Back to Journal
DevOps

Monitoring & Observability at Scale: Lessons from Production

Real-world lessons from implementing Monitoring & Observability in production, including architecture decisions, measurable results, and honest retrospectives.

Muneer Puthiya Purayil 10 min read

In late 2023, we rebuilt the monitoring infrastructure for a B2B SaaS platform serving 450 enterprise customers. The existing setup — a single Prometheus instance, Elasticsearch for logs, and a collection of hand-crafted Grafana dashboards — was failing under the load of 200+ microservices. This is the story of migrating to a scalable observability stack while maintaining visibility during the transition.

Starting Point

The platform ran 200+ microservices on EKS across two AWS regions. The monitoring infrastructure had accumulated four years of technical debt:

  • Prometheus: Single instance ingesting 2.8M active series, regularly OOMing at 64GB RAM
  • Elasticsearch: 3-node cluster consuming 12TB storage, queries taking 30+ seconds for multi-service searches
  • Grafana: 340 dashboards, of which only 40 were actively used. No one knew who created the other 300
  • Alerts: 127 active alert rules. The on-call team received an average of 45 alerts per day, ignored 90% of them
  • Cost: $14,200/month for monitoring infrastructure (Prometheus EC2, Elasticsearch, Grafana)

The tipping point was a 2-hour outage where the monitoring system itself was down, so the team couldn't diagnose the root cause. Prometheus had OOMed processing a cardinality explosion from a misconfigured service.

Architecture Decisions

Why Grafana Mimir + Loki + Tempo

We evaluated Datadog ($45,000/month estimated at our scale), Grafana Cloud ($22,000/month), and self-hosted Grafana LGTM stack ($8,000/month). Self-hosted won on cost, and the team had Kubernetes operations experience.

The architecture:

1Services → OTel Collector (DaemonSet) → OTel Collector (Gateway)
2
3 ┌─────────┼─────────┐
4 ↓ ↓ ↓
5 Mimir Loki Tempo
6 (metrics) (logs) (traces)
7 ↓ ↓ ↓
8 Grafana
9 

Migration Strategy

We ran old and new systems in parallel for 6 weeks:

  • Weeks 1-2: Deploy new stack, dual-write metrics and logs
  • Weeks 3-4: Migrate dashboards and alerts to new stack
  • Weeks 5-6: Validate and decommission old stack

Results

MetricBeforeAfterChange
Monthly cost$14,200$8,400-41%
Active series capacity3M (OOM at peak)20M (comfortable)+567%
Log query latency (p95)32s1.8s-94%
Dashboard count34052-85%
Daily alert volume458-82%
MTTD (mean time to detect)12 min3 min-75%
MTTR (mean time to resolve)47 min22 min-53%

The alert reduction from 45/day to 8/day was the most impactful change. We replaced 127 threshold-based alerts with 23 SLO-based alerts. On-call satisfaction scores improved from 2.1/5 to 4.2/5.

Need a second opinion on your DevOps pipelines architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

What Went Wrong

Elasticsearch migration data loss. During the Loki migration, we discovered that our Elasticsearch index lifecycle policy had been silently deleting logs older than 14 days instead of the intended 30 days. We couldn't backfill this data into Loki.

Cardinality explosion on day 3. A service team deployed a change that added request_id as a Prometheus label. This generated 500,000 new series in 2 hours. The old Prometheus would have OOMed; Mimir's per-tenant limits rejected the excess series with a clear error message. The fix took 20 minutes instead of causing an outage.

Dashboard migration was manual. We couldn't automatically migrate 340 Grafana dashboards because many used Elasticsearch-specific query syntax. We decided to only migrate the 40 actively-used dashboards and asked teams to recreate any others they actually needed. No one asked for the remaining 300.

Honest Retrospective

The biggest win was reducing alert volume by 82%. The engineering team went from ignoring alerts to trusting them. When an alert fires now, it means something actionable.

What we'd do differently: Start with the OTel Collector migration (dual-writing) before changing any backends. Our initial plan tried to migrate backends and instrumentation simultaneously, which created too many moving parts. Sequential migrations are slower but safer.

The hidden cost: 6 weeks of two engineers' time (~$50,000 in loaded salary) plus the $8,400/month ongoing cost. ROI was positive within 3 months from the monthly savings alone, not counting the productivity improvements from better MTTD/MTTR.

Conclusion

Monitoring migrations are high-risk, high-reward projects. The risk is losing visibility during the transition; the reward is dramatically better observability at lower cost. The parallel-running approach — where both old and new systems receive data simultaneously — is essential for safe migration. It doubles your monitoring cost for the transition period but eliminates the "flying blind" risk.

The most valuable outcome wasn't the technology change — it was the opportunity to rationalize dashboards, alerts, and on-call processes. Four years of accumulated monitoring debt was cleaned up in six weeks because the migration forced every team to justify what they actually needed.

FAQ

Need expert help?

Building with CI/CD pipelines?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026