How long should you run old and new monitoring in parallel?

Minimum 4 weeks, ideally 6. This covers at least one full on-call rotation and one monthly peak traffic event. The parallel period is when you discover discrepancies between old and new dashboards, missing metrics in the new system, and alert tuning issues. Cutting this short risks blind spots during incidents.

How do you convince teams to reduce their alert count?

Present data: alert-to-action ratio. If a team has 20 alerts and acts on 3, the other 17 are noise. Propose replacing the 20 with SLO-based alerts that fire only when user experience is actually impacted. Teams are receptive when you frame it as "fewer pages, less on-call fatigue" rather than "deleting your alerts."

What was the most challenging part of the migration?

Migrating from Elasticsearch query syntax (Lucene/KQL) to LogQL (Loki). The query languages are fundamentally different — Elasticsearch indexes all fields for full-text search, while Loki indexes only labels and streams. Queries that took 2 seconds in Elasticsearch needed restructuring for Loki's label-based model. This took more engineering time than any other migration task.

How do you handle monitoring during the monitoring migration?

Keep the old system as the primary alerting source throughout the migration. Only switch alerting to the new system after you've validated alert parity for 2+ weeks. During the transition, we added a meta-alert: "if the new monitoring system stops receiving data, alert via the old system." This safety net caught two configuration issues during the migration.

Monitoring & Observability at Scale: Lessons from Production

In late 2023, we rebuilt the monitoring infrastructure for a B2B SaaS platform serving 450 enterprise customers. The existing setup — a single Prometheus instance, Elasticsearch for logs, and a collection of hand-crafted Grafana dashboards — was failing under the load of 200+ microservices. This is the story of migrating to a scalable observability stack while maintaining visibility during the transition.

Starting Point

The platform ran 200+ microservices on EKS across two AWS regions. The monitoring infrastructure had accumulated four years of technical debt:

Prometheus: Single instance ingesting 2.8M active series, regularly OOMing at 64GB RAM
Elasticsearch: 3-node cluster consuming 12TB storage, queries taking 30+ seconds for multi-service searches
Grafana: 340 dashboards, of which only 40 were actively used. No one knew who created the other 300
Alerts: 127 active alert rules. The on-call team received an average of 45 alerts per day, ignored 90% of them
Cost: $14,200/month for monitoring infrastructure (Prometheus EC2, Elasticsearch, Grafana)

The tipping point was a 2-hour outage where the monitoring system itself was down, so the team couldn't diagnose the root cause. Prometheus had OOMed processing a cardinality explosion from a misconfigured service.

Architecture Decisions

Why Grafana Mimir + Loki + Tempo

We evaluated Datadog ($45,000/month estimated at our scale), Grafana Cloud ($22,000/month), and self-hosted Grafana LGTM stack ($8,000/month). Self-hosted won on cost, and the team had Kubernetes operations experience.

The architecture:

1Services → OTel Collector (DaemonSet) → OTel Collector (Gateway)

2 ↓

3 ┌─────────┼─────────┐

4 ↓ ↓ ↓

5 Mimir Loki Tempo

6 (metrics) (logs) (traces)

7 ↓ ↓ ↓

8 Grafana

Migration Strategy

We ran old and new systems in parallel for 6 weeks:

Weeks 1-2: Deploy new stack, dual-write metrics and logs
Weeks 3-4: Migrate dashboards and alerts to new stack
Weeks 5-6: Validate and decommission old stack

Results

Metric	Before	After	Change
Monthly cost	$14,200	$8,400	-41%
Active series capacity	3M (OOM at peak)	20M (comfortable)	+567%
Log query latency (p95)	32s	1.8s	-94%
Dashboard count	340	52	-85%
Daily alert volume	45	8	-82%
MTTD (mean time to detect)	12 min	3 min	-75%
MTTR (mean time to resolve)	47 min	22 min	-53%

The alert reduction from 45/day to 8/day was the most impactful change. We replaced 127 threshold-based alerts with 23 SLO-based alerts. On-call satisfaction scores improved from 2.1/5 to 4.2/5.

Need a second opinion on your DevOps pipelines architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

What Went Wrong

Elasticsearch migration data loss. During the Loki migration, we discovered that our Elasticsearch index lifecycle policy had been silently deleting logs older than 14 days instead of the intended 30 days. We couldn't backfill this data into Loki.

Cardinality explosion on day 3. A service team deployed a change that added request_id as a Prometheus label. This generated 500,000 new series in 2 hours. The old Prometheus would have OOMed; Mimir's per-tenant limits rejected the excess series with a clear error message. The fix took 20 minutes instead of causing an outage.

Dashboard migration was manual. We couldn't automatically migrate 340 Grafana dashboards because many used Elasticsearch-specific query syntax. We decided to only migrate the 40 actively-used dashboards and asked teams to recreate any others they actually needed. No one asked for the remaining 300.

Honest Retrospective

The biggest win was reducing alert volume by 82%. The engineering team went from ignoring alerts to trusting them. When an alert fires now, it means something actionable.

What we'd do differently: Start with the OTel Collector migration (dual-writing) before changing any backends. Our initial plan tried to migrate backends and instrumentation simultaneously, which created too many moving parts. Sequential migrations are slower but safer.

The hidden cost: 6 weeks of two engineers' time (~$50,000 in loaded salary) plus the $8,400/month ongoing cost. ROI was positive within 3 months from the monthly savings alone, not counting the productivity improvements from better MTTD/MTTR.

Conclusion

Monitoring migrations are high-risk, high-reward projects. The risk is losing visibility during the transition; the reward is dramatically better observability at lower cost. The parallel-running approach — where both old and new systems receive data simultaneously — is essential for safe migration. It doubles your monitoring cost for the transition period but eliminates the "flying blind" risk.

The most valuable outcome wasn't the technology change — it was the opportunity to rationalize dashboards, alerts, and on-call processes. Four years of accumulated monitoring debt was cleaned up in six weeks because the migration forced every team to justify what they actually needed.

FAQ

Need expert help?

Building with CI/CD pipelines?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Book a Free Call Send a Brief

monitoring observability logging tracing aws case-study

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

View Portfolio Book a Call

← Previous

Monitoring & Observability at Scale: Lessons from Production

Starting Point

Architecture Decisions

Why Grafana Mimir + Loki + Tempo

Migration Strategy

Results

What Went Wrong

Honest Retrospective

Conclusion

FAQ

Building with CI/CD pipelines?

Monitoring & Observability Best Practices for High Scale Teams

Monitoring & Observability Best Practices for Enterprise Teams

Monitoring & Observability Best Practices for Startup Teams

Complete Guide to Zero-Downtime Deployments with Typescript

Monitoring & Observability Best Practices for High Scale Teams

Start a
Conversation.

Starting Point

Architecture Decisions

Why Grafana Mimir + Loki + Tempo

Migration Strategy

Results

What Went Wrong

Honest Retrospective

Conclusion

FAQ

Building with CI/CD pipelines?

Monitoring & Observability Best Practices for High Scale Teams

Monitoring & Observability Best Practices for Enterprise Teams

Monitoring & Observability Best Practices for Startup Teams

Complete Guide to Zero-Downtime Deployments with Typescript

Monitoring & Observability Best Practices for High Scale Teams

Start aConversation.

Start a
Conversation.