High-scale monitoring demands infrastructure that can ingest millions of data points per second while maintaining sub-second query latency. When you're running 10,000+ containers across multiple regions, every cardinality decision and sampling strategy has significant cost implications.
Metrics Pipeline Architecture
High-Cardinality Metrics Handling
At high scale, Prometheus alone cannot handle the ingestion volume. Grafana Mimir (or Thanos) provides horizontal scaling, multi-tenancy, and long-term storage. A single Mimir cluster can ingest 10M+ samples/second with proper configuration.
Recording Rules for Query Performance
Recording rules pre-compute expensive queries. At high scale, a query like histogram_quantile(0.99, sum(rate(...[5m])) by (le, service)) across 100,000 time series takes 30+ seconds. The same query against a recording rule returns in milliseconds.
Distributed Tracing at Scale
Sampling Strategy
At 50,000 traces/second, storing everything costs $100,000+/month. Tail sampling keeps 100% of errors and high-latency traces (the ones you actually debug) while sampling 5% of normal traces. This reduces storage costs by 90% while maintaining full debuggability for incidents.
Log Aggregation for 10,000+ Containers
Loki's label-based indexing is critical at high scale. Unlike Elasticsearch which indexes every field, Loki only indexes labels (service, namespace, level). This reduces storage costs by 80% compared to ELK at equivalent log volumes. The trade-off is slower full-text search — acceptable when you have trace IDs for correlation.
Need a second opinion on your DevOps pipelines architecture?
I run free 30-minute strategy calls for engineering teams tackling this exact problem.
Book a Free CallAnti-Patterns to Avoid
Unbounded metric cardinality. A single label with unbounded values (user_id, request_id, URL path) can generate millions of time series. At $0.10 per 1,000 active series/month, this adds up fast. Enforce cardinality limits at the collector level.
Storing all traces. At 50,000 traces/second, storing everything is economically insane. Tail sampling with error and latency bias maintains debugging capability at 5-10% of the cost.
Single Prometheus instance. Beyond 5,000 targets or 1M active series, a single Prometheus hits memory limits. Shard across multiple Prometheus instances with consistent hashing, or migrate to Mimir/Thanos.
Alerting on raw metrics. At high scale, raw metric queries timeout. Alert on recording rules that pre-compute the expensive aggregations.
Production Checklist
- Mimir/Thanos for horizontally-scaled metrics storage
- Recording rules for all dashboard and alert queries
- Cardinality limits enforced per tenant/team
- Tail sampling for traces (errors: 100%, slow: 100%, baseline: 5%)
- Loki with label-based indexing for cost-effective log storage
- Memcached caching layer for query performance
- Multi-region metric federation
- Per-team ingestion rate limits
- Automated cardinality analysis and alerting
- Query performance SLOs (<2s for dashboard queries)
Conclusion
High-scale observability is fundamentally a cost optimization problem. The technology to collect, store, and query telemetry at any scale exists. The challenge is doing so economically. Recording rules, tail sampling, cardinality limits, and label-based log indexing are the tools that keep costs proportional to value rather than proportional to scale.
Teams operating at 10,000+ containers should budget 3-5% of their infrastructure cost for observability. Below that, you're under-investing and flying blind. Above that, you're likely storing too much data at too high resolution for too long.