Back to Journal
DevOps

Monitoring & Observability Best Practices for High Scale Teams

Battle-tested best practices for Monitoring & Observability tailored to High Scale teams, including anti-patterns to avoid and a ready-to-use checklist.

Muneer Puthiya Purayil 12 min read

High-scale monitoring demands infrastructure that can ingest millions of data points per second while maintaining sub-second query latency. When you're running 10,000+ containers across multiple regions, every cardinality decision and sampling strategy has significant cost implications.

Metrics Pipeline Architecture

High-Cardinality Metrics Handling

yaml
1# Grafana Mimir configuration for high-scale
2target: all
3server:
4 http_listen_port: 9009
5 
6distributor:
7 ring:
8 kvstore:
9 store: consul
10 ha_tracker:
11 enable_ha_tracker: true
12 kvstore:
13 store: consul
14 
15ingester:
16 ring:
17 replication_factor: 3
18 kvstore:
19 store: consul
20 
21limits:
22 max_global_series_per_user: 5000000
23 max_global_series_per_metric: 500000
24 ingestion_rate: 500000
25 ingestion_burst_size: 1000000
26 max_label_names_per_series: 30
27 max_label_value_length: 2048
28 max_fetched_series_per_query: 100000
29 
30blocks_storage:
31 backend: s3
32 s3:
33 bucket_name: metrics-blocks
34 endpoint: s3.us-east-1.amazonaws.com
35 tsdb:
36 dir: /data/tsdb
37 block_ranges_period: [2h]
38 retention_period: 24h
39 

At high scale, Prometheus alone cannot handle the ingestion volume. Grafana Mimir (or Thanos) provides horizontal scaling, multi-tenancy, and long-term storage. A single Mimir cluster can ingest 10M+ samples/second with proper configuration.

Recording Rules for Query Performance

yaml
1groups:
2 - name: high-scale-recording-rules
3 interval: 15s
4 rules:
5 - record: cluster:http_requests:rate5m
6 expr: sum(rate(http_requests_total[5m])) by (cluster, service, status_code)
7 
8 - record: cluster:http_request_duration:p99_5m
9 expr: |
10 histogram_quantile(0.99,
11 sum(rate(http_request_duration_seconds_bucket[5m]))
12 by (le, cluster, service)
13 )
14
15 - record: cluster:node_cpu:utilization_avg5m
16 expr: |
17 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))
18 by (cluster, instance)
19
20 - record: cluster:container_memory:usage_ratio
21 expr: |
22 sum(container_memory_working_set_bytes) by (cluster, namespace, pod)
23 /
24 sum(kube_pod_container_resource_limits{resource="memory"}) by (cluster, namespace, pod)
25

Recording rules pre-compute expensive queries. At high scale, a query like histogram_quantile(0.99, sum(rate(...[5m])) by (le, service)) across 100,000 time series takes 30+ seconds. The same query against a recording rule returns in milliseconds.

Distributed Tracing at Scale

Sampling Strategy

yaml
1# OpenTelemetry Collector with tail sampling
2processors:
3 tail_sampling:
4 decision_wait: 10s
5 num_traces: 100000
6 expected_new_traces_per_sec: 50000
7 policies:
8 - name: errors-always
9 type: status_code
10 status_code:
11 status_codes: [ERROR]
12 - name: high-latency
13 type: latency
14 latency:
15 threshold_ms: 2000
16 - name: low-volume-services
17 type: string_attribute
18 string_attribute:
19 key: service.name
20 values: [payment-service, auth-service]
21 - name: baseline
22 type: probabilistic
23 probabilistic:
24 sampling_percentage: 5
25 

At 50,000 traces/second, storing everything costs $100,000+/month. Tail sampling keeps 100% of errors and high-latency traces (the ones you actually debug) while sampling 5% of normal traces. This reduces storage costs by 90% while maintaining full debuggability for incidents.

Log Aggregation for 10,000+ Containers

yaml
1# Grafana Loki configuration for high scale
2schema_config:
3 configs:
4 - from: 2024-01-01
5 store: tsdb
6 object_store: s3
7 schema: v13
8 index:
9 prefix: index_
10 period: 24h
11 
12storage_config:
13 tsdb_shipper:
14 active_index_directory: /data/index
15 cache_location: /data/cache
16 aws:
17 s3: s3://logs-bucket/loki
18 region: us-east-1
19 
20limits_config:
21 ingestion_rate_mb: 100
22 ingestion_burst_size_mb: 200
23 max_entries_limit_per_query: 10000
24 max_query_series: 5000
25 retention_period: 168h
26 
27chunk_store_config:
28 chunk_cache_config:
29 memcached:
30 host: memcached.monitoring
31 service: memcached-client
32 

Loki's label-based indexing is critical at high scale. Unlike Elasticsearch which indexes every field, Loki only indexes labels (service, namespace, level). This reduces storage costs by 80% compared to ELK at equivalent log volumes. The trade-off is slower full-text search — acceptable when you have trace IDs for correlation.

Need a second opinion on your DevOps pipelines architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Anti-Patterns to Avoid

Unbounded metric cardinality. A single label with unbounded values (user_id, request_id, URL path) can generate millions of time series. At $0.10 per 1,000 active series/month, this adds up fast. Enforce cardinality limits at the collector level.

Storing all traces. At 50,000 traces/second, storing everything is economically insane. Tail sampling with error and latency bias maintains debugging capability at 5-10% of the cost.

Single Prometheus instance. Beyond 5,000 targets or 1M active series, a single Prometheus hits memory limits. Shard across multiple Prometheus instances with consistent hashing, or migrate to Mimir/Thanos.

Alerting on raw metrics. At high scale, raw metric queries timeout. Alert on recording rules that pre-compute the expensive aggregations.

Production Checklist

  • Mimir/Thanos for horizontally-scaled metrics storage
  • Recording rules for all dashboard and alert queries
  • Cardinality limits enforced per tenant/team
  • Tail sampling for traces (errors: 100%, slow: 100%, baseline: 5%)
  • Loki with label-based indexing for cost-effective log storage
  • Memcached caching layer for query performance
  • Multi-region metric federation
  • Per-team ingestion rate limits
  • Automated cardinality analysis and alerting
  • Query performance SLOs (<2s for dashboard queries)

Conclusion

High-scale observability is fundamentally a cost optimization problem. The technology to collect, store, and query telemetry at any scale exists. The challenge is doing so economically. Recording rules, tail sampling, cardinality limits, and label-based log indexing are the tools that keep costs proportional to value rather than proportional to scale.

Teams operating at 10,000+ containers should budget 3-5% of their infrastructure cost for observability. Below that, you're under-investing and flying blind. Above that, you're likely storing too much data at too high resolution for too long.

FAQ

Need expert help?

Building with CI/CD pipelines?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026