How do you handle metric cardinality explosions?

Implement a three-layer defense: (1) client-side — use middleware that normalizes URL paths and limits label values, (2) collector-side — configure the OpenTelemetry Collector's filter processor to drop high-cardinality labels, (3) storage-side — set per-tenant series limits in Mimir that reject new series beyond the threshold. Monitor `cortex_discarded_samples_total` to detect when limits are hit.

What is the cost comparison between Prometheus+Mimir and Datadog at high scale?

At 5M active series: self-hosted Mimir costs ~$5,000/month (compute + storage), Datadog costs ~$23,000/month (custom metrics pricing). At 20M active series: Mimir ~$15,000/month, Datadog ~$80,000/month. Self-hosted requires 1-2 engineers for operations. The break-even (accounting for engineering cost) is around 3-5M active series.

How do you query across multiple Prometheus instances?

Use Mimir or Thanos as a query aggregation layer. Both provide a single query endpoint that federates across all Prometheus instances. Mimir is the newer option with better multi-tenancy support. Thanos is more mature with a larger community. Both support PromQL and are compatible with Grafana.

What retention periods are recommended for high-scale observability data?

Metrics: 13 months (to compare year-over-year). Use downsampling — 15s resolution for 48 hours, 1m for 30 days, 5m for 13 months. Logs: 7 days at full resolution, 30 days sampled, 1 year for error-level only. Traces: 7 days for full traces, 30 days for trace metadata (service, duration, error status).

Monitoring & Observability Best Practices for High Scale Teams

High-scale monitoring demands infrastructure that can ingest millions of data points per second while maintaining sub-second query latency. When you're running 10,000+ containers across multiple regions, every cardinality decision and sampling strategy has significant cost implications.

Metrics Pipeline Architecture

High-Cardinality Metrics Handling

yaml

1# Grafana Mimir configuration for high-scale

2target: all

3server:

4 http_listen_port: 9009

6distributor:

7 ring:

8 kvstore:

9 store: consul

10 ha_tracker:

11 enable_ha_tracker: true

12 kvstore:

13 store: consul

15ingester:

16 ring:

17 replication_factor: 3

18 kvstore:

19 store: consul

21limits:

22 max_global_series_per_user: 5000000

23 max_global_series_per_metric: 500000

24 ingestion_rate: 500000

25 ingestion_burst_size: 1000000

26 max_label_names_per_series: 30

27 max_label_value_length: 2048

28 max_fetched_series_per_query: 100000

30blocks_storage:

31 backend: s3

32 s3:

33 bucket_name: metrics-blocks

34 endpoint: s3.us-east-1.amazonaws.com

35 tsdb:

36 dir: /data/tsdb

37 block_ranges_period: [2h]

38 retention_period: 24h

At high scale, Prometheus alone cannot handle the ingestion volume. Grafana Mimir (or Thanos) provides horizontal scaling, multi-tenancy, and long-term storage. A single Mimir cluster can ingest 10M+ samples/second with proper configuration.

Recording Rules for Query Performance

yaml

1groups:

2 - name: high-scale-recording-rules

3 interval: 15s

4 rules:

5 - record: cluster:http_requests:rate5m

6 expr: sum(rate(http_requests_total[5m])) by (cluster, service, status_code)

8 - record: cluster:http_request_duration:p99_5m

9 expr: |

10 histogram_quantile(0.99,

11 sum(rate(http_request_duration_seconds_bucket[5m]))

12 by (le, cluster, service)

13 )

15 - record: cluster:node_cpu:utilization_avg5m

16 expr: |

17 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))

18 by (cluster, instance)

20 - record: cluster:container_memory:usage_ratio

21 expr: |

22 sum(container_memory_working_set_bytes) by (cluster, namespace, pod)

23 /

24 sum(kube_pod_container_resource_limits{resource="memory"}) by (cluster, namespace, pod)

Recording rules pre-compute expensive queries. At high scale, a query like histogram_quantile(0.99, sum(rate(...[5m])) by (le, service)) across 100,000 time series takes 30+ seconds. The same query against a recording rule returns in milliseconds.

Distributed Tracing at Scale

Sampling Strategy

yaml

1# OpenTelemetry Collector with tail sampling

2processors:

3 tail_sampling:

4 decision_wait: 10s

5 num_traces: 100000

6 expected_new_traces_per_sec: 50000

7 policies:

8 - name: errors-always

9 type: status_code

10 status_code:

11 status_codes: [ERROR]

12 - name: high-latency

13 type: latency

14 latency:

15 threshold_ms: 2000

16 - name: low-volume-services

17 type: string_attribute

18 string_attribute:

19 key: service.name

20 values: [payment-service, auth-service]

21 - name: baseline

22 type: probabilistic

23 probabilistic:

24 sampling_percentage: 5

At 50,000 traces/second, storing everything costs $100,000+/month. Tail sampling keeps 100% of errors and high-latency traces (the ones you actually debug) while sampling 5% of normal traces. This reduces storage costs by 90% while maintaining full debuggability for incidents.

Log Aggregation for 10,000+ Containers

yaml

1# Grafana Loki configuration for high scale

2schema_config:

3 configs:

4 - from: 2024-01-01

5 store: tsdb

6 object_store: s3

7 schema: v13

8 index:

9 prefix: index_

10 period: 24h

12storage_config:

13 tsdb_shipper:

14 active_index_directory: /data/index

15 cache_location: /data/cache

16 aws:

17 s3: s3://logs-bucket/loki

18 region: us-east-1

20limits_config:

21 ingestion_rate_mb: 100

22 ingestion_burst_size_mb: 200

23 max_entries_limit_per_query: 10000

24 max_query_series: 5000

25 retention_period: 168h

27chunk_store_config:

28 chunk_cache_config:

29 memcached:

30 host: memcached.monitoring

31 service: memcached-client

Loki's label-based indexing is critical at high scale. Unlike Elasticsearch which indexes every field, Loki only indexes labels (service, namespace, level). This reduces storage costs by 80% compared to ELK at equivalent log volumes. The trade-off is slower full-text search — acceptable when you have trace IDs for correlation.

Need a second opinion on your DevOps pipelines architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Anti-Patterns to Avoid

Unbounded metric cardinality. A single label with unbounded values (user_id, request_id, URL path) can generate millions of time series. At $0.10 per 1,000 active series/month, this adds up fast. Enforce cardinality limits at the collector level.

Storing all traces. At 50,000 traces/second, storing everything is economically insane. Tail sampling with error and latency bias maintains debugging capability at 5-10% of the cost.

Single Prometheus instance. Beyond 5,000 targets or 1M active series, a single Prometheus hits memory limits. Shard across multiple Prometheus instances with consistent hashing, or migrate to Mimir/Thanos.

Alerting on raw metrics. At high scale, raw metric queries timeout. Alert on recording rules that pre-compute the expensive aggregations.

Production Checklist

Conclusion

High-scale observability is fundamentally a cost optimization problem. The technology to collect, store, and query telemetry at any scale exists. The challenge is doing so economically. Recording rules, tail sampling, cardinality limits, and label-based log indexing are the tools that keep costs proportional to value rather than proportional to scale.

Teams operating at 10,000+ containers should budget 3-5% of their infrastructure cost for observability. Below that, you're under-investing and flying blind. Above that, you're likely storing too much data at too high resolution for too long.

Monitoring & Observability Best Practices for High Scale Teams

Metrics Pipeline Architecture

High-Cardinality Metrics Handling

Recording Rules for Query Performance

Distributed Tracing at Scale

Sampling Strategy

Log Aggregation for 10,000+ Containers

Anti-Patterns to Avoid

Production Checklist

Conclusion

FAQ

Building with CI/CD pipelines?

Monitoring & Observability Best Practices for Enterprise Teams

Monitoring & Observability Best Practices for Startup Teams

Monitoring & Observability at Scale: Lessons from Production

Monitoring & Observability at Scale: Lessons from Production

Scaling to $4B: Lessons in Latency

Start a
Conversation.

Metrics Pipeline Architecture

High-Cardinality Metrics Handling

Recording Rules for Query Performance

Distributed Tracing at Scale

Sampling Strategy

Log Aggregation for 10,000+ Containers

Anti-Patterns to Avoid

Production Checklist

Conclusion

FAQ

Building with CI/CD pipelines?

Monitoring & Observability Best Practices for Enterprise Teams

Monitoring & Observability Best Practices for Startup Teams

Monitoring & Observability at Scale: Lessons from Production

Monitoring & Observability at Scale: Lessons from Production

Scaling to $4B: Lessons in Latency

Start aConversation.

Start a
Conversation.