How do you choose between self-hosted and SaaS observability?

Self-hosted (Grafana stack: Mimir, Loki, Tempo) costs 30-50% less than SaaS at enterprise scale but requires 1-2 dedicated engineers for operations. SaaS (Datadog, New Relic, Grafana Cloud) is simpler operationally but costs scale linearly with ingestion volume. The break-even point is typically around $20,000/month in SaaS spending — below that, SaaS is cheaper when you account for engineering time.

What is the recommended approach for cross-team trace correlation?

Use W3C Trace Context propagation (traceparent and tracestate headers) across all services. The OpenTelemetry SDK handles this automatically for HTTP and gRPC. For async communication (message queues), propagate trace context in message headers. This enables tracing a request from the edge through 20+ services owned by different teams.

How do you handle PII in logs for compliance?

Implement a log scrubbing pipeline in the OpenTelemetry Collector using the `transform` processor to hash or redact PII fields before they reach storage. Define a PII field list (email, phone, SSN, credit card) and apply regex-based redaction. For GDPR compliance, implement log deletion capability keyed on user ID — this is easier with structured logs that use a consistent `user_id` field.

What SLO targets should enterprise services start with?

Start conservative: 99.5% availability and 500ms p99 latency for customer-facing APIs. These are achievable targets that give your team experience with error budget management. Tighten to 99.9% or 99.95% only after you've demonstrated the ability to maintain 99.5% for three consecutive months. Overly aggressive SLOs (99.99%) create alert fatigue and burn out on-call engineers.

Monitoring & Observability Best Practices for Enterprise Teams

Enterprise monitoring and observability require more than dashboards and alerts. At enterprise scale, you're managing hundreds of services, multiple teams with different SLO requirements, compliance-mandated audit trails, and the political complexity of shared infrastructure. These practices address the organizational and technical challenges of observability in large engineering organizations.

Observability Strategy

The Three Pillars with Enterprise Context

Metrics, logs, and traces form the technical foundation, but enterprise observability adds three more dimensions: SLOs as contracts, cost attribution per team, and compliance-grade audit trails.

yaml

1# OpenTelemetry Collector configuration for enterprise

2receivers:

3 otlp:

4 protocols:

5 grpc:

6 endpoint: 0.0.0.0:4317

7 http:

8 endpoint: 0.0.0.0:4318

9 prometheus:

10 config:

11 scrape_configs:

12 - job_name: 'kubernetes-pods'

13 kubernetes_sd_configs:

14 - role: pod

15 relabel_configs:

16 - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]

17 action: keep

18 regex: true

20processors:

21 batch:

22 timeout: 5s

23 send_batch_size: 1000

24 resource:

25 attributes:

26 - key: environment

27 value: production

28 action: insert

29 - key: cost_center

30 action: insert

31 from_attribute: team

32 filter:

33 metrics:

34 exclude:

35 match_type: regexp

36 metric_names:

37 - go_.*

38 - process_.*

39 tail_sampling:

40 decision_wait: 10s

41 policies:

42 - name: error-sampling

43 type: status_code

44 status_code: {status_codes: [ERROR]}

45 - name: latency-sampling

46 type: latency

47 latency: {threshold_ms: 1000}

48 - name: probabilistic-sampling

49 type: probabilistic

50 probabilistic: {sampling_percentage: 10}

52exporters:

53 otlp/tempo:

54 endpoint: tempo.monitoring:4317

55 tls:

56 insecure: true

57 prometheusremotewrite:

58 endpoint: http://mimir.monitoring:9009/api/v1/push

59 resource_to_telemetry_conversion:

60 enabled: true

61 loki:

62 endpoint: http://loki.monitoring:3100/loki/api/v1/push

64service:

65 pipelines:

66 traces:

67 receivers: [otlp]

68 processors: [batch, resource, tail_sampling]

69 exporters: [otlp/tempo]

70 metrics:

71 receivers: [otlp, prometheus]

72 processors: [batch, resource, filter]

73 exporters: [prometheusremotewrite]

74 logs:

75 receivers: [otlp]

76 processors: [batch, resource]

77 exporters: [loki]

SLO-Driven Alerting

yaml

1# Prometheus recording rules for SLO tracking

2groups:

3 - name: slo-recording-rules

4 interval: 30s

5 rules:

6 - record: slo:api_availability:ratio_rate5m

7 expr: |

8 sum(rate(http_requests_total{status!~"5.."}[5m]))

9 /

10 sum(rate(http_requests_total[5m]))

12 - record: slo:api_availability:ratio_rate1h

13 expr: |

14 sum(rate(http_requests_total{status!~"5.."}[1h]))

15 /

16 sum(rate(http_requests_total[1h]))

18 - record: slo:api_latency_p99:ratio_rate5m

19 expr: |

20 histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

22 - name: slo-alerts

23 rules:

24 - alert: SLOBurnRateHigh

25 expr: |

26 (

27 slo:api_availability:ratio_rate5m < 0.999

28 and

29 slo:api_availability:ratio_rate1h < 0.999

30 )

31 for: 2m

32 labels:

33 severity: critical

34 slo: api-availability

35 annotations:

36 summary: "API availability SLO burn rate is high"

37 description: "5m availability: {{ $value | humanizePercentage }}"

39 - alert: SLOLatencyBudgetConsuming

40 expr: slo:api_latency_p99:ratio_rate5m > 0.5

41 for: 5m

42 labels:

43 severity: warning

44 slo: api-latency

SLO-based alerting replaces threshold-based alerting. Instead of alerting when CPU hits 80%, alert when error budget consumption rate suggests you'll breach your SLO within the next hour. This dramatically reduces alert noise — teams report 70-80% fewer alerts after switching to SLO-based approaches.

Centralized Log Management

yaml

1# Structured logging standard for enterprise

2apiVersion: v1

3kind: ConfigMap

4metadata:

5 name: logging-standard

6data:

7 schema.json: |

8 {

9 "required_fields": {

10 "timestamp": "ISO 8601 format",

11 "level": "debug|info|warn|error|fatal",

12 "service": "service name from deployment label",

13 "trace_id": "W3C trace context",

14 "span_id": "W3C span context",

15 "message": "human-readable description"

16 },

17 "optional_fields": {

18 "user_id": "anonymized user identifier",

19 "request_id": "correlation ID",

20 "duration_ms": "operation duration",

21 "error_code": "application-specific error code",

22 "team": "owning team identifier",

23 "cost_center": "billing attribution"

24 }

25 }

Enterprise log management is 80% standardization and 20% technology. Without a mandated log format, searching across 200 services becomes impossible because every team uses different field names and structures.

Cost Management

Observability costs scale with cardinality (metrics), volume (logs), and retention (traces). At enterprise scale, uncontrolled observability spending reaches $50,000-200,000/month.

yaml

1# Grafana Mimir tenant-based cost tracking

2overrides:

3 team-payments:

4 max_series_per_user: 500000

5 max_samples_per_query: 50000000

6 ingestion_rate: 100000

7 max_label_names_per_series: 30

8 team-frontend:

9 max_series_per_user: 200000

10 max_samples_per_query: 20000000

11 ingestion_rate: 50000

Per-team quotas prevent a single team from consuming the entire observability budget. When a team hits their quota, they're forced to reduce cardinality rather than ignoring the cost.

Anti-Patterns to Avoid

Alert fatigue from threshold-based alerting. If your on-call engineers receive more than 5 alerts per shift, most are being ignored. Switch to SLO-based alerting and multi-window burn rate alerts.

No log retention policy. Storing all logs at full resolution indefinitely costs more than the infrastructure they're monitoring. Implement tiered retention: 7 days hot (full resolution), 30 days warm (sampled), 1 year cold (errors only).

Ignoring observability cost attribution. Without per-team cost tracking, no team has an incentive to reduce their cardinality or log volume. Make observability costs visible in each team's budget.

Custom dashboards for common patterns. Standard dashboards (RED metrics, USE method, SLO burn rate) should be templated and deployed automatically for every new service. Custom dashboards should only be built for domain-specific metrics.

Production Checklist

Conclusion

Enterprise observability is an organizational challenge as much as a technical one. The technology stack — OpenTelemetry, Prometheus/Mimir, Loki, Tempo — is well-established. The harder problems are standardizing log formats across 50 teams, implementing SLO-based alerting that actually reduces page volume, and making observability costs visible so teams optimize their instrumentation.

The most impactful investment is in the OpenTelemetry Collector pipeline. A well-configured collector with tail sampling, resource attribution, and metric filtering reduces backend costs by 50-70% while maintaining full visibility for debugging. Teams that skip this step and send everything to a SaaS backend discover the cost problem months later.

Monitoring & Observability Best Practices for Enterprise Teams

Observability Strategy

The Three Pillars with Enterprise Context

SLO-Driven Alerting

Centralized Log Management

Cost Management

Anti-Patterns to Avoid

Production Checklist

Conclusion

FAQ

Building with CI/CD pipelines?

Monitoring & Observability Best Practices for High Scale Teams

Monitoring & Observability Best Practices for Startup Teams

Monitoring & Observability at Scale: Lessons from Production

Scaling to $4B: Lessons in Latency

Monitoring & Observability Best Practices for Startup Teams

Start a
Conversation.

Observability Strategy

The Three Pillars with Enterprise Context

SLO-Driven Alerting

Centralized Log Management

Cost Management

Anti-Patterns to Avoid

Production Checklist

Conclusion

FAQ

Building with CI/CD pipelines?

Monitoring & Observability Best Practices for High Scale Teams

Monitoring & Observability Best Practices for Startup Teams

Monitoring & Observability at Scale: Lessons from Production

Scaling to $4B: Lessons in Latency

Monitoring & Observability Best Practices for Startup Teams

Start aConversation.

Start a
Conversation.