Back to Journal
DevOps

Complete Guide to Monitoring & Observability with Python

A comprehensive guide to implementing Monitoring & Observability using Python, covering architecture, code examples, and production-ready patterns.

Muneer Puthiya Purayil 16 min read

Python's dynamic nature and rich library ecosystem make it a natural fit for building monitoring dashboards, data analysis pipelines, and custom alerting logic. While Python isn't the language of choice for building high-throughput monitoring backends (Go and Rust dominate there), it excels at the human-facing layer of observability.

Application Instrumentation

Prometheus Client

python
1from prometheus_client import Counter, Histogram, Gauge, generate_latest, CONTENT_TYPE_LATEST
2from starlette.responses import Response
3import time
4 
5REQUEST_COUNT = Counter("http_requests_total", "Total requests", ["method", "endpoint", "status"])
6REQUEST_DURATION = Histogram("http_request_duration_seconds", "Request duration", ["method", "endpoint"], buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0])
7ACTIVE_REQUESTS = Gauge("http_requests_active", "Active requests")
8 
9async def metrics_middleware(request, call_next):
10 ACTIVE_REQUESTS.inc()
11 start = time.perf_counter()
12 response = await call_next(request)
13 duration = time.perf_counter() - start
14 ACTIVE_REQUESTS.dec()
15 REQUEST_COUNT.labels(method=request.method, endpoint=request.url.path, status=response.status_code).inc()
16 REQUEST_DURATION.labels(method=request.method, endpoint=request.url.path).observe(duration)
17 return response
18 

OpenTelemetry Integration

python
1from opentelemetry import trace, metrics
2from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
3from opentelemetry.exporter.prometheus import PrometheusMetricReader
4from opentelemetry.sdk.trace import TracerProvider
5from opentelemetry.sdk.trace.export import BatchSpanProcessor
6from opentelemetry.sdk.metrics import MeterProvider
7from opentelemetry.sdk.resources import Resource
8from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
9from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
10from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
11 
12def setup_telemetry(service_name: str):
13 resource = Resource.create({"service.name": service_name})
14
15 # Traces
16 tp = TracerProvider(resource=resource)
17 tp.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
18 trace.set_tracer_provider(tp)
19
20 # Metrics
21 reader = PrometheusMetricReader()
22 mp = MeterProvider(resource=resource, metric_readers=[reader])
23 metrics.set_meter_provider(mp)
24
25 # Auto-instrumentation
26 FastAPIInstrumentor.instrument()
27 SQLAlchemyInstrumentor().instrument()
28 HTTPXClientInstrumentor().instrument()
29 

Structured Logging

python
1import structlog
2import logging
3 
4def setup_logging(level: str = "INFO"):
5 structlog.configure(
6 processors=[
7 structlog.contextvars.merge_contextvars,
8 structlog.processors.add_log_level,
9 structlog.processors.TimeStamper(fmt="iso"),
10 structlog.processors.StackInfoRenderer(),
11 structlog.processors.format_exc_info,
12 structlog.processors.JSONRenderer(),
13 ],
14 wrapper_class=structlog.make_filtering_bound_logger(getattr(logging, level)),
15 logger_factory=structlog.PrintLoggerFactory(),
16 )
17 
18logger = structlog.get_logger()
19 
20# Usage
21logger.info("order_created", order_id="o_123", amount=4999, customer_id="c_456")
22logger.error("payment_failed", order_id="o_123", error="insufficient_funds", provider="stripe")
23 

Need a second opinion on your DevOps pipelines architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Custom Monitoring Scripts

SLA Report Generator

python
1from datetime import datetime, timedelta
2import httpx
3from dataclasses import dataclass
4 
5@dataclass
6class SLAReport:
7 service: str
8 period_start: datetime
9 period_end: datetime
10 availability: float
11 p50_latency_ms: float
12 p99_latency_ms: float
13 error_count: int
14 total_requests: int
15 slo_met: bool
16 
17async def generate_sla_report(prometheus_url: str, service: str, hours: int = 24) -> SLAReport:
18 async with httpx.AsyncClient() as client:
19 end = datetime.utcnow()
20 start = end - timedelta(hours=hours)
21
22 availability = await query_prometheus(client, prometheus_url, f'sum(rate(http_requests_total{{service="{service}",status!~"5.."}}[{hours}h])) / sum(rate(http_requests_total{{service="{service}"}}[{hours}h]))')
23 p50 = await query_prometheus(client, prometheus_url, f'histogram_quantile(0.5, sum(rate(http_request_duration_seconds_bucket{{service="{service}"}}[{hours}h])) by (le))')
24 p99 = await query_prometheus(client, prometheus_url, f'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{{service="{service}"}}[{hours}h])) by (le))')
25 errors = await query_prometheus(client, prometheus_url, f'sum(increase(http_requests_total{{service="{service}",status=~"5.."}}[{hours}h]))')
26 total = await query_prometheus(client, prometheus_url, f'sum(increase(http_requests_total{{service="{service}"}}[{hours}h]))')
27
28 return SLAReport(
29 service=service, period_start=start, period_end=end,
30 availability=float(availability or 0), p50_latency_ms=float(p50 or 0) * 1000,
31 p99_latency_ms=float(p99 or 0) * 1000, error_count=int(float(errors or 0)),
32 total_requests=int(float(total or 0)), slo_met=float(availability or 0) >= 0.999,
33 )
34 
35async def query_prometheus(client, url, query):
36 resp = await client.get(f"{url}/api/v1/query", params={"query": query})
37 data = resp.json()
38 if data["data"]["result"]:
39 return data["data"]["result"][0]["value"][1]
40 return None
41 

Anomaly Detection

python
1import numpy as np
2from collections import deque
3 
4class SimpleAnomalyDetector:
5 def __init__(self, window_size: int = 100, z_threshold: float = 3.0):
6 self.window = deque(maxlen=window_size)
7 self.z_threshold = z_threshold
8 
9 def is_anomaly(self, value: float) -> tuple[bool, float]:
10 if len(self.window) < 10:
11 self.window.append(value)
12 return False, 0.0
13
14 mean = np.mean(list(self.window))
15 std = np.std(list(self.window))
16
17 if std == 0:
18 self.window.append(value)
19 return False, 0.0
20
21 z_score = abs(value - mean) / std
22 self.window.append(value)
23
24 return z_score > self.z_threshold, z_score
25 

Conclusion

Python's role in monitoring is at the analysis and automation layer — building SLA reports, anomaly detection, custom alerting logic, and monitoring dashboards. The Prometheus client and OpenTelemetry SDK handle standard instrumentation, while Python's data science libraries (NumPy, pandas) enable sophisticated analysis that would be cumbersome in Go or Java. For teams with Python services, structlog and the OTel auto-instrumentation libraries provide production-grade observability with minimal configuration.

FAQ

Need expert help?

Building with CI/CD pipelines?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026