How many Uvicorn workers should you run per pod?

For I/O-bound services (API servers making database queries and HTTP calls), use `2 * cpu_cores + 1`. For CPU-bound services (ML inference, data processing), use `cpu_cores`. Each worker consumes 50-150MB of base memory plus request handling overhead. Monitor memory usage per worker and adjust the ratio if pods approach their memory limits.

Should you use async (uvicorn) or sync (gunicorn) for Python web services?

Use async (uvicorn with FastAPI) when your service is primarily I/O-bound — making database queries, calling external APIs, reading from caches. Async handles more concurrent connections per worker because it doesn't block on I/O. Use sync (gunicorn with Flask/Django) for CPU-bound services or when your dependencies don't support async. Mixing sync blocking calls in async code is worse than using sync throughout.

How do you handle Python dependency vulnerabilities in container images?

Run `pip-audit` or `safety check` in your CI pipeline to scan for known vulnerabilities. Pin all dependencies in a lockfile (uv.lock or pip-compile output) for reproducible builds. Use Trivy or Snyk to scan the built container image for OS-level and Python package vulnerabilities. Set up automated PR creation for dependency updates via Dependabot or Renovate.

What is the best approach for Python ML model serving on Kubernetes?

For real-time inference, use FastAPI with a model loaded at startup via the lifespan handler. Set readiness probes to check that the model is loaded and responding. For GPU workloads, use the NVIDIA device plugin and set `nvidia.com/gpu: 1` in resource limits. Consider model serving frameworks like Triton Inference Server or BentoML for production ML workloads that need batching, model versioning, and A/B testing.

Complete Guide to Kubernetes Production Setup with Python

Python workloads on Kubernetes present unique challenges: the GIL limits true parallelism, memory usage can be unpredictable with large ML libraries, and dependency management adds significant image bloat. This guide covers production-ready patterns for deploying Python applications — from FastAPI web services to data pipelines — on Kubernetes.

Optimized Container Images

Python images are notoriously large. A naive python:3.12 base image is 1GB+. Proper multi-stage builds and base image selection reduce this dramatically.

dockerfile

1FROM python:3.12-slim AS builder

2WORKDIR /app

3RUN pip install --no-cache-dir uv

4COPY pyproject.toml uv.lock ./

5RUN uv sync --frozen --no-dev

6COPY . .

8FROM python:3.12-slim

9RUN groupadd -g 1001 appuser && useradd -u 1001 -g appuser appuser

10WORKDIR /app

11COPY --from=builder /app/.venv /app/.venv

12COPY --from=builder /app/src ./src

13ENV PATH="/app/.venv/bin:$PATH"

14USER appuser

15EXPOSE 8000

16CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

Using uv instead of pip reduces dependency installation from minutes to seconds. The --frozen flag ensures the lockfile is respected exactly. Separating the virtual environment into the runtime stage avoids including build tools, headers, and compilation artifacts.

For ML workloads with heavy native dependencies (NumPy, pandas, PyTorch), consider distroless Python or custom base images:

dockerfile

1FROM python:3.12-slim AS builder

2WORKDIR /app

3RUN pip install --no-cache-dir uv

4COPY pyproject.toml uv.lock ./

5RUN uv sync --frozen --no-dev

7FROM gcr.io/distroless/python3-debian12

8WORKDIR /app

9COPY --from=builder /app/.venv/lib/python3.12/site-packages /usr/lib/python3.12/site-packages

10COPY src/ ./src/

11EXPOSE 8000

12CMD ["src/main.py"]

FastAPI Application Structure

python

1import asyncio

2import signal

3from contextlib import asynccontextmanager

4from typing import AsyncIterator

6import structlog

7from fastapi import FastAPI, Request

8from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST

9import asyncpg

11logger = structlog.get_logger()

13REQUEST_COUNT = Counter(

14 "http_requests_total",

15 "Total HTTP requests",

16 ["method", "endpoint", "status"]

17)

18REQUEST_DURATION = Histogram(

19 "http_request_duration_seconds",

20 "HTTP request duration",

21 ["method", "endpoint"],

22 buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]

23)

25class AppState:

26 db_pool: asyncpg.Pool | None = None

27 shutting_down: bool = False

29state = AppState()

31@asynccontextmanager

32async def lifespan(app: FastAPI) -> AsyncIterator[None]:

33 state.db_pool = await asyncpg.create_pool(

34 dsn=settings.database_url,

35 min_size=5,

36 max_size=20,

37 max_inactive_connection_lifetime=300,

38 )

39 logger.info("database pool created")

40 yield

41 if state.db_pool:

42 await state.db_pool.close()

43 logger.info("database pool closed")

45app = FastAPI(lifespan=lifespan)

47@app.middleware("http")

48async def metrics_middleware(request: Request, call_next):

49 import time

50 start = time.perf_counter()

51 response = await call_next(request)

52 duration = time.perf_counter() - start

53 REQUEST_COUNT.labels(

54 method=request.method,

55 endpoint=request.url.path,

56 status=response.status_code

57 ).inc()

58 REQUEST_DURATION.labels(

59 method=request.method,

60 endpoint=request.url.path

61 ).observe(duration)

62 return response

64@app.get("/healthz")

65async def health():

66 return {"status": "ok"}

68@app.get("/readyz")

69async def ready():

70 if state.shutting_down:

71 return {"status": "shutting down"}, 503

72 try:

73 async with state.db_pool.acquire() as conn:

74 await conn.fetchval("SELECT 1")

75 return {"status": "ready"}

76 except Exception:

77 return {"status": "not ready"}, 503

79@app.get("/metrics")

80async def metrics():

81 from starlette.responses import Response

82 return Response(

83 content=generate_latest(),

84 media_type=CONTENT_TYPE_LATEST

85 )

Worker Configuration

Python's GIL means CPU-bound work in a single process only uses one core. For web services, use multiple Uvicorn workers:

yaml

1env:

2- name: WEB_CONCURRENCY

3 value: "4" # Number of Uvicorn workers

4- name: UVICORN_WORKERS

5 value: "4"

The formula: workers = 2 * cpu_cores + 1 for I/O-bound services, workers = cpu_cores for CPU-bound services. Each Uvicorn worker is a separate process consuming 50-150MB of memory. A pod with 4 workers on a 1Gi memory limit leaves 400-600MB for request handling per worker.

Kubernetes Deployment

yaml

1apiVersion: apps/v1

2kind: Deployment

3metadata:

4 name: api-service

5spec:

6 replicas: 3

7 selector:

8 matchLabels:

9 app: api-service

10 template:

11 metadata:

12 labels:

13 app: api-service

14 annotations:

15 prometheus.io/scrape: "true"

16 prometheus.io/port: "8000"

17 prometheus.io/path: "/metrics"

18 spec:

19 terminationGracePeriodSeconds: 30

20 containers:

21 - name: api

22 image: registry.example.com/api-service:v1.3.0

23 ports:

24 - containerPort: 8000

25 name: http

26 env:

27 - name: DATABASE_URL

28 valueFrom:

29 secretKeyRef:

30 name: api-secrets

31 key: database-url

32 - name: WEB_CONCURRENCY

33 value: "4"

34 resources:

35 requests:

36 cpu: 500m

37 memory: 512Mi

38 limits:

39 memory: 1Gi

40 readinessProbe:

41 httpGet:

42 path: /readyz

43 port: http

44 initialDelaySeconds: 5

45 periodSeconds: 10

46 livenessProbe:

47 httpGet:

48 path: /healthz

49 port: http

50 initialDelaySeconds: 10

51 periodSeconds: 15

52 securityContext:

53 allowPrivilegeEscalation: false

54 readOnlyRootFilesystem: true

55 runAsNonRoot: true

56 runAsUser: 1001

57 capabilities:

58 drop: ["ALL"]

59 volumeMounts:

60 - name: tmp

61 mountPath: /tmp

62 volumes:

63 - name: tmp

64 emptyDir: {}

Python applications need a writable /tmp directory for compiled .pyc files and temporary file operations. The emptyDir volume satisfies this requirement while keeping the root filesystem read-only.

Celery Workers on Kubernetes

Background task processing with Celery requires separate deployments for the worker and beat scheduler:

yaml

1apiVersion: apps/v1

2kind: Deployment

3metadata:

4 name: celery-worker

5spec:

6 replicas: 4

7 selector:

8 matchLabels:

9 app: celery-worker

10 template:

11 spec:

12 containers:

13 - name: worker

14 image: registry.example.com/api-service:v1.3.0

15 command: ["celery", "-A", "src.celery_app", "worker",

16 "--loglevel=info",

17 "--concurrency=4",

18 "--max-tasks-per-child=1000",

19 "--without-heartbeat"]

20 env:

21 - name: CELERY_BROKER_URL

22 valueFrom:

23 secretKeyRef:

24 name: celery-secrets

25 key: broker-url

26 resources:

27 requests:

28 cpu: 500m

29 memory: 512Mi

30 limits:

31 memory: 1Gi

32 livenessProbe:

33 exec:

34 command: ["celery", "-A", "src.celery_app", "inspect", "ping",

35 "--timeout", "5"]

36 initialDelaySeconds: 30

37 periodSeconds: 60

38 timeoutSeconds: 10

39---

40apiVersion: apps/v1

41kind: Deployment

42metadata:

43 name: celery-beat

44spec:

45 replicas: 1

46 selector:

47 matchLabels:

48 app: celery-beat

49 template:

50 spec:

51 containers:

52 - name: beat

53 image: registry.example.com/api-service:v1.3.0

54 command: ["celery", "-A", "src.celery_app", "beat",

55 "--loglevel=info"]

56 resources:

57 requests:

58 cpu: 100m

59 memory: 128Mi

60 limits:

61 memory: 256Mi

Key Celery-on-Kubernetes patterns:

--max-tasks-per-child=1000 prevents memory leaks by recycling workers after 1,000 tasks. Python's garbage collector doesn't always reclaim fragmented memory.
--without-heartbeat reduces Redis/RabbitMQ overhead when pod health is managed by Kubernetes probes.
Celery Beat must run as a single replica. Use a Deployment with replicas: 1 and consider adding a leader election sidecar for high availability.

Need a second opinion on your DevOps pipelines architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Data Pipeline Jobs

yaml

1apiVersion: batch/v1

2kind: CronJob

3metadata:

4 name: daily-etl

5spec:

6 schedule: "0 2 * * *"

7 concurrencyPolicy: Forbid

8 jobTemplate:

9 spec:

10 backoffLimit: 3

11 activeDeadlineSeconds: 3600

12 template:

13 spec:

14 restartPolicy: Never

15 containers:

16 - name: etl

17 image: registry.example.com/etl-pipeline:v1.0.0

18 command: ["python", "-m", "src.etl.daily_pipeline"]

19 resources:

20 requests:

21 cpu: "2"

22 memory: 4Gi

23 limits:

24 memory: 8Gi

25 env:

26 - name: PYTHONUNBUFFERED

27 value: "1"

PYTHONUNBUFFERED=1 ensures print statements and log output appear in kubectl logs immediately rather than being buffered. Without this, debugging failed jobs requires waiting for buffer flushes that may never happen if the process crashes.

Memory Management

Python's memory allocator doesn't return freed memory to the OS reliably. This means a pod that processes a large batch job can hold onto memory long after the data is garbage collected.

python

1import gc

2import tracemalloc

4tracemalloc.start()

6def process_large_batch(items: list[dict]) -> list[dict]:

7 results = []

8 # Process in chunks to limit peak memory

9 chunk_size = 1000

10 for i in range(0, len(items), chunk_size):

11 chunk = items[i:i + chunk_size]

12 chunk_results = [transform(item) for item in chunk]

13 results.extend(chunk_results)

14 del chunk, chunk_results

15 gc.collect() # Force garbage collection between chunks

17 snapshot = tracemalloc.take_snapshot()

18 top_stats = snapshot.statistics("lineno")

19 for stat in top_stats[:5]:

20 logger.info("memory_usage", stat=str(stat))

22 return results

For Kubernetes, this means setting memory limits with headroom for Python's memory management behavior. If your application's working set is 500MB, set the memory limit to 800MB-1GB to account for fragmentation and GC overhead.

Anti-Patterns to Avoid

Using python:3.12 as the base image. The full Python image includes compilers, headers, and development tools totaling 1GB+. Always use python:3.12-slim or distroless variants.

Running a single Uvicorn worker. A single-worker Python process handles one request at a time (for CPU-bound work). With no worker scaling, a single slow request blocks all other requests. Always run multiple workers.

Ignoring SIGTERM handling. Uvicorn handles SIGTERM by default, but Celery workers need explicit signal handling to finish in-progress tasks. Without graceful shutdown, tasks are lost during rolling updates.

Installing packages at runtime. Pods that run pip install on startup have unpredictable startup times and fail when PyPI is unreachable. All dependencies must be baked into the container image.

Not setting PYTHONDONTWRITEBYTECODE. Without PYTHONDONTWRITEBYTECODE=1 or a writable /tmp, Python crashes trying to write .pyc files to a read-only filesystem.

Conclusion

Python on Kubernetes requires accepting the language's runtime characteristics — the GIL, memory management quirks, and larger image sizes — and compensating at the infrastructure level. Multi-worker processes, generous memory limits with headroom for fragmentation, and proper container image optimization are the foundations.

The Python ecosystem's strength lies in its library breadth, particularly for data science and ML workloads. A well-optimized Python deployment on Kubernetes — using uv for dependency management, multi-stage builds for image size, and Celery for background processing — provides a productive development environment with production-grade operations.

FAQ

Need expert help?

Building with CI/CD pipelines?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Book a Free Call Send a Brief

kubernetes k8s container-orchestration devops python guide

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

View Portfolio Book a Call

← Previous

Complete Guide to Kubernetes Production Setup with Python

Optimized Container Images

FastAPI Application Structure

Worker Configuration

Kubernetes Deployment

Celery Workers on Kubernetes

Data Pipeline Jobs

Memory Management

Anti-Patterns to Avoid

Conclusion

FAQ

Building with CI/CD pipelines?

Complete Guide to Kubernetes Production Setup with Java

Complete Guide to Kubernetes Production Setup with Rust

Complete Guide to Kubernetes Production Setup with Go

Complete Guide to Kubernetes Production Setup with Go

Complete Guide to Kubernetes Production Setup with Typescript

Start a
Conversation.

Optimized Container Images

FastAPI Application Structure

Worker Configuration

Kubernetes Deployment

Celery Workers on Kubernetes

Data Pipeline Jobs

Memory Management

Anti-Patterns to Avoid

Conclusion

FAQ

Building with CI/CD pipelines?

Complete Guide to Kubernetes Production Setup with Java

Complete Guide to Kubernetes Production Setup with Rust

Complete Guide to Kubernetes Production Setup with Go

Complete Guide to Kubernetes Production Setup with Go

Complete Guide to Kubernetes Production Setup with Typescript

Start aConversation.

Start a
Conversation.