Back to Journal
DevOps

Kubernetes Production Setup Best Practices for High Scale Teams

Battle-tested best practices for Kubernetes Production Setup tailored to High Scale teams, including anti-patterns to avoid and a ready-to-use checklist.

Muneer Puthiya Purayil 11 min read

Running Kubernetes at high scale demands a fundamentally different operational mindset than managing a handful of clusters. When you're orchestrating thousands of pods across multiple regions, every misconfiguration compounds. This guide distills production-tested practices from teams running 500+ node clusters serving millions of requests per second.

Resource Management at Scale

Resource requests and limits are the foundation of stable high-scale clusters. Underspecified resources lead to noisy-neighbor problems; overspecified resources waste capacity.

Right-Sizing with VPA and Goldilocks

yaml
1apiVersion: autoscaling.k8s.io/v1
2kind: VerticalPodAutoscaler
3metadata:
4 name: api-gateway-vpa
5spec:
6 targetRef:
7 apiVersion: apps/v1
8 kind: Deployment
9 name: api-gateway
10 updatePolicy:
11 updateMode: "Off" # Recommendation-only in production
12 resourcePolicy:
13 containerPolicies:
14 - containerName: api-gateway
15 minAllowed:
16 cpu: 100m
17 memory: 128Mi
18 maxAllowed:
19 cpu: 4
20 memory: 8Gi
21 

At high scale, run VPA in recommendation mode and feed its suggestions into your CI pipeline rather than allowing live mutations. Direct pod updates cause restarts that cascade through large deployments.

Guaranteed QoS for Critical Paths

yaml
1resources:
2 requests:
3 cpu: "2"
4 memory: "4Gi"
5 limits:
6 cpu: "2"
7 memory: "4Gi"
8 

For latency-sensitive services, set requests equal to limits to achieve Guaranteed QoS class. This prevents CPU throttling and OOM kills during traffic spikes. At scale, the 15-20% capacity overhead pays for itself in reduced incident frequency.

Cluster Architecture for Multi-Region

Topology-Aware Routing

yaml
1apiVersion: v1
2kind: Service
3metadata:
4 name: payment-service
5 annotations:
6 service.kubernetes.io/topology-mode: Auto
7spec:
8 selector:
9 app: payment-service
10 ports:
11 - port: 443
12 targetPort: 8443
13 

Topology-aware routing reduces cross-zone traffic by 60-70%, which at high scale translates to significant cost savings. A team running 2,000 pods across 3 AZs saved $14,000/month in inter-AZ data transfer after enabling this.

Node Pool Segmentation

bash
1# Dedicated node pool for latency-critical services
2eksctl create nodepool \
3 --cluster production \
4 --name latency-critical \
5 --node-type c6i.2xlarge \
6 --nodes-min 10 \
7 --nodes-max 50 \
8 --node-labels "tier=latency-critical" \
9 --node-taints "dedicated=latency-critical:NoSchedule"
10 

Separate node pools by workload characteristics: compute-intensive, memory-intensive, GPU, and general-purpose. This prevents resource contention and allows independent scaling. At 500+ nodes, mixing workload types on the same nodes creates unpredictable performance profiles.

Pod Disruption Budgets and Rolling Updates

yaml
1apiVersion: policy/v1
2kind: PodDisruptionBudget
3metadata:
4 name: api-gateway-pdb
5spec:
6 maxUnavailable: 10%
7 selector:
8 matchLabels:
9 app: api-gateway
10---
11apiVersion: apps/v1
12kind: Deployment
13metadata:
14 name: api-gateway
15spec:
16 replicas: 50
17 strategy:
18 rollingUpdate:
19 maxSurge: 25%
20 maxUnavailable: 10%
21 type: RollingUpdate
22 

At high replica counts, percentage-based disruption budgets scale naturally. A fixed maxUnavailable: 1 on a 50-replica deployment makes rollouts painfully slow; 10% allows 5 pods to cycle simultaneously while maintaining 90% capacity.

Network Policies as Default Deny

yaml
1apiVersion: networking.k8s.io/v1
2kind: NetworkPolicy
3metadata:
4 name: default-deny-all
5 namespace: production
6spec:
7 podSelector: {}
8 policyTypes:
9 - Ingress
10 - Egress
11---
12apiVersion: networking.k8s.io/v1
13kind: NetworkPolicy
14metadata:
15 name: allow-api-to-database
16 namespace: production
17spec:
18 podSelector:
19 matchLabels:
20 app: postgres
21 policyTypes:
22 - Ingress
23 ingress:
24 - from:
25 - podSelector:
26 matchLabels:
27 app: api-gateway
28 ports:
29 - protocol: TCP
30 port: 5432
31 

Default-deny network policies are non-negotiable at scale. A compromised pod in a 1,000-pod cluster with no network policies has lateral movement access to everything. Start with deny-all and whitelist explicitly.

Need a second opinion on your DevOps pipelines architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Observability at Scale

Custom Metrics for HPA

yaml
1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4 name: api-gateway-hpa
5spec:
6 scaleTargetRef:
7 apiVersion: apps/v1
8 kind: Deployment
9 name: api-gateway
10 minReplicas: 20
11 maxReplicas: 200
12 metrics:
13 - type: Pods
14 pods:
15 metric:
16 name: http_requests_per_second
17 target:
18 type: AverageValue
19 averageValue: "1000"
20 - type: Pods
21 pods:
22 metric:
23 name: p99_latency_ms
24 target:
25 type: AverageValue
26 averageValue: "100"
27 behavior:
28 scaleDown:
29 stabilizationWindowSeconds: 300
30 policies:
31 - type: Percent
32 value: 10
33 periodSeconds: 60
34 scaleUp:
35 stabilizationWindowSeconds: 30
36 policies:
37 - type: Percent
38 value: 50
39 periodSeconds: 60
40 

CPU-based HPA fails at scale because CPU utilization doesn't correlate with user experience. Use custom metrics — requests per second, p99 latency, queue depth — that reflect actual service health. The asymmetric scale-up/scale-down behavior prevents flapping: scale up aggressively (50% in 60s) but scale down conservatively (10% in 60s with 5-minute stabilization).

Prometheus Federation for Multi-Cluster

yaml
1# prometheus-federation.yaml
2scrape_configs:
3 - job_name: 'federated-clusters'
4 honor_labels: true
5 metrics_path: '/federate'
6 params:
7 'match[]':
8 - '{__name__=~"job:.*"}'
9 - '{__name__=~"node:.*"}'
10 static_configs:
11 - targets:
12 - 'prometheus-us-east.internal:9090'
13 - 'prometheus-eu-west.internal:9090'
14 - 'prometheus-ap-south.internal:9090'
15 

At high scale, a single Prometheus instance cannot ingest metrics from all clusters. Federation with recording rules at the edge reduces central ingestion by 90%. Each cluster Prometheus retains full-resolution data for 24 hours; the federation layer stores aggregated metrics for long-term trending.

Anti-Patterns to Avoid

Running without Pod Disruption Budgets. At scale, node drains during upgrades can take down entire services if PDBs aren't configured. A cluster upgrade on 200 nodes without PDBs caused a 45-minute outage when all replicas of a critical service were drained simultaneously.

Using latest tags in production. Image tag immutability is essential when you have 50 deployments across 3 clusters. A single latest tag pointing to a broken image propagates failures across your entire infrastructure in minutes.

Ignoring etcd performance. At 500+ nodes, etcd becomes the bottleneck. Watch count, object count, and request latency all grow non-linearly. Monitor etcd metrics aggressively and consider dedicated etcd nodes with SSD storage.

Skipping admission controllers. Without admission webhooks enforcing resource requests, namespace quotas, and security contexts, a single misconfigured deployment can consume an entire node pool. OPA Gatekeeper or Kyverno should be mandatory at scale.

Manual kubectl operations. GitOps (ArgoCD or Flux) is the only safe deployment mechanism at high scale. Manual kubectl applies across multiple clusters are impossible to audit, impossible to roll back atomically, and guaranteed to cause drift.

Production Checklist

  • VPA running in recommendation mode with CI integration
  • Guaranteed QoS for latency-critical services
  • Topology-aware routing enabled for cross-zone optimization
  • Separate node pools for distinct workload types
  • PDBs on every production deployment
  • Default-deny network policies per namespace
  • Custom metrics HPA with asymmetric scaling behavior
  • Prometheus federation for multi-cluster monitoring
  • GitOps-only deployment pipeline (no manual kubectl)
  • etcd monitoring with dedicated alerting
  • Resource quotas enforced per namespace
  • Pod Security Standards enforced cluster-wide
  • Image tag immutability enforced via admission controller
  • Cluster autoscaler tuned for scale-up speed (30s scan interval)
  • Regular chaos engineering exercises (pod kill, node drain, AZ failure)

Conclusion

High-scale Kubernetes operations require treating your cluster configuration with the same rigor as application code. Every resource manifest, network policy, and autoscaling configuration should be version-controlled, reviewed, and tested before reaching production. The practices outlined here — from VPA-driven right-sizing to federation-based observability — form a cohesive operational framework that scales from 100 to 10,000 pods.

The most critical investment is in guardrails: admission controllers that prevent misconfigurations, PDBs that protect availability during maintenance, and GitOps pipelines that ensure reproducibility. Teams that build these foundations early avoid the operational emergencies that plague organizations scaling reactively.

FAQ

Need expert help?

Building with CI/CD pipelines?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026