How do you handle etcd performance at 500+ nodes?

Dedicate specific nodes for etcd with NVMe SSD storage, monitor key metrics like `etcd_server_slow_apply_total` and `etcd_disk_wal_fsync_duration_seconds`, and set aggressive alerts. Consider running etcd on bare metal or instances with guaranteed IOPS. Beyond 1,000 nodes, evaluate splitting into multiple clusters with federation rather than pushing a single control plane.

What is the recommended approach for multi-cluster service mesh?

Start with Istio multi-primary on different networks if you need cross-cluster service discovery. Use locality-weighted load balancing to prefer local endpoints. At very high scale, evaluate Cilium Service Mesh for its eBPF-based data plane, which eliminates sidecar overhead — a meaningful optimization when running 5,000+ pods with sidecars consuming 200Mi memory each.

How should you handle secrets management at scale?

External Secrets Operator with a centralized vault (HashiCorp Vault or AWS Secrets Manager) is the standard pattern. Avoid Kubernetes-native secrets for anything sensitive — they are base64-encoded, not encrypted at rest by default. At scale, implement automatic rotation with a 90-day maximum lifetime and audit access via Vault's audit log.

When should you split into multiple clusters vs scaling one large cluster?

Consider splitting when you exceed 5,000 nodes, when teams need independent upgrade cycles, or when blast radius reduction is a priority. Multi-cluster adds operational complexity (federation, cross-cluster networking, consistent policy enforcement), so the single-cluster path is simpler up to the 3,000-5,000 node range with proper etcd tuning.

Kubernetes Production Setup Best Practices for High Scale Teams

Running Kubernetes at high scale demands a fundamentally different operational mindset than managing a handful of clusters. When you're orchestrating thousands of pods across multiple regions, every misconfiguration compounds. This guide distills production-tested practices from teams running 500+ node clusters serving millions of requests per second.

Resource Management at Scale

Resource requests and limits are the foundation of stable high-scale clusters. Underspecified resources lead to noisy-neighbor problems; overspecified resources waste capacity.

Right-Sizing with VPA and Goldilocks

yaml

1apiVersion: autoscaling.k8s.io/v1

2kind: VerticalPodAutoscaler

3metadata:

4 name: api-gateway-vpa

5spec:

6 targetRef:

7 apiVersion: apps/v1

8 kind: Deployment

9 name: api-gateway

10 updatePolicy:

11 updateMode: "Off" # Recommendation-only in production

12 resourcePolicy:

13 containerPolicies:

14 - containerName: api-gateway

15 minAllowed:

16 cpu: 100m

17 memory: 128Mi

18 maxAllowed:

19 cpu: 4

20 memory: 8Gi

At high scale, run VPA in recommendation mode and feed its suggestions into your CI pipeline rather than allowing live mutations. Direct pod updates cause restarts that cascade through large deployments.

Guaranteed QoS for Critical Paths

yaml

1resources:

2 requests:

3 cpu: "2"

4 memory: "4Gi"

5 limits:

6 cpu: "2"

7 memory: "4Gi"

For latency-sensitive services, set requests equal to limits to achieve Guaranteed QoS class. This prevents CPU throttling and OOM kills during traffic spikes. At scale, the 15-20% capacity overhead pays for itself in reduced incident frequency.

Cluster Architecture for Multi-Region

Topology-Aware Routing

yaml

1apiVersion: v1

2kind: Service

3metadata:

4 name: payment-service

5 annotations:

6 service.kubernetes.io/topology-mode: Auto

7spec:

8 selector:

9 app: payment-service

10 ports:

11 - port: 443

12 targetPort: 8443

Topology-aware routing reduces cross-zone traffic by 60-70%, which at high scale translates to significant cost savings. A team running 2,000 pods across 3 AZs saved $14,000/month in inter-AZ data transfer after enabling this.

Node Pool Segmentation

bash

1# Dedicated node pool for latency-critical services

2eksctl create nodepool \

3 --cluster production \

4 --name latency-critical \

5 --node-type c6i.2xlarge \

6 --nodes-min 10 \

7 --nodes-max 50 \

8 --node-labels "tier=latency-critical" \

9 --node-taints "dedicated=latency-critical:NoSchedule"

Separate node pools by workload characteristics: compute-intensive, memory-intensive, GPU, and general-purpose. This prevents resource contention and allows independent scaling. At 500+ nodes, mixing workload types on the same nodes creates unpredictable performance profiles.

Pod Disruption Budgets and Rolling Updates

yaml

1apiVersion: policy/v1

2kind: PodDisruptionBudget

3metadata:

4 name: api-gateway-pdb

5spec:

6 maxUnavailable: 10%

7 selector:

8 matchLabels:

9 app: api-gateway

10---

11apiVersion: apps/v1

12kind: Deployment

13metadata:

14 name: api-gateway

15spec:

16 replicas: 50

17 strategy:

18 rollingUpdate:

19 maxSurge: 25%

20 maxUnavailable: 10%

21 type: RollingUpdate

At high replica counts, percentage-based disruption budgets scale naturally. A fixed maxUnavailable: 1 on a 50-replica deployment makes rollouts painfully slow; 10% allows 5 pods to cycle simultaneously while maintaining 90% capacity.

Network Policies as Default Deny

yaml

1apiVersion: networking.k8s.io/v1

2kind: NetworkPolicy

3metadata:

4 name: default-deny-all

5 namespace: production

6spec:

7 podSelector: {}

8 policyTypes:

9 - Ingress

10 - Egress

11---

12apiVersion: networking.k8s.io/v1

13kind: NetworkPolicy

14metadata:

15 name: allow-api-to-database

16 namespace: production

17spec:

18 podSelector:

19 matchLabels:

20 app: postgres

21 policyTypes:

22 - Ingress

23 ingress:

24 - from:

25 - podSelector:

26 matchLabels:

27 app: api-gateway

28 ports:

29 - protocol: TCP

30 port: 5432

Default-deny network policies are non-negotiable at scale. A compromised pod in a 1,000-pod cluster with no network policies has lateral movement access to everything. Start with deny-all and whitelist explicitly.

Need a second opinion on your DevOps pipelines architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Observability at Scale

Custom Metrics for HPA

yaml

1apiVersion: autoscaling/v2

2kind: HorizontalPodAutoscaler

3metadata:

4 name: api-gateway-hpa

5spec:

6 scaleTargetRef:

7 apiVersion: apps/v1

8 kind: Deployment

9 name: api-gateway

10 minReplicas: 20

11 maxReplicas: 200

12 metrics:

13 - type: Pods

14 pods:

15 metric:

16 name: http_requests_per_second

17 target:

18 type: AverageValue

19 averageValue: "1000"

20 - type: Pods

21 pods:

22 metric:

23 name: p99_latency_ms

24 target:

25 type: AverageValue

26 averageValue: "100"

27 behavior:

28 scaleDown:

29 stabilizationWindowSeconds: 300

30 policies:

31 - type: Percent

32 value: 10

33 periodSeconds: 60

34 scaleUp:

35 stabilizationWindowSeconds: 30

36 policies:

37 - type: Percent

38 value: 50

39 periodSeconds: 60

CPU-based HPA fails at scale because CPU utilization doesn't correlate with user experience. Use custom metrics — requests per second, p99 latency, queue depth — that reflect actual service health. The asymmetric scale-up/scale-down behavior prevents flapping: scale up aggressively (50% in 60s) but scale down conservatively (10% in 60s with 5-minute stabilization).

Prometheus Federation for Multi-Cluster

yaml

1# prometheus-federation.yaml

2scrape_configs:

3 - job_name: 'federated-clusters'

4 honor_labels: true

5 metrics_path: '/federate'

6 params:

7 'match[]':

8 - '{__name__=~"job:.*"}'

9 - '{__name__=~"node:.*"}'

10 static_configs:

11 - targets:

12 - 'prometheus-us-east.internal:9090'

13 - 'prometheus-eu-west.internal:9090'

14 - 'prometheus-ap-south.internal:9090'

At high scale, a single Prometheus instance cannot ingest metrics from all clusters. Federation with recording rules at the edge reduces central ingestion by 90%. Each cluster Prometheus retains full-resolution data for 24 hours; the federation layer stores aggregated metrics for long-term trending.

Anti-Patterns to Avoid

Running without Pod Disruption Budgets. At scale, node drains during upgrades can take down entire services if PDBs aren't configured. A cluster upgrade on 200 nodes without PDBs caused a 45-minute outage when all replicas of a critical service were drained simultaneously.

Using latest tags in production. Image tag immutability is essential when you have 50 deployments across 3 clusters. A single latest tag pointing to a broken image propagates failures across your entire infrastructure in minutes.

Ignoring etcd performance. At 500+ nodes, etcd becomes the bottleneck. Watch count, object count, and request latency all grow non-linearly. Monitor etcd metrics aggressively and consider dedicated etcd nodes with SSD storage.

Skipping admission controllers. Without admission webhooks enforcing resource requests, namespace quotas, and security contexts, a single misconfigured deployment can consume an entire node pool. OPA Gatekeeper or Kyverno should be mandatory at scale.

Manual kubectl operations. GitOps (ArgoCD or Flux) is the only safe deployment mechanism at high scale. Manual kubectl applies across multiple clusters are impossible to audit, impossible to roll back atomically, and guaranteed to cause drift.

Production Checklist

Conclusion

High-scale Kubernetes operations require treating your cluster configuration with the same rigor as application code. Every resource manifest, network policy, and autoscaling configuration should be version-controlled, reviewed, and tested before reaching production. The practices outlined here — from VPA-driven right-sizing to federation-based observability — form a cohesive operational framework that scales from 100 to 10,000 pods.

The most critical investment is in guardrails: admission controllers that prevent misconfigurations, PDBs that protect availability during maintenance, and GitOps pipelines that ensure reproducibility. Teams that build these foundations early avoid the operational emergencies that plague organizations scaling reactively.

FAQ

Need expert help?

Building with CI/CD pipelines?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Book a Free Call Send a Brief

kubernetes k8s container-orchestration devops high-scale best-practices

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

View Portfolio Book a Call

← Previous

Resource Management at Scale

Right-Sizing with VPA and Goldilocks

Guaranteed QoS for Critical Paths

Cluster Architecture for Multi-Region

Topology-Aware Routing

Node Pool Segmentation

Pod Disruption Budgets and Rolling Updates

Network Policies as Default Deny

Observability at Scale

Custom Metrics for HPA

Prometheus Federation for Multi-Cluster

Anti-Patterns to Avoid

Production Checklist

Conclusion

FAQ

Building with CI/CD pipelines?

Kubernetes Production Setup Best Practices for Enterprise Teams

Kubernetes Production Setup Best Practices for Startup Teams

Kubernetes Production Setup at Scale: Lessons from Production

Kubernetes Production Setup at Scale: Lessons from Production

Kubernetes Production Setup Best Practices for Enterprise Teams

Start aConversation.

Start a
Conversation.