Kubernetes Production Setup Best Practices for Enterprise Teams
Battle-tested best practices for Kubernetes Production Setup tailored to Enterprise teams, including anti-patterns to avoid and a ready-to-use checklist.
Muneer Puthiya Purayil 10 min read
Kubernetes Production Setup Best Practices for Enterprise Teams
Running Kubernetes in production for enterprise workloads is fundamentally different from getting a cluster up and running. The gap between a working cluster and a production-grade platform spans security hardening, multi-tenancy, observability, compliance, and operational procedures that prevent 3 AM pages. Enterprise teams need battle-tested patterns, not just tutorials.
This guide covers the practices that separate production Kubernetes from demo Kubernetes — drawn from real-world deployments managing hundreds of services across regulated industries.
Cluster Architecture
Control Plane Configuration
Enterprise clusters should use managed control planes (EKS, GKE, AKS) unless you have a dedicated platform team of 5+ engineers. Self-managing etcd, the API server, and controller managers is a full-time job.
For EKS, configure the control plane for high availability:
yaml
1# eksctl cluster config
2apiVersion:eksctl.io/v1alpha5
3kind:ClusterConfig
4
5metadata:
6name:production-primary
7region:us-east-1
8version:"1.29"
9
10iam:
11withOIDC:true
12
13vpc:
14clusterEndpoints:
15privateAccess:true
16publicAccess:false# API server not exposed to internet
17
18managedNodeGroups:
19-name:system
20instanceType:m6i.xlarge
21minSize:3
22maxSize:5
23desiredCapacity:3
24labels:
25role:system
26taints:
27-key:CriticalAddonsOnly
28effect:NoSchedule
29privateNetworking:true
30volumeSize:100
31volumeType:gp3
32tags:
33Team:platform
34Environment:production
35
36-name:workload
37instanceType:m6i.2xlarge
38minSize:5
39maxSize:50
40desiredCapacity:10
41labels:
42role:workload
43privateNetworking:true
44volumeSize:200
45volumeType:gp3
46
47addons:
48-name:vpc-cni
49version:latest
50-name:coredns
51version:latest
52-name:kube-proxy
53version:latest
54
Key decisions:
Private API endpoint only. Public API endpoints are the #1 attack vector for enterprise clusters. Use a VPN or bastion for kubectl access.
Separate system and workload node groups. System components (ingress, monitoring, cert-manager) run on dedicated nodes with taints. Application workloads can't interfere with cluster operations.
gp3 volumes. 20% cheaper than gp2 with better baseline IOPS (3,000 vs 100/GB).
Node Pool Strategy
Enterprise clusters typically need 3-4 node pools:
Node Pool
Instance Type
Purpose
Taint
system
m6i.xlarge
Ingress, monitoring, cert-manager
CriticalAddonsOnly
workload
m6i.2xlarge
Application services
None
compute
c6i.4xlarge
CPU-intensive batch jobs
workload-type=compute
gpu
g5.xlarge
ML inference
nvidia.com/gpu
Namespace and Multi-Tenancy
Namespace Strategy
Organize namespaces by team and environment, not by application:
preStop hook: The 10-second sleep allows load balancers to deregister the pod before it terminates. Without this, you get 502 errors during deployments.
terminationGracePeriodSeconds: 60: Give in-flight requests time to complete.
Pod Disruption Budgets
Protect services during node drains and cluster upgrades:
yaml
1apiVersion:policy/v1
2kind:PodDisruptionBudget
3metadata:
4name:payment-service-pdb
5namespace:team-payments-prod
6spec:
7minAvailable:3
8selector:
9matchLabels:
10app:payment-service
11
With 5 replicas and minAvailable: 3, only 2 pods can be down simultaneously during voluntary disruptions. This prevents cluster upgrades from taking down your service.
Cluster Upgrades and Maintenance
Upgrade Strategy
Enterprise clusters should follow a staged upgrade pattern:
Dev cluster: Upgrade immediately when new version is available
Staging cluster: Upgrade 1 week after dev (soak test)
Production cluster: Upgrade 2-3 weeks after staging
DR cluster: Upgrade after production is stable (1 week)
For EKS managed node groups, use a rolling update:
Automate node rotation to ensure nodes don't drift:
yaml
1# Karpenter NodePool for automatic node management
2apiVersion:karpenter.sh/v1beta1
3kind:NodePool
4metadata:
5name:workload
6spec:
7template:
8spec:
9requirements:
10-key:kubernetes.io/arch
11operator:In
12values: ["amd64"]
13-key:karpenter.k8s.aws/instance-category
14operator:In
15values: ["m", "c", "r"]
16-key:karpenter.k8s.aws/instance-generation
17operator:Gt
18values: ["5"]
19nodeClassRef:
20name:default
21disruption:
22consolidationPolicy:WhenUnderutilized
23expireAfter:720h# Rotate nodes every 30 days
24limits:
25cpu:200
26memory:800Gi
27
Disaster Recovery
Backup Strategy
Use Velero for cluster-level backups:
yaml
1apiVersion:velero.io/v1
2kind:Schedule
3metadata:
4name:daily-backup
5namespace:velero
6spec:
7schedule:"0 2 * * *"# 2 AM daily
8template:
9includedNamespaces:
10-"team-*"
11excludedResources:
12-events
13-pods
14storageLocation:default
15ttl:720h# 30-day retention
16snapshotVolumes:true
17
Multi-Cluster Strategy
Enterprise deployments should run at least two clusters:
Active-active for critical services (traffic split across clusters)
Active-passive for stateful services (failover on primary failure)
Use an external load balancer (AWS Global Accelerator, Cloudflare) to route between clusters, not in-cluster service mesh.
Conclusion
Production Kubernetes for enterprise teams is a platform engineering discipline, not a deployment target. The patterns covered here — namespace isolation, pod security standards, network policies, observability, and upgrade procedures — represent the minimum bar for enterprise workloads. Skipping any of these creates operational debt that surfaces at the worst possible time.
The most common failure mode isn't a cluster going down. It's a gradual erosion of operational discipline: resource quotas not enforced, network policies not applied, alerts not actionable. Enterprise teams should treat their Kubernetes configuration as production code — reviewed, tested, and continuously validated. GitOps tools like Argo CD or Flux make this practical by ensuring the cluster state always matches what's in version control.
FAQ
Need expert help?
Building with CI/CD pipelines?
I help teams ship production-grade systems. From architecture review to hands-on builds.
For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.