Back to Journal
DevOps

Kubernetes Production Setup Best Practices for Enterprise Teams

Battle-tested best practices for Kubernetes Production Setup tailored to Enterprise teams, including anti-patterns to avoid and a ready-to-use checklist.

Muneer Puthiya Purayil 10 min read

Kubernetes Production Setup Best Practices for Enterprise Teams

Running Kubernetes in production for enterprise workloads is fundamentally different from getting a cluster up and running. The gap between a working cluster and a production-grade platform spans security hardening, multi-tenancy, observability, compliance, and operational procedures that prevent 3 AM pages. Enterprise teams need battle-tested patterns, not just tutorials.

This guide covers the practices that separate production Kubernetes from demo Kubernetes — drawn from real-world deployments managing hundreds of services across regulated industries.

Cluster Architecture

Control Plane Configuration

Enterprise clusters should use managed control planes (EKS, GKE, AKS) unless you have a dedicated platform team of 5+ engineers. Self-managing etcd, the API server, and controller managers is a full-time job.

For EKS, configure the control plane for high availability:

yaml
1# eksctl cluster config
2apiVersion: eksctl.io/v1alpha5
3kind: ClusterConfig
4 
5metadata:
6 name: production-primary
7 region: us-east-1
8 version: "1.29"
9 
10iam:
11 withOIDC: true
12 
13vpc:
14 clusterEndpoints:
15 privateAccess: true
16 publicAccess: false # API server not exposed to internet
17 
18managedNodeGroups:
19 - name: system
20 instanceType: m6i.xlarge
21 minSize: 3
22 maxSize: 5
23 desiredCapacity: 3
24 labels:
25 role: system
26 taints:
27 - key: CriticalAddonsOnly
28 effect: NoSchedule
29 privateNetworking: true
30 volumeSize: 100
31 volumeType: gp3
32 tags:
33 Team: platform
34 Environment: production
35 
36 - name: workload
37 instanceType: m6i.2xlarge
38 minSize: 5
39 maxSize: 50
40 desiredCapacity: 10
41 labels:
42 role: workload
43 privateNetworking: true
44 volumeSize: 200
45 volumeType: gp3
46 
47addons:
48 - name: vpc-cni
49 version: latest
50 - name: coredns
51 version: latest
52 - name: kube-proxy
53 version: latest
54 

Key decisions:

  • Private API endpoint only. Public API endpoints are the #1 attack vector for enterprise clusters. Use a VPN or bastion for kubectl access.
  • Separate system and workload node groups. System components (ingress, monitoring, cert-manager) run on dedicated nodes with taints. Application workloads can't interfere with cluster operations.
  • gp3 volumes. 20% cheaper than gp2 with better baseline IOPS (3,000 vs 100/GB).

Node Pool Strategy

Enterprise clusters typically need 3-4 node pools:

Node PoolInstance TypePurposeTaint
systemm6i.xlargeIngress, monitoring, cert-managerCriticalAddonsOnly
workloadm6i.2xlargeApplication servicesNone
computec6i.4xlargeCPU-intensive batch jobsworkload-type=compute
gpug5.xlargeML inferencenvidia.com/gpu

Namespace and Multi-Tenancy

Namespace Strategy

Organize namespaces by team and environment, not by application:

yaml
1apiVersion: v1
2kind: Namespace
3metadata:
4 name: team-payments-prod
5 labels:
6 team: payments
7 environment: production
8 cost-center: eng-payments
9 annotations:
10 contacts: "[email protected]"
11 

Resource Quotas

Every namespace must have resource quotas. Without them, a single team's runaway deployment can starve the entire cluster:

yaml
1apiVersion: v1
2kind: ResourceQuota
3metadata:
4 name: team-quota
5 namespace: team-payments-prod
6spec:
7 hard:
8 requests.cpu: "40"
9 requests.memory: 80Gi
10 limits.cpu: "80"
11 limits.memory: 160Gi
12 pods: "100"
13 services: "20"
14 services.loadbalancers: "5"
15 persistentvolumeclaims: "20"
16 secrets: "50"
17 configmaps: "50"
18 

Limit Ranges

Set defaults so that pods without explicit resource requests don't run unbounded:

yaml
1apiVersion: v1
2kind: LimitRange
3metadata:
4 name: default-limits
5 namespace: team-payments-prod
6spec:
7 limits:
8 - default:
9 cpu: "500m"
10 memory: 512Mi
11 defaultRequest:
12 cpu: "100m"
13 memory: 128Mi
14 max:
15 cpu: "4"
16 memory: 8Gi
17 min:
18 cpu: "50m"
19 memory: 64Mi
20 type: Container
21 

Security Hardening

Pod Security Standards

Enforce pod security at the namespace level using built-in Pod Security Admission:

yaml
1apiVersion: v1
2kind: Namespace
3metadata:
4 name: team-payments-prod
5 labels:
6 pod-security.kubernetes.io/enforce: restricted
7 pod-security.kubernetes.io/enforce-version: v1.29
8 pod-security.kubernetes.io/audit: restricted
9 pod-security.kubernetes.io/warn: restricted
10 

The restricted policy requires:

  • No privileged containers
  • No host networking or host ports
  • Read-only root filesystem
  • Non-root user
  • Seccomp profile set

Network Policies

Default deny all traffic, then explicitly allow what's needed:

yaml
1# Default deny all ingress and egress
2apiVersion: networking.k8s.io/v1
3kind: NetworkPolicy
4metadata:
5 name: default-deny-all
6 namespace: team-payments-prod
7spec:
8 podSelector: {}
9 policyTypes:
10 - Ingress
11 - Egress
12---
13# Allow DNS resolution
14apiVersion: networking.k8s.io/v1
15kind: NetworkPolicy
16metadata:
17 name: allow-dns
18 namespace: team-payments-prod
19spec:
20 podSelector: {}
21 policyTypes:
22 - Egress
23 egress:
24 - to:
25 - namespaceSelector:
26 matchLabels:
27 kubernetes.io/metadata.name: kube-system
28 ports:
29 - protocol: UDP
30 port: 53
31 - protocol: TCP
32 port: 53
33---
34# Allow specific service communication
35apiVersion: networking.k8s.io/v1
36kind: NetworkPolicy
37metadata:
38 name: allow-api-to-payments
39 namespace: team-payments-prod
40spec:
41 podSelector:
42 matchLabels:
43 app: payment-service
44 policyTypes:
45 - Ingress
46 ingress:
47 - from:
48 - namespaceSelector:
49 matchLabels:
50 team: api
51 podSelector:
52 matchLabels:
53 app: api-gateway
54 ports:
55 - protocol: TCP
56 port: 8080
57 

RBAC

Create role bindings per team with least-privilege access:

yaml
1apiVersion: rbac.authorization.k8s.io/v1
2kind: Role
3metadata:
4 name: team-developer
5 namespace: team-payments-prod
6rules:
7 - apiGroups: ["apps"]
8 resources: ["deployments", "replicasets"]
9 verbs: ["get", "list", "watch", "create", "update", "patch"]
10 - apiGroups: [""]
11 resources: ["pods", "pods/log", "services", "configmaps"]
12 verbs: ["get", "list", "watch"]
13 - apiGroups: [""]
14 resources: ["pods/exec"]
15 verbs: ["create"] # Allow debug access
16 - apiGroups: [""]
17 resources: ["secrets"]
18 verbs: ["get", "list"] # No create/update — secrets managed by platform team
19---
20apiVersion: rbac.authorization.k8s.io/v1
21kind: RoleBinding
22metadata:
23 name: team-payments-devs
24 namespace: team-payments-prod
25subjects:
26 - kind: Group
27 name: "payments-developers"
28 apiGroup: rbac.authorization.k8s.io
29roleRef:
30 kind: Role
31 name: team-developer
32 apiGroup: rbac.authorization.k8s.io
33 

Secrets Management

Never store secrets in Kubernetes Secret objects directly — they're base64-encoded, not encrypted. Use an external secrets operator:

yaml
1# External Secrets Operator with AWS Secrets Manager
2apiVersion: external-secrets.io/v1beta1
3kind: ExternalSecret
4metadata:
5 name: payment-api-credentials
6 namespace: team-payments-prod
7spec:
8 refreshInterval: 1h
9 secretStoreRef:
10 name: aws-secrets-manager
11 kind: ClusterSecretStore
12 target:
13 name: payment-api-credentials
14 creationPolicy: Owner
15 data:
16 - secretKey: api-key
17 remoteRef:
18 key: production/payments/api-key
19 - secretKey: db-password
20 remoteRef:
21 key: production/payments/db-password
22 

Observability

The Three Pillars

Enterprise Kubernetes requires metrics, logs, and traces working together.

Metrics with Prometheus:

yaml
1apiVersion: monitoring.coreos.com/v1
2kind: ServiceMonitor
3metadata:
4 name: payment-service
5 namespace: team-payments-prod
6spec:
7 selector:
8 matchLabels:
9 app: payment-service
10 endpoints:
11 - port: metrics
12 interval: 15s
13 path: /metrics
14 relabelings:
15 - sourceLabels: [__meta_kubernetes_pod_label_version]
16 targetLabel: version
17 

Alerting rules:

yaml
1apiVersion: monitoring.coreos.com/v1
2kind: PrometheusRule
3metadata:
4 name: payment-service-alerts
5 namespace: team-payments-prod
6spec:
7 groups:
8 - name: payment-service
9 rules:
10 - alert: HighErrorRate
11 expr: |
12 sum(rate(http_requests_total{namespace="team-payments-prod", status=~"5.."}[5m]))
13 /
14 sum(rate(http_requests_total{namespace="team-payments-prod"}[5m]))
15 > 0.01
16 for: 5m
17 labels:
18 severity: critical
19 team: payments
20 annotations:
21 summary: "Payment service error rate above 1%"
22 runbook: "https://runbooks.example.com/payment-errors"
23 
24 - alert: PodCrashLooping
25 expr: |
26 rate(kube_pod_container_status_restarts_total{namespace="team-payments-prod"}[15m]) > 0
27 for: 15m
28 labels:
29 severity: warning
30 team: payments
31 annotations:
32 summary: "Pod {{ $labels.pod }} is crash looping"
33 
34 - alert: HighLatencyP99
35 expr: |
36 histogram_quantile(0.99,
37 sum(rate(http_request_duration_seconds_bucket{namespace="team-payments-prod"}[5m])) by (le)
38 ) > 2
39 for: 10m
40 labels:
41 severity: warning
42 team: payments
43 annotations:
44 summary: "P99 latency exceeds 2 seconds"
45 

Structured Logging

Enforce JSON logging across all services. This makes log aggregation and querying practical at scale:

yaml
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: payment-service
5spec:
6 template:
7 spec:
8 containers:
9 - name: payment-service
10 env:
11 - name: LOG_FORMAT
12 value: "json"
13 - name: LOG_LEVEL
14 value: "info"
15 - name: OTEL_SERVICE_NAME
16 value: "payment-service"
17 - name: OTEL_EXPORTER_OTLP_ENDPOINT
18 value: "http://otel-collector.monitoring:4317"
19 

Need a second opinion on your DevOps pipelines architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Deployment Strategies

Rolling Updates with Safety

yaml
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: payment-service
5 namespace: team-payments-prod
6spec:
7 replicas: 5
8 strategy:
9 type: RollingUpdate
10 rollingUpdate:
11 maxUnavailable: 1
12 maxSurge: 2
13 selector:
14 matchLabels:
15 app: payment-service
16 template:
17 metadata:
18 labels:
19 app: payment-service
20 version: v2.3.1
21 spec:
22 terminationGracePeriodSeconds: 60
23 containers:
24 - name: payment-service
25 image: registry.example.com/payment-service:v2.3.1
26 ports:
27 - containerPort: 8080
28 name: http
29 - containerPort: 9090
30 name: metrics
31 resources:
32 requests:
33 cpu: "250m"
34 memory: 512Mi
35 limits:
36 cpu: "1"
37 memory: 1Gi
38 readinessProbe:
39 httpGet:
40 path: /ready
41 port: 8080
42 initialDelaySeconds: 5
43 periodSeconds: 5
44 failureThreshold: 3
45 livenessProbe:
46 httpGet:
47 path: /health
48 port: 8080
49 initialDelaySeconds: 15
50 periodSeconds: 10
51 failureThreshold: 3
52 startupProbe:
53 httpGet:
54 path: /health
55 port: 8080
56 initialDelaySeconds: 10
57 periodSeconds: 5
58 failureThreshold: 30 # 150 seconds max startup time
59 lifecycle:
60 preStop:
61 exec:
62 command: ["sh", "-c", "sleep 10"] # Allow LB to drain
63 

Key points:

  • Three probes: Startup (for slow-starting apps), readiness (for traffic routing), liveness (for stuck processes).
  • preStop hook: The 10-second sleep allows load balancers to deregister the pod before it terminates. Without this, you get 502 errors during deployments.
  • terminationGracePeriodSeconds: 60: Give in-flight requests time to complete.

Pod Disruption Budgets

Protect services during node drains and cluster upgrades:

yaml
1apiVersion: policy/v1
2kind: PodDisruptionBudget
3metadata:
4 name: payment-service-pdb
5 namespace: team-payments-prod
6spec:
7 minAvailable: 3
8 selector:
9 matchLabels:
10 app: payment-service
11 

With 5 replicas and minAvailable: 3, only 2 pods can be down simultaneously during voluntary disruptions. This prevents cluster upgrades from taking down your service.

Cluster Upgrades and Maintenance

Upgrade Strategy

Enterprise clusters should follow a staged upgrade pattern:

  1. Dev cluster: Upgrade immediately when new version is available
  2. Staging cluster: Upgrade 1 week after dev (soak test)
  3. Production cluster: Upgrade 2-3 weeks after staging
  4. DR cluster: Upgrade after production is stable (1 week)

For EKS managed node groups, use a rolling update:

bash
1# Update control plane first
2aws eks update-cluster-version \
3 --name production-primary \
4 --kubernetes-version 1.30
5 
6# Wait for control plane update to complete
7aws eks wait cluster-active --name production-primary
8 
9# Update node groups one at a time
10aws eks update-nodegroup-version \
11 --cluster-name production-primary \
12 --nodegroup-name system \
13 --kubernetes-version 1.30
14 
15# Wait, then update workload nodes
16aws eks update-nodegroup-version \
17 --cluster-name production-primary \
18 --nodegroup-name workload \
19 --kubernetes-version 1.30
20 

Node Maintenance

Automate node rotation to ensure nodes don't drift:

yaml
1# Karpenter NodePool for automatic node management
2apiVersion: karpenter.sh/v1beta1
3kind: NodePool
4metadata:
5 name: workload
6spec:
7 template:
8 spec:
9 requirements:
10 - key: kubernetes.io/arch
11 operator: In
12 values: ["amd64"]
13 - key: karpenter.k8s.aws/instance-category
14 operator: In
15 values: ["m", "c", "r"]
16 - key: karpenter.k8s.aws/instance-generation
17 operator: Gt
18 values: ["5"]
19 nodeClassRef:
20 name: default
21 disruption:
22 consolidationPolicy: WhenUnderutilized
23 expireAfter: 720h # Rotate nodes every 30 days
24 limits:
25 cpu: 200
26 memory: 800Gi
27 

Disaster Recovery

Backup Strategy

Use Velero for cluster-level backups:

yaml
1apiVersion: velero.io/v1
2kind: Schedule
3metadata:
4 name: daily-backup
5 namespace: velero
6spec:
7 schedule: "0 2 * * *" # 2 AM daily
8 template:
9 includedNamespaces:
10 - "team-*"
11 excludedResources:
12 - events
13 - pods
14 storageLocation: default
15 ttl: 720h # 30-day retention
16 snapshotVolumes: true
17 

Multi-Cluster Strategy

Enterprise deployments should run at least two clusters:

  • Active-active for critical services (traffic split across clusters)
  • Active-passive for stateful services (failover on primary failure)

Use an external load balancer (AWS Global Accelerator, Cloudflare) to route between clusters, not in-cluster service mesh.

Conclusion

Production Kubernetes for enterprise teams is a platform engineering discipline, not a deployment target. The patterns covered here — namespace isolation, pod security standards, network policies, observability, and upgrade procedures — represent the minimum bar for enterprise workloads. Skipping any of these creates operational debt that surfaces at the worst possible time.

The most common failure mode isn't a cluster going down. It's a gradual erosion of operational discipline: resource quotas not enforced, network policies not applied, alerts not actionable. Enterprise teams should treat their Kubernetes configuration as production code — reviewed, tested, and continuously validated. GitOps tools like Argo CD or Flux make this practical by ensuring the cluster state always matches what's in version control.

FAQ

Need expert help?

Building with CI/CD pipelines?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026