Should enterprise teams use EKS, GKE, or AKS?

GKE has the most mature Kubernetes implementation with the fastest version upgrades and best Autopilot mode. EKS has the deepest AWS integration and largest enterprise customer base. AKS is the right choice for Azure-heavy organizations. All three are production-ready for enterprise workloads. The deciding factor is your primary cloud provider, not the Kubernetes implementation quality. Running self-managed Kubernetes (kubeadm, Rancher, OpenShift) only makes sense if you have regulatory requirem

How many clusters should an enterprise run?

At minimum, three: dev, staging, and production. Most enterprises add a fourth for disaster recovery. Beyond that, the decision depends on your isolation requirements. Some organizations run one cluster per team (10-20 clusters) for complete blast-radius isolation. Others run shared clusters with namespace-level isolation for efficiency. The tradeoff is operational overhead (more clusters = more upgrades, more monitoring) versus isolation (more clusters = smaller blast radius). Start with shared

When should we use a service mesh (Istio, Linkerd)?

Add a service mesh when you need mutual TLS between services, traffic shifting for canary deployments, or cross-service observability without application changes. Do not add a service mesh "just in case" — it adds significant operational complexity (sidecar proxy management, control plane maintenance, debugging difficulty). For most enterprise teams, Kubernetes Network Policies handle network segmentation, and application-level observability (OpenTelemetry) handles tracing. A service mesh become

How do we handle stateful workloads on Kubernetes?

Prefer managed services (RDS, Cloud SQL, ElastiCache) over running databases on Kubernetes. If you must run stateful workloads on Kubernetes, use operators designed for the specific technology — PostgreSQL Operator (Zalando or CrunchyData), MongoDB Community Operator, or Redis Operator. These handle replication, failover, and backups automatically. Always use `volumeBindingMode: WaitForFirstConsumer` on StorageClasses to ensure volumes are created in the same AZ as pods. Set `reclaimPolicy: Reta

Kubernetes Production Setup Best Practices for Enterprise Teams

Running Kubernetes in production for enterprise workloads is fundamentally different from getting a cluster up and running. The gap between a working cluster and a production-grade platform spans security hardening, multi-tenancy, observability, compliance, and operational procedures that prevent 3 AM pages. Enterprise teams need battle-tested patterns, not just tutorials.

This guide covers the practices that separate production Kubernetes from demo Kubernetes — drawn from real-world deployments managing hundreds of services across regulated industries.

Cluster Architecture

Control Plane Configuration

Enterprise clusters should use managed control planes (EKS, GKE, AKS) unless you have a dedicated platform team of 5+ engineers. Self-managing etcd, the API server, and controller managers is a full-time job.

For EKS, configure the control plane for high availability:

yaml

1# eksctl cluster config

2apiVersion: eksctl.io/v1alpha5

3kind: ClusterConfig

5metadata:

6 name: production-primary

7 region: us-east-1

8 version: "1.29"

10iam:

11 withOIDC: true

13vpc:

14 clusterEndpoints:

15 privateAccess: true

16 publicAccess: false # API server not exposed to internet

18managedNodeGroups:

19 - name: system

20 instanceType: m6i.xlarge

21 minSize: 3

22 maxSize: 5

23 desiredCapacity: 3

24 labels:

25 role: system

26 taints:

27 - key: CriticalAddonsOnly

28 effect: NoSchedule

29 privateNetworking: true

30 volumeSize: 100

31 volumeType: gp3

32 tags:

33 Team: platform

34 Environment: production

36 - name: workload

37 instanceType: m6i.2xlarge

38 minSize: 5

39 maxSize: 50

40 desiredCapacity: 10

41 labels:

42 role: workload

43 privateNetworking: true

44 volumeSize: 200

45 volumeType: gp3

47addons:

48 - name: vpc-cni

49 version: latest

50 - name: coredns

51 version: latest

52 - name: kube-proxy

53 version: latest

Key decisions:

Private API endpoint only. Public API endpoints are the #1 attack vector for enterprise clusters. Use a VPN or bastion for kubectl access.
Separate system and workload node groups. System components (ingress, monitoring, cert-manager) run on dedicated nodes with taints. Application workloads can't interfere with cluster operations.
gp3 volumes. 20% cheaper than gp2 with better baseline IOPS (3,000 vs 100/GB).

Node Pool Strategy

Enterprise clusters typically need 3-4 node pools:

Node Pool	Instance Type	Purpose	Taint
system	m6i.xlarge	Ingress, monitoring, cert-manager	CriticalAddonsOnly
workload	m6i.2xlarge	Application services	None
compute	c6i.4xlarge	CPU-intensive batch jobs	workload-type=compute
gpu	g5.xlarge	ML inference	nvidia.com/gpu

Namespace and Multi-Tenancy

Namespace Strategy

Organize namespaces by team and environment, not by application:

yaml

1apiVersion: v1

2kind: Namespace

3metadata:

4 name: team-payments-prod

5 labels:

6 team: payments

7 environment: production

8 cost-center: eng-payments

9 annotations:

10 contacts: "[email protected]"

Resource Quotas

Every namespace must have resource quotas. Without them, a single team's runaway deployment can starve the entire cluster:

yaml

1apiVersion: v1

2kind: ResourceQuota

3metadata:

4 name: team-quota

5 namespace: team-payments-prod

6spec:

7 hard:

8 requests.cpu: "40"

9 requests.memory: 80Gi

10 limits.cpu: "80"

11 limits.memory: 160Gi

12 pods: "100"

13 services: "20"

14 services.loadbalancers: "5"

15 persistentvolumeclaims: "20"

16 secrets: "50"

17 configmaps: "50"

Limit Ranges

Set defaults so that pods without explicit resource requests don't run unbounded:

yaml

1apiVersion: v1

2kind: LimitRange

3metadata:

4 name: default-limits

5 namespace: team-payments-prod

6spec:

7 limits:

8 - default:

9 cpu: "500m"

10 memory: 512Mi

11 defaultRequest:

12 cpu: "100m"

13 memory: 128Mi

14 max:

15 cpu: "4"

16 memory: 8Gi

17 min:

18 cpu: "50m"

19 memory: 64Mi

20 type: Container

Security Hardening

Pod Security Standards

Enforce pod security at the namespace level using built-in Pod Security Admission:

yaml

1apiVersion: v1

2kind: Namespace

3metadata:

4 name: team-payments-prod

5 labels:

6 pod-security.kubernetes.io/enforce: restricted

7 pod-security.kubernetes.io/enforce-version: v1.29

8 pod-security.kubernetes.io/audit: restricted

9 pod-security.kubernetes.io/warn: restricted

The restricted policy requires:

No privileged containers
No host networking or host ports
Read-only root filesystem
Non-root user
Seccomp profile set

Network Policies

Default deny all traffic, then explicitly allow what's needed:

yaml

1# Default deny all ingress and egress

2apiVersion: networking.k8s.io/v1

3kind: NetworkPolicy

4metadata:

5 name: default-deny-all

6 namespace: team-payments-prod

7spec:

8 podSelector: {}

9 policyTypes:

10 - Ingress

11 - Egress

12---

13# Allow DNS resolution

14apiVersion: networking.k8s.io/v1

15kind: NetworkPolicy

16metadata:

17 name: allow-dns

18 namespace: team-payments-prod

19spec:

20 podSelector: {}

21 policyTypes:

22 - Egress

23 egress:

24 - to:

25 - namespaceSelector:

26 matchLabels:

27 kubernetes.io/metadata.name: kube-system

28 ports:

29 - protocol: UDP

30 port: 53

31 - protocol: TCP

32 port: 53

33---

34# Allow specific service communication

35apiVersion: networking.k8s.io/v1

36kind: NetworkPolicy

37metadata:

38 name: allow-api-to-payments

39 namespace: team-payments-prod

40spec:

41 podSelector:

42 matchLabels:

43 app: payment-service

44 policyTypes:

45 - Ingress

46 ingress:

47 - from:

48 - namespaceSelector:

49 matchLabels:

50 team: api

51 podSelector:

52 matchLabels:

53 app: api-gateway

54 ports:

55 - protocol: TCP

56 port: 8080

RBAC

Create role bindings per team with least-privilege access:

yaml

1apiVersion: rbac.authorization.k8s.io/v1

2kind: Role

3metadata:

4 name: team-developer

5 namespace: team-payments-prod

6rules:

7 - apiGroups: ["apps"]

8 resources: ["deployments", "replicasets"]

9 verbs: ["get", "list", "watch", "create", "update", "patch"]

10 - apiGroups: [""]

11 resources: ["pods", "pods/log", "services", "configmaps"]

12 verbs: ["get", "list", "watch"]

13 - apiGroups: [""]

14 resources: ["pods/exec"]

15 verbs: ["create"] # Allow debug access

16 - apiGroups: [""]

17 resources: ["secrets"]

18 verbs: ["get", "list"] # No create/update — secrets managed by platform team

19---

20apiVersion: rbac.authorization.k8s.io/v1

21kind: RoleBinding

22metadata:

23 name: team-payments-devs

24 namespace: team-payments-prod

25subjects:

26 - kind: Group

27 name: "payments-developers"

28 apiGroup: rbac.authorization.k8s.io

29roleRef:

30 kind: Role

31 name: team-developer

32 apiGroup: rbac.authorization.k8s.io

Secrets Management

Never store secrets in Kubernetes Secret objects directly — they're base64-encoded, not encrypted. Use an external secrets operator:

yaml

1# External Secrets Operator with AWS Secrets Manager

2apiVersion: external-secrets.io/v1beta1

3kind: ExternalSecret

4metadata:

5 name: payment-api-credentials

6 namespace: team-payments-prod

7spec:

8 refreshInterval: 1h

9 secretStoreRef:

10 name: aws-secrets-manager

11 kind: ClusterSecretStore

12 target:

13 name: payment-api-credentials

14 creationPolicy: Owner

15 data:

16 - secretKey: api-key

17 remoteRef:

18 key: production/payments/api-key

19 - secretKey: db-password

20 remoteRef:

21 key: production/payments/db-password

Observability

The Three Pillars

Enterprise Kubernetes requires metrics, logs, and traces working together.

Metrics with Prometheus:

yaml

1apiVersion: monitoring.coreos.com/v1

2kind: ServiceMonitor

3metadata:

4 name: payment-service

5 namespace: team-payments-prod

6spec:

7 selector:

8 matchLabels:

9 app: payment-service

10 endpoints:

11 - port: metrics

12 interval: 15s

13 path: /metrics

14 relabelings:

15 - sourceLabels: [__meta_kubernetes_pod_label_version]

16 targetLabel: version

Alerting rules:

yaml

1apiVersion: monitoring.coreos.com/v1

2kind: PrometheusRule

3metadata:

4 name: payment-service-alerts

5 namespace: team-payments-prod

6spec:

7 groups:

8 - name: payment-service

9 rules:

10 - alert: HighErrorRate

11 expr: |

12 sum(rate(http_requests_total{namespace="team-payments-prod", status=~"5.."}[5m]))

13 /

14 sum(rate(http_requests_total{namespace="team-payments-prod"}[5m]))

15 > 0.01

16 for: 5m

17 labels:

18 severity: critical

19 team: payments

20 annotations:

21 summary: "Payment service error rate above 1%"

22 runbook: "https://runbooks.example.com/payment-errors"

24 - alert: PodCrashLooping

25 expr: |

26 rate(kube_pod_container_status_restarts_total{namespace="team-payments-prod"}[15m]) > 0

27 for: 15m

28 labels:

29 severity: warning

30 team: payments

31 annotations:

32 summary: "Pod {{ $labels.pod }} is crash looping"

34 - alert: HighLatencyP99

35 expr: |

36 histogram_quantile(0.99,

37 sum(rate(http_request_duration_seconds_bucket{namespace="team-payments-prod"}[5m])) by (le)

38 ) > 2

39 for: 10m

40 labels:

41 severity: warning

42 team: payments

43 annotations:

44 summary: "P99 latency exceeds 2 seconds"

Structured Logging

Enforce JSON logging across all services. This makes log aggregation and querying practical at scale:

yaml

1apiVersion: apps/v1

2kind: Deployment

3metadata:

4 name: payment-service

5spec:

6 template:

7 spec:

8 containers:

9 - name: payment-service

10 env:

11 - name: LOG_FORMAT

12 value: "json"

13 - name: LOG_LEVEL

14 value: "info"

15 - name: OTEL_SERVICE_NAME

16 value: "payment-service"

17 - name: OTEL_EXPORTER_OTLP_ENDPOINT

18 value: "http://otel-collector.monitoring:4317"

Need a second opinion on your DevOps pipelines architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Deployment Strategies

Rolling Updates with Safety

yaml

1apiVersion: apps/v1

2kind: Deployment

3metadata:

4 name: payment-service

5 namespace: team-payments-prod

6spec:

7 replicas: 5

8 strategy:

9 type: RollingUpdate

10 rollingUpdate:

11 maxUnavailable: 1

12 maxSurge: 2

13 selector:

14 matchLabels:

15 app: payment-service

16 template:

17 metadata:

18 labels:

19 app: payment-service

20 version: v2.3.1

21 spec:

22 terminationGracePeriodSeconds: 60

23 containers:

24 - name: payment-service

25 image: registry.example.com/payment-service:v2.3.1

26 ports:

27 - containerPort: 8080

28 name: http

29 - containerPort: 9090

30 name: metrics

31 resources:

32 requests:

33 cpu: "250m"

34 memory: 512Mi

35 limits:

36 cpu: "1"

37 memory: 1Gi

38 readinessProbe:

39 httpGet:

40 path: /ready

41 port: 8080

42 initialDelaySeconds: 5

43 periodSeconds: 5

44 failureThreshold: 3

45 livenessProbe:

46 httpGet:

47 path: /health

48 port: 8080

49 initialDelaySeconds: 15

50 periodSeconds: 10

51 failureThreshold: 3

52 startupProbe:

53 httpGet:

54 path: /health

55 port: 8080

56 initialDelaySeconds: 10

57 periodSeconds: 5

58 failureThreshold: 30 # 150 seconds max startup time

59 lifecycle:

60 preStop:

61 exec:

62 command: ["sh", "-c", "sleep 10"] # Allow LB to drain

Key points:

Three probes: Startup (for slow-starting apps), readiness (for traffic routing), liveness (for stuck processes).
preStop hook: The 10-second sleep allows load balancers to deregister the pod before it terminates. Without this, you get 502 errors during deployments.
terminationGracePeriodSeconds: 60: Give in-flight requests time to complete.

Pod Disruption Budgets

Protect services during node drains and cluster upgrades:

yaml

1apiVersion: policy/v1

2kind: PodDisruptionBudget

3metadata:

4 name: payment-service-pdb

5 namespace: team-payments-prod

6spec:

7 minAvailable: 3

8 selector:

9 matchLabels:

10 app: payment-service

With 5 replicas and minAvailable: 3, only 2 pods can be down simultaneously during voluntary disruptions. This prevents cluster upgrades from taking down your service.

Cluster Upgrades and Maintenance

Upgrade Strategy

Enterprise clusters should follow a staged upgrade pattern:

Dev cluster: Upgrade immediately when new version is available
Staging cluster: Upgrade 1 week after dev (soak test)
Production cluster: Upgrade 2-3 weeks after staging
DR cluster: Upgrade after production is stable (1 week)

For EKS managed node groups, use a rolling update:

bash

1# Update control plane first

2aws eks update-cluster-version \

3 --name production-primary \

4 --kubernetes-version 1.30

6# Wait for control plane update to complete

7aws eks wait cluster-active --name production-primary

9# Update node groups one at a time

10aws eks update-nodegroup-version \

11 --cluster-name production-primary \

12 --nodegroup-name system \

13 --kubernetes-version 1.30

15# Wait, then update workload nodes

16aws eks update-nodegroup-version \

17 --cluster-name production-primary \

18 --nodegroup-name workload \

19 --kubernetes-version 1.30

Node Maintenance

Automate node rotation to ensure nodes don't drift:

yaml

1# Karpenter NodePool for automatic node management

2apiVersion: karpenter.sh/v1beta1

3kind: NodePool

4metadata:

5 name: workload

6spec:

7 template:

8 spec:

9 requirements:

10 - key: kubernetes.io/arch

11 operator: In

12 values: ["amd64"]

13 - key: karpenter.k8s.aws/instance-category

14 operator: In

15 values: ["m", "c", "r"]

16 - key: karpenter.k8s.aws/instance-generation

17 operator: Gt

18 values: ["5"]

19 nodeClassRef:

20 name: default

21 disruption:

22 consolidationPolicy: WhenUnderutilized

23 expireAfter: 720h # Rotate nodes every 30 days

24 limits:

25 cpu: 200

26 memory: 800Gi

Disaster Recovery

Backup Strategy

Use Velero for cluster-level backups:

yaml

1apiVersion: velero.io/v1

2kind: Schedule

3metadata:

4 name: daily-backup

5 namespace: velero

6spec:

7 schedule: "0 2 * * *" # 2 AM daily

8 template:

9 includedNamespaces:

10 - "team-*"

11 excludedResources:

12 - events

13 - pods

14 storageLocation: default

15 ttl: 720h # 30-day retention

16 snapshotVolumes: true

Multi-Cluster Strategy

Enterprise deployments should run at least two clusters:

Active-active for critical services (traffic split across clusters)
Active-passive for stateful services (failover on primary failure)

Use an external load balancer (AWS Global Accelerator, Cloudflare) to route between clusters, not in-cluster service mesh.

Conclusion

Production Kubernetes for enterprise teams is a platform engineering discipline, not a deployment target. The patterns covered here — namespace isolation, pod security standards, network policies, observability, and upgrade procedures — represent the minimum bar for enterprise workloads. Skipping any of these creates operational debt that surfaces at the worst possible time.

The most common failure mode isn't a cluster going down. It's a gradual erosion of operational discipline: resource quotas not enforced, network policies not applied, alerts not actionable. Enterprise teams should treat their Kubernetes configuration as production code — reviewed, tested, and continuously validated. GitOps tools like Argo CD or Flux make this practical by ensuring the cluster state always matches what's in version control.

FAQ

Need expert help?

Building with CI/CD pipelines?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Book a Free Call Send a Brief

kubernetes k8s container-orchestration devops enterprise best-practices

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

View Portfolio Book a Call

← Previous

Kubernetes Production Setup Best Practices for Enterprise Teams

Cluster Architecture

Control Plane Configuration

Node Pool Strategy

Namespace and Multi-Tenancy

Namespace Strategy

Resource Quotas

Limit Ranges

Security Hardening

Pod Security Standards

Network Policies

RBAC

Secrets Management

Observability

The Three Pillars

Structured Logging

Deployment Strategies

Rolling Updates with Safety

Pod Disruption Budgets

Cluster Upgrades and Maintenance

Upgrade Strategy

Node Maintenance

Disaster Recovery

Backup Strategy

Multi-Cluster Strategy

Conclusion

FAQ

Building with CI/CD pipelines?

Kubernetes Production Setup Best Practices for High Scale Teams

Kubernetes Production Setup Best Practices for Startup Teams

Kubernetes Production Setup at Scale: Lessons from Production

Kubernetes Production Setup Best Practices for High Scale Teams

Kubernetes Production Setup Best Practices for Startup Teams

Start aConversation.

Start a
Conversation.