Back to Journal
DevOps

Kubernetes Production Setup at Scale: Lessons from Production

Real-world lessons from implementing Kubernetes Production Setup in production, including architecture decisions, measurable results, and honest retrospectives.

Muneer Puthiya Purayil 13 min read

In early 2024, we migrated a monolithic Node.js application serving 2.3 million daily active users to Kubernetes on AWS EKS. The migration took four months, involved zero downtime, and reduced infrastructure costs by 34%. This is an honest account of what worked, what didn't, and what we'd do differently.

Starting Point

The application was a B2B SaaS platform running on 12 EC2 instances behind an Application Load Balancer. Deployments were SSH-based scripts that took 45 minutes and required a dedicated engineer babysitting the process. Rollbacks meant re-deploying the previous version through the same 45-minute pipeline. The team had grown to 8 backend engineers, and deployment contention — where one team's deploy blocked another's — was costing roughly 6 hours per week in engineering time.

The infrastructure costs were $18,400/month on reserved instances sized for peak traffic. The instances ran at 25-30% average utilization because we provisioned for Black Friday-level loads year-round.

Architecture Decisions

Container Strategy

We chose to containerize the monolith first and decompose later. The alternative — breaking the monolith into microservices during migration — was rejected because it would have doubled the project timeline and introduced distributed systems complexity simultaneously with infrastructure changes.

dockerfile
1FROM node:20-alpine AS builder
2WORKDIR /app
3COPY package*.json ./
4RUN npm ci --production=false
5COPY . .
6RUN npm run build
7 
8FROM node:20-alpine
9WORKDIR /app
10RUN addgroup -g 1001 -S appuser && adduser -S appuser -u 1001
11COPY --from=builder /app/dist ./dist
12COPY --from=builder /app/node_modules ./node_modules
13COPY --from=builder /app/package.json ./
14USER appuser
15EXPOSE 8080
16CMD ["node", "dist/server.js"]
17 

The multi-stage build reduced image size from 1.2GB to 340MB. We later added a .dockerignore for test files and documentation, bringing it down to 290MB.

Cluster Configuration

yaml
1apiVersion: eksctl.io/v1alpha5
2kind: ClusterConfig
3metadata:
4 name: production
5 region: us-east-1
6 version: "1.28"
7 
8managedNodeGroups:
9 - name: application
10 instanceType: c6i.xlarge
11 desiredCapacity: 6
12 minSize: 4
13 maxSize: 15
14 volumeSize: 50
15 labels:
16 workload: application
17 iam:
18 withAddonPolicies:
19 autoScaler: true
20 cloudWatch: true
21 
22 - name: system
23 instanceType: t3.large
24 desiredCapacity: 2
25 minSize: 2
26 maxSize: 4
27 volumeSize: 30
28 labels:
29 workload: system
30 taints:
31 - key: CriticalAddonsOnly
32 effect: NoSchedule
33 

We separated system components (ingress controllers, monitoring, cert-manager) from application workloads using node groups with taints. This prevented a monitoring Prometheus from being evicted during application scaling events.

The Migration

Phase 1: Shadow Traffic (Weeks 1-4)

We ran the Kubernetes deployment alongside the existing EC2 infrastructure, sending mirrored traffic to the K8s cluster using an Nginx mirror directive:

nginx
1location / {
2 proxy_pass http://ec2-backend;
3 mirror /mirror;
4 mirror_request_body on;
5}
6 
7location = /mirror {
8 internal;
9 proxy_pass http://k8s-backend$request_uri;
10}
11 

This revealed three critical issues:

  1. DNS resolution caching. The Node.js application cached DNS lookups indefinitely by default. In Kubernetes, service IPs change during rolling updates. We set dns.setDefaultResultOrder('ipv4first') and reduced the DNS cache TTL to 30 seconds.

  2. Health check timeouts. Our readiness probe initially used the main application endpoint, which ran database queries. Under load, the probe timed out, causing Kubernetes to remove pods from service rotation. We created a dedicated /healthz endpoint that only checked process responsiveness.

  3. Graceful shutdown. SIGTERM handling was missing. Without it, in-flight requests were dropped during rolling updates. The fix was straightforward:

javascript
1process.on('SIGTERM', () => {
2 server.close(() => {
3 database.disconnect().then(() => {
4 process.exit(0);
5 });
6 });
7
8 setTimeout(() => {
9 process.exit(1);
10 }, 30000);
11});
12 

Phase 2: Canary Migration (Weeks 5-8)

We used weighted target groups on the ALB to shift traffic gradually:

  • Week 5: 5% to Kubernetes
  • Week 6: 25% to Kubernetes
  • Week 7: 50% to Kubernetes
  • Week 8: 100% to Kubernetes

At each step, we monitored p50, p95, and p99 latency, error rates, and database connection pool utilization. The p99 latency on Kubernetes was actually 12ms lower than EC2, likely because the newer c6i instances had better network performance.

Phase 3: Decommission EC2 (Weeks 9-12)

We kept the EC2 instances running for four additional weeks as a rollback path. During this period, we set up the full production operational stack:

yaml
1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4 name: api-hpa
5spec:
6 scaleTargetRef:
7 apiVersion: apps/v1
8 kind: Deployment
9 name: api
10 minReplicas: 6
11 maxReplicas: 30
12 metrics:
13 - type: Resource
14 resource:
15 name: cpu
16 target:
17 type: Utilization
18 targetAverageUtilization: 65
19 behavior:
20 scaleDown:
21 stabilizationWindowSeconds: 300
22 

Need a second opinion on your DevOps pipelines architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Measurable Results

MetricBefore (EC2)After (EKS)Change
Monthly infrastructure cost$18,400$12,150-34%
Average deployment time45 min3 min-93%
Rollback time45 min30 sec-99%
Average CPU utilization28%62%+121%
p99 latency145ms133ms-8%
Deployment frequency2/week8/day+28x
Deployment-related incidents3/month0.5/month-83%

The cost savings came primarily from two sources: right-sizing (the HPA maintained 60-65% CPU utilization vs the previous 28%) and spot instances for background workers (saving 68% on those nodes).

What Went Wrong

Persistent volume migration was painful. We underestimated the complexity of migrating the application's file upload storage from local EBS volumes to S3. The application assumed a local filesystem, and the refactor to use S3 added three weeks to the timeline.

Monitoring gaps during migration. Our existing Datadog setup didn't integrate well with Kubernetes labels and annotations out of the box. We spent a week configuring auto-discovery and label-based dashboards. In retrospect, we should have set up the monitoring stack before starting the migration.

Ingress controller sizing. The default nginx ingress controller replicas (2) were insufficient for our traffic. During the 50% traffic shift, the ingress controllers hit CPU limits and started dropping connections. Scaling to 4 replicas with proper resource requests resolved this, but it caused a 15-minute incident.

Honest Retrospective

Would we use Kubernetes again for this? Yes, but the decision isn't obvious. ECS Fargate would have achieved similar results with less operational complexity. We chose Kubernetes because two team members had prior experience and because we anticipated needing the flexibility for the eventual microservices decomposition.

What we'd do differently:

  1. Migrate file storage to S3 before the Kubernetes migration, not during.
  2. Set up the complete monitoring stack as the first step.
  3. Use Karpenter from day one instead of Cluster Autoscaler — we switched later and it reduced scaling time from 4 minutes to 45 seconds.
  4. Run load tests on the Kubernetes setup before starting the canary migration.

What we got right:

  1. Containerizing the monolith without decomposing it.
  2. The shadow traffic phase caught three issues that would have caused production incidents.
  3. Keeping EC2 as a rollback path for four weeks gave the team confidence to proceed.
  4. Investing in GitOps (ArgoCD) from the start — it made the operational steady-state much simpler.

Conclusion

Migrating to Kubernetes is a significant undertaking, but for a team deploying multiple times per day with autoscaling needs, the investment pays off within months. The 34% cost reduction alone justified the four-month project, and the operational improvements — sub-minute rollbacks, 28x deployment frequency — transformed how the engineering team ships code.

The critical lesson is to treat the migration as an infrastructure project, not an application rewrite. Containerize what you have, validate it with shadow traffic, shift gradually, and keep a rollback path. Every team that tries to simultaneously migrate infrastructure and rewrite their application architecture ends up doing neither well.

FAQ

Need expert help?

Building with CI/CD pipelines?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026