How did you handle database connections during the migration?

We used RDS with connection pooling via PgBouncer running as a sidecar container. The database itself didn't move — only the application tier migrated to Kubernetes. Connection pool settings required tuning: each pod ran 10 connections, and at peak we had 20 pods, so 200 connections to RDS. We configured RDS for 300 max connections and set PgBouncer's pool mode to transaction to multiplex efficiently.

What was the team's Kubernetes experience level before the migration?

Two of eight engineers had production Kubernetes experience. We ran a one-week internal training program covering kubectl basics, manifest structure, and debugging techniques (logs, exec, describe). The remaining six engineers were comfortable with day-to-day operations within three weeks. The ArgoCD GitOps setup helped significantly because engineers could deploy by merging PRs rather than learning kubectl.

How do you handle secrets in the production cluster?

We use AWS Secrets Manager with the External Secrets Operator. Application secrets are stored in Secrets Manager, and ESO syncs them into Kubernetes Secrets. This provides audit logging, automatic rotation, and keeps secrets out of the Git repository. The migration from environment variables on EC2 to this setup took two days.

What monitoring stack do you use in production?

We kept Datadog as the primary monitoring platform because the team was already familiar with it. The Datadog Agent runs as a DaemonSet with autodiscovery enabled. For Kubernetes-specific metrics (pod restarts, HPA events, node utilization), we added kube-state-metrics feeding into Datadog. Alerts are routed through PagerDuty with escalation policies based on severity.

Kubernetes Production Setup at Scale: Lessons from Production

In early 2024, we migrated a monolithic Node.js application serving 2.3 million daily active users to Kubernetes on AWS EKS. The migration took four months, involved zero downtime, and reduced infrastructure costs by 34%. This is an honest account of what worked, what didn't, and what we'd do differently.

Starting Point

The application was a B2B SaaS platform running on 12 EC2 instances behind an Application Load Balancer. Deployments were SSH-based scripts that took 45 minutes and required a dedicated engineer babysitting the process. Rollbacks meant re-deploying the previous version through the same 45-minute pipeline. The team had grown to 8 backend engineers, and deployment contention — where one team's deploy blocked another's — was costing roughly 6 hours per week in engineering time.

The infrastructure costs were $18,400/month on reserved instances sized for peak traffic. The instances ran at 25-30% average utilization because we provisioned for Black Friday-level loads year-round.

Architecture Decisions

Container Strategy

We chose to containerize the monolith first and decompose later. The alternative — breaking the monolith into microservices during migration — was rejected because it would have doubled the project timeline and introduced distributed systems complexity simultaneously with infrastructure changes.

dockerfile

1FROM node:20-alpine AS builder

2WORKDIR /app

3COPY package*.json ./

4RUN npm ci --production=false

5COPY . .

6RUN npm run build

8FROM node:20-alpine

9WORKDIR /app

10RUN addgroup -g 1001 -S appuser && adduser -S appuser -u 1001

11COPY --from=builder /app/dist ./dist

12COPY --from=builder /app/node_modules ./node_modules

13COPY --from=builder /app/package.json ./

14USER appuser

15EXPOSE 8080

16CMD ["node", "dist/server.js"]

The multi-stage build reduced image size from 1.2GB to 340MB. We later added a .dockerignore for test files and documentation, bringing it down to 290MB.

Cluster Configuration

yaml

1apiVersion: eksctl.io/v1alpha5

2kind: ClusterConfig

3metadata:

4 name: production

5 region: us-east-1

6 version: "1.28"

8managedNodeGroups:

9 - name: application

10 instanceType: c6i.xlarge

11 desiredCapacity: 6

12 minSize: 4

13 maxSize: 15

14 volumeSize: 50

15 labels:

16 workload: application

17 iam:

18 withAddonPolicies:

19 autoScaler: true

20 cloudWatch: true

22 - name: system

23 instanceType: t3.large

24 desiredCapacity: 2

25 minSize: 2

26 maxSize: 4

27 volumeSize: 30

28 labels:

29 workload: system

30 taints:

31 - key: CriticalAddonsOnly

32 effect: NoSchedule

We separated system components (ingress controllers, monitoring, cert-manager) from application workloads using node groups with taints. This prevented a monitoring Prometheus from being evicted during application scaling events.

The Migration

Phase 1: Shadow Traffic (Weeks 1-4)

We ran the Kubernetes deployment alongside the existing EC2 infrastructure, sending mirrored traffic to the K8s cluster using an Nginx mirror directive:

nginx

1location / {

2 proxy_pass http://ec2-backend;

3 mirror /mirror;

4 mirror_request_body on;

7location = /mirror {

8 internal;

9 proxy_pass http://k8s-backend$request_uri;

10}

This revealed three critical issues:

DNS resolution caching. The Node.js application cached DNS lookups indefinitely by default. In Kubernetes, service IPs change during rolling updates. We set dns.setDefaultResultOrder('ipv4first') and reduced the DNS cache TTL to 30 seconds.
Health check timeouts. Our readiness probe initially used the main application endpoint, which ran database queries. Under load, the probe timed out, causing Kubernetes to remove pods from service rotation. We created a dedicated /healthz endpoint that only checked process responsiveness.
Graceful shutdown. SIGTERM handling was missing. Without it, in-flight requests were dropped during rolling updates. The fix was straightforward:

javascript

1process.on('SIGTERM', () => {

2 server.close(() => {

3 database.disconnect().then(() => {

4 process.exit(0);

5 });

6 });

8 setTimeout(() => {

9 process.exit(1);

10 }, 30000);

11});

Phase 2: Canary Migration (Weeks 5-8)

We used weighted target groups on the ALB to shift traffic gradually:

Week 5: 5% to Kubernetes
Week 6: 25% to Kubernetes
Week 7: 50% to Kubernetes
Week 8: 100% to Kubernetes

At each step, we monitored p50, p95, and p99 latency, error rates, and database connection pool utilization. The p99 latency on Kubernetes was actually 12ms lower than EC2, likely because the newer c6i instances had better network performance.

Phase 3: Decommission EC2 (Weeks 9-12)

We kept the EC2 instances running for four additional weeks as a rollback path. During this period, we set up the full production operational stack:

yaml

1apiVersion: autoscaling/v2

2kind: HorizontalPodAutoscaler

3metadata:

4 name: api-hpa

5spec:

6 scaleTargetRef:

7 apiVersion: apps/v1

8 kind: Deployment

9 name: api

10 minReplicas: 6

11 maxReplicas: 30

12 metrics:

13 - type: Resource

14 resource:

15 name: cpu

16 target:

17 type: Utilization

18 targetAverageUtilization: 65

19 behavior:

20 scaleDown:

21 stabilizationWindowSeconds: 300

Need a second opinion on your DevOps pipelines architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Measurable Results

Metric	Before (EC2)	After (EKS)	Change
Monthly infrastructure cost	$18,400	$12,150	-34%
Average deployment time	45 min	3 min	-93%
Rollback time	45 min	30 sec	-99%
Average CPU utilization	28%	62%	+121%
p99 latency	145ms	133ms	-8%
Deployment frequency	2/week	8/day	+28x
Deployment-related incidents	3/month	0.5/month	-83%

The cost savings came primarily from two sources: right-sizing (the HPA maintained 60-65% CPU utilization vs the previous 28%) and spot instances for background workers (saving 68% on those nodes).

What Went Wrong

Persistent volume migration was painful. We underestimated the complexity of migrating the application's file upload storage from local EBS volumes to S3. The application assumed a local filesystem, and the refactor to use S3 added three weeks to the timeline.

Monitoring gaps during migration. Our existing Datadog setup didn't integrate well with Kubernetes labels and annotations out of the box. We spent a week configuring auto-discovery and label-based dashboards. In retrospect, we should have set up the monitoring stack before starting the migration.

Ingress controller sizing. The default nginx ingress controller replicas (2) were insufficient for our traffic. During the 50% traffic shift, the ingress controllers hit CPU limits and started dropping connections. Scaling to 4 replicas with proper resource requests resolved this, but it caused a 15-minute incident.

Honest Retrospective

Would we use Kubernetes again for this? Yes, but the decision isn't obvious. ECS Fargate would have achieved similar results with less operational complexity. We chose Kubernetes because two team members had prior experience and because we anticipated needing the flexibility for the eventual microservices decomposition.

What we'd do differently:

Migrate file storage to S3 before the Kubernetes migration, not during.
Set up the complete monitoring stack as the first step.
Use Karpenter from day one instead of Cluster Autoscaler — we switched later and it reduced scaling time from 4 minutes to 45 seconds.
Run load tests on the Kubernetes setup before starting the canary migration.

What we got right:

Containerizing the monolith without decomposing it.
The shadow traffic phase caught three issues that would have caused production incidents.
Keeping EC2 as a rollback path for four weeks gave the team confidence to proceed.
Investing in GitOps (ArgoCD) from the start — it made the operational steady-state much simpler.

Conclusion

Migrating to Kubernetes is a significant undertaking, but for a team deploying multiple times per day with autoscaling needs, the investment pays off within months. The 34% cost reduction alone justified the four-month project, and the operational improvements — sub-minute rollbacks, 28x deployment frequency — transformed how the engineering team ships code.

The critical lesson is to treat the migration as an infrastructure project, not an application rewrite. Containerize what you have, validate it with shadow traffic, shift gradually, and keep a rollback path. Every team that tries to simultaneously migrate infrastructure and rewrite their application architecture ends up doing neither well.

FAQ

Need expert help?

Building with CI/CD pipelines?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Book a Free Call Send a Brief

kubernetes k8s container-orchestration devops aws case-study

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

View Portfolio Book a Call

← Previous

Kubernetes Production Setup at Scale: Lessons from Production

Starting Point

Architecture Decisions

Container Strategy

Cluster Configuration

The Migration

Phase 1: Shadow Traffic (Weeks 1-4)

Phase 2: Canary Migration (Weeks 5-8)

Phase 3: Decommission EC2 (Weeks 9-12)

Measurable Results

What Went Wrong

Honest Retrospective

Conclusion

FAQ

Building with CI/CD pipelines?

Kubernetes Production Setup Best Practices for High Scale Teams

Kubernetes Production Setup Best Practices for Enterprise Teams

Kubernetes Production Setup Best Practices for Startup Teams

Complete Guide to CI/CD Pipeline Design with Typescript

Kubernetes Production Setup Best Practices for High Scale Teams

Start a
Conversation.

Starting Point

Architecture Decisions

Container Strategy

Cluster Configuration

The Migration

Phase 1: Shadow Traffic (Weeks 1-4)

Phase 2: Canary Migration (Weeks 5-8)

Phase 3: Decommission EC2 (Weeks 9-12)

Measurable Results

What Went Wrong

Honest Retrospective

Conclusion

FAQ

Building with CI/CD pipelines?

Kubernetes Production Setup Best Practices for High Scale Teams

Kubernetes Production Setup Best Practices for Enterprise Teams

Kubernetes Production Setup Best Practices for Startup Teams

Complete Guide to CI/CD Pipeline Design with Typescript

Kubernetes Production Setup Best Practices for High Scale Teams

Start aConversation.

Start a
Conversation.