In early 2024, we migrated a monolithic Node.js application serving 2.3 million daily active users to Kubernetes on AWS EKS. The migration took four months, involved zero downtime, and reduced infrastructure costs by 34%. This is an honest account of what worked, what didn't, and what we'd do differently.
Starting Point
The application was a B2B SaaS platform running on 12 EC2 instances behind an Application Load Balancer. Deployments were SSH-based scripts that took 45 minutes and required a dedicated engineer babysitting the process. Rollbacks meant re-deploying the previous version through the same 45-minute pipeline. The team had grown to 8 backend engineers, and deployment contention — where one team's deploy blocked another's — was costing roughly 6 hours per week in engineering time.
The infrastructure costs were $18,400/month on reserved instances sized for peak traffic. The instances ran at 25-30% average utilization because we provisioned for Black Friday-level loads year-round.
Architecture Decisions
Container Strategy
We chose to containerize the monolith first and decompose later. The alternative — breaking the monolith into microservices during migration — was rejected because it would have doubled the project timeline and introduced distributed systems complexity simultaneously with infrastructure changes.
The multi-stage build reduced image size from 1.2GB to 340MB. We later added a .dockerignore for test files and documentation, bringing it down to 290MB.
Cluster Configuration
We separated system components (ingress controllers, monitoring, cert-manager) from application workloads using node groups with taints. This prevented a monitoring Prometheus from being evicted during application scaling events.
The Migration
Phase 1: Shadow Traffic (Weeks 1-4)
We ran the Kubernetes deployment alongside the existing EC2 infrastructure, sending mirrored traffic to the K8s cluster using an Nginx mirror directive:
This revealed three critical issues:
-
DNS resolution caching. The Node.js application cached DNS lookups indefinitely by default. In Kubernetes, service IPs change during rolling updates. We set
dns.setDefaultResultOrder('ipv4first')and reduced the DNS cache TTL to 30 seconds. -
Health check timeouts. Our readiness probe initially used the main application endpoint, which ran database queries. Under load, the probe timed out, causing Kubernetes to remove pods from service rotation. We created a dedicated
/healthzendpoint that only checked process responsiveness. -
Graceful shutdown. SIGTERM handling was missing. Without it, in-flight requests were dropped during rolling updates. The fix was straightforward:
Phase 2: Canary Migration (Weeks 5-8)
We used weighted target groups on the ALB to shift traffic gradually:
- Week 5: 5% to Kubernetes
- Week 6: 25% to Kubernetes
- Week 7: 50% to Kubernetes
- Week 8: 100% to Kubernetes
At each step, we monitored p50, p95, and p99 latency, error rates, and database connection pool utilization. The p99 latency on Kubernetes was actually 12ms lower than EC2, likely because the newer c6i instances had better network performance.
Phase 3: Decommission EC2 (Weeks 9-12)
We kept the EC2 instances running for four additional weeks as a rollback path. During this period, we set up the full production operational stack:
Need a second opinion on your DevOps pipelines architecture?
I run free 30-minute strategy calls for engineering teams tackling this exact problem.
Book a Free CallMeasurable Results
| Metric | Before (EC2) | After (EKS) | Change |
|---|---|---|---|
| Monthly infrastructure cost | $18,400 | $12,150 | -34% |
| Average deployment time | 45 min | 3 min | -93% |
| Rollback time | 45 min | 30 sec | -99% |
| Average CPU utilization | 28% | 62% | +121% |
| p99 latency | 145ms | 133ms | -8% |
| Deployment frequency | 2/week | 8/day | +28x |
| Deployment-related incidents | 3/month | 0.5/month | -83% |
The cost savings came primarily from two sources: right-sizing (the HPA maintained 60-65% CPU utilization vs the previous 28%) and spot instances for background workers (saving 68% on those nodes).
What Went Wrong
Persistent volume migration was painful. We underestimated the complexity of migrating the application's file upload storage from local EBS volumes to S3. The application assumed a local filesystem, and the refactor to use S3 added three weeks to the timeline.
Monitoring gaps during migration. Our existing Datadog setup didn't integrate well with Kubernetes labels and annotations out of the box. We spent a week configuring auto-discovery and label-based dashboards. In retrospect, we should have set up the monitoring stack before starting the migration.
Ingress controller sizing. The default nginx ingress controller replicas (2) were insufficient for our traffic. During the 50% traffic shift, the ingress controllers hit CPU limits and started dropping connections. Scaling to 4 replicas with proper resource requests resolved this, but it caused a 15-minute incident.
Honest Retrospective
Would we use Kubernetes again for this? Yes, but the decision isn't obvious. ECS Fargate would have achieved similar results with less operational complexity. We chose Kubernetes because two team members had prior experience and because we anticipated needing the flexibility for the eventual microservices decomposition.
What we'd do differently:
- Migrate file storage to S3 before the Kubernetes migration, not during.
- Set up the complete monitoring stack as the first step.
- Use Karpenter from day one instead of Cluster Autoscaler — we switched later and it reduced scaling time from 4 minutes to 45 seconds.
- Run load tests on the Kubernetes setup before starting the canary migration.
What we got right:
- Containerizing the monolith without decomposing it.
- The shadow traffic phase caught three issues that would have caused production incidents.
- Keeping EC2 as a rollback path for four weeks gave the team confidence to proceed.
- Investing in GitOps (ArgoCD) from the start — it made the operational steady-state much simpler.
Conclusion
Migrating to Kubernetes is a significant undertaking, but for a team deploying multiple times per day with autoscaling needs, the investment pays off within months. The 34% cost reduction alone justified the four-month project, and the operational improvements — sub-minute rollbacks, 28x deployment frequency — transformed how the engineering team ships code.
The critical lesson is to treat the migration as an infrastructure project, not an application rewrite. Containerize what you have, validate it with shadow traffic, shift gradually, and keep a rollback path. Every team that tries to simultaneously migrate infrastructure and rewrite their application architecture ends up doing neither well.