In early 2024, our team at a B2B SaaS company serving 2,000+ enterprise customers made a commitment: zero customer-visible downtime during deployments. We were deploying 8-12 times per week, and each deploy involved a 15-30 second window where some requests failed. Over a month, that added up to 5-10 minutes of degraded service — not enough to breach our 99.95% SLA, but enough that our largest customers noticed and complained.
This is the story of how we eliminated deployment-related downtime completely, including the mistakes we made along the way.
The Starting Point
Our stack: 4 NestJS API services, 2 background worker services, PostgreSQL, Redis, and a Next.js frontend. Everything ran on AWS ECS Fargate across two availability zones. Total traffic: ~3,000 requests per second at peak.
The problems with our existing deployment process:
- ECS task replacements dropped connections — old tasks were killed before connections drained
- Database migrations ran inline during deployment, locking tables for 2-5 seconds
- Redis cache invalidation happened all at once, causing cache stampedes
- No health check sophistication — a 200 from
/healthdidn't mean the service was ready to handle traffic
Architecture Changes
Change 1: Proper Connection Draining
Our first fix was configuring ECS task draining correctly. Before:
After:
We also added a graceful shutdown handler to our NestJS services:
Impact: Connection drops during deployment went from ~50 per deploy to zero.
Change 2: Separating Database Migrations
We had been running Prisma migrations as part of the ECS task definition's startup command. This meant every new task ran prisma migrate deploy before starting the application, which occasionally locked tables.
New approach: migrations run in a separate CI/CD step before the application deployment.
Crucially, we enforced that all migrations must be backward compatible:
Column removals and renames became two-deploy operations. Deploy 1 stops reading from the old column. Deploy 2 drops the column.
Impact: Database-related deployment delays went from 2-5 seconds per deploy to zero.
Change 3: Cache Warming Strategy
Our biggest remaining issue: when all tasks were replaced simultaneously, every new task started with cold caches. At 3,000 RPS, this meant 3,000 cache misses hitting PostgreSQL in the first few seconds. Response times spiked from 40ms to 800ms until caches warmed.
Solution: staggered task replacement with cache pre-warming.
We also changed our ECS deployment to roll one task at a time:
With 8 tasks, this meant replacing one task at a time (125% of 8 = 10 max, minus 8 healthy = 2 simultaneous replacements). Each new task warmed its cache before passing health checks, avoiding stampedes.
Impact: Post-deployment latency spikes eliminated. p99 latency during deployment stayed under 100ms (vs 800ms previously).
Change 4: Feature Flags for Risky Changes
We adopted a simple feature flag system for changes that modified business logic:
For the new billing engine we shipped, the rollout looked like:
- Deploy code with feature flag (flag disabled) — verified no regressions
- Enable for internal tenant — tested for 2 days
- Enable for 3 friendly customers — tested for 1 week
- Roll out to 10%, then 25%, then 50%, then 100% over 2 weeks
- Remove old code path after 100% for 1 month
Impact: Two production incidents avoided that would have affected all 2,000 customers. Instead, they affected 3 friendly customers who helped us debug.
Need a second opinion on your DevOps pipelines architecture?
I run free 30-minute strategy calls for engineering teams tackling this exact problem.
Book a Free CallMeasurable Results
After 3 months of these changes:
| Metric | Before | After |
|---|---|---|
| Deploys per week | 8-12 | 15-20 |
| Failed requests per deploy | ~50 | 0 |
| p99 latency during deploy | 800ms | 95ms |
| Deployment-related incidents/month | 2-3 | 0 |
| Deploy-to-production time | 25 min | 12 min |
| Rollback time | 15 min | 3 min |
| Monthly downtime from deploys | 5-10 min | 0 min |
The team's confidence in deploying increased dramatically. We went from "deploy during low traffic hours" to "deploy anytime, it doesn't matter."
What We'd Do Differently
Start with Feature Flags Earlier
We implemented feature flags after the first billing bug reached production. If we'd had them from the start, we would have caught the issue in the 3-customer test phase instead of the 2,000-customer production phase.
Automate Migration Compatibility Checks
We relied on code review to catch backward-incompatible migrations. This failed twice — once when a reviewer missed a NOT NULL constraint addition, and once when a column rename slipped through. We should have added a CI check that validates migration compatibility using a schema diff tool.
Invest in Deployment Observability Earlier
For the first month, we monitored deployments by watching Datadog dashboards manually. Automated canary analysis (even simple error rate checks) would have caught the cache stampede issue on the first occurrence instead of the fourth.
Test Graceful Shutdown Under Load
We tested graceful shutdown with curl, not with 3,000 concurrent connections. The first production deploy with the new shutdown handler revealed a race condition where the health check returned 503 before the preStop sleep completed, causing the ALB to deregister the task too early. Load testing the shutdown path would have caught this in staging.