Back to Journal
DevOps

Zero-Downtime Deployments at Scale: Lessons from Production

Real-world lessons from implementing Zero-Downtime Deployments in production, including architecture decisions, measurable results, and honest retrospectives.

Muneer Puthiya Purayil 14 min read

In early 2024, our team at a B2B SaaS company serving 2,000+ enterprise customers made a commitment: zero customer-visible downtime during deployments. We were deploying 8-12 times per week, and each deploy involved a 15-30 second window where some requests failed. Over a month, that added up to 5-10 minutes of degraded service — not enough to breach our 99.95% SLA, but enough that our largest customers noticed and complained.

This is the story of how we eliminated deployment-related downtime completely, including the mistakes we made along the way.

The Starting Point

Our stack: 4 NestJS API services, 2 background worker services, PostgreSQL, Redis, and a Next.js frontend. Everything ran on AWS ECS Fargate across two availability zones. Total traffic: ~3,000 requests per second at peak.

The problems with our existing deployment process:

  1. ECS task replacements dropped connections — old tasks were killed before connections drained
  2. Database migrations ran inline during deployment, locking tables for 2-5 seconds
  3. Redis cache invalidation happened all at once, causing cache stampedes
  4. No health check sophistication — a 200 from /health didn't mean the service was ready to handle traffic

Architecture Changes

Change 1: Proper Connection Draining

Our first fix was configuring ECS task draining correctly. Before:

json
1{
2 "deregistrationDelay": 0,
3 "healthCheckIntervalSeconds": 30,
4 "healthyThresholdCount": 5
5}
6 

After:

json
1{
2 "deregistrationDelay": 30,
3 "healthCheckIntervalSeconds": 10,
4 "healthyThresholdCount": 2,
5 "unhealthyThresholdCount": 3
6}
7 

We also added a graceful shutdown handler to our NestJS services:

typescript
1// main.ts
2import { NestFactory } from '@nestjs/core';
3import { AppModule } from './app.module';
4 
5async function bootstrap() {
6 const app = await NestFactory.create(AppModule);
7 
8 let isShuttingDown = false;
9 
10 // Health endpoint that reflects shutdown state
11 app.use('/health', (req, res) => {
12 if (isShuttingDown) {
13 res.status(503).json({ status: 'shutting_down' });
14 } else {
15 res.json({ status: 'healthy' });
16 }
17 });
18 
19 // Handle SIGTERM from ECS
20 process.on('SIGTERM', async () => {
21 isShuttingDown = true;
22 
23 // Wait for ALB to deregister this task
24 await new Promise(r => setTimeout(r, 15000));
25 
26 // Close NestJS app (drains existing connections)
27 await app.close();
28 process.exit(0);
29 });
30 
31 await app.listen(8080);
32}
33bootstrap();
34 

Impact: Connection drops during deployment went from ~50 per deploy to zero.

Change 2: Separating Database Migrations

We had been running Prisma migrations as part of the ECS task definition's startup command. This meant every new task ran prisma migrate deploy before starting the application, which occasionally locked tables.

New approach: migrations run in a separate CI/CD step before the application deployment.

yaml
1# .github/workflows/deploy.yml
2jobs:
3 migrate:
4 runs-on: ubuntu-latest
5 steps:
6 - uses: actions/checkout@v4
7 - name: Run migrations
8 run: |
9 npx prisma migrate deploy
10 env:
11 DATABASE_URL: ${{ secrets.DATABASE_URL }}
12 
13 - name: Verify migration
14 run: |
15 npx prisma migrate status
16 # Ensure no pending migrations
17
18 deploy:
19 needs: migrate
20 runs-on: ubuntu-latest
21 steps:
22 - name: Update ECS service
23 run: |
24 aws ecs update-service \
25 --cluster production \
26 --service api-server \
27 --task-definition api-server:${{ env.TASK_DEF_REV }} \
28 --deployment-configuration "minimumHealthyPercent=100,maximumPercent=200"
29

Crucially, we enforced that all migrations must be backward compatible:

1Rule: Every migration must work with BOTH the current AND previous application version.
2 
3Allowed:
4ADD COLUMN (nullable)
5CREATE TABLE
6CREATE INDEX CONCURRENTLY
7ADD new enum value
8 
9Not allowed in a single deploy:
10DROP COLUMN
11❌ RENAME COLUMN
12ALTER COLUMN type
13DROP TABLE
14 

Column removals and renames became two-deploy operations. Deploy 1 stops reading from the old column. Deploy 2 drops the column.

Impact: Database-related deployment delays went from 2-5 seconds per deploy to zero.

Change 3: Cache Warming Strategy

Our biggest remaining issue: when all tasks were replaced simultaneously, every new task started with cold caches. At 3,000 RPS, this meant 3,000 cache misses hitting PostgreSQL in the first few seconds. Response times spiked from 40ms to 800ms until caches warmed.

Solution: staggered task replacement with cache pre-warming.

typescript
1// Pre-warm critical caches during startup
2async function warmCaches() {
3 const criticalKeys = [
4 'plans:active',
5 'feature-flags:all',
6 'rate-limits:config',
7 ];
8 
9 for (const key of criticalKeys) {
10 const cached = await redis.get(key);
11 if (!cached) {
12 // Fetch from database and cache
13 await refreshCacheKey(key);
14 }
15 }
16 
17 // Warm the most frequently accessed tenant configs
18 const topTenants = await prisma.tenant.findMany({
19 where: { plan: 'enterprise' },
20 orderBy: { requestCount: 'desc' },
21 take: 100,
22 });
23 
24 await Promise.all(
25 topTenants.map(t => warmTenantCache(t.id))
26 );
27}
28 
29// Call warmCaches before the health check starts returning 200
30await warmCaches();
31isReady = true; // Now health check passes
32 

We also changed our ECS deployment to roll one task at a time:

json
1{
2 "deploymentConfiguration": {
3 "minimumHealthyPercent": 100,
4 "maximumPercent": 125
5 }
6}
7 

With 8 tasks, this meant replacing one task at a time (125% of 8 = 10 max, minus 8 healthy = 2 simultaneous replacements). Each new task warmed its cache before passing health checks, avoiding stampedes.

Impact: Post-deployment latency spikes eliminated. p99 latency during deployment stayed under 100ms (vs 800ms previously).

Change 4: Feature Flags for Risky Changes

We adopted a simple feature flag system for changes that modified business logic:

typescript
1// lib/features.ts
2import { redis } from './redis';
3 
4interface FlagConfig {
5 enabled: boolean;
6 rolloutPercent: number;
7 allowList: string[]; // tenant IDs
8}
9 
10export async function isEnabled(
11 flagName: string,
12 tenantId: string,
13): Promise<boolean> {
14 const raw = await redis.get(`flag:${flagName}`);
15 if (!raw) return false;
16 
17 const config: FlagConfig = JSON.parse(raw);
18 if (!config.enabled) return false;
19 if (config.allowList.includes(tenantId)) return true;
20 
21 // Consistent hash for percentage rollout
22 const hash = simpleHash(`${flagName}:${tenantId}`);
23 return (hash % 100) < config.rolloutPercent;
24}
25 
26function simpleHash(str: string): number {
27 let hash = 0;
28 for (let i = 0; i < str.length; i++) {
29 hash = ((hash << 5) - hash + str.charCodeAt(i)) | 0;
30 }
31 return Math.abs(hash);
32}
33 

For the new billing engine we shipped, the rollout looked like:

  1. Deploy code with feature flag (flag disabled) — verified no regressions
  2. Enable for internal tenant — tested for 2 days
  3. Enable for 3 friendly customers — tested for 1 week
  4. Roll out to 10%, then 25%, then 50%, then 100% over 2 weeks
  5. Remove old code path after 100% for 1 month

Impact: Two production incidents avoided that would have affected all 2,000 customers. Instead, they affected 3 friendly customers who helped us debug.

Need a second opinion on your DevOps pipelines architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Measurable Results

After 3 months of these changes:

MetricBeforeAfter
Deploys per week8-1215-20
Failed requests per deploy~500
p99 latency during deploy800ms95ms
Deployment-related incidents/month2-30
Deploy-to-production time25 min12 min
Rollback time15 min3 min
Monthly downtime from deploys5-10 min0 min

The team's confidence in deploying increased dramatically. We went from "deploy during low traffic hours" to "deploy anytime, it doesn't matter."

What We'd Do Differently

Start with Feature Flags Earlier

We implemented feature flags after the first billing bug reached production. If we'd had them from the start, we would have caught the issue in the 3-customer test phase instead of the 2,000-customer production phase.

Automate Migration Compatibility Checks

We relied on code review to catch backward-incompatible migrations. This failed twice — once when a reviewer missed a NOT NULL constraint addition, and once when a column rename slipped through. We should have added a CI check that validates migration compatibility using a schema diff tool.

Invest in Deployment Observability Earlier

For the first month, we monitored deployments by watching Datadog dashboards manually. Automated canary analysis (even simple error rate checks) would have caught the cache stampede issue on the first occurrence instead of the fourth.

Test Graceful Shutdown Under Load

We tested graceful shutdown with curl, not with 3,000 concurrent connections. The first production deploy with the new shutdown handler revealed a race condition where the health check returned 503 before the preStop sleep completed, causing the ALB to deregister the task too early. Load testing the shutdown path would have caught this in staging.

FAQ

Need expert help?

Building with CI/CD pipelines?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026