How did you handle long-running API requests during deployment?

Our longest API requests (report generation) took up to 60 seconds. We set `terminationGracePeriodSeconds` to 90 seconds and the ALB deregistration delay to 30 seconds. The sequence: SIGTERM → health returns 503 → 15s sleep → ALB stops sending new traffic → up to 60s for in-flight requests to complete → process exits. No report generation requests were lost.

What was the total engineering effort for these changes?

Roughly 3 engineer-weeks spread over 2 months. The connection draining fix took 2 days. Separating migrations took 3 days (including rewriting 4 existing migrations to be backward-compatible). Cache warming took 1 week. Feature flags took 3 days. The remaining time was testing and monitoring.

Did you consider Kubernetes instead of ECS?

Yes, but ECS met our needs with less operational overhead. Our team of 8 engineers didn't have Kubernetes expertise, and the migration would have taken 2-3 months. ECS with Fargate gave us the rolling update mechanics we needed without managing a control plane. We'd reconsider Kubernetes at 20+ services.

How do you handle database schema changes that require downtime?

We've only encountered this once — changing a primary key type on a table with 50M rows. We used the expand-and-contract approach: created a new table with the new key type, set up dual writes, backfilled over a weekend, verified data consistency, then swapped reads to the new table. Total effort: 2 weeks. Total downtime: zero.

Zero-Downtime Deployments at Scale: Lessons from Production

In early 2024, our team at a B2B SaaS company serving 2,000+ enterprise customers made a commitment: zero customer-visible downtime during deployments. We were deploying 8-12 times per week, and each deploy involved a 15-30 second window where some requests failed. Over a month, that added up to 5-10 minutes of degraded service — not enough to breach our 99.95% SLA, but enough that our largest customers noticed and complained.

This is the story of how we eliminated deployment-related downtime completely, including the mistakes we made along the way.

The Starting Point

Our stack: 4 NestJS API services, 2 background worker services, PostgreSQL, Redis, and a Next.js frontend. Everything ran on AWS ECS Fargate across two availability zones. Total traffic: ~3,000 requests per second at peak.

The problems with our existing deployment process:

ECS task replacements dropped connections — old tasks were killed before connections drained
Database migrations ran inline during deployment, locking tables for 2-5 seconds
Redis cache invalidation happened all at once, causing cache stampedes
No health check sophistication — a 200 from /health didn't mean the service was ready to handle traffic

Architecture Changes

Change 1: Proper Connection Draining

Our first fix was configuring ECS task draining correctly. Before:

json

2 "deregistrationDelay": 0,

3 "healthCheckIntervalSeconds": 30,

4 "healthyThresholdCount": 5

After:

json

2 "deregistrationDelay": 30,

3 "healthCheckIntervalSeconds": 10,

4 "healthyThresholdCount": 2,

5 "unhealthyThresholdCount": 3

We also added a graceful shutdown handler to our NestJS services:

typescript

1// main.ts

2import { NestFactory } from '@nestjs/core';

3import { AppModule } from './app.module';

5async function bootstrap() {

6 const app = await NestFactory.create(AppModule);

8 let isShuttingDown = false;

10 // Health endpoint that reflects shutdown state

11 app.use('/health', (req, res) => {

12 if (isShuttingDown) {

13 res.status(503).json({ status: 'shutting_down' });

14 } else {

15 res.json({ status: 'healthy' });

16 }

17 });

19 // Handle SIGTERM from ECS

20 process.on('SIGTERM', async () => {

21 isShuttingDown = true;

23 // Wait for ALB to deregister this task

24 await new Promise(r => setTimeout(r, 15000));

26 // Close NestJS app (drains existing connections)

27 await app.close();

28 process.exit(0);

29 });

31 await app.listen(8080);

32}

33bootstrap();

Impact: Connection drops during deployment went from ~50 per deploy to zero.

Change 2: Separating Database Migrations

We had been running Prisma migrations as part of the ECS task definition's startup command. This meant every new task ran prisma migrate deploy before starting the application, which occasionally locked tables.

New approach: migrations run in a separate CI/CD step before the application deployment.

yaml

1# .github/workflows/deploy.yml

2jobs:

3 migrate:

4 runs-on: ubuntu-latest

5 steps:

6 - uses: actions/checkout@v4

7 - name: Run migrations

8 run: |

9 npx prisma migrate deploy

10 env:

11 DATABASE_URL: ${{ secrets.DATABASE_URL }}

13 - name: Verify migration

14 run: |

15 npx prisma migrate status

16 # Ensure no pending migrations

18 deploy:

19 needs: migrate

20 runs-on: ubuntu-latest

21 steps:

22 - name: Update ECS service

23 run: |

24 aws ecs update-service \

25 --cluster production \

26 --service api-server \

27 --task-definition api-server:${{ env.TASK_DEF_REV }} \

28 --deployment-configuration "minimumHealthyPercent=100,maximumPercent=200"

Crucially, we enforced that all migrations must be backward compatible:

1Rule: Every migration must work with BOTH the current AND previous application version.

3Allowed:

4✅ ADD COLUMN (nullable)

5✅ CREATE TABLE

6✅ CREATE INDEX CONCURRENTLY

7✅ ADD new enum value

9Not allowed in a single deploy:

10❌ DROP COLUMN

11❌ RENAME COLUMN

12❌ ALTER COLUMN type

13❌ DROP TABLE

Column removals and renames became two-deploy operations. Deploy 1 stops reading from the old column. Deploy 2 drops the column.

Impact: Database-related deployment delays went from 2-5 seconds per deploy to zero.

Change 3: Cache Warming Strategy

Our biggest remaining issue: when all tasks were replaced simultaneously, every new task started with cold caches. At 3,000 RPS, this meant 3,000 cache misses hitting PostgreSQL in the first few seconds. Response times spiked from 40ms to 800ms until caches warmed.

Solution: staggered task replacement with cache pre-warming.

typescript

1// Pre-warm critical caches during startup

2async function warmCaches() {

3 const criticalKeys = [

4 'plans:active',

5 'feature-flags:all',

6 'rate-limits:config',

7 ];

9 for (const key of criticalKeys) {

10 const cached = await redis.get(key);

11 if (!cached) {

12 // Fetch from database and cache

13 await refreshCacheKey(key);

14 }

15 }

17 // Warm the most frequently accessed tenant configs

18 const topTenants = await prisma.tenant.findMany({

19 where: { plan: 'enterprise' },

20 orderBy: { requestCount: 'desc' },

21 take: 100,

22 });

24 await Promise.all(

25 topTenants.map(t => warmTenantCache(t.id))

26 );

27}

29// Call warmCaches before the health check starts returning 200

30await warmCaches();

31isReady = true; // Now health check passes

We also changed our ECS deployment to roll one task at a time:

json

2 "deploymentConfiguration": {

3 "minimumHealthyPercent": 100,

4 "maximumPercent": 125

5 }

With 8 tasks, this meant replacing one task at a time (125% of 8 = 10 max, minus 8 healthy = 2 simultaneous replacements). Each new task warmed its cache before passing health checks, avoiding stampedes.

Impact: Post-deployment latency spikes eliminated. p99 latency during deployment stayed under 100ms (vs 800ms previously).

Change 4: Feature Flags for Risky Changes

We adopted a simple feature flag system for changes that modified business logic:

typescript

1// lib/features.ts

2import { redis } from './redis';

4interface FlagConfig {

5 enabled: boolean;

6 rolloutPercent: number;

7 allowList: string[]; // tenant IDs

10export async function isEnabled(

11 flagName: string,

12 tenantId: string,

13): Promise<boolean> {

14 const raw = await redis.get(`flag:${flagName}`);

15 if (!raw) return false;

17 const config: FlagConfig = JSON.parse(raw);

18 if (!config.enabled) return false;

19 if (config.allowList.includes(tenantId)) return true;

21 // Consistent hash for percentage rollout

22 const hash = simpleHash(`${flagName}:${tenantId}`);

23 return (hash % 100) < config.rolloutPercent;

24}

26function simpleHash(str: string): number {

27 let hash = 0;

28 for (let i = 0; i < str.length; i++) {

29 hash = ((hash << 5) - hash + str.charCodeAt(i)) | 0;

30 }

31 return Math.abs(hash);

32}

For the new billing engine we shipped, the rollout looked like:

Deploy code with feature flag (flag disabled) — verified no regressions
Enable for internal tenant — tested for 2 days
Enable for 3 friendly customers — tested for 1 week
Roll out to 10%, then 25%, then 50%, then 100% over 2 weeks
Remove old code path after 100% for 1 month

Impact: Two production incidents avoided that would have affected all 2,000 customers. Instead, they affected 3 friendly customers who helped us debug.

Need a second opinion on your DevOps pipelines architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Measurable Results

After 3 months of these changes:

Metric	Before	After
Deploys per week	8-12	15-20
Failed requests per deploy	~50	0
p99 latency during deploy	800ms	95ms
Deployment-related incidents/month	2-3	0
Deploy-to-production time	25 min	12 min
Rollback time	15 min	3 min
Monthly downtime from deploys	5-10 min	0 min

The team's confidence in deploying increased dramatically. We went from "deploy during low traffic hours" to "deploy anytime, it doesn't matter."

What We'd Do Differently

Start with Feature Flags Earlier

We implemented feature flags after the first billing bug reached production. If we'd had them from the start, we would have caught the issue in the 3-customer test phase instead of the 2,000-customer production phase.

Automate Migration Compatibility Checks

We relied on code review to catch backward-incompatible migrations. This failed twice — once when a reviewer missed a NOT NULL constraint addition, and once when a column rename slipped through. We should have added a CI check that validates migration compatibility using a schema diff tool.

Invest in Deployment Observability Earlier

For the first month, we monitored deployments by watching Datadog dashboards manually. Automated canary analysis (even simple error rate checks) would have caught the cache stampede issue on the first occurrence instead of the fourth.

Test Graceful Shutdown Under Load

We tested graceful shutdown with curl, not with 3,000 concurrent connections. The first production deploy with the new shutdown handler revealed a race condition where the health check returned 503 before the preStop sleep completed, causing the ALB to deregister the task too early. Load testing the shutdown path would have caught this in staging.

FAQ

Need expert help?

Building with CI/CD pipelines?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Book a Free Call Send a Brief

zero-downtime blue-green canary deployment aws case-study

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

View Portfolio Book a Call

← Previous

Zero-Downtime Deployments at Scale: Lessons from Production

The Starting Point

Architecture Changes

Change 1: Proper Connection Draining

Change 2: Separating Database Migrations

Change 3: Cache Warming Strategy

Change 4: Feature Flags for Risky Changes

Measurable Results

What We'd Do Differently

Start with Feature Flags Earlier

Automate Migration Compatibility Checks

Invest in Deployment Observability Earlier

Test Graceful Shutdown Under Load

FAQ

Building with CI/CD pipelines?

Zero-Downtime Deployments Best Practices for High Scale Teams

Zero-Downtime Deployments Best Practices for Enterprise Teams

Zero-Downtime Deployments Best Practices for Startup Teams

Complete Guide to React Native Performance with Typescript

Zero-Downtime Deployments Best Practices for High Scale Teams

Start a
Conversation.

The Starting Point

Architecture Changes

Change 1: Proper Connection Draining

Change 2: Separating Database Migrations

Change 3: Cache Warming Strategy

Change 4: Feature Flags for Risky Changes

Measurable Results

What We'd Do Differently

Start with Feature Flags Earlier

Automate Migration Compatibility Checks

Invest in Deployment Observability Earlier

Test Graceful Shutdown Under Load

FAQ

Building with CI/CD pipelines?

Zero-Downtime Deployments Best Practices for High Scale Teams

Zero-Downtime Deployments Best Practices for Enterprise Teams

Zero-Downtime Deployments Best Practices for Startup Teams

Complete Guide to React Native Performance with Typescript

Zero-Downtime Deployments Best Practices for High Scale Teams

Start aConversation.

Start a
Conversation.