Back to Journal
DevOps

Zero-Downtime Deployments Best Practices for Startup Teams

Battle-tested best practices for Zero-Downtime Deployments tailored to Startup teams, including anti-patterns to avoid and a ready-to-use checklist.

Muneer Puthiya Purayil 12 min read

Startups can't afford downtime during deployments, but they also can't afford spending two weeks building an enterprise deployment pipeline. You need something that works reliably with minimal infrastructure, scales with you, and doesn't require a dedicated DevOps engineer to maintain. This guide covers the pragmatic deployment patterns that get startups to zero-downtime without over-engineering.

Start Simple: Platform-Managed Deployments

Before building any deployment infrastructure, use what your platform provides:

Vercel / Netlify (Frontend + Serverless)

Zero-downtime is built in. Every deploy creates an immutable deployment. Traffic atomically shifts from old to new. Rollback is instant — just point production to a previous deployment.

json
1// vercel.json — zero-downtime by default
2{
3 "builds": [{ "src": "next.config.ts", "use": "@vercel/next" }],
4 "routes": [{ "src": "/(.*)", "dest": "/" }]
5}
6 

Cost: Free tier handles most startups. No DevOps needed.

Railway / Render (Backend Services)

Both platforms support rolling deploys with health checks:

yaml
1# render.yaml
2services:
3 - type: web
4 name: api
5 runtime: node
6 buildCommand: npm run build
7 startCommand: npm start
8 healthCheckPath: /health
9 numInstances: 2
10 autoDeploy: true
11 

With 2+ instances and a health check path, Render performs rolling updates automatically. No Kubernetes, no Argo Rollouts, no CI/CD pipeline to maintain.

Fly.io (Containers with Built-in Rolling Deploys)

toml
1# fly.toml
2[http_service]
3 internal_port = 8080
4 force_https = true
5 auto_start_machines = true
6 auto_stop_machines = true
7 min_machines_running = 2
8 
9[http_service.concurrency]
10 type = "requests"
11 hard_limit = 250
12 soft_limit = 200
13 
14[[http_service.checks]]
15 interval = "10s"
16 timeout = "5s"
17 grace_period = "10s"
18 method = "GET"
19 path = "/health"
20 
21[deploy]
22 strategy = "rolling"
23 

Health Check Implementation

The foundation of zero-downtime deploys. Get this right first:

typescript
1// routes/health.ts (Express/Hono/Fastify)
2import { Router } from 'express';
3 
4const router = Router();
5 
6let isShuttingDown = false;
7 
8// Readiness: can this instance serve traffic?
9router.get('/health', async (req, res) => {
10 if (isShuttingDown) {
11 return res.status(503).json({ status: 'shutting_down' });
12 }
13 
14 try {
15 // Check critical dependencies
16 await prisma.$queryRaw`SELECT 1`;
17 res.json({ status: 'healthy' });
18 } catch {
19 res.status(503).json({ status: 'unhealthy' });
20 }
21});
22 
23// Graceful shutdown
24process.on('SIGTERM', () => {
25 isShuttingDown = true;
26 
27 // Give load balancer time to stop sending traffic
28 setTimeout(() => {
29 server.close(() => {
30 process.exit(0);
31 });
32 }, 10000);
33});
34 
35export default router;
36 

Key requirements:

  • Health check must fail when the process is shutting down
  • Database/Redis connectivity should be verified
  • Response time should be under 100ms

Docker-Based Deployment

When you outgrow platform-managed deployments, Docker with a process manager provides zero-downtime:

Docker Compose with Health Checks

yaml
1# docker-compose.yml
2services:
3 api:
4 image: api:latest
5 deploy:
6 replicas: 2
7 update_config:
8 parallelism: 1
9 delay: 30s
10 order: start-first # Start new before stopping old
11 rollback_config:
12 parallelism: 0
13 order: stop-first
14 healthcheck:
15 test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
16 interval: 10s
17 timeout: 5s
18 retries: 3
19 start_period: 30s
20 ports:
21 - "8080:8080"
22 
23 nginx:
24 image: nginx:alpine
25 ports:
26 - "80:80"
27 - "443:443"
28 volumes:
29 - ./nginx.conf:/etc/nginx/nginx.conf
30 depends_on:
31 api:
32 condition: service_healthy
33 

Nginx Upstream with Health Checks

nginx
1upstream api_servers {
2 server api:8080 max_fails=3 fail_timeout=30s;
3 keepalive 32;
4}
5 
6server {
7 listen 80;
8 
9 location / {
10 proxy_pass http://api_servers;
11 proxy_http_version 1.1;
12 proxy_set_header Connection "";
13 proxy_set_header Host $host;
14 proxy_set_header X-Real-IP $remote_addr;
15 
16 # Retry on connection errors (not on 5xx)
17 proxy_next_upstream error timeout;
18 proxy_next_upstream_tries 2;
19 }
20 
21 location /health {
22 proxy_pass http://api_servers;
23 access_log off;
24 }
25}
26 

GitHub Actions CI/CD

A simple but effective deployment pipeline:

yaml
1# .github/workflows/deploy.yml
2name: Deploy
3on:
4 push:
5 branches: [main]
6 
7jobs:
8 test:
9 runs-on: ubuntu-latest
10 steps:
11 - uses: actions/checkout@v4
12 - uses: oven-sh/setup-bun@v2
13 - run: bun install
14 - run: bun run lint
15 - run: bun test
16 
17 deploy:
18 needs: test
19 runs-on: ubuntu-latest
20 steps:
21 - uses: actions/checkout@v4
22 
23 - name: Build and push Docker image
24 run: |
25 docker build -t $REGISTRY/$IMAGE:${{ github.sha }} .
26 docker push $REGISTRY/$IMAGE:${{ github.sha }}
27
28 - name: Deploy with rolling update
29 run: |
30 ssh deploy@$SERVER "
31 docker pull $REGISTRY/$IMAGE:${{ github.sha }}
32 docker compose up -d --no-deps api
33 "
34
35 - name: Verify deployment
36 run: |
37 for i in {1..30}; do
38 if curl -sf https://api.example.com/health; then
39 echo 'Deployment healthy'
40 exit 0
41 fi
42 sleep 2
43 done
44 echo 'Deployment health check failed'
45 exit 1
46
47 - name: Rollback on failure
48 if: failure()
49 run: |
50 ssh deploy@$SERVER "
51 docker compose up -d --no-deps api
52 "
53

Need a second opinion on your DevOps pipelines architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Database Migrations for Startups

Use the additive-only pattern — only add columns and tables, never remove or rename in the same deploy:

typescript
1// Safe migration pattern
2// Deploy 1: Add new column (nullable)
3// prisma/migrations/001_add_email_column.sql
4ALTER TABLE users ADD COLUMN email_verified BOOLEAN;
5 
6// Deploy 2: Start writing to new column
7// Your application code
8await prisma.user.update({
9 where: { id: userId },
10 data: {
11 emailVerified: true,
12 // Keep writing to old fields too
13 },
14});
15 
16// Deploy 3: Backfill existing records
17// scripts/backfill-email-verified.ts
18const BATCH_SIZE = 1000;
19let cursor = 0;
20 
21while (true) {
22 const updated = await prisma.$executeRaw`
23 UPDATE users
24 SET email_verified = true
25 WHERE id > ${cursor}
26 AND email_verified IS NULL
27 AND email_confirmed_at IS NOT NULL
28 LIMIT ${BATCH_SIZE}
29 `;
30 if (updated === 0) break;
31 cursor += BATCH_SIZE;
32 await new Promise(r => setTimeout(r, 100)); // Avoid overloading DB
33}
34 
35// Deploy 4: Make column non-nullable and remove old column
36// Only after Deploy 3 is verified
37 

Environment Variable Management

Use environment-based feature toggles for zero-downtime changes:

typescript
1// lib/config.ts
2export const config = {
3 features: {
4 newCheckout: process.env.FEATURE_NEW_CHECKOUT === 'true',
5 v2Api: process.env.FEATURE_V2_API === 'true',
6 },
7} as const;
8 
9// Usage
10if (config.features.newCheckout) {
11 return handleNewCheckout(req);
12}
13return handleLegacyCheckout(req);
14 

Update the environment variable in your platform (Railway, Fly.io, Render) to toggle features without deploying code. Most platforms restart instances when environment variables change, so combine with rolling deploys.

Anti-Patterns to Avoid

Over-Engineering the Pipeline

A startup with 3 engineers doesn't need Argo Rollouts, Istio service mesh, and a custom deployment controller. Start with your platform's built-in deployments. Add complexity only when you outgrow the simple approach.

Deploying from Laptops

ssh prod && git pull && npm start works until it doesn't. The first time someone deploys from a branch with uncommitted changes, you'll understand why CI/CD exists. Set up GitHub Actions on day one.

Skipping Health Checks

Without health checks, your platform can't distinguish a healthy deployment from a crashed one. A simple /health endpoint that returns 200 takes 5 minutes to implement and prevents 90% of bad deployments from reaching users.

Running Database Migrations in the Deployment Pipeline

Don't tie migrations to deployments. Run migrations separately, verify they succeeded, then deploy the code that uses the new schema. This lets you roll back the code without rolling back the migration.

No Rollback Plan

Every deployment should have a documented rollback path. For platform-managed deployments, this is usually "redeploy the previous commit." Test this before you need it in a crisis.

Startup Readiness Checklist

  • Health check endpoint implemented and returning dependency status
  • At least 2 instances running for availability
  • Rolling deployment configured (platform-managed or Docker)
  • CI/CD pipeline runs tests before deploying
  • Graceful shutdown with SIGTERM handling
  • Database migrations separated from code deployments
  • Environment variable-based feature toggles for risky changes
  • Rollback procedure documented (even if it's "click Revert in Vercel")
  • Post-deployment health verification automated
  • Alerting configured for error rate spikes after deployment

FAQ

Need expert help?

Building with CI/CD pipelines?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026