Back to Journal
DevOps

Zero-Downtime Deployments Best Practices for Enterprise Teams

Battle-tested best practices for Zero-Downtime Deployments tailored to Enterprise teams, including anti-patterns to avoid and a ready-to-use checklist.

Muneer Puthiya Purayil 13 min read

Zero-downtime deployments are table stakes for enterprise SaaS platforms. Your customers signed SLAs with 99.99% uptime commitments. A 30-second deployment restart during business hours means incident reports, credits, and erosion of trust. Enterprise-grade zero-downtime deployment goes beyond blue-green swaps — it encompasses database migrations, feature flags, traffic management, and organizational processes that ensure every release is invisible to users.

Deployment Strategy Selection

Enterprise teams need multiple strategies in their toolkit, selected based on risk level:

Risk LevelStrategyUse Case
LowRolling updateConfig changes, dependency bumps
MediumBlue-greenApplication logic changes
HighCanaryDatabase schema changes, new features
CriticalFeature flags + canaryPayment flows, auth changes, API contracts

Rolling Updates with Health Checks

yaml
1# Kubernetes rolling update with proper health checks
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5 name: api-server
6spec:
7 replicas: 6
8 strategy:
9 type: RollingUpdate
10 rollingUpdate:
11 maxSurge: 2 # Allow 2 extra pods during rollout
12 maxUnavailable: 0 # Never reduce below desired count
13 template:
14 spec:
15 terminationGracePeriodSeconds: 60
16 containers:
17 - name: api
18 image: api-server:v2.1.0
19 ports:
20 - containerPort: 8080
21 readinessProbe:
22 httpGet:
23 path: /health/ready
24 port: 8080
25 initialDelaySeconds: 10
26 periodSeconds: 5
27 failureThreshold: 3
28 livenessProbe:
29 httpGet:
30 path: /health/live
31 port: 8080
32 initialDelaySeconds: 30
33 periodSeconds: 10
34 lifecycle:
35 preStop:
36 exec:
37 command: ["/bin/sh", "-c", "sleep 10"]
38 

The preStop hook is critical — it gives the load balancer time to drain connections before the pod terminates. Without it, in-flight requests get dropped during pod termination.

Blue-Green with Traffic Shifting

yaml
1# Argo Rollouts blue-green strategy
2apiVersion: argoproj.io/v1alpha1
3kind: Rollout
4metadata:
5 name: api-server
6spec:
7 replicas: 6
8 strategy:
9 blueGreen:
10 activeService: api-active
11 previewService: api-preview
12 autoPromotionEnabled: false
13 prePromotionAnalysis:
14 templates:
15 - templateName: smoke-tests
16 args:
17 - name: service-name
18 value: api-preview
19 scaleDownDelaySeconds: 300 # Keep old version for 5 min after switch
20 

Canary with Progressive Delivery

yaml
1# Canary with automatic analysis
2apiVersion: argoproj.io/v1alpha1
3kind: Rollout
4metadata:
5 name: api-server
6spec:
7 replicas: 10
8 strategy:
9 canary:
10 steps:
11 - setWeight: 5 # 5% of traffic
12 - pause: { duration: 5m }
13 - analysis:
14 templates:
15 - templateName: error-rate-check
16 - setWeight: 25 # 25% of traffic
17 - pause: { duration: 10m }
18 - analysis:
19 templates:
20 - templateName: latency-check
21 - setWeight: 50 # 50% of traffic
22 - pause: { duration: 15m }
23 - analysis:
24 templates:
25 - templateName: full-health-check
26 - setWeight: 100 # Full rollout
27 canaryMetadata:
28 labels:
29 deployment: canary
30 stableMetadata:
31 labels:
32 deployment: stable
33 

Database Migration Without Downtime

Database changes are the hardest part of zero-downtime deployment. The expand-and-contract pattern is non-negotiable:

Phase 1: Expand (Backward Compatible)

sql
1-- Migration: Add new column without breaking existing code
2ALTER TABLE orders ADD COLUMN customer_email VARCHAR(255);
3 
4-- Backfill in batches (don't lock the table)
5UPDATE orders SET customer_email = (
6 SELECT email FROM customers WHERE customers.id = orders.customer_id
7)
8WHERE id BETWEEN 1 AND 100000
9 AND customer_email IS NULL;
10 
11-- Repeat for remaining batches with sleep between batches
12-- to avoid replication lag
13 

Phase 2: Dual Write (Both Versions Work)

typescript
1// Application code writes to both old and new columns
2async function createOrder(data: CreateOrderDTO) {
3 return prisma.order.create({
4 data: {
5 customerId: data.customerId,
6 // Write to new column alongside old reference
7 customerEmail: data.customerEmail,
8 amount: data.amount,
9 },
10 });
11}
12 

Phase 3: Contract (Remove Old Column)

Only after all application instances use the new column:

sql
1-- Verify no queries reference the old pattern
2-- Then drop the old constraint/column
3ALTER TABLE orders DROP COLUMN IF EXISTS old_customer_ref;
4 

This three-phase approach spans at least three deployments. Each phase is independently deployable and rollback-safe.

Connection Draining

Proper connection draining prevents request failures during pod transitions:

go
1// Graceful shutdown handler
2package main
3 
4import (
5 "context"
6 "net/http"
7 "os"
8 "os/signal"
9 "sync/atomic"
10 "syscall"
11 "time"
12)
13 
14func main() {
15 var healthy int32 = 1
16 
17 mux := http.NewServeMux()
18 
19 // Health check that reflects shutdown state
20 mux.HandleFunc("/health/ready", func(w http.ResponseWriter, r *http.Request) {
21 if atomic.LoadInt32(&healthy) == 1 {
22 w.WriteHeader(http.StatusOK)
23 w.Write([]byte("ready"))
24 } else {
25 w.WriteHeader(http.StatusServiceUnavailable)
26 w.Write([]byte("shutting down"))
27 }
28 })
29 
30 mux.HandleFunc("/api/", apiHandler)
31 
32 server := &http.Server{
33 Addr: ":8080",
34 Handler: mux,
35 }
36 
37 // Start server
38 go server.ListenAndServe()
39 
40 // Wait for shutdown signal
41 quit := make(chan os.Signal, 1)
42 signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)
43 <-quit
44 
45 // Phase 1: Stop accepting new connections
46 atomic.StoreInt32(&healthy, 0)
47 
48 // Phase 2: Wait for load balancer to detect unhealthy status
49 time.Sleep(15 * time.Second)
50 
51 // Phase 3: Drain existing connections with timeout
52 ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
53 defer cancel()
54 server.Shutdown(ctx)
55}
56 

Feature Flags for Safe Rollouts

Decouple deployment from release using feature flags:

typescript
1// lib/features.ts
2interface FeatureFlag {
3 key: string;
4 enabled: boolean;
5 rolloutPercentage: number;
6 allowedTenants?: string[];
7}
8 
9class FeatureService {
10 private flags: Map<string, FeatureFlag> = new Map();
11 
12 async isEnabled(
13 key: string,
14 context: { userId: string; tenantId: string }
15 ): Promise<boolean> {
16 const flag = this.flags.get(key);
17 if (!flag || !flag.enabled) return false;
18 
19 // Tenant allowlist takes priority
20 if (flag.allowedTenants?.includes(context.tenantId)) {
21 return true;
22 }
23 
24 // Percentage-based rollout using consistent hashing
25 const hash = this.hashUser(context.userId, key);
26 return hash < flag.rolloutPercentage;
27 }
28 
29 private hashUser(userId: string, flagKey: string): number {
30 const str = `${flagKey}:${userId}`;
31 let hash = 0;
32 for (let i = 0; i < str.length; i++) {
33 hash = ((hash << 5) - hash + str.charCodeAt(i)) | 0;
34 }
35 return Math.abs(hash) % 100;
36 }
37}
38 
39// Usage in API handler
40async function handleRequest(req: Request, ctx: AppContext) {
41 const useNewBilling = await features.isEnabled(
42 'new-billing-engine',
43 { userId: ctx.userId, tenantId: ctx.tenantId }
44 );
45 
46 if (useNewBilling) {
47 return newBillingHandler(req, ctx);
48 }
49 return legacyBillingHandler(req, ctx);
50}
51 

Need a second opinion on your DevOps pipelines architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Rollback Procedures

Enterprise deployments need automated rollback triggers:

yaml
1# Argo Rollouts automatic rollback analysis
2apiVersion: argoproj.io/v1alpha1
3kind: AnalysisTemplate
4metadata:
5 name: error-rate-check
6spec:
7 metrics:
8 - name: error-rate
9 interval: 60s
10 failureLimit: 3
11 successCondition: result[0] < 0.01 # <1% error rate
12 provider:
13 prometheus:
14 address: http://prometheus:9090
15 query: |
16 sum(rate(http_requests_total{status=~"5..",deployment="canary"}[5m]))
17 /
18 sum(rate(http_requests_total{deployment="canary"}[5m]))
19
20 - name: latency-p99
21 interval: 60s
22 failureLimit: 2
23 successCondition: result[0] < 500 # p99 < 500ms
24 provider:
25 prometheus:
26 address: http://prometheus:9090
27 query: |
28 histogram_quantile(0.99,
29 sum(rate(http_request_duration_seconds_bucket{deployment="canary"}[5m]))
30 by (le)
31 ) * 1000
32

Multi-Region Deployment

Enterprise deployments span multiple regions. Deploy regionally with automated promotion:

1Region Order:
21. us-west-2 (internal/staging) → 30 min observation
32. eu-west-1 (smallest production region) → 1 hour observation
43. us-east-1 (largest region) → 2 hour observation
54. ap-southeast-1 → automatic after us-east-1 success
6 
yaml
1# ArgoCD ApplicationSet for multi-region
2apiVersion: argoproj.io/v1alpha1
3kind: ApplicationSet
4metadata:
5 name: api-server
6spec:
7 generators:
8 - list:
9 elements:
10 - region: us-west-2
11 wave: "1"
12 cluster: staging
13 - region: eu-west-1
14 wave: "2"
15 cluster: prod-eu
16 - region: us-east-1
17 wave: "3"
18 cluster: prod-us
19 - region: ap-southeast-1
20 wave: "4"
21 cluster: prod-ap
22 strategy:
23 type: RollingSync
24 rollingSync:
25 steps:
26 - matchExpressions:
27 - key: wave
28 operator: In
29 values: ["1"]
30 - matchExpressions:
31 - key: wave
32 operator: In
33 values: ["2"]
34 - matchExpressions:
35 - key: wave
36 operator: In
37 values: ["3", "4"]
38 

Anti-Patterns to Avoid

Deploying During Peak Hours

Enterprise customers use your product during business hours. Deploying at 2 PM EST maximizes the blast radius. Schedule deployments during low-traffic windows or use canary deployments that minimize risk regardless of timing.

Big-Bang Database Migrations

Running ALTER TABLE ... ADD COLUMN NOT NULL on a table with 100M rows locks the table for minutes. Use the expand-and-contract pattern with background backfills. Every schema change should be a separate deployment from the application change that uses it.

Skipping Rollback Testing

Teams test the deployment forward path but never test rollback. Run rollback drills monthly. Verify that rolling back to the previous version doesn't corrupt data, lose in-flight requests, or break dependent services.

Shared Mutable State During Deploys

If your deployment process writes to a shared cache, database, or message queue, ensure both old and new versions can read what the other writes. Version your cache keys, maintain backward-compatible message schemas, and never assume only one version runs at a time.

Manual Approval Gates Without Timeouts

Requiring manual approval for production deployments is fine. But without a timeout, deployments stall indefinitely when the approver is unavailable. Set a 4-hour timeout — if no one approves, the deployment auto-rolls back.

Enterprise Readiness Checklist

  • Rolling updates configured with zero maxUnavailable
  • Readiness and liveness probes differentiated (ready ≠ live)
  • Graceful shutdown with connection draining (30s minimum)
  • PreStop hook with sleep to allow LB deregistration
  • Database migrations follow expand-and-contract pattern
  • Feature flags decouple deployment from release
  • Canary analysis with automated rollback on error rate spike
  • Multi-region deployment with progressive regional rollout
  • Rollback procedure documented and tested monthly
  • Deployment windows defined and communicated to stakeholders
  • Post-deployment monitoring dashboard with deployment markers
  • Change management process integrated with deployment pipeline

FAQ

Need expert help?

Building with CI/CD pipelines?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026