Back to Journal
DevOps

Zero-Downtime Deployments Best Practices for High Scale Teams

Battle-tested best practices for Zero-Downtime Deployments tailored to High Scale teams, including anti-patterns to avoid and a ready-to-use checklist.

Muneer Puthiya Purayil 14 min read

At high scale, zero-downtime deployment is a distributed systems problem. When you're running 500+ pods across multiple regions serving 100K+ requests per second, a deployment isn't just "replace old with new." It's a coordinated state transition that must maintain consistency across load balancers, caches, databases, message queues, and background workers — all while keeping every request successful.

The High-Scale Deployment Challenge

At scale, deployment complexity grows non-linearly:

ScalePodsRegionsDeploy TimeRisk Surface
Small3-1012 minLow
Medium10-5025 minModerate
High50-2003+15 minHigh
Very High200-10005+30+ minCritical

At 500 pods, a rolling update that replaces one pod at a time takes 25+ minutes. During that window, both versions serve traffic simultaneously, which creates compatibility requirements for every API endpoint, database query, cache key, and message format.

Coordinated Rollout Strategy

Wave-Based Deployment

Deploy in waves to limit blast radius while maintaining reasonable deployment speed:

yaml
1# Wave-based rollout with Argo Rollouts
2apiVersion: argoproj.io/v1alpha1
3kind: Rollout
4metadata:
5 name: api-server
6spec:
7 replicas: 200
8 strategy:
9 canary:
10 steps:
11 # Wave 1: Single pod canary
12 - setWeight: 1
13 - pause: { duration: 5m }
14 - analysis:
15 templates:
16 - templateName: high-scale-analysis
17 
18 # Wave 2: 5% (10 pods)
19 - setWeight: 5
20 - pause: { duration: 10m }
21 - analysis:
22 templates:
23 - templateName: high-scale-analysis
24 
25 # Wave 3: 25% (50 pods)
26 - setWeight: 25
27 - pause: { duration: 15m }
28 - analysis:
29 templates:
30 - templateName: high-scale-analysis
31 
32 # Wave 4: 50% (100 pods)
33 - setWeight: 50
34 - pause: { duration: 20m }
35 - analysis:
36 templates:
37 - templateName: high-scale-analysis
38 
39 # Wave 5: Full rollout
40 - setWeight: 100
41 maxSurge: "10%"
42 maxUnavailable: 0
43 

Analysis Template for High Scale

yaml
1apiVersion: argoproj.io/v1alpha1
2kind: AnalysisTemplate
3metadata:
4 name: high-scale-analysis
5spec:
6 metrics:
7 - name: error-rate
8 interval: 30s
9 failureLimit: 3
10 successCondition: result[0] < 0.005 # <0.5% errors
11 provider:
12 prometheus:
13 query: |
14 sum(rate(http_requests_total{status=~"5..",rollout_version="canary"}[2m]))
15 /
16 sum(rate(http_requests_total{rollout_version="canary"}[2m]))
17
18 - name: latency-p99
19 interval: 30s
20 failureLimit: 3
21 successCondition: result[0] < 0.3 # p99 < 300ms
22 provider:
23 prometheus:
24 query: |
25 histogram_quantile(0.99,
26 sum(rate(http_request_duration_seconds_bucket{rollout_version="canary"}[2m]))
27 by (le)
28 )
29
30 - name: saturation
31 interval: 60s
32 failureLimit: 2
33 successCondition: result[0] < 0.8 # CPU < 80%
34 provider:
35 prometheus:
36 query: |
37 avg(rate(container_cpu_usage_seconds_total{pod=~"api-server-canary.*"}[2m]))
38 /
39 avg(kube_pod_container_resource_limits{resource="cpu", pod=~"api-server-canary.*"})
40
41 - name: downstream-errors
42 interval: 60s
43 failureLimit: 2
44 successCondition: result[0] < 0.001
45 provider:
46 prometheus:
47 query: |
48 sum(rate(grpc_client_handled_total{grpc_code!="OK", source="api-server-canary"}[2m]))
49 /
50 sum(rate(grpc_client_handled_total{source="api-server-canary"}[2m]))
51

Cache Versioning During Deployment

At high scale, cache inconsistency during deployment causes subtle bugs:

go
1// Cache key versioning to handle mixed old/new pods
2package cache
3 
4import (
5 "context"
6 "encoding/json"
7 "fmt"
8 "time"
9 
10 "github.com/redis/go-redis/v9"
11)
12 
13type VersionedCache struct {
14 client *redis.Client
15 appVersion string
16}
17 
18func NewVersionedCache(client *redis.Client, version string) *VersionedCache {
19 return &VersionedCache{
20 client: client,
21 appVersion: version,
22 }
23}
24 
25func (c *VersionedCache) key(base string) string {
26 return fmt.Sprintf("v%s:%s", c.appVersion, base)
27}
28 
29func (c *VersionedCache) Get(ctx context.Context, key string, dest interface{}) error {
30 // Try current version first
31 data, err := c.client.Get(ctx, c.key(key)).Bytes()
32 if err == nil {
33 return json.Unmarshal(data, dest)
34 }
35 
36 // Don't fall back to old version — cache miss is safer
37 // than returning stale data with incompatible schema
38 return err
39}
40 
41func (c *VersionedCache) Set(
42 ctx context.Context,
43 key string,
44 value interface{},
45 ttl time.Duration,
46) error {
47 data, err := json.Marshal(value)
48 if err != nil {
49 return err
50 }
51 return c.client.Set(ctx, c.key(key), data, ttl).Err()
52}
53 

Graceful Shutdown at Scale

At high throughput, graceful shutdown must handle thousands of in-flight requests:

go
1package server
2 
3import (
4 "context"
5 "net/http"
6 "os"
7 "os/signal"
8 "sync"
9 "sync/atomic"
10 "syscall"
11 "time"
12)
13 
14type GracefulServer struct {
15 server *http.Server
16 activeRequests int64
17 shuttingDown int32
18 wg sync.WaitGroup
19}
20 
21func (s *GracefulServer) middleware(next http.Handler) http.Handler {
22 return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
23 if atomic.LoadInt32(&s.shuttingDown) == 1 {
24 w.Header().Set("Connection", "close")
25 w.Header().Set("Retry-After", "5")
26 http.Error(w, "Service shutting down", http.StatusServiceUnavailable)
27 return
28 }
29 
30 atomic.AddInt64(&s.activeRequests, 1)
31 s.wg.Add(1)
32 defer func() {
33 s.wg.Done()
34 atomic.AddInt64(&s.activeRequests, -1)
35 }()
36 
37 next.ServeHTTP(w, r)
38 })
39}
40 
41func (s *GracefulServer) Shutdown() {
42 // Phase 1: Mark as shutting down (health checks fail)
43 atomic.StoreInt32(&s.shuttingDown, 1)
44 
45 // Phase 2: Wait for LB to deregister (Kubernetes preStop)
46 time.Sleep(15 * time.Second)
47 
48 // Phase 3: Wait for active requests to complete
49 done := make(chan struct{})
50 go func() {
51 s.wg.Wait()
52 close(done)
53 }()
54 
55 select {
56 case <-done:
57 // All requests completed
58 case <-time.After(45 * time.Second):
59 // Force shutdown after timeout
60 }
61 
62 // Phase 4: Close server
63 ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
64 defer cancel()
65 s.server.Shutdown(ctx)
66}
67 

Need a second opinion on your DevOps pipelines architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Queue Draining for Background Workers

At scale, background workers must drain their queues before shutting down:

go
1package worker
2 
3import (
4 "context"
5 "sync"
6 "time"
7)
8 
9type Worker struct {
10 queue MessageQueue
11 handler func(ctx context.Context, msg Message) error
12 concurrency int
13 wg sync.WaitGroup
14 cancel context.CancelFunc
15}
16 
17func (w *Worker) Start(ctx context.Context) {
18 ctx, w.cancel = context.WithCancel(ctx)
19 
20 for i := 0; i < w.concurrency; i++ {
21 w.wg.Add(1)
22 go func() {
23 defer w.wg.Done()
24 for {
25 select {
26 case <-ctx.Done():
27 return
28 default:
29 msg, err := w.queue.Receive(ctx, 5*time.Second)
30 if err != nil {
31 continue
32 }
33 
34 if err := w.handler(ctx, msg); err != nil {
35 msg.Nack() // Return to queue for retry
36 } else {
37 msg.Ack()
38 }
39 }
40 }
41 }()
42 }
43}
44 
45func (w *Worker) GracefulStop(timeout time.Duration) {
46 // Stop receiving new messages
47 w.cancel()
48 
49 // Wait for in-progress messages to complete
50 done := make(chan struct{})
51 go func() {
52 w.wg.Wait()
53 close(done)
54 }()
55 
56 select {
57 case <-done:
58 case <-time.After(timeout):
59 // Messages in progress will be nacked by the queue's
60 // visibility timeout and retried by another worker
61 }
62}
63 

Load Balancer Configuration

At high scale, load balancer configuration directly affects deployment safety:

yaml
1# AWS ALB target group configuration for zero-downtime
2resource "aws_lb_target_group" "api" {
3 name = "api-server"
4 port = 8080
5 protocol = "HTTP"
6 vpc_id = var.vpc_id
7 
8 health_check {
9 enabled = true
10 path = "/health/ready"
11 port = 8080
12 healthy_threshold = 2 # 2 consecutive successes to mark healthy
13 unhealthy_threshold = 3 # 3 consecutive failures to mark unhealthy
14 interval = 10 # Check every 10 seconds
15 timeout = 5
16 matcher = "200"
17 }
18 
19 deregistration_delay = 30 # Drain connections for 30s before removal
20 
21 stickiness {
22 type = "lb_cookie"
23 cookie_duration = 3600
24 enabled = false # Disable for stateless services
25 }
26}
27 

Anti-Patterns to Avoid

Deploying All Regions Simultaneously

At high scale, a bad deployment that hits all regions at once is an outage. Always deploy to the smallest region first, observe for at least 30 minutes, then cascade to larger regions. Each region should be independently rollback-capable.

Ignoring Deployment Velocity Limits

Replacing 50 pods per minute sounds fast, but if each new pod needs 30 seconds to warm up (load caches, establish connections), you'll have 25 cold pods taking slow requests at any given time. Limit rollout speed to match your warm-up time.

Shared Global State During Rollout

Global caches, feature flag configs, and circuit breaker states that change mid-deployment cause split-brain scenarios. Version your cache keys, make feature flags immutable during deployment, and ensure circuit breakers evaluate per-instance, not globally.

Assuming Instant Load Balancer Updates

Load balancers take 10-30 seconds to reflect health check changes. After a pod becomes unhealthy, traffic continues flowing to it until the next health check interval. Account for this delay in your preStop hook timing.

Testing Only the Happy Path

High-scale deployments expose race conditions, connection pool exhaustion, and cache stampedes that don't appear at low scale. Run deployment tests against a production-like environment with production-level traffic using shadow traffic or load testing.

High-Scale Readiness Checklist

  • Wave-based rollout with automated analysis at each step
  • Canary analysis covers error rate, latency, CPU saturation, and downstream errors
  • Cache keys versioned by application version
  • Graceful shutdown handles 10K+ in-flight requests
  • Background workers drain queues before termination
  • Load balancer deregistration delay matches grace period
  • Multi-region deployment with progressive rollout
  • Rollback completes in under 2 minutes
  • Deployment velocity limited by pod warm-up time
  • Shadow traffic testing validates deployment at production scale
  • PodDisruptionBudgets prevent cluster operations from disrupting rollouts
  • Deployment dashboards show real-time canary vs stable comparison

FAQ

Need expert help?

Building with CI/CD pipelines?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026