Back to Journal
System Design

Saga Pattern Implementation Best Practices for High Scale Teams

Battle-tested best practices for Saga Pattern Implementation tailored to High Scale teams, including anti-patterns to avoid and a ready-to-use checklist.

Muneer Puthiya Purayil 11 min read

At high scale, saga implementations face challenges that enterprise teams at moderate throughput never encounter: saga state storage becomes a bottleneck, compensation cascades under load create thundering herds, and observability across thousands of concurrent sagas requires purpose-built tooling. These best practices are drawn from operating saga-based systems processing 50,000+ transactions per minute.

Partitioning Saga State Storage

At high throughput, a single saga state table becomes a write bottleneck. Partition saga state by saga ID hash to distribute writes across multiple shards.

typescript
1// Partitioned saga state store
2class PartitionedSagaStore {
3 private readonly partitions: SagaStorePartition[];
4 private readonly partitionCount: number;
5 
6 constructor(partitions: SagaStorePartition[]) {
7 this.partitions = partitions;
8 this.partitionCount = partitions.length;
9 }
10 
11 private getPartition(sagaId: string): SagaStorePartition {
12 const hash = this.hashCode(sagaId);
13 const index = Math.abs(hash) % this.partitionCount;
14 return this.partitions[index];
15 }
16 
17 private hashCode(str: string): number {
18 let hash = 0;
19 for (let i = 0; i < str.length; i++) {
20 const char = str.charCodeAt(i);
21 hash = ((hash << 5) - hash) + char;
22 hash |= 0;
23 }
24 return hash;
25 }
26 
27 async save(state: SagaState): Promise<void> {
28 const partition = this.getPartition(state.sagaId);
29 await partition.save(state);
30 }
31 
32 async get(sagaId: string): Promise<SagaState | null> {
33 const partition = this.getPartition(sagaId);
34 return partition.get(sagaId);
35 }
36}
37 
38// PostgreSQL partition setup
39/*
40CREATE TABLE saga_state (
41 saga_id TEXT NOT NULL,
42 saga_type TEXT NOT NULL,
43 status TEXT NOT NULL,
44 current_step INT NOT NULL DEFAULT 0,
45 context JSONB NOT NULL,
46 completed_steps TEXT[] NOT NULL DEFAULT '{}',
47 created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
48 updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
49) PARTITION BY HASH (saga_id);
50 
51CREATE TABLE saga_state_p0 PARTITION OF saga_state FOR VALUES WITH (MODULUS 8, REMAINDER 0);
52CREATE TABLE saga_state_p1 PARTITION OF saga_state FOR VALUES WITH (MODULUS 8, REMAINDER 1);
53CREATE TABLE saga_state_p2 PARTITION OF saga_state FOR VALUES WITH (MODULUS 8, REMAINDER 2);
54CREATE TABLE saga_state_p3 PARTITION OF saga_state FOR VALUES WITH (MODULUS 8, REMAINDER 3);
55CREATE TABLE saga_state_p4 PARTITION OF saga_state FOR VALUES WITH (MODULUS 8, REMAINDER 4);
56CREATE TABLE saga_state_p5 PARTITION OF saga_state FOR VALUES WITH (MODULUS 8, REMAINDER 5);
57CREATE TABLE saga_state_p6 PARTITION OF saga_state FOR VALUES WITH (MODULUS 8, REMAINDER 6);
58CREATE TABLE saga_state_p7 PARTITION OF saga_state FOR VALUES WITH (MODULUS 8, REMAINDER 7);
59*/
60 

Eight partitions handles most workloads up to 100K writes/second. Benchmark your specific write pattern before adding more partitions — the coordination overhead increases with partition count.

Rate-Limiting Compensation Cascades

When a downstream service fails under load, hundreds of sagas trigger compensation simultaneously, creating a thundering herd on the already-stressed service.

typescript
1class CompensationRateLimiter {
2 private readonly tokens: number;
3 private available: number;
4 private lastRefill: number;
5 private readonly refillRateMs: number;
6 private readonly queue: Array<{
7 resolve: () => void;
8 reject: (error: Error) => void;
9 timeout: ReturnType<typeof setTimeout>;
10 }> = [];
11 
12 constructor(tokensPerSecond: number, burstSize: number) {
13 this.tokens = burstSize;
14 this.available = burstSize;
15 this.refillRateMs = 1000 / tokensPerSecond;
16 this.lastRefill = Date.now();
17 }
18 
19 private refill(): void {
20 const now = Date.now();
21 const elapsed = now - this.lastRefill;
22 const newTokens = Math.floor(elapsed / this.refillRateMs);
23 
24 if (newTokens > 0) {
25 this.available = Math.min(this.tokens, this.available + newTokens);
26 this.lastRefill = now;
27 }
28 }
29 
30 async acquire(timeoutMs: number = 30000): Promise<void> {
31 this.refill();
32 
33 if (this.available > 0) {
34 this.available--;
35 return;
36 }
37 
38 return new Promise((resolve, reject) => {
39 const timeout = setTimeout(() => {
40 const idx = this.queue.findIndex(item => item.resolve === resolve);
41 if (idx !== -1) this.queue.splice(idx, 1);
42 reject(new Error('Compensation rate limit timeout'));
43 }, timeoutMs);
44 
45 this.queue.push({ resolve, reject, timeout });
46 });
47 }
48 
49 release(): void {
50 if (this.queue.length > 0) {
51 const next = this.queue.shift()!;
52 clearTimeout(next.timeout);
53 next.resolve();
54 } else {
55 this.available = Math.min(this.tokens, this.available + 1);
56 }
57 }
58}
59 
60// Usage in saga orchestrator
61const compensationLimiter = new CompensationRateLimiter(100, 50); // 100/sec, burst 50
62 
63async function compensateWithRateLimit<TContext>(
64 step: SagaStep<TContext>,
65 ctx: TContext
66): Promise<void> {
67 await compensationLimiter.acquire();
68 try {
69 await step.compensate(ctx);
70 } finally {
71 compensationLimiter.release();
72 }
73}
74 

Parallel Step Execution

Not all saga steps are sequential. When steps have no dependencies between them, execute them in parallel to reduce total saga duration.

typescript
1interface ParallelSagaStep<TContext> {
2 name: string;
3 execute: (ctx: TContext) => Promise<void>;
4 compensate: (ctx: TContext) => Promise<void>;
5}
6 
7interface SagaPhase<TContext> {
8 steps: ParallelSagaStep<TContext>[];
9 mode: 'sequential' | 'parallel';
10}
11 
12class PhasedSagaOrchestrator<TContext> {
13 private phases: SagaPhase<TContext>[] = [];
14 private completedPhases: SagaPhase<TContext>[] = [];
15 
16 addSequentialPhase(steps: ParallelSagaStep<TContext>[]): this {
17 this.phases.push({ steps, mode: 'sequential' });
18 return this;
19 }
20 
21 addParallelPhase(steps: ParallelSagaStep<TContext>[]): this {
22 this.phases.push({ steps, mode: 'parallel' });
23 return this;
24 }
25 
26 async execute(ctx: TContext): Promise<void> {
27 for (const phase of this.phases) {
28 try {
29 if (phase.mode === 'parallel') {
30 await Promise.all(phase.steps.map(step => step.execute(ctx)));
31 } else {
32 for (const step of phase.steps) {
33 await step.execute(ctx);
34 }
35 }
36 this.completedPhases.push(phase);
37 } catch (error) {
38 await this.compensateAll(ctx);
39 throw error;
40 }
41 }
42 }
43 
44 private async compensateAll(ctx: TContext): Promise<void> {
45 for (const phase of [...this.completedPhases].reverse()) {
46 if (phase.mode === 'parallel') {
47 await Promise.allSettled(phase.steps.map(s => s.compensate(ctx)));
48 } else {
49 for (const step of [...phase.steps].reverse()) {
50 await step.compensate(ctx).catch(() => {});
51 }
52 }
53 }
54 }
55}
56 
57// Example: E-commerce order saga with parallel phases
58const orderSaga = new PhasedSagaOrchestrator<OrderContext>()
59 // Phase 1: Validate (parallel — independent checks)
60 .addParallelPhase([
61 { name: 'validate_inventory', execute: validateInventory, compensate: noop },
62 { name: 'validate_payment', execute: validatePaymentMethod, compensate: noop },
63 { name: 'validate_address', execute: validateShippingAddress, compensate: noop },
64 ])
65 // Phase 2: Reserve (sequential — order matters)
66 .addSequentialPhase([
67 { name: 'reserve_inventory', execute: reserveInventory, compensate: releaseInventory },
68 { name: 'charge_payment', execute: chargePayment, compensate: refundPayment },
69 ])
70 // Phase 3: Fulfill (parallel — independent operations)
71 .addParallelPhase([
72 { name: 'create_shipment', execute: createShipment, compensate: cancelShipment },
73 { name: 'send_confirmation', execute: sendConfirmation, compensate: noop },
74 ]);
75 

Distributed Tracing for Saga Observability

At high scale, you need to trace a single saga across multiple services and step executions.

typescript
1import { trace, SpanStatusCode, context as otelContext } from '@opentelemetry/api';
2 
3const tracer = trace.getTracer('saga-orchestrator');
4 
5class TracedSagaOrchestrator<TContext> {
6 async execute(sagaId: string, sagaType: string, steps: SagaStep<TContext>[], ctx: TContext): Promise<void> {
7 const span = tracer.startSpan(`saga.${sagaType}`, {
8 attributes: {
9 'saga.id': sagaId,
10 'saga.type': sagaType,
11 'saga.steps.count': steps.length,
12 },
13 });
14 
15 try {
16 for (let i = 0; i < steps.length; i++) {
17 const step = steps[i];
18 
19 const stepSpan = tracer.startSpan(`saga.step.${step.name}`, {
20 attributes: {
21 'saga.step.name': step.name,
22 'saga.step.index': i,
23 },
24 });
25 
26 try {
27 await step.execute(ctx);
28 stepSpan.setStatus({ code: SpanStatusCode.OK });
29 } catch (error) {
30 stepSpan.setStatus({
31 code: SpanStatusCode.ERROR,
32 message: (error as Error).message,
33 });
34 stepSpan.recordException(error as Error);
35 throw error;
36 } finally {
37 stepSpan.end();
38 }
39 }
40 
41 span.setStatus({ code: SpanStatusCode.OK });
42 } catch (error) {
43 span.setStatus({ code: SpanStatusCode.ERROR, message: (error as Error).message });
44 span.setAttribute('saga.compensation.triggered', true);
45 throw error;
46 } finally {
47 span.end();
48 }
49 }
50}
51 

Need a second opinion on your system design architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Saga Instance Limits and Backpressure

Prevent resource exhaustion by limiting concurrent saga instances per type.

typescript
1class SagaConcurrencyLimiter {
2 private readonly limits: Map<string, number>;
3 private readonly active: Map<string, number>;
4 
5 constructor(limits: Record<string, number>) {
6 this.limits = new Map(Object.entries(limits));
7 this.active = new Map();
8 }
9 
10 canStart(sagaType: string): boolean {
11 const limit = this.limits.get(sagaType) ?? Infinity;
12 const current = this.active.get(sagaType) ?? 0;
13 return current < limit;
14 }
15 
16 acquire(sagaType: string): boolean {
17 if (!this.canStart(sagaType)) return false;
18 this.active.set(sagaType, (this.active.get(sagaType) ?? 0) + 1);
19 return true;
20 }
21 
22 release(sagaType: string): void {
23 const current = this.active.get(sagaType) ?? 0;
24 this.active.set(sagaType, Math.max(0, current - 1));
25 }
26 
27 getMetrics(): Record<string, { active: number; limit: number }> {
28 const metrics: Record<string, { active: number; limit: number }> = {};
29 for (const [type, limit] of this.limits) {
30 metrics[type] = { active: this.active.get(type) ?? 0, limit };
31 }
32 return metrics;
33 }
34}
35 
36// Usage
37const limiter = new SagaConcurrencyLimiter({
38 'order-fulfillment': 500,
39 'account-provisioning': 100,
40 'payment-reconciliation': 50,
41});
42 

Anti-Patterns at Scale

Anti-Pattern 1: Unbounded Saga Context

Storing full request payloads, response bodies, and intermediate results in the saga context. At 50K concurrent sagas, this consumes gigabytes of memory. Store only IDs and amounts in the saga context — fetch full objects only when a step needs them.

Anti-Pattern 2: Synchronous Compensation Chains

Running compensation steps synchronously when a saga fails under load. If compensation for each step takes 200ms and you have five steps, that is one second of compensation time per saga. With 1,000 concurrent failures, you are spending 1,000 seconds of serial compensation time. Use parallel compensation for independent steps.

Anti-Pattern 3: Missing Circuit Breakers on Step Execution

Without circuit breakers, a failing downstream service causes all sagas to queue up at that step, exhausting connection pools and memory. Implement circuit breakers per downstream service, and fail fast to trigger compensation rather than waiting for timeouts.

Anti-Pattern 4: Global Saga State Locks

Using a global lock on the saga state table to prevent concurrent modifications. This serializes all saga operations and becomes a single point of contention. Use row-level optimistic locking with version numbers instead.

High-Scale Production Checklist

  • Saga state storage is partitioned by saga ID
  • Compensation rate limiting prevents thundering herds
  • Independent steps execute in parallel
  • Each step and compensation is instrumented with distributed tracing
  • Concurrent saga instances are bounded per saga type
  • Saga context contains only IDs and critical values, not full objects
  • Circuit breakers protect each downstream service call
  • Compensation uses parallel execution for independent steps
  • Saga state uses optimistic locking, not global locks
  • Dead letter queue with automatic alerting for compensation failures
  • Metrics dashboards show: throughput, p99 latency, failure rate, compensation rate, DLQ depth
  • Load tested at 2x expected peak throughput

Conclusion

High-scale saga implementations require infrastructure-level thinking that goes beyond the pattern itself. The saga orchestration logic — step sequencing, compensation, and state transitions — is the easy part. The hard part is operating it at throughput where every inefficiency compounds: unbounded contexts eat memory, synchronous compensations block threads, and unpartitioned state tables throttle writes.

The phased execution model (parallel validation, sequential mutation, parallel notification) reduces end-to-end saga latency by 40-60% in typical e-commerce workflows. Combined with partitioned state storage and rate-limited compensation, you get a system that degrades gracefully under load instead of cascading failures across services.

Instrument everything from day one. At high scale, you cannot debug individual sagas — you debug patterns. Distributed tracing across saga steps, combined with aggregate metrics (completion rate by saga type, p99 latency by step, compensation frequency), gives you the operational visibility to identify bottlenecks before they become outages.

FAQ

Need expert help?

Building with system design?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026