Back to Journal
System Design

Saga Pattern Implementation Best Practices for Enterprise Teams

Battle-tested best practices for Saga Pattern Implementation tailored to Enterprise teams, including anti-patterns to avoid and a ready-to-use checklist.

Muneer Puthiya Purayil 10 min read

The saga pattern solves distributed transaction consistency across microservices, but getting it wrong creates failure modes worse than the problem it solves. After implementing sagas across three different enterprise platforms — an e-commerce order pipeline, a financial reconciliation system, and a multi-tenant provisioning service — these are the practices that separate production-grade implementations from fragile prototypes.

Choosing Between Orchestration and Choreography

The first decision is whether a central orchestrator coordinates the saga or whether services react to events independently (choreography).

Use orchestration when:

  • The saga has more than four steps
  • Compensation logic is complex or conditional
  • You need centralized monitoring and retry policies
  • Business stakeholders need to understand the flow (an orchestrator maps directly to a workflow diagram)

Use choreography when:

  • The saga has two to three steps
  • Services are owned by different teams with independent release cycles
  • Loose coupling is more important than centralized visibility
  • Each service already publishes domain events

In practice, most enterprise systems benefit from orchestration. The visibility and debuggability advantages outweigh the coupling trade-off.

typescript
1// Orchestrator-based saga definition
2interface SagaStep<TContext> {
3 name: string;
4 execute: (ctx: TContext) => Promise<void>;
5 compensate: (ctx: TContext) => Promise<void>;
6 retryPolicy?: RetryPolicy;
7}
8 
9interface RetryPolicy {
10 maxAttempts: number;
11 backoffMs: number;
12 backoffMultiplier: number;
13 retryableErrors?: string[];
14}
15 
16class SagaOrchestrator<TContext> {
17 private steps: SagaStep<TContext>[] = [];
18 private completedSteps: SagaStep<TContext>[] = [];
19 
20 addStep(step: SagaStep<TContext>): this {
21 this.steps.push(step);
22 return this;
23 }
24 
25 async execute(ctx: TContext): Promise<void> {
26 for (const step of this.steps) {
27 try {
28 await this.executeWithRetry(step, ctx);
29 this.completedSteps.push(step);
30 } catch (error) {
31 await this.compensate(ctx);
32 throw new SagaFailedError(step.name, error as Error);
33 }
34 }
35 }
36 
37 private async compensate(ctx: TContext): Promise<void> {
38 // Compensate in reverse order
39 const toCompensate = [...this.completedSteps].reverse();
40 const compensationErrors: Error[] = [];
41 
42 for (const step of toCompensate) {
43 try {
44 await this.executeWithRetry(
45 { ...step, execute: step.compensate, compensate: async () => {} },
46 ctx
47 );
48 } catch (error) {
49 compensationErrors.push(error as Error);
50 // Log but continue — try to compensate as many steps as possible
51 }
52 }
53 
54 if (compensationErrors.length > 0) {
55 throw new CompensationFailedError(compensationErrors);
56 }
57 }
58 
59 private async executeWithRetry(step: SagaStep<TContext>, ctx: TContext): Promise<void> {
60 const policy = step.retryPolicy ?? { maxAttempts: 3, backoffMs: 100, backoffMultiplier: 2 };
61 let lastError: Error | null = null;
62 
63 for (let attempt = 1; attempt <= policy.maxAttempts; attempt++) {
64 try {
65 await step.execute(ctx);
66 return;
67 } catch (error) {
68 lastError = error as Error;
69 
70 if (policy.retryableErrors && !policy.retryableErrors.includes(lastError.name)) {
71 throw lastError;
72 }
73 
74 if (attempt < policy.maxAttempts) {
75 const delay = policy.backoffMs * Math.pow(policy.backoffMultiplier, attempt - 1);
76 await new Promise(resolve => setTimeout(resolve, delay));
77 }
78 }
79 }
80 
81 throw lastError!;
82 }
83}
84 

Best Practice 1: Make Every Step Idempotent

Every execute and compensate function must be safely re-runnable. Network failures, container restarts, and message redelivery will cause duplicate invocations.

typescript
1// BAD: Not idempotent — creates duplicate charges on retry
2async function chargePayment(ctx: OrderContext): Promise<void> {
3 await paymentService.charge(ctx.userId, ctx.amount);
4}
5 
6// GOOD: Idempotent via idempotency key
7async function chargePayment(ctx: OrderContext): Promise<void> {
8 await paymentService.charge({
9 idempotencyKey: `order-${ctx.orderId}-charge`,
10 userId: ctx.userId,
11 amount: ctx.amount,
12 });
13}
14 
15// GOOD: Idempotent compensation — only refund if charge exists
16async function compensateChargePayment(ctx: OrderContext): Promise<void> {
17 const charge = await paymentService.getCharge(`order-${ctx.orderId}-charge`);
18 if (!charge || charge.status === 'refunded') return; // Already compensated
19 
20 await paymentService.refund({
21 idempotencyKey: `order-${ctx.orderId}-refund`,
22 chargeId: charge.id,
23 });
24}
25 

The pattern: use a deterministic idempotency key derived from the saga context, and check current state before mutating.

Best Practice 2: Persist Saga State at Every Step

If the orchestrator crashes mid-saga, you need to resume from the last completed step, not restart from the beginning.

typescript
1interface SagaState {
2 sagaId: string;
3 sagaType: string;
4 status: 'running' | 'compensating' | 'completed' | 'failed';
5 currentStep: number;
6 context: Record<string, unknown>;
7 completedSteps: string[];
8 startedAt: Date;
9 updatedAt: Date;
10 error?: string;
11}
12 
13class PersistentSagaOrchestrator<TContext> {
14 constructor(
15 private store: SagaStore,
16 private steps: SagaStep<TContext>[],
17 ) {}
18 
19 async execute(sagaId: string, ctx: TContext): Promise<void> {
20 // Check for existing state (resume after crash)
21 let state = await this.store.get(sagaId);
22 
23 if (!state) {
24 state = {
25 sagaId,
26 sagaType: 'order-fulfillment',
27 status: 'running',
28 currentStep: 0,
29 context: ctx as Record<string, unknown>,
30 completedSteps: [],
31 startedAt: new Date(),
32 updatedAt: new Date(),
33 };
34 await this.store.save(state);
35 }
36 
37 // Resume from last completed step
38 const startFrom = state.status === 'running' ? state.currentStep : 0;
39 
40 for (let i = startFrom; i < this.steps.length; i++) {
41 const step = this.steps[i];
42 
43 try {
44 await step.execute(ctx);
45 
46 state.currentStep = i + 1;
47 state.completedSteps.push(step.name);
48 state.updatedAt = new Date();
49 await this.store.save(state);
50 } catch (error) {
51 state.status = 'compensating';
52 state.error = (error as Error).message;
53 await this.store.save(state);
54 
55 await this.compensate(state, ctx);
56 return;
57 }
58 }
59 
60 state.status = 'completed';
61 state.updatedAt = new Date();
62 await this.store.save(state);
63 }
64 
65 private async compensate(state: SagaState, ctx: TContext): Promise<void> {
66 const toCompensate = [...state.completedSteps].reverse();
67 
68 for (const stepName of toCompensate) {
69 const step = this.steps.find(s => s.name === stepName)!;
70 await step.compensate(ctx);
71 }
72 
73 state.status = 'failed';
74 state.updatedAt = new Date();
75 await this.store.save(state);
76 }
77}
78 

Best Practice 3: Define Explicit Timeouts Per Step

Sagas without timeouts hang indefinitely when a downstream service stops responding.

typescript
1interface SagaStepWithTimeout<TContext> extends SagaStep<TContext> {
2 timeoutMs: number;
3 onTimeout: 'compensate' | 'retry' | 'alert';
4}
5 
6async function executeWithTimeout<TContext>(
7 step: SagaStepWithTimeout<TContext>,
8 ctx: TContext
9): Promise<void> {
10 const controller = new AbortController();
11 const timer = setTimeout(() => controller.abort(), step.timeoutMs);
12 
13 try {
14 await step.execute(ctx);
15 } catch (error) {
16 if ((error as Error).name === 'AbortError') {
17 throw new StepTimeoutError(step.name, step.timeoutMs);
18 }
19 throw error;
20 } finally {
21 clearTimeout(timer);
22 }
23}
24 
25// Example: Order saga with step-level timeouts
26const orderSagaSteps: SagaStepWithTimeout<OrderContext>[] = [
27 {
28 name: 'reserve_inventory',
29 execute: reserveInventory,
30 compensate: releaseInventory,
31 timeoutMs: 5000, // Inventory service should respond within 5s
32 onTimeout: 'compensate',
33 },
34 {
35 name: 'charge_payment',
36 execute: chargePayment,
37 compensate: refundPayment,
38 timeoutMs: 30000, // Payment processing can be slow
39 onTimeout: 'retry',
40 retryPolicy: { maxAttempts: 2, backoffMs: 5000, backoffMultiplier: 1 },
41 },
42 {
43 name: 'create_shipment',
44 execute: createShipment,
45 compensate: cancelShipment,
46 timeoutMs: 10000,
47 onTimeout: 'alert', // Alert ops team, don't auto-compensate
48 },
49];
50 

Need a second opinion on your system design architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Best Practice 4: Use a Dead Letter Queue for Failed Compensations

When compensation fails, you have a data consistency problem that requires human intervention. Route these to a dead letter queue with enough context for manual resolution.

typescript
1interface DeadLetterEntry {
2 sagaId: string;
3 sagaType: string;
4 failedStep: string;
5 compensationError: string;
6 context: Record<string, unknown>;
7 completedSteps: string[];
8 timestamp: Date;
9 resolved: boolean;
10}
11 
12async function handleCompensationFailure(
13 saga: SagaState,
14 step: string,
15 error: Error
16): Promise<void> {
17 const entry: DeadLetterEntry = {
18 sagaId: saga.sagaId,
19 sagaType: saga.sagaType,
20 failedStep: step,
21 compensationError: error.message,
22 context: saga.context,
23 completedSteps: saga.completedSteps,
24 timestamp: new Date(),
25 resolved: false,
26 };
27 
28 await deadLetterQueue.publish(entry);
29 
30 // Alert operations team
31 await alerting.send({
32 severity: 'critical',
33 title: `Saga compensation failed: ${saga.sagaType}`,
34 body: `Saga ${saga.sagaId} failed to compensate step "${step}". ` +
35 `Manual intervention required. Completed steps: ${saga.completedSteps.join(', ')}`,
36 metadata: { sagaId: saga.sagaId, step },
37 });
38}
39 

Best Practice 5: Separate Read and Write Models for Saga State

Query the saga state without loading the full execution context. This is critical for dashboards and monitoring.

typescript
1// Write model: full context for execution
2interface SagaExecutionState {
3 sagaId: string;
4 context: Record<string, unknown>; // Can be large (order items, user data, etc.)
5 steps: SagaStepState[];
6}
7 
8// Read model: lightweight for queries and dashboards
9interface SagaStatusView {
10 sagaId: string;
11 sagaType: string;
12 status: string;
13 currentStep: string;
14 startedAt: Date;
15 duration: number;
16 stepsCompleted: number;
17 totalSteps: number;
18}
19 
20// Materialize read model on every state change
21async function updateSagaStatusView(state: SagaState): Promise<void> {
22 await db.sagaStatusView.upsert({
23 where: { sagaId: state.sagaId },
24 update: {
25 status: state.status,
26 currentStep: state.completedSteps[state.completedSteps.length - 1] ?? 'not_started',
27 duration: Date.now() - state.startedAt.getTime(),
28 stepsCompleted: state.completedSteps.length,
29 updatedAt: new Date(),
30 },
31 create: {
32 sagaId: state.sagaId,
33 sagaType: state.sagaType,
34 status: state.status,
35 currentStep: 'not_started',
36 startedAt: state.startedAt,
37 duration: 0,
38 stepsCompleted: 0,
39 totalSteps: state.currentStep,
40 },
41 });
42}
43 

Anti-Patterns to Avoid

Anti-Pattern 1: Nested Sagas Without Boundaries

Never start a saga from within another saga's step. If step 3 of Saga A triggers Saga B, Saga A's compensation logic cannot reliably roll back Saga B. Instead, make Saga B a subsequent step in Saga A or use a parent orchestrator that coordinates both.

Anti-Pattern 2: Compensations That Call External APIs Without Idempotency

If your compensation step calls an external payment API to issue a refund but doesn't pass an idempotency key, a retry during compensation creates a double refund. Every external call in a compensation must be idempotent.

Anti-Pattern 3: Using Saga State as a General-Purpose Database

Saga context should contain only the data needed for execution and compensation. Do not store derived data, analytics payloads, or user preferences in the saga state. Keep the context minimal — typically IDs, amounts, and timestamps.

Anti-Pattern 4: Ignoring Partial Failure in Compensation

If step 3 of 5 fails and compensation for step 2 also fails, do not silently mark the saga as "failed." The inconsistent state between steps 1 (completed) and 2 (partially compensated) requires explicit handling — dead letter queues, alerts, and manual resolution workflows.

Production Checklist

  • Every step has both execute and compensate functions
  • All execute and compensate functions are idempotent
  • Saga state is persisted after every step transition
  • Each step has an explicit timeout
  • Failed compensations route to a dead letter queue with alerts
  • Saga status is queryable without loading full execution context
  • No nested sagas — use a flat step list or parent orchestrator
  • Retry policies distinguish between transient and permanent errors
  • Monitoring dashboards show saga completion rates, average duration, and failure rates per step
  • Load tests verify saga behavior under concurrent execution

Conclusion

The saga pattern is a tool for managing distributed consistency, not a silver bullet. The implementation complexity is significant — idempotent steps, persistent state, compensation chains, dead letter queues, and monitoring — and teams frequently underestimate this upfront investment. A well-implemented saga gives you reliable cross-service transactions with clear failure recovery. A poorly implemented one gives you data inconsistency with extra infrastructure to maintain.

Start with the simplest saga that solves your immediate consistency problem: two or three steps with an orchestrator, persistent state, and a dead letter queue for compensation failures. Add complexity (conditional branching, parallel steps, nested workflows) only when you have concrete requirements and operational maturity to support it.

The orchestration approach scales better for enterprise teams because it centralizes the workflow definition, making it auditable, testable, and visible to non-engineers. Choreography works for simple, loosely coupled flows but becomes a distributed debugging nightmare when you have seven services reacting to events with no central view of the overall transaction.

FAQ

Need expert help?

Building with system design?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026