When should you use the saga pattern instead of two-phase commit?

Use sagas when your services use different databases or when you cannot hold locks across services (which is most microservice architectures). Two-phase commit (2PC) requires all participants to support the XA protocol and hold locks during the prepare phase, which creates latency and availability problems at scale. Sagas trade strong consistency for availability — you get eventual consistency with explicit compensation instead of blocking all participants until commit.

How do you handle saga steps that depend on external third-party APIs?

Wrap third-party API calls in an adapter that handles idempotency, timeouts, and circuit breaking. Store the external transaction ID (e.g., Stripe charge ID) in the saga context so the compensation step can reference it for refunds or cancellations. Set aggressive timeouts — third-party APIs are the most common source of saga hangs. If the third party does not support idempotency keys, implement at-most-once delivery by checking your own records before calling out.

What database should you use for saga state storage?

Use whatever your primary application database is — typically PostgreSQL. Saga state is transactional (you need atomicity between updating step status and recording completion), so a relational database with ACID guarantees is the right fit. Do not use an eventually-consistent store for saga state unless you are prepared to handle the edge cases of reading stale state during compensation. For high-throughput scenarios (>10K sagas/second), consider a dedicated instance with tuned write performanc

How do you test sagas in integration tests?

Create a test harness that lets you inject failures at specific steps. Override individual step execute/compensate functions with stubs that throw at configured points. Verify three scenarios for every saga: (1) all steps succeed, (2) failure at each step triggers correct compensation, (3) compensation failure routes to the dead letter queue. Use a test database for saga state and assert on the final state records, not just the return values.

Saga Pattern Implementation Best Practices for Enterprise Teams

The saga pattern solves distributed transaction consistency across microservices, but getting it wrong creates failure modes worse than the problem it solves. After implementing sagas across three different enterprise platforms — an e-commerce order pipeline, a financial reconciliation system, and a multi-tenant provisioning service — these are the practices that separate production-grade implementations from fragile prototypes.

Choosing Between Orchestration and Choreography

The first decision is whether a central orchestrator coordinates the saga or whether services react to events independently (choreography).

Use orchestration when:

The saga has more than four steps
Compensation logic is complex or conditional
You need centralized monitoring and retry policies
Business stakeholders need to understand the flow (an orchestrator maps directly to a workflow diagram)

Use choreography when:

The saga has two to three steps
Services are owned by different teams with independent release cycles
Loose coupling is more important than centralized visibility
Each service already publishes domain events

In practice, most enterprise systems benefit from orchestration. The visibility and debuggability advantages outweigh the coupling trade-off.

typescript

1// Orchestrator-based saga definition

2interface SagaStep<TContext> {

3 name: string;

4 execute: (ctx: TContext) => Promise<void>;

5 compensate: (ctx: TContext) => Promise<void>;

6 retryPolicy?: RetryPolicy;

9interface RetryPolicy {

10 maxAttempts: number;

11 backoffMs: number;

12 backoffMultiplier: number;

13 retryableErrors?: string[];

14}

16class SagaOrchestrator<TContext> {

17 private steps: SagaStep<TContext>[] = [];

18 private completedSteps: SagaStep<TContext>[] = [];

20 addStep(step: SagaStep<TContext>): this {

21 this.steps.push(step);

22 return this;

23 }

25 async execute(ctx: TContext): Promise<void> {

26 for (const step of this.steps) {

27 try {

28 await this.executeWithRetry(step, ctx);

29 this.completedSteps.push(step);

30 } catch (error) {

31 await this.compensate(ctx);

32 throw new SagaFailedError(step.name, error as Error);

33 }

34 }

35 }

37 private async compensate(ctx: TContext): Promise<void> {

38 // Compensate in reverse order

39 const toCompensate = [...this.completedSteps].reverse();

40 const compensationErrors: Error[] = [];

42 for (const step of toCompensate) {

43 try {

44 await this.executeWithRetry(

45 { ...step, execute: step.compensate, compensate: async () => {} },

46 ctx

47 );

48 } catch (error) {

49 compensationErrors.push(error as Error);

50 // Log but continue — try to compensate as many steps as possible

51 }

52 }

54 if (compensationErrors.length > 0) {

55 throw new CompensationFailedError(compensationErrors);

56 }

57 }

59 private async executeWithRetry(step: SagaStep<TContext>, ctx: TContext): Promise<void> {

60 const policy = step.retryPolicy ?? { maxAttempts: 3, backoffMs: 100, backoffMultiplier: 2 };

61 let lastError: Error | null = null;

63 for (let attempt = 1; attempt <= policy.maxAttempts; attempt++) {

64 try {

65 await step.execute(ctx);

66 return;

67 } catch (error) {

68 lastError = error as Error;

70 if (policy.retryableErrors && !policy.retryableErrors.includes(lastError.name)) {

71 throw lastError;

72 }

74 if (attempt < policy.maxAttempts) {

75 const delay = policy.backoffMs * Math.pow(policy.backoffMultiplier, attempt - 1);

76 await new Promise(resolve => setTimeout(resolve, delay));

77 }

78 }

79 }

81 throw lastError!;

82 }

83}

Best Practice 1: Make Every Step Idempotent

Every execute and compensate function must be safely re-runnable. Network failures, container restarts, and message redelivery will cause duplicate invocations.

typescript

1// BAD: Not idempotent — creates duplicate charges on retry

2async function chargePayment(ctx: OrderContext): Promise<void> {

3 await paymentService.charge(ctx.userId, ctx.amount);

6// GOOD: Idempotent via idempotency key

7async function chargePayment(ctx: OrderContext): Promise<void> {

8 await paymentService.charge({

9 idempotencyKey: `order-${ctx.orderId}-charge`,

10 userId: ctx.userId,

11 amount: ctx.amount,

12 });

13}

15// GOOD: Idempotent compensation — only refund if charge exists

16async function compensateChargePayment(ctx: OrderContext): Promise<void> {

17 const charge = await paymentService.getCharge(`order-${ctx.orderId}-charge`);

18 if (!charge || charge.status === 'refunded') return; // Already compensated

20 await paymentService.refund({

21 idempotencyKey: `order-${ctx.orderId}-refund`,

22 chargeId: charge.id,

23 });

24}

The pattern: use a deterministic idempotency key derived from the saga context, and check current state before mutating.

Best Practice 2: Persist Saga State at Every Step

If the orchestrator crashes mid-saga, you need to resume from the last completed step, not restart from the beginning.

typescript

1interface SagaState {

2 sagaId: string;

3 sagaType: string;

4 status: 'running' | 'compensating' | 'completed' | 'failed';

5 currentStep: number;

6 context: Record<string, unknown>;

7 completedSteps: string[];

8 startedAt: Date;

9 updatedAt: Date;

10 error?: string;

11}

13class PersistentSagaOrchestrator<TContext> {

14 constructor(

15 private store: SagaStore,

16 private steps: SagaStep<TContext>[],

17 ) {}

19 async execute(sagaId: string, ctx: TContext): Promise<void> {

20 // Check for existing state (resume after crash)

21 let state = await this.store.get(sagaId);

23 if (!state) {

24 state = {

25 sagaId,

26 sagaType: 'order-fulfillment',

27 status: 'running',

28 currentStep: 0,

29 context: ctx as Record<string, unknown>,

30 completedSteps: [],

31 startedAt: new Date(),

32 updatedAt: new Date(),

33 };

34 await this.store.save(state);

35 }

37 // Resume from last completed step

38 const startFrom = state.status === 'running' ? state.currentStep : 0;

40 for (let i = startFrom; i < this.steps.length; i++) {

41 const step = this.steps[i];

43 try {

44 await step.execute(ctx);

46 state.currentStep = i + 1;

47 state.completedSteps.push(step.name);

48 state.updatedAt = new Date();

49 await this.store.save(state);

50 } catch (error) {

51 state.status = 'compensating';

52 state.error = (error as Error).message;

53 await this.store.save(state);

55 await this.compensate(state, ctx);

56 return;

57 }

58 }

60 state.status = 'completed';

61 state.updatedAt = new Date();

62 await this.store.save(state);

63 }

65 private async compensate(state: SagaState, ctx: TContext): Promise<void> {

66 const toCompensate = [...state.completedSteps].reverse();

68 for (const stepName of toCompensate) {

69 const step = this.steps.find(s => s.name === stepName)!;

70 await step.compensate(ctx);

71 }

73 state.status = 'failed';

74 state.updatedAt = new Date();

75 await this.store.save(state);

76 }

77}

Best Practice 3: Define Explicit Timeouts Per Step

Sagas without timeouts hang indefinitely when a downstream service stops responding.

typescript

1interface SagaStepWithTimeout<TContext> extends SagaStep<TContext> {

2 timeoutMs: number;

3 onTimeout: 'compensate' | 'retry' | 'alert';

6async function executeWithTimeout<TContext>(

7 step: SagaStepWithTimeout<TContext>,

8 ctx: TContext

9): Promise<void> {

10 const controller = new AbortController();

11 const timer = setTimeout(() => controller.abort(), step.timeoutMs);

13 try {

14 await step.execute(ctx);

15 } catch (error) {

16 if ((error as Error).name === 'AbortError') {

17 throw new StepTimeoutError(step.name, step.timeoutMs);

18 }

19 throw error;

20 } finally {

21 clearTimeout(timer);

22 }

23}

25// Example: Order saga with step-level timeouts

26const orderSagaSteps: SagaStepWithTimeout<OrderContext>[] = [

27 {

28 name: 'reserve_inventory',

29 execute: reserveInventory,

30 compensate: releaseInventory,

31 timeoutMs: 5000, // Inventory service should respond within 5s

32 onTimeout: 'compensate',

33 },

34 {

35 name: 'charge_payment',

36 execute: chargePayment,

37 compensate: refundPayment,

38 timeoutMs: 30000, // Payment processing can be slow

39 onTimeout: 'retry',

40 retryPolicy: { maxAttempts: 2, backoffMs: 5000, backoffMultiplier: 1 },

41 },

42 {

43 name: 'create_shipment',

44 execute: createShipment,

45 compensate: cancelShipment,

46 timeoutMs: 10000,

47 onTimeout: 'alert', // Alert ops team, don't auto-compensate

48 },

49];

Need a second opinion on your system design architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Best Practice 4: Use a Dead Letter Queue for Failed Compensations

When compensation fails, you have a data consistency problem that requires human intervention. Route these to a dead letter queue with enough context for manual resolution.

typescript

1interface DeadLetterEntry {

2 sagaId: string;

3 sagaType: string;

4 failedStep: string;

5 compensationError: string;

6 context: Record<string, unknown>;

7 completedSteps: string[];

8 timestamp: Date;

9 resolved: boolean;

10}

12async function handleCompensationFailure(

13 saga: SagaState,

14 step: string,

15 error: Error

16): Promise<void> {

17 const entry: DeadLetterEntry = {

18 sagaId: saga.sagaId,

19 sagaType: saga.sagaType,

20 failedStep: step,

21 compensationError: error.message,

22 context: saga.context,

23 completedSteps: saga.completedSteps,

24 timestamp: new Date(),

25 resolved: false,

26 };

28 await deadLetterQueue.publish(entry);

30 // Alert operations team

31 await alerting.send({

32 severity: 'critical',

33 title: `Saga compensation failed: ${saga.sagaType}`,

34 body: `Saga ${saga.sagaId} failed to compensate step "${step}". ` +

35 `Manual intervention required. Completed steps: ${saga.completedSteps.join(', ')}`,

36 metadata: { sagaId: saga.sagaId, step },

37 });

38}

Best Practice 5: Separate Read and Write Models for Saga State

Query the saga state without loading the full execution context. This is critical for dashboards and monitoring.

typescript

1// Write model: full context for execution

2interface SagaExecutionState {

3 sagaId: string;

4 context: Record<string, unknown>; // Can be large (order items, user data, etc.)

5 steps: SagaStepState[];

8// Read model: lightweight for queries and dashboards

9interface SagaStatusView {

10 sagaId: string;

11 sagaType: string;

12 status: string;

13 currentStep: string;

14 startedAt: Date;

15 duration: number;

16 stepsCompleted: number;

17 totalSteps: number;

18}

20// Materialize read model on every state change

21async function updateSagaStatusView(state: SagaState): Promise<void> {

22 await db.sagaStatusView.upsert({

23 where: { sagaId: state.sagaId },

24 update: {

25 status: state.status,

26 currentStep: state.completedSteps[state.completedSteps.length - 1] ?? 'not_started',

27 duration: Date.now() - state.startedAt.getTime(),

28 stepsCompleted: state.completedSteps.length,

29 updatedAt: new Date(),

30 },

31 create: {

32 sagaId: state.sagaId,

33 sagaType: state.sagaType,

34 status: state.status,

35 currentStep: 'not_started',

36 startedAt: state.startedAt,

37 duration: 0,

38 stepsCompleted: 0,

39 totalSteps: state.currentStep,

40 },

41 });

42}

Anti-Patterns to Avoid

Anti-Pattern 1: Nested Sagas Without Boundaries

Never start a saga from within another saga's step. If step 3 of Saga A triggers Saga B, Saga A's compensation logic cannot reliably roll back Saga B. Instead, make Saga B a subsequent step in Saga A or use a parent orchestrator that coordinates both.

Anti-Pattern 2: Compensations That Call External APIs Without Idempotency

If your compensation step calls an external payment API to issue a refund but doesn't pass an idempotency key, a retry during compensation creates a double refund. Every external call in a compensation must be idempotent.

Anti-Pattern 3: Using Saga State as a General-Purpose Database

Saga context should contain only the data needed for execution and compensation. Do not store derived data, analytics payloads, or user preferences in the saga state. Keep the context minimal — typically IDs, amounts, and timestamps.

Anti-Pattern 4: Ignoring Partial Failure in Compensation

If step 3 of 5 fails and compensation for step 2 also fails, do not silently mark the saga as "failed." The inconsistent state between steps 1 (completed) and 2 (partially compensated) requires explicit handling — dead letter queues, alerts, and manual resolution workflows.

Production Checklist

Conclusion

The saga pattern is a tool for managing distributed consistency, not a silver bullet. The implementation complexity is significant — idempotent steps, persistent state, compensation chains, dead letter queues, and monitoring — and teams frequently underestimate this upfront investment. A well-implemented saga gives you reliable cross-service transactions with clear failure recovery. A poorly implemented one gives you data inconsistency with extra infrastructure to maintain.

Start with the simplest saga that solves your immediate consistency problem: two or three steps with an orchestrator, persistent state, and a dead letter queue for compensation failures. Add complexity (conditional branching, parallel steps, nested workflows) only when you have concrete requirements and operational maturity to support it.

The orchestration approach scales better for enterprise teams because it centralizes the workflow definition, making it auditable, testable, and visible to non-engineers. Choreography works for simple, loosely coupled flows but becomes a distributed debugging nightmare when you have seven services reacting to events with no central view of the overall transaction.

FAQ

Need expert help?

Building with system design?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Book a Free Call Send a Brief

saga distributed-transactions microservices orchestration enterprise best-practices

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

View Portfolio Book a Call

← Previous

Choosing Between Orchestration and Choreography

Best Practice 1: Make Every Step Idempotent

Best Practice 2: Persist Saga State at Every Step

Best Practice 3: Define Explicit Timeouts Per Step

Best Practice 4: Use a Dead Letter Queue for Failed Compensations

Best Practice 5: Separate Read and Write Models for Saga State

Anti-Patterns to Avoid

Anti-Pattern 1: Nested Sagas Without Boundaries

Anti-Pattern 2: Compensations That Call External APIs Without Idempotency

Anti-Pattern 3: Using Saga State as a General-Purpose Database

Anti-Pattern 4: Ignoring Partial Failure in Compensation

Production Checklist

Conclusion

FAQ

Building with system design?

Saga Pattern Implementation Best Practices for High Scale Teams

Saga Pattern Implementation at Scale: Lessons from Production

How to Build Saga Pattern Implementation Using Spring Boot

Saga Pattern Implementation Best Practices for High Scale Teams

How to Build Saga Pattern Implementation Using Spring Boot

Start aConversation.

Start a
Conversation.