The saga pattern solves distributed transaction consistency across microservices, but getting it wrong creates failure modes worse than the problem it solves. After implementing sagas across three different enterprise platforms — an e-commerce order pipeline, a financial reconciliation system, and a multi-tenant provisioning service — these are the practices that separate production-grade implementations from fragile prototypes.
Choosing Between Orchestration and Choreography
The first decision is whether a central orchestrator coordinates the saga or whether services react to events independently (choreography).
Use orchestration when:
- The saga has more than four steps
- Compensation logic is complex or conditional
- You need centralized monitoring and retry policies
- Business stakeholders need to understand the flow (an orchestrator maps directly to a workflow diagram)
Use choreography when:
- The saga has two to three steps
- Services are owned by different teams with independent release cycles
- Loose coupling is more important than centralized visibility
- Each service already publishes domain events
In practice, most enterprise systems benefit from orchestration. The visibility and debuggability advantages outweigh the coupling trade-off.
Best Practice 1: Make Every Step Idempotent
Every execute and compensate function must be safely re-runnable. Network failures, container restarts, and message redelivery will cause duplicate invocations.
The pattern: use a deterministic idempotency key derived from the saga context, and check current state before mutating.
Best Practice 2: Persist Saga State at Every Step
If the orchestrator crashes mid-saga, you need to resume from the last completed step, not restart from the beginning.
Best Practice 3: Define Explicit Timeouts Per Step
Sagas without timeouts hang indefinitely when a downstream service stops responding.
Need a second opinion on your system design architecture?
I run free 30-minute strategy calls for engineering teams tackling this exact problem.
Book a Free CallBest Practice 4: Use a Dead Letter Queue for Failed Compensations
When compensation fails, you have a data consistency problem that requires human intervention. Route these to a dead letter queue with enough context for manual resolution.
Best Practice 5: Separate Read and Write Models for Saga State
Query the saga state without loading the full execution context. This is critical for dashboards and monitoring.
Anti-Patterns to Avoid
Anti-Pattern 1: Nested Sagas Without Boundaries
Never start a saga from within another saga's step. If step 3 of Saga A triggers Saga B, Saga A's compensation logic cannot reliably roll back Saga B. Instead, make Saga B a subsequent step in Saga A or use a parent orchestrator that coordinates both.
Anti-Pattern 2: Compensations That Call External APIs Without Idempotency
If your compensation step calls an external payment API to issue a refund but doesn't pass an idempotency key, a retry during compensation creates a double refund. Every external call in a compensation must be idempotent.
Anti-Pattern 3: Using Saga State as a General-Purpose Database
Saga context should contain only the data needed for execution and compensation. Do not store derived data, analytics payloads, or user preferences in the saga state. Keep the context minimal — typically IDs, amounts, and timestamps.
Anti-Pattern 4: Ignoring Partial Failure in Compensation
If step 3 of 5 fails and compensation for step 2 also fails, do not silently mark the saga as "failed." The inconsistent state between steps 1 (completed) and 2 (partially compensated) requires explicit handling — dead letter queues, alerts, and manual resolution workflows.
Production Checklist
- Every step has both execute and compensate functions
- All execute and compensate functions are idempotent
- Saga state is persisted after every step transition
- Each step has an explicit timeout
- Failed compensations route to a dead letter queue with alerts
- Saga status is queryable without loading full execution context
- No nested sagas — use a flat step list or parent orchestrator
- Retry policies distinguish between transient and permanent errors
- Monitoring dashboards show saga completion rates, average duration, and failure rates per step
- Load tests verify saga behavior under concurrent execution
Conclusion
The saga pattern is a tool for managing distributed consistency, not a silver bullet. The implementation complexity is significant — idempotent steps, persistent state, compensation chains, dead letter queues, and monitoring — and teams frequently underestimate this upfront investment. A well-implemented saga gives you reliable cross-service transactions with clear failure recovery. A poorly implemented one gives you data inconsistency with extra infrastructure to maintain.
Start with the simplest saga that solves your immediate consistency problem: two or three steps with an orchestrator, persistent state, and a dead letter queue for compensation failures. Add complexity (conditional branching, parallel steps, nested workflows) only when you have concrete requirements and operational maturity to support it.
The orchestration approach scales better for enterprise teams because it centralizes the workflow definition, making it auditable, testable, and visible to non-engineers. Choreography works for simple, loosely coupled flows but becomes a distributed debugging nightmare when you have seven services reacting to events with no central view of the overall transaction.