When we migrated our monolithic order processing system to microservices, the first production incident taught us a painful lesson: without distributed transaction management, a payment could succeed while inventory reservation failed, leaving customers charged for products we could not ship. Over four months, we implemented the saga pattern to coordinate transactions across six services. This is the full account — architecture decisions, production failures, and the metrics that justified the investment.
The System Before Sagas
Our e-commerce platform processed 12,000 orders per hour at peak. The monolith handled the entire order flow in a single database transaction:
One transaction, one database, zero coordination problems. Then we split into microservices: Order Service, Inventory Service, Payment Service, Shipping Service, Notification Service, and Analytics Service. Each owned its database. The single ACID transaction became six network calls with no atomicity guarantees.
First Attempt: Eventual Consistency Without Sagas
Our initial approach was event-driven choreography. Each service published events, and downstream services reacted:
This worked for the happy path. The failure paths were catastrophic:
- Payment succeeded, inventory reservation failed: Customer charged, product unavailable. Required manual refund.
- Shipment creation timed out: Payment charged, inventory reserved, but no shipment record. Operations team discovered these 6-12 hours later.
- Duplicate events from Kafka rebalancing: Two inventory reservations for the same order, one product shipped twice.
In the first month after the microservices migration, we logged 847 inconsistency incidents requiring manual intervention. Support ticket volume increased 340%.
Architecture: Orchestrated Sagas with AWS
We chose orchestration over choreography because we needed centralized visibility into the order flow and deterministic compensation ordering.
Key Design Decisions
Decision 1: ECS-based orchestrator over Step Functions. AWS Step Functions was the obvious choice, but at 12,000 orders/hour, the per-state-transition pricing added $2,400/month. Our ECS orchestrator running on two t3.large instances costs $140/month. Step Functions would have been the right choice below 1,000 orders/hour.
Decision 2: RDS PostgreSQL for saga state. We evaluated DynamoDB but needed transactional guarantees on saga state updates. When a step completes, we update the step status and check for flow completion in a single transaction. DynamoDB's single-item transaction model did not fit this pattern without restructuring our data model.
Decision 3: HTTP for step execution, SQS for compensation. Forward steps (execute) use synchronous HTTP calls because the orchestrator needs the result before proceeding. Compensation steps use SQS with a dead letter queue because compensation can be eventually consistent — the user already sees a failure message.
Implementation Timeline
Week 1-2: Saga orchestrator framework. Built the core orchestrator: step sequencing, state persistence, retry logic, and compensation triggering. This was reusable across saga types.
Week 3: Idempotency layer. Added idempotency key support to every service endpoint. This was the most tedious work — each service needed an idempotency key table, deduplication logic, and response caching for replayed requests.
Week 4-5: Order fulfillment saga. Implemented the specific saga for order processing, including all four steps and their compensations. Wrote integration tests that injected failures at each step.
Week 6-7: Monitoring and dead letter queue processing. Built dashboards for saga throughput, failure rates, compensation rates, and DLQ depth. Created a manual resolution UI for dead letter entries.
Week 8: Shadow mode deployment. Ran the saga orchestrator in parallel with the existing choreography system. Both processed orders, but only the choreography result was committed. We compared outcomes to validate correctness.
Week 9-12: Gradual rollout. Shifted 10% → 25% → 50% → 100% of traffic to the saga orchestrator over four weeks.
Need a second opinion on your system design architecture?
I run free 30-minute strategy calls for engineering teams tackling this exact problem.
Book a Free CallProduction Failures and Lessons
Failure 1: Saga State Table Lock Contention
Two weeks after full rollout, saga processing latency spiked from 800ms p99 to 12 seconds during peak hours. The root cause: our saga state table had a single index on saga_id, and PostgreSQL's row-level locking was contending with our monitoring queries that scanned the table for active sagas.
Fix: Added a materialized view for monitoring queries that refreshed every 30 seconds, keeping analytical reads off the main table. Saga processing p99 dropped back to 900ms.
Failure 2: Payment Service Timeout Cascade
The payment provider had a 45-second degradation. Our 30-second timeout triggered correctly, but 400 sagas simultaneously entered the compensation path, all trying to release inventory at once. The inventory service, already at 80% capacity, rejected 60% of the release requests.
Fix: Implemented exponential backoff with jitter on compensation retries, and added a rate limiter (200 compensations/second max) to prevent thundering herds. We also increased the compensation SQS visibility timeout to 60 seconds.
Failure 3: Idempotency Key Collision
We generated idempotency keys as ${orderId}-${stepName}. When a customer placed two orders within the same second (double-click on the order button), both orders shared the same orderId from our sequence generator. The second order's payment step received the cached response from the first order.
Fix: Changed the order ID to a UUID. Added a unique constraint on the (customer_id, created_at, item_hash) tuple to prevent true duplicate orders at the application level. Idempotency keys became ${orderId}-${sagaInstanceId}-${stepName}.
Results After Four Months
| Metric | Before Sagas | After Sagas | Change |
|---|---|---|---|
| Inconsistency incidents/month | 847 | 3 | -99.6% |
| Manual resolution time/month | 160 hours | 2 hours | -98.8% |
| Order processing p50 latency | 420ms | 650ms | +55% |
| Order processing p99 latency | 1.2s | 1.8s | +50% |
| Support tickets (order issues) | 1,200/month | 45/month | -96.3% |
| Failed order recovery rate | 12% (manual) | 97% (automatic) | +708% |
The latency increase was expected — we added network round-trips for saga state persistence. The consistency gains justified it. Those 847 monthly incidents were each costing an average of $23 in manual resolution time and $8 in customer goodwill credits. The saga infrastructure cost $380/month (ECS, RDS, SQS), replacing $19,481/month in incident costs.
Infrastructure Cost Breakdown
- ECS orchestrator (2x t3.large): $140/month
- RDS PostgreSQL (db.t3.medium): $165/month
- SQS (compensation queues + DLQ): $45/month
- CloudWatch (monitoring): $30/month
- Total: $380/month
What We Would Change
1. Start with idempotency, not sagas. Half of our choreography-era incidents would have been prevented by idempotent service endpoints alone. Sagas handle the other half — multi-step coordination — but idempotency is the foundation. Build it first.
2. Use SQS for forward steps too, not just compensation. Our synchronous HTTP calls for forward steps create tight coupling. If the inventory service is slow, the orchestrator blocks. An async model with SQS for all steps would decouple latency at the cost of slightly more complex state management.
3. Build the manual resolution UI from day one. We built it in week 7, but the first compensation failures happened in week 3 of shadow mode. Those four weeks of manual SQL queries for DLQ resolution were painful and error-prone.
4. Shadow mode for longer. Two weeks of shadow mode caught three bugs. If we had run it for four weeks, we would have caught the idempotency key collision bug before it hit production.
Conclusion
The saga pattern added latency, infrastructure, and operational complexity to our order processing pipeline. It also eliminated 99.6% of data inconsistency incidents and saved the team 158 hours of manual resolution work per month. That trade-off was unambiguously worth it.
The orchestration approach gave us something choreography never could: a single place to look when an order goes wrong. Instead of tracing events across six service logs, we query the saga state table and see exactly which step failed, what compensation ran, and what ended up in the dead letter queue. For an e-commerce platform where order integrity is revenue integrity, that visibility is essential.
The implementation was not the hard part. Making every service endpoint idempotent was. If you are considering sagas, start by auditing your services for idempotency gaps. The saga orchestrator is straightforward to build — the idempotency layer is where the real engineering effort lives.