How did you handle saga state during deployments?

We use blue-green deployments for the orchestrator. During deployment, the old version continues processing in-flight sagas to completion while the new version picks up new saga requests. We achieve this by routing new saga creation to the new deployment while the old deployment drains its active sagas. A saga has a maximum lifetime of 5 minutes, so the drain period is bounded.

What happens when the saga orchestrator itself crashes?

The orchestrator is stateless — all saga state lives in PostgreSQL. When a new orchestrator instance starts, it queries for sagas in `running` or `compensating` status and resumes them from their last persisted step. Each step execution is idempotent, so replaying the last step is safe. We run two orchestrator instances behind an ALB, so a single instance failure is transparent.

How do you handle sagas that span services owned by different teams?

Each team owns their service's step implementation (execute and compensate functions). The saga definition lives in a shared repository that all teams contribute to. We use a schema registry (Protobuf) for the step input/output contracts, and each team runs integration tests against a shared staging environment. Changes to saga definitions require approval from all participating teams.

Did you consider using AWS Step Functions or Temporal instead of building a custom orchestrator?

We evaluated both. Step Functions was too expensive at our throughput ($2,400/month vs. $380/month). Temporal was the closest fit, but at the time of implementation, our team had no Temporal operational experience, and the learning curve for operating a Temporal cluster was steeper than building a simple orchestrator. For a team starting today, I would recommend Temporal — the operational tooling has matured significantly and the workflow-as-code model maps naturally to saga definitions.

Saga Pattern Implementation at Scale: Lessons from Production

When we migrated our monolithic order processing system to microservices, the first production incident taught us a painful lesson: without distributed transaction management, a payment could succeed while inventory reservation failed, leaving customers charged for products we could not ship. Over four months, we implemented the saga pattern to coordinate transactions across six services. This is the full account — architecture decisions, production failures, and the metrics that justified the investment.

The System Before Sagas

Our e-commerce platform processed 12,000 orders per hour at peak. The monolith handled the entire order flow in a single database transaction:

sql

1BEGIN;

2 INSERT INTO orders (...) VALUES (...);

3 UPDATE inventory SET quantity = quantity - $1 WHERE product_id = $2;

4 INSERT INTO payments (...) VALUES (...);

5 INSERT INTO shipments (...) VALUES (...);

6 UPDATE customer_ledger SET balance = balance - $1 WHERE customer_id = $2;

7COMMIT;

One transaction, one database, zero coordination problems. Then we split into microservices: Order Service, Inventory Service, Payment Service, Shipping Service, Notification Service, and Analytics Service. Each owned its database. The single ACID transaction became six network calls with no atomicity guarantees.

First Attempt: Eventual Consistency Without Sagas

Our initial approach was event-driven choreography. Each service published events, and downstream services reacted:

Order Created → Inventory Reserved → Payment Charged → Shipment Created → Notification Sent

This worked for the happy path. The failure paths were catastrophic:

Payment succeeded, inventory reservation failed: Customer charged, product unavailable. Required manual refund.
Shipment creation timed out: Payment charged, inventory reserved, but no shipment record. Operations team discovered these 6-12 hours later.
Duplicate events from Kafka rebalancing: Two inventory reservations for the same order, one product shipped twice.

In the first month after the microservices migration, we logged 847 inconsistency incidents requiring manual intervention. Support ticket volume increased 340%.

Architecture: Orchestrated Sagas with AWS

We chose orchestration over choreography because we needed centralized visibility into the order flow and deterministic compensation ordering.

1┌─────────────────────────────────────────────────────┐

2│ Saga Orchestrator (ECS) │

3│ ┌─────────────────────────────────────────────────┐│

4│ │ Order Fulfillment Saga ││

5│ │ Step 1: Reserve Inventory → Compensate: Release││

6│ │ Step 2: Charge Payment → Compensate: Refund ││

7│ │ Step 3: Create Shipment → Compensate: Cancel ││

8│ │ Step 4: Send Notification → (no compensation) ││

9│ └─────────────────────────────────────────────────┘│

10└──────────┬──────┬──────┬──────┬──────┬──────────────┘

11 │ │ │ │ │

12 ┌────▼┐ ┌──▼──┐ ┌─▼──┐ ┌▼───┐ ┌▼────┐

13 │Inv. │ │Pay. │ │Ship│ │Noti│ │Saga │

14 │Svc │ │Svc │ │Svc │ │Svc │ │State│

15 └─────┘ └─────┘ └────┘ └────┘ │(RDS)│

16 └─────┘

Key Design Decisions

Decision 1: ECS-based orchestrator over Step Functions. AWS Step Functions was the obvious choice, but at 12,000 orders/hour, the per-state-transition pricing added $2,400/month. Our ECS orchestrator running on two t3.large instances costs $140/month. Step Functions would have been the right choice below 1,000 orders/hour.

Decision 2: RDS PostgreSQL for saga state. We evaluated DynamoDB but needed transactional guarantees on saga state updates. When a step completes, we update the step status and check for flow completion in a single transaction. DynamoDB's single-item transaction model did not fit this pattern without restructuring our data model.

Decision 3: HTTP for step execution, SQS for compensation. Forward steps (execute) use synchronous HTTP calls because the orchestrator needs the result before proceeding. Compensation steps use SQS with a dead letter queue because compensation can be eventually consistent — the user already sees a failure message.

typescript

1// Saga definition for order fulfillment

2const orderFulfillmentSaga: SagaDefinition<OrderContext> = {

3 sagaType: 'order-fulfillment',

4 steps: [

5 {

6 name: 'reserve_inventory',

7 execute: async (ctx) => {

8 const result = await inventoryService.reserve({

9 orderId: ctx.orderId,

10 items: ctx.items,

11 idempotencyKey: `${ctx.orderId}-reserve`,

12 });

13 ctx.reservationId = result.reservationId;

14 },

15 compensate: async (ctx) => {

16 await compensationQueue.send({

17 service: 'inventory',

18 action: 'release',

19 payload: {

20 reservationId: ctx.reservationId,

21 idempotencyKey: `${ctx.orderId}-release`,

22 },

23 });

24 },

25 timeoutMs: 5000,

26 },

27 {

28 name: 'charge_payment',

29 execute: async (ctx) => {

30 const result = await paymentService.charge({

31 orderId: ctx.orderId,

32 amount: ctx.totalAmount,

33 paymentMethodId: ctx.paymentMethodId,

34 idempotencyKey: `${ctx.orderId}-charge`,

35 });

36 ctx.chargeId = result.chargeId;

37 },

38 compensate: async (ctx) => {

39 await compensationQueue.send({

40 service: 'payment',

41 action: 'refund',

42 payload: {

43 chargeId: ctx.chargeId,

44 idempotencyKey: `${ctx.orderId}-refund`,

45 },

46 });

47 },

48 timeoutMs: 30000,

49 },

50 {

51 name: 'create_shipment',

52 execute: async (ctx) => {

53 const result = await shippingService.createShipment({

54 orderId: ctx.orderId,

55 address: ctx.shippingAddress,

56 items: ctx.items,

57 idempotencyKey: `${ctx.orderId}-shipment`,

58 });

59 ctx.shipmentId = result.shipmentId;

60 ctx.trackingNumber = result.trackingNumber;

61 },

62 compensate: async (ctx) => {

63 await compensationQueue.send({

64 service: 'shipping',

65 action: 'cancel',

66 payload: {

67 shipmentId: ctx.shipmentId,

68 idempotencyKey: `${ctx.orderId}-cancel-ship`,

69 },

70 });

71 },

72 timeoutMs: 10000,

73 },

74 {

75 name: 'send_notification',

76 execute: async (ctx) => {

77 await notificationService.sendOrderConfirmation({

78 orderId: ctx.orderId,

79 email: ctx.customerEmail,

80 trackingNumber: ctx.trackingNumber,

81 });

82 },

83 compensate: async () => {

84 // No compensation needed for notifications

85 },

86 timeoutMs: 5000,

87 },

88 ],

89};

Implementation Timeline

Week 1-2: Saga orchestrator framework. Built the core orchestrator: step sequencing, state persistence, retry logic, and compensation triggering. This was reusable across saga types.

Week 3: Idempotency layer. Added idempotency key support to every service endpoint. This was the most tedious work — each service needed an idempotency key table, deduplication logic, and response caching for replayed requests.

Week 4-5: Order fulfillment saga. Implemented the specific saga for order processing, including all four steps and their compensations. Wrote integration tests that injected failures at each step.

Week 6-7: Monitoring and dead letter queue processing. Built dashboards for saga throughput, failure rates, compensation rates, and DLQ depth. Created a manual resolution UI for dead letter entries.

Week 8: Shadow mode deployment. Ran the saga orchestrator in parallel with the existing choreography system. Both processed orders, but only the choreography result was committed. We compared outcomes to validate correctness.

Week 9-12: Gradual rollout. Shifted 10% → 25% → 50% → 100% of traffic to the saga orchestrator over four weeks.

Need a second opinion on your system design architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Production Failures and Lessons

Failure 1: Saga State Table Lock Contention

Two weeks after full rollout, saga processing latency spiked from 800ms p99 to 12 seconds during peak hours. The root cause: our saga state table had a single index on saga_id, and PostgreSQL's row-level locking was contending with our monitoring queries that scanned the table for active sagas.

Fix: Added a materialized view for monitoring queries that refreshed every 30 seconds, keeping analytical reads off the main table. Saga processing p99 dropped back to 900ms.

Failure 2: Payment Service Timeout Cascade

The payment provider had a 45-second degradation. Our 30-second timeout triggered correctly, but 400 sagas simultaneously entered the compensation path, all trying to release inventory at once. The inventory service, already at 80% capacity, rejected 60% of the release requests.

Fix: Implemented exponential backoff with jitter on compensation retries, and added a rate limiter (200 compensations/second max) to prevent thundering herds. We also increased the compensation SQS visibility timeout to 60 seconds.

Failure 3: Idempotency Key Collision

We generated idempotency keys as ${orderId}-${stepName}. When a customer placed two orders within the same second (double-click on the order button), both orders shared the same orderId from our sequence generator. The second order's payment step received the cached response from the first order.

Fix: Changed the order ID to a UUID. Added a unique constraint on the (customer_id, created_at, item_hash) tuple to prevent true duplicate orders at the application level. Idempotency keys became ${orderId}-${sagaInstanceId}-${stepName}.

Results After Four Months

Metric	Before Sagas	After Sagas	Change
Inconsistency incidents/month	847	3	-99.6%
Manual resolution time/month	160 hours	2 hours	-98.8%
Order processing p50 latency	420ms	650ms	+55%
Order processing p99 latency	1.2s	1.8s	+50%
Support tickets (order issues)	1,200/month	45/month	-96.3%
Failed order recovery rate	12% (manual)	97% (automatic)	+708%

The latency increase was expected — we added network round-trips for saga state persistence. The consistency gains justified it. Those 847 monthly incidents were each costing an average of $23 in manual resolution time and $8 in customer goodwill credits. The saga infrastructure cost $380/month (ECS, RDS, SQS), replacing $19,481/month in incident costs.

Infrastructure Cost Breakdown

ECS orchestrator (2x t3.large): $140/month
RDS PostgreSQL (db.t3.medium): $165/month
SQS (compensation queues + DLQ): $45/month
CloudWatch (monitoring): $30/month
Total: $380/month

What We Would Change

1. Start with idempotency, not sagas. Half of our choreography-era incidents would have been prevented by idempotent service endpoints alone. Sagas handle the other half — multi-step coordination — but idempotency is the foundation. Build it first.

2. Use SQS for forward steps too, not just compensation. Our synchronous HTTP calls for forward steps create tight coupling. If the inventory service is slow, the orchestrator blocks. An async model with SQS for all steps would decouple latency at the cost of slightly more complex state management.

3. Build the manual resolution UI from day one. We built it in week 7, but the first compensation failures happened in week 3 of shadow mode. Those four weeks of manual SQL queries for DLQ resolution were painful and error-prone.

4. Shadow mode for longer. Two weeks of shadow mode caught three bugs. If we had run it for four weeks, we would have caught the idempotency key collision bug before it hit production.

Conclusion

The saga pattern added latency, infrastructure, and operational complexity to our order processing pipeline. It also eliminated 99.6% of data inconsistency incidents and saved the team 158 hours of manual resolution work per month. That trade-off was unambiguously worth it.

The orchestration approach gave us something choreography never could: a single place to look when an order goes wrong. Instead of tracing events across six service logs, we query the saga state table and see exactly which step failed, what compensation ran, and what ended up in the dead letter queue. For an e-commerce platform where order integrity is revenue integrity, that visibility is essential.

The implementation was not the hard part. Making every service endpoint idempotent was. If you are considering sagas, start by auditing your services for idempotency gaps. The saga orchestrator is straightforward to build — the idempotency layer is where the real engineering effort lives.

FAQ

Need expert help?

Building with system design?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Book a Free Call Send a Brief

saga distributed-transactions microservices orchestration aws case-study

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

View Portfolio Book a Call

← Previous

Saga Pattern Implementation at Scale: Lessons from Production

The System Before Sagas

First Attempt: Eventual Consistency Without Sagas

Architecture: Orchestrated Sagas with AWS

Key Design Decisions

Implementation Timeline

Production Failures and Lessons

Failure 1: Saga State Table Lock Contention

Failure 2: Payment Service Timeout Cascade

Failure 3: Idempotency Key Collision

Results After Four Months

Infrastructure Cost Breakdown

What We Would Change

Conclusion

FAQ

Building with system design?

Saga Pattern Implementation Best Practices for High Scale Teams

Saga Pattern Implementation Best Practices for Enterprise Teams

How to Build Saga Pattern Implementation Using Spring Boot

Complete Guide to Kubernetes Production Setup with Typescript

Saga Pattern Implementation Best Practices for High Scale Teams

Start a
Conversation.

The System Before Sagas

First Attempt: Eventual Consistency Without Sagas

Architecture: Orchestrated Sagas with AWS

Key Design Decisions

Implementation Timeline

Production Failures and Lessons

Failure 1: Saga State Table Lock Contention

Failure 2: Payment Service Timeout Cascade

Failure 3: Idempotency Key Collision

Results After Four Months

Infrastructure Cost Breakdown

What We Would Change

Conclusion

FAQ

Building with system design?

Saga Pattern Implementation Best Practices for High Scale Teams

Saga Pattern Implementation Best Practices for Enterprise Teams

How to Build Saga Pattern Implementation Using Spring Boot

Complete Guide to Kubernetes Production Setup with Typescript

Saga Pattern Implementation Best Practices for High Scale Teams

Start aConversation.

Start a
Conversation.