How did you handle in-flight orders during the migration cutover?

We never did a hard cutover. The dual-write phase processed orders through both paths simultaneously. When we were confident the event-driven path was correct (after 2 weeks of comparison), we gradually shifted traffic using a feature flag. First 10%, then 50%, then 100%. Orders in flight at each transition completed through whichever path they started on. Zero orders were affected by the migration.

What's the total team investment for maintaining the event infrastructure?

About 10% of one engineer's time for ongoing maintenance—monitoring consumer lag, reviewing schema changes, and occasional Kafka configuration tuning. MSK handles the heavy operational lifting (broker management, patching, scaling). The initial 12-week migration was a larger investment, but the operational overhead post-migration is minimal. The time saved from fewer incident responses more than offsets the maintenance cost.

How do you handle event replay for debugging production issues?

Kafka retains events for 7-30 days depending on the topic. For debugging, we use a dedicated consumer group that replays events for a specific order ID. The consumer filters by the order's partition key and processes only matching events, reconstructing the full event history. For longer-term replay, we archive events to S3 using Kafka Connect and can replay from there for historical investigations.

Would you recommend this approach for a team smaller than yours?

For teams under 4 engineers, start simpler. Use SQS or Google Pub/Sub instead of Kafka, skip the saga orchestrator, and use basic choreography for everything. The core benefit of EDA—decoupling services so one failure doesn't cascade—works with any message broker. Add Kafka and sagas when your throughput or transaction complexity demands them. The architectural patterns are the same regardless of the messaging technology.

Event-Driven Architecture at Scale: Lessons from Production

In early 2024, our team migrated a large e-commerce platform's order processing pipeline from synchronous REST-based communication to event-driven architecture. The system processed 15,000 orders per hour at peak, with 12 downstream services that needed to react to order lifecycle events—from payment processing through fulfillment, inventory management, and customer notifications. This is the story of what worked, what didn't, and the measurable impact on reliability and performance.

The Problem

The existing architecture used synchronous HTTP calls between services. When a customer placed an order, the checkout service made sequential calls to:

Payment service → charge the card
Inventory service → reserve stock
Fulfillment service → create shipment
Notification service → send confirmation email
Analytics service → record the transaction
Loyalty service → credit reward points

Each call added 50-200ms to the total response time. The p99 checkout latency was 4.2 seconds. Worse, if any downstream service was slow or unavailable, the entire checkout failed. During Black Friday 2023, the analytics service went down under load, causing a 45-minute outage that blocked all orders—even though analytics wasn't critical to completing a purchase.

Key metrics before migration:

Metric	Value
Checkout p50 latency	1.8s
Checkout p99 latency	4.2s
Monthly order failures	2,340 (0.52% of orders)
MTTR for cascade failures	38 minutes
Deploy coupling (services requiring coordinated deploys)	8 of 12

Architecture Decision

We evaluated three approaches:

Choreography-based EDA — services publish events, consumers react independently
Orchestration-based EDA — a central orchestrator coordinates the workflow via commands
Hybrid — orchestration for the critical path, choreography for non-critical side effects

We chose the hybrid approach. Payment processing and inventory reservation needed transactional guarantees and clear error handling, making orchestration the right fit. Notifications, analytics, and loyalty points were fire-and-forget side effects that worked well with choreography.

Technology Stack

Message broker: Amazon MSK (Managed Kafka) with 3 brokers across 3 AZs
Orchestrator: Custom saga implementation using a state machine library
Schema registry: Confluent Schema Registry with Avro serialization
Monitoring: Datadog for metrics, distributed tracing via OpenTelemetry

We chose Kafka over SQS because we needed event replay capability (to rebuild read models), consumer groups (multiple services consuming the same events at their own pace), and ordering guarantees (events for the same order processed in sequence).

Implementation

Phase 1: Event Infrastructure (Weeks 1-3)

We set up MSK, configured topics, and built shared libraries for event production and consumption. Topics were organized by domain:

1orders.events — 32 partitions, 7-day retention

2payments.events — 16 partitions, 30-day retention

3inventory.events — 16 partitions, 7-day retention

4fulfillment.events — 16 partitions, 7-day retention

5notifications.commands — 8 partitions, 3-day retention

Partition counts were based on expected throughput: 15K orders/hour = ~4 orders/second, but we sized for 100x growth. Each partition handles 10K+ events/sec, so 32 partitions for the orders topic was generous headroom.

The shared event library enforced the envelope structure:

typescript

1interface OrderEvent {

2 eventId: string;

3 eventType: string;

4 eventVersion: number;

5 orderId: string;

6 timestamp: string;

7 correlationId: string;

8 data: Record<string, unknown>;

Phase 2: Dual-Write Migration (Weeks 4-6)

Rather than a big-bang cutover, we ran the old synchronous calls and new events in parallel. The checkout service continued making HTTP calls but also published events:

typescript

1async function placeOrder(order: Order): Promise<OrderResult> {

2 // Critical path — still synchronous

3 const payment = await paymentService.charge(order);

4 const reservation = await inventoryService.reserve(order);

6 // Publish event for non-critical consumers

7 await kafka.produce("orders.events", {

8 eventType: "OrderPlaced",

9 orderId: order.id,

10 data: { ...order, paymentId: payment.id, reservationId: reservation.id },

11 });

13 // Old path — still active, will be removed

14 await notificationService.sendConfirmation(order); // Will become event consumer

15 await analyticsService.trackOrder(order); // Will become event consumer

16 await loyaltyService.creditPoints(order); // Will become event consumer

18 return { orderId: order.id, status: "confirmed" };

19}

During this phase, we validated that event consumers produced the same results as the synchronous calls. We compared notification emails, analytics records, and loyalty point credits between the two paths. After 2 weeks of consistent results, we removed the synchronous calls to non-critical services.

Phase 3: Saga Orchestrator (Weeks 7-10)

We replaced the synchronous payment and inventory calls with an orchestrated saga:

typescript

1const orderSaga = createSaga("OrderPlacement", [

2 {

3 step: "reserveInventory",

4 execute: (ctx) => publishCommand("inventory.commands", "ReserveStock", ctx.order),

5 compensate: (ctx) => publishCommand("inventory.commands", "ReleaseStock", ctx.order),

6 timeout: 5000,

7 },

8 {

9 step: "processPayment",

10 execute: (ctx) => publishCommand("payments.commands", "ChargeCard", ctx.order),

11 compensate: (ctx) => publishCommand("payments.commands", "RefundCharge", ctx.order),

12 timeout: 10000,

13 },

14 {

15 step: "confirmOrder",

16 execute: (ctx) => publishEvent("orders.events", "OrderConfirmed", ctx.order),

17 // No compensation — this is the final step

18 timeout: 3000,

19 },

20]);

The saga orchestrator maintained state in a PostgreSQL table, making it resilient to crashes. If the service restarted mid-saga, it resumed from the last completed step.

Phase 4: Remove Synchronous Calls (Weeks 11-12)

With all consumers running as event-driven services and the saga handling the critical path, we removed the remaining synchronous calls. The checkout service's response time dropped immediately—it only waited for the saga's first step (inventory reservation) before returning a pending status to the customer.

Need a second opinion on your system design architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

What Went Wrong

Problem 1: Consumer Lag During Deployment

When we deployed a consumer update, the rolling restart caused a Kafka rebalance. During rebalancing, all partitions were revoked and reassigned, creating a 30-second processing gap. With 4 events/second, that's 120 events delayed.

Fix: Switched to CooperativeStickyAssignor, which only moves partitions that need reassignment. Rebalancing downtime dropped from 30 seconds to under 2 seconds.

Problem 2: Schema Evolution Breaking Consumers

A developer added a required field to the OrderPlaced event without updating all consumers. The new field was shippingMethod, added as required in the Avro schema. Old events (replayed during a consumer rebuild) didn't have this field, causing deserialization failures.

Fix: Enforced backward compatibility in the schema registry. New schemas must be compatible with the previous version. Required fields can only be added with default values. We added a CI check that runs schema compatibility verification before merging.

Problem 3: Idempotency Gap

During a Kafka broker failover, some events were delivered twice. The notification service sent duplicate order confirmation emails to ~200 customers.

Fix: Added idempotent processing using the eventId as a deduplication key. Each consumer checks a Redis set before processing:

typescript

1async function processIdempotent(event: OrderEvent, handler: Function): Promise<void> {

2 const dedupKey = `processed:${consumerGroup}:${event.eventId}`;

3 const isNew = await redis.set(dedupKey, "1", "NX", "EX", 86400);

5 if (!isNew) return; // Already processed

6 await handler(event);

Results

After 12 weeks, the migration was complete. Here are the measured results:

Metric	Before	After	Change
Checkout p50 latency	1.8s	0.4s	-78%
Checkout p99 latency	4.2s	1.1s	-74%
Monthly order failures	2,340	312	-87%
MTTR for cascade failures	38 min	4 min	-89%
Deploy coupling	8 services	2 services	-75%
Non-critical service outage impact	Full checkout failure	Zero checkout impact	Eliminated

The most impactful change was decoupling non-critical services. When the analytics service went down during a load test, orders continued processing without interruption. The analytics consumer simply caught up when it recovered—processing the backlog in 3 minutes.

Cost Impact

MSK cluster: $1,200/month (3 brokers, m5.large)
Removed: API gateway costs for inter-service calls ($400/month)
Removed: Circuit breaker infrastructure ($200/month)
Net cost increase: $600/month for significantly improved reliability

Team Velocity Impact

Independent deployability was the biggest velocity gain. Before EDA, changing the checkout flow required coordinated deploys across 8 services. After, the checkout team deploys independently. Other teams consume events at their own pace, deploying on their own schedules. Sprint velocity (measured by story points delivered) increased 23% across the three teams most affected by the migration.

Key Lessons

Dual-write migration is essential. Running old and new paths in parallel catches inconsistencies before they affect customers. Budget 2-3 weeks for validation.
Choreography for side effects, orchestration for transactions. Don't force one pattern everywhere. Critical business transactions need the explicit error handling that sagas provide.
Schema governance pays for itself immediately. Our one schema-breaking incident took 4 hours to resolve. The CI check that prevents it took 2 hours to implement.
Idempotency isn't optional. Every consumer will eventually receive duplicates. Build deduplication in from day one, not after the first incident.
Consumer lag is your primary health metric. It tells you whether the system is keeping up before anything visibly breaks. Alert aggressively.

Conclusion

Migrating to event-driven architecture reduced our checkout latency by 78%, order failures by 87%, and eliminated cascade failures from non-critical services. The hybrid approach—saga orchestration for the critical payment and inventory path, choreography for everything else—gave us transactional safety where it mattered and decoupling everywhere else.

The migration took 12 weeks with a team of 4 engineers. The first 3 weeks were infrastructure setup, the next 3 were dual-write validation, and the final 6 were saga implementation and cutover. If we did it again, we'd enforce schema compatibility from day one and implement idempotent consumers before the first event was published—both lessons we learned the hard way.

FAQ

Need expert help?

Building with system design?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Book a Free Call Send a Brief

event-driven messaging kafka architecture aws case-study

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

View Portfolio Book a Call

← Previous

Event-Driven Architecture at Scale: Lessons from Production

The Problem

Architecture Decision

Technology Stack

Implementation

Phase 1: Event Infrastructure (Weeks 1-3)

Phase 2: Dual-Write Migration (Weeks 4-6)

Phase 3: Saga Orchestrator (Weeks 7-10)

Phase 4: Remove Synchronous Calls (Weeks 11-12)

What Went Wrong

Problem 1: Consumer Lag During Deployment

Problem 2: Schema Evolution Breaking Consumers

Problem 3: Idempotency Gap

Results

Cost Impact

Team Velocity Impact

Key Lessons

Conclusion

FAQ

Building with system design?

Event-Driven Architecture Best Practices for High Scale Teams

Event-Driven Architecture Best Practices for Enterprise Teams

Event-Driven Architecture Best Practices for Startup Teams

Complete Guide to CQRS & Event Sourcing with Typescript

Event-Driven Architecture Best Practices for High Scale Teams

Start a
Conversation.

The Problem

Architecture Decision

Technology Stack

Implementation

Phase 1: Event Infrastructure (Weeks 1-3)

Phase 2: Dual-Write Migration (Weeks 4-6)

Phase 3: Saga Orchestrator (Weeks 7-10)

Phase 4: Remove Synchronous Calls (Weeks 11-12)

What Went Wrong

Problem 1: Consumer Lag During Deployment

Problem 2: Schema Evolution Breaking Consumers

Problem 3: Idempotency Gap

Results

Cost Impact

Team Velocity Impact

Key Lessons

Conclusion

FAQ

Building with system design?

Event-Driven Architecture Best Practices for High Scale Teams

Event-Driven Architecture Best Practices for Enterprise Teams

Event-Driven Architecture Best Practices for Startup Teams

Complete Guide to CQRS & Event Sourcing with Typescript

Event-Driven Architecture Best Practices for High Scale Teams

Start aConversation.

Start a
Conversation.