Back to Journal
System Design

Event-Driven Architecture at Scale: Lessons from Production

Real-world lessons from implementing Event-Driven Architecture in production, including architecture decisions, measurable results, and honest retrospectives.

Muneer Puthiya Purayil 9 min read

In early 2024, our team migrated a large e-commerce platform's order processing pipeline from synchronous REST-based communication to event-driven architecture. The system processed 15,000 orders per hour at peak, with 12 downstream services that needed to react to order lifecycle events—from payment processing through fulfillment, inventory management, and customer notifications. This is the story of what worked, what didn't, and the measurable impact on reliability and performance.

The Problem

The existing architecture used synchronous HTTP calls between services. When a customer placed an order, the checkout service made sequential calls to:

  1. Payment service → charge the card
  2. Inventory service → reserve stock
  3. Fulfillment service → create shipment
  4. Notification service → send confirmation email
  5. Analytics service → record the transaction
  6. Loyalty service → credit reward points

Each call added 50-200ms to the total response time. The p99 checkout latency was 4.2 seconds. Worse, if any downstream service was slow or unavailable, the entire checkout failed. During Black Friday 2023, the analytics service went down under load, causing a 45-minute outage that blocked all orders—even though analytics wasn't critical to completing a purchase.

Key metrics before migration:

MetricValue
Checkout p50 latency1.8s
Checkout p99 latency4.2s
Monthly order failures2,340 (0.52% of orders)
MTTR for cascade failures38 minutes
Deploy coupling (services requiring coordinated deploys)8 of 12

Architecture Decision

We evaluated three approaches:

  1. Choreography-based EDA — services publish events, consumers react independently
  2. Orchestration-based EDA — a central orchestrator coordinates the workflow via commands
  3. Hybrid — orchestration for the critical path, choreography for non-critical side effects

We chose the hybrid approach. Payment processing and inventory reservation needed transactional guarantees and clear error handling, making orchestration the right fit. Notifications, analytics, and loyalty points were fire-and-forget side effects that worked well with choreography.

Technology Stack

  • Message broker: Amazon MSK (Managed Kafka) with 3 brokers across 3 AZs
  • Orchestrator: Custom saga implementation using a state machine library
  • Schema registry: Confluent Schema Registry with Avro serialization
  • Monitoring: Datadog for metrics, distributed tracing via OpenTelemetry

We chose Kafka over SQS because we needed event replay capability (to rebuild read models), consumer groups (multiple services consuming the same events at their own pace), and ordering guarantees (events for the same order processed in sequence).

Implementation

Phase 1: Event Infrastructure (Weeks 1-3)

We set up MSK, configured topics, and built shared libraries for event production and consumption. Topics were organized by domain:

1orders.events — 32 partitions, 7-day retention
2payments.events — 16 partitions, 30-day retention
3inventory.events — 16 partitions, 7-day retention
4fulfillment.events — 16 partitions, 7-day retention
5notifications.commands — 8 partitions, 3-day retention
6 

Partition counts were based on expected throughput: 15K orders/hour = ~4 orders/second, but we sized for 100x growth. Each partition handles 10K+ events/sec, so 32 partitions for the orders topic was generous headroom.

The shared event library enforced the envelope structure:

typescript
1interface OrderEvent {
2 eventId: string;
3 eventType: string;
4 eventVersion: number;
5 orderId: string;
6 timestamp: string;
7 correlationId: string;
8 data: Record<string, unknown>;
9}
10 

Phase 2: Dual-Write Migration (Weeks 4-6)

Rather than a big-bang cutover, we ran the old synchronous calls and new events in parallel. The checkout service continued making HTTP calls but also published events:

typescript
1async function placeOrder(order: Order): Promise<OrderResult> {
2 // Critical path — still synchronous
3 const payment = await paymentService.charge(order);
4 const reservation = await inventoryService.reserve(order);
5
6 // Publish event for non-critical consumers
7 await kafka.produce("orders.events", {
8 eventType: "OrderPlaced",
9 orderId: order.id,
10 data: { ...order, paymentId: payment.id, reservationId: reservation.id },
11 });
12
13 // Old path — still active, will be removed
14 await notificationService.sendConfirmation(order); // Will become event consumer
15 await analyticsService.trackOrder(order); // Will become event consumer
16 await loyaltyService.creditPoints(order); // Will become event consumer
17
18 return { orderId: order.id, status: "confirmed" };
19}
20 

During this phase, we validated that event consumers produced the same results as the synchronous calls. We compared notification emails, analytics records, and loyalty point credits between the two paths. After 2 weeks of consistent results, we removed the synchronous calls to non-critical services.

Phase 3: Saga Orchestrator (Weeks 7-10)

We replaced the synchronous payment and inventory calls with an orchestrated saga:

typescript
1const orderSaga = createSaga("OrderPlacement", [
2 {
3 step: "reserveInventory",
4 execute: (ctx) => publishCommand("inventory.commands", "ReserveStock", ctx.order),
5 compensate: (ctx) => publishCommand("inventory.commands", "ReleaseStock", ctx.order),
6 timeout: 5000,
7 },
8 {
9 step: "processPayment",
10 execute: (ctx) => publishCommand("payments.commands", "ChargeCard", ctx.order),
11 compensate: (ctx) => publishCommand("payments.commands", "RefundCharge", ctx.order),
12 timeout: 10000,
13 },
14 {
15 step: "confirmOrder",
16 execute: (ctx) => publishEvent("orders.events", "OrderConfirmed", ctx.order),
17 // No compensation — this is the final step
18 timeout: 3000,
19 },
20]);
21 

The saga orchestrator maintained state in a PostgreSQL table, making it resilient to crashes. If the service restarted mid-saga, it resumed from the last completed step.

Phase 4: Remove Synchronous Calls (Weeks 11-12)

With all consumers running as event-driven services and the saga handling the critical path, we removed the remaining synchronous calls. The checkout service's response time dropped immediately—it only waited for the saga's first step (inventory reservation) before returning a pending status to the customer.

Need a second opinion on your system design architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

What Went Wrong

Problem 1: Consumer Lag During Deployment

When we deployed a consumer update, the rolling restart caused a Kafka rebalance. During rebalancing, all partitions were revoked and reassigned, creating a 30-second processing gap. With 4 events/second, that's 120 events delayed.

Fix: Switched to CooperativeStickyAssignor, which only moves partitions that need reassignment. Rebalancing downtime dropped from 30 seconds to under 2 seconds.

Problem 2: Schema Evolution Breaking Consumers

A developer added a required field to the OrderPlaced event without updating all consumers. The new field was shippingMethod, added as required in the Avro schema. Old events (replayed during a consumer rebuild) didn't have this field, causing deserialization failures.

Fix: Enforced backward compatibility in the schema registry. New schemas must be compatible with the previous version. Required fields can only be added with default values. We added a CI check that runs schema compatibility verification before merging.

Problem 3: Idempotency Gap

During a Kafka broker failover, some events were delivered twice. The notification service sent duplicate order confirmation emails to ~200 customers.

Fix: Added idempotent processing using the eventId as a deduplication key. Each consumer checks a Redis set before processing:

typescript
1async function processIdempotent(event: OrderEvent, handler: Function): Promise<void> {
2 const dedupKey = `processed:${consumerGroup}:${event.eventId}`;
3 const isNew = await redis.set(dedupKey, "1", "NX", "EX", 86400);
4
5 if (!isNew) return; // Already processed
6 await handler(event);
7}
8 

Results

After 12 weeks, the migration was complete. Here are the measured results:

MetricBeforeAfterChange
Checkout p50 latency1.8s0.4s-78%
Checkout p99 latency4.2s1.1s-74%
Monthly order failures2,340312-87%
MTTR for cascade failures38 min4 min-89%
Deploy coupling8 services2 services-75%
Non-critical service outage impactFull checkout failureZero checkout impactEliminated

The most impactful change was decoupling non-critical services. When the analytics service went down during a load test, orders continued processing without interruption. The analytics consumer simply caught up when it recovered—processing the backlog in 3 minutes.

Cost Impact

  • MSK cluster: $1,200/month (3 brokers, m5.large)
  • Removed: API gateway costs for inter-service calls ($400/month)
  • Removed: Circuit breaker infrastructure ($200/month)
  • Net cost increase: $600/month for significantly improved reliability

Team Velocity Impact

Independent deployability was the biggest velocity gain. Before EDA, changing the checkout flow required coordinated deploys across 8 services. After, the checkout team deploys independently. Other teams consume events at their own pace, deploying on their own schedules. Sprint velocity (measured by story points delivered) increased 23% across the three teams most affected by the migration.

Key Lessons

  1. Dual-write migration is essential. Running old and new paths in parallel catches inconsistencies before they affect customers. Budget 2-3 weeks for validation.

  2. Choreography for side effects, orchestration for transactions. Don't force one pattern everywhere. Critical business transactions need the explicit error handling that sagas provide.

  3. Schema governance pays for itself immediately. Our one schema-breaking incident took 4 hours to resolve. The CI check that prevents it took 2 hours to implement.

  4. Idempotency isn't optional. Every consumer will eventually receive duplicates. Build deduplication in from day one, not after the first incident.

  5. Consumer lag is your primary health metric. It tells you whether the system is keeping up before anything visibly breaks. Alert aggressively.

Conclusion

Migrating to event-driven architecture reduced our checkout latency by 78%, order failures by 87%, and eliminated cascade failures from non-critical services. The hybrid approach—saga orchestration for the critical payment and inventory path, choreography for everything else—gave us transactional safety where it mattered and decoupling everywhere else.

The migration took 12 weeks with a team of 4 engineers. The first 3 weeks were infrastructure setup, the next 3 were dual-write validation, and the final 6 were saga implementation and cutover. If we did it again, we'd enforce schema compatibility from day one and implement idempotent consumers before the first event was published—both lessons we learned the hard way.

FAQ

Need expert help?

Building with system design?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026