In early 2024, our team migrated a large e-commerce platform's order processing pipeline from synchronous REST-based communication to event-driven architecture. The system processed 15,000 orders per hour at peak, with 12 downstream services that needed to react to order lifecycle events—from payment processing through fulfillment, inventory management, and customer notifications. This is the story of what worked, what didn't, and the measurable impact on reliability and performance.
The Problem
The existing architecture used synchronous HTTP calls between services. When a customer placed an order, the checkout service made sequential calls to:
- Payment service → charge the card
- Inventory service → reserve stock
- Fulfillment service → create shipment
- Notification service → send confirmation email
- Analytics service → record the transaction
- Loyalty service → credit reward points
Each call added 50-200ms to the total response time. The p99 checkout latency was 4.2 seconds. Worse, if any downstream service was slow or unavailable, the entire checkout failed. During Black Friday 2023, the analytics service went down under load, causing a 45-minute outage that blocked all orders—even though analytics wasn't critical to completing a purchase.
Key metrics before migration:
| Metric | Value |
|---|---|
| Checkout p50 latency | 1.8s |
| Checkout p99 latency | 4.2s |
| Monthly order failures | 2,340 (0.52% of orders) |
| MTTR for cascade failures | 38 minutes |
| Deploy coupling (services requiring coordinated deploys) | 8 of 12 |
Architecture Decision
We evaluated three approaches:
- Choreography-based EDA — services publish events, consumers react independently
- Orchestration-based EDA — a central orchestrator coordinates the workflow via commands
- Hybrid — orchestration for the critical path, choreography for non-critical side effects
We chose the hybrid approach. Payment processing and inventory reservation needed transactional guarantees and clear error handling, making orchestration the right fit. Notifications, analytics, and loyalty points were fire-and-forget side effects that worked well with choreography.
Technology Stack
- Message broker: Amazon MSK (Managed Kafka) with 3 brokers across 3 AZs
- Orchestrator: Custom saga implementation using a state machine library
- Schema registry: Confluent Schema Registry with Avro serialization
- Monitoring: Datadog for metrics, distributed tracing via OpenTelemetry
We chose Kafka over SQS because we needed event replay capability (to rebuild read models), consumer groups (multiple services consuming the same events at their own pace), and ordering guarantees (events for the same order processed in sequence).
Implementation
Phase 1: Event Infrastructure (Weeks 1-3)
We set up MSK, configured topics, and built shared libraries for event production and consumption. Topics were organized by domain:
Partition counts were based on expected throughput: 15K orders/hour = ~4 orders/second, but we sized for 100x growth. Each partition handles 10K+ events/sec, so 32 partitions for the orders topic was generous headroom.
The shared event library enforced the envelope structure:
Phase 2: Dual-Write Migration (Weeks 4-6)
Rather than a big-bang cutover, we ran the old synchronous calls and new events in parallel. The checkout service continued making HTTP calls but also published events:
During this phase, we validated that event consumers produced the same results as the synchronous calls. We compared notification emails, analytics records, and loyalty point credits between the two paths. After 2 weeks of consistent results, we removed the synchronous calls to non-critical services.
Phase 3: Saga Orchestrator (Weeks 7-10)
We replaced the synchronous payment and inventory calls with an orchestrated saga:
The saga orchestrator maintained state in a PostgreSQL table, making it resilient to crashes. If the service restarted mid-saga, it resumed from the last completed step.
Phase 4: Remove Synchronous Calls (Weeks 11-12)
With all consumers running as event-driven services and the saga handling the critical path, we removed the remaining synchronous calls. The checkout service's response time dropped immediately—it only waited for the saga's first step (inventory reservation) before returning a pending status to the customer.
Need a second opinion on your system design architecture?
I run free 30-minute strategy calls for engineering teams tackling this exact problem.
Book a Free CallWhat Went Wrong
Problem 1: Consumer Lag During Deployment
When we deployed a consumer update, the rolling restart caused a Kafka rebalance. During rebalancing, all partitions were revoked and reassigned, creating a 30-second processing gap. With 4 events/second, that's 120 events delayed.
Fix: Switched to CooperativeStickyAssignor, which only moves partitions that need reassignment. Rebalancing downtime dropped from 30 seconds to under 2 seconds.
Problem 2: Schema Evolution Breaking Consumers
A developer added a required field to the OrderPlaced event without updating all consumers. The new field was shippingMethod, added as required in the Avro schema. Old events (replayed during a consumer rebuild) didn't have this field, causing deserialization failures.
Fix: Enforced backward compatibility in the schema registry. New schemas must be compatible with the previous version. Required fields can only be added with default values. We added a CI check that runs schema compatibility verification before merging.
Problem 3: Idempotency Gap
During a Kafka broker failover, some events were delivered twice. The notification service sent duplicate order confirmation emails to ~200 customers.
Fix: Added idempotent processing using the eventId as a deduplication key. Each consumer checks a Redis set before processing:
Results
After 12 weeks, the migration was complete. Here are the measured results:
| Metric | Before | After | Change |
|---|---|---|---|
| Checkout p50 latency | 1.8s | 0.4s | -78% |
| Checkout p99 latency | 4.2s | 1.1s | -74% |
| Monthly order failures | 2,340 | 312 | -87% |
| MTTR for cascade failures | 38 min | 4 min | -89% |
| Deploy coupling | 8 services | 2 services | -75% |
| Non-critical service outage impact | Full checkout failure | Zero checkout impact | Eliminated |
The most impactful change was decoupling non-critical services. When the analytics service went down during a load test, orders continued processing without interruption. The analytics consumer simply caught up when it recovered—processing the backlog in 3 minutes.
Cost Impact
- MSK cluster: $1,200/month (3 brokers, m5.large)
- Removed: API gateway costs for inter-service calls ($400/month)
- Removed: Circuit breaker infrastructure ($200/month)
- Net cost increase: $600/month for significantly improved reliability
Team Velocity Impact
Independent deployability was the biggest velocity gain. Before EDA, changing the checkout flow required coordinated deploys across 8 services. After, the checkout team deploys independently. Other teams consume events at their own pace, deploying on their own schedules. Sprint velocity (measured by story points delivered) increased 23% across the three teams most affected by the migration.
Key Lessons
-
Dual-write migration is essential. Running old and new paths in parallel catches inconsistencies before they affect customers. Budget 2-3 weeks for validation.
-
Choreography for side effects, orchestration for transactions. Don't force one pattern everywhere. Critical business transactions need the explicit error handling that sagas provide.
-
Schema governance pays for itself immediately. Our one schema-breaking incident took 4 hours to resolve. The CI check that prevents it took 2 hours to implement.
-
Idempotency isn't optional. Every consumer will eventually receive duplicates. Build deduplication in from day one, not after the first incident.
-
Consumer lag is your primary health metric. It tells you whether the system is keeping up before anything visibly breaks. Alert aggressively.
Conclusion
Migrating to event-driven architecture reduced our checkout latency by 78%, order failures by 87%, and eliminated cascade failures from non-critical services. The hybrid approach—saga orchestration for the critical payment and inventory path, choreography for everything else—gave us transactional safety where it mattered and decoupling everywhere else.
The migration took 12 weeks with a team of 4 engineers. The first 3 weeks were infrastructure setup, the next 3 were dual-write validation, and the final 6 were saga implementation and cutover. If we did it again, we'd enforce schema compatibility from day one and implement idempotent consumers before the first event was published—both lessons we learned the hard way.