In 2023, we migrated a monolithic order management system processing $2.1B in annual transaction volume to a CQRS and Event Sourcing architecture. The legacy system — a 500K-line Java monolith backed by a single PostgreSQL instance — had reached its scaling ceiling. Response times during peak periods (Black Friday, flash sales) regularly exceeded 8 seconds, and the database was maxing out at 12,000 connections. This is the story of what worked, what failed, and what we would do differently.
The Starting Point
The existing system handled everything in a single request-response cycle: validate the order, check inventory, process payment, update the database, send notifications, and return a response. A single order placement touched 14 database tables in a transaction that held locks for 200-400ms.
Key metrics before migration:
- p99 latency: 4.2s (order placement), 8.1s during peak
- Peak throughput: 1,200 orders/minute before degradation
- Database connections: 12,000 (pool exhaustion during peaks)
- Deployment frequency: Bi-weekly, 4-hour maintenance windows
- Mean time to recovery: 45 minutes
The business needed 10x throughput capacity for international expansion and sub-second response times to reduce cart abandonment.
Architecture Decisions
Event Store Selection: Amazon EventBridge + DynamoDB
We evaluated EventStoreDB, Kafka, and a custom DynamoDB-based solution. We chose DynamoDB as the event store with EventBridge for event distribution for three reasons: operational simplicity (serverless), single-digit millisecond write latency, and the team's existing AWS expertise.
Aggregate Design: Order as the Primary Aggregate
We decomposed the monolithic order into four aggregates:
- Order: Placement, modification, cancellation (5-8 events per lifecycle)
- Payment: Authorization, capture, refund (3-5 events)
- Fulfillment: Warehouse assignment, packing, shipping (4-7 events)
- Inventory: Reservation, allocation, release (2-3 events)
Projection Strategy: Purpose-Built Read Models
We built five projections, each optimized for a specific access pattern:
- Order Status (DynamoDB): Customer-facing, sub-10ms reads
- Order Search (OpenSearch): Full-text search for support agents
- Analytics (Redshift): Business intelligence and reporting
- Fulfillment Queue (SQS + DynamoDB): Warehouse operations
- Inventory View (ElastiCache): Real-time inventory levels
Migration Strategy: Strangler Fig
We ran the old and new systems in parallel for 12 weeks, using a feature flag to control traffic routing.
Phase 1 (Weeks 1-4): Shadow mode — all orders processed by the monolith, events emitted in parallel to build and validate projections.
Phase 2 (Weeks 5-8): Canary — 5% of orders processed by the new system, results compared with monolith output for correctness.
Phase 3 (Weeks 9-12): Gradual rollout — 5% → 25% → 50% → 100%, with automatic rollback triggers on error rate or latency thresholds.
Need a second opinion on your system design architecture?
I run free 30-minute strategy calls for engineering teams tackling this exact problem.
Book a Free CallMeasurable Results
After full migration:
| Metric | Before | After | Improvement |
|---|---|---|---|
| p99 latency (order placement) | 4.2s | 180ms | 23x |
| Peak throughput | 1,200 orders/min | 18,000 orders/min | 15x |
| Database connections | 12,000 | 0 (serverless) | N/A |
| Deployment frequency | Bi-weekly | Multiple daily | 10x |
| Mean time to recovery | 45 min | 3 min | 15x |
| Infrastructure cost | $42K/month | $28K/month | 33% reduction |
The cost reduction was unexpected. Despite higher architectural complexity, the move to DynamoDB on-demand pricing and Lambda-based projections eliminated the over-provisioned RDS instances and connection pooling infrastructure.
What Failed
Underestimating Eventual Consistency Impact on Customer Support
Customer support agents were trained on a system where changes appeared instantly. With CQRS, the order search projection had a 2-3 second lag. Agents would update an order and immediately search for it — finding stale data. We solved this with read-your-writes tokens, but only after weeks of escalated support tickets.
Event Schema Evolution During Migration
We changed the OrderPlaced event schema three times during the 12-week migration. Each change required coordinating upcasters, redeploying all projection consumers, and rebuilding affected projections. We should have frozen the event schema before starting the migration and dealt with imperfect schemas post-migration.
Projection Rebuild Times
The analytics projection (Redshift) took 14 hours to rebuild from scratch. This meant we couldn't deploy schema changes to that projection without a 14-hour window of degraded analytics. We eventually built a parallel rebuild pipeline that reduced this to 2 hours, but it consumed significant engineering time.
Honest Retrospective
Would we use CQRS/ES again? Yes, for this use case. The audit trail alone justified the effort — regulatory compliance required us to explain every state change to any order, and Event Sourcing made that trivial.
What would we change?
- Invest in projection rebuild infrastructure before the migration, not after
- Use a schema registry from the start
- Train customer support on eventual consistency before go-live
- Start with Kafka instead of EventBridge — EventBridge's 256KB event size limit forced us to split large order events awkwardly
Is CQRS/ES worth it for every system? Absolutely not. We evaluated it for our user profile service and correctly rejected it — CRUD was the right pattern for that domain. The order management system benefited because it has complex domain logic, regulatory audit requirements, and wildly different read/write patterns.
Conclusion
Migrating to CQRS and Event Sourcing delivered transformative results for our order management system: 23x latency improvement, 15x throughput increase, and a 33% cost reduction. The journey required 12 weeks of parallel operation, three event schema iterations, and significant investment in projection infrastructure.
The most valuable lesson was that CQRS/ES is as much an organizational change as a technical one. Every team that consumed order data — support, analytics, fulfillment — needed to understand and adapt to eventual consistency. The technical migration was the easy part.