How long did the full migration take from planning to completion?

Six months end-to-end. Two months of planning and prototyping, one month building the core event store and aggregate infrastructure, one month building projections, and two months of parallel operation and gradual cutover. The team was six engineers — three focused on the write side, two on projections, and one on migration tooling and monitoring.

What monitoring did you put in place for the CQRS system?

Four key dashboards: (1) Event throughput and latency per aggregate type, (2) Projection lag per projection with SLA lines, (3) Command rejection rate with breakdown by reason, (4) Event store partition health and DynamoDB consumed capacity. We used CloudWatch for metrics, PagerDuty for alerting, and built a custom tool to compare monolith state with projection state during the parallel-run phase.

How did you handle the cutover from monolith to the new system?

We used a dual-write approach during the transition. The monolith continued to write to PostgreSQL while also emitting events. A reconciliation job compared the monolith state with the event-sourced projection state every 5 minutes, flagging discrepancies for manual review. Once the discrepancy rate was below 0.001% for two consecutive weeks, we cut over the remaining traffic.

What was the biggest unexpected challenge?

Testing. Our existing integration test suite assumed immediate consistency — asserting state immediately after a command. Every single integration test needed to be rewritten to account for eventual consistency, either by introducing polling with timeouts or by testing the command and query sides independently. This consumed three weeks of engineering time that we had not budgeted.

CQRS & Event Sourcing at Scale: Lessons from Production

In 2023, we migrated a monolithic order management system processing $2.1B in annual transaction volume to a CQRS and Event Sourcing architecture. The legacy system — a 500K-line Java monolith backed by a single PostgreSQL instance — had reached its scaling ceiling. Response times during peak periods (Black Friday, flash sales) regularly exceeded 8 seconds, and the database was maxing out at 12,000 connections. This is the story of what worked, what failed, and what we would do differently.

The Starting Point

The existing system handled everything in a single request-response cycle: validate the order, check inventory, process payment, update the database, send notifications, and return a response. A single order placement touched 14 database tables in a transaction that held locks for 200-400ms.

Key metrics before migration:

p99 latency: 4.2s (order placement), 8.1s during peak
Peak throughput: 1,200 orders/minute before degradation
Database connections: 12,000 (pool exhaustion during peaks)
Deployment frequency: Bi-weekly, 4-hour maintenance windows
Mean time to recovery: 45 minutes

The business needed 10x throughput capacity for international expansion and sub-second response times to reduce cart abandonment.

Architecture Decisions

Event Store Selection: Amazon EventBridge + DynamoDB

We evaluated EventStoreDB, Kafka, and a custom DynamoDB-based solution. We chose DynamoDB as the event store with EventBridge for event distribution for three reasons: operational simplicity (serverless), single-digit millisecond write latency, and the team's existing AWS expertise.

typescript

1// Event store schema in DynamoDB

2// Partition key: aggregateId

3// Sort key: version (number)

4interface StoredEvent {

5 aggregateId: string; // PK

6 version: number; // SK

7 eventType: string;

8 payload: Record<string, unknown>;

9 metadata: {

10 correlationId: string;

11 causationId: string;

12 timestamp: string;

13 userId: string;

14 };

15 ttl?: number; // For event archival

16}

18class DynamoEventStore {

19 async append(

20 aggregateId: string,

21 expectedVersion: number,

22 events: DomainEvent[]

23 ): Promise<number> {

24 const items = events.map((event, i) => ({

25 aggregateId,

26 version: expectedVersion + i + 1,

27 eventType: event.type,

28 payload: event.payload,

29 metadata: event.metadata,

30 createdAt: new Date().toISOString(),

31 }));

33 // Optimistic concurrency via conditional write

34 await this.dynamodb.transactWrite({

35 TransactItems: items.map((item, i) => ({

36 Put: {

37 TableName: 'events',

38 Item: item,

39 ConditionExpression: 'attribute_not_exists(aggregateId) AND attribute_not_exists(version)',

40 },

41 })),

42 });

44 // Publish to EventBridge for downstream consumers

45 await this.eventBridge.putEvents({

46 Entries: items.map(item => ({

47 Source: 'orders.domain',

48 DetailType: item.eventType,

49 Detail: JSON.stringify(item),

50 })),

51 });

53 return expectedVersion + events.length;

54 }

55}

Aggregate Design: Order as the Primary Aggregate

We decomposed the monolithic order into four aggregates:

Order: Placement, modification, cancellation (5-8 events per lifecycle)
Payment: Authorization, capture, refund (3-5 events)
Fulfillment: Warehouse assignment, packing, shipping (4-7 events)
Inventory: Reservation, allocation, release (2-3 events)

typescript

1class OrderAggregate {

2 private state: OrderState = { status: 'draft', lineItems: [], version: 0 };

3 private uncommittedEvents: DomainEvent[] = [];

5 place(command: PlaceOrderCommand): void {

6 this.guardAgainst(this.state.status !== 'draft', 'Order already placed');

7 this.guardAgainst(command.lineItems.length === 0, 'Order must have items');

9 this.apply({

10 type: 'OrderPlaced',

11 payload: {

12 orderId: this.id,

13 customerId: command.customerId,

14 lineItems: command.lineItems,

15 totalAmount: this.calculateTotal(command.lineItems),

16 currency: command.currency,

17 shippingAddress: command.shippingAddress,

18 },

19 });

20 }

22 private apply(event: DomainEvent): void {

23 this.state = this.evolve(this.state, event);

24 this.uncommittedEvents.push(event);

25 }

27 private evolve(state: OrderState, event: DomainEvent): OrderState {

28 switch (event.type) {

29 case 'OrderPlaced':

30 return {

31 ...state,

32 status: 'placed',

33 customerId: event.payload.customerId,

34 lineItems: event.payload.lineItems,

35 totalAmount: event.payload.totalAmount,

36 version: state.version + 1,

37 };

38 case 'OrderCancelled':

39 return { ...state, status: 'cancelled', version: state.version + 1 };

40 default:

41 return state;

42 }

43 }

44}

Projection Strategy: Purpose-Built Read Models

We built five projections, each optimized for a specific access pattern:

Order Status (DynamoDB): Customer-facing, sub-10ms reads
Order Search (OpenSearch): Full-text search for support agents
Analytics (Redshift): Business intelligence and reporting
Fulfillment Queue (SQS + DynamoDB): Warehouse operations
Inventory View (ElastiCache): Real-time inventory levels

Migration Strategy: Strangler Fig

We ran the old and new systems in parallel for 12 weeks, using a feature flag to control traffic routing.

Phase 1 (Weeks 1-4): Shadow mode — all orders processed by the monolith, events emitted in parallel to build and validate projections.

Phase 2 (Weeks 5-8): Canary — 5% of orders processed by the new system, results compared with monolith output for correctness.

Phase 3 (Weeks 9-12): Gradual rollout — 5% → 25% → 50% → 100%, with automatic rollback triggers on error rate or latency thresholds.

Need a second opinion on your system design architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Measurable Results

After full migration:

Metric	Before	After	Improvement
p99 latency (order placement)	4.2s	180ms	23x
Peak throughput	1,200 orders/min	18,000 orders/min	15x
Database connections	12,000	0 (serverless)	N/A
Deployment frequency	Bi-weekly	Multiple daily	10x
Mean time to recovery	45 min	3 min	15x
Infrastructure cost	$42K/month	$28K/month	33% reduction

The cost reduction was unexpected. Despite higher architectural complexity, the move to DynamoDB on-demand pricing and Lambda-based projections eliminated the over-provisioned RDS instances and connection pooling infrastructure.

What Failed

Underestimating Eventual Consistency Impact on Customer Support

Customer support agents were trained on a system where changes appeared instantly. With CQRS, the order search projection had a 2-3 second lag. Agents would update an order and immediately search for it — finding stale data. We solved this with read-your-writes tokens, but only after weeks of escalated support tickets.

Event Schema Evolution During Migration

We changed the OrderPlaced event schema three times during the 12-week migration. Each change required coordinating upcasters, redeploying all projection consumers, and rebuilding affected projections. We should have frozen the event schema before starting the migration and dealt with imperfect schemas post-migration.

Projection Rebuild Times

The analytics projection (Redshift) took 14 hours to rebuild from scratch. This meant we couldn't deploy schema changes to that projection without a 14-hour window of degraded analytics. We eventually built a parallel rebuild pipeline that reduced this to 2 hours, but it consumed significant engineering time.

Honest Retrospective

Would we use CQRS/ES again? Yes, for this use case. The audit trail alone justified the effort — regulatory compliance required us to explain every state change to any order, and Event Sourcing made that trivial.

What would we change?

Invest in projection rebuild infrastructure before the migration, not after
Use a schema registry from the start
Train customer support on eventual consistency before go-live
Start with Kafka instead of EventBridge — EventBridge's 256KB event size limit forced us to split large order events awkwardly

Is CQRS/ES worth it for every system? Absolutely not. We evaluated it for our user profile service and correctly rejected it — CRUD was the right pattern for that domain. The order management system benefited because it has complex domain logic, regulatory audit requirements, and wildly different read/write patterns.

Conclusion

Migrating to CQRS and Event Sourcing delivered transformative results for our order management system: 23x latency improvement, 15x throughput increase, and a 33% cost reduction. The journey required 12 weeks of parallel operation, three event schema iterations, and significant investment in projection infrastructure.

The most valuable lesson was that CQRS/ES is as much an organizational change as a technical one. Every team that consumed order data — support, analytics, fulfillment — needed to understand and adapt to eventual consistency. The technical migration was the easy part.

FAQ

Need expert help?

Building with system design?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Book a Free Call Send a Brief

cqrs event-sourcing ddd architecture aws case-study

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

View Portfolio Book a Call

← Previous

CQRS & Event Sourcing at Scale: Lessons from Production

The Starting Point

Architecture Decisions

Event Store Selection: Amazon EventBridge + DynamoDB

Aggregate Design: Order as the Primary Aggregate

Projection Strategy: Purpose-Built Read Models

Migration Strategy: Strangler Fig

Measurable Results

What Failed

Underestimating Eventual Consistency Impact on Customer Support

Event Schema Evolution During Migration

Projection Rebuild Times

Honest Retrospective

Conclusion

FAQ

Building with system design?

CQRS & Event Sourcing Best Practices for High Scale Teams

CQRS & Event Sourcing Best Practices for Enterprise Teams

How to Build CQRS & Event Sourcing Using Spring Boot

Complete Guide to Distributed Caching with Typescript

CQRS & Event Sourcing Best Practices for High Scale Teams

Start a
Conversation.

The Starting Point

Architecture Decisions

Event Store Selection: Amazon EventBridge + DynamoDB

Aggregate Design: Order as the Primary Aggregate

Projection Strategy: Purpose-Built Read Models

Migration Strategy: Strangler Fig

Measurable Results

What Failed

Underestimating Eventual Consistency Impact on Customer Support

Event Schema Evolution During Migration

Projection Rebuild Times

Honest Retrospective

Conclusion

FAQ

Building with system design?

CQRS & Event Sourcing Best Practices for High Scale Teams

CQRS & Event Sourcing Best Practices for Enterprise Teams

How to Build CQRS & Event Sourcing Using Spring Boot

Complete Guide to Distributed Caching with Typescript

CQRS & Event Sourcing Best Practices for High Scale Teams

Start aConversation.

Start a
Conversation.