CQRS and Event Sourcing become essential tools when your system processes millions of events per day and read workloads outpace writes by orders of magnitude. At high scale, the architectural decisions you make around event storage, projection infrastructure, and consistency boundaries directly impact your cloud bill and your on-call team's sleep quality. These best practices come from operating CQRS/ES systems handling 50K+ events per second across distributed clusters.
The High-Scale Imperative
High-scale teams face different constraints than enterprise teams. Latency budgets measured in single-digit milliseconds. Event throughput that demands horizontal partitioning. Read models serving millions of concurrent users. The patterns that work for a team processing 100 events per second actively harm you at 100,000.
The fundamental architecture remains the same — commands produce events, events build projections — but every component must be designed for horizontal scalability, partition tolerance, and graceful degradation under load.
Best Practices for High-Scale CQRS & Event Sourcing
1. Partition Your Event Store by Aggregate ID
At high scale, a single-partition event store becomes a bottleneck. Partition events by aggregate ID to distribute write load and enable parallel projection processing.
Use consistent hashing for partition assignment to minimize redistribution when adding partitions. EventStoreDB supports server-side partitioning; with Kafka as the event store, topic partitions map directly to this pattern.
2. Use Snapshots Aggressively
At high throughput, rehydrating aggregates from full event history is prohibitively expensive. Snapshot after every N events and on every significant state transition.
Target snapshot intervals of 50-100 events. Store snapshots in a dedicated fast-access store (Redis, DynamoDB) separate from the event store to avoid adding read pressure.
3. Build Projections with Parallel Consumers
A single projection consumer cannot keep up with high-throughput event streams. Parallelize projection building across partitions while maintaining ordering guarantees within each aggregate.
Use consumer group protocols (Kafka consumer groups, EventStoreDB competing consumers) to distribute partitions across workers. This gives you horizontal scaling and automatic rebalancing.
4. Implement Back-Pressure on Command Processing
When event throughput spikes, back-pressure prevents cascade failures. Rate-limit command acceptance based on downstream capacity.
Monitor projection lag as a key health signal. When projections fall behind by more than your SLA threshold, apply back-pressure to commands to let projections catch up.
5. Use Read-Through Caching for Hot Projections
At high read volumes, even optimized projections need a caching layer. Implement read-through caches that invalidate on event arrival.
For read models serving 100K+ QPS, consider materialized views in Redis with event-driven updates rather than database-backed projections with caching.
6. Implement Event Compaction for Cold Data
Long-running aggregates accumulate thousands of events. Event compaction reduces storage costs and speeds up historical replays.
Move compacted events to cold storage (S3, GCS) with lifecycle policies. Maintain an index for compliance queries that need historical access.
Need a second opinion on your system design architecture?
I run free 30-minute strategy calls for engineering teams tackling this exact problem.
Book a Free CallAnti-Patterns to Avoid
Synchronous Projections in the Command Path
Never update projections synchronously during command processing. This couples write latency to projection complexity and eliminates the scaling benefit of CQRS. Accept eventual consistency and design your UX around it.
Global Ordering Where Partition Ordering Suffices
Global event ordering is expensive — it forces single-writer bottlenecks. Most business requirements only need ordering within an aggregate or partition. Challenge any requirement for global ordering and explore if causal ordering or partition-local ordering meets the actual need.
Under-Provisioning the Dead Letter Queue
At 50K events/second, even a 0.01% failure rate generates 5 failed events per second. Without proper dead letter queue handling, monitoring, and replay tooling, these failures compound into data inconsistencies.
Ignoring Projection Rebuild Time
If rebuilding a projection takes 12 hours, you effectively cannot deploy schema changes to that projection. Track rebuild time as a metric and invest in parallel rebuild infrastructure before it becomes a deployment bottleneck.
High-Scale Readiness Checklist
- Event store partitioned by aggregate ID with consistent hashing
- Snapshot strategy implemented with sub-100ms rehydration target
- Projections running on parallel consumer groups
- Back-pressure mechanism on command ingestion
- Read-through cache layer for hot projections (Redis/Memcached)
- Event compaction and archival pipeline for cold data
- Dead letter queue with automated retry and alerting
- Projection lag monitoring with SLA-based alerts (< 5s for critical projections)
- Load tested at 3x expected peak throughput
- Partition rebalancing tested with zero-downtime
- Event replay tooling supports parallel full rebuild under 1 hour
- Graceful degradation plan when projections are unavailable
Conclusion
High-scale CQRS and Event Sourcing demand a systems-thinking approach where every component is designed for horizontal scaling and graceful degradation. The patterns that differentiate high-scale implementations — partitioned event stores, aggressive snapshotting, parallel projection engines, and back-pressure mechanisms — all share a common theme: accepting distributed systems realities rather than fighting them.
Start by establishing your throughput baseline, identify the bottleneck (it is usually projection lag), and optimize from there. Monitor event store partition distribution, projection consumer lag, and cache hit ratios as your three north-star metrics.