Back to Journal
System Design

CQRS & Event Sourcing Best Practices for High Scale Teams

Battle-tested best practices for CQRS & Event Sourcing tailored to High Scale teams, including anti-patterns to avoid and a ready-to-use checklist.

Muneer Puthiya Purayil 11 min read

CQRS and Event Sourcing become essential tools when your system processes millions of events per day and read workloads outpace writes by orders of magnitude. At high scale, the architectural decisions you make around event storage, projection infrastructure, and consistency boundaries directly impact your cloud bill and your on-call team's sleep quality. These best practices come from operating CQRS/ES systems handling 50K+ events per second across distributed clusters.

The High-Scale Imperative

High-scale teams face different constraints than enterprise teams. Latency budgets measured in single-digit milliseconds. Event throughput that demands horizontal partitioning. Read models serving millions of concurrent users. The patterns that work for a team processing 100 events per second actively harm you at 100,000.

The fundamental architecture remains the same — commands produce events, events build projections — but every component must be designed for horizontal scalability, partition tolerance, and graceful degradation under load.

Best Practices for High-Scale CQRS & Event Sourcing

1. Partition Your Event Store by Aggregate ID

At high scale, a single-partition event store becomes a bottleneck. Partition events by aggregate ID to distribute write load and enable parallel projection processing.

go
1type PartitionedEventStore struct {
2 partitions int
3 stores []EventStore
4 partitionFunc func(aggregateID string) int
5}
6 
7func NewPartitionedEventStore(partitions int) *PartitionedEventStore {
8 stores := make([]EventStore, partitions)
9 for i := range stores {
10 stores[i] = NewEventStore(fmt.Sprintf("events_partition_%d", i))
11 }
12 return &PartitionedEventStore{
13 partitions: partitions,
14 stores: stores,
15 partitionFunc: func(id string) int {
16 h := fnv.New32a()
17 h.Write([]byte(id))
18 return int(h.Sum32()) % partitions
19 },
20 }
21}
22 
23func (p *PartitionedEventStore) Append(aggregateID string, events []Event) error {
24 partition := p.partitionFunc(aggregateID)
25 return p.stores[partition].Append(aggregateID, events)
26}
27 

Use consistent hashing for partition assignment to minimize redistribution when adding partitions. EventStoreDB supports server-side partitioning; with Kafka as the event store, topic partitions map directly to this pattern.

2. Use Snapshots Aggressively

At high throughput, rehydrating aggregates from full event history is prohibitively expensive. Snapshot after every N events and on every significant state transition.

go
1type SnapshotStrategy struct {
2 eventThreshold int
3 store SnapshotStore
4}
5 
6func (s *SnapshotStrategy) ShouldSnapshot(aggregate Aggregate) bool {
7 return aggregate.UncommittedCount() >= s.eventThreshold
8}
9 
10func (s *SnapshotStrategy) LoadAggregate(ctx context.Context, id string) (Aggregate, error) {
11 snapshot, err := s.store.GetLatest(ctx, id)
12 if err != nil {
13 return nil, err
14 }
15 
16 var startVersion int64
17 aggregate := NewAggregate(id)
18 
19 if snapshot != nil {
20 aggregate.RestoreFromSnapshot(snapshot.State)
21 startVersion = snapshot.Version + 1
22 }
23 
24 events, err := s.eventStore.ReadFrom(ctx, id, startVersion)
25 if err != nil {
26 return nil, err
27 }
28 
29 for _, event := range events {
30 aggregate.Apply(event)
31 }
32 
33 return aggregate, nil
34}
35 

Target snapshot intervals of 50-100 events. Store snapshots in a dedicated fast-access store (Redis, DynamoDB) separate from the event store to avoid adding read pressure.

3. Build Projections with Parallel Consumers

A single projection consumer cannot keep up with high-throughput event streams. Parallelize projection building across partitions while maintaining ordering guarantees within each aggregate.

go
1type ParallelProjectionEngine struct {
2 workers int
3 partitions int
4 handler ProjectionHandler
5}
6 
7func (e *ParallelProjectionEngine) Start(ctx context.Context) error {
8 errGroup, ctx := errgroup.WithContext(ctx)
9 
10 for i := 0; i < e.workers; i++ {
11 workerID := i
12 errGroup.Go(func() error {
13 return e.runWorker(ctx, workerID)
14 })
15 }
16 
17 return errGroup.Wait()
18}
19 
20func (e *ParallelProjectionEngine) runWorker(ctx context.Context, workerID int) error {
21 // Each worker handles a subset of partitions
22 assignedPartitions := e.assignPartitions(workerID)
23
24 for _, partition := range assignedPartitions {
25 checkpoint, _ := e.handler.GetCheckpoint(partition)
26 events := e.eventStore.ReadPartition(ctx, partition, checkpoint)
27
28 for event := range events {
29 if err := e.handler.Handle(ctx, event); err != nil {
30 // Dead-letter and continue
31 e.deadLetter.Send(event, err)
32 continue
33 }
34 e.handler.SaveCheckpoint(partition, event.Position)
35 }
36 }
37 return nil
38}
39 

Use consumer group protocols (Kafka consumer groups, EventStoreDB competing consumers) to distribute partitions across workers. This gives you horizontal scaling and automatic rebalancing.

4. Implement Back-Pressure on Command Processing

When event throughput spikes, back-pressure prevents cascade failures. Rate-limit command acceptance based on downstream capacity.

go
1type BackPressureCommandBus struct {
2 limiter *rate.Limiter
3 handler CommandHandler
4 metrics MetricsCollector
5 queueDepth atomic.Int64
6 maxDepth int64
7}
8 
9func (b *BackPressureCommandBus) Dispatch(ctx context.Context, cmd Command) error {
10 if b.queueDepth.Load() > b.maxDepth {
11 b.metrics.Increment("commands.rejected.queue_full")
12 return ErrServiceOverloaded
13 }
14 
15 if !b.limiter.Allow() {
16 b.metrics.Increment("commands.rejected.rate_limited")
17 return ErrRateLimited
18 }
19 
20 b.queueDepth.Add(1)
21 defer b.queueDepth.Add(-1)
22 
23 return b.handler.Handle(ctx, cmd)
24}
25 

Monitor projection lag as a key health signal. When projections fall behind by more than your SLA threshold, apply back-pressure to commands to let projections catch up.

5. Use Read-Through Caching for Hot Projections

At high read volumes, even optimized projections need a caching layer. Implement read-through caches that invalidate on event arrival.

go
1type CachedProjection struct {
2 cache *redis.Client
3 projection ProjectionStore
4 ttl time.Duration
5}
6 
7func (c *CachedProjection) Get(ctx context.Context, key string) (*ProjectionView, error) {
8 cached, err := c.cache.Get(ctx, key).Bytes()
9 if err == nil {
10 var view ProjectionView
11 json.Unmarshal(cached, &view)
12 return &view, nil
13 }
14 
15 view, err := c.projection.Get(ctx, key)
16 if err != nil {
17 return nil, err
18 }
19 
20 data, _ := json.Marshal(view)
21 c.cache.Set(ctx, key, data, c.ttl)
22 return view, nil
23}
24 
25func (c *CachedProjection) OnEvent(ctx context.Context, event DomainEvent) error {
26 // Update projection
27 if err := c.projection.Handle(ctx, event); err != nil {
28 return err
29 }
30 // Invalidate cache
31 c.cache.Del(ctx, event.AggregateID)
32 return nil
33}
34 

For read models serving 100K+ QPS, consider materialized views in Redis with event-driven updates rather than database-backed projections with caching.

6. Implement Event Compaction for Cold Data

Long-running aggregates accumulate thousands of events. Event compaction reduces storage costs and speeds up historical replays.

go
1type EventCompactor struct {
2 eventStore EventStore
3 snapshotStore SnapshotStore
4 retention time.Duration
5}
6 
7func (c *EventCompactor) Compact(ctx context.Context, aggregateID string) error {
8 snapshot, err := c.snapshotStore.GetLatest(ctx, aggregateID)
9 if err != nil || snapshot == nil {
10 return err
11 }
12 
13 // Archive events older than retention period that are before the snapshot
14 cutoff := time.Now().Add(-c.retention)
15 archivedCount, err := c.eventStore.ArchiveBefore(ctx, aggregateID, snapshot.Version, cutoff)
16 if err != nil {
17 return err
18 }
19 
20 log.Printf("Compacted %d events for aggregate %s", archivedCount, aggregateID)
21 return nil
22}
23 

Move compacted events to cold storage (S3, GCS) with lifecycle policies. Maintain an index for compliance queries that need historical access.

Need a second opinion on your system design architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Anti-Patterns to Avoid

Synchronous Projections in the Command Path

Never update projections synchronously during command processing. This couples write latency to projection complexity and eliminates the scaling benefit of CQRS. Accept eventual consistency and design your UX around it.

Global Ordering Where Partition Ordering Suffices

Global event ordering is expensive — it forces single-writer bottlenecks. Most business requirements only need ordering within an aggregate or partition. Challenge any requirement for global ordering and explore if causal ordering or partition-local ordering meets the actual need.

Under-Provisioning the Dead Letter Queue

At 50K events/second, even a 0.01% failure rate generates 5 failed events per second. Without proper dead letter queue handling, monitoring, and replay tooling, these failures compound into data inconsistencies.

Ignoring Projection Rebuild Time

If rebuilding a projection takes 12 hours, you effectively cannot deploy schema changes to that projection. Track rebuild time as a metric and invest in parallel rebuild infrastructure before it becomes a deployment bottleneck.

High-Scale Readiness Checklist

  • Event store partitioned by aggregate ID with consistent hashing
  • Snapshot strategy implemented with sub-100ms rehydration target
  • Projections running on parallel consumer groups
  • Back-pressure mechanism on command ingestion
  • Read-through cache layer for hot projections (Redis/Memcached)
  • Event compaction and archival pipeline for cold data
  • Dead letter queue with automated retry and alerting
  • Projection lag monitoring with SLA-based alerts (< 5s for critical projections)
  • Load tested at 3x expected peak throughput
  • Partition rebalancing tested with zero-downtime
  • Event replay tooling supports parallel full rebuild under 1 hour
  • Graceful degradation plan when projections are unavailable

Conclusion

High-scale CQRS and Event Sourcing demand a systems-thinking approach where every component is designed for horizontal scaling and graceful degradation. The patterns that differentiate high-scale implementations — partitioned event stores, aggressive snapshotting, parallel projection engines, and back-pressure mechanisms — all share a common theme: accepting distributed systems realities rather than fighting them.

Start by establishing your throughput baseline, identify the bottleneck (it is usually projection lag), and optimize from there. Monitor event store partition distribution, projection consumer lag, and cache hit ratios as your three north-star metrics.

FAQ

Need expert help?

Building with system design?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026