How do you choose the right number of event store partitions?

Start with the number of partitions matching your peak write throughput requirement divided by the throughput of a single partition. For Kafka-based event stores, 12-24 partitions per topic is a common starting point for systems processing 10K-50K events per second. Over-partitioning (hundreds of partitions) increases metadata overhead and rebalancing time. Under-partitioning limits your horizontal scaling ceiling. Monitor partition write latency at p99 and add partitions when it exceeds your SL

What is the ideal projection lag SLA for high-scale systems?

It depends on the business context. For customer-facing read models (product catalog, account balance), target under 2 seconds. For analytical dashboards and reporting, 30-60 seconds is typically acceptable. For compliance and audit projections, minutes to hours may be fine. Define projection lag SLAs per projection based on business requirements, not technical convenience. Monitor the gap between the write position and projection position as your primary metric.

How do you handle cross-aggregate queries at scale?

Cross-aggregate queries are purely a read-side concern. Build dedicated projections that denormalize data from multiple aggregates into query-optimized views. For example, an "order with customer details" projection listens to both OrderPlaced and CustomerUpdated events. At high scale, these denormalized projections stored in Redis or Elasticsearch outperform joins by orders of magnitude. Accept data duplication as a feature of the pattern.

When should you choose EventStoreDB vs Kafka as the event store?

EventStoreDB is purpose-built for Event Sourcing with native aggregate streams, optimistic concurrency, and built-in projections. Choose it when Event Sourcing is your primary use case and event throughput is under 50K/second. Kafka excels when you need higher throughput, already have Kafka infrastructure, or need the event store to double as an event distribution backbone. At truly high scale (100K+ events/second), Kafka's partitioning model and ecosystem integration typically win.

CQRS & Event Sourcing Best Practices for High Scale Teams

CQRS and Event Sourcing become essential tools when your system processes millions of events per day and read workloads outpace writes by orders of magnitude. At high scale, the architectural decisions you make around event storage, projection infrastructure, and consistency boundaries directly impact your cloud bill and your on-call team's sleep quality. These best practices come from operating CQRS/ES systems handling 50K+ events per second across distributed clusters.

The High-Scale Imperative

High-scale teams face different constraints than enterprise teams. Latency budgets measured in single-digit milliseconds. Event throughput that demands horizontal partitioning. Read models serving millions of concurrent users. The patterns that work for a team processing 100 events per second actively harm you at 100,000.

The fundamental architecture remains the same — commands produce events, events build projections — but every component must be designed for horizontal scalability, partition tolerance, and graceful degradation under load.

Best Practices for High-Scale CQRS & Event Sourcing

1. Partition Your Event Store by Aggregate ID

At high scale, a single-partition event store becomes a bottleneck. Partition events by aggregate ID to distribute write load and enable parallel projection processing.

1type PartitionedEventStore struct {

2 partitions int

3 stores []EventStore

4 partitionFunc func(aggregateID string) int

7func NewPartitionedEventStore(partitions int) *PartitionedEventStore {

8 stores := make([]EventStore, partitions)

9 for i := range stores {

10 stores[i] = NewEventStore(fmt.Sprintf("events_partition_%d", i))

11 }

12 return &PartitionedEventStore{

13 partitions: partitions,

14 stores: stores,

15 partitionFunc: func(id string) int {

16 h := fnv.New32a()

17 h.Write([]byte(id))

18 return int(h.Sum32()) % partitions

19 },

20 }

21}

23func (p *PartitionedEventStore) Append(aggregateID string, events []Event) error {

24 partition := p.partitionFunc(aggregateID)

25 return p.stores[partition].Append(aggregateID, events)

26}

Use consistent hashing for partition assignment to minimize redistribution when adding partitions. EventStoreDB supports server-side partitioning; with Kafka as the event store, topic partitions map directly to this pattern.

2. Use Snapshots Aggressively

At high throughput, rehydrating aggregates from full event history is prohibitively expensive. Snapshot after every N events and on every significant state transition.

1type SnapshotStrategy struct {

2 eventThreshold int

3 store SnapshotStore

6func (s *SnapshotStrategy) ShouldSnapshot(aggregate Aggregate) bool {

7 return aggregate.UncommittedCount() >= s.eventThreshold

10func (s *SnapshotStrategy) LoadAggregate(ctx context.Context, id string) (Aggregate, error) {

11 snapshot, err := s.store.GetLatest(ctx, id)

12 if err != nil {

13 return nil, err

14 }

16 var startVersion int64

17 aggregate := NewAggregate(id)

19 if snapshot != nil {

20 aggregate.RestoreFromSnapshot(snapshot.State)

21 startVersion = snapshot.Version + 1

22 }

24 events, err := s.eventStore.ReadFrom(ctx, id, startVersion)

25 if err != nil {

26 return nil, err

27 }

29 for _, event := range events {

30 aggregate.Apply(event)

31 }

33 return aggregate, nil

34}

Target snapshot intervals of 50-100 events. Store snapshots in a dedicated fast-access store (Redis, DynamoDB) separate from the event store to avoid adding read pressure.

3. Build Projections with Parallel Consumers

A single projection consumer cannot keep up with high-throughput event streams. Parallelize projection building across partitions while maintaining ordering guarantees within each aggregate.

1type ParallelProjectionEngine struct {

2 workers int

3 partitions int

4 handler ProjectionHandler

7func (e *ParallelProjectionEngine) Start(ctx context.Context) error {

8 errGroup, ctx := errgroup.WithContext(ctx)

10 for i := 0; i < e.workers; i++ {

11 workerID := i

12 errGroup.Go(func() error {

13 return e.runWorker(ctx, workerID)

14 })

15 }

17 return errGroup.Wait()

18}

20func (e *ParallelProjectionEngine) runWorker(ctx context.Context, workerID int) error {

21 // Each worker handles a subset of partitions

22 assignedPartitions := e.assignPartitions(workerID)

24 for _, partition := range assignedPartitions {

25 checkpoint, _ := e.handler.GetCheckpoint(partition)

26 events := e.eventStore.ReadPartition(ctx, partition, checkpoint)

28 for event := range events {

29 if err := e.handler.Handle(ctx, event); err != nil {

30 // Dead-letter and continue

31 e.deadLetter.Send(event, err)

32 continue

33 }

34 e.handler.SaveCheckpoint(partition, event.Position)

35 }

36 }

37 return nil

38}

Use consumer group protocols (Kafka consumer groups, EventStoreDB competing consumers) to distribute partitions across workers. This gives you horizontal scaling and automatic rebalancing.

4. Implement Back-Pressure on Command Processing

When event throughput spikes, back-pressure prevents cascade failures. Rate-limit command acceptance based on downstream capacity.

1type BackPressureCommandBus struct {

2 limiter *rate.Limiter

3 handler CommandHandler

4 metrics MetricsCollector

5 queueDepth atomic.Int64

6 maxDepth int64

9func (b *BackPressureCommandBus) Dispatch(ctx context.Context, cmd Command) error {

10 if b.queueDepth.Load() > b.maxDepth {

11 b.metrics.Increment("commands.rejected.queue_full")

12 return ErrServiceOverloaded

13 }

15 if !b.limiter.Allow() {

16 b.metrics.Increment("commands.rejected.rate_limited")

17 return ErrRateLimited

18 }

20 b.queueDepth.Add(1)

21 defer b.queueDepth.Add(-1)

23 return b.handler.Handle(ctx, cmd)

24}

Monitor projection lag as a key health signal. When projections fall behind by more than your SLA threshold, apply back-pressure to commands to let projections catch up.

5. Use Read-Through Caching for Hot Projections

At high read volumes, even optimized projections need a caching layer. Implement read-through caches that invalidate on event arrival.

1type CachedProjection struct {

2 cache *redis.Client

3 projection ProjectionStore

4 ttl time.Duration

7func (c *CachedProjection) Get(ctx context.Context, key string) (*ProjectionView, error) {

8 cached, err := c.cache.Get(ctx, key).Bytes()

9 if err == nil {

10 var view ProjectionView

11 json.Unmarshal(cached, &view)

12 return &view, nil

13 }

15 view, err := c.projection.Get(ctx, key)

16 if err != nil {

17 return nil, err

18 }

20 data, _ := json.Marshal(view)

21 c.cache.Set(ctx, key, data, c.ttl)

22 return view, nil

23}

25func (c *CachedProjection) OnEvent(ctx context.Context, event DomainEvent) error {

26 // Update projection

27 if err := c.projection.Handle(ctx, event); err != nil {

28 return err

29 }

30 // Invalidate cache

31 c.cache.Del(ctx, event.AggregateID)

32 return nil

33}

For read models serving 100K+ QPS, consider materialized views in Redis with event-driven updates rather than database-backed projections with caching.

6. Implement Event Compaction for Cold Data

Long-running aggregates accumulate thousands of events. Event compaction reduces storage costs and speeds up historical replays.

1type EventCompactor struct {

2 eventStore EventStore

3 snapshotStore SnapshotStore

4 retention time.Duration

7func (c *EventCompactor) Compact(ctx context.Context, aggregateID string) error {

8 snapshot, err := c.snapshotStore.GetLatest(ctx, aggregateID)

9 if err != nil || snapshot == nil {

10 return err

11 }

13 // Archive events older than retention period that are before the snapshot

14 cutoff := time.Now().Add(-c.retention)

15 archivedCount, err := c.eventStore.ArchiveBefore(ctx, aggregateID, snapshot.Version, cutoff)

16 if err != nil {

17 return err

18 }

20 log.Printf("Compacted %d events for aggregate %s", archivedCount, aggregateID)

21 return nil

22}

Move compacted events to cold storage (S3, GCS) with lifecycle policies. Maintain an index for compliance queries that need historical access.

Need a second opinion on your system design architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Anti-Patterns to Avoid

Synchronous Projections in the Command Path

Never update projections synchronously during command processing. This couples write latency to projection complexity and eliminates the scaling benefit of CQRS. Accept eventual consistency and design your UX around it.

Global Ordering Where Partition Ordering Suffices

Global event ordering is expensive — it forces single-writer bottlenecks. Most business requirements only need ordering within an aggregate or partition. Challenge any requirement for global ordering and explore if causal ordering or partition-local ordering meets the actual need.

Under-Provisioning the Dead Letter Queue

At 50K events/second, even a 0.01% failure rate generates 5 failed events per second. Without proper dead letter queue handling, monitoring, and replay tooling, these failures compound into data inconsistencies.

Ignoring Projection Rebuild Time

If rebuilding a projection takes 12 hours, you effectively cannot deploy schema changes to that projection. Track rebuild time as a metric and invest in parallel rebuild infrastructure before it becomes a deployment bottleneck.

High-Scale Readiness Checklist

Conclusion

High-scale CQRS and Event Sourcing demand a systems-thinking approach where every component is designed for horizontal scaling and graceful degradation. The patterns that differentiate high-scale implementations — partitioned event stores, aggressive snapshotting, parallel projection engines, and back-pressure mechanisms — all share a common theme: accepting distributed systems realities rather than fighting them.

Start by establishing your throughput baseline, identify the bottleneck (it is usually projection lag), and optimize from there. Monitor event store partition distribution, projection consumer lag, and cache hit ratios as your three north-star metrics.

FAQ

Need expert help?

Building with system design?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Book a Free Call Send a Brief

cqrs event-sourcing ddd architecture high-scale best-practices

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

View Portfolio Book a Call

← Previous

The High-Scale Imperative

Best Practices for High-Scale CQRS & Event Sourcing

1. Partition Your Event Store by Aggregate ID

2. Use Snapshots Aggressively

3. Build Projections with Parallel Consumers

4. Implement Back-Pressure on Command Processing

5. Use Read-Through Caching for Hot Projections

6. Implement Event Compaction for Cold Data

Anti-Patterns to Avoid

Synchronous Projections in the Command Path

Global Ordering Where Partition Ordering Suffices

Under-Provisioning the Dead Letter Queue

Ignoring Projection Rebuild Time

High-Scale Readiness Checklist

Conclusion

FAQ

Building with system design?

CQRS & Event Sourcing Best Practices for Enterprise Teams

CQRS & Event Sourcing at Scale: Lessons from Production

How to Build CQRS & Event Sourcing Using Spring Boot

CQRS & Event Sourcing at Scale: Lessons from Production

CQRS & Event Sourcing Best Practices for Enterprise Teams

Start aConversation.

Start a
Conversation.