Back to Journal
System Design

Event-Driven Architecture Best Practices for High Scale Teams

Battle-tested best practices for Event-Driven Architecture tailored to High Scale teams, including anti-patterns to avoid and a ready-to-use checklist.

Muneer Puthiya Purayil 12 min read

When your event-driven system processes 500K events per second and a single consumer falling behind triggers a cascade that affects 40 downstream services, you need engineering practices that go beyond standard patterns. High-scale EDA operates under different constraints than typical enterprise systems—network bandwidth becomes a bottleneck, partition rebalancing can cause minutes of downtime, and a single misconfigured consumer can exhaust your entire Kafka cluster's capacity.

These practices come from operating event-driven systems processing 2B+ events per day across real-time bidding platforms, financial trading systems, and large-scale e-commerce infrastructure.

Throughput-Optimized Producer Configuration

Batching and Compression

The single biggest throughput lever is producer batching. Instead of sending one event per network request, accumulate events and send them in batches:

java
1Properties props = new Properties();
2props.put(ProducerConfig.BATCH_SIZE_CONFIG, 65536); // 64KB batches
3props.put(ProducerConfig.LINGER_MS_CONFIG, 10); // Wait up to 10ms
4props.put(ProducerConfig.BUFFER_MEMORY_CONFIG, 67108864); // 64MB buffer
5props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "zstd"); // Best ratio
6props.put(ProducerConfig.ACKS_CONFIG, "1"); // Leader ack only
7props.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, 5);
8 

Impact on a 16-core producer:

ConfigThroughputLatency (p50)Latency (p99)
Default (no batching)50K msg/s2ms15ms
64KB batch, 5ms linger400K msg/s7ms25ms
64KB batch, 10ms linger, zstd600K msg/s12ms35ms

Zstd compression reduces network bandwidth by 70-80% for JSON payloads with minimal CPU overhead. The tradeoff is ~10ms additional latency from linger time—acceptable for most event-driven flows.

Partition Count Sizing

Partitions are the unit of parallelism. More partitions enable more concurrent consumers but increase metadata overhead:

1Target throughput: 500K events/sec
2Single consumer throughput: 10K events/sec
3Required partitions: 500K / 10K = 50 partitions
4With 30% headroom: 65 partitions round to 64
5 

Rules of thumb:

  • 1 partition handles 10-30K events/sec depending on message size
  • Keep total partitions per broker under 4,000
  • Over-partitioning (>200 per topic) increases leader election time and rebalance duration
  • You can increase partitions but never decrease them

Consumer Optimization

Parallel Processing Within a Partition

Processing events sequentially within a partition is safe but slow. For events that don't require strict ordering within a partition, use a thread pool:

java
1@Service
2public class ParallelEventConsumer {
3
4 private final ExecutorService executor = Executors.newFixedThreadPool(
5 Runtime.getRuntime().availableProcessors() * 2
6 );
7
8 @KafkaListener(topics = "events", groupId = "parallel-consumer")
9 public void consume(List<ConsumerRecord<String, Event>> records) {
10 // Group by key (ordering preserved per key)
11 Map<String, List<ConsumerRecord<String, Event>>> byKey = records.stream()
12 .collect(Collectors.groupingBy(ConsumerRecord::key));
13
14 // Process different keys in parallel
15 List<CompletableFuture<Void>> futures = byKey.values().stream()
16 .map(keyRecords -> CompletableFuture.runAsync(
17 () -> processSequentially(keyRecords), executor))
18 .toList();
19
20 // Wait for all to complete before committing
21 CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])).join();
22 }
23
24 private void processSequentially(List<ConsumerRecord<String, Event>> records) {
25 for (var record : records) {
26 processEvent(record.value());
27 }
28 }
29}
30 

This preserves per-key ordering (critical for aggregate consistency) while parallelizing across keys. On a 16-core machine, this approach increases throughput from 10K to 80K events/sec per consumer instance.

Consumer Lag Management

Consumer lag is the time gap between event production and consumption. At high scale, lag compounds quickly:

1Production rate: 500K events/sec
2Consumer processing rate: 480K events/sec
3Lag growth: 20K events/sec → 72M events/hour → unrecoverable
4 

A 4% throughput deficit makes the system unrecoverable within hours. Monitor lag per consumer group and set alerts aggressively:

Lag LevelEvents BehindAction
Normal<10KNo action
Warning10K-100KInvestigate, prepare to scale
Critical100K-1MScale consumers immediately
Emergency>1MActivate catch-up mode, consider skipping

Catch-Up Mode

When a consumer falls far behind, switch to catch-up mode—simplified processing that skips non-essential work:

typescript
1function processEvent(event: Event, lag: number): void {
2 if (lag > 1_000_000) {
3 // Catch-up mode: only essential processing
4 processCriticalPath(event);
5 return;
6 }
7
8 if (lag > 100_000) {
9 // Degraded mode: skip enrichment and analytics
10 processCoreLogic(event);
11 return;
12 }
13
14 // Normal mode: full processing
15 processFullPipeline(event);
16}
17 

Exactly-Once Processing at Scale

Idempotent Consumers

At 500K events/sec, duplicates are inevitable—network retries, consumer rebalances, and producer retries all create duplicates. Make every consumer idempotent:

sql
1-- Deduplication table
2CREATE TABLE processed_events (
3 event_id UUID PRIMARY KEY,
4 consumer_group VARCHAR(255) NOT NULL,
5 processed_at TIMESTAMP NOT NULL DEFAULT NOW(),
6 partition_id INT NOT NULL,
7 offset_val BIGINT NOT NULL
8);
9 
10-- Index for cleanup
11CREATE INDEX idx_processed_events_time
12 ON processed_events (processed_at);
13 
java
1@Transactional
2public void processEvent(Event event) {
3 // Check + insert in same transaction
4 if (eventRepository.existsByEventIdAndConsumerGroup(
5 event.getEventId(), CONSUMER_GROUP)) {
6 log.debug("Duplicate event: {}", event.getEventId());
7 return;
8 }
9
10 // Process the event
11 executeBusinessLogic(event);
12
13 // Mark as processed
14 eventRepository.save(new ProcessedEvent(
15 event.getEventId(), CONSUMER_GROUP,
16 Instant.now(), event.getPartition(), event.getOffset()
17 ));
18}
19 

Clean up the deduplication table periodically. Events older than the consumer's committed offset plus a safety margin can be purged:

sql
DELETE FROM processed_events WHERE processed_at < NOW() - INTERVAL '24 hours';

Transactional Outbox Pattern

Ensure events are published only when the database transaction commits:

java
1@Transactional
2public void placeOrder(OrderRequest request) {
3 // 1. Save to database
4 Order order = orderRepository.save(new Order(request));
5
6 // 2. Write to outbox table (same transaction)
7 outboxRepository.save(new OutboxEvent(
8 UUID.randomUUID(),
9 "OrderPlaced",
10 objectMapper.writeValueAsString(order),
11 Instant.now()
12 ));
13}
14 
15// Separate process polls outbox and publishes to Kafka
16@Scheduled(fixedDelay = 100)
17public void publishOutboxEvents() {
18 List<OutboxEvent> pending = outboxRepository
19 .findByPublishedFalseOrderByCreatedAtAsc();
20
21 for (OutboxEvent event : pending) {
22 kafkaTemplate.send("orders.events", event.getAggregateId(), event.getPayload());
23 event.setPublished(true);
24 outboxRepository.save(event);
25 }
26}
27 

At high scale, use Debezium CDC (Change Data Capture) instead of polling—it captures outbox table changes from the database's transaction log with sub-second latency and zero polling overhead.

Need a second opinion on your system design architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Backpressure Management

Producer-Side Backpressure

When the broker can't keep up, the producer's buffer fills up. Handle this gracefully:

java
1try {
2 Future<RecordMetadata> future = producer.send(record);
3 RecordMetadata metadata = future.get(5, TimeUnit.SECONDS);
4} catch (TimeoutException e) {
5 // Broker is overwhelmed — apply backpressure
6 metrics.increment("producer.backpressure");
7 Thread.sleep(calculateBackoff(consecutiveFailures));
8 // Consider writing to local disk as overflow buffer
9 overflowBuffer.write(record);
10}
11 

Consumer-Side Backpressure

Use pause() and resume() to control consumption rate:

java
1@KafkaListener(topics = "events")
2public void consume(ConsumerRecord<String, Event> record,
3 Consumer<String, Event> consumer) {
4 int queueDepth = processingQueue.size();
5
6 if (queueDepth > 10000) {
7 // Pause consumption until queue drains
8 consumer.pause(consumer.assignment());
9 metrics.increment("consumer.paused");
10 } else if (queueDepth < 1000 && isPaused(consumer)) {
11 consumer.resume(consumer.assignment());
12 metrics.increment("consumer.resumed");
13 }
14
15 processingQueue.offer(record);
16}
17 

Network and Infrastructure Optimization

Broker Placement

For multi-AZ deployments, configure min.insync.replicas=2 with replication.factor=3. Place brokers across 3 availability zones:

1AZ-1: broker-1, broker-4
2AZ-2: broker-2, broker-5
3AZ-3: broker-3, broker-6
4 

This survives a full AZ failure with zero data loss. The tradeoff is cross-AZ replication latency (~2ms per hop).

Network Tuning

At 500K events/sec with 1KB average message size, you're pushing 500MB/s through the network. Tune OS-level buffers:

bash
1# Producer/consumer hosts
2sysctl -w net.core.rmem_max=16777216
3sysctl -w net.core.wmem_max=16777216
4sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
5sysctl -w net.ipv4.tcp_wmem="4096 87380 16777216"
6 

Monitoring at Scale

Standard monitoring tools struggle at high event rates. Use sampling for detailed traces:

java
1// Sample 0.1% of events for detailed tracing
2boolean shouldTrace = ThreadLocalRandom.current().nextDouble() < 0.001;
3 
4if (shouldTrace) {
5 Span span = tracer.buildSpan("process-event")
6 .withTag("event.type", event.getType())
7 .withTag("event.partition", event.getPartition())
8 .start();
9 // ... detailed instrumentation
10 span.finish();
11}
12 
13// Always record aggregate metrics
14metrics.counter("events.processed").increment();
15metrics.timer("events.latency").record(processingTime, TimeUnit.MILLISECONDS);
16 

High-Scale Checklist

  • Producer batching configured (batch.size ≥ 64KB, linger.ms 5-20ms)
  • Compression enabled (zstd for best ratio, lz4 for lowest CPU)
  • Partition count sized for target throughput with 30% headroom
  • Consumers use parallel processing per partition with key-based ordering
  • Consumer lag alerts at 10K (warn), 100K (critical), 1M (emergency)
  • Catch-up mode implemented for degraded processing when behind
  • All consumers are idempotent with deduplication
  • Transactional outbox or CDC for reliable event publishing
  • Backpressure handling on both producer and consumer side
  • Multi-AZ deployment with min.insync.replicas=2
  • OS network buffers tuned for high throughput
  • Trace sampling at 0.1-1% with aggregate metrics at 100%

Conclusion

High-scale event-driven architecture is about managing the physics of distributed systems: network bandwidth, partition rebalancing, and the compounding effect of small throughput deficits. The patterns here—producer batching, parallel consumption, idempotent processing, and backpressure management—are the critical levers.

Start with accurate capacity planning. Know your peak throughput, average message size, and required end-to-end latency. Size partitions and consumers accordingly, then add 30% headroom. Monitor consumer lag as your primary health indicator—it tells you whether your system is keeping up before anything else breaks.

The difference between a system that handles 100K events/sec and one that handles 1M events/sec isn't architectural—it's operational. Both use the same patterns. The high-scale system has better tuning, more aggressive monitoring, and faster incident response.

FAQ

Need expert help?

Building with system design?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026