How do we handle consumer rebalancing without losing events at high scale?

Use cooperative sticky assignment (`CooperativeStickyAssignor`) instead of the default eager rebalancing. Eager rebalancing revokes all partitions from all consumers during assignment, causing a processing gap. Cooperative rebalancing only revokes partitions that need to move, keeping other partitions active. At 500K events/sec, this reduces rebalancing downtime from 30 seconds to under 2 seconds. Additionally, set `session.timeout.ms` to 30 seconds and `heartbeat.interval.ms` to 10 seconds to p

What's the maximum sustainable throughput per Kafka partition?

A single partition can sustain 50-100MB/s write throughput on modern SSDs. For 1KB messages, that's 50K-100K messages/sec per partition. However, the practical limit is often lower due to consumer processing speed. A well-tuned consumer handles 10K-30K messages/sec per partition depending on processing complexity. For higher throughput per logical stream, use sub-partitioning within your application—hash the aggregate ID to an internal processing lane rather than adding Kafka partitions.

Should we use Kafka Streams, Apache Flink, or custom consumers for high-scale event processing?

Custom consumers give you the most control and lowest overhead for simple event processing (transform, enrich, route). Kafka Streams adds stateful processing (aggregations, joins, windowing) with automatic state management—good for moderate scale ( 1M events/sec) with exactly-once semantics, watermark handling, and complex event processing. Start with custom consumers, add Kafka Streams when you need stateful processing, and consid

How do we test event-driven systems at production-equivalent scale?

Build a load testing environment that mirrors production topology. Use Kafka's built-in `kafka-producer-perf-test` and `kafka-consumer-perf-test` for raw throughput testing. For end-to-end testing, create a traffic replay system that captures production events (with PII removed), stores them in S3, and replays at configurable rates. Test at 120% of peak production traffic to validate headroom. Run these tests weekly as part of your reliability practice—performance regressions from code changes a

Event-Driven Architecture Best Practices for High Scale Teams

When your event-driven system processes 500K events per second and a single consumer falling behind triggers a cascade that affects 40 downstream services, you need engineering practices that go beyond standard patterns. High-scale EDA operates under different constraints than typical enterprise systems—network bandwidth becomes a bottleneck, partition rebalancing can cause minutes of downtime, and a single misconfigured consumer can exhaust your entire Kafka cluster's capacity.

These practices come from operating event-driven systems processing 2B+ events per day across real-time bidding platforms, financial trading systems, and large-scale e-commerce infrastructure.

Throughput-Optimized Producer Configuration

Batching and Compression

The single biggest throughput lever is producer batching. Instead of sending one event per network request, accumulate events and send them in batches:

java

1Properties props = new Properties();

2props.put(ProducerConfig.BATCH_SIZE_CONFIG, 65536); // 64KB batches

3props.put(ProducerConfig.LINGER_MS_CONFIG, 10); // Wait up to 10ms

4props.put(ProducerConfig.BUFFER_MEMORY_CONFIG, 67108864); // 64MB buffer

5props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "zstd"); // Best ratio

6props.put(ProducerConfig.ACKS_CONFIG, "1"); // Leader ack only

7props.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, 5);

Impact on a 16-core producer:

Config	Throughput	Latency (p50)	Latency (p99)
Default (no batching)	50K msg/s	2ms	15ms
64KB batch, 5ms linger	400K msg/s	7ms	25ms
64KB batch, 10ms linger, zstd	600K msg/s	12ms	35ms

Zstd compression reduces network bandwidth by 70-80% for JSON payloads with minimal CPU overhead. The tradeoff is ~10ms additional latency from linger time—acceptable for most event-driven flows.

Partition Count Sizing

Partitions are the unit of parallelism. More partitions enable more concurrent consumers but increase metadata overhead:

1Target throughput: 500K events/sec

2Single consumer throughput: 10K events/sec

3Required partitions: 500K / 10K = 50 partitions

4With 30% headroom: 65 partitions → round to 64

Rules of thumb:

1 partition handles 10-30K events/sec depending on message size
Keep total partitions per broker under 4,000
Over-partitioning (>200 per topic) increases leader election time and rebalance duration
You can increase partitions but never decrease them

Consumer Optimization

Parallel Processing Within a Partition

Processing events sequentially within a partition is safe but slow. For events that don't require strict ordering within a partition, use a thread pool:

java

1@Service

2public class ParallelEventConsumer {

4 private final ExecutorService executor = Executors.newFixedThreadPool(

5 Runtime.getRuntime().availableProcessors() * 2

6 );

8 @KafkaListener(topics = "events", groupId = "parallel-consumer")

9 public void consume(List<ConsumerRecord<String, Event>> records) {

10 // Group by key (ordering preserved per key)

11 Map<String, List<ConsumerRecord<String, Event>>> byKey = records.stream()

12 .collect(Collectors.groupingBy(ConsumerRecord::key));

14 // Process different keys in parallel

15 List<CompletableFuture<Void>> futures = byKey.values().stream()

16 .map(keyRecords -> CompletableFuture.runAsync(

17 () -> processSequentially(keyRecords), executor))

18 .toList();

20 // Wait for all to complete before committing

21 CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])).join();

22 }

24 private void processSequentially(List<ConsumerRecord<String, Event>> records) {

25 for (var record : records) {

26 processEvent(record.value());

27 }

28 }

29}

This preserves per-key ordering (critical for aggregate consistency) while parallelizing across keys. On a 16-core machine, this approach increases throughput from 10K to 80K events/sec per consumer instance.

Consumer Lag Management

Consumer lag is the time gap between event production and consumption. At high scale, lag compounds quickly:

1Production rate: 500K events/sec

2Consumer processing rate: 480K events/sec

3Lag growth: 20K events/sec → 72M events/hour → unrecoverable

A 4% throughput deficit makes the system unrecoverable within hours. Monitor lag per consumer group and set alerts aggressively:

Lag Level	Events Behind	Action
Normal	<10K	No action
Warning	10K-100K	Investigate, prepare to scale
Critical	100K-1M	Scale consumers immediately
Emergency	>1M	Activate catch-up mode, consider skipping

Catch-Up Mode

When a consumer falls far behind, switch to catch-up mode—simplified processing that skips non-essential work:

typescript

1function processEvent(event: Event, lag: number): void {

2 if (lag > 1_000_000) {

3 // Catch-up mode: only essential processing

4 processCriticalPath(event);

5 return;

6 }

8 if (lag > 100_000) {

9 // Degraded mode: skip enrichment and analytics

10 processCoreLogic(event);

11 return;

12 }

14 // Normal mode: full processing

15 processFullPipeline(event);

16}

Exactly-Once Processing at Scale

Idempotent Consumers

At 500K events/sec, duplicates are inevitable—network retries, consumer rebalances, and producer retries all create duplicates. Make every consumer idempotent:

sql

1-- Deduplication table

2CREATE TABLE processed_events (

3 event_id UUID PRIMARY KEY,

4 consumer_group VARCHAR(255) NOT NULL,

5 processed_at TIMESTAMP NOT NULL DEFAULT NOW(),

6 partition_id INT NOT NULL,

7 offset_val BIGINT NOT NULL

8);

10-- Index for cleanup

11CREATE INDEX idx_processed_events_time

12 ON processed_events (processed_at);

java

1@Transactional

2public void processEvent(Event event) {

3 // Check + insert in same transaction

4 if (eventRepository.existsByEventIdAndConsumerGroup(

5 event.getEventId(), CONSUMER_GROUP)) {

6 log.debug("Duplicate event: {}", event.getEventId());

7 return;

8 }

10 // Process the event

11 executeBusinessLogic(event);

13 // Mark as processed

14 eventRepository.save(new ProcessedEvent(

15 event.getEventId(), CONSUMER_GROUP,

16 Instant.now(), event.getPartition(), event.getOffset()

17 ));

18}

Clean up the deduplication table periodically. Events older than the consumer's committed offset plus a safety margin can be purged:

sql

DELETE FROM processed_events WHERE processed_at < NOW() - INTERVAL '24 hours';

Transactional Outbox Pattern

Ensure events are published only when the database transaction commits:

java

1@Transactional

2public void placeOrder(OrderRequest request) {

3 // 1. Save to database

4 Order order = orderRepository.save(new Order(request));

6 // 2. Write to outbox table (same transaction)

7 outboxRepository.save(new OutboxEvent(

8 UUID.randomUUID(),

9 "OrderPlaced",

10 objectMapper.writeValueAsString(order),

11 Instant.now()

12 ));

13}

15// Separate process polls outbox and publishes to Kafka

16@Scheduled(fixedDelay = 100)

17public void publishOutboxEvents() {

18 List<OutboxEvent> pending = outboxRepository

19 .findByPublishedFalseOrderByCreatedAtAsc();

21 for (OutboxEvent event : pending) {

22 kafkaTemplate.send("orders.events", event.getAggregateId(), event.getPayload());

23 event.setPublished(true);

24 outboxRepository.save(event);

25 }

26}

At high scale, use Debezium CDC (Change Data Capture) instead of polling—it captures outbox table changes from the database's transaction log with sub-second latency and zero polling overhead.

Need a second opinion on your system design architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Backpressure Management

Producer-Side Backpressure

When the broker can't keep up, the producer's buffer fills up. Handle this gracefully:

java

1try {

2 Future<RecordMetadata> future = producer.send(record);

3 RecordMetadata metadata = future.get(5, TimeUnit.SECONDS);

4} catch (TimeoutException e) {

5 // Broker is overwhelmed — apply backpressure

6 metrics.increment("producer.backpressure");

7 Thread.sleep(calculateBackoff(consecutiveFailures));

8 // Consider writing to local disk as overflow buffer

9 overflowBuffer.write(record);

10}

Consumer-Side Backpressure

Use pause() and resume() to control consumption rate:

java

1@KafkaListener(topics = "events")

2public void consume(ConsumerRecord<String, Event> record,

3 Consumer<String, Event> consumer) {

4 int queueDepth = processingQueue.size();

6 if (queueDepth > 10000) {

7 // Pause consumption until queue drains

8 consumer.pause(consumer.assignment());

9 metrics.increment("consumer.paused");

10 } else if (queueDepth < 1000 && isPaused(consumer)) {

11 consumer.resume(consumer.assignment());

12 metrics.increment("consumer.resumed");

13 }

15 processingQueue.offer(record);

16}

Network and Infrastructure Optimization

Broker Placement

For multi-AZ deployments, configure min.insync.replicas=2 with replication.factor=3. Place brokers across 3 availability zones:

1AZ-1: broker-1, broker-4

2AZ-2: broker-2, broker-5

3AZ-3: broker-3, broker-6

This survives a full AZ failure with zero data loss. The tradeoff is cross-AZ replication latency (~2ms per hop).

Network Tuning

At 500K events/sec with 1KB average message size, you're pushing 500MB/s through the network. Tune OS-level buffers:

bash

1# Producer/consumer hosts

2sysctl -w net.core.rmem_max=16777216

3sysctl -w net.core.wmem_max=16777216

4sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"

5sysctl -w net.ipv4.tcp_wmem="4096 87380 16777216"

Monitoring at Scale

Standard monitoring tools struggle at high event rates. Use sampling for detailed traces:

java

1// Sample 0.1% of events for detailed tracing

2boolean shouldTrace = ThreadLocalRandom.current().nextDouble() < 0.001;

4if (shouldTrace) {

5 Span span = tracer.buildSpan("process-event")

6 .withTag("event.type", event.getType())

7 .withTag("event.partition", event.getPartition())

8 .start();

9 // ... detailed instrumentation

10 span.finish();

11}

13// Always record aggregate metrics

14metrics.counter("events.processed").increment();

15metrics.timer("events.latency").record(processingTime, TimeUnit.MILLISECONDS);

High-Scale Checklist

Conclusion

High-scale event-driven architecture is about managing the physics of distributed systems: network bandwidth, partition rebalancing, and the compounding effect of small throughput deficits. The patterns here—producer batching, parallel consumption, idempotent processing, and backpressure management—are the critical levers.

Start with accurate capacity planning. Know your peak throughput, average message size, and required end-to-end latency. Size partitions and consumers accordingly, then add 30% headroom. Monitor consumer lag as your primary health indicator—it tells you whether your system is keeping up before anything else breaks.

The difference between a system that handles 100K events/sec and one that handles 1M events/sec isn't architectural—it's operational. Both use the same patterns. The high-scale system has better tuning, more aggressive monitoring, and faster incident response.

FAQ

Need expert help?

Building with system design?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Book a Free Call Send a Brief

event-driven messaging kafka architecture high-scale best-practices

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

View Portfolio Book a Call

← Previous

Throughput-Optimized Producer Configuration

Batching and Compression

Partition Count Sizing

Consumer Optimization

Parallel Processing Within a Partition

Consumer Lag Management

Catch-Up Mode

Exactly-Once Processing at Scale

Idempotent Consumers

Transactional Outbox Pattern

Backpressure Management

Producer-Side Backpressure

Consumer-Side Backpressure

Network and Infrastructure Optimization

Broker Placement

Network Tuning

Monitoring at Scale

High-Scale Checklist

Conclusion

FAQ

Building with system design?

Event-Driven Architecture Best Practices for Enterprise Teams

Event-Driven Architecture Best Practices for Startup Teams

Event-Driven Architecture at Scale: Lessons from Production

Event-Driven Architecture at Scale: Lessons from Production

Event-Driven Architecture Best Practices for Enterprise Teams

Start aConversation.

Start a
Conversation.