Back to Journal
System Design

Event-Driven Architecture Best Practices for Startup Teams

Battle-tested best practices for Event-Driven Architecture tailored to Startup teams, including anti-patterns to avoid and a ready-to-use checklist.

Muneer Puthiya Purayil 10 min read

Startups adopting event-driven architecture face a unique challenge: you need the decoupling benefits of EDA without the operational overhead that enterprise teams absorb with dedicated platform engineers. The wrong choices early on lead to either a brittle synchronous monolith that can't scale or an over-engineered event mesh that burns engineering time on infrastructure instead of product features.

This guide covers the practices that work for teams of 3-15 engineers processing 1K-100K events per second, optimized for speed of implementation, low operational overhead, and a clear path to scale when needed.

Start with Managed Services

Self-hosting Kafka is a full-time job. At startup scale, use managed services:

ServiceBest ForPricing ModelSetup Time
Amazon SQS + SNSSimple pub/sub, <10K events/secPer-message ($0.40/1M)30 minutes
Amazon EventBridgeEvent routing with filteringPer-event ($1.00/1M)1 hour
Upstash KafkaKafka-compatible, serverlessPay-per-message15 minutes
Confluent Cloud BasicFull Kafka features neededFrom $1/hour per CKU1 hour
Google Cloud Pub/SubGCP-native, auto-scaling$40/TB ingested30 minutes

Recommendation for most startups: Start with SQS + SNS if you're on AWS. It handles 100K events/sec with zero operational overhead, automatic scaling, and no cluster management. Switch to Kafka only when you need features SQS lacks: event replay, consumer groups with offset management, or stream processing.

Keep Events Simple

Minimal Event Structure

Skip the enterprise event envelope. Start with what you actually need:

typescript
1interface Event<T = unknown> {
2 id: string;
3 type: string;
4 timestamp: string;
5 data: T;
6}
7 
8// Example
9const event: Event<OrderPlacedData> = {
10 id: crypto.randomUUID(),
11 type: "order.placed",
12 timestamp: new Date().toISOString(),
13 data: {
14 orderId: "ord-123",
15 customerId: "cust-456",
16 total: 99.99,
17 items: [{ productId: "prod-789", quantity: 2, price: 49.99 }],
18 },
19};
20 

Add correlationId, causationId, and version fields when you actually need them—not preemptively. You can always add fields to events; removing them is harder.

Use Fat Events

Always include all relevant data in the event. Thin events (just IDs) create runtime dependencies:

typescript
1// Bad: consumer must call orders API to get details
2{ type: "order.placed", data: { orderId: "ord-123" } }
3 
4// Good: consumer has everything it needs
5{ type: "order.placed", data: { orderId: "ord-123", total: 99.99, items: [...] } }
6 

The storage cost difference is negligible. The operational cost of a consumer failing because the producer's API is down is significant.

JSON Over Binary Formats

Use JSON for event serialization. Avro and Protobuf add schema management complexity that isn't justified at startup scale. JSON is debuggable (you can read events in the queue console), universally supported, and good enough for 100K events/sec. Switch to binary serialization when JSON parsing becomes a measurable bottleneck—which it won't until you're well past startup scale.

Three Patterns That Cover 90% of Use Cases

1. Fire-and-Forget Notifications

The simplest pattern. Publish an event, consumers process it asynchronously:

typescript
1// Publisher (in your order service)
2async function placeOrder(order: Order): Promise<void> {
3 await db.orders.create(order);
4
5 await eventBus.publish({
6 id: crypto.randomUUID(),
7 type: "order.placed",
8 timestamp: new Date().toISOString(),
9 data: { orderId: order.id, customerId: order.customerId, total: order.total },
10 });
11}
12 
13// Consumer (email service)
14eventBus.subscribe("order.placed", async (event) => {
15 await sendOrderConfirmationEmail(event.data.customerId, event.data.orderId);
16});
17 
18// Consumer (analytics)
19eventBus.subscribe("order.placed", async (event) => {
20 await trackRevenue(event.data.total);
21});
22 

2. Async Task Processing

Offload expensive operations from the request path:

typescript
1// In the API handler — respond immediately
2app.post("/api/reports", async (req, res) => {
3 const reportId = crypto.randomUUID();
4 await db.reports.create({ id: reportId, status: "pending" });
5
6 await eventBus.publish({
7 id: crypto.randomUUID(),
8 type: "report.requested",
9 timestamp: new Date().toISOString(),
10 data: { reportId, parameters: req.body },
11 });
12
13 res.json({ reportId, status: "pending" });
14});
15 
16// Worker process — handles the expensive work
17eventBus.subscribe("report.requested", async (event) => {
18 const { reportId, parameters } = event.data;
19 await db.reports.update(reportId, { status: "generating" });
20
21 const report = await generateReport(parameters); // Takes 30 seconds
22 await storage.upload(`reports/${reportId}.pdf`, report);
23
24 await db.reports.update(reportId, { status: "completed" });
25 await eventBus.publish({
26 id: crypto.randomUUID(),
27 type: "report.completed",
28 timestamp: new Date().toISOString(),
29 data: { reportId },
30 });
31});
32 

3. Event-Driven Cache Invalidation

Keep caches fresh without complex TTL strategies:

typescript
1// When data changes, publish an event
2async function updateProduct(id: string, data: Partial<Product>): Promise<void> {
3 await db.products.update(id, data);
4
5 await eventBus.publish({
6 id: crypto.randomUUID(),
7 type: "product.updated",
8 timestamp: new Date().toISOString(),
9 data: { productId: id },
10 });
11}
12 
13// Cache service listens and invalidates
14eventBus.subscribe("product.updated", async (event) => {
15 await cache.delete(`product:${event.data.productId}`);
16 await cache.delete("products:list:*"); // Invalidate list caches
17});
18 

Error Handling for Small Teams

Simple Retry with Dead Letter Queue

Don't build complex retry infrastructure. Use your message broker's built-in retry and DLQ support:

typescript
1// SQS example with maxReceiveCount
2const queueConfig = {
3 RedrivePolicy: JSON.stringify({
4 maxReceiveCount: 3,
5 deadLetterTargetArn: dlqArn,
6 }),
7 VisibilityTimeout: "30",
8};
9 

For custom consumers, implement basic retry:

typescript
1async function processWithRetry(
2 event: Event,
3 handler: (event: Event) => Promise<void>,
4 maxRetries: number = 3,
5): Promise<void> {
6 for (let attempt = 0; attempt <= maxRetries; attempt++) {
7 try {
8 await handler(event);
9 return;
10 } catch (error) {
11 if (attempt === maxRetries) {
12 console.error(`Failed after ${maxRetries} retries:`, event.id, error);
13 await moveToDlq(event, error);
14 return;
15 }
16 const delay = Math.min(1000 * Math.pow(2, attempt), 30000);
17 await new Promise((r) => setTimeout(r, delay));
18 }
19 }
20}
21 

DLQ Processing

Check your DLQ daily. At startup scale, a Slack notification and manual processing is fine:

typescript
1// Scheduled job — runs every hour
2async function processDlq(): Promise<void> {
3 const failedEvents = await dlq.receiveMessages(10);
4
5 if (failedEvents.length > 0) {
6 await slack.send("#eng-alerts", {
7 text: `${failedEvents.length} events in DLQ. Check dashboard.`,
8 });
9 }
10}
11 

Automate DLQ replay only when you're processing enough events that manual review isn't feasible.

Need a second opinion on your system design architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Make Every Consumer Idempotent

This is the one rule you can't skip. Network retries, redeliveries, and duplicate events will happen:

typescript
1async function handleOrderPlaced(event: Event<OrderPlacedData>): Promise<void> {
2 // Idempotency check
3 const existing = await db.processedEvents.findOne({ eventId: event.id });
4 if (existing) return;
5
6 // Process the event
7 await fulfillmentService.initiateShipment(event.data.orderId);
8
9 // Record processing
10 await db.processedEvents.create({
11 eventId: event.id,
12 processedAt: new Date(),
13 });
14}
15 

For database operations, use upserts or conditional updates instead of a separate deduplication table:

sql
1INSERT INTO shipments (order_id, status, created_at)
2VALUES ($1, 'pending', NOW())
3ON CONFLICT (order_id) DO NOTHING;
4 

Monitoring That Fits Startup Resources

You don't need Datadog's full APM suite. Track these three things:

  1. Queue depth: How many events are waiting to be processed. Rising depth means consumers can't keep up.
  2. Processing errors: Count of failed event processing attempts per hour.
  3. End-to-end latency: Time from event publication to consumer processing completion.
typescript
1// Simple metrics using CloudWatch or your logging service
2function trackEventMetrics(event: Event, startTime: number): void {
3 const processingMs = Date.now() - startTime;
4 const e2eMs = Date.now() - new Date(event.timestamp).getTime();
5
6 console.log(JSON.stringify({
7 metric: "event_processed",
8 eventType: event.type,
9 processingMs,
10 e2eLatencyMs: e2eMs,
11 }));
12}
13 

Set alerts on queue depth only. If the queue is growing, you'll know something is wrong before users notice.

When to Evolve Beyond Startup Patterns

Graduate to enterprise EDA patterns when:

  • Team size exceeds 15 engineers → Add an event registry to prevent schema conflicts
  • You have more than 10 event types → Standardize on an event envelope with versioning
  • Events/sec exceeds 100K → Consider Kafka for replay, consumer groups, and ordering guarantees
  • Multiple teams consume the same events → Add schema compatibility checks in CI
  • You need event replay → Move from SQS to Kafka or EventBridge Archive

Startup EDA Checklist

  • Using a managed message broker (SQS, Pub/Sub, or managed Kafka)
  • Events use simple JSON with id, type, timestamp, and data fields
  • Fat events include all data consumers need (no callback APIs)
  • Every consumer is idempotent
  • Retry with exponential backoff (3 retries max)
  • Dead letter queue configured with daily monitoring
  • Queue depth alerts set up
  • Event publishing happens after database writes succeed
  • Events named as past-tense domain facts (order.placed, not place.order)

Conclusion

Startup EDA should be boring infrastructure. Use managed services, keep events simple, make consumers idempotent, and monitor queue depth. These practices let a 5-person team process 100K events/sec with near-zero operational overhead.

Resist the urge to build event sourcing, schema registries, or complex saga orchestrators until you have concrete problems that require them. The startups that succeed with EDA are the ones that use it as a simple decoupling mechanism—not the ones that build a distributed computing framework before they have product-market fit.

FAQ

Need expert help?

Building with system design?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026