Why not use a managed sharding solution like Citus or PlanetScale?

We evaluated both. Citus required PostgreSQL 14+ (we were on 13) and had limitations with our ORM's generated queries. PlanetScale used MySQL, and migrating from PostgreSQL would have been a larger project than sharding. In hindsight, upgrading PostgreSQL and adopting Citus would have saved engineering time on routing and migration tooling. For teams starting fresh, managed sharding solutions are worth the evaluation.

How did you handle database schema migrations across 16 shards?

We built a migration coordinator that applies the same migration to all shards sequentially, with automatic rollback if any shard fails. Migrations run during low-traffic windows and are applied one shard at a time with a 60-second observation period between shards. Schema changes that require table rewrites (ALTER TABLE ... ADD COLUMN with default) are applied using pg_repack to avoid locking.

What monitoring did you implement for the sharded database?

Per-shard dashboards showing: connection pool utilization, query latency (p50/p95/p99), replication lag (each shard has its own replica), table sizes, and index usage statistics. Aggregate dashboards showing: cross-shard query latency, scatter-gather success rate, global order index hit rate, and shard balance (data size distribution). Alerts on: any shard exceeding 70% CPU, connection pool saturation, replication lag > 5 seconds, and shard size imbalance > 30%.

How long did the full migration take including planning?

Ten weeks total. Two weeks of planning and shard topology design. One week of infrastructure setup (shard instances, networking, connection pooling). Three weeks of tooling development (routing layer, migration scripts, reconciliation jobs). Four weeks of actual migration (dual-write, bulk copy, cutover). The team was four engineers working full-time on the migration. Our advice: multiply your estimate by 1.5x.

Database Sharding at Scale: Lessons from Production

Our e-commerce platform hit the database scaling wall in Q3 2023. A single PostgreSQL instance serving 12TB of order data, 45,000 read QPS, and 8,000 write QPS during peak Black Friday traffic was crumbling under the load. Connection pool exhaustion caused cascading failures. p99 query latency during peaks hit 12 seconds. We sharded the orders database across 16 shards and this is what happened.

The Breaking Point

The symptoms were clear: during the November 2023 sale, our PostgreSQL primary hit 95% CPU utilization. The connection pool (800 connections via PgBouncer) was fully saturated. Read replicas lagged by 30+ seconds, causing stale inventory reads and overselling. We lost an estimated $340K in abandoned carts over the 4-hour degradation window.

Pre-sharding metrics:

Database size: 12TB (orders + order_items + payments)
Peak read QPS: 45,000
Peak write QPS: 8,000
p99 read latency: 2.1s (normal), 12s (peak)
p99 write latency: 450ms (normal), 4.2s (peak)
Connection pool: 800 connections, 100% utilized during peaks

We had already exhausted vertical scaling (r6g.16xlarge — 64 vCPUs, 512GB RAM), read replicas (3 replicas), table partitioning (by month), and query optimization. Sharding was the remaining option.

Architecture Decisions

Shard Key: Customer ID

We chose customer_id over order_id for two reasons. First, 90% of queries scope to a single customer (order history, account page, recommendation engine). Second, customer data has relatively uniform distribution — no single customer generates more than 0.001% of total orders.

sql

1-- Query pattern analysis showed:

2-- 87% of queries: WHERE customer_id = ?

3-- 8% of queries: WHERE order_id = ?

4-- 5% of queries: admin dashboards (cross-shard acceptable)

Shard Count: 16

We calculated the target shard count based on per-shard capacity:

Each shard: ~750GB data (12TB / 16)
Each shard peak read: ~2,800 QPS (45K / 16)
Each shard peak write: ~500 QPS (8K / 16)

This kept each shard well within a single r6g.4xlarge instance's capacity with 60% headroom for growth.

Routing: Application-Level with Consistent Hashing

typescript

1class OrderShardRouter {

2 private ring: ConsistentHashRing;

3 private pools: Map<string, Pool>;

5 constructor(shardConfigs: ShardConfig[]) {

6 this.ring = new ConsistentHashRing(

7 shardConfigs.map(s => s.id),

8 150 // virtual nodes

9 );

10 this.pools = new Map();

11 for (const config of shardConfigs) {

12 this.pools.set(config.id, new Pool({

13 host: config.host,

14 port: config.port,

15 database: config.database,

16 max: 50,

17 idleTimeoutMillis: 30000,

18 }));

19 }

20 }

22 routeByCustomer(customerId: string): Pool {

23 const shardId = this.ring.getNode(customerId);

24 return this.pools.get(shardId)!;

25 }

27 // For order_id lookups, we maintained a global index

28 async routeByOrder(orderId: string): Promise<Pool> {

29 const mapping = await this.globalIndex.get(`order:${orderId}`);

30 if (!mapping) throw new Error(`Order ${orderId} not found in shard index`);

31 return this.pools.get(mapping.shardId)!;

32 }

33}

Global Order ID Index

Since 8% of queries looked up orders by order_id (not customer_id), we maintained a lightweight Redis-backed index mapping order_id → (shard_id, customer_id).

typescript

1class GlobalOrderIndex {

2 constructor(private redis: Redis) {}

4 async register(orderId: string, customerId: string, shardId: string): Promise<void> {

5 await this.redis.hset(`order:${orderId}`, { customerId, shardId });

6 await this.redis.expire(`order:${orderId}`, 86400 * 365); // 1 year TTL

7 }

9 async lookup(orderId: string): Promise<{ customerId: string; shardId: string } | null> {

10 const data = await this.redis.hgetall(`order:${orderId}`);

11 if (!data.customerId) return null;

12 return { customerId: data.customerId, shardId: data.shardId };

13 }

14}

Migration Strategy

We migrated over 6 weeks using a dual-write approach with gradual cutover.

Week 1-2: Set up 16 shard instances. Deploy routing layer in shadow mode (route queries but still read from monolith).

Week 3: Begin bulk data migration using pg_dump per customer ID range, loading into target shards in parallel. 12TB took 18 hours across 16 parallel restore jobs.

Week 4: Enable dual-write. New orders write to both monolith and target shard. A reconciliation job verified consistency every hour.

Week 5: Gradual read cutover. 5% → 25% → 50% → 100% of reads served from shards, validated by comparing results with monolith reads.

Week 6: Disable monolith writes. Monolith database becomes read-only backup. Decommission after 30-day observation period.

Need a second opinion on your system design architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Results

Post-sharding metrics (Black Friday 2024):

Metric	Before	After	Improvement
p99 read latency	12s (peak)	45ms	267x
p99 write latency	4.2s (peak)	28ms	150x
Peak read QPS capacity	45K (saturated)	200K+ (headroom)	4.4x
Peak write QPS capacity	8K (saturated)	40K+ (headroom)	5x
Connection pool utilization	100%	35%	N/A
Database cost	$18K/month	$12K/month	33% savings
Revenue lost to DB issues	$340K (Nov 2023)	$0 (Nov 2024)	N/A

The cost savings came from replacing one massive r6g.16xlarge with 16 smaller r6g.xlarge instances. Total compute was higher but the per-instance cost was much lower, and we eliminated the connection pooling infrastructure (PgBouncer cluster).

What Failed

Cross-Shard Admin Queries Were Slower Than Expected

Admin dashboards that ran aggregate queries (total revenue, order counts by status) went from 200ms on the monolith to 3-5 seconds via scatter-gather across 16 shards. We addressed this by building a separate analytics pipeline that aggregated shard data into a ClickHouse instance for dashboards.

Customer ID Reassignment

When two customer accounts merged (acquisition, duplicate detection), all orders from the source customer needed to move to the destination customer's shard. We underestimated how frequently this happened — about 50 merges per month. Each merge required a coordinated cross-shard data move. We built an automated merge pipeline, but it took 3 weeks of engineering time we had not planned for.

Connection Count Explosion

16 shards × 50 connections per shard × 12 application instances = 9,600 total connections. Our network infrastructure was not prepared for this. We had to upgrade our VPC configuration and add connection pooling per shard (PgBouncer per shard instance), reducing the per-application connection count to 10 per shard.

Honest Retrospective

Was sharding the right call? Yes. The monolith database could not have survived another Black Friday. But we should have started the project 6 months earlier to avoid the time pressure.

What would we change?

Start with 32 shards instead of 16 — we are already planning a re-shard for 2025 as data grows
Build the analytics pipeline before the migration, not after admin dashboards broke
Plan for account merges in the initial design
Use Citus instead of application-level sharding — the engineering cost of building routing, migration, and monitoring tooling was significant

Conclusion

Sharding our orders database from a single 12TB instance to 16 shards reduced p99 latency by over 100x and eliminated the Black Friday database failures that cost us $340K. The migration took 6 weeks of engineering effort from a team of four, with ongoing operational overhead for shard management, cross-shard queries, and the global order index. For write-heavy workloads that have outgrown vertical scaling, application-level sharding with consistent hashing provides predictable performance at the cost of increased operational complexity.

FAQ

Need expert help?

Building with system design?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Book a Free Call Send a Brief

sharding database scalability distributed-systems aws case-study

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

View Portfolio Book a Call

← Previous

Database Sharding at Scale: Lessons from Production

The Breaking Point

Architecture Decisions

Shard Key: Customer ID

Shard Count: 16

Routing: Application-Level with Consistent Hashing

Global Order ID Index

Migration Strategy

Results

What Failed

Cross-Shard Admin Queries Were Slower Than Expected

Customer ID Reassignment

Connection Count Explosion

Honest Retrospective

Conclusion

FAQ

Building with system design?

Database Sharding Best Practices for High Scale Teams

Database Sharding Best Practices for Enterprise Teams

Complete Guide to Database Sharding with Java

Complete Guide to Saga Pattern Implementation with Typescript

Database Sharding Best Practices for High Scale Teams

Start a
Conversation.

The Breaking Point

Architecture Decisions

Shard Key: Customer ID

Shard Count: 16

Routing: Application-Level with Consistent Hashing

Global Order ID Index

Migration Strategy

Results

What Failed

Cross-Shard Admin Queries Were Slower Than Expected

Customer ID Reassignment

Connection Count Explosion

Honest Retrospective

Conclusion

FAQ

Building with system design?

Database Sharding Best Practices for High Scale Teams

Database Sharding Best Practices for Enterprise Teams

Complete Guide to Database Sharding with Java

Complete Guide to Saga Pattern Implementation with Typescript

Database Sharding Best Practices for High Scale Teams

Start aConversation.

Start a
Conversation.