Our e-commerce platform hit the database scaling wall in Q3 2023. A single PostgreSQL instance serving 12TB of order data, 45,000 read QPS, and 8,000 write QPS during peak Black Friday traffic was crumbling under the load. Connection pool exhaustion caused cascading failures. p99 query latency during peaks hit 12 seconds. We sharded the orders database across 16 shards and this is what happened.
The Breaking Point
The symptoms were clear: during the November 2023 sale, our PostgreSQL primary hit 95% CPU utilization. The connection pool (800 connections via PgBouncer) was fully saturated. Read replicas lagged by 30+ seconds, causing stale inventory reads and overselling. We lost an estimated $340K in abandoned carts over the 4-hour degradation window.
Pre-sharding metrics:
- Database size: 12TB (orders + order_items + payments)
- Peak read QPS: 45,000
- Peak write QPS: 8,000
- p99 read latency: 2.1s (normal), 12s (peak)
- p99 write latency: 450ms (normal), 4.2s (peak)
- Connection pool: 800 connections, 100% utilized during peaks
We had already exhausted vertical scaling (r6g.16xlarge — 64 vCPUs, 512GB RAM), read replicas (3 replicas), table partitioning (by month), and query optimization. Sharding was the remaining option.
Architecture Decisions
Shard Key: Customer ID
We chose customer_id over order_id for two reasons. First, 90% of queries scope to a single customer (order history, account page, recommendation engine). Second, customer data has relatively uniform distribution — no single customer generates more than 0.001% of total orders.
Shard Count: 16
We calculated the target shard count based on per-shard capacity:
- Each shard: ~750GB data (12TB / 16)
- Each shard peak read: ~2,800 QPS (45K / 16)
- Each shard peak write: ~500 QPS (8K / 16)
This kept each shard well within a single r6g.4xlarge instance's capacity with 60% headroom for growth.
Routing: Application-Level with Consistent Hashing
Global Order ID Index
Since 8% of queries looked up orders by order_id (not customer_id), we maintained a lightweight Redis-backed index mapping order_id → (shard_id, customer_id).
Migration Strategy
We migrated over 6 weeks using a dual-write approach with gradual cutover.
Week 1-2: Set up 16 shard instances. Deploy routing layer in shadow mode (route queries but still read from monolith).
Week 3: Begin bulk data migration using pg_dump per customer ID range, loading into target shards in parallel. 12TB took 18 hours across 16 parallel restore jobs.
Week 4: Enable dual-write. New orders write to both monolith and target shard. A reconciliation job verified consistency every hour.
Week 5: Gradual read cutover. 5% → 25% → 50% → 100% of reads served from shards, validated by comparing results with monolith reads.
Week 6: Disable monolith writes. Monolith database becomes read-only backup. Decommission after 30-day observation period.
Need a second opinion on your system design architecture?
I run free 30-minute strategy calls for engineering teams tackling this exact problem.
Book a Free CallResults
Post-sharding metrics (Black Friday 2024):
| Metric | Before | After | Improvement |
|---|---|---|---|
| p99 read latency | 12s (peak) | 45ms | 267x |
| p99 write latency | 4.2s (peak) | 28ms | 150x |
| Peak read QPS capacity | 45K (saturated) | 200K+ (headroom) | 4.4x |
| Peak write QPS capacity | 8K (saturated) | 40K+ (headroom) | 5x |
| Connection pool utilization | 100% | 35% | N/A |
| Database cost | $18K/month | $12K/month | 33% savings |
| Revenue lost to DB issues | $340K (Nov 2023) | $0 (Nov 2024) | N/A |
The cost savings came from replacing one massive r6g.16xlarge with 16 smaller r6g.xlarge instances. Total compute was higher but the per-instance cost was much lower, and we eliminated the connection pooling infrastructure (PgBouncer cluster).
What Failed
Cross-Shard Admin Queries Were Slower Than Expected
Admin dashboards that ran aggregate queries (total revenue, order counts by status) went from 200ms on the monolith to 3-5 seconds via scatter-gather across 16 shards. We addressed this by building a separate analytics pipeline that aggregated shard data into a ClickHouse instance for dashboards.
Customer ID Reassignment
When two customer accounts merged (acquisition, duplicate detection), all orders from the source customer needed to move to the destination customer's shard. We underestimated how frequently this happened — about 50 merges per month. Each merge required a coordinated cross-shard data move. We built an automated merge pipeline, but it took 3 weeks of engineering time we had not planned for.
Connection Count Explosion
16 shards × 50 connections per shard × 12 application instances = 9,600 total connections. Our network infrastructure was not prepared for this. We had to upgrade our VPC configuration and add connection pooling per shard (PgBouncer per shard instance), reducing the per-application connection count to 10 per shard.
Honest Retrospective
Was sharding the right call? Yes. The monolith database could not have survived another Black Friday. But we should have started the project 6 months earlier to avoid the time pressure.
What would we change?
- Start with 32 shards instead of 16 — we are already planning a re-shard for 2025 as data grows
- Build the analytics pipeline before the migration, not after admin dashboards broke
- Plan for account merges in the initial design
- Use Citus instead of application-level sharding — the engineering cost of building routing, migration, and monitoring tooling was significant
Conclusion
Sharding our orders database from a single 12TB instance to 16 shards reduced p99 latency by over 100x and eliminated the Black Friday database failures that cost us $340K. The migration took 6 weeks of engineering effort from a team of four, with ongoing operational overhead for shard management, cross-shard queries, and the global order index. For write-heavy workloads that have outgrown vertical scaling, application-level sharding with consistent hashing provides predictable performance at the cost of increased operational complexity.