Back to Journal
System Design

Database Sharding Best Practices for Enterprise Teams

Battle-tested best practices for Database Sharding tailored to Enterprise teams, including anti-patterns to avoid and a ready-to-use checklist.

Muneer Puthiya Purayil 11 min read

Database sharding introduces operational complexity that enterprises must manage for years. The decision to shard is irreversible in practice — once data is distributed across shards, consolidating it requires a migration effort comparable to the original sharding project. These best practices help enterprise teams implement sharding that scales without creating an operational nightmare.

When Enterprises Actually Need Sharding

Sharding becomes necessary when vertical scaling reaches its limit and read replicas cannot absorb the write load. For most enterprise PostgreSQL deployments, that threshold is around 5-10TB of data with 50,000+ write transactions per second. Before sharding, exhaust these alternatives: connection pooling (PgBouncer), read replicas, table partitioning, query optimization, and caching layers.

The decision framework: if your database is write-bound (not read-bound), growing faster than vertical scaling can accommodate, and your data has a natural partition key, sharding is the right next step.

Best Practices

1. Choose the Shard Key Based on Query Patterns, Not Data Structure

The shard key determines query performance, data distribution, and operational complexity for the life of the system.

sql
1-- Good: Tenant ID as shard key for multi-tenant SaaS
2-- Queries naturally scope to a single tenant
3SELECT * FROM orders WHERE tenant_id = 'acme-corp' AND status = 'pending';
4-- Routes to exactly one shard
5 
6-- Bad: Order date as shard key
7-- Range queries span all shards, recent shards become hot
8SELECT * FROM orders WHERE created_at > NOW() - INTERVAL '7 days';
9-- Must query all shards and merge results
10 

For enterprise multi-tenant systems, tenant_id is almost always the correct shard key. It provides natural isolation, predictable routing, and aligns with compliance requirements for data residency.

2. Implement a Routing Layer with Shard Map

Centralize shard routing in a dedicated service that maps shard keys to database connections.

typescript
1interface ShardConfig {
2 shardId: string;
3 host: string;
4 port: number;
5 database: string;
6 weight: number; // For weighted routing during migration
7}
8 
9class ShardRouter {
10 private shardMap: Map<string, ShardConfig>;
11 private pools: Map<string, Pool>;
12 private consistentHash: ConsistentHashRing;
13 
14 constructor(shards: ShardConfig[]) {
15 this.shardMap = new Map(shards.map(s => [s.shardId, s]));
16 this.pools = new Map();
17 this.consistentHash = new ConsistentHashRing(
18 shards.map(s => s.shardId),
19 150 // virtual nodes per shard
20 );
21 
22 for (const shard of shards) {
23 this.pools.set(shard.shardId, new Pool({
24 host: shard.host,
25 port: shard.port,
26 database: shard.database,
27 max: 20,
28 }));
29 }
30 }
31 
32 getPool(shardKey: string): Pool {
33 const shardId = this.consistentHash.getNode(shardKey);
34 const pool = this.pools.get(shardId);
35 if (!pool) throw new Error(`No pool for shard ${shardId}`);
36 return pool;
37 }
38 
39 async query(shardKey: string, sql: string, params: unknown[]): Promise<QueryResult> {
40 const pool = this.getPool(shardKey);
41 return pool.query(sql, params);
42 }
43 
44 async scatter(sql: string, params: unknown[]): Promise<QueryResult[]> {
45 const promises = Array.from(this.pools.values()).map(pool =>
46 pool.query(sql, params)
47 );
48 return Promise.all(promises);
49 }
50}
51 

3. Include the Shard Key in Every Table's Primary Key

Every table that participates in sharding must include the shard key in its primary key and all foreign key relationships. This ensures queries can be routed to a single shard.

sql
1-- Every table includes tenant_id in the primary key
2CREATE TABLE orders (
3 tenant_id UUID NOT NULL,
4 order_id UUID NOT NULL,
5 customer_id UUID NOT NULL,
6 status VARCHAR(50) NOT NULL,
7 total_cents BIGINT NOT NULL,
8 created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
9 PRIMARY KEY (tenant_id, order_id)
10);
11 
12CREATE TABLE order_items (
13 tenant_id UUID NOT NULL,
14 order_id UUID NOT NULL,
15 item_id UUID NOT NULL,
16 product_id UUID NOT NULL,
17 quantity INT NOT NULL,
18 unit_price BIGINT NOT NULL,
19 PRIMARY KEY (tenant_id, order_id, item_id),
20 FOREIGN KEY (tenant_id, order_id) REFERENCES orders (tenant_id, order_id)
21);
22 
23-- Indexes must also include tenant_id for efficient shard-local queries
24CREATE INDEX idx_orders_customer ON orders (tenant_id, customer_id, created_at DESC);
25CREATE INDEX idx_orders_status ON orders (tenant_id, status);
26 

4. Plan for Cross-Shard Queries

Some queries inherently span shards. Design explicit scatter-gather patterns for these cases.

typescript
1class CrossShardQueryExecutor {
2 constructor(private router: ShardRouter) {}
3 
4 async aggregateAcrossShards<T>(
5 sql: string,
6 params: unknown[],
7 merge: (results: T[][]) => T[]
8 ): Promise<T[]> {
9 const results = await this.router.scatter(sql, params);
10 const typed = results.map(r => r.rows as T[]);
11 return merge(typed);
12 }
13 
14 async getGlobalOrderStats(): Promise<OrderStats> {
15 const perShardStats = await this.aggregateAcrossShards<ShardStats>(
16 `SELECT COUNT(*) as order_count, SUM(total_cents) as revenue,
17 AVG(total_cents) as avg_order
18 FROM orders WHERE created_at > NOW() - INTERVAL '30 days'`,
19 [],
20 (results) => results.flat()
21 );
22 
23 return {
24 totalOrders: perShardStats.reduce((sum, s) => sum + s.order_count, 0),
25 totalRevenue: perShardStats.reduce((sum, s) => sum + s.revenue, 0),
26 avgOrderValue: perShardStats.reduce((sum, s) => sum + s.avg_order, 0) / perShardStats.length,
27 };
28 }
29}
30 

5. Implement Shard-Aware Migrations

Schema changes must be applied to every shard atomically. A failed migration on one shard while others succeed creates an inconsistent state.

typescript
1class ShardMigrationRunner {
2 constructor(
3 private router: ShardRouter,
4 private migrationStore: MigrationStore
5 ) {}
6 
7 async migrate(migration: Migration): Promise<MigrationResult> {
8 const shardIds = this.router.getAllShardIds();
9 const results: Map<string, 'success' | 'failed'> = new Map();
10 
11 // Phase 1: Validate migration on all shards (dry run)
12 for (const shardId of shardIds) {
13 const isValid = await this.validateMigration(shardId, migration);
14 if (!isValid) {
15 return { status: 'aborted', reason: `Validation failed on ${shardId}` };
16 }
17 }
18 
19 // Phase 2: Apply migration shard by shard
20 for (const shardId of shardIds) {
21 try {
22 await this.applyMigration(shardId, migration);
23 results.set(shardId, 'success');
24 } catch (error) {
25 results.set(shardId, 'failed');
26 // Rollback successful shards
27 for (const [completedShard, status] of results) {
28 if (status === 'success') {
29 await this.rollbackMigration(completedShard, migration);
30 }
31 }
32 return { status: 'rolled_back', failedShard: shardId, error };
33 }
34 }
35 
36 await this.migrationStore.record(migration.id, 'completed');
37 return { status: 'completed' };
38 }
39}
40 

6. Monitor Shard Balance and Hotspots

Uneven data distribution leads to hotspot shards that degrade overall system performance.

typescript
1class ShardMonitor {
2 async getShardHealth(): Promise<ShardHealthReport[]> {
3 const shardIds = this.router.getAllShardIds();
4 const reports: ShardHealthReport[] = [];
5 
6 for (const shardId of shardIds) {
7 const pool = this.router.getPool(shardId);
8 const [size, connections, slowQueries] = await Promise.all([
9 pool.query('SELECT pg_database_size(current_database()) as size'),
10 pool.query('SELECT count(*) as active FROM pg_stat_activity WHERE state = $1', ['active']),
11 pool.query(`SELECT count(*) as slow FROM pg_stat_activity
12 WHERE state = 'active' AND NOW() - query_start > INTERVAL '5 seconds'`),
13 ]);
14 
15 reports.push({
16 shardId,
17 sizeBytes: size.rows[0].size,
18 activeConnections: connections.rows[0].active,
19 slowQueries: slowQueries.rows[0].slow,
20 });
21 }
22 
23 return reports;
24 }
25 
26 detectHotspots(reports: ShardHealthReport[]): string[] {
27 const avgSize = reports.reduce((s, r) => s + r.sizeBytes, 0) / reports.length;
28 return reports
29 .filter(r => r.sizeBytes > avgSize * 1.5 || r.activeConnections > 50)
30 .map(r => r.shardId);
31 }
32}
33 

Need a second opinion on your system design architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Anti-Patterns to Avoid

Sharding Too Early

Premature sharding adds distributed systems complexity to a problem that PostgreSQL table partitioning could solve. A single well-tuned PostgreSQL instance handles more load than most applications will ever need.

Auto-Increment IDs Across Shards

Sequential IDs collide across shards. Use UUIDs (v7 for sortability) or a distributed ID generator (Snowflake-style) from the start.

Ignoring Cross-Shard Transaction Requirements

If your business logic requires atomic operations across entities on different shards, you need either saga patterns or to reconsider your shard key so those entities co-locate.

Shard Key That Changes

If the shard key value can change (e.g., a user's region), every record must be moved between shards when it changes. Choose immutable shard keys.

Enterprise Readiness Checklist

  • Shard key chosen based on query pattern analysis
  • Shard key included in all primary keys and foreign keys
  • Routing layer implemented with consistent hashing
  • Cross-shard query patterns identified and scatter-gather implemented
  • Shard-aware migration tooling with rollback capability
  • Monitoring dashboards for per-shard metrics (size, connections, latency)
  • Hotspot detection and alerting configured
  • Data rebalancing runbook documented and tested
  • UUID or distributed ID generation for all primary keys
  • Backup and restore procedures tested per-shard and cross-shard
  • DR plan accounts for partial shard failures
  • Connection pooling configured per shard

Conclusion

Enterprise database sharding succeeds when the shard key aligns with both query patterns and business isolation requirements. For multi-tenant SaaS systems, tenant ID provides natural sharding that satisfies both performance and compliance needs. Invest heavily in the routing layer, shard-aware migration tooling, and monitoring before the first production shard split — these operational foundations determine whether sharding scales gracefully or becomes an ongoing source of incidents.

FAQ

Need expert help?

Building with system design?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026