When your vector database handles hundreds of millions of vectors and thousands of queries per second, the architecture decisions that worked at smaller scale become bottlenecks. High-scale vector deployments expose failure modes you won't find in tutorials: memory pressure from large indexes, query tail latency spikes during index rebuilds, and the cold-start problem when scaling horizontally.
This guide covers the patterns that work at high scale — not theoretical limits, but configurations and architectures proven in production systems serving 10M+ daily queries across billions of vectors.
Capacity Planning for High-Scale Deployments
Before architecting, establish your numbers. Every decision flows from these constraints:
Memory estimation for HNSW indexes:
At 500M vectors with 1536 dimensions, you need roughly 4TB of RAM for the HNSW index alone. This is where quantization and sharding become mandatory, not optional.
Sharding Strategies
Hash-Based Sharding
Distribute vectors across shards using consistent hashing:
Partition-by-Tenant Sharding
For multi-tenant systems, co-locate each tenant's vectors on dedicated shards:
Quantization for Memory Efficiency
At high scale, quantization reduces memory by 4-8x with minimal recall loss:
Product Quantization (PQ)
Scalar Quantization (SQ8)
Simpler and faster than PQ, with less compression:
Write Pipeline for High Throughput
Ingesting vectors at 100K/second requires careful batching and backpressure:
Need a second opinion on your AI systems architecture?
I run free 30-minute strategy calls for engineering teams tackling this exact problem.
Book a Free CallQuery Fan-Out and Result Merging
When vectors are distributed across shards, queries must fan out and merge:
Index Rebuild Strategy
HNSW index rebuilds are expensive. At high scale, you need a strategy that avoids downtime:
Anti-Patterns to Avoid
Loading Entire Index into RAM
At billion-scale, full in-memory indexes are cost-prohibitive. Use quantized vectors in RAM with full vectors on SSD for rescoring. This reduces memory by 4-8x while maintaining recall above 0.95.
Synchronous Write-then-Read
Vector indexes update asynchronously. Writes are not immediately searchable. Design your application to tolerate eventual consistency — typically 100ms to 2 seconds for newly inserted vectors to become queryable.
Unbounded Fan-Out
Querying all shards for every request doesn't scale past 50 shards. Implement query routing that narrows the shard set based on metadata filters or cluster assignment. For tenant-scoped queries, route directly to the tenant's shards.
Ignoring Cold Start Latency
When a shard starts or restarts, the first queries hit disk while the index loads into memory. Pre-warm shards by running synthetic queries during startup. Set readiness probes to only pass after warmup completes.
Monolithic Index Configuration
Different query patterns need different indexes. Keep a high-recall HNSW index for RAG queries alongside a faster IVF_PQ index for recommendation feeds. Route queries to the appropriate index based on the use case.
High-Scale Readiness Checklist
- Memory budget calculated per shard with quantization factored in
- Sharding strategy tested with 2x projected vector count
- Write pipeline handles backpressure without dropping vectors
- Query fan-out respects timeouts and degrades gracefully on shard failure
- Quantization recall validated against ground-truth dataset
- Blue-green index rebuild procedure documented and tested
- Pre-warming implemented for cold start mitigation
- Monitoring covers per-shard latency, recall estimates, and memory pressure
- Capacity alerts set at 70% utilization to allow scaling lead time
- Disaster recovery tested: single shard loss, full region failover
- Cost model validated against cloud provider billing at projected scale