Back to Journal
AI Architecture

Vector Database Architecture Best Practices for High Scale Teams

Battle-tested best practices for Vector Database Architecture tailored to High Scale teams, including anti-patterns to avoid and a ready-to-use checklist.

Muneer Puthiya Purayil 18 min read

When your vector database handles hundreds of millions of vectors and thousands of queries per second, the architecture decisions that worked at smaller scale become bottlenecks. High-scale vector deployments expose failure modes you won't find in tutorials: memory pressure from large indexes, query tail latency spikes during index rebuilds, and the cold-start problem when scaling horizontally.

This guide covers the patterns that work at high scale — not theoretical limits, but configurations and architectures proven in production systems serving 10M+ daily queries across billions of vectors.

Capacity Planning for High-Scale Deployments

Before architecting, establish your numbers. Every decision flows from these constraints:

1Vectors: 500M - 5B total
2Dimensions: 768 - 1536
3Query throughput: 5K - 50K QPS
4Latency target: p99 < 100ms
5Write throughput: 10K - 100K vectors/second
6Recall target: > 0.95
7 

Memory estimation for HNSW indexes:

python
1def estimate_memory_gb(
2 num_vectors: int,
3 dimensions: int,
4 m_connections: int = 16,
5 bytes_per_float: int = 4,
6 overhead_factor: float = 1.3,
7) -> float:
8 """
9 Estimate HNSW index memory requirements.
10 
11 Components:
12 - Vector data: num_vectors * dimensions * bytes_per_float
13 - Graph links: num_vectors * m_connections * 2 * 8 bytes (neighbor IDs)
14 - Metadata overhead: ~30% on top
15 """
16 vector_bytes = num_vectors * dimensions * bytes_per_float
17 graph_bytes = num_vectors * m_connections * 2 * 8
18 total_bytes = (vector_bytes + graph_bytes) * overhead_factor
19 return total_bytes / (1024 ** 3)
20 
21# Example: 500M vectors, 1536 dimensions
22memory = estimate_memory_gb(500_000_000, 1536)
23print(f"Estimated memory: {memory:.0f} GB")
24# Output: Estimated memory: 3878 GB
25 

At 500M vectors with 1536 dimensions, you need roughly 4TB of RAM for the HNSW index alone. This is where quantization and sharding become mandatory, not optional.

Sharding Strategies

Hash-Based Sharding

Distribute vectors across shards using consistent hashing:

go
1package vectordb
2 
3import (
4 "crypto/sha256"
5 "encoding/binary"
6 "fmt"
7 "sort"
8 "sync"
9)
10 
11type ShardRouter struct {
12 shards []ShardInfo
13 ring []ringEntry
14 mu sync.RWMutex
15 replicas int
16}
17 
18type ShardInfo struct {
19 ID string
20 Endpoint string
21 Weight int
22}
23 
24type ringEntry struct {
25 hash uint64
26 shardID string
27}
28 
29func NewShardRouter(shards []ShardInfo, virtualNodes int) *ShardRouter {
30 router := &ShardRouter{
31 shards: shards,
32 replicas: virtualNodes,
33 }
34 router.buildRing()
35 return router
36}
37 
38func (r *ShardRouter) buildRing() {
39 r.ring = nil
40 for _, shard := range r.shards {
41 for i := 0; i < r.replicas * shard.Weight; i++ {
42 key := fmt.Sprintf("%s:%d", shard.ID, i)
43 hash := hashKey(key)
44 r.ring = append(r.ring, ringEntry{
45 hash: hash,
46 shardID: shard.ID,
47 })
48 }
49 }
50 sort.Slice(r.ring, func(i, j int) bool {
51 return r.ring[i].hash < r.ring[j].hash
52 })
53}
54 
55func hashKey(key string) uint64 {
56 h := sha256.Sum256([]byte(key))
57 return binary.BigEndian.Uint64(h[:8])
58}
59 
60func (r *ShardRouter) GetShard(vectorID string) string {
61 r.mu.RLock()
62 defer r.mu.RUnlock()
63 
64 hash := hashKey(vectorID)
65 idx := sort.Search(len(r.ring), func(i int) bool {
66 return r.ring[i].hash >= hash
67 })
68 if idx == len(r.ring) {
69 idx = 0
70 }
71 return r.ring[idx].shardID
72}
73 
74func (r *ShardRouter) GetQueryShards(topK int) []string {
75 // For queries, fan out to all shards and merge results
76 seen := make(map[string]bool)
77 var shards []string
78 for _, entry := range r.ring {
79 if !seen[entry.shardID] {
80 seen[entry.shardID] = true
81 shards = append(shards, entry.shardID)
82 }
83 }
84 return shards
85}
86 

Partition-by-Tenant Sharding

For multi-tenant systems, co-locate each tenant's vectors on dedicated shards:

go
1type TenantRouter struct {
2 assignments map[string][]string // tenant -> shard IDs
3 mu sync.RWMutex
4}
5 
6func (r *TenantRouter) AssignTenant(
7 tenantID string,
8 vectorCount int64,
9 shardCapacity int64,
10) {
11 r.mu.Lock()
12 defer r.mu.Unlock()
13 
14 shardsNeeded := (vectorCount + shardCapacity - 1) / shardCapacity
15 shards := r.findAvailableShards(int(shardsNeeded))
16 r.assignments[tenantID] = shards
17}
18 
19func (r *TenantRouter) GetWriteShard(tenantID string, vectorID string) string {
20 r.mu.RLock()
21 defer r.mu.RUnlock()
22 
23 shards := r.assignments[tenantID]
24 if len(shards) == 0 {
25 return ""
26 }
27 // Hash within tenant's shard set
28 hash := hashKey(vectorID)
29 idx := hash % uint64(len(shards))
30 return shards[idx]
31}
32 

Quantization for Memory Efficiency

At high scale, quantization reduces memory by 4-8x with minimal recall loss:

Product Quantization (PQ)

python
1# Milvus with IVF_PQ index for billion-scale
2from pymilvus import (
3 connections, Collection, CollectionSchema,
4 FieldSchema, DataType, utility,
5)
6 
7connections.connect(host="milvus-proxy", port="19530")
8 
9schema = CollectionSchema(fields=[
10 FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
11 FieldSchema(name="tenant_id", dtype=DataType.VARCHAR, max_length=64),
12 FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536),
13])
14 
15collection = Collection(name="vectors_pq", schema=schema)
16 
17# IVF_PQ: partition vectors into clusters, then quantize within clusters
18index_params = {
19 "metric_type": "IP", # Inner product (cosine after normalization)
20 "index_type": "IVF_PQ",
21 "params": {
22 "nlist": 16384, # Number of clusters (sqrt(N) is a starting point)
23 "m": 48, # Number of sub-quantizers (dimensions / m ≈ 32)
24 "nbits": 8, # Bits per sub-quantizer
25 },
26}
27 
28collection.create_index(
29 field_name="embedding",
30 index_params=index_params,
31)
32 
33# Query with reranking for accuracy recovery
34search_params = {
35 "metric_type": "IP",
36 "params": {
37 "nprobe": 128, # Number of clusters to search (higher = better recall)
38 "rerank_k": 200, # Rerank top candidates with original vectors
39 },
40}
41 

Scalar Quantization (SQ8)

Simpler and faster than PQ, with less compression:

python
1# Qdrant with scalar quantization
2from qdrant_client import QdrantClient
3from qdrant_client.models import (
4 VectorParams, Distance,
5 ScalarQuantizationConfig, ScalarType,
6 QuantizationSearchParams,
7)
8 
9client = QdrantClient(url="http://qdrant:6333")
10 
11client.create_collection(
12 collection_name="high_scale_vectors",
13 vectors_config=VectorParams(
14 size=1536,
15 distance=Distance.COSINE,
16 on_disk=True, # Vectors on SSD, quantized in RAM
17 ),
18 quantization_config=ScalarQuantizationConfig(
19 type=ScalarType.INT8,
20 quantile=0.99, # Clip outliers above 99th percentile
21 always_ram=True, # Keep quantized vectors in RAM
22 ),
23)
24 
25# Search uses quantized vectors first, then rescores top candidates
26results = client.search(
27 collection_name="high_scale_vectors",
28 query_vector=query_embedding,
29 limit=10,
30 search_params=QuantizationSearchParams(
31 ignore=False, # Use quantization
32 rescore=True, # Rescore with original vectors
33 oversampling=2.0, # Fetch 2x candidates before rescoring
34 ),
35)
36 

Write Pipeline for High Throughput

Ingesting vectors at 100K/second requires careful batching and backpressure:

go
1package ingest
2 
3import (
4 "context"
5 "sync"
6 "time"
7)
8 
9type Vector struct {
10 ID string
11 Embedding []float32
12 Metadata map[string]interface{}
13 TenantID string
14}
15 
16type BatchWriter struct {
17 buffer []Vector
18 mu sync.Mutex
19 batchSize int
20 flushInterval time.Duration
21 writeFn func(ctx context.Context, batch []Vector) error
22 done chan struct{}
23}
24 
25func NewBatchWriter(
26 batchSize int,
27 flushInterval time.Duration,
28 writeFn func(ctx context.Context, batch []Vector) error,
29) *BatchWriter {
30 bw := &BatchWriter{
31 buffer: make([]Vector, 0, batchSize),
32 batchSize: batchSize,
33 flushInterval: flushInterval,
34 writeFn: writeFn,
35 done: make(chan struct{}),
36 }
37 go bw.flushLoop()
38 return bw
39}
40 
41func (bw *BatchWriter) Add(v Vector) error {
42 bw.mu.Lock()
43 bw.buffer = append(bw.buffer, v)
44 shouldFlush := len(bw.buffer) >= bw.batchSize
45 bw.mu.Unlock()
46 
47 if shouldFlush {
48 return bw.flush()
49 }
50 return nil
51}
52 
53func (bw *BatchWriter) flush() error {
54 bw.mu.Lock()
55 if len(bw.buffer) == 0 {
56 bw.mu.Unlock()
57 return nil
58 }
59 batch := bw.buffer
60 bw.buffer = make([]Vector, 0, bw.batchSize)
61 bw.mu.Unlock()
62 
63 return bw.writeFn(context.Background(), batch)
64}
65 
66func (bw *BatchWriter) flushLoop() {
67 ticker := time.NewTicker(bw.flushInterval)
68 defer ticker.Stop()
69 for {
70 select {
71 case <-ticker.C:
72 bw.flush()
73 case <-bw.done:
74 bw.flush() // Final flush
75 return
76 }
77 }
78}
79 
80func (bw *BatchWriter) Close() {
81 close(bw.done)
82}
83 

Need a second opinion on your AI systems architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Query Fan-Out and Result Merging

When vectors are distributed across shards, queries must fan out and merge:

go
1package query
2 
3import (
4 "context"
5 "sort"
6 "sync"
7 "time"
8)
9 
10type SearchResult struct {
11 ID string
12 Score float32
13 Metadata map[string]interface{}
14 ShardID string
15}
16 
17type ShardSearcher interface {
18 Search(ctx context.Context, vector []float32, topK int) ([]SearchResult, error)
19}
20 
21func FanOutSearch(
22 ctx context.Context,
23 shards map[string]ShardSearcher,
24 vector []float32,
25 topK int,
26 timeout time.Duration,
27) ([]SearchResult, error) {
28 ctx, cancel := context.WithTimeout(ctx, timeout)
29 defer cancel()
30 
31 var mu sync.Mutex
32 var allResults []SearchResult
33 var firstErr error
34 
35 var wg sync.WaitGroup
36 for shardID, searcher := range shards {
37 wg.Add(1)
38 go func(id string, s ShardSearcher) {
39 defer wg.Done()
40 results, err := s.Search(ctx, vector, topK)
41 mu.Lock()
42 defer mu.Unlock()
43 if err != nil {
44 if firstErr == nil {
45 firstErr = err
46 }
47 return
48 }
49 for i := range results {
50 results[i].ShardID = id
51 }
52 allResults = append(allResults, results...)
53 }(shardID, searcher)
54 }
55 
56 wg.Wait()
57 
58 // If all shards failed, return error
59 if len(allResults) == 0 && firstErr != nil {
60 return nil, firstErr
61 }
62 
63 // Merge: sort by score descending, take top K
64 sort.Slice(allResults, func(i, j int) bool {
65 return allResults[i].Score > allResults[j].Score
66 })
67 
68 if len(allResults) > topK {
69 allResults = allResults[:topK]
70 }
71 
72 return allResults, nil
73}
74 

Index Rebuild Strategy

HNSW index rebuilds are expensive. At high scale, you need a strategy that avoids downtime:

python
1# Blue-green index rebuild pattern
2class IndexManager:
3 def __init__(self, client, base_name: str):
4 self.client = client
5 self.base_name = base_name
6 self.active_suffix = "blue"
7 
8 @property
9 def active_collection(self) -> str:
10 return f"{self.base_name}_{self.active_suffix}"
11 
12 @property
13 def standby_collection(self) -> str:
14 suffix = "green" if self.active_suffix == "blue" else "blue"
15 return f"{self.base_name}_{suffix}"
16 
17 async def rebuild_index(
18 self,
19 vectors_source,
20 new_config: dict,
21 ):
22 standby = self.standby_collection
23 
24 # Drop and recreate standby collection with new config
25 if self.client.collection_exists(standby):
26 self.client.delete_collection(standby)
27 self.client.create_collection(standby, **new_config)
28 
29 # Bulk load into standby
30 batch = []
31 async for vector in vectors_source:
32 batch.append(vector)
33 if len(batch) >= 10000:
34 self.client.upsert(standby, batch)
35 batch = []
36 if batch:
37 self.client.upsert(standby, batch)
38 
39 # Wait for indexing to complete
40 await self._wait_for_index_ready(standby)
41 
42 # Swap active pointer
43 old_suffix = self.active_suffix
44 self.active_suffix = (
45 "green" if old_suffix == "blue" else "blue"
46 )
47 
48 # Clean up old collection after traffic drains
49 await asyncio.sleep(30)
50 self.client.delete_collection(
51 f"{self.base_name}_{old_suffix}"
52 )
53 

Anti-Patterns to Avoid

Loading Entire Index into RAM

At billion-scale, full in-memory indexes are cost-prohibitive. Use quantized vectors in RAM with full vectors on SSD for rescoring. This reduces memory by 4-8x while maintaining recall above 0.95.

Synchronous Write-then-Read

Vector indexes update asynchronously. Writes are not immediately searchable. Design your application to tolerate eventual consistency — typically 100ms to 2 seconds for newly inserted vectors to become queryable.

Unbounded Fan-Out

Querying all shards for every request doesn't scale past 50 shards. Implement query routing that narrows the shard set based on metadata filters or cluster assignment. For tenant-scoped queries, route directly to the tenant's shards.

Ignoring Cold Start Latency

When a shard starts or restarts, the first queries hit disk while the index loads into memory. Pre-warm shards by running synthetic queries during startup. Set readiness probes to only pass after warmup completes.

Monolithic Index Configuration

Different query patterns need different indexes. Keep a high-recall HNSW index for RAG queries alongside a faster IVF_PQ index for recommendation feeds. Route queries to the appropriate index based on the use case.

High-Scale Readiness Checklist

  • Memory budget calculated per shard with quantization factored in
  • Sharding strategy tested with 2x projected vector count
  • Write pipeline handles backpressure without dropping vectors
  • Query fan-out respects timeouts and degrades gracefully on shard failure
  • Quantization recall validated against ground-truth dataset
  • Blue-green index rebuild procedure documented and tested
  • Pre-warming implemented for cold start mitigation
  • Monitoring covers per-shard latency, recall estimates, and memory pressure
  • Capacity alerts set at 70% utilization to allow scaling lead time
  • Disaster recovery tested: single shard loss, full region failover
  • Cost model validated against cloud provider billing at projected scale

FAQ

Need expert help?

Building with agentic AI?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026