How many vectors per shard is optimal?

Target 10-50 million vectors per shard for HNSW with scalar quantization. Beyond 50M, query latency degrades noticeably as the graph traversal becomes deeper. If using IVF_PQ, you can push to 200M per shard since the cluster-based approach scales more linearly.

When should I switch from pgvector to a dedicated vector database?

Switch when you hit any of these: query latency exceeds 50ms at p95, vector count exceeds 5 million, you need more than 2,000 dimensions, or you need horizontal scaling. pgvector is single-node and its HNSW implementation, while solid, doesn't support sharding or quantization natively.

How do I handle embedding model upgrades at scale?

Never mix embeddings from different models in the same index. Create a new collection with the updated model, backfill all vectors in parallel (use your write pipeline at full throughput), validate recall on a test set, then swap traffic. Keep the old collection available for 48 hours as a rollback target.

What's the cost difference between managed and self-hosted at scale?

At 500M vectors, managed services (Pinecone, Zilliz) typically cost $15K-40K/month. Self-hosted on Kubernetes with Qdrant or Milvus runs $5K-12K/month in compute and storage, plus one full-time engineer for operations. The break-even point is usually around 100M vectors — below that, managed is almost always cheaper when you factor in engineering time.

Vector Database Architecture Best Practices for High Scale Teams

When your vector database handles hundreds of millions of vectors and thousands of queries per second, the architecture decisions that worked at smaller scale become bottlenecks. High-scale vector deployments expose failure modes you won't find in tutorials: memory pressure from large indexes, query tail latency spikes during index rebuilds, and the cold-start problem when scaling horizontally.

This guide covers the patterns that work at high scale — not theoretical limits, but configurations and architectures proven in production systems serving 10M+ daily queries across billions of vectors.

Capacity Planning for High-Scale Deployments

Before architecting, establish your numbers. Every decision flows from these constraints:

1Vectors: 500M - 5B total

2Dimensions: 768 - 1536

3Query throughput: 5K - 50K QPS

4Latency target: p99 < 100ms

5Write throughput: 10K - 100K vectors/second

6Recall target: > 0.95

Memory estimation for HNSW indexes:

python

1def estimate_memory_gb(

2 num_vectors: int,

3 dimensions: int,

4 m_connections: int = 16,

5 bytes_per_float: int = 4,

6 overhead_factor: float = 1.3,

7) -> float:

8 """

9 Estimate HNSW index memory requirements.

11 Components:

12 - Vector data: num_vectors * dimensions * bytes_per_float

13 - Graph links: num_vectors * m_connections * 2 * 8 bytes (neighbor IDs)

14 - Metadata overhead: ~30% on top

15 """

16 vector_bytes = num_vectors * dimensions * bytes_per_float

17 graph_bytes = num_vectors * m_connections * 2 * 8

18 total_bytes = (vector_bytes + graph_bytes) * overhead_factor

19 return total_bytes / (1024 ** 3)

21# Example: 500M vectors, 1536 dimensions

22memory = estimate_memory_gb(500_000_000, 1536)

23print(f"Estimated memory: {memory:.0f} GB")

24# Output: Estimated memory: 3878 GB

At 500M vectors with 1536 dimensions, you need roughly 4TB of RAM for the HNSW index alone. This is where quantization and sharding become mandatory, not optional.

Sharding Strategies

Hash-Based Sharding

Distribute vectors across shards using consistent hashing:

1package vectordb

3import (

4 "crypto/sha256"

5 "encoding/binary"

6 "fmt"

7 "sort"

8 "sync"

11type ShardRouter struct {

12 shards []ShardInfo

13 ring []ringEntry

14 mu sync.RWMutex

15 replicas int

16}

18type ShardInfo struct {

19 ID string

20 Endpoint string

21 Weight int

22}

24type ringEntry struct {

25 hash uint64

26 shardID string

27}

29func NewShardRouter(shards []ShardInfo, virtualNodes int) *ShardRouter {

30 router := &ShardRouter{

31 shards: shards,

32 replicas: virtualNodes,

33 }

34 router.buildRing()

35 return router

36}

38func (r *ShardRouter) buildRing() {

39 r.ring = nil

40 for _, shard := range r.shards {

41 for i := 0; i < r.replicas * shard.Weight; i++ {

42 key := fmt.Sprintf("%s:%d", shard.ID, i)

43 hash := hashKey(key)

44 r.ring = append(r.ring, ringEntry{

45 hash: hash,

46 shardID: shard.ID,

47 })

48 }

49 }

50 sort.Slice(r.ring, func(i, j int) bool {

51 return r.ring[i].hash < r.ring[j].hash

52 })

53}

55func hashKey(key string) uint64 {

56 h := sha256.Sum256([]byte(key))

57 return binary.BigEndian.Uint64(h[:8])

58}

60func (r *ShardRouter) GetShard(vectorID string) string {

61 r.mu.RLock()

62 defer r.mu.RUnlock()

64 hash := hashKey(vectorID)

65 idx := sort.Search(len(r.ring), func(i int) bool {

66 return r.ring[i].hash >= hash

67 })

68 if idx == len(r.ring) {

69 idx = 0

70 }

71 return r.ring[idx].shardID

72}

74func (r *ShardRouter) GetQueryShards(topK int) []string {

75 // For queries, fan out to all shards and merge results

76 seen := make(map[string]bool)

77 var shards []string

78 for _, entry := range r.ring {

79 if !seen[entry.shardID] {

80 seen[entry.shardID] = true

81 shards = append(shards, entry.shardID)

82 }

83 }

84 return shards

85}

Partition-by-Tenant Sharding

For multi-tenant systems, co-locate each tenant's vectors on dedicated shards:

1type TenantRouter struct {

2 assignments map[string][]string // tenant -> shard IDs

3 mu sync.RWMutex

6func (r *TenantRouter) AssignTenant(

7 tenantID string,

8 vectorCount int64,

9 shardCapacity int64,

10) {

11 r.mu.Lock()

12 defer r.mu.Unlock()

14 shardsNeeded := (vectorCount + shardCapacity - 1) / shardCapacity

15 shards := r.findAvailableShards(int(shardsNeeded))

16 r.assignments[tenantID] = shards

17}

19func (r *TenantRouter) GetWriteShard(tenantID string, vectorID string) string {

20 r.mu.RLock()

21 defer r.mu.RUnlock()

23 shards := r.assignments[tenantID]

24 if len(shards) == 0 {

25 return ""

26 }

27 // Hash within tenant's shard set

28 hash := hashKey(vectorID)

29 idx := hash % uint64(len(shards))

30 return shards[idx]

31}

Quantization for Memory Efficiency

At high scale, quantization reduces memory by 4-8x with minimal recall loss:

Product Quantization (PQ)

python

1# Milvus with IVF_PQ index for billion-scale

2from pymilvus import (

3 connections, Collection, CollectionSchema,

4 FieldSchema, DataType, utility,

7connections.connect(host="milvus-proxy", port="19530")

9schema = CollectionSchema(fields=[

10 FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),

11 FieldSchema(name="tenant_id", dtype=DataType.VARCHAR, max_length=64),

12 FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536),

13])

15collection = Collection(name="vectors_pq", schema=schema)

17# IVF_PQ: partition vectors into clusters, then quantize within clusters

18index_params = {

19 "metric_type": "IP", # Inner product (cosine after normalization)

20 "index_type": "IVF_PQ",

21 "params": {

22 "nlist": 16384, # Number of clusters (sqrt(N) is a starting point)

23 "m": 48, # Number of sub-quantizers (dimensions / m ≈ 32)

24 "nbits": 8, # Bits per sub-quantizer

25 },

26}

28collection.create_index(

29 field_name="embedding",

30 index_params=index_params,

31)

33# Query with reranking for accuracy recovery

34search_params = {

35 "metric_type": "IP",

36 "params": {

37 "nprobe": 128, # Number of clusters to search (higher = better recall)

38 "rerank_k": 200, # Rerank top candidates with original vectors

39 },

40}

Scalar Quantization (SQ8)

Simpler and faster than PQ, with less compression:

python

1# Qdrant with scalar quantization

2from qdrant_client import QdrantClient

3from qdrant_client.models import (

4 VectorParams, Distance,

5 ScalarQuantizationConfig, ScalarType,

6 QuantizationSearchParams,

9client = QdrantClient(url="http://qdrant:6333")

11client.create_collection(

12 collection_name="high_scale_vectors",

13 vectors_config=VectorParams(

14 size=1536,

15 distance=Distance.COSINE,

16 on_disk=True, # Vectors on SSD, quantized in RAM

17 ),

18 quantization_config=ScalarQuantizationConfig(

19 type=ScalarType.INT8,

20 quantile=0.99, # Clip outliers above 99th percentile

21 always_ram=True, # Keep quantized vectors in RAM

22 ),

23)

25# Search uses quantized vectors first, then rescores top candidates

26results = client.search(

27 collection_name="high_scale_vectors",

28 query_vector=query_embedding,

29 limit=10,

30 search_params=QuantizationSearchParams(

31 ignore=False, # Use quantization

32 rescore=True, # Rescore with original vectors

33 oversampling=2.0, # Fetch 2x candidates before rescoring

34 ),

35)

Write Pipeline for High Throughput

Ingesting vectors at 100K/second requires careful batching and backpressure:

1package ingest

3import (

4 "context"

5 "sync"

6 "time"

9type Vector struct {

10 ID string

11 Embedding []float32

12 Metadata map[string]interface{}

13 TenantID string

14}

16type BatchWriter struct {

17 buffer []Vector

18 mu sync.Mutex

19 batchSize int

20 flushInterval time.Duration

21 writeFn func(ctx context.Context, batch []Vector) error

22 done chan struct{}

23}

25func NewBatchWriter(

26 batchSize int,

27 flushInterval time.Duration,

28 writeFn func(ctx context.Context, batch []Vector) error,

29) *BatchWriter {

30 bw := &BatchWriter{

31 buffer: make([]Vector, 0, batchSize),

32 batchSize: batchSize,

33 flushInterval: flushInterval,

34 writeFn: writeFn,

35 done: make(chan struct{}),

36 }

37 go bw.flushLoop()

38 return bw

39}

41func (bw *BatchWriter) Add(v Vector) error {

42 bw.mu.Lock()

43 bw.buffer = append(bw.buffer, v)

44 shouldFlush := len(bw.buffer) >= bw.batchSize

45 bw.mu.Unlock()

47 if shouldFlush {

48 return bw.flush()

49 }

50 return nil

51}

53func (bw *BatchWriter) flush() error {

54 bw.mu.Lock()

55 if len(bw.buffer) == 0 {

56 bw.mu.Unlock()

57 return nil

58 }

59 batch := bw.buffer

60 bw.buffer = make([]Vector, 0, bw.batchSize)

61 bw.mu.Unlock()

63 return bw.writeFn(context.Background(), batch)

64}

66func (bw *BatchWriter) flushLoop() {

67 ticker := time.NewTicker(bw.flushInterval)

68 defer ticker.Stop()

69 for {

70 select {

71 case <-ticker.C:

72 bw.flush()

73 case <-bw.done:

74 bw.flush() // Final flush

75 return

76 }

77 }

78}

80func (bw *BatchWriter) Close() {

81 close(bw.done)

82}

Need a second opinion on your AI systems architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Query Fan-Out and Result Merging

When vectors are distributed across shards, queries must fan out and merge:

1package query

3import (

4 "context"

5 "sort"

6 "sync"

7 "time"

10type SearchResult struct {

11 ID string

12 Score float32

13 Metadata map[string]interface{}

14 ShardID string

15}

17type ShardSearcher interface {

18 Search(ctx context.Context, vector []float32, topK int) ([]SearchResult, error)

19}

21func FanOutSearch(

22 ctx context.Context,

23 shards map[string]ShardSearcher,

24 vector []float32,

25 topK int,

26 timeout time.Duration,

27) ([]SearchResult, error) {

28 ctx, cancel := context.WithTimeout(ctx, timeout)

29 defer cancel()

31 var mu sync.Mutex

32 var allResults []SearchResult

33 var firstErr error

35 var wg sync.WaitGroup

36 for shardID, searcher := range shards {

37 wg.Add(1)

38 go func(id string, s ShardSearcher) {

39 defer wg.Done()

40 results, err := s.Search(ctx, vector, topK)

41 mu.Lock()

42 defer mu.Unlock()

43 if err != nil {

44 if firstErr == nil {

45 firstErr = err

46 }

47 return

48 }

49 for i := range results {

50 results[i].ShardID = id

51 }

52 allResults = append(allResults, results...)

53 }(shardID, searcher)

54 }

56 wg.Wait()

58 // If all shards failed, return error

59 if len(allResults) == 0 && firstErr != nil {

60 return nil, firstErr

61 }

63 // Merge: sort by score descending, take top K

64 sort.Slice(allResults, func(i, j int) bool {

65 return allResults[i].Score > allResults[j].Score

66 })

68 if len(allResults) > topK {

69 allResults = allResults[:topK]

70 }

72 return allResults, nil

73}

Index Rebuild Strategy

HNSW index rebuilds are expensive. At high scale, you need a strategy that avoids downtime:

python

1# Blue-green index rebuild pattern

2class IndexManager:

3 def __init__(self, client, base_name: str):

4 self.client = client

5 self.base_name = base_name

6 self.active_suffix = "blue"

8 @property

9 def active_collection(self) -> str:

10 return f"{self.base_name}_{self.active_suffix}"

12 @property

13 def standby_collection(self) -> str:

14 suffix = "green" if self.active_suffix == "blue" else "blue"

15 return f"{self.base_name}_{suffix}"

17 async def rebuild_index(

18 self,

19 vectors_source,

20 new_config: dict,

21 ):

22 standby = self.standby_collection

24 # Drop and recreate standby collection with new config

25 if self.client.collection_exists(standby):

26 self.client.delete_collection(standby)

27 self.client.create_collection(standby, **new_config)

29 # Bulk load into standby

30 batch = []

31 async for vector in vectors_source:

32 batch.append(vector)

33 if len(batch) >= 10000:

34 self.client.upsert(standby, batch)

35 batch = []

36 if batch:

37 self.client.upsert(standby, batch)

39 # Wait for indexing to complete

40 await self._wait_for_index_ready(standby)

42 # Swap active pointer

43 old_suffix = self.active_suffix

44 self.active_suffix = (

45 "green" if old_suffix == "blue" else "blue"

46 )

48 # Clean up old collection after traffic drains

49 await asyncio.sleep(30)

50 self.client.delete_collection(

51 f"{self.base_name}_{old_suffix}"

52 )

Anti-Patterns to Avoid

Loading Entire Index into RAM

At billion-scale, full in-memory indexes are cost-prohibitive. Use quantized vectors in RAM with full vectors on SSD for rescoring. This reduces memory by 4-8x while maintaining recall above 0.95.

Synchronous Write-then-Read

Vector indexes update asynchronously. Writes are not immediately searchable. Design your application to tolerate eventual consistency — typically 100ms to 2 seconds for newly inserted vectors to become queryable.

Unbounded Fan-Out

Querying all shards for every request doesn't scale past 50 shards. Implement query routing that narrows the shard set based on metadata filters or cluster assignment. For tenant-scoped queries, route directly to the tenant's shards.

Ignoring Cold Start Latency

When a shard starts or restarts, the first queries hit disk while the index loads into memory. Pre-warm shards by running synthetic queries during startup. Set readiness probes to only pass after warmup completes.

Monolithic Index Configuration

Different query patterns need different indexes. Keep a high-recall HNSW index for RAG queries alongside a faster IVF_PQ index for recommendation feeds. Route queries to the appropriate index based on the use case.

High-Scale Readiness Checklist

FAQ

Need expert help?

Building with agentic AI?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Book a Free Call Send a Brief

vector-db embeddings similarity-search ai-infrastructure high-scale best-practices

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

View Portfolio Book a Call

← Previous

Capacity Planning for High-Scale Deployments

Sharding Strategies

Hash-Based Sharding

Partition-by-Tenant Sharding

Quantization for Memory Efficiency

Product Quantization (PQ)

Scalar Quantization (SQ8)

Write Pipeline for High Throughput

Query Fan-Out and Result Merging

Index Rebuild Strategy

Anti-Patterns to Avoid

Loading Entire Index into RAM

Synchronous Write-then-Read

Unbounded Fan-Out

Ignoring Cold Start Latency

Monolithic Index Configuration

High-Scale Readiness Checklist

FAQ

Building with agentic AI?

Vector Database Architecture Best Practices for Enterprise Teams

Vector Database Architecture Best Practices for Startup Teams

Vector Database Architecture at Scale: Lessons from Production

Vector Database Architecture at Scale: Lessons from Production

Vector Database Architecture Best Practices for Enterprise Teams

Start aConversation.

Start a
Conversation.