Back to Journal
AI Architecture

RAG Pipeline Design Best Practices for High Scale Teams

Battle-tested best practices for RAG Pipeline Design tailored to High Scale teams, including anti-patterns to avoid and a ready-to-use checklist.

Muneer Puthiya Purayil 16 min read

High-scale RAG pipelines serving millions of queries daily face challenges that enterprise deployments at moderate scale never encounter: vector database sharding, embedding computation at throughput limits, cache invalidation across distributed retrieval nodes, and maintaining retrieval quality while optimizing for latency at the p99. These best practices address the engineering problems that emerge above 10,000 queries per hour.

Vector Database Scaling

Sharding Strategy

At high scale, a single vector database instance becomes a bottleneck. Shard your vector collections by document domain:

python
1from enum import Enum
2 
3class ShardKey(Enum):
4 ENGINEERING = "engineering"
5 LEGAL = "legal"
6 FINANCE = "finance"
7 PRODUCT = "product"
8 SUPPORT = "support"
9 
10class ShardedVectorStore:
11 def __init__(self, shard_configs: dict[ShardKey, VectorStoreConfig]):
12 self.shards = {
13 key: VectorStore(config) for key, config in shard_configs.items()
14 }
15 self.router = ShardRouter()
16 
17 async def query(self, query: str, embedding: list[float],
18 target_shards: list[ShardKey] | None = None,
19 top_k: int = 10) -> list[dict]:
20 shards = target_shards or self.router.route(query)
21
22 # Query relevant shards in parallel
23 results = await asyncio.gather(*[
24 self.shards[shard].query(embedding, top_k=top_k)
25 for shard in shards
26 ])
27
28 # Merge and re-rank across shards
29 all_results = [r for shard_results in results for r in shard_results]
30 all_results.sort(key=lambda x: x["score"], reverse=True)
31 return all_results[:top_k]
32 
33class ShardRouter:
34 def __init__(self):
35 self.classifier = None # Lightweight query classifier
36 
37 def route(self, query: str) -> list[ShardKey]:
38 """Route queries to relevant shards based on content."""
39 # Use a fast classifier or keyword-based routing
40 keywords = query.lower().split()
41 relevant = set()
42
43 keyword_map = {
44 ShardKey.ENGINEERING: {"api", "code", "deploy", "bug", "architecture"},
45 ShardKey.LEGAL: {"contract", "compliance", "regulation", "policy", "terms"},
46 ShardKey.FINANCE: {"revenue", "budget", "forecast", "quarterly", "expense"},
47 ShardKey.PRODUCT: {"feature", "roadmap", "user", "feedback", "release"},
48 ShardKey.SUPPORT: {"ticket", "issue", "error", "help", "troubleshoot"},
49 }
50
51 for shard, shard_keywords in keyword_map.items():
52 if shard_keywords & set(keywords):
53 relevant.add(shard)
54
55 return list(relevant) if relevant else list(ShardKey)
56 

Index Optimization

At high query volume, index configuration directly impacts latency:

python
1# Qdrant collection configuration for high-throughput
2COLLECTION_CONFIG = {
3 "vectors": {
4 "size": 1536,
5 "distance": "Cosine",
6 "on_disk": False, # Keep vectors in memory for < 10ms queries
7 },
8 "optimizers_config": {
9 "indexing_threshold": 20000, # Reindex after 20K updates
10 "memmap_threshold": 50000,
11 },
12 "hnsw_config": {
13 "m": 32, # Higher = better recall, more memory
14 "ef_construct": 200, # Higher = better index quality, slower builds
15 "on_disk": False,
16 },
17 "quantization_config": {
18 "scalar": {
19 "type": "int8",
20 "quantile": 0.99,
21 "always_ram": True,
22 }
23 },
24}
25 

Int8 scalar quantization reduces memory usage by 4x with less than 1% recall degradation. At 100M+ vectors, this is the difference between needing 64GB and 16GB of RAM per replica.

Embedding Pipeline at Scale

Batch Processing with Backpressure

python
1import asyncio
2from collections import deque
3 
4class EmbeddingPipeline:
5 def __init__(self, embed_fn, batch_size: int = 64, max_concurrent: int = 4):
6 self.embed_fn = embed_fn
7 self.batch_size = batch_size
8 self.semaphore = asyncio.Semaphore(max_concurrent)
9 self.queue: deque = deque()
10 self.metrics = EmbeddingMetrics()
11 
12 async def embed_batch(self, texts: list[str]) -> list[list[float]]:
13 batches = [
14 texts[i:i + self.batch_size]
15 for i in range(0, len(texts), self.batch_size)
16 ]
17
18 results = []
19 for batch in batches:
20 async with self.semaphore:
21 start = asyncio.get_event_loop().time()
22 embeddings = await self.embed_fn(batch)
23 elapsed = asyncio.get_event_loop().time() - start
24
25 self.metrics.record(
26 batch_size=len(batch),
27 latency_ms=elapsed * 1000,
28 )
29 results.extend(embeddings)
30
31 return results
32 

Embedding Cache

At high scale, many queries repeat or are semantically similar. Cache embeddings aggressively:

python
1import hashlib
2from redis import asyncio as aioredis
3 
4class EmbeddingCache:
5 def __init__(self, redis: aioredis.Redis, ttl: int = 86400):
6 self.redis = redis
7 self.ttl = ttl
8 self.hits = 0
9 self.misses = 0
10 
11 def _key(self, text: str, model: str) -> str:
12 h = hashlib.sha256(f"{model}:{text}".encode()).hexdigest()
13 return f"emb:{h}"
14 
15 async def get(self, text: str, model: str) -> list[float] | None:
16 key = self._key(text, model)
17 data = await self.redis.get(key)
18 if data:
19 self.hits += 1
20 return json.loads(data)
21 self.misses += 1
22 return None
23 
24 async def set(self, text: str, model: str, embedding: list[float]) -> None:
25 key = self._key(text, model)
26 await self.redis.setex(key, self.ttl, json.dumps(embedding))
27 
28 @property
29 def hit_rate(self) -> float:
30 total = self.hits + self.misses
31 return self.hits / total if total > 0 else 0.0
32 

Retrieval Optimization

Two-Stage Retrieval with Cross-Encoder Re-Ranking

Fast retrieval (ANN search) gets candidates. Slow but accurate re-ranking (cross-encoder) sorts them:

python
1from sentence_transformers import CrossEncoder
2 
3class TwoStageRetriever:
4 def __init__(self, vector_store, cross_encoder_model: str = "cross-encoder/ms-marco-MiniLM-L-12-v2"):
5 self.vector_store = vector_store
6 self.reranker = CrossEncoder(cross_encoder_model)
7 
8 async def retrieve(self, query: str, embedding: list[float],
9 candidates: int = 50, final_k: int = 5) -> list[dict]:
10 # Stage 1: Fast ANN retrieval
11 results = await self.vector_store.query(embedding, top_k=candidates)
12 
13 # Stage 2: Cross-encoder re-ranking
14 pairs = [(query, r["text"]) for r in results]
15 scores = self.reranker.predict(pairs)
16 
17 for result, score in zip(results, scores):
18 result["rerank_score"] = float(score)
19 
20 results.sort(key=lambda x: x["rerank_score"], reverse=True)
21 return results[:final_k]
22 

Result Caching with Semantic Deduplication

python
1class SemanticCache:
2 def __init__(self, vector_store, similarity_threshold: float = 0.95):
3 self.cache_store = vector_store
4 self.threshold = similarity_threshold
5 
6 async def get(self, query_embedding: list[float]) -> dict | None:
7 results = await self.cache_store.query(query_embedding, top_k=1)
8 if results and results[0]["score"] >= self.threshold:
9 return json.loads(results[0]["metadata"]["response"])
10 return None
11 
12 async def set(self, query_embedding: list[float], query: str, response: dict) -> None:
13 await self.cache_store.upsert(
14 ids=[hashlib.sha256(query.encode()).hexdigest()],
15 embeddings=[query_embedding],
16 metadata=[{"query": query, "response": json.dumps(response), "timestamp": time.time()}],
17 )
18 

Need a second opinion on your AI systems architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

Monitoring and Alerting

Key Metrics for High-Scale RAG

python
1from prometheus_client import Histogram, Counter, Gauge
2 
3retrieval_latency = Histogram(
4 "rag_retrieval_latency_seconds",
5 "Retrieval latency",
6 ["stage"], # "ann", "rerank", "total"
7 buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0],
8)
9 
10embedding_latency = Histogram(
11 "rag_embedding_latency_seconds",
12 "Embedding computation latency",
13 ["model"],
14 buckets=[0.01, 0.025, 0.05, 0.1, 0.25],
15)
16 
17retrieval_empty_results = Counter(
18 "rag_empty_results_total",
19 "Queries returning zero relevant results",
20)
21 
22cache_hit_rate = Gauge(
23 "rag_cache_hit_rate",
24 "Embedding and result cache hit rate",
25 ["cache_type"],
26)
27 
28generation_tokens = Counter(
29 "rag_generation_tokens_total",
30 "Total tokens used in LLM generation",
31 ["model", "type"], # type: "input" or "output"
32)
33 

Alert on:

  • Retrieval p99 latency > 500ms
  • Empty result rate > 15% (indicates chunking or embedding quality degradation)
  • Embedding cache hit rate < 20% (indicates query diversity change or cache misconfiguration)
  • Generation token cost per query > budget threshold

Checklist

  • Vector database sharded by document domain
  • Int8 quantization for memory efficiency
  • Embedding batch pipeline with backpressure
  • Embedding cache with > 30% hit rate target
  • Two-stage retrieval (ANN + cross-encoder re-ranking)
  • Semantic result cache for repeated queries
  • Horizontal scaling of retrieval nodes behind load balancer
  • Monitoring: retrieval latency, empty result rate, token cost
  • Graceful degradation when vector DB is unavailable
  • Load testing at 2x projected peak QPS

Anti-Patterns to Avoid

Over-sharding: More shards means more fan-out queries and merge overhead. Start with 3-5 domain shards and split further only when a single shard exceeds performance targets.

Skipping re-ranking for latency: ANN search alone has 70-80% precision. Cross-encoder re-ranking pushes it to 90-95%. The 30-50ms latency cost of re-ranking is almost always worth the quality improvement.

Caching at the wrong layer: Cache embeddings and final responses, not intermediate retrieval results. Intermediate results change with index updates, but query embeddings and well-formed responses are stable.

Synchronous embedding in the request path: At high scale, embedding computation should be async with pre-computed query embeddings for common queries. Batch embed during off-peak hours and cache aggressively.

Conclusion

High-scale RAG pipelines are distributed systems first and ML systems second. The core engineering challenges — sharding, caching, backpressure, and monitoring — are the same problems that arise in any high-throughput data pipeline. The ML-specific concerns (embedding quality, chunking strategy, re-ranking) layer on top of solid distributed systems foundations.

Invest in the embedding cache and semantic result cache early. At high query volumes, cache hit rates above 40% reduce both latency and embedding API costs dramatically. Pair this with two-stage retrieval and continuous quality monitoring, and the pipeline scales from thousands to millions of queries per day without architectural changes.

FAQ

Need expert help?

Building with agentic AI?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026