Back to Journal
AI Architecture

Vector Database Architecture at Scale: Lessons from Production

Real-world lessons from implementing Vector Database Architecture in production, including architecture decisions, measurable results, and honest retrospectives.

Muneer Puthiya Purayil 9 min read

In mid-2024, our team at a Series B fintech company needed to build a document intelligence platform that could search across 12 million financial documents for our institutional clients. Each document was chunked into paragraphs, producing roughly 180 million vectors. This is the story of what we built, what broke, and what we'd do differently.

The Problem

Our clients — asset managers and compliance teams — needed to search across SEC filings, earnings transcripts, analyst reports, and internal memos using natural language queries. Keyword search returned too many irrelevant results. They needed semantic understanding: "companies discussing supply chain risks in Asia" should surface relevant paragraphs even if those exact words weren't used.

Requirements:

  • 180M vectors at 1536 dimensions
  • Multi-tenant: 45 institutional clients with strict data isolation
  • Query latency: p95 under 200ms
  • Ingestion: 500K new documents per week
  • Hybrid search: semantic + keyword for exact entity matches (ticker symbols, CUSIP numbers)

Architecture Decisions

Decision 1: Qdrant Over Pinecone

We evaluated Pinecone, Weaviate, Qdrant, and Milvus. Qdrant won for three reasons:

  1. Compliance: Our clients required data to stay in our VPC. Pinecone's managed service wasn't an option.
  2. Payload filtering: Qdrant's filtered search performance was consistently better than Weaviate's for our access-control patterns.
  3. Quantization: Built-in scalar quantization reduced our memory footprint by 4x without significant recall loss.

Decision 2: Collection-Per-Client

We chose one Qdrant collection per client rather than a shared collection with metadata filtering:

1client_acme_capital → 4.2M vectors
2client_bridgewater → 12.8M vectors
3client_citadel_research → 8.1M vectors
4...
5 

This added operational complexity (managing 45 collections) but gave us:

  • Provable data isolation for compliance audits
  • Per-client index tuning (clients with larger corpora got different HNSW parameters)
  • Independent backup/restore per client
  • Clean data deletion when a client churns

Decision 3: Two-Stage Retrieval

Pure vector search at 180M scale was too slow for our latency target. We implemented a two-stage approach:

Stage 1: Coarse filter (metadata + quantized vectors) → 500 candidates Stage 2: Rescore with full-precision vectors → top 20 results
python
1# Two-stage retrieval implementation
2async def search(
3 client_id: str,
4 query_embedding: list[float],
5 filters: dict,
6 top_k: int = 20,
7) -> list[SearchResult]:
8 collection = f"client_{client_id}"
9 
10 # Stage 1: Fast search with quantized vectors
11 candidates = qdrant_client.search(
12 collection_name=collection,
13 query_vector=query_embedding,
14 limit=500,
15 query_filter=build_qdrant_filter(filters),
16 search_params=models.SearchParams(
17 quantization=models.QuantizationSearchParams(
18 ignore=False,
19 rescore=False, # Skip rescoring in stage 1
20 ),
21 hnsw_ef=64, # Lower ef for speed
22 ),
23 )
24 
25 # Stage 2: Rescore top candidates with full vectors
26 candidate_ids = [c.id for c in candidates]
27 rescored = qdrant_client.search(
28 collection_name=collection,
29 query_vector=query_embedding,
30 limit=top_k,
31 query_filter=models.Filter(
32 must=[
33 models.HasIdCondition(has_id=candidate_ids)
34 ]
35 ),
36 search_params=models.SearchParams(
37 quantization=models.QuantizationSearchParams(
38 ignore=True, # Use full vectors
39 ),
40 exact=True, # Exact search within candidates
41 ),
42 )
43 
44 return [
45 SearchResult(
46 id=r.id,
47 score=r.score,
48 content=r.payload["content"],
49 document_title=r.payload["document_title"],
50 document_date=r.payload["document_date"],
51 )
52 for r in rescored
53 ]
54 

Infrastructure Layout

We ran Qdrant on Kubernetes across three availability zones:

13x r6i.8xlarge (256GB RAM, 32 vCPU) — Qdrant nodes
2 - Replication factor: 2 (each vector stored on 2 nodes)
3 - Sharding: automatic, 6 shards per collection
4 
52x c6i.4xlarge (32GB RAM, 16 vCPU) — Embedding workers
6 - Process ingestion queue
7 - OpenAI API calls with batching
8 
91x r6i.2xlarge (64GB RAM, 8 vCPU) — API gateway
10 - Query routing, auth, rate limiting
11 

Monthly cost: roughly $8,200 for compute + $1,400 for EBS storage + $2,100 for embedding API calls = $11,700/month serving 45 clients.

Need a second opinion on your AI systems architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

What Broke in Production

Incident 1: Index Rebuild Storm

Three months in, we upgraded Qdrant and needed to rebuild indexes for the new version. We naively triggered rebuilds across all 45 collections simultaneously. Every node's CPU pegged at 100%, query latency spiked to 15 seconds, and our monitoring alerted every client's SLA threshold.

Fix: We built a sequential rebuild controller that processes one collection at a time, with a 10-minute cooldown between rebuilds. Total rebuild time went from 2 hours (parallel, broken) to 18 hours (sequential, stable).

Incident 2: Embedding Pipeline Backlog

A client uploaded 50,000 documents in a single batch. Our embedding workers consumed their OpenAI rate limit within minutes, and the queue backed up to 200,000 pending chunks. New documents from other clients weren't being processed.

Fix: Per-client rate limiting on the ingestion queue. Each client gets a maximum of 1,000 embedding API calls per minute. Excess work is queued with fair scheduling across clients.

Incident 3: Memory Pressure from Payload Storage

We stored the full paragraph text in Qdrant's payload. For our largest client (12.8M vectors), payloads consumed 40GB of RAM — more than the vectors themselves. This pushed the node into swap, destroying query performance.

Fix: Moved full text to PostgreSQL and stored only a text hash in Qdrant's payload. On query, we fetch the full text from PostgreSQL using the returned IDs. This reduced Qdrant's memory usage by 60%.

Measurable Results

After six months in production:

MetricTargetActual
Query latency (p50)< 100ms42ms
Query latency (p95)< 200ms128ms
Query latency (p99)< 500ms287ms
Recall@20> 0.900.94
Ingestion throughput500K docs/week720K docs/week
Uptime99.9%99.95%
Cost per client< $500/month$260/month

User feedback was strongly positive. Compliance teams reported finding relevant documents in seconds instead of hours. The semantic search caught contextual references that keyword search missed entirely — particularly valuable for risk assessment queries.

What We'd Do Differently

Use Hybrid Search from Day One

We launched with pure vector search and added BM25 keyword search three months later when users complained about missing exact ticker symbol matches. The retrofit required re-indexing all 180M vectors with text payloads for BM25. If we'd planned for hybrid search from the start, we would have avoided two weeks of re-indexing downtime.

Invest in Evaluation Earlier

We didn't build a systematic evaluation framework until month four. Before that, we relied on qualitative feedback from users. Building a test set of 200 query/expected-result pairs and running weekly recall benchmarks would have caught our chunking quality issues two months sooner.

Start with Smaller Embedding Dimensions

We used 1536-dimension embeddings from the start. Testing later showed that 768 dimensions (using OpenAI's text-embedding-3-small with the dimensions parameter) achieved 0.92 recall versus 0.94 at 1536 — a negligible difference for our use case. At 768 dimensions, our memory footprint and costs would have been halved.

Separate Read and Write Paths

Our initial architecture used the same Qdrant nodes for both queries and ingestion. During bulk ingestion, query latency degraded noticeably. Adding dedicated ingestion nodes that replicate to query nodes would have eliminated this interference.

FAQ

Need expert help?

Building with agentic AI?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026