How long did the initial data migration take?

Embedding 12 million documents (180M chunks) took 11 days using 8 concurrent embedding workers with OpenAI's batch API. The bottleneck was API rate limits, not compute. At current OpenAI pricing, the initial embedding cost was approximately $3,800.

How do you handle document updates and deletions?

Each chunk stores the source document ID in its payload. When a document is updated, we delete all chunks with that document ID and re-ingest. For deletions, we delete chunks and run a nightly reconciliation job that compares Qdrant's stored document IDs against PostgreSQL's document table to catch any orphans.

What monitoring do you run on the vector database?

We track five key metrics: query latency (p50/p95/p99), recall@20 against a weekly benchmark set, memory utilization per node, ingestion queue depth, and replication lag between nodes. Alerts fire when p95 latency exceeds 200ms, memory exceeds 80%, or queue depth exceeds 50,000 items.

Would you choose Qdrant again?

Yes. Six months in, Qdrant has been stable, performant, and the team's filtered search capabilities saved us significant engineering time. The main pain point has been the lack of built-in BM25 hybrid search (we had to build our own). If hybrid search were a day-one requirement, we'd evaluate Weaviate more seriously for its native BM25 support.

Vector Database Architecture at Scale: Lessons from Production

In mid-2024, our team at a Series B fintech company needed to build a document intelligence platform that could search across 12 million financial documents for our institutional clients. Each document was chunked into paragraphs, producing roughly 180 million vectors. This is the story of what we built, what broke, and what we'd do differently.

The Problem

Our clients — asset managers and compliance teams — needed to search across SEC filings, earnings transcripts, analyst reports, and internal memos using natural language queries. Keyword search returned too many irrelevant results. They needed semantic understanding: "companies discussing supply chain risks in Asia" should surface relevant paragraphs even if those exact words weren't used.

Requirements:

180M vectors at 1536 dimensions
Multi-tenant: 45 institutional clients with strict data isolation
Query latency: p95 under 200ms
Ingestion: 500K new documents per week
Hybrid search: semantic + keyword for exact entity matches (ticker symbols, CUSIP numbers)

Architecture Decisions

Decision 1: Qdrant Over Pinecone

We evaluated Pinecone, Weaviate, Qdrant, and Milvus. Qdrant won for three reasons:

Compliance: Our clients required data to stay in our VPC. Pinecone's managed service wasn't an option.
Payload filtering: Qdrant's filtered search performance was consistently better than Weaviate's for our access-control patterns.
Quantization: Built-in scalar quantization reduced our memory footprint by 4x without significant recall loss.

Decision 2: Collection-Per-Client

We chose one Qdrant collection per client rather than a shared collection with metadata filtering:

1client_acme_capital → 4.2M vectors

2client_bridgewater → 12.8M vectors

3client_citadel_research → 8.1M vectors

4...

This added operational complexity (managing 45 collections) but gave us:

Provable data isolation for compliance audits
Per-client index tuning (clients with larger corpora got different HNSW parameters)
Independent backup/restore per client
Clean data deletion when a client churns

Decision 3: Two-Stage Retrieval

Pure vector search at 180M scale was too slow for our latency target. We implemented a two-stage approach:

Stage 1: Coarse filter (metadata + quantized vectors) → 500 candidates Stage 2: Rescore with full-precision vectors → top 20 results

python

1# Two-stage retrieval implementation

2async def search(

3 client_id: str,

4 query_embedding: list[float],

5 filters: dict,

6 top_k: int = 20,

7) -> list[SearchResult]:

8 collection = f"client_{client_id}"

10 # Stage 1: Fast search with quantized vectors

11 candidates = qdrant_client.search(

12 collection_name=collection,

13 query_vector=query_embedding,

14 limit=500,

15 query_filter=build_qdrant_filter(filters),

16 search_params=models.SearchParams(

17 quantization=models.QuantizationSearchParams(

18 ignore=False,

19 rescore=False, # Skip rescoring in stage 1

20 ),

21 hnsw_ef=64, # Lower ef for speed

22 ),

23 )

25 # Stage 2: Rescore top candidates with full vectors

26 candidate_ids = [c.id for c in candidates]

27 rescored = qdrant_client.search(

28 collection_name=collection,

29 query_vector=query_embedding,

30 limit=top_k,

31 query_filter=models.Filter(

32 must=[

33 models.HasIdCondition(has_id=candidate_ids)

34 ]

35 ),

36 search_params=models.SearchParams(

37 quantization=models.QuantizationSearchParams(

38 ignore=True, # Use full vectors

39 ),

40 exact=True, # Exact search within candidates

41 ),

42 )

44 return [

45 SearchResult(

46 id=r.id,

47 score=r.score,

48 content=r.payload["content"],

49 document_title=r.payload["document_title"],

50 document_date=r.payload["document_date"],

51 )

52 for r in rescored

53 ]

Infrastructure Layout

We ran Qdrant on Kubernetes across three availability zones:

13x r6i.8xlarge (256GB RAM, 32 vCPU) — Qdrant nodes

2 - Replication factor: 2 (each vector stored on 2 nodes)

3 - Sharding: automatic, 6 shards per collection

52x c6i.4xlarge (32GB RAM, 16 vCPU) — Embedding workers

6 - Process ingestion queue

7 - OpenAI API calls with batching

91x r6i.2xlarge (64GB RAM, 8 vCPU) — API gateway

10 - Query routing, auth, rate limiting

Monthly cost: roughly $8,200 for compute + $1,400 for EBS storage + $2,100 for embedding API calls = $11,700/month serving 45 clients.

Need a second opinion on your AI systems architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

What Broke in Production

Incident 1: Index Rebuild Storm

Three months in, we upgraded Qdrant and needed to rebuild indexes for the new version. We naively triggered rebuilds across all 45 collections simultaneously. Every node's CPU pegged at 100%, query latency spiked to 15 seconds, and our monitoring alerted every client's SLA threshold.

Fix: We built a sequential rebuild controller that processes one collection at a time, with a 10-minute cooldown between rebuilds. Total rebuild time went from 2 hours (parallel, broken) to 18 hours (sequential, stable).

Incident 2: Embedding Pipeline Backlog

A client uploaded 50,000 documents in a single batch. Our embedding workers consumed their OpenAI rate limit within minutes, and the queue backed up to 200,000 pending chunks. New documents from other clients weren't being processed.

Fix: Per-client rate limiting on the ingestion queue. Each client gets a maximum of 1,000 embedding API calls per minute. Excess work is queued with fair scheduling across clients.

Incident 3: Memory Pressure from Payload Storage

We stored the full paragraph text in Qdrant's payload. For our largest client (12.8M vectors), payloads consumed 40GB of RAM — more than the vectors themselves. This pushed the node into swap, destroying query performance.

Fix: Moved full text to PostgreSQL and stored only a text hash in Qdrant's payload. On query, we fetch the full text from PostgreSQL using the returned IDs. This reduced Qdrant's memory usage by 60%.

Measurable Results

After six months in production:

Metric	Target	Actual
Query latency (p50)	< 100ms	42ms
Query latency (p95)	< 200ms	128ms
Query latency (p99)	< 500ms	287ms
Recall@20	> 0.90	0.94
Ingestion throughput	500K docs/week	720K docs/week
Uptime	99.9%	99.95%
Cost per client	< $500/month	$260/month

User feedback was strongly positive. Compliance teams reported finding relevant documents in seconds instead of hours. The semantic search caught contextual references that keyword search missed entirely — particularly valuable for risk assessment queries.

What We'd Do Differently

Use Hybrid Search from Day One

We launched with pure vector search and added BM25 keyword search three months later when users complained about missing exact ticker symbol matches. The retrofit required re-indexing all 180M vectors with text payloads for BM25. If we'd planned for hybrid search from the start, we would have avoided two weeks of re-indexing downtime.

Invest in Evaluation Earlier

We didn't build a systematic evaluation framework until month four. Before that, we relied on qualitative feedback from users. Building a test set of 200 query/expected-result pairs and running weekly recall benchmarks would have caught our chunking quality issues two months sooner.

Start with Smaller Embedding Dimensions

We used 1536-dimension embeddings from the start. Testing later showed that 768 dimensions (using OpenAI's text-embedding-3-small with the dimensions parameter) achieved 0.92 recall versus 0.94 at 1536 — a negligible difference for our use case. At 768 dimensions, our memory footprint and costs would have been halved.

Separate Read and Write Paths

Our initial architecture used the same Qdrant nodes for both queries and ingestion. During bulk ingestion, query latency degraded noticeably. Adding dedicated ingestion nodes that replicate to query nodes would have eliminated this interference.

FAQ

Need expert help?

Building with agentic AI?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Book a Free Call Send a Brief

vector-db embeddings similarity-search ai-infrastructure aws case-study

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

View Portfolio Book a Call

← Previous

Vector Database Architecture at Scale: Lessons from Production

The Problem

Architecture Decisions

Decision 1: Qdrant Over Pinecone

Decision 2: Collection-Per-Client

Decision 3: Two-Stage Retrieval

Infrastructure Layout

What Broke in Production

Incident 1: Index Rebuild Storm

Incident 2: Embedding Pipeline Backlog

Incident 3: Memory Pressure from Payload Storage

Measurable Results

What We'd Do Differently

Use Hybrid Search from Day One

Invest in Evaluation Earlier

Start with Smaller Embedding Dimensions

Separate Read and Write Paths

FAQ

Building with agentic AI?

Vector Database Architecture Best Practices for High Scale Teams

Vector Database Architecture Best Practices for Enterprise Teams

Vector Database Architecture Best Practices for Startup Teams

Complete Guide to AI Guardrails & Safety with Typescript

Vector Database Architecture Best Practices for High Scale Teams

Start a
Conversation.

The Problem

Architecture Decisions

Decision 1: Qdrant Over Pinecone

Decision 2: Collection-Per-Client

Decision 3: Two-Stage Retrieval

Infrastructure Layout

What Broke in Production

Incident 1: Index Rebuild Storm

Incident 2: Embedding Pipeline Backlog

Incident 3: Memory Pressure from Payload Storage

Measurable Results

What We'd Do Differently

Use Hybrid Search from Day One

Invest in Evaluation Earlier

Start with Smaller Embedding Dimensions

Separate Read and Write Paths

FAQ

Building with agentic AI?

Vector Database Architecture Best Practices for High Scale Teams

Vector Database Architecture Best Practices for Enterprise Teams

Vector Database Architecture Best Practices for Startup Teams

Complete Guide to AI Guardrails & Safety with Typescript

Vector Database Architecture Best Practices for High Scale Teams

Start aConversation.

Start a
Conversation.