In mid-2024, our team at a Series B fintech company needed to build a document intelligence platform that could search across 12 million financial documents for our institutional clients. Each document was chunked into paragraphs, producing roughly 180 million vectors. This is the story of what we built, what broke, and what we'd do differently.
The Problem
Our clients — asset managers and compliance teams — needed to search across SEC filings, earnings transcripts, analyst reports, and internal memos using natural language queries. Keyword search returned too many irrelevant results. They needed semantic understanding: "companies discussing supply chain risks in Asia" should surface relevant paragraphs even if those exact words weren't used.
Requirements:
- 180M vectors at 1536 dimensions
- Multi-tenant: 45 institutional clients with strict data isolation
- Query latency: p95 under 200ms
- Ingestion: 500K new documents per week
- Hybrid search: semantic + keyword for exact entity matches (ticker symbols, CUSIP numbers)
Architecture Decisions
Decision 1: Qdrant Over Pinecone
We evaluated Pinecone, Weaviate, Qdrant, and Milvus. Qdrant won for three reasons:
- Compliance: Our clients required data to stay in our VPC. Pinecone's managed service wasn't an option.
- Payload filtering: Qdrant's filtered search performance was consistently better than Weaviate's for our access-control patterns.
- Quantization: Built-in scalar quantization reduced our memory footprint by 4x without significant recall loss.
Decision 2: Collection-Per-Client
We chose one Qdrant collection per client rather than a shared collection with metadata filtering:
This added operational complexity (managing 45 collections) but gave us:
- Provable data isolation for compliance audits
- Per-client index tuning (clients with larger corpora got different HNSW parameters)
- Independent backup/restore per client
- Clean data deletion when a client churns
Decision 3: Two-Stage Retrieval
Pure vector search at 180M scale was too slow for our latency target. We implemented a two-stage approach:
Infrastructure Layout
We ran Qdrant on Kubernetes across three availability zones:
Monthly cost: roughly $8,200 for compute + $1,400 for EBS storage + $2,100 for embedding API calls = $11,700/month serving 45 clients.
Need a second opinion on your AI systems architecture?
I run free 30-minute strategy calls for engineering teams tackling this exact problem.
Book a Free CallWhat Broke in Production
Incident 1: Index Rebuild Storm
Three months in, we upgraded Qdrant and needed to rebuild indexes for the new version. We naively triggered rebuilds across all 45 collections simultaneously. Every node's CPU pegged at 100%, query latency spiked to 15 seconds, and our monitoring alerted every client's SLA threshold.
Fix: We built a sequential rebuild controller that processes one collection at a time, with a 10-minute cooldown between rebuilds. Total rebuild time went from 2 hours (parallel, broken) to 18 hours (sequential, stable).
Incident 2: Embedding Pipeline Backlog
A client uploaded 50,000 documents in a single batch. Our embedding workers consumed their OpenAI rate limit within minutes, and the queue backed up to 200,000 pending chunks. New documents from other clients weren't being processed.
Fix: Per-client rate limiting on the ingestion queue. Each client gets a maximum of 1,000 embedding API calls per minute. Excess work is queued with fair scheduling across clients.
Incident 3: Memory Pressure from Payload Storage
We stored the full paragraph text in Qdrant's payload. For our largest client (12.8M vectors), payloads consumed 40GB of RAM — more than the vectors themselves. This pushed the node into swap, destroying query performance.
Fix: Moved full text to PostgreSQL and stored only a text hash in Qdrant's payload. On query, we fetch the full text from PostgreSQL using the returned IDs. This reduced Qdrant's memory usage by 60%.
Measurable Results
After six months in production:
| Metric | Target | Actual |
|---|---|---|
| Query latency (p50) | < 100ms | 42ms |
| Query latency (p95) | < 200ms | 128ms |
| Query latency (p99) | < 500ms | 287ms |
| Recall@20 | > 0.90 | 0.94 |
| Ingestion throughput | 500K docs/week | 720K docs/week |
| Uptime | 99.9% | 99.95% |
| Cost per client | < $500/month | $260/month |
User feedback was strongly positive. Compliance teams reported finding relevant documents in seconds instead of hours. The semantic search caught contextual references that keyword search missed entirely — particularly valuable for risk assessment queries.
What We'd Do Differently
Use Hybrid Search from Day One
We launched with pure vector search and added BM25 keyword search three months later when users complained about missing exact ticker symbol matches. The retrofit required re-indexing all 180M vectors with text payloads for BM25. If we'd planned for hybrid search from the start, we would have avoided two weeks of re-indexing downtime.
Invest in Evaluation Earlier
We didn't build a systematic evaluation framework until month four. Before that, we relied on qualitative feedback from users. Building a test set of 200 query/expected-result pairs and running weekly recall benchmarks would have caught our chunking quality issues two months sooner.
Start with Smaller Embedding Dimensions
We used 1536-dimension embeddings from the start. Testing later showed that 768 dimensions (using OpenAI's text-embedding-3-small with the dimensions parameter) achieved 0.92 recall versus 0.94 at 1536 — a negligible difference for our use case. At 768 dimensions, our memory footprint and costs would have been halved.
Separate Read and Write Paths
Our initial architecture used the same Qdrant nodes for both queries and ingestion. During bulk ingestion, query latency degraded noticeably. Adding dedicated ingestion nodes that replicate to query nodes would have eliminated this interference.