Back to Journal
AI Architecture

Vector Database Architecture Best Practices for Startup Teams

Battle-tested best practices for Vector Database Architecture tailored to Startup teams, including anti-patterns to avoid and a ready-to-use checklist.

Muneer Puthiya Purayil 16 min read

Startups building AI-powered features face a paradox with vector databases: you need something production-ready enough to ship, but flexible enough to pivot when your product direction changes. Over-engineering the vector layer too early burns runway. Under-engineering it means rewriting everything six months in.

This guide covers the pragmatic patterns for startup teams — what to ship in week one, what to defer, and the specific traps that catch teams with limited engineering bandwidth.

Start with pgvector, Migrate Later

Unless you have a specific reason to run a dedicated vector database, start with pgvector. Here's why:

sql
1-- Add vector support to your existing PostgreSQL
2CREATE EXTENSION IF NOT EXISTS vector;
3 
4-- Add an embedding column to your existing table
5ALTER TABLE documents ADD COLUMN embedding vector(1536);
6 
7-- Create an HNSW index
8CREATE INDEX ON documents
9USING hnsw (embedding vector_cosine_ops)
10WITH (m = 16, ef_construction = 64);
11 
12-- Query — it's just SQL
13SELECT id, title, 1 - (embedding <=> $1::vector) AS similarity
14FROM documents
15WHERE tenant_id = $2
16ORDER BY embedding <=> $1::vector
17LIMIT 10;
18 

Why pgvector works for startups:

  • No new infrastructure — runs in your existing PostgreSQL
  • Transactional consistency — vectors and metadata update atomically
  • Familiar tooling — Prisma, TypeORM, SQLAlchemy all work
  • Good enough performance — sub-50ms queries up to 1M vectors on a db.r6g.xlarge

The migration trigger: switch to a dedicated vector database when you exceed 5M vectors, need sub-10ms p99 latency, or require horizontal scaling.

Embedding Pipeline for Small Teams

Skip the infrastructure-heavy patterns. A simple, reliable pipeline beats an overengineered one:

typescript
1// lib/embeddings.ts
2import OpenAI from 'openai';
3 
4const openai = new OpenAI();
5 
6const BATCH_SIZE = 100;
7const MAX_RETRIES = 3;
8 
9export async function generateEmbeddings(
10 texts: string[]
11): Promise<number[][]> {
12 const embeddings: number[][] = [];
13 
14 for (let i = 0; i < texts.length; i += BATCH_SIZE) {
15 const batch = texts.slice(i, i + BATCH_SIZE);
16 let attempts = 0;
17 
18 while (attempts < MAX_RETRIES) {
19 try {
20 const response = await openai.embeddings.create({
21 model: 'text-embedding-3-small',
22 input: batch,
23 });
24 
25 embeddings.push(
26 ...response.data.map((d) => d.embedding)
27 );
28 break;
29 } catch (error) {
30 attempts++;
31 if (attempts === MAX_RETRIES) throw error;
32 await new Promise((r) =>
33 setTimeout(r, 1000 * Math.pow(2, attempts))
34 );
35 }
36 }
37 }
38 
39 return embeddings;
40}
41 
42export async function generateSingleEmbedding(
43 text: string
44): Promise<number[]> {
45 const [embedding] = await generateEmbeddings([text]);
46 return embedding;
47}
48 

Document Ingestion with Chunking

Chunking strategy matters more than embedding model choice for RAG quality:

typescript
1// lib/ingest.ts
2import { prisma } from './prisma';
3import { generateEmbeddings } from './embeddings';
4 
5interface ChunkOptions {
6 maxTokens: number;
7 overlap: number;
8}
9 
10export function chunkText(
11 text: string,
12 options: ChunkOptions = { maxTokens: 500, overlap: 50 }
13): string[] {
14 const sentences = text.split(/(?<=[.!?])\s+/);
15 const chunks: string[] = [];
16 let current: string[] = [];
17 let currentLength = 0;
18 
19 for (const sentence of sentences) {
20 const sentenceTokens = estimateTokens(sentence);
21 
22 if (
23 currentLength + sentenceTokens > options.maxTokens &&
24 current.length > 0
25 ) {
26 chunks.push(current.join(' '));
27 
28 // Keep overlap sentences
29 const overlapSentences: string[] = [];
30 let overlapLength = 0;
31 for (let i = current.length - 1; i >= 0; i--) {
32 const len = estimateTokens(current[i]);
33 if (overlapLength + len > options.overlap) break;
34 overlapSentences.unshift(current[i]);
35 overlapLength += len;
36 }
37 current = overlapSentences;
38 currentLength = overlapLength;
39 }
40 
41 current.push(sentence);
42 currentLength += sentenceTokens;
43 }
44 
45 if (current.length > 0) {
46 chunks.push(current.join(' '));
47 }
48 
49 return chunks;
50}
51 
52function estimateTokens(text: string): number {
53 return Math.ceil(text.length / 4);
54}
55 
56export async function ingestDocument(
57 docId: string,
58 title: string,
59 content: string,
60 tenantId: string
61) {
62 const chunks = chunkText(content);
63 const embeddings = await generateEmbeddings(chunks);
64 
65 // Delete existing chunks for this document
66 await prisma.$executeRaw`
67 DELETE FROM document_chunks WHERE document_id = ${docId}
68 `;
69 
70 // Insert new chunks with embeddings
71 for (let i = 0; i < chunks.length; i++) {
72 await prisma.$executeRaw`
73 INSERT INTO document_chunks (
74 document_id, tenant_id, chunk_index,
75 content, embedding
76 ) VALUES (
77 ${docId}, ${tenantId}, ${i},
78 ${chunks[i]}, ${embeddings[i]}::vector
79 )
80 `;
81 }
82 
83 return chunks.length;
84}
85 

Semantic Search API Route

Build a simple search endpoint that handles both semantic and filtered queries:

typescript
1// app/api/search/route.ts
2import { NextRequest, NextResponse } from 'next/server';
3import { prisma } from '@/lib/prisma';
4import { generateSingleEmbedding } from '@/lib/embeddings';
5 
6export async function POST(request: NextRequest) {
7 const { query, tenantId, limit = 10, threshold = 0.7 } =
8 await request.json();
9 
10 if (!query || !tenantId) {
11 return NextResponse.json(
12 { error: 'query and tenantId required' },
13 { status: 400 }
14 );
15 }
16 
17 const embedding = await generateSingleEmbedding(query);
18 
19 const results = await prisma.$queryRaw`
20 SELECT
21 dc.document_id,
22 dc.content,
23 dc.chunk_index,
24 d.title,
25 1 - (dc.embedding <=> ${embedding}::vector) AS similarity
26 FROM document_chunks dc
27 JOIN documents d ON d.id = dc.document_id
28 WHERE dc.tenant_id = ${tenantId}
29 AND 1 - (dc.embedding <=> ${embedding}::vector) > ${threshold}
30 ORDER BY dc.embedding <=> ${embedding}::vector
31 LIMIT ${limit}
32 `;
33 
34 return NextResponse.json({ results });
35}
36 

Need a second opinion on your AI systems architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

RAG Pipeline

Connect your search to an LLM for retrieval-augmented generation:

typescript
1// lib/rag.ts
2import OpenAI from 'openai';
3import { prisma } from './prisma';
4import { generateSingleEmbedding } from './embeddings';
5 
6const openai = new OpenAI();
7 
8interface RAGOptions {
9 tenantId: string;
10 topK?: number;
11 model?: string;
12 systemPrompt?: string;
13}
14 
15export async function ragQuery(
16 query: string,
17 options: RAGOptions
18) {
19 const {
20 tenantId,
21 topK = 5,
22 model = 'gpt-4o-mini',
23 systemPrompt = 'Answer based on the provided context. If the context does not contain relevant information, say so.',
24 } = options;
25 
26 // Retrieve relevant chunks
27 const embedding = await generateSingleEmbedding(query);
28 
29 const chunks = await prisma.$queryRaw<
30 { content: string; title: string; similarity: number }[]
31 >`
32 SELECT
33 dc.content,
34 d.title,
35 1 - (dc.embedding <=> ${embedding}::vector) AS similarity
36 FROM document_chunks dc
37 JOIN documents d ON d.id = dc.document_id
38 WHERE dc.tenant_id = ${tenantId}
39 ORDER BY dc.embedding <=> ${embedding}::vector
40 LIMIT ${topK}
41 `;
42 
43 // Build context from retrieved chunks
44 const context = chunks
45 .map(
46 (c, i) =>
47 `[Source ${i + 1}: ${c.title}]\n${c.content}`
48 )
49 .join('\n\n---\n\n');
50 
51 // Generate response
52 const completion = await openai.chat.completions.create({
53 model,
54 messages: [
55 { role: 'system', content: systemPrompt },
56 {
57 role: 'user',
58 content: `Context:\n${context}\n\nQuestion: ${query}`,
59 },
60 ],
61 temperature: 0.1,
62 });
63 
64 return {
65 answer: completion.choices[0].message.content,
66 sources: chunks.map((c) => ({
67 title: c.title,
68 similarity: c.similarity,
69 excerpt: c.content.slice(0, 200),
70 })),
71 };
72}
73 

Cost-Efficient Embedding Strategy

Embedding costs add up fast. Here's how to keep them manageable:

typescript
1// lib/embeddings-cache.ts
2import { prisma } from './prisma';
3import crypto from 'crypto';
4import { generateEmbeddings } from './embeddings';
5 
6function hashText(text: string): string {
7 return crypto
8 .createHash('sha256')
9 .update(text)
10 .digest('hex');
11}
12 
13export async function getCachedEmbedding(
14 text: string
15): Promise<number[] | null> {
16 const hash = hashText(text);
17 
18 const cached = await prisma.$queryRaw<
19 { embedding: number[] }[]
20 >`
21 SELECT embedding::float8[]
22 FROM embedding_cache
23 WHERE text_hash = ${hash}
24 AND model = 'text-embedding-3-small'
25 `;
26 
27 return cached[0]?.embedding ?? null;
28}
29 
30export async function getOrCreateEmbedding(
31 text: string
32): Promise<number[]> {
33 const cached = await getCachedEmbedding(text);
34 if (cached) return cached;
35 
36 const [embedding] = await generateEmbeddings([text]);
37 const hash = hashText(text);
38 
39 await prisma.$executeRaw`
40 INSERT INTO embedding_cache (text_hash, model, embedding)
41 VALUES (${hash}, 'text-embedding-3-small', ${embedding}::vector)
42 ON CONFLICT (text_hash, model) DO NOTHING
43 `;
44 
45 return embedding;
46}
47 

Cost comparison at startup scale:

Volumetext-embedding-3-smalltext-embedding-3-large
100K docs (avg 500 tokens)$1.00$6.50
1M docs$10.00$65.00
10M docs$100.00$650.00

Use text-embedding-3-small (1536 dimensions) unless you have benchmarks showing the large model measurably improves your specific retrieval task.

When to Migrate Off pgvector

Track these metrics weekly. When any threshold is crossed, start planning migration:

typescript
1// scripts/vector-health-check.ts
2import { prisma } from '@/lib/prisma';
3 
4async function checkVectorHealth() {
5 const stats = await prisma.$queryRaw<any[]>`
6 SELECT
7 COUNT(*) as total_vectors,
8 pg_size_pretty(pg_total_relation_size('document_chunks')) as table_size,
9 (SELECT COUNT(DISTINCT tenant_id) FROM document_chunks) as tenant_count
10 FROM document_chunks
11 `;
12 
13 // Measure p95 query latency
14 const latencyTest = await prisma.$queryRaw<any[]>`
15 EXPLAIN (ANALYZE, FORMAT JSON)
16 SELECT id, 1 - (embedding <=> $1::vector) AS similarity
17 FROM document_chunks
18 WHERE tenant_id = 'test'
19 ORDER BY embedding <=> $1::vector
20 LIMIT 10
21 `;
22 
23 const row = stats[0];
24 const warnings: string[] = [];
25 
26 if (row.total_vectors > 5_000_000) {
27 warnings.push(
28 `Vector count (${row.total_vectors}) exceeds 5M threshold`
29 );
30 }
31 
32 console.log({
33 totalVectors: row.total_vectors,
34 tableSize: row.table_size,
35 tenantCount: row.tenant_count,
36 warnings,
37 migrationRecommended: warnings.length > 0,
38 });
39}
40 
41checkVectorHealth();
42 

Anti-Patterns to Avoid

Running a Dedicated Vector Database Before 100K Vectors

The operational overhead of Milvus, Weaviate, or Qdrant is not justified at small scale. pgvector in your existing PostgreSQL handles 100K vectors with sub-20ms queries. Don't add infrastructure complexity until you have the data volume to justify it.

Embedding Everything

Not every piece of text needs to be embedded. Skip boilerplate, navigation text, legal disclaimers, and duplicate content. Compute embeddings only for content that users actually search through. This can reduce your vector count by 40-60%.

Building a Custom Embedding Model

Fine-tuning embedding models requires significant ML expertise and thousands of labeled query-document pairs. Use OpenAI or Cohere's off-the-shelf models until you have concrete evidence that general-purpose embeddings underperform for your domain.

Premature Sharding

If your entire vector index fits in memory on a single $200/month instance, sharding adds complexity with zero performance benefit. A single r6g.xlarge (32GB RAM) comfortably holds 5M vectors at 1536 dimensions with HNSW.

Ignoring Chunking Quality

Teams spend weeks optimizing embedding models while using naive fixed-size chunking. Sentence-boundary chunking with semantic overlap outperforms character-based splitting every time. Invest in chunking before tuning anything else.

Startup Readiness Checklist

  • pgvector extension installed and HNSW index created
  • Embedding pipeline with batching and retry logic
  • Sentence-boundary chunking with configurable overlap
  • Embedding cache to avoid re-computing unchanged documents
  • Search API with tenant isolation and similarity threshold
  • RAG pipeline connected to LLM with source attribution
  • Cost monitoring on embedding API calls
  • Weekly health check script tracking vector count and query latency
  • Migration criteria defined (vector count, latency, feature needs)
  • Document ingestion tested with your actual data format

FAQ

Need expert help?

Building with agentic AI?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026