Which embedding model should startups use in 2025?

Start with OpenAI's `text-embedding-3-small` at 1536 dimensions. It costs $0.02 per million tokens, delivers strong retrieval quality across domains, and integrates trivially with any language. Only switch if you benchmark a measurable improvement with alternatives like Cohere embed-v3 or an open-source model on your specific dataset.

How do I handle real-time document updates in a vector search system?

For startups, synchronous update-then-embed works until you exceed a few hundred updates per minute. Delete existing chunks for the document, re-chunk the updated content, generate new embeddings, and insert in a transaction. When this becomes a bottleneck, add a background queue (BullMQ, SQS) between the update and the embedding step.

Should I use LangChain for my RAG pipeline?

Evaluate carefully. LangChain adds abstraction layers that make debugging harder when things go wrong in production. For a startup MVP, a direct implementation (OpenAI SDK + raw SQL against pgvector) is roughly 50 lines of code, fully debuggable, and has zero hidden magic. Add framework abstractions only when you need features like conversation memory, agent tooling, or complex chain composition.

How do I measure if my vector search is actually working well?

Build a test set of 50-100 query/expected-result pairs from real user questions. Run queries weekly and measure recall@10 (did the correct document appear in the top 10?). If recall drops below 0.8, investigate chunking strategy, embedding model, or data quality. Automated recall testing catches degradation before users notice.

Vector Database Architecture Best Practices for Startup Teams

Startups building AI-powered features face a paradox with vector databases: you need something production-ready enough to ship, but flexible enough to pivot when your product direction changes. Over-engineering the vector layer too early burns runway. Under-engineering it means rewriting everything six months in.

This guide covers the pragmatic patterns for startup teams — what to ship in week one, what to defer, and the specific traps that catch teams with limited engineering bandwidth.

Start with pgvector, Migrate Later

Unless you have a specific reason to run a dedicated vector database, start with pgvector. Here's why:

sql

1-- Add vector support to your existing PostgreSQL

2CREATE EXTENSION IF NOT EXISTS vector;

4-- Add an embedding column to your existing table

5ALTER TABLE documents ADD COLUMN embedding vector(1536);

7-- Create an HNSW index

8CREATE INDEX ON documents

9USING hnsw (embedding vector_cosine_ops)

10WITH (m = 16, ef_construction = 64);

12-- Query — it's just SQL

13SELECT id, title, 1 - (embedding <=> $1::vector) AS similarity

14FROM documents

15WHERE tenant_id = $2

16ORDER BY embedding <=> $1::vector

17LIMIT 10;

Why pgvector works for startups:

No new infrastructure — runs in your existing PostgreSQL
Transactional consistency — vectors and metadata update atomically
Familiar tooling — Prisma, TypeORM, SQLAlchemy all work
Good enough performance — sub-50ms queries up to 1M vectors on a db.r6g.xlarge

The migration trigger: switch to a dedicated vector database when you exceed 5M vectors, need sub-10ms p99 latency, or require horizontal scaling.

Embedding Pipeline for Small Teams

Skip the infrastructure-heavy patterns. A simple, reliable pipeline beats an overengineered one:

typescript

1// lib/embeddings.ts

2import OpenAI from 'openai';

4const openai = new OpenAI();

6const BATCH_SIZE = 100;

7const MAX_RETRIES = 3;

9export async function generateEmbeddings(

10 texts: string[]

11): Promise<number[][]> {

12 const embeddings: number[][] = [];

14 for (let i = 0; i < texts.length; i += BATCH_SIZE) {

15 const batch = texts.slice(i, i + BATCH_SIZE);

16 let attempts = 0;

18 while (attempts < MAX_RETRIES) {

19 try {

20 const response = await openai.embeddings.create({

21 model: 'text-embedding-3-small',

22 input: batch,

23 });

25 embeddings.push(

26 ...response.data.map((d) => d.embedding)

27 );

28 break;

29 } catch (error) {

30 attempts++;

31 if (attempts === MAX_RETRIES) throw error;

32 await new Promise((r) =>

33 setTimeout(r, 1000 * Math.pow(2, attempts))

34 );

35 }

36 }

37 }

39 return embeddings;

40}

42export async function generateSingleEmbedding(

43 text: string

44): Promise<number[]> {

45 const [embedding] = await generateEmbeddings([text]);

46 return embedding;

47}

Document Ingestion with Chunking

Chunking strategy matters more than embedding model choice for RAG quality:

typescript

1// lib/ingest.ts

2import { prisma } from './prisma';

3import { generateEmbeddings } from './embeddings';

5interface ChunkOptions {

6 maxTokens: number;

7 overlap: number;

10export function chunkText(

11 text: string,

12 options: ChunkOptions = { maxTokens: 500, overlap: 50 }

13): string[] {

14 const sentences = text.split(/(?<=[.!?])\s+/);

15 const chunks: string[] = [];

16 let current: string[] = [];

17 let currentLength = 0;

19 for (const sentence of sentences) {

20 const sentenceTokens = estimateTokens(sentence);

22 if (

23 currentLength + sentenceTokens > options.maxTokens &&

24 current.length > 0

25 ) {

26 chunks.push(current.join(' '));

28 // Keep overlap sentences

29 const overlapSentences: string[] = [];

30 let overlapLength = 0;

31 for (let i = current.length - 1; i >= 0; i--) {

32 const len = estimateTokens(current[i]);

33 if (overlapLength + len > options.overlap) break;

34 overlapSentences.unshift(current[i]);

35 overlapLength += len;

36 }

37 current = overlapSentences;

38 currentLength = overlapLength;

39 }

41 current.push(sentence);

42 currentLength += sentenceTokens;

43 }

45 if (current.length > 0) {

46 chunks.push(current.join(' '));

47 }

49 return chunks;

50}

52function estimateTokens(text: string): number {

53 return Math.ceil(text.length / 4);

54}

56export async function ingestDocument(

57 docId: string,

58 title: string,

59 content: string,

60 tenantId: string

61) {

62 const chunks = chunkText(content);

63 const embeddings = await generateEmbeddings(chunks);

65 // Delete existing chunks for this document

66 await prisma.$executeRaw`

67 DELETE FROM document_chunks WHERE document_id = ${docId}

68 `;

70 // Insert new chunks with embeddings

71 for (let i = 0; i < chunks.length; i++) {

72 await prisma.$executeRaw`

73 INSERT INTO document_chunks (

74 document_id, tenant_id, chunk_index,

75 content, embedding

76 ) VALUES (

77 ${docId}, ${tenantId}, ${i},

78 ${chunks[i]}, ${embeddings[i]}::vector

79 )

80 `;

81 }

83 return chunks.length;

84}

Semantic Search API Route

Build a simple search endpoint that handles both semantic and filtered queries:

typescript

1// app/api/search/route.ts

2import { NextRequest, NextResponse } from 'next/server';

3import { prisma } from '@/lib/prisma';

4import { generateSingleEmbedding } from '@/lib/embeddings';

6export async function POST(request: NextRequest) {

7 const { query, tenantId, limit = 10, threshold = 0.7 } =

8 await request.json();

10 if (!query || !tenantId) {

11 return NextResponse.json(

12 { error: 'query and tenantId required' },

13 { status: 400 }

14 );

15 }

17 const embedding = await generateSingleEmbedding(query);

19 const results = await prisma.$queryRaw`

20 SELECT

21 dc.document_id,

22 dc.content,

23 dc.chunk_index,

24 d.title,

25 1 - (dc.embedding <=> ${embedding}::vector) AS similarity

26 FROM document_chunks dc

27 JOIN documents d ON d.id = dc.document_id

28 WHERE dc.tenant_id = ${tenantId}

29 AND 1 - (dc.embedding <=> ${embedding}::vector) > ${threshold}

30 ORDER BY dc.embedding <=> ${embedding}::vector

31 LIMIT ${limit}

32 `;

34 return NextResponse.json({ results });

35}

Need a second opinion on your AI systems architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

RAG Pipeline

Connect your search to an LLM for retrieval-augmented generation:

typescript

1// lib/rag.ts

2import OpenAI from 'openai';

3import { prisma } from './prisma';

4import { generateSingleEmbedding } from './embeddings';

6const openai = new OpenAI();

8interface RAGOptions {

9 tenantId: string;

10 topK?: number;

11 model?: string;

12 systemPrompt?: string;

13}

15export async function ragQuery(

16 query: string,

17 options: RAGOptions

18) {

19 const {

20 tenantId,

21 topK = 5,

22 model = 'gpt-4o-mini',

23 systemPrompt = 'Answer based on the provided context. If the context does not contain relevant information, say so.',

24 } = options;

26 // Retrieve relevant chunks

27 const embedding = await generateSingleEmbedding(query);

29 const chunks = await prisma.$queryRaw<

30 { content: string; title: string; similarity: number }[]

31 >`

32 SELECT

33 dc.content,

34 d.title,

35 1 - (dc.embedding <=> ${embedding}::vector) AS similarity

36 FROM document_chunks dc

37 JOIN documents d ON d.id = dc.document_id

38 WHERE dc.tenant_id = ${tenantId}

39 ORDER BY dc.embedding <=> ${embedding}::vector

40 LIMIT ${topK}

41 `;

43 // Build context from retrieved chunks

44 const context = chunks

45 .map(

46 (c, i) =>

47 `[Source ${i + 1}: ${c.title}]\n${c.content}`

48 )

49 .join('\n\n---\n\n');

51 // Generate response

52 const completion = await openai.chat.completions.create({

53 model,

54 messages: [

55 { role: 'system', content: systemPrompt },

56 {

57 role: 'user',

58 content: `Context:\n${context}\n\nQuestion: ${query}`,

59 },

60 ],

61 temperature: 0.1,

62 });

64 return {

65 answer: completion.choices[0].message.content,

66 sources: chunks.map((c) => ({

67 title: c.title,

68 similarity: c.similarity,

69 excerpt: c.content.slice(0, 200),

70 })),

71 };

72}

Cost-Efficient Embedding Strategy

Embedding costs add up fast. Here's how to keep them manageable:

typescript

1// lib/embeddings-cache.ts

2import { prisma } from './prisma';

3import crypto from 'crypto';

4import { generateEmbeddings } from './embeddings';

6function hashText(text: string): string {

7 return crypto

8 .createHash('sha256')

9 .update(text)

10 .digest('hex');

11}

13export async function getCachedEmbedding(

14 text: string

15): Promise<number[] | null> {

16 const hash = hashText(text);

18 const cached = await prisma.$queryRaw<

19 { embedding: number[] }[]

20 >`

21 SELECT embedding::float8[]

22 FROM embedding_cache

23 WHERE text_hash = ${hash}

24 AND model = 'text-embedding-3-small'

25 `;

27 return cached[0]?.embedding ?? null;

28}

30export async function getOrCreateEmbedding(

31 text: string

32): Promise<number[]> {

33 const cached = await getCachedEmbedding(text);

34 if (cached) return cached;

36 const [embedding] = await generateEmbeddings([text]);

37 const hash = hashText(text);

39 await prisma.$executeRaw`

40 INSERT INTO embedding_cache (text_hash, model, embedding)

41 VALUES (${hash}, 'text-embedding-3-small', ${embedding}::vector)

42 ON CONFLICT (text_hash, model) DO NOTHING

43 `;

45 return embedding;

46}

Cost comparison at startup scale:

Volume	text-embedding-3-small	text-embedding-3-large
100K docs (avg 500 tokens)	$1.00	$6.50
1M docs	$10.00	$65.00
10M docs	$100.00	$650.00

Use text-embedding-3-small (1536 dimensions) unless you have benchmarks showing the large model measurably improves your specific retrieval task.

When to Migrate Off pgvector

Track these metrics weekly. When any threshold is crossed, start planning migration:

typescript

1// scripts/vector-health-check.ts

2import { prisma } from '@/lib/prisma';

4async function checkVectorHealth() {

5 const stats = await prisma.$queryRaw<any[]>`

6 SELECT

7 COUNT(*) as total_vectors,

8 pg_size_pretty(pg_total_relation_size('document_chunks')) as table_size,

9 (SELECT COUNT(DISTINCT tenant_id) FROM document_chunks) as tenant_count

10 FROM document_chunks

11 `;

13 // Measure p95 query latency

14 const latencyTest = await prisma.$queryRaw<any[]>`

15 EXPLAIN (ANALYZE, FORMAT JSON)

16 SELECT id, 1 - (embedding <=> $1::vector) AS similarity

17 FROM document_chunks

18 WHERE tenant_id = 'test'

19 ORDER BY embedding <=> $1::vector

20 LIMIT 10

21 `;

23 const row = stats[0];

24 const warnings: string[] = [];

26 if (row.total_vectors > 5_000_000) {

27 warnings.push(

28 `Vector count (${row.total_vectors}) exceeds 5M threshold`

29 );

30 }

32 console.log({

33 totalVectors: row.total_vectors,

34 tableSize: row.table_size,

35 tenantCount: row.tenant_count,

36 warnings,

37 migrationRecommended: warnings.length > 0,

38 });

39}

41checkVectorHealth();

Anti-Patterns to Avoid

Running a Dedicated Vector Database Before 100K Vectors

The operational overhead of Milvus, Weaviate, or Qdrant is not justified at small scale. pgvector in your existing PostgreSQL handles 100K vectors with sub-20ms queries. Don't add infrastructure complexity until you have the data volume to justify it.

Embedding Everything

Not every piece of text needs to be embedded. Skip boilerplate, navigation text, legal disclaimers, and duplicate content. Compute embeddings only for content that users actually search through. This can reduce your vector count by 40-60%.

Building a Custom Embedding Model

Fine-tuning embedding models requires significant ML expertise and thousands of labeled query-document pairs. Use OpenAI or Cohere's off-the-shelf models until you have concrete evidence that general-purpose embeddings underperform for your domain.

Premature Sharding

If your entire vector index fits in memory on a single $200/month instance, sharding adds complexity with zero performance benefit. A single r6g.xlarge (32GB RAM) comfortably holds 5M vectors at 1536 dimensions with HNSW.

Ignoring Chunking Quality

Teams spend weeks optimizing embedding models while using naive fixed-size chunking. Sentence-boundary chunking with semantic overlap outperforms character-based splitting every time. Invest in chunking before tuning anything else.

Startup Readiness Checklist

FAQ

Need expert help?

Building with agentic AI?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Book a Free Call Send a Brief

vector-db embeddings similarity-search ai-infrastructure startup best-practices

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

View Portfolio Book a Call

← Previous

Vector Database Architecture Best Practices for Startup Teams

Start with pgvector, Migrate Later

Embedding Pipeline for Small Teams

Document Ingestion with Chunking

Semantic Search API Route

RAG Pipeline

Cost-Efficient Embedding Strategy

When to Migrate Off pgvector

Anti-Patterns to Avoid

Running a Dedicated Vector Database Before 100K Vectors

Embedding Everything

Building a Custom Embedding Model

Premature Sharding

Ignoring Chunking Quality

Startup Readiness Checklist

FAQ

Building with agentic AI?

Vector Database Architecture Best Practices for High Scale Teams

Vector Database Architecture Best Practices for Enterprise Teams

Vector Database Architecture at Scale: Lessons from Production

Vector Database Architecture Best Practices for Enterprise Teams

Vector Database Architecture: Go vs Rust in 2025

Start a
Conversation.

Start with pgvector, Migrate Later

Embedding Pipeline for Small Teams

Document Ingestion with Chunking

Semantic Search API Route

RAG Pipeline

Cost-Efficient Embedding Strategy

When to Migrate Off pgvector

Anti-Patterns to Avoid

Running a Dedicated Vector Database Before 100K Vectors

Embedding Everything

Building a Custom Embedding Model

Premature Sharding

Ignoring Chunking Quality

Startup Readiness Checklist

FAQ

Building with agentic AI?

Vector Database Architecture Best Practices for High Scale Teams

Vector Database Architecture Best Practices for Enterprise Teams

Vector Database Architecture at Scale: Lessons from Production

Vector Database Architecture Best Practices for Enterprise Teams

Vector Database Architecture: Go vs Rust in 2025

Start aConversation.

Start a
Conversation.