Back to Journal
AI Architecture

RAG Pipeline Design Best Practices for Startup Teams

Battle-tested best practices for RAG Pipeline Design tailored to Startup teams, including anti-patterns to avoid and a ready-to-use checklist.

Muneer Puthiya Purayil 14 min read

Startup RAG pipelines need to deliver value fast without the infrastructure complexity of enterprise deployments. The goal is a working retrieval-augmented generation system that answers questions over your product documentation, support tickets, or knowledge base in days rather than months. These best practices focus on pragmatic choices that maximize quality with minimal engineering investment.

Start with a Minimal Architecture

The simplest production RAG pipeline has four components:

Documents → Chunker → Embedder → Vector DB → Query → LLM → Response

Resist the urge to add re-ranking, query expansion, hybrid search, or complex chunking until you have user feedback on the basic pipeline.

Day-One Implementation

python
1import os
2from openai import OpenAI
3from qdrant_client import QdrantClient
4from qdrant_client.models import Distance, VectorParams, PointStruct
5 
6client = OpenAI()
7qdrant = QdrantClient(url=os.environ["QDRANT_URL"], api_key=os.environ["QDRANT_API_KEY"])
8 
9COLLECTION = "knowledge_base"
10EMBED_MODEL = "text-embedding-3-small" # Cheaper, fast, good enough to start
11DIMENSIONS = 1536
12 
13# One-time setup
14qdrant.create_collection(
15 collection_name=COLLECTION,
16 vectors_config=VectorParams(size=DIMENSIONS, distance=Distance.COSINE),
17)
18 
19def embed(texts: list[str]) -> list[list[float]]:
20 response = client.embeddings.create(model=EMBED_MODEL, input=texts)
21 return [item.embedding for item in response.data]
22 
23def ingest(documents: list[dict]):
24 """Ingest documents with simple fixed-size chunking."""
25 points = []
26 for doc in documents:
27 chunks = chunk_text(doc["content"], max_tokens=500, overlap=50)
28 embeddings = embed([c["text"] for c in chunks])
29
30 for chunk, embedding in zip(chunks, embeddings):
31 points.append(PointStruct(
32 id=chunk["id"],
33 vector=embedding,
34 payload={
35 "text": chunk["text"],
36 "source": doc["source"],
37 "title": doc["title"],
38 },
39 ))
40
41 qdrant.upsert(collection_name=COLLECTION, points=points)
42 
43def query(question: str, top_k: int = 5) -> str:
44 query_embedding = embed([question])[0]
45 results = qdrant.search(
46 collection_name=COLLECTION,
47 query_vector=query_embedding,
48 limit=top_k,
49 )
50
51 context = "\n\n---\n\n".join([
52 f"Source: {r.payload['title']}\n{r.payload['text']}"
53 for r in results
54 ])
55
56 response = client.chat.completions.create(
57 model="gpt-4o-mini",
58 messages=[
59 {"role": "system", "content": (
60 "Answer the user's question using only the provided context. "
61 "If the context doesn't contain enough information, say so. "
62 "Cite your sources."
63 )},
64 {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
65 ],
66 max_tokens=1000,
67 )
68
69 return response.choices[0].message.content
70 
71def chunk_text(text: str, max_tokens: int = 500, overlap: int = 50) -> list[dict]:
72 words = text.split()
73 chunks = []
74 start = 0
75
76 while start < len(words):
77 end = min(start + max_tokens, len(words))
78 chunk_text = " ".join(words[start:end])
79 chunks.append({
80 "id": f"{hash(chunk_text)}",
81 "text": chunk_text,
82 })
83 start = end - overlap
84
85 return chunks
86 

This is approximately 80 lines of code and handles the core use case. Ship this first, measure quality, then iterate.

Choosing Your Stack

Vector Database

For startups, start with a managed vector database to avoid operational overhead:

OptionCostBest For
Qdrant CloudFree tier: 1GBSmall corpuses, < 100K vectors
PineconeFree tier: 100K vectorsServerless scaling, minimal ops
Supabase pgvectorIncluded with Supabase planIf already using Supabase
ChromaDB (self-hosted)Infrastructure cost onlyLocal development, POCs

Avoid self-hosting a vector database in production until you have at least 10M vectors. The operational overhead is not worth it at startup scale.

Embedding Model

python
1# Cost comparison per 1M tokens (as of 2025)
2EMBEDDING_COSTS = {
3 "text-embedding-3-small": 0.02, # Good quality, cheapest
4 "text-embedding-3-large": 0.13, # Best quality OpenAI
5 "voyage-3": 0.06, # Best for code/technical
6 "cohere-embed-v3": 0.10, # Strong multilingual
7}
8 

Start with text-embedding-3-small. It costs 6x less than text-embedding-3-large with only 3-5% lower retrieval quality on most benchmarks. Switch to a larger model when you have data showing retrieval quality is the bottleneck.

LLM for Generation

Use the cheapest model that produces acceptable output:

python
1# Start with gpt-4o-mini or claude-3-5-haiku for cost efficiency
2# Upgrade to gpt-4o or claude-sonnet when quality demands it
3# Reserve gpt-4 / claude-opus for complex reasoning tasks
4 
5MODEL_TIERS = {
6 "fast_cheap": "gpt-4o-mini", # $0.15/1M input tokens
7 "balanced": "claude-sonnet-4-5-20250514", # Better reasoning
8 "premium": "gpt-4o", # Highest quality
9}
10 

Quick Wins for Retrieval Quality

Add Document Titles to Chunks

The single biggest retrieval quality improvement with zero complexity:

python
1def chunk_with_context(doc_title: str, text: str, max_tokens: int = 500) -> list[dict]:
2 chunks = chunk_text(text, max_tokens=max_tokens - 20)
3 for chunk in chunks:
4 chunk["text"] = f"Document: {doc_title}\n\n{chunk['text']}"
5 return chunks
6 

Filter by Metadata

Add source filtering so users can scope their search:

python
1def query_with_filter(question: str, source_filter: str | None = None, top_k: int = 5):
2 query_embedding = embed([question])[0]
3
4 filter_condition = None
5 if source_filter:
6 filter_condition = models.Filter(
7 must=[models.FieldCondition(
8 key="source",
9 match=models.MatchValue(value=source_filter),
10 )]
11 )
12
13 results = qdrant.search(
14 collection_name=COLLECTION,
15 query_vector=query_embedding,
16 query_filter=filter_condition,
17 limit=top_k,
18 )
19 return results
20 

Implement Simple Feedback Collection

Track which responses users find helpful to identify retrieval failures:

python
1from datetime import datetime
2 
3async def log_feedback(query_id: str, helpful: bool, user_comment: str | None = None):
4 await db.insert("rag_feedback", {
5 "query_id": query_id,
6 "helpful": helpful,
7 "comment": user_comment,
8 "timestamp": datetime.utcnow().isoformat(),
9 })
10 
11# Review unhelpful responses weekly to identify:
12# 1. Missing documents (need to add content)
13# 2. Poor chunking (chunk boundaries split relevant information)
14# 3. Wrong retrieval (correct document exists but wrong chunks are returned)
15 

Need a second opinion on your AI systems architecture?

I run free 30-minute strategy calls for engineering teams tackling this exact problem.

Book a Free Call

When to Add Complexity

Only add these when you have evidence they are needed:

FeatureAdd WhenEvidence
Hybrid search (BM25 + vector)Keyword queries return poor resultsUsers searching for exact terms get irrelevant results
Re-rankingTop-5 results contain irrelevant chunksRelevant docs appear at position 8-15
Query expansionShort or ambiguous queries fail1-3 word queries return empty results
Semantic chunkingFixed-size chunks split informationFeedback shows answers are incomplete
Streaming responsesUsers wait too long for answersTime-to-first-token > 3 seconds

Checklist

  • Basic ingestion pipeline (parse, chunk, embed, store)
  • Single vector database collection with managed hosting
  • Simple query → retrieve → generate pipeline
  • Source citations in generated responses
  • User feedback collection (thumbs up/down)
  • Basic error handling (API failures, empty results)
  • Document title prepended to chunks
  • Cost monitoring (embedding + LLM API costs)

Anti-Patterns to Avoid

Building evaluation infrastructure before having users: You need real queries to build meaningful evaluation sets. Ship the basic pipeline, collect 100 real queries, then build evaluation.

Using the most expensive models from day one: Start cheap. text-embedding-3-small + gpt-4o-mini costs 10x less than the premium stack and handles 80% of use cases adequately.

Implementing all retrieval strategies simultaneously: Hybrid search + re-ranking + query expansion + HyDE adds 4 weeks of engineering time. The basic pipeline delivers 70-80% of the value. Add complexity based on measured quality gaps.

Over-engineering the chunking pipeline: Recursive character splitting with markdown-aware boundaries and semantic deduplication is a week of engineering. Fixed-size chunking with document context prepended takes 30 minutes and gets you 80% there.

Conclusion

The startup RAG playbook is straightforward: ship the simplest pipeline that answers user questions, collect feedback, and add complexity based on evidence. The 80/20 rule applies aggressively — basic chunking, a cheap embedding model, and a simple vector database handle the majority of real-world RAG use cases.

Resist premature optimization. The difference between a startup RAG pipeline and an enterprise one is not architectural complexity — it is data quality. A well-curated corpus of 1,000 documents with clean chunking outperforms a poorly maintained corpus of 100,000 documents with the most sophisticated retrieval stack.

FAQ

Need expert help?

Building with agentic AI?

I help teams ship production-grade systems. From architecture review to hands-on builds.

Muneer Puthiya Purayil

SaaS Architect & AI Systems Engineer. 10+ years shipping production infrastructure across fintech, automotive, e-commerce, and healthcare.

Engage

Start a
Conversation.

For teams building at scale: SaaS platforms, agentic AI systems, and enterprise mobile infrastructure. Scope and fit are evaluated before any engagement begins.

Limited availability · Q3 / Q4 2026