RAG in Production: Beyond the Tutorial (Vector DB Selection, Chunking, Evaluation)
Every RAG tutorial follows the same script. Load documents, split into chunks, embed them, store in a vector database, retrieve the top-k results, pass to an LLM. It works in the demo. Then you deploy it and everything falls apart.
I have built RAG systems that serve real traffic — internal knowledge bases, customer support tools, document analysis pipelines. The gap between the tutorial version and what survives production is enormous. This post covers everything the tutorials skip: the chunking strategies that actually matter, embedding model tradeoffs you will hit immediately, vector database selection for real workloads, hybrid search, evaluation, and the failure modes that will bite you at 2 AM.
If you are building your first RAG pipeline, start with the AI Engineer Guide for the bigger picture. If you have already been burned by agent loops gone wrong, my production agent architecture post covers adjacent problems. This post is specifically about retrieval — getting the right context to the LLM before generation even begins.
The Tutorial vs. Production Gap
The tutorial version of RAG has roughly five lines of meaningful code:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)
vectorstore = Chroma.from_documents(chunks, embedding_model)
results = vectorstore.similarity_search(query, k=4)This works for a hackathon. Here is what it does not handle:
- Source data changes daily. Your embeddings become stale. Documents get updated, deleted, or added. There is no sync mechanism.
- Chunks split mid-sentence. A critical piece of information sits across two chunks. Neither chunk alone answers the question.
- Top-k retrieval returns irrelevant results. The query is semantically similar to many documents but only one actually answers the question.
- The LLM hallucinates despite having context. The retrieved chunks are relevant but the model ignores them or synthesizes information that is not there.
- Latency is unacceptable. Embedding the query, searching the vector store, reranking, and generating a response takes 8 seconds. Users leave.
Production RAG is an engineering discipline. Let me walk through each component.
Chunking Strategies: The Foundation You Cannot Get Wrong
Chunking is the single most impactful decision in your RAG pipeline. Bad chunks mean irrelevant retrieval. Irrelevant retrieval means bad answers. No amount of prompt engineering fixes garbage context.
Fixed-Size Chunking
Split every N characters (or tokens) with some overlap. Simple, fast, predictable.
function fixedSizeChunk(text: string, chunkSize: number, overlap: number): string[] {
const chunks: string[] = [];
let start = 0;
while (start < text.length) {
chunks.push(text.slice(start, start + chunkSize));
start += chunkSize - overlap;
}
return chunks;
}The problem: it splits mid-paragraph, mid-sentence, even mid-word. A paragraph explaining a concept gets cut in half. Each half is meaningless on its own.
Recursive Character Splitting
Split by paragraph first, then by sentence, then by character — only going deeper when a chunk exceeds the size limit. LangChain's default, and a solid baseline.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document_text)This respects natural boundaries better than fixed-size, but it still has no understanding of meaning. Two paragraphs about the same topic get split if they exceed the size limit.
Semantic Chunking
Group text by meaning. Embed sentences, then split where the embedding similarity drops below a threshold. Adjacent sentences about the same topic stay together.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
chunker = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=75
)
chunks = chunker.create_documents([document_text])Higher quality, but slower — you are embedding every sentence during ingestion. For large corpora this adds significant processing time and cost.
Document-Aware Chunking
Parse the document structure (headings, sections, lists, code blocks) and split at structural boundaries. This is what I recommend for technical documentation, legal documents, and any content with explicit hierarchy.
interface DocumentChunk {
content: string;
metadata: {
heading: string;
section_path: string[]; // ["Chapter 3", "Authentication", "OAuth2 Flow"]
chunk_index: number;
source_file: string;
};
}
function documentAwareChunk(markdown: string): DocumentChunk[] {
const chunks: DocumentChunk[] = [];
const sections = markdown.split(/(?=^#{1,3}\s)/m);
let sectionPath: string[] = [];
for (const section of sections) {
const headingMatch = section.match(/^(#{1,3})\s+(.+)/);
if (headingMatch) {
const level = headingMatch[1].length;
const heading = headingMatch[2].trim();
sectionPath = [...sectionPath.slice(0, level - 1), heading];
}
// If section is too large, apply recursive splitting within it
if (section.length > 1500) {
const subChunks = recursiveSplit(section, 1000, 200);
subChunks.forEach((sub, i) => {
chunks.push({
content: sub,
metadata: {
heading: sectionPath[sectionPath.length - 1] || "untitled",
section_path: [...sectionPath],
chunk_index: i,
source_file: "",
},
});
});
} else {
chunks.push({
content: section,
metadata: {
heading: sectionPath[sectionPath.length - 1] || "untitled",
section_path: [...sectionPath],
chunk_index: 0,
source_file: "",
},
});
}
}
return chunks;
}The key advantage: metadata. When you retrieve a chunk, you know exactly where it came from in the document hierarchy. This helps the LLM cite sources and helps you debug retrieval issues.
Chunking Strategy Comparison
| Strategy | Quality | Speed | Complexity | Best For |
|---|---|---|---|---|
| Fixed-size | Low | Fast | Minimal | Quick prototypes, uniform text (logs, raw data) |
| Recursive character | Medium | Fast | Low | General-purpose, mixed content |
| Semantic | High | Slow | Medium | Nuanced content where topic boundaries matter |
| Document-aware | High | Medium | High | Structured docs, technical writing, legal text |
My recommendation: start with recursive character splitting to get a baseline. Measure retrieval quality. Switch to document-aware chunking when you need better precision — which you almost certainly will.
Embedding Model Selection
Your embedding model determines what "similar" means in your vector search. Choose wrong and semantically relevant documents will rank below irrelevant ones.
The Models That Matter in 2026
| Model | Dimensions | Max Tokens | Multilingual | Relative Cost | MTEB Score |
|---|---|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 (configurable) | 8191 | Yes | $$$ | ~64.6 |
| OpenAI text-embedding-3-small | 1536 (configurable) | 8191 | Yes | $ | ~62.3 |
| Cohere Embed v3 | 1024 | 512 | 100+ languages | $$ | ~64.5 |
| BGE-M3 (BAAI) | 1024 | 8192 | 100+ languages | Free (self-hosted) | ~64.2 |
| E5-Large-v2 | 1024 | 512 | English-focused | Free (self-hosted) | ~62.7 |
Key tradeoffs:
- OpenAI text-embedding-3-large has the best overall quality and the simplest integration. But you pay per token and you have a vendor dependency. The dimension reduction feature (you can truncate to 256 or 1024 dims) is useful for cost-sensitive deployments.
- Cohere Embed v3 is strong for multilingual use cases and offers separate
search_documentandsearch_queryinput types, which improves retrieval quality. The 512 token limit is restrictive for longer chunks. - BGE-M3 is my default for self-hosted deployments. Supports dense, sparse, and multi-vector retrieval natively. No API costs. You need GPU infrastructure.
- E5-Large-v2 is lightweight and fast. Good for English-only, latency-sensitive applications.
# OpenAI embedding with dimension reduction
from openai import OpenAI
client = OpenAI()
def embed_texts(texts: list[str], dimensions: int = 1024) -> list[list[float]]:
response = client.embeddings.create(
model="text-embedding-3-large",
input=texts,
dimensions=dimensions # Reduce from 3072 to save storage/cost
)
return [item.embedding for item in response.data]Critical rule: the embedding model you use at index time must be the same model you use at query time. This sounds obvious but I have seen production systems break because someone upgraded the embedding model without re-indexing. Your old embeddings and new query embeddings live in different vector spaces. Similarity scores become meaningless.
Vector Database Comparison
The vector database is your retrieval engine. The choice matters more than most tutorials suggest.
| Feature | Pinecone | Weaviate | ChromaDB | Qdrant |
|---|---|---|---|---|
| Hosting | Managed only | Managed + self-hosted | Self-hosted (local) | Managed + self-hosted |
| Open Source | No | Yes | Yes | Yes |
| Hybrid Search | Sparse vectors (manual) | Native BM25 + vector | No | Sparse vectors + payload index |
| Filtering | Metadata filters | GraphQL-style filters | Where clauses | Payload filters (powerful) |
| Max Vectors | Billions (serverless) | Millions+ | Thousands-Millions | Billions |
| Latency (p99) | ~50ms | ~100ms | ~10ms (local) | ~30ms |
| Best For | Managed production at scale | Hybrid search, multi-modal | Local dev, prototyping | High-performance, complex filters |
| Pricing | Pay-per-query + storage | Free tier + usage-based | Free | Free tier + usage-based |
When to Use What
ChromaDB when you are prototyping or building a local tool. Zero config, pip install, done. Do not use it for production traffic — it is an in-process database with no replication.
import chromadb
client = chromadb.Client()
collection = client.create_collection("docs")
collection.add(
documents=["RAG is retrieval augmented generation..."],
metadatas=[{"source": "glossary.md"}],
ids=["doc-001"]
)
results = collection.query(query_texts=["What is RAG?"], n_results=3)Pinecone when you want managed infrastructure and your team does not want to operate a database. Serverless pricing is reasonable for moderate traffic. The main limitation is that hybrid search requires you to compute sparse vectors yourself.
Weaviate when you need native hybrid search (BM25 + vector) out of the box. This is a significant advantage — you get keyword matching and semantic matching without building it yourself.
import weaviate
client = weaviate.connect_to_local()
collection = client.collections.get("Document")
# Hybrid search: combines BM25 keyword + vector similarity
results = collection.query.hybrid(
query="OAuth2 refresh token expiration",
alpha=0.5, # 0 = pure keyword, 1 = pure vector
limit=5
)Qdrant when you need fast filtering alongside vector search. Its payload filtering is the most flexible — you can filter by nested fields, ranges, geo coordinates, and combine them with boolean logic before vector search runs.
Hybrid Search: Why Vector-Only Retrieval Fails
Pure vector similarity search has a fundamental weakness: it matches by meaning, not by terminology. If a user asks "What is the EBITDA margin for Q3?" and your documents use "EBITDA" as a specific term, vector search might return chunks about "profitability metrics" or "earnings analysis" that are semantically similar but do not contain the specific data.
Hybrid search combines vector similarity (semantic) with BM25 keyword search (lexical). The results are fused using Reciprocal Rank Fusion (RRF) or a weighted linear combination.
interface SearchResult {
id: string;
content: string;
score: number;
}
function reciprocalRankFusion(
vectorResults: SearchResult[],
keywordResults: SearchResult[],
k: number = 60
): SearchResult[] {
const scores = new Map<string, { score: number; content: string }>();
vectorResults.forEach((result, rank) => {
const rrf = 1 / (k + rank + 1);
const existing = scores.get(result.id);
scores.set(result.id, {
score: (existing?.score || 0) + rrf,
content: result.content,
});
});
keywordResults.forEach((result, rank) => {
const rrf = 1 / (k + rank + 1);
const existing = scores.get(result.id);
scores.set(result.id, {
score: (existing?.score || 0) + rrf,
content: result.content,
});
});
return Array.from(scores.entries())
.map(([id, { score, content }]) => ({ id, content, score }))
.sort((a, b) => b.score - a.score);
}In my experience, hybrid search with alpha around 0.6-0.7 (favoring vector) consistently outperforms pure vector or pure keyword search. The keyword component catches exact term matches that embedding models sometimes miss, especially for domain-specific jargon, product names, and acronyms.
Production Architecture
Here is the architecture that works. Every component exists for a reason.
RAG Production Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ INGESTION PIPELINE │
│ │
│ Source Data ──→ Parser ──→ Chunker ──→ Embedder ──→ Vector Store │
│ (docs, APIs, (PDF, (document- (batch (with │
│ databases) markdown, aware + embed, metadata) │
│ HTML) metadata) queue) │
│ │
│ ┌─── Change Detection ───── Incremental Re-index ──┐ │
│ │ (hash comparison, (update changed, │ │
│ │ timestamps) delete removed) │ │
│ └───────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ RETRIEVAL PIPELINE │
│ │
│ User Query │
│ │ │
│ ├──→ Query Expansion (optional: rephrase, HyDE) │
│ │ │
│ ├──→ Hybrid Search (vector + BM25) │
│ │ │ │
│ │ ├──→ Metadata Filtering (source, date, category) │
│ │ │ │
│ │ └──→ Top-N candidates (N=20) │
│ │ │
│ ├──→ Reranker (Cohere Rerank / cross-encoder) │
│ │ │ │
│ │ └──→ Top-K results (K=5) │
│ │ │
│ └──→ Context Assembly ──→ LLM Generation ──→ Response │
└─────────────────────────────────────────────────────────────────────┘The Reranking Step Most Tutorials Skip
Retrieve 20 candidates, rerank to 5. This single addition improved our retrieval precision by 15-25% across multiple projects. The initial vector search casts a wide net. The reranker — a cross-encoder model — scores each candidate against the query with much higher accuracy than embedding similarity alone.
import cohere
co = cohere.ClientV2()
def rerank_results(query: str, documents: list[str], top_k: int = 5) -> list[dict]:
response = co.rerank(
query=query,
documents=documents,
top_n=top_k,
model="rerank-v3.5"
)
return [
{
"index": result.index,
"relevance_score": result.relevance_score,
"text": documents[result.index]
}
for result in response.results
]
# Usage
candidates = vector_search(query, n=20) # Cast wide net
reranked = rerank_results(query, [c.text for c in candidates], top_k=5)The cost of reranking 20 documents is negligible compared to the improvement in answer quality.
Evaluation with RAGAS
You cannot improve what you cannot measure. RAGAS (Retrieval Augmented Generation Assessment) gives you four metrics that cover the full RAG pipeline:
| Metric | What It Measures | Range | Target |
|---|---|---|---|
| Faithfulness | Is the answer grounded in retrieved context? | 0-1 | > 0.85 |
| Answer Relevancy | Does the answer actually address the question? | 0-1 | > 0.80 |
| Context Precision | Are retrieved chunks relevant to the question? | 0-1 | > 0.75 |
| Context Recall | Did retrieval find all relevant information? | 0-1 | > 0.70 |
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from datasets import Dataset
# Build your evaluation dataset (50-100 examples minimum)
eval_data = {
"question": [
"What is the OAuth2 refresh token expiration policy?",
"How do I configure rate limiting for the API gateway?",
],
"answer": [
# Generated by your RAG pipeline
"Refresh tokens expire after 30 days of inactivity...",
"Rate limiting is configured in the gateway.yaml file...",
],
"contexts": [
# Retrieved chunks that were passed to the LLM
["Refresh tokens: tokens expire after 30 days...", "OAuth2 spec requires..."],
["Gateway configuration: rate_limit section...", "API throttling docs..."],
],
"ground_truth": [
# Human-verified correct answers
"Refresh tokens expire after 30 days of inactivity. Active tokens are renewed.",
"Add a rate_limit block to gateway.yaml with requests_per_second and burst_size.",
],
}
dataset = Dataset.from_dict(eval_data)
results = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(results)
# {'faithfulness': 0.92, 'answer_relevancy': 0.87,
# 'context_precision': 0.81, 'context_recall': 0.76}How I use RAGAS in practice:
- Curate 50-100 question-answer pairs from real user queries. Do not synthesize them — use actual questions people asked your system.
- Run evaluation after every pipeline change (new chunking strategy, different embedding model, reranker update).
- Set regression thresholds in CI. If faithfulness drops below 0.80, the pipeline change does not deploy.
- Track metrics over time. Degradation usually means your source data changed and your embeddings are stale.
What Actually Breaks in Production
After shipping multiple RAG systems, these are the failure modes I see repeatedly.
1. Stale Embeddings
Your source documents change. Your vector store does not know. The system confidently returns outdated information because the old embedding is still the closest match.
Solution: build a change detection layer. Hash each document at ingestion time. On every sync cycle, compare hashes. Re-embed changed documents, delete removed ones.
import crypto from "crypto";
interface DocumentRecord {
id: string;
content_hash: string;
last_indexed: Date;
}
function detectChanges(
currentDocs: Map<string, string>,
indexedRecords: Map<string, DocumentRecord>
): { toAdd: string[]; toUpdate: string[]; toDelete: string[] } {
const toAdd: string[] = [];
const toUpdate: string[] = [];
const toDelete: string[] = [];
for (const [id, content] of currentDocs) {
const hash = crypto.createHash("sha256").update(content).digest("hex");
const existing = indexedRecords.get(id);
if (!existing) {
toAdd.push(id);
} else if (existing.content_hash !== hash) {
toUpdate.push(id);
}
}
for (const id of indexedRecords.keys()) {
if (!currentDocs.has(id)) {
toDelete.push(id);
}
}
return { toAdd, toUpdate, toDelete };
}2. Context Window Overflow
You retrieve 10 chunks, each 1000 tokens. That is 10,000 tokens of context before your system prompt and the user query. With a complex system prompt, you are pushing into expensive territory and potentially exceeding the context window on smaller models.
Solution: retrieve more, rerank aggressively, keep only the top 3-5 chunks. Quality over quantity. A focused 2000-token context outperforms a sprawling 10,000-token one.
3. Embedding Model Mismatch
Someone updates the embedding model in the query pipeline but forgets to re-embed the entire corpus. Old embeddings and new embeddings are incompatible. Similarity scores become random noise.
Solution: version your embedding model in your vector store metadata. Add a check at query time that rejects results from a different model version.
CURRENT_EMBEDDING_MODEL = "text-embedding-3-large-v2"
def safe_search(query: str, collection, top_k: int = 5):
results = collection.query(
query_embeddings=[embed(query)],
n_results=top_k,
where={"embedding_model": CURRENT_EMBEDDING_MODEL}
)
if len(results["ids"][0]) < top_k:
logger.warning(
f"Only {len(results['ids'][0])} results matched current model version. "
f"Re-indexing may be needed."
)
return results4. Poor Chunk Boundaries
A table spans two chunks. A code block is split in half. A critical definition sits at the end of one chunk and its explanation at the start of the next.
Solution: document-aware chunking (as described above) plus overlap is not enough. For tables and code blocks, detect them as atomic units and never split them. Add a pre-processing step that identifies these structures.
5. The "Relevant but Wrong" Retrieval
The retrieved chunks are semantically related to the query but do not contain the answer. The LLM, seeing related context, generates a plausible-sounding but fabricated answer instead of saying "I don't know."
Solution: add a faithfulness check. After generation, score whether the answer is grounded in the retrieved context. If the faithfulness score is below a threshold, return a fallback response. This is where RAGAS faithfulness metric becomes a runtime guardrail, not just an offline evaluation tool.
The Minimal Production RAG Stack
If I were starting a new RAG project today with real production requirements, here is exactly what I would pick:
| Component | Choice | Why |
|---|---|---|
| Chunking | Document-aware + recursive fallback | Best balance of quality and complexity |
| Embedding | OpenAI text-embedding-3-large (1024 dims) | Quality, simplicity, reduced dimensions for cost |
| Vector DB | Qdrant (self-hosted) or Pinecone (managed) | Depends on ops preference |
| Search | Hybrid (vector + BM25) with alpha=0.65 | Catches both semantic and lexical matches |
| Reranker | Cohere Rerank v3.5 | Best quality-to-cost ratio |
| Evaluation | RAGAS, 75+ eval pairs, CI integration | Regression prevention |
| LLM | Claude Sonnet for generation | Strong instruction following, good grounding |
Start with this stack, measure with RAGAS, and iterate on the component that scores lowest. In my experience, improving chunking gives you the biggest lift first, followed by adding reranking, followed by hybrid search tuning.
Closing Thoughts
RAG is deceptively simple in concept and genuinely difficult in execution. The retrieval step is where most systems fail, and it is the step that gets the least attention in tutorials. The generation model is only as good as the context you feed it.
Focus on three things: chunk quality, retrieval precision, and measurement. Everything else — the specific vector database, the exact embedding model, the framework you use — is a secondary decision that you can change later. But if your chunks are bad and you are not measuring retrieval quality, no amount of model upgrades or prompt engineering will save you.
Build the evaluation harness first. Then iterate on everything else with data, not intuition.