What is the best chunking strategy for RAG?

There is no single best strategy — it depends on your content. For technical documentation, semantic chunking (splitting by topic boundaries) outperforms fixed-size chunks. For conversational data, recursive character splitting with overlap works well. The key is testing multiple strategies against your actual queries using evaluation metrics.

Which vector database should I use for RAG?

For prototyping, use ChromaDB (local, zero-config). For production with managed infrastructure, Pinecone is the simplest. For hybrid search (vector + keyword), Weaviate excels. For raw performance and filtering, Qdrant is strong. The choice depends on scale, search patterns, and operational preferences.

How do I evaluate if my RAG pipeline is working?

Use the RAGAS framework to measure four key metrics: faithfulness (is the answer grounded in retrieved context?), answer relevancy (does it actually answer the question?), context precision (are the retrieved chunks relevant?), and context recall (did retrieval find all relevant information?). Run these on a curated test set of 50-100 question-answer pairs.

RAG in Production: Beyond the Tutorial (Vector DB Selection, Chunking, Evaluation)

Every RAG tutorial follows the same script. Load documents, split into chunks, embed them, store in a vector database, retrieve the top-k results, pass to an LLM. It works in the demo. Then you deploy it and everything falls apart.

I have built RAG systems that serve real traffic — internal knowledge bases, customer support tools, document analysis pipelines. The gap between the tutorial version and what survives production is enormous. This post covers everything the tutorials skip: the chunking strategies that actually matter, embedding model tradeoffs you will hit immediately, vector database selection for real workloads, hybrid search, evaluation, and the failure modes that will bite you at 2 AM.

If you are building your first RAG pipeline, start with the AI Engineer Guide for the bigger picture. If you have already been burned by agent loops gone wrong, my production agent architecture post covers adjacent problems. This post is specifically about retrieval — getting the right context to the LLM before generation even begins.

The Tutorial vs. Production Gap

The tutorial version of RAG has roughly five lines of meaningful code:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
 
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)
vectorstore = Chroma.from_documents(chunks, embedding_model)
results = vectorstore.similarity_search(query, k=4)

This works for a hackathon. Here is what it does not handle:

Source data changes daily. Your embeddings become stale. Documents get updated, deleted, or added. There is no sync mechanism.
Chunks split mid-sentence. A critical piece of information sits across two chunks. Neither chunk alone answers the question.
Top-k retrieval returns irrelevant results. The query is semantically similar to many documents but only one actually answers the question.
The LLM hallucinates despite having context. The retrieved chunks are relevant but the model ignores them or synthesizes information that is not there.
Latency is unacceptable. Embedding the query, searching the vector store, reranking, and generating a response takes 8 seconds. Users leave.

Production RAG is an engineering discipline. Let me walk through each component.

Chunking Strategies: The Foundation You Cannot Get Wrong

Chunking is the single most impactful decision in your RAG pipeline. Bad chunks mean irrelevant retrieval. Irrelevant retrieval means bad answers. No amount of prompt engineering fixes garbage context.

Fixed-Size Chunking

Split every N characters (or tokens) with some overlap. Simple, fast, predictable.

function fixedSizeChunk(text: string, chunkSize: number, overlap: number): string[] {
  const chunks: string[] = [];
  let start = 0;
  while (start < text.length) {
    chunks.push(text.slice(start, start + chunkSize));
    start += chunkSize - overlap;
  }
  return chunks;
}

The problem: it splits mid-paragraph, mid-sentence, even mid-word. A paragraph explaining a concept gets cut in half. Each half is meaningless on its own.

Recursive Character Splitting

Split by paragraph first, then by sentence, then by character — only going deeper when a chunk exceeds the size limit. LangChain’s default, and a solid baseline.

from langchain.text_splitter import RecursiveCharacterTextSplitter
 
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document_text)

This respects natural boundaries better than fixed-size, but it still has no understanding of meaning. Two paragraphs about the same topic get split if they exceed the size limit.

Semantic Chunking

Group text by meaning. Embed sentences, then split where the embedding similarity drops below a threshold. Adjacent sentences about the same topic stay together.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
 
chunker = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=75
)
chunks = chunker.create_documents([document_text])

Higher quality, but slower — you are embedding every sentence during ingestion. For large corpora this adds significant processing time and cost.

Document-Aware Chunking

Parse the document structure (headings, sections, lists, code blocks) and split at structural boundaries. This is what I recommend for technical documentation, legal documents, and any content with explicit hierarchy.

interface DocumentChunk {
  content: string;
  metadata: {
    heading: string;
    section_path: string[];  // ["Chapter 3", "Authentication", "OAuth2 Flow"]
    chunk_index: number;
    source_file: string;
  };
}
 
function documentAwareChunk(markdown: string): DocumentChunk[] {
  const chunks: DocumentChunk[] = [];
  const sections = markdown.split(/(?=^#{1,3}\s)/m);
 
  let sectionPath: string[] = [];
 
  for (const section of sections) {
    const headingMatch = section.match(/^(#{1,3})\s+(.+)/);
    if (headingMatch) {
      const level = headingMatch[1].length;
      const heading = headingMatch[2].trim();
      sectionPath = [...sectionPath.slice(0, level - 1), heading];
    }
 
    // If section is too large, apply recursive splitting within it
    if (section.length > 1500) {
      const subChunks = recursiveSplit(section, 1000, 200);
      subChunks.forEach((sub, i) => {
        chunks.push({
          content: sub,
          metadata: {
            heading: sectionPath[sectionPath.length - 1] || "untitled",
            section_path: [...sectionPath],
            chunk_index: i,
            source_file: "",
          },
        });
      });
    } else {
      chunks.push({
        content: section,
        metadata: {
          heading: sectionPath[sectionPath.length - 1] || "untitled",
          section_path: [...sectionPath],
          chunk_index: 0,
          source_file: "",
        },
      });
    }
  }
 
  return chunks;
}

The key advantage: metadata. When you retrieve a chunk, you know exactly where it came from in the document hierarchy. This helps the LLM cite sources and helps you debug retrieval issues.

Chunking Strategy Comparison

Strategy	Quality	Speed	Complexity	Best For
Fixed-size	Low	Fast	Minimal	Quick prototypes, uniform text (logs, raw data)
Recursive character	Medium	Fast	Low	General-purpose, mixed content
Semantic	High	Slow	Medium	Nuanced content where topic boundaries matter
Document-aware	High	Medium	High	Structured docs, technical writing, legal text

My recommendation: start with recursive character splitting to get a baseline. Measure retrieval quality. Switch to document-aware chunking when you need better precision — which you almost certainly will.

Embedding Model Selection

Your embedding model determines what “similar” means in your vector search. Choose wrong and semantically relevant documents will rank below irrelevant ones.

The Models That Matter in 2026

Model	Dimensions	Max Tokens	Multilingual	Relative Cost	MTEB Score
OpenAI text-embedding-3-large	3072 (configurable)	8191	Yes	$$$	~64.6
OpenAI text-embedding-3-small	1536 (configurable)	8191	Yes	$	~62.3
Cohere Embed v3	1024	512	100+ languages	$$	~64.5
BGE-M3 (BAAI)	1024	8192	100+ languages	Free (self-hosted)	~64.2
E5-Large-v2	1024	512	English-focused	Free (self-hosted)	~62.7

Key tradeoffs:

OpenAI text-embedding-3-large has the best overall quality and the simplest integration. But you pay per token and you have a vendor dependency. The dimension reduction feature (you can truncate to 256 or 1024 dims) is useful for cost-sensitive deployments.
Cohere Embed v3 is strong for multilingual use cases and offers separate search_document and search_query input types, which improves retrieval quality. The 512 token limit is restrictive for longer chunks.
BGE-M3 is my default for self-hosted deployments. Supports dense, sparse, and multi-vector retrieval natively. No API costs. You need GPU infrastructure.
E5-Large-v2 is lightweight and fast. Good for English-only, latency-sensitive applications.

# OpenAI embedding with dimension reduction
from openai import OpenAI
 
client = OpenAI()
 
def embed_texts(texts: list[str], dimensions: int = 1024) -> list[list[float]]:
    response = client.embeddings.create(
        model="text-embedding-3-large",
        input=texts,
        dimensions=dimensions  # Reduce from 3072 to save storage/cost
    )
    return [item.embedding for item in response.data]

Critical rule: the embedding model you use at index time must be the same model you use at query time. This sounds obvious but I have seen production systems break because someone upgraded the embedding model without re-indexing. Your old embeddings and new query embeddings live in different vector spaces. Similarity scores become meaningless.

Vector Database Comparison

The vector database is your retrieval engine. The choice matters more than most tutorials suggest.

Feature	Pinecone	Weaviate	ChromaDB	Qdrant
Hosting	Managed only	Managed + self-hosted	Self-hosted (local)	Managed + self-hosted
Open Source	No	Yes	Yes	Yes
Hybrid Search	Sparse vectors (manual)	Native BM25 + vector	No	Sparse vectors + payload index
Filtering	Metadata filters	GraphQL-style filters	Where clauses	Payload filters (powerful)
Max Vectors	Billions (serverless)	Millions+	Thousands-Millions	Billions
Latency (p99)	~50ms	~100ms	~10ms (local)	~30ms
Best For	Managed production at scale	Hybrid search, multi-modal	Local dev, prototyping	High-performance, complex filters
Pricing	Pay-per-query + storage	Free tier + usage-based	Free	Free tier + usage-based

When to Use What

ChromaDB when you are prototyping or building a local tool. Zero config, pip install, done. Do not use it for production traffic — it is an in-process database with no replication.

import chromadb
 
client = chromadb.Client()
collection = client.create_collection("docs")
collection.add(
    documents=["RAG is retrieval augmented generation..."],
    metadatas=[{"source": "glossary.md"}],
    ids=["doc-001"]
)
results = collection.query(query_texts=["What is RAG?"], n_results=3)

Pinecone when you want managed infrastructure and your team does not want to operate a database. Serverless pricing is reasonable for moderate traffic. The main limitation is that hybrid search requires you to compute sparse vectors yourself.

Weaviate when you need native hybrid search (BM25 + vector) out of the box. This is a significant advantage — you get keyword matching and semantic matching without building it yourself.

import weaviate
 
client = weaviate.connect_to_local()
 
collection = client.collections.get("Document")
 
# Hybrid search: combines BM25 keyword + vector similarity
results = collection.query.hybrid(
    query="OAuth2 refresh token expiration",
    alpha=0.5,  # 0 = pure keyword, 1 = pure vector
    limit=5
)

Qdrant when you need fast filtering alongside vector search. Its payload filtering is the most flexible — you can filter by nested fields, ranges, geo coordinates, and combine them with boolean logic before vector search runs.

Hybrid Search: Why Vector-Only Retrieval Fails

Pure vector similarity search has a fundamental weakness: it matches by meaning, not by terminology. If a user asks “What is the EBITDA margin for Q3?” and your documents use “EBITDA” as a specific term, vector search might return chunks about “profitability metrics” or “earnings analysis” that are semantically similar but do not contain the specific data.

Hybrid search combines vector similarity (semantic) with BM25 keyword search (lexical). The results are fused using Reciprocal Rank Fusion (RRF) or a weighted linear combination.

interface SearchResult {
  id: string;
  content: string;
  score: number;
}
 
function reciprocalRankFusion(
  vectorResults: SearchResult[],
  keywordResults: SearchResult[],
  k: number = 60
): SearchResult[] {
  const scores = new Map<string, { score: number; content: string }>();
 
  vectorResults.forEach((result, rank) => {
    const rrf = 1 / (k + rank + 1);
    const existing = scores.get(result.id);
    scores.set(result.id, {
      score: (existing?.score || 0) + rrf,
      content: result.content,
    });
  });
 
  keywordResults.forEach((result, rank) => {
    const rrf = 1 / (k + rank + 1);
    const existing = scores.get(result.id);
    scores.set(result.id, {
      score: (existing?.score || 0) + rrf,
      content: result.content,
    });
  });
 
  return Array.from(scores.entries())
    .map(([id, { score, content }]) => ({ id, content, score }))
    .sort((a, b) => b.score - a.score);
}

In my experience, hybrid search with alpha around 0.6-0.7 (favoring vector) consistently outperforms pure vector or pure keyword search. The keyword component catches exact term matches that embedding models sometimes miss, especially for domain-specific jargon, product names, and acronyms.

Production Architecture

Here is the architecture that works. Every component exists for a reason.

                           RAG Production Architecture
  ┌─────────────────────────────────────────────────────────────────────┐
  │                        INGESTION PIPELINE                          │
  │                                                                     │
  │  Source Data ──→ Parser ──→ Chunker ──→ Embedder ──→ Vector Store  │
  │  (docs, APIs,    (PDF,      (document-   (batch       (with         │
  │   databases)     markdown,   aware +      embed,       metadata)    │
  │                  HTML)       metadata)    queue)                    │
  │                                                                     │
  │  ┌─── Change Detection ───── Incremental Re-index ──┐             │
  │  │    (hash comparison,       (update changed,       │             │
  │  │     timestamps)             delete removed)       │             │
  │  └───────────────────────────────────────────────────┘             │
  └─────────────────────────────────────────────────────────────────────┘
 
  ┌─────────────────────────────────────────────────────────────────────┐
  │                        RETRIEVAL PIPELINE                          │
  │                                                                     │
  │  User Query                                                         │
  │      │                                                              │
  │      ├──→ Query Expansion (optional: rephrase, HyDE)               │
  │      │                                                              │
  │      ├──→ Hybrid Search (vector + BM25)                            │
  │      │         │                                                    │
  │      │         ├──→ Metadata Filtering (source, date, category)    │
  │      │         │                                                    │
  │      │         └──→ Top-N candidates (N=20)                        │
  │      │                                                              │
  │      ├──→ Reranker (Cohere Rerank / cross-encoder)                 │
  │      │         │                                                    │
  │      │         └──→ Top-K results (K=5)                            │
  │      │                                                              │
  │      └──→ Context Assembly ──→ LLM Generation ──→ Response         │
  └─────────────────────────────────────────────────────────────────────┘

The Reranking Step Most Tutorials Skip

Retrieve 20 candidates, rerank to 5. This single addition improved our retrieval precision by 15-25% across multiple projects. The initial vector search casts a wide net. The reranker — a cross-encoder model — scores each candidate against the query with much higher accuracy than embedding similarity alone.

import cohere
 
co = cohere.ClientV2()
 
def rerank_results(query: str, documents: list[str], top_k: int = 5) -> list[dict]:
    response = co.rerank(
        query=query,
        documents=documents,
        top_n=top_k,
        model="rerank-v3.5"
    )
    return [
        {
            "index": result.index,
            "relevance_score": result.relevance_score,
            "text": documents[result.index]
        }
        for result in response.results
    ]
 
# Usage
candidates = vector_search(query, n=20)  # Cast wide net
reranked = rerank_results(query, [c.text for c in candidates], top_k=5)

The cost of reranking 20 documents is negligible compared to the improvement in answer quality.

Evaluation with RAGAS

You cannot improve what you cannot measure. RAGAS (Retrieval Augmented Generation Assessment) gives you four metrics that cover the full RAG pipeline:

Metric	What It Measures	Range	Target
Faithfulness	Is the answer grounded in retrieved context?	0-1	> 0.85
Answer Relevancy	Does the answer actually address the question?	0-1	> 0.80
Context Precision	Are retrieved chunks relevant to the question?	0-1	> 0.75
Context Recall	Did retrieval find all relevant information?	0-1	> 0.70

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset
 
# Build your evaluation dataset (50-100 examples minimum)
eval_data = {
    "question": [
        "What is the OAuth2 refresh token expiration policy?",
        "How do I configure rate limiting for the API gateway?",
    ],
    "answer": [
        # Generated by your RAG pipeline
        "Refresh tokens expire after 30 days of inactivity...",
        "Rate limiting is configured in the gateway.yaml file...",
    ],
    "contexts": [
        # Retrieved chunks that were passed to the LLM
        ["Refresh tokens: tokens expire after 30 days...", "OAuth2 spec requires..."],
        ["Gateway configuration: rate_limit section...", "API throttling docs..."],
    ],
    "ground_truth": [
        # Human-verified correct answers
        "Refresh tokens expire after 30 days of inactivity. Active tokens are renewed.",
        "Add a rate_limit block to gateway.yaml with requests_per_second and burst_size.",
    ],
}
 
dataset = Dataset.from_dict(eval_data)
 
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
 
print(results)
# {'faithfulness': 0.92, 'answer_relevancy': 0.87,
#  'context_precision': 0.81, 'context_recall': 0.76}

How I use RAGAS in practice:

Curate 50-100 question-answer pairs from real user queries. Do not synthesize them — use actual questions people asked your system.
Run evaluation after every pipeline change (new chunking strategy, different embedding model, reranker update).
Set regression thresholds in CI. If faithfulness drops below 0.80, the pipeline change does not deploy.
Track metrics over time. Degradation usually means your source data changed and your embeddings are stale.

What Actually Breaks in Production

After shipping multiple RAG systems, these are the failure modes I see repeatedly.

1. Stale Embeddings

Your source documents change. Your vector store does not know. The system confidently returns outdated information because the old embedding is still the closest match.

Solution: build a change detection layer. Hash each document at ingestion time. On every sync cycle, compare hashes. Re-embed changed documents, delete removed ones.

import crypto from "crypto";
 
interface DocumentRecord {
  id: string;
  content_hash: string;
  last_indexed: Date;
}
 
function detectChanges(
  currentDocs: Map<string, string>,
  indexedRecords: Map<string, DocumentRecord>
): { toAdd: string[]; toUpdate: string[]; toDelete: string[] } {
  const toAdd: string[] = [];
  const toUpdate: string[] = [];
  const toDelete: string[] = [];
 
  for (const [id, content] of currentDocs) {
    const hash = crypto.createHash("sha256").update(content).digest("hex");
    const existing = indexedRecords.get(id);
    if (!existing) {
      toAdd.push(id);
    } else if (existing.content_hash !== hash) {
      toUpdate.push(id);
    }
  }
 
  for (const id of indexedRecords.keys()) {
    if (!currentDocs.has(id)) {
      toDelete.push(id);
    }
  }
 
  return { toAdd, toUpdate, toDelete };
}

2. Context Window Overflow

You retrieve 10 chunks, each 1000 tokens. That is 10,000 tokens of context before your system prompt and the user query. With a complex system prompt, you are pushing into expensive territory and potentially exceeding the context window on smaller models.

Solution: retrieve more, rerank aggressively, keep only the top 3-5 chunks. Quality over quantity. A focused 2000-token context outperforms a sprawling 10,000-token one.

3. Embedding Model Mismatch

Someone updates the embedding model in the query pipeline but forgets to re-embed the entire corpus. Old embeddings and new embeddings are incompatible. Similarity scores become random noise.

Solution: version your embedding model in your vector store metadata. Add a check at query time that rejects results from a different model version.

CURRENT_EMBEDDING_MODEL = "text-embedding-3-large-v2"
 
def safe_search(query: str, collection, top_k: int = 5):
    results = collection.query(
        query_embeddings=[embed(query)],
        n_results=top_k,
        where={"embedding_model": CURRENT_EMBEDDING_MODEL}
    )
    if len(results["ids"][0]) < top_k:
        logger.warning(
            f"Only {len(results['ids'][0])} results matched current model version. "
            f"Re-indexing may be needed."
        )
    return results

4. Poor Chunk Boundaries

A table spans two chunks. A code block is split in half. A critical definition sits at the end of one chunk and its explanation at the start of the next.

Solution: document-aware chunking (as described above) plus overlap is not enough. For tables and code blocks, detect them as atomic units and never split them. Add a pre-processing step that identifies these structures.

5. The “Relevant but Wrong” Retrieval

The retrieved chunks are semantically related to the query but do not contain the answer. The LLM, seeing related context, generates a plausible-sounding but fabricated answer instead of saying “I don’t know.”

Solution: add a faithfulness check. After generation, score whether the answer is grounded in the retrieved context. If the faithfulness score is below a threshold, return a fallback response. This is where RAGAS faithfulness metric becomes a runtime guardrail, not just an offline evaluation tool.

The Minimal Production RAG Stack

If I were starting a new RAG project today with real production requirements, here is exactly what I would pick:

Component	Choice	Why
Chunking	Document-aware + recursive fallback	Best balance of quality and complexity
Embedding	OpenAI text-embedding-3-large (1024 dims)	Quality, simplicity, reduced dimensions for cost
Vector DB	Qdrant (self-hosted) or Pinecone (managed)	Depends on ops preference
Search	Hybrid (vector + BM25) with alpha=0.65	Catches both semantic and lexical matches
Reranker	Cohere Rerank v3.5	Best quality-to-cost ratio
Evaluation	RAGAS, 75+ eval pairs, CI integration	Regression prevention
LLM	Claude Sonnet for generation	Strong instruction following, good grounding

Start with this stack, measure with RAGAS, and iterate on the component that scores lowest. In my experience, improving chunking gives you the biggest lift first, followed by adding reranking, followed by hybrid search tuning.

Closing Thoughts

RAG is deceptively simple in concept and genuinely difficult in execution. The retrieval step is where most systems fail, and it is the step that gets the least attention in tutorials. The generation model is only as good as the context you feed it.

Focus on three things: chunk quality, retrieval precision, and measurement. Everything else — the specific vector database, the exact embedding model, the framework you use — is a secondary decision that you can change later. But if your chunks are bad and you are not measuring retrieval quality, no amount of model upgrades or prompt engineering will save you.

Build the evaluation harness first. Then iterate on everything else with data, not intuition.