Deep Dive 06-Retrieval and Retrieval-Augmented Generation¶
A self-contained reference. By the end of this chapter you can implement a production-grade RAG system from first principles, evaluate it with the right metrics, and reason about every design choice (chunk size, index type, hybrid weight, rerank depth) on the basis of what the math says rather than what a tutorial says.
0. Reading guide¶
This chapter is long because retrieval is wide. The shape:
- Why retrieval at all-the parametric-knowledge problem.
- Sparse retrieval (BM25), with the formulas derived.
- Dense retrieval (bi-encoders, contrastive training, hard negatives).
- Cross-encoders and the rerank pattern.
- Vector indexes (HNSW, IVF, PQ, DiskANN) and the recall-latency curve.
- Vector DB landscape and a decision matrix.
- Hybrid retrieval: RRF and convex combination.
- Chunking, including late chunking and Anthropic's contextual retrieval.
- The reference RAG pipeline.
- Query rewriting (HyDE, multi-query, step-back).
- Evaluation-retrieval and generation. RAGAS.
- Failure modes (lost-in-the-middle, retrieval-generation gap, ...).
- Multi-hop and agentic retrieval; GraphRAG.
- Metadata filtering, pre- vs post-filter.
- Production concerns (freshness, versioning, tenancy, citations).
- Self-host vs API embedding decision.
- Six exercises with worked solutions and acceptance criteria.
If you only have an hour, read sections 2, 3, 7, 8, 11, 17. Come back for the rest when you're putting it on a real load.
1. Why retrieval¶
1.1 The parametric-knowledge problem¶
A pretrained LLM stores a snapshot of the world inside its weights. That snapshot has three problems for almost every product use case:
- Stale. Training cutoff is some date in the past. Your customer's pricing page changed yesterday; the model has no idea.
- Lossy. Even within the training window, models compress information. Long-tail facts (the second-tier feature flag, the SLA in contract revision 14) get crushed.
- Unsourced. The model cannot cite where a fact came from, so the user cannot verify and you cannot audit.
Hallucination is the visible symptom: the model produces text that is fluent, plausible, and wrong. The root cause is asking a parametric system to answer non-parametric questions.
Retrieval-Augmented Generation (RAG) inverts the assumption. Instead of asking the model to know, you ask it to read. At query time you retrieve relevant text from an external corpus (the knowledge base) and include it in the prompt. The model's job becomes "answer the question using only this context, and cite the source."
The basic shape, which we will refine all chapter:
user_query
│
▼
retriever ── reads ──▶ corpus (indexed)
│
▼ top-k passages
prompt builder
│
▼ "Answer using only:\n{passages}\n\nQ: {query}"
LLM
│
▼
answer + citations
1.2 What retrieval is not¶
- It is not fine-tuning. Fine-tuning teaches behavior; retrieval teaches facts. Use retrieval for changing knowledge; use fine-tuning for changing style, format, or skill.
- It is not memory. Memory is short-horizon, agent-scoped state. Retrieval is long-horizon, corpus-scoped knowledge.
- It is not a vector database. The vector DB is one component (the index for dense retrieval). Real systems use sparse + dense + reranker + filters + freshness pipeline.
1.3 When retrieval is the right tool¶
- Knowledge that changes faster than your training cycle.
- Knowledge whose provenance must be auditable.
- Knowledge that is too large or too sparse to fit in context.
- Anything that needs citations.
When retrieval is not the right tool: mathematical reasoning, code execution, logic puzzles where the answer is computed not looked up.
2. Sparse retrieval-BM25¶
Sparse retrieval treats documents and queries as bags of terms over a vocabulary V. The score of a document for a query is a sum of per-term contributions. It is "sparse" because each document's representation is mostly zeros-only the terms it contains are non-zero.
2.1 From counting to TF-IDF¶
Start with three quantities. Let t be a term, d a document, and D
the corpus of N documents.
- Term frequency:
tf(t, d) = number of times t occurs in d. - Document frequency:
df(t) = number of documents in D that contain t. - Inverse document frequency:
idf(t) = log(N / df(t)).
The TF-IDF weight is
The intuition is "rare words that occur often in this document matter." But TF-IDF has two problems:
tfis unbounded-a term repeated 100 times scores 100x a single occurrence, which doesn't match human intuition (the second mention is informative, the hundredth is not).- Long documents win unfairly because they contain more total
tf.
BM25 fixes both with a single ranking function.
2.2 BM25, derived¶
The Okapi BM25 score for a query q = (t_1, ..., t_m) against a document
d is
┌ ┐
score(q, d) = Σ │ IDF(t) · tf(t, d) · (k1 + 1) │
t∈q │ ───────────────────────────── │
│ tf(t, d) + k1·(1 − b + b·|d|/avgdl) │
└ ┘
Where:
- |d| = length of document d (in tokens).
- avgdl = average document length over the corpus.
- k1 ∈ [1.2, 2.0] typically (default ~1.2). Controls term-frequency
saturation.
- b ∈ [0, 1] typically (default 0.75). Controls length normalization.
- The IDF used in BM25 is the smoothed form
IDF(t) = log((N − df(t) + 0.5) / (df(t) + 0.5) + 1).
Reading the formula¶
The numerator tf · (k1 + 1) is monotone increasing in tf but bounded
above as tf → ∞ it converges to (k1 + 1) · IDF(t). The denominator's
tf + k1 · (...) term is what produces the saturation: doubling tf
doesn't double the score, it pushes you closer to the asymptote. That
matches the "second mention informative, hundredth isn't" intuition.
The (1 − b + b · |d|/avgdl) factor is length normalization. If
|d| = avgdl it equals 1 (no penalty). If the document is twice the
average length, it scales to (1 − b) + 2b = 1 + b, which (with b = 0.75)
makes the denominator 1.75x bigger and so penalizes the score. That
matches "long documents shouldn't win just by being long."
Parameter intuition¶
k1low (~1.0): saturation kicks in fast-second occurrence already near-asymptotic. Useful when documents repeat keywords formulaically.k1high (~2.0): scores keep growing withtf. Useful when raw frequency is genuinely informative.b = 0: no length normalization. Long docs win.b = 1: full normalization. Short docs win.b = 0.75: the conventional sweet spot.
You will rarely beat the defaults without a held-out tuning set.
2.3 Why BM25 is still a strong baseline in 2026¶
The transformer revolution ate a lot of fields, but BM25 has a peculiar property: it is unbeatable on the long tail of rare-term queries. If your query is "ERR-7842 retry policy" (an exact error code), a dense embedding will get clever and find documents about similar error patterns; BM25 will find the exact document with that string in it. Most production retrieval errors are exact-match misses, not semantic-similarity misses.
Rules of thumb that have held since the original IR work: - BM25 alone wins on short, keyword-y queries with exact-term overlap. - Dense alone wins on long, paraphrased, conversational queries. - Hybrid wins on average. We'll get there in section 7.
2.4 Implementing BM25¶
In production, use Elasticsearch / OpenSearch (their default scorer is
BM25 since v5) or Postgres tsvector for moderate scale. For a small
corpus or experimentation, rank_bm25 in Python is fine.
# pip install rank_bm25
from rank_bm25 import BM25Okapi
import re
def tokenize(text: str) -> list[str]:
return re.findall(r"[a-z0-9]+", text.lower())
corpus = [
"BM25 is a bag-of-words ranking function",
"Dense retrieval uses neural embeddings",
"Hybrid retrieval combines sparse and dense signals",
# ... thousands more
]
tokenized = [tokenize(d) for d in corpus]
bm25 = BM25Okapi(tokenized, k1=1.5, b=0.75)
def search(query: str, k: int = 10) -> list[tuple[int, float]]:
scores = bm25.get_scores(tokenize(query))
top_idx = scores.argsort()[::-1][:k]
return [(int(i), float(scores[i])) for i in top_idx]
A from-scratch implementation (we'll use this in Exercise 1):
import math
from collections import Counter
class BM25:
def __init__(self, docs: list[list[str]], k1: float = 1.5, b: float = 0.75):
self.k1, self.b = k1, b
self.docs = docs
self.N = len(docs)
self.avgdl = sum(len(d) for d in docs) / self.N
# df[t] = number of docs containing t
self.df: Counter[str] = Counter()
for d in docs:
for t in set(d):
self.df[t] += 1
# cache idf
self.idf = {
t: math.log((self.N - df + 0.5) / (df + 0.5) + 1.0)
for t, df in self.df.items()
}
self.tf = [Counter(d) for d in docs]
def score(self, query: list[str], i: int) -> float:
d_len = len(self.docs[i])
tf_d = self.tf[i]
s = 0.0
for t in query:
if t not in self.idf:
continue
tf = tf_d.get(t, 0)
if tf == 0:
continue
num = tf * (self.k1 + 1)
den = tf + self.k1 * (1 - self.b + self.b * d_len / self.avgdl)
s += self.idf[t] * num / den
return s
def topk(self, query: list[str], k: int = 10) -> list[tuple[int, float]]:
scores = [(i, self.score(query, i)) for i in range(self.N)]
scores.sort(key=lambda x: x[1], reverse=True)
return scores[:k]
This is ~40 lines and reproduces the BM25 you get from any library. If you can write it, you understand it.
3. Dense retrieval¶
3.1 The bi-encoder¶
A bi-encoder is two (often shared-weight) neural encoders E_q and E_d
that map text to fixed-dim vectors. Score is a similarity:
sim(q, d) = E_q(q) · E_d(d) (dot product)
or cos(E_q(q), E_d(d)) (cosine; equivalent if vectors are L2-normalized)
The crucial property: documents are encoded once, offline and stored in a vector index. At query time you only encode the query (one forward pass) and run an approximate nearest-neighbor search. This is what makes dense retrieval cheap enough for production.
Compare this to a cross-encoder (next section), which encodes (q, d)
together. Cross-encoders cannot pre-compute, so they are too expensive
to run over millions of documents-you use them only as a reranker on a
short candidate list.
3.2 Training a bi-encoder: contrastive learning¶
A bi-encoder is trained on (query, positive_doc, negative_docs)
triples. The objective is "pull positive close, push negatives far." The
standard loss is InfoNCE (also called multi-class N-pair):
exp(sim(q, d+) / τ)
L = − log ─────────────────────────────────────────────────
exp(sim(q, d+)/τ) + Σ_{d−} exp(sim(q, d−)/τ)
Where τ is a temperature (typically 0.01–0.1 for cosine; 1 is fine for
dot product if vectors aren't normalized). This is just cross-entropy
over the "which doc is the right one" classification problem with the
positive against all negatives.
Why InfoNCE works¶
It pushes the positive's similarity to dominate the softmax. The gradient
on the positive pulls it toward q; the gradient on each negative
pushes it away. Crucially, the loss is a function of relative
similarities, not absolute ones, so the encoder learns a geometry rather
than a calibrated score.
In-batch negatives¶
The cheap trick that made dense retrieval practical: in a batch of B query-positive pairs, each query's negatives are the other B-1 positives in the batch. You get B-1 negatives per query for free. With B = 256 you have 255 negatives per query per gradient step.
Hard-negative mining-the quality lever¶
Random negatives are easy. The model learns to separate "tomato soup" from "differential equations" but not "tomato soup" from "tomato bisque." Hard negatives are what teach the fine-grained distinctions.
The standard recipe: 1. Train a v0 model with random/in-batch negatives. 2. Use v0 to retrieve top-100 for each training query. 3. Take docs that v0 ranked high but are not the labeled positive. These are the hard negatives. 4. Retrain (v1) with hard negatives mixed into the InfoNCE loss. 5. Optionally iterate (v2 mines its own hard negatives, etc.).
This is where most of the quality of modern embedding models comes from. The architecture is "another transformer encoder"; the data pipeline is where the magic is.
Watch out for false negatives¶
A "hard negative" mined this way might actually be a relevant document that just wasn't labeled. Pushing it away from the query damages the model. In practice people apply a similarity ceiling (don't mine negatives that are too similar to the positive) or use a teacher model (cross-encoder) to filter out likely false negatives.
3.3 Modern embedding models (2024–2026 landscape)¶
Approximate landscape-exact numbers shift quarterly, treat as illustrative.
| Family | Provider | Approx dim | Notes |
|---|---|---|---|
| text-embedding-3-small / -large | OpenAI | 512 / 3072 (variable) | API; supports Matryoshka truncation |
| Cohere embed v3 / v4 | Cohere | ~1024 | API; multilingual; quantization-friendly |
| Voyage (voyage-3 family) | Voyage | ~1024–2048 | API; competitive on RAG benchmarks |
| BGE (M3, large, base) | BAAI | 384 / 768 / 1024 | Open weights; strong retrieval |
| E5 (small/base/large/multilingual) | Microsoft | 384–1024 | Open weights; trained with weak supervision |
| GTE (small/base/large) | Alibaba | 384–1024 | Open weights |
| jina-embeddings-v3 / late-chunking | Jina | ~1024 | Open; long-context, late-chunking-friendly |
Picking one in 2026 looks like:
- Need a strong default fast → text-embedding-3-large or Voyage.
- Need open weights, self-hosted → BGE-M3 or E5-large or jina-v3.
- Need multilingual → BGE-M3, Cohere embed multilingual, jina-v3.
- Need long context (>8k input) → jina-v3, BGE-M3, GTE-large.
- Need quantized for edge → BGE-small, E5-small.
Don't agonize. Pick one with strong scores on the benchmarks closest to your domain (BEIR, MTEB), measure on your own eval set (section 11), swap if needed.
3.4 Embedding dimensionality¶
Dim D is a quality/cost knob.
- Quality: larger D usually helps but with diminishing returns. From D = 384 to D = 1024 you typically gain a few NDCG points. From 1024 to 3072 you gain less.
- Storage:
4 · D · Nbytes for float32; halves for float16; eighth for int8 quantization. At N = 100M docs and D = 1024 fp32 that's 400 GB before any index overhead. - Index latency: HNSW search cost scales with D (distance computation).
- Quantization: most production indexes quantize to int8 or scalar quantization with negligible recall loss.
3.5 Matryoshka embeddings¶
A standard embedding is a single vector of fixed size; truncating it
breaks it. Matryoshka Representation Learning (Kusupati et al., 2022,
adopted broadly by 2024) trains the embedding so that the first k
dimensions of the D-dim vector are themselves a useful (lower-quality,
smaller) embedding. Properly trained, you can store the full 3072-dim
vector but query with the first 256 dims for a cheap first-stage filter,
then rescore the candidates with the full vector. OpenAI's
text-embedding-3 and several open models support this natively.
Why care: it lets you spend memory once on a "fat" index but get cheap-pass and expensive-pass retrieval out of it without re-embedding.
4. Cross-encoders and reranking¶
4.1 Cross-encoder architecture¶
A cross-encoder takes (q, d) together as a single input
[CLS] q [SEP] d [SEP] and predicts a relevance score (usually a single
scalar from a regression head). Because the transformer attends over q
and d jointly, it can model subtle interactions a bi-encoder cannot —
for instance, "the query asks about retries on 5xx errors; this doc
mentions retries but only for 4xx." A bi-encoder collapses each side to
a vector before they ever meet; a cross-encoder lets them attend.
The cost: you cannot pre-compute. Every (q, d) pair is a fresh forward
pass. For a corpus of 10M documents this is utterly infeasible at query
time-you'd be computing 10M forward passes per user query.
4.2 The standard pipeline¶
The dominant production architecture is a two-stage retrieve-and-rerank funnel:
sparse (BM25) ┐
user query ├──▶ candidates (top-100 to top-1000)
dense (embed) ┘
│
▼
cross-encoder rerank
│
▼
top-10 to LLM
Stage 1 is cheap and recall-oriented: sparse + dense pulls in everything that could be relevant. Stage 2 is expensive and precision-oriented: the cross-encoder reorders the short list. The product is much closer to cross-encoder quality at near-bi-encoder cost.
4.3 Production rerankers¶
- Cohere Rerank (rerank-3 / rerank-3.5): API; multilingual; near-SOTA on most benchmarks; the easy default.
- BGE-reranker (base / large / v2-m3): open weights; self-hosted.
- Cross-encoders on Hugging Face:
cross-encoder/ms-marco-MiniLM-L-6-v2is a small, fast, decent default;bge-reranker-v2-m3is the modern open-weights pick.
4.4 The cost-quality knobs¶
You have three numbers to set:
- k_retrieve: how many candidates the first stage pulls. Bigger = more recall, more rerank cost. Common default: 50–200.
- k_rerank: how many of those the cross-encoder scores. Usually equal to k_retrieve, but you can do progressive: rerank top-50, only the top-10 go to the LLM.
- k_context: how many reranked docs you put in the LLM prompt. Common default: 5–10.
The pareto frontier is roughly: k_retrieve = 100, k_rerank = 100,
k_context = 5 is a strong baseline. Tune from your eval set, not vibes.
import cohere
co = cohere.Client(api_key="...")
def hybrid_retrieve(query: str, k: int = 100) -> list[dict]:
bm25_hits = bm25_search(query, k=k) # from section 2
dense_hits = dense_search(query, k=k) # from section 3
return rrf_merge(bm25_hits, dense_hits, k=k) # from section 7
def rerank(query: str, candidates: list[dict], top_k: int = 10) -> list[dict]:
docs = [c["text"] for c in candidates]
resp = co.rerank(query=query, documents=docs, top_n=top_k,
model="rerank-3.5")
return [candidates[r.index] | {"rerank_score": r.relevance_score}
for r in resp.results]
5. Vector indexing¶
5.1 Why exact NN doesn't scale¶
Exact nearest-neighbor search on D-dim vectors over a corpus of N takes
O(N · D) per query-you must compute the distance to every document.
At N = 10M, D = 1024, that's 10^10 floating-point operations per query,
several hundred milliseconds even on optimized BLAS. At N = 1B, it's
unworkable.
Approximate NN (ANN) trades a small loss in recall for orders-of-magnitude faster search. The fundamental algorithms:
- HNSW (graph-based)-fastest in-memory, default for most vector DBs.
- IVF (cluster-based)-partitions space, search inside likely clusters only.
- PQ (product quantization)-compresses vectors so the index fits in RAM at billion scale.
- DiskANN-graph index designed for SSD; billion-scale on a single machine.
5.2 HNSW-Hierarchical Navigable Small World¶
HNSW (Malkov & Yashunin, 2016) builds a multi-layer proximity graph.
Build¶
For each vector, draw a level l ~ Geometric(p) with p ≈ 1/ln(2) so
levels grow rare exponentially. Insert the vector into all layers from
l down to 0. At each layer:
1. Greedily walk from an entry point toward the closest existing node.
2. Find the M nearest existing nodes; create bidirectional edges.
3. Prune neighbors using a heuristic (keep diverse, not just closest).
The result is a graph where layer 0 contains all nodes (dense), upper layers contain only a fraction (sparse, long-range edges).
Search¶
Start at an entry point in the top layer.
1. Greedy descent: at each layer, walk to the locally closest neighbor.
Move down a layer.
2. At layer 0, run a beam search (priority queue of size efSearch)
until no closer neighbor is found.
3. Return the k closest of those visited.
Parameters¶
M: number of edges per node per layer. Higher = better recall, more memory and build time. Typical 16–48.efConstruction: candidate-list size at build time. Higher = better graph quality, slower build. Typical 200–800.efSearch: candidate-list size at query time. The recall/latency knob at query time. Higher = better recall, slower. Typical 32–512.
You tune efSearch against your latency target. For most apps, ef = 64
gives recall@10 above 0.98 with sub-10ms p99 latency on 1M vectors.
5.3 IVF-Inverted File¶
Partition the corpus into nlist clusters (k-means on embeddings).
Store, for each cluster, the list of vector IDs in it (the inverted
file).
At query: compute distance from query to all nlist centroids. Pick the
nprobe closest centroids. Search inside their inverted lists only.
nlist: typically~sqrt(N). Bigger N → bigger nlist.nprobe: 1 to nlist. Bigger = more recall, more cost.
Pure IVF is rarely used today because HNSW dominates in-memory. But IVF is the natural pairing for product quantization.
5.4 PQ-Product Quantization¶
PQ compresses vectors aggressively. Split the D-dim vector into m
sub-vectors of dim D/m. For each sub-vector, run k-means with K = 256
codes. Now each vector is m bytes (one byte per sub-vector code).
A 1024-dim fp32 vector (4 KB) becomes m = 64 bytes-64x compression. Distance computations use precomputed lookup tables (one per query, per sub-vector) so they remain fast.
PQ on its own loses recall. The standard combination is IVF-PQ: IVF
for partitioning + PQ for compression of the residuals. FAISS's
IndexIVFPQ is the canonical implementation.
5.5 DiskANN¶
For billion-scale on a single machine, you can't fit even quantized vectors in RAM affordably. DiskANN (Microsoft, 2019) builds an HNSW- like graph that lives on SSD, with a small in-RAM cache for hot nodes. Search performs a few SSD reads per query (the graph is designed to be SSD-friendly: nodes laid out for sequential reads, beam width tuned for SSD random-read latency).
DiskANN is what underlies several of the billion-scale vector services (it's behind parts of Azure Cognitive Search and is the algorithm benchmark winner at billion scale on the BigANN benchmark).
5.6 The recall–latency tradeoff¶
You will see this curve everywhere:
recall@10
1.0 │
│ ───●── HNSW (ef=512)
│ ──●── HNSW (ef=128)
│ ──●── HNSW (ef=64)
│ ──●── HNSW (ef=32)
0.9 │ ●
│
└──────────────────────────► latency (ms)
1 5 20 100
Every ANN library exposes a knob (efSearch for HNSW, nprobe for IVF)
that walks this curve. You always benchmark on your data: indexes
behave differently depending on intrinsic dimension and clusterability.
6. Vector DB landscape¶
6.1 The contenders (2026)¶
- pgvector (Postgres extension): the choice when you already have Postgres. Supports both IVFFlat and HNSW (since pgvector 0.5). Filtering via SQL. Up to ~10M vectors comfortable; 100M with care; not for 1B+. Wins on operational simplicity: same backup, same auth, same monitoring as your transactional DB.
- Qdrant (Rust, open source): strong filtering ("payload" with pre-filter integrated into the HNSW walk so filtering doesn't destroy recall), good multi-tenancy, snapshots. The default modern choice if you don't have Postgres and want a real vector DB.
- Weaviate: built-in modules for embedding, hybrid search via BM25 + vectors out of the box, GraphQL API. Heavier and more opinionated than Qdrant.
- Milvus / Zilliz: scale champion. If you genuinely have billions of vectors and a team to operate it, Milvus is the most-deployed open option.
- Chroma: SQLite-style local-first vector DB. Excellent for prototyping; lighter on production features.
- FAISS: not a database-an index library from Meta. You embed it in your service. Use when you need control and have engineers; it underlies many of the others' indexes.
- Pinecone, Vespa, Elasticsearch + dense_vector: managed and enterprise options worth knowing exist.
6.2 Decision matrix¶
| Situation | Pick |
|---|---|
| You already have Postgres, < 50M vectors, want single-DB ops | pgvector |
| You want a real vector DB, open source, strong filtering | Qdrant |
| You have 100M+ vectors and a team to run it | Milvus |
| Prototype on a laptop | Chroma |
| You need full control or have a custom index | FAISS embedded |
| You're on managed cloud, want zero ops, OK with $$$ | Pinecone |
| You already have Elasticsearch and need vectors too | ES dense_vector / OpenSearch k-NN |
The honest answer: pgvector for almost everything new in 2026, and graduate to Qdrant or Milvus only when you outgrow it. The cost of running a separate stateful service is real and easy to underestimate.
7. Hybrid retrieval¶
7.1 Why hybrid wins¶
Sparse and dense retrieval fail in different ways: - Sparse misses paraphrases. "How do I cancel my subscription?" vs a doc titled "Account closure procedure"-no term overlap, BM25 scores zero. - Dense misses exact matches. "ERR-7842" vs the exact doc with that string-dense embeddings normalize away the literal token.
Combining the two captures both signals. On almost every public benchmark, hybrid beats either alone by 3–10 points of NDCG@10.
7.2 Reciprocal Rank Fusion (RRF)¶
RRF is a rank-based fusion method that's almost embarrassingly simple
and almost always works. For each retrieval system i, let rank_i(d)
be the rank of document d (1-indexed). Then:
A constant k (typically 60) damps the head-without it the top-1
document gets disproportionate weight.
Why RRF is robust:
- It uses ranks, not scores. BM25 scores and cosine similarities are
on completely different scales; combining them directly via
weighted sum requires careful normalization. Ranks sidestep this.
- It naturally rewards documents that show up in multiple systems'
top-k.
- The constant k = 60 is from the original RRF paper (Cormack, Clarke,
Buettcher, 2009) and works essentially everywhere.
from collections import defaultdict
def rrf_merge(*ranked_lists: list[tuple[str, float]], k: int = 60,
top_k: int = 100) -> list[tuple[str, float]]:
"""ranked_lists: each is [(doc_id, score), ...] in descending score."""
fused: dict[str, float] = defaultdict(float)
for ranked in ranked_lists:
for rank, (doc_id, _score) in enumerate(ranked, start=1):
fused[doc_id] += 1.0 / (k + rank)
return sorted(fused.items(), key=lambda x: x[1], reverse=True)[:top_k]
7.3 Convex combination¶
Alternative: normalize scores to [0, 1] and take a weighted sum.
Pros: smooth, lets you tune α with grid search. Cons: requires careful score normalization (min-max per query, or softmax, or rank-as-score). If a system returns an absurd outlier score, normalization is brittle. RRF is just safer.
When does convex beat RRF? When you have a labeled tuning set and tune α per query type. In practice the gain is small. Default to RRF.
7.4 Sparse-dense fusion in vector DBs¶
Modern vector DBs increasingly support hybrid natively (Qdrant, Weaviate, OpenSearch, Vespa). They run BM25 and ANN in parallel and fuse. If yours doesn't, do RRF in your application code-it's 10 lines.
8. Chunking strategies¶
A retriever doesn't operate on documents; it operates on chunks. How you split a 50-page PDF into chunks before embedding determines the ceiling of your retrieval quality.
8.1 The chunking dilemma¶
- Too small (<128 tokens): each chunk loses context. The model needs many chunks to answer; relevance gets diluted.
- Too large (>2000 tokens): each chunk dilutes its own embedding (the embedding has to represent too many topics at once); irrelevant text crowds the LLM's prompt window; retrieval becomes coarse.
- Boundary-naive: splitting in the middle of a sentence or table or code block destroys the meaning.
8.2 Fixed-size chunking¶
Split every N tokens with M tokens of overlap (e.g., N = 512, M = 64). Simplest, most common, often suboptimal.
def fixed_chunks(text: str, size: int = 512, overlap: int = 64,
tokenize=lambda s: s.split()) -> list[str]:
toks = tokenize(text)
out = []
i = 0
while i < len(toks):
out.append(" ".join(toks[i:i + size]))
i += size - overlap
return out
8.3 Semantic / boundary-aware chunking¶
Split at "natural" boundaries first, only fall back to size limits. Roughly:
- Split into sections (Markdown headings, HTML
<h1>/<h2>). - Within sections, split into paragraphs.
- Within paragraphs, split into sentences.
- Greedily pack sentences into chunks up to a token budget; never split mid-sentence.
LangChain's RecursiveCharacterTextSplitter does exactly this with a
priority list of separators. LlamaIndex has equivalents.
8.4 Hierarchical / parent-document chunking¶
The trick: embed small chunks (high precision retrieval), but at LLM generation time, hand the parent chunk (more context). This decouples retrieval granularity from generation granularity.
Implementation:
- Split docs into "parents" (e.g., 2000 tokens).
- Split each parent into "children" (e.g., 400 tokens).
- Index children for retrieval; each child stores parent_id.
- At query time, retrieve top-k children, look up parents, dedupe,
pass parents to the LLM.
This costs more in context but reliably improves answer quality on questions that need surrounding context.
8.5 Late chunking (Jina, 2024)¶
A subtle and powerful idea. The standard pipeline embeds each chunk in isolation, so the embedding doesn't know the rest of the document exists. Late chunking inverts the order:
- Run the embedding model over the entire document with a long-context encoder, producing a per-token embedding sequence.
- Then chunk the token sequence by averaging (or pooling) per chunk.
The resulting chunk embeddings carry context from the surrounding document-pronouns get resolved, topic drift is smoothed, references become embeddable.
Late chunking requires a long-context encoder (8k+ tokens), so it pairs with models like jina-v3, BGE-M3, or anything based on Mistral/Llama encoders. The quality lift on long documents is consistently several points of NDCG.
8.6 Contextual retrieval (Anthropic, 2024)¶
Anthropic published a technique that's a bit ugly and very effective. Before embedding each chunk, prepend it with an LLM-generated context description that situates the chunk in its document.
For each (document, chunk):
context = LLM("Here is a document: {document}\n"
"Here is a chunk: {chunk}\n"
"Please give a short, succinct context to situate this "
"chunk within the overall document for the purposes of "
"improving search retrieval.")
embedded_text = context + "\n\n" + chunk
embedding = embed(embedded_text)
bm25_doc = embedded_text # also feed into the BM25 index
Anthropic reported (and it has been widely reproduced) that contextual retrieval reduces top-20 retrieval failure rate by ~35% on its own and ~67% combined with BM25 + reranking. The cost is one LLM call per chunk at ingestion time-amortized over the lifetime of the corpus, this is cheap. Prompt caching makes it even cheaper because you can cache the document portion across all its chunks.
When to use contextual retrieval: any time chunks lose meaning out of context (most non-trivial corpora). It's a no-brainer for legal, medical, technical documentation. It's overkill for FAQs.
8.7 Choosing chunk size¶
Rules of thumb that hold up:
- 256–512 tokens for FAQ-style or short-passage corpora.
- 512–1000 tokens for documentation, articles, books.
- Smaller if your queries are atomic (single fact lookups).
- Larger if your queries need broader context (summarization, multi-fact questions).
- Test it. Hold out a query set. Measure recall@k for chunk sizes 256, 512, 1024, 2048. Pick the elbow.
Overlap of 10–20% of chunk size is the standard default and rarely worth tuning.
9. The RAG pipeline-production reference¶
9.1 The two pipelines¶
RAG systems have two flows: ingestion (offline, batch) and query (online, latency-sensitive). They share a corpus and an index.
INGEST (offline):
source ──▶ loader ──▶ cleaner ──▶ chunker ──▶ contextualizer (optional)
│
▼
embedder
│
▼
vector_db.upsert + bm25.upsert
│
▼
metadata store
QUERY (online):
user_q ──▶ query_rewriter ──▶ retriever (sparse + dense, hybrid)
│
▼
reranker (cross-encoder)
│
▼
context_builder (prompt assembly)
│
▼
LLM
│
▼
answer + citations + telemetry
9.2 Walk through every stage¶
Loader. Pulls the source (S3 PDFs, Confluence, GitHub, a database
table). Always extract along with the text: a stable doc_id, source
URL, last-modified timestamp, tenant/owner, doc type.
Cleaner. Strips boilerplate (headers, footers, ads), normalizes
whitespace, fixes encoding, optionally removes tables/images or
serializes them to text. PDF cleaners are their own universe; tools
like unstructured, marker, docling are the modern picks.
Chunker. From section 8. Output: list of Chunk(text, doc_id,
chunk_id, parent_id?, position, metadata).
Contextualizer (optional). Per chunk, generate a context-prefix using an LLM (section 8.6). This is where you add the most quality at ingestion time.
Embedder. Bi-encoder. Input chunk text (with optional context prefix). Output a fixed-dim vector. Batch up to the model's max input.
Vector DB upsert. upsert(id=chunk_id, vector=emb, payload={doc_id,
tenant_id, ...}). Upsert, not insert-re-ingestion is normal and
must be idempotent.
BM25 upsert. Same chunks indexed sparsely. (Or use a vector DB with hybrid built in.)
Metadata store. Postgres or similar holding chunk_id → text
(because vector DBs are not great at storing text), doc_id → metadata,
ingestion lineage. You will need this for citations and re-ingestion.
Query rewriter. From section 10.
Retriever (hybrid). From section 7. RRF over BM25 and dense.
Reranker. From section 4. Cross-encoder narrows top-100 → top-10.
Context builder. Assemble the prompt. Place rerank-top docs with care (section 12.1, lost-in-the-middle). Truncate if necessary (prefer dropping low-rank docs over truncating high-rank ones).
LLM. Instructed to answer using only the provided context and to cite chunk IDs.
Telemetry. Log the query, retrieved chunk IDs (for reproducibility), latencies per stage, and the answer. This is your ground truth for everything downstream-eval, debugging, training data.
9.3 A minimal end-to-end example¶
# Skeleton; assume bm25, dense_index, reranker, llm exist.
def ingest_doc(doc: Document) -> None:
cleaned = clean(doc.text)
chunks = chunk(cleaned, size=512, overlap=64,
doc_id=doc.id, metadata=doc.metadata)
for c in chunks:
c.text_for_embedding = contextualize(doc, c) # optional
embs = embedder.encode([c.text_for_embedding for c in chunks])
vector_db.upsert([
{"id": c.id, "vector": e, "payload": {**doc.metadata,
"doc_id": doc.id,
"chunk_id": c.id}}
for c, e in zip(chunks, embs)
])
bm25.upsert([{"id": c.id, "text": c.text_for_embedding} for c in chunks])
metadata_store.upsert(chunks)
def answer(user_q: str, tenant_id: str) -> dict:
rewrites = rewrite_queries(user_q) # section 10
sparse = bm25.search(rewrites, k=100, filter={"tenant_id": tenant_id})
dense = dense_index.search(rewrites, k=100, filter={"tenant_id": tenant_id})
candidates = rrf_merge(sparse, dense, k=60)[:100]
candidate_texts = metadata_store.fetch_texts([c[0] for c in candidates])
reranked = reranker.rerank(user_q, candidate_texts, top_k=8)
prompt = build_prompt(user_q, reranked)
out = llm.generate(prompt)
return {"answer": out.text, "citations": [r["chunk_id"] for r in reranked]}
This is the entire production architecture in 30 lines of Python skeleton. The hard part isn't the wiring; it's the eval (section 11) and the operational concerns (section 15).
10. Query rewriting¶
The user's literal query is rarely the optimal retrieval query. Three techniques to bridge the gap.
10.1 HyDE-Hypothetical Document Embeddings¶
(Gao et al., 2022.) Insight: a query and its answer have very different shapes. "How do I reset my password?" and a doc that says "To reset your password, click 'Forgot password' on the sign-in page..." live in different parts of embedding space. HyDE asks the LLM to generate a hypothetical answer to the query, embeds the hypothetical, and uses that embedding to retrieve.
def hyde_retrieve(query: str, k: int = 100) -> list[dict]:
hypothetical = llm.generate(
f"Write a concise factual answer to: {query}\nAnswer:"
).text
emb = embedder.encode(hypothetical)
return dense_index.search_by_vector(emb, k=k)
Helps most for short, vague queries. Hurts when the LLM hallucinates a detailed but wrong "answer"-the wrong embedding finds the wrong docs. Often best combined with the original query (RRF over both).
10.2 Multi-query¶
Generate N rephrasings of the query, retrieve for each, dedupe, merge by RRF.
def multi_query(q: str, n: int = 4, k: int = 50) -> list[dict]:
rephrased = llm.generate(
f"Generate {n} different ways to phrase this question for search,"
f" one per line:\n{q}"
).text.splitlines()
queries = [q] + [r.strip("- 1234567890.") for r in rephrased if r.strip()]
results = [retrieve(qi, k=k) for qi in queries]
return rrf_merge(*results, k=60, top_k=k)
Dirt simple, often a 2–4 point recall gain. Cost: N+1 retrievals per query.
10.3 Step-back prompting¶
(Zheng et al., 2023.) For specific questions, a "step-back" generalization sometimes retrieves better.
User: "Did the Q3 2025 product release ship the multi-tenant SSO feature?"
Step-back: "What did the Q3 2025 product release ship?"
Retrieve for both. The step-back query pulls broader context (release notes); the original pulls the specific feature mention. RRF the results.
Pattern: useful when answers require a frame around them. Less useful for atomic factoid lookup.
10.4 Choosing a rewriter (or none)¶
For most production systems, start with no rewriting. Then add multi-query (cheap, broadly helpful). Add HyDE only if your queries are short and vague (search bar, not chatbot). Step-back is niche.
Rewriting costs latency (extra LLM call) and tokens. Always measure.
11. RAG evaluation¶
This is the section most teams skip and most teams regret skipping. Without an eval set, your "improvements" are vibes and your regressions are silent.
11.1 Two layers of eval¶
A RAG system has two failure modes that need separate metrics:
- Retrieval failure: the right context never reached the LLM.
- Generation failure: the right context reached the LLM but the answer is wrong/incomplete/unfaithful.
You must measure both. Otherwise improving one hides regressions in the other.
11.2 Retrieval metrics¶
Setup. You have an eval set: a list of (query, relevant_doc_ids)
pairs. (The hard part is building this-see section 17.6.)
For a query q, your retriever returns a ranked list of doc IDs
r_1, r_2, ..., r_k. Let R_q be the set of relevant docs.
Recall@k¶
Fraction of all relevant docs that made it into the top-k. The single most important metric: if recall@k is low, no rerank or generation improvement can save you.
Precision@k¶
Fraction of top-k that are relevant. Less critical than recall-the LLM filters out irrelevant context tolerably well-but very useful for diagnosing context bloat.
MRR-Mean Reciprocal Rank¶
If the first relevant doc is at position 1, contributes 1.0. At position 2, 0.5. At position 10, 0.1. Past k, 0. Best when one relevant doc suffices to answer (factoid QA).
def mrr(eval_set: list[dict], retriever, k: int = 10) -> float:
total = 0.0
for ex in eval_set:
retrieved = [d["id"] for d in retriever(ex["query"], k=k)]
rank = next((i + 1 for i, d in enumerate(retrieved)
if d in set(ex["relevant"])), None)
total += (1.0 / rank) if rank else 0.0
return total / len(eval_set)
NDCG-Normalized Discounted Cumulative Gain¶
For graded relevance (0/1/2/3 instead of binary), NDCG is the right
metric. Define rel_i as the graded relevance of the doc at rank i.
DCG@k = Σ (2^rel_i − 1) / log2(i + 1)
i=1..k
IDCG@k = DCG@k of the ideal ordering (sort by relevance desc)
NDCG@k = DCG@k / IDCG@k
NDCG ∈ [0, 1], higher better. The log2(i+1) discount means errors at
high ranks hurt more than errors at low ranks-which matches human
intuition about ranked lists.
Use NDCG when you have graded judgments. Use Recall@k + MRR when you have only binary relevance (most cases).
import math
def ndcg(retrieved_with_rels: list[float], k: int = 10) -> float:
"""retrieved_with_rels: rel_i values in retrieved order."""
def dcg(rels):
return sum((2**r - 1) / math.log2(i + 2)
for i, r in enumerate(rels[:k]))
actual = dcg(retrieved_with_rels)
ideal = dcg(sorted(retrieved_with_rels, reverse=True))
return actual / ideal if ideal > 0 else 0.0
11.3 Generation metrics¶
For the generation half you need to grade answers. Three useful axes:
- Faithfulness (a.k.a. groundedness): does the answer make claims not supported by retrieved context? An unfaithful answer hallucinated even when given correct context.
- Answer relevance: does the answer actually address the question? An answer can be faithful but tangential.
- Context recall: does the retrieved context contain all the information needed to answer? This couples back to retrieval-it's the generation-side view of retrieval recall.
Three ways to grade these: 1. Human labels-gold standard, expensive. Use for the seed eval set. 2. LLM-as-judge-a strong model grades. Cheap, scales, biased toward its own style. Always validate the judge against humans on a sample. 3. Reference-based metrics-BLEU/ROUGE on a reference answer. Old- school and noisy for free-form text; use sparingly.
11.4 RAGAS¶
RAGAS (Es et al., 2023) is a framework that operationalizes the above metrics with an LLM judge. Its core metrics:
- Faithfulness: extract claims from the answer; for each claim, check whether it is entailed by the retrieved context. Score = fraction entailed.
- Answer relevance: prompt LLM to generate questions for which the answer would be a good response; measure cosine similarity between those questions and the original question. High similarity = high relevance.
- Context precision / recall: of the retrieved chunks, which contain info used in the answer? Of the info needed by the ground-truth answer, which is in the retrieved chunks?
You don't need RAGAS specifically-you can implement the same metrics yourself. But it's a fine starting framework. Roll your own only when you've outgrown it.
11.5 The metric that ultimately matters¶
End-to-end task accuracy: did the user get a correct, useful answer?
Per-stage metrics are diagnostic. Task accuracy is the headline. If you improve recall@10 by 5 points and end-to-end accuracy doesn't move, you either had headroom in another stage (rerank, generation) or your eval set is leaking.
A serviceable eval scorecard:
| Metric | What it measures | Target (for "good enough") |
|---|---|---|
| Retrieval Recall@10 | Right context found | ≥ 0.90 |
| Retrieval MRR | Right context found early | ≥ 0.70 |
| Faithfulness | No hallucination | ≥ 0.95 |
| Answer relevance | Addresses the question | ≥ 0.90 |
| End-to-end accuracy | Right answer | depends on domain (≥ 0.80 for well-scoped KB) |
Targets vary by domain-calibrate against a baseline (BM25-only retrieval, no rerank) and demand each new component improve a metric.
11.6 Building the eval set¶
This is the unglamorous foundation. Two strategies:
- Mine real questions. From support tickets, search logs, user questions in product. Hand-label the gold doc(s) and the gold answer. 50–200 is enough to start; aim for 500+ over time.
- Synthetic + reviewed. Use an LLM to generate questions from the corpus (give it a chunk, ask "what question does this chunk answer?"). Then have a human review/filter. Faster to start, lower-quality if you skip the human review step.
Cover the long tail: include queries that are short, long, multi-hop, typo-laden, in alternate phrasings, in non-English if you serve multilingual.
12. Common RAG failure modes¶
12.1 Lost in the middle¶
(Liu et al., 2023.) Models attend less to information in the middle of long contexts than at the beginning or end. With 10 retrieved docs, the ones in slots 4–7 are often effectively ignored.
Fixes:
- Rerank (section 4) so the most relevant doc is at the top.
- Place the highest-scoring docs at the start and end of the context
block. Some teams reorder reranker output as [1, 3, 5, ..., 6, 4, 2]
so the strongest are at the edges.
- Shorten the context. If 5 docs work, don't pass 20.
- Prefer smaller, sharper context windows. The "more context is
better" instinct is wrong here.
Worked example: a customer support bot retrieves 10 chunks; the answer- bearing chunk is at position 6. Without reordering, end-to-end accuracy on questions where the answer-chunk is in slot 6 is ~62%. After reranking it to slot 1, accuracy on the same questions rises to ~89%. (Numbers illustrative; reproduce on your own eval set.)
12.2 Retrieval-generation gap¶
The retrieved context contains the answer, but the LLM ignores it, contradicts it, or hallucinates around it. Symptoms: faithfulness < 0.9 even when context recall is 1.0.
Fixes: - Stronger reranking-irrelevant context confuses the LLM. - Smaller context window-3 strong chunks beat 10 mixed ones. - Explicit instructions-"Answer only using the provided context. If the context does not contain the answer, say so." - Citations required-instruct the model to cite chunk IDs; this forces grounding. - Use a more capable LLM for generation. Some failure modes are fundamentally generation-side, not retrieval-side.
12.3 Off-by-one chunks¶
The answer is split across the boundary of two chunks, and only one is retrieved. The retriever pulls a chunk that mentions the question but not the answer (it's in the next chunk).
Fixes: - Increase chunk overlap (8.2). - Hierarchical retrieval (8.4): retrieve at the child level, generate with the parent. - Late chunking (8.5): the chunk's embedding carries surrounding context.
12.4 Stale data¶
The index is out of date. The right doc exists in the corpus but isn't yet ingested, or the ingested version is old.
Fixes: - Monitor ingestion lag as an SLO. Alert when median lag > target. - Incremental ingestion triggered by source-system events (webhooks, change-data-capture), not nightly batch. - Versioning (section 15.2): keep enough history to reproduce a query result on a past version of the corpus.
12.5 Off-distribution queries¶
Your eval set is medical lookups; the user is asking small talk. The retriever returns nonsense; the LLM dutifully writes an answer.
Fixes:
- Out-of-scope detection: a classifier or threshold on retrieval
scores. If top retrieval score is below θ, return a "I don't have
information on that" answer rather than fabricated context.
- Refuse safely. Better to say "I don't know" than to invent.
12.6 Citation hallucination¶
The LLM cites doc IDs that aren't in the retrieved context, or that don't exist. Always validate citations before showing them: every citation in the answer must be a chunk ID that was actually in the prompt.
13. Multi-hop and agentic retrieval¶
13.1 Multi-hop¶
Some questions require combining facts from multiple documents:
"Who reported to Alice in 2023, and which of them later joined the security team?"
A single retrieval pass cannot solve this-the chain is Alice →
reports → security_team_roster. You need at least two retrievals where
the second is conditioned on the first's results.
Approaches: - Iterative retrieval: retrieve, reason, identify what's still missing, retrieve again. Loop until the model says "done" or a hop limit. - Decompose-then-retrieve: an LLM decomposes the question into sub-questions, retrieves for each, then synthesizes.
def multi_hop(question: str, max_hops: int = 3) -> str:
context = []
for hop in range(max_hops):
sub_q = llm.generate(
f"Original question: {question}\n"
f"Context so far:\n{format(context)}\n"
f"What is the next question we need to answer "
f"to reach the final answer? "
f"If we have enough info, respond with DONE."
).text
if sub_q.strip().startswith("DONE"):
break
retrieved = retrieve(sub_q, k=5)
context.extend(retrieved)
return llm.generate(answer_prompt(question, context)).text
13.2 Agentic RAG¶
The model decides what to retrieve and when, by issuing search calls as tool invocations. Same architecture as any tool-using agent (see deep dive on agents) but with retrievers as the tools.
Pros: - Handles multi-hop naturally (the agent loops until satisfied). - Can choose between multiple corpora (knowledge sources, code, web, internal docs). - Best quality on complex queries.
Cons: - High latency (multiple LLM + retrieval round trips). - Hard to evaluate (non-deterministic plans). - Higher cost.
When to use: complex, varied query distributions where a single retrieval pass is too narrow. Customer support agents, research assistants, internal coding agents fit. Simple FAQ bots do not.
13.3 GraphRAG¶
(Microsoft Research, 2024.) Build an entity-and-relationship graph from the corpus offline (using LLM extraction), then navigate the graph at query time.
Pipeline: 1. Offline: chunk corpus → run an LLM to extract entities and relations from each chunk → build a graph → run community detection → generate per-community summaries with an LLM. 2. Online: route the query-for "global" questions ("what are the main themes?") use community summaries; for "local" questions ("who works on X?") traverse the graph from the matching entities.
GraphRAG shines on global queries that span the whole corpus — classic RAG retrieves k chunks and never sees the bigger picture. For "what does this company do?" applied to a 10,000-document corpus, flat RAG returns a few chunks and the answer is fragmentary; GraphRAG's community summaries already aggregated the corpus.
Cost: graph construction is expensive (one LLM call per chunk for extraction + community summarization). Worth it for stable corpora and analytical queries; overkill for transactional QA.
14. Filtering and metadata¶
14.1 Why metadata matters from day 1¶
Pure semantic retrieval is rarely what you want in production. You almost always need to filter:
- Tenant isolation: customer A must not see customer B's docs.
- Document type: "policy documents only," "code only."
- Time range: "only post-2024."
- Permissions: "user X has access to projects {1, 4, 9}."
- Language, region, product line.
Plan the metadata schema before you ingest. Migrating later means re-embedding everything.
Standard fields to attach to every chunk:
- doc_id, chunk_id, parent_id, position
- tenant_id (or org_id)
- acl / permissions (list of group/user IDs)
- doc_type, source
- created_at, updated_at, version
- language, region
- arbitrary custom payload
14.2 Pre-filter vs post-filter¶
Two ways to combine ANN search with filters:
Post-filter: ANN returns top-k by similarity, then drop ones that fail the filter. - Cheap when the filter accepts most docs. - Catastrophic when the filter is selective: you get top-k from the whole corpus, throw most away, end up with empty or near-empty results.
Pre-filter: filter the corpus first to a subset, then ANN within. - Correct results regardless of filter selectivity. - Implementation is harder-the ANN index doesn't naturally support arbitrary subsets. Either: - Build a subset index per filter combo (impractical for many combinations). - Integrate filtering into the ANN walk: only follow edges to nodes that pass the filter. Qdrant, Weaviate, Milvus, modern pgvector all support this. This is what you want for selective filters.
Rule: if any filter can shrink the corpus to <10% of total, use pre-filter. Otherwise post-filter is fine.
# pgvector example with HNSW + WHERE pre-filter
"""
SELECT chunk_id, embedding <=> %s AS distance
FROM chunks
WHERE tenant_id = %s
AND created_at >= %s
ORDER BY embedding <=> %s
LIMIT 100;
"""
The <=> operator is cosine distance in pgvector. With an HNSW index
and an appropriate hnsw.ef_search setting, this runs as integrated
pre-filter rather than post-filter.
15. Production concerns¶
15.1 Freshness¶
Two metrics: - Ingestion lag: time from "doc updated at source" to "doc retrievable in our index." Target depends on use case (minutes for support; hours for docs; days for archives). - Retrieval freshness: are we serving the latest version of a changed doc?
Patterns:
- Event-driven ingestion: source emits change events (webhook, CDC,
Kafka topic). Ingestor consumes, re-embeds, upserts. This is the only
way to get sub-minute lag.
- Soft deletes: when a source doc is deleted, mark its chunks
deleted_at rather than hard-deleting; gives you a chance to undo.
- Tombstones in the index: when a doc is removed, ensure all its
chunks are removed from sparse + dense indexes atomically.
15.2 Versioning and reproducibility¶
Two questions you will be asked: 1. "Why did the bot say X yesterday and Y today?" 2. "Reproduce this answer for an audit / compliance review."
For both you need to store, per query: - The query, timestamp, user. - The retrieved chunk IDs and their versions. - The prompt sent to the LLM. - The answer.
If chunks are mutable, version them: chunk_id, version, text, embedding,
created_at, replaced_at. Retrieval logs the (chunk_id, version) pair.
15.3 Multi-tenancy¶
Three patterns, in increasing isolation:
- Single index, tenant_id filter on every query. Cheapest. You must enforce the filter in code on every path; one missed code path is a data leak. Audit every query.
- Index per tenant. Better isolation; higher overhead per tenant. Good for fewer, larger tenants.
- Cluster per tenant. For high-compliance industries. Costly.
In all cases: log tenant_id with every query; alert on anomalies (e.g., cross-tenant query patterns).
15.4 Citations and provenance¶
The LLM should cite. Architecture:
- Each chunk in the prompt is given an explicit ID:
- The prompt instructs the LLM to use those IDs: "Cite sources as [doc_id_chunk_id]."
- After generation, parse out citations and validate every one is actually in the prompt. Discard or rewrite hallucinated citations.
- Resolve citations to URLs / titles for the UI.
Never let the LLM invent citations. Always validate.
15.5 Cost monitoring¶
Per query, track: - Embed (sometimes 0 if cached). - Retrieve (DB cost). - Rerank (model cost). - LLM completion (the dominant cost). - Total wall time.
Alert on regressions. A subtle prompt change that adds 500 tokens of context across millions of queries is a real bill.
15.6 Caching¶
- Query embedding cache. A user often asks similar things; caching the query embedding by query string saves an API call.
- Answer cache. Same
(normalized_query, tenant, top_5_chunk_ids)→ cached answer with TTL. Be careful: cache key must include the retrieved set, or you'll serve stale answers on doc updates. - Prompt cache (provider-side, e.g., Anthropic's prompt caching). If your prompt has a large stable prefix (system prompt, few-shot examples), cache it on the LLM provider's side; pays for itself fast.
16. Self-host vs API for embeddings¶
16.1 The choice¶
| Axis | API (OpenAI, Cohere, Voyage) | Self-host (BGE, E5, jina via sentence-transformers) |
|---|---|---|
| Ops cost | Zero | Real (GPUs, monitoring, HA) |
| Per-token cost | $/M tokens, ongoing | One-time GPU; near-zero marginal |
| Latency | Network + provider | Local; controllable |
| Tail latency | Provider's SLAs | Yours to control |
| Quality | Often top-of-pack | Open weights catching up; close on most benchmarks |
| Compliance | Data leaves your network | Stays in your VPC |
| Lock-in | Provider-shaped | Portable (any inference runtime) |
16.2 Decision matrix¶
- Prototype / small corpus / occasional ingestion → API. Don't run a GPU service for 10k embeddings.
- High volume, predictable load → self-host pays back fast. 10M+ embeddings/day is the rough threshold.
- Compliance / data residency requirements → self-host or region-pinned API.
- Multilingual / domain-specific tuning needed → self-host (you can fine-tune).
- Best raw quality, willing to pay → top API model + cross-encoder rerank.
A realistic mixed setup: API embeddings for query-time (low volume, latency sensitive); self-hosted for ingestion (high volume, batch). Or vice versa, depending on traffic shape.
16.3 Self-hosting in practice¶
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-large-en-v1.5", device="cuda")
embeddings = model.encode(
["chunk one", "chunk two"],
batch_size=64,
normalize_embeddings=True, # cosine ≡ dot product after this
show_progress_bar=True,
)
For production: serve via Triton, Text Embeddings Inference (TEI), or vLLM (for embedding-capable models). Put a queue in front; batch aggressively. A single A10G can do tens of thousands of embeddings per second on a 384-dim small model.
17. Practical exercises¶
Exercise 1-BM25 from scratch on 100 docs¶
Task. Use the BM25 class from section 2.4. Build a corpus of 100 short technical docs (you can synthesize them or use a public dataset like Wikipedia abstracts). Index. Run 10 queries. Verify the top-1 hit is the doc you intended for each query.
Acceptance. For 10 queries you author, top-1 retrieval matches your
intended target on at least 8/10. Your implementation matches
rank_bm25.BM25Okapi to within 0.001 on the same parameters and
tokenization.
Solution sketch.
import math, re
from collections import Counter
def tok(s): return re.findall(r"[a-z0-9]+", s.lower())
corpus = [
"BM25 is a sparse ranking function based on term frequency",
"Dense retrieval encodes queries and documents into vectors",
"HNSW is a graph-based approximate nearest neighbor index",
# ... add 97 more
]
docs = [tok(d) for d in corpus]
bm25 = BM25(docs, k1=1.5, b=0.75) # from section 2.4
q = tok("what is bm25 ranking")
top = bm25.topk(q, k=5)
for i, s in top:
print(f"{s:.3f} {corpus[i][:80]}")
Exercise 2-Recall@5, Precision@5, MRR on a 50-query eval set¶
Task. Build (or get) a 50-query eval set with binary gold labels. Compute Recall@5, Precision@5, MRR for two retrievers (BM25 and dense).
Acceptance. A printed scorecard like:
Report which retriever wins on which metric and hypothesize why.
Solution sketch.
def precision_at_k(retrieved, relevant, k):
hits = [d for d in retrieved[:k] if d in relevant]
return len(hits) / k
def recall_at_k(retrieved, relevant, k):
hits = [d for d in retrieved[:k] if d in relevant]
return len(hits) / max(len(relevant), 1)
def reciprocal_rank(retrieved, relevant):
for i, d in enumerate(retrieved, start=1):
if d in relevant:
return 1.0 / i
return 0.0
def evaluate(retriever, eval_set, k=5):
P, R, MRR = 0.0, 0.0, 0.0
for ex in eval_set:
retrieved = [d["id"] for d in retriever(ex["query"], k=k)]
rel = set(ex["relevant"])
P += precision_at_k(retrieved, rel, k)
R += recall_at_k(retrieved, rel, k)
MRR += reciprocal_rank(retrieved, rel)
n = len(eval_set)
return {"P@k": P/n, "R@k": R/n, "MRR": MRR/n}
Exercise 3-RRF over BM25 + dense¶
Task. Implement rrf_merge (section 7.2). Run BM25 and dense
retrieval on the same 50-query eval set. Merge with RRF (k = 60). Score
the fused retriever on Recall@5 and MRR. Verify it beats both individual
retrievers.
Acceptance. RRF Recall@5 ≥ max(BM25, Dense) and RRF MRR ≥ max(BM25, Dense), or you've genuinely understood why your data is an exception (rare but possible-e.g., near-perfectly redundant retrievers).
Solution sketch.
def rrf_retrieve(query, k=10):
bm25_hits = bm25_search(query, k=100)
dense_hits = dense_search(query, k=100)
fused = rrf_merge(bm25_hits, dense_hits, k=60, top_k=k)
return [{"id": d, "score": s} for d, s in fused]
print(evaluate(rrf_retrieve, eval_set, k=5))
Exercise 4-Contextual retrieval on a 500-doc synthetic corpus¶
Task. Build a synthetic corpus of 500 documents (e.g., generated support FAQs across 10 product areas). Generate 100 questions. Implement contextual retrieval (section 8.6). Compare retrieval Recall@5: - Baseline: chunks embedded as-is. - Contextual: chunks prepended with LLM-generated context.
Acceptance. Contextual retrieval shows ≥ 5-point Recall@5 improvement over the baseline, or you've debugged why not (small documents that don't need context, redundant context, etc.).
Solution sketch.
CTX_PROMPT = """Document:
{doc}
Chunk:
{chunk}
Give a one-sentence context describing where this chunk sits in the
document and what it is about. Output the context only."""
def contextualize(doc_text, chunk_text):
return llm.generate(CTX_PROMPT.format(doc=doc_text, chunk=chunk_text)).text.strip()
def ingest_contextual(doc_text, doc_id):
chunks = chunk_text(doc_text, size=400, overlap=50)
for c in chunks:
ctx = contextualize(doc_text, c.text)
c.text_for_embed = f"{ctx}\n\n{c.text}"
embs = embedder.encode([c.text_for_embed for c in chunks])
upsert(chunks, embs, doc_id)
Use prompt caching on the document portion (every chunk in the same doc
shares the same doc_text) to make this affordable.
Exercise 5-Diagnose a "lost in the middle" failure¶
Task. Construct a query whose answer-bearing chunk lands at position 6 in a 10-chunk context (you can engineer the rerank to put it there). Generate an answer. Then move the chunk to position 1 and regenerate. Compare the answers.
Acceptance. A worked-out before/after where the position-6 answer is notably worse (incomplete, hallucinated, or refuses) and the position-1 answer is correct. Document the test and add it to a regression suite — this is the kind of failure that comes back.
Worked example.
Query: "What's the maximum retry count for 5xx errors in our default policy?"
Context (10 chunks; chunk 6 is the only one with the answer): 1. General error handling overview. 2. Authentication errors (4xx). 3. Rate limiting strategies. 4. Client-side timeout config. 5. Logging conventions. 6. "... For 5xx server errors, the default policy retries up to 5 times with exponential backoff..." 7. Webhook delivery semantics. 8. Error reporting integration. 9. Glossary entry on idempotency. 10. Changelog summary.
Without reordering, a typical model gives: "The retry count depends on error type; please consult the documentation." (Refusal-saw the middle slot but didn't anchor on it.)
With chunk 6 promoted to slot 1, the model gives: "The default policy retries 5xx errors up to 5 times with exponential backoff [doc_42_chunk_6]."
Fix: reranker promotes the answer-bearing chunk; context builder places top reranks at slots 1 and N (edges).
Exercise 6-Design a RAG eval set for a customer support KB¶
Task. Design (don't necessarily run) a 50-question eval set for a customer support knowledge base of ~5,000 articles. Specify:
- The question categories you'll cover and how many in each.
- The metrics you'll track and why.
- Pass thresholds.
- How you'll source the questions.
- How you'll label gold docs.
Acceptance. A 1–2 page eval design that another engineer could execute. Below is one valid design.
Sample design.
Categories (50 questions total): - 10 Factoid lookups ("What's the refund window for plan X?") - 10 How-to / procedural ("How do I add a teammate?") - 5 Multi-hop ("Does plan X include feature Y, and at what tier?") - 5 Negation / boundary ("What's not covered under the basic plan?") - 5 Edge cases (very short queries, typos, slang). - 5 Out-of-scope ("What's the weather?")-gold = refuse. - 5 Synonym / paraphrase variants of factoid lookups. - 5 Recently-changed-doc questions (test freshness).
Metrics tracked: - Retrieval: Recall@10, MRR, Precision@5. - Generation (LLM-as-judge with human spot-check): Faithfulness, Answer relevance, Refusal correctness on out-of-scope. - End-to-end accuracy (human-graded on a sample, LLM-judge on the full set).
Pass thresholds for shipping: - Recall@10 ≥ 0.92, MRR ≥ 0.75. - Faithfulness ≥ 0.95. - Refusal correctness on out-of-scope ≥ 0.95 (this is a safety bar). - End-to-end accuracy ≥ 0.85, with no individual category below 0.75.
Sourcing: - 30 questions mined from real ticket data (anonymized). - 15 synthetic generated by LLM from chunks, human-reviewed. - 5 written by hand to cover specific edge cases / known weak spots.
Labeling: - Gold doc(s): annotator finds the article(s) that contain the answer. At least one annotator + adjudicator on disagreements. Allow multiple gold docs (set, not single). - Gold answer: a short reference text used for human grading and as a judge prompt input.
Cadence: - Run on every retriever / model change. - Re-mine 5–10 new questions per month from fresh tickets to catch drift. - Manual spot-check 20% of LLM-judge calls each run.
18. What to take away¶
If you're going to remember six things from this chapter:
- Hybrid retrieval (BM25 + dense, fused with RRF) + cross-encoder rerank is the dominant production architecture in 2026. Almost every other choice (which embedder, which DB, which chunker) is a tuning decision around this skeleton.
- Chunking and contextualization are the single highest-leverage ingestion-time choices. Start with semantic chunking; if quality isn't there, add Anthropic-style contextual retrieval.
- Always measure both retrieval and generation separately. Recall@k and MRR for retrieval; faithfulness and answer relevance for generation. The single end-to-end accuracy number is the headline but is uninterpretable on its own.
- Lost-in-the-middle is real. Rerank, prune context, place strong docs at the edges.
- Plan metadata, freshness, versioning, and tenancy from day 1. Migrating later means re-embedding, which is expensive and disruptive.
- The eval set is the foundation. No eval, no improvement. Build a small good one before you optimize anything else.
The rest is engineering.