Month 5-Week 1: Retrieval framing + BM25 baseline + retrieval metrics¶
Week summary¶
- Goal: Frame the RAG problem properly. Pick a real corpus + 30 labeled queries. Implement BM25 baseline. Compute NDCG@10 and MRR from scratch (once) before reaching for a library.
- Time: ~9 h over 3 sessions.
- Output:
evals/retrieval_eval.ipynbwith corpus, queries, BM25 baseline, and reported metrics. - Sequences relied on: 10-retrieval-and-rag rungs 01, 02, 05, 07; 06-classical-ml rung 08.
Why this week matters¶
Most teams skip BM25 and start with a vector DB. That's how you ship a worse system than necessary. BM25 is the baseline every RAG system must beat. This week installs the discipline of measure first, optimize against a meaningful baseline, which is what separates senior RAG engineers from the rest.
Implementing NDCG and MRR by hand once internalizes what the metrics actually measure-rank-aware quality. Reading-only doesn't stick.
Prerequisites¶
- M04 complete.
- Comfortable with Python, Pandas.
Recommended cadence¶
- Session A-Tue/Wed evening (~3 h): pick corpus + create eval set
- Session B-Sat morning (~3 h): chunking + BM25
- Session C-Sun afternoon (~3 h): NDCG + MRR + write up
Session A-Corpus + labeled queries¶
Goal: Pick a corpus that aligns with your anchor project. Create 30 queries with labeled relevant chunks.
Part 1-Pick the corpus (45 min)¶
Strong options: 1. Your runbooks / docs (synthesize 50–200 documents about your domain-Claude can help generate). 2. HotpotQA (huggingface.co/datasets/hotpotqa/hotpot_qa)-Wikipedia paragraphs with multi-hop questions. 3. A code corpus-your own repos or a popular OSS project.
For the incident-triage anchor: synthesize ~100 runbook-like documents about service architectures, failure modes, common fixes. Real or synthetic both work; aim for variety.
Part 2-Create labeled queries (75 min)¶
30 queries. For each, hand-label which document(s) contain the answer:
{"query_id": "q01", "query": "How do we recover from a checkout-api OOM crash?", "relevant_doc_ids": ["doc_42", "doc_57"]}
{"query_id": "q02", "query": "What's the rollback procedure for v2.x deploys?", "relevant_doc_ids": ["doc_11"]}
Labeling tips: - Don't second-guess. If a doc is the answer, label it. - Multiple docs can be relevant. - Create some adversarial queries (lexical mismatch, paraphrase) to stress dense vs lexical methods.
Part 3-Document the data (60 min)¶
Store:
- evals/corpus/ - one file per document or a single JSONL.
-evals/retrieval_queries.jsonl - labeled queries.
- `evals/coverage.md - coverage analysis: how many docs are referenced by ≥1 query? (At least 50% should be-otherwise your queries don't exercise the corpus.)
Output of Session A¶
- Corpus committed.
- 30 labeled queries.
- Coverage analysis.
Session B-Chunking + BM25 baseline¶
Goal: Chunk the corpus and build a BM25 index. Retrieve top-10 for each query.
Part 1-Chunking strategies (60 min)¶
Watch: Greg Kamradt's "5 Levels of Chunking" video (YouTube search "kamradt chunking strategies").
Three strategies to compare later:
1. Fixed-size: 512 tokens, 50-token overlap.
2. Paragraph-based: split on \n\n.
3. Recursive: split by major separators first, then minor (langchain.text_splitter.RecursiveCharacterTextSplitter is the reference design).
For this week, pick fixed-size 512 with 50-token overlap as your default. We'll compare strategies in a later iteration.
def chunk_fixed(text, chunk_size=512, overlap=50):
# Use tiktoken for token-accurate chunking
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode(text)
chunks = []
for i in range(0, len(tokens), chunk_size - overlap):
chunk = enc.decode(tokens[i:i + chunk_size])
chunks.append(chunk)
return chunks
Apply to all docs. Tag each chunk with (doc_id, chunk_idx).
Part 2-BM25 implementation (75 min)¶
from rank_bm25 import BM25Okapi
# Tokenize for BM25 (simple lowercase split-sufficient for English)
def tokenize(text):
import re
return re.findall(r'\w+', text.lower())
bm25 = BM25Okapi([tokenize(c["text"]) for c in chunks])
def search_bm25(query: str, k: int = 10):
scores = bm25.get_scores(tokenize(query))
top = scores.argsort()[::-1][:k]
return [(chunks[i], float(scores[i])) for i in top]
Run for all 30 queries. Save (query_id, retrieved_chunk_ids) per query.
Part 3-Sanity check (45 min)¶
Manually inspect 5 query results. Are top-1 hits sensible? If not, the corpus or queries need refinement.
Output of Session B¶
- Chunked corpus.
- BM25 index.
- Top-10 retrievals saved per query.
Session C-NDCG, MRR, write up¶
Goal: Implement NDCG@10 and MRR from scratch. Compute on BM25 results. Document.
Part 1-Implement metrics (75 min)¶
MRR (Mean Reciprocal Rank):
def reciprocal_rank(retrieved_doc_ids, relevant_doc_ids):
"""1/rank of first relevant doc; 0 if none in retrieved."""
for i, d in enumerate(retrieved_doc_ids, start=1):
if d in relevant_doc_ids:
return 1.0 / i
return 0.0
def mrr(results):
return sum(reciprocal_rank(r["retrieved"], r["relevant"]) for r in results) / len(results)
NDCG@10:
import math
def dcg_at_k(retrieved_doc_ids, relevant_doc_ids, k=10):
"""Discounted Cumulative Gain. Binary relevance: 1 if relevant else 0."""
return sum(
(1.0 if d in relevant_doc_ids else 0.0) / math.log2(i + 2)
for i, d in enumerate(retrieved_doc_ids[:k])
)
def ndcg_at_k(retrieved_doc_ids, relevant_doc_ids, k=10):
actual = dcg_at_k(retrieved_doc_ids, relevant_doc_ids, k)
# Ideal: all relevant docs at the top
n_rel = min(len(relevant_doc_ids), k)
ideal = sum(1.0 / math.log2(i + 2) for i in range(n_rel))
return actual / ideal if ideal > 0 else 0.0
def mean_ndcg_at_k(results, k=10):
return sum(ndcg_at_k(r["retrieved"], r["relevant"], k) for r in results) / len(results)
Why these: MRR captures "did the user get a relevant result fast?"; NDCG captures "is the full list well-ordered?". Both matter for RAG.
Part 2-Compute and report (60 min)¶
Likely numbers: NDCG@10 = 0.5–0.7, MRR = 0.4–0.6 (depends on corpus difficulty).
This is your baseline. Every later approach (dense, hybrid, rerank) compares against these numbers.
Part 3-Write up + push (45 min)¶
evals/retrieval_eval.ipynb:
## BM25 baseline
- 30 queries, 100-doc corpus chunked at 512 tokens.
- NDCG@10 = 0.612
- MRR = 0.534
- Median rank of first relevant: 2.
Failure modes observed:
1. Lexical mismatch: query "outage" misses docs that say "downtime".
2. Synonyms: "checkout-api" docs missed by queries using "purchase service".
These are the cases dense retrieval will help with-measured next week.
Push.
Output of Session C¶
- NDCG@10 + MRR implemented from scratch.
- Baseline metrics documented.
- Notebook published.
End-of-week artifact¶
- Corpus + 30 labeled queries committed
- BM25 baseline retrievals saved
- NDCG@10 and MRR implemented from scratch
- Baseline numbers documented in README
- Failure mode observations recorded
End-of-week self-assessment¶
- I can implement NDCG and MRR without looking them up.
- I can defend "BM25 first" reasoning.
- I have measurable retrieval baselines.
Common failure modes for this week¶
- Skipping BM25. Don't.
- Synthetic queries that look like the docs. Add adversarial cases that paraphrase.
- Vague metric reporting. Always pair NDCG with the corpus and query-set descriptions.
What's next (preview of M05-W02)¶
Dense retrieval. Embeddings + vector DB. Compare to BM25. The semantic vs lexical comparison is one of RAG's most informative diagnostics.