Month 5-Week 1: Retrieval framing + BM25 baseline + retrieval metrics¶

Week summary¶

Goal: Frame the RAG problem properly. Pick a real corpus + 30 labeled queries. Implement BM25 baseline. Compute NDCG@10 and MRR from scratch (once) before reaching for a library.
Time: ~9 h over 3 sessions.
Output: evals/retrieval_eval.ipynb with corpus, queries, BM25 baseline, and reported metrics.
Sequences relied on: 10-retrieval-and-rag rungs 01, 02, 05, 07; 06-classical-ml rung 08.

Why this week matters¶

Most teams skip BM25 and start with a vector DB. That's how you ship a worse system than necessary. BM25 is the baseline every RAG system must beat. This week installs the discipline of measure first, optimize against a meaningful baseline, which is what separates senior RAG engineers from the rest.

Implementing NDCG and MRR by hand once internalizes what the metrics actually measure-rank-aware quality. Reading-only doesn't stick.

Prerequisites¶

M04 complete.
Comfortable with Python, Pandas.

Recommended cadence¶

Session A-Tue/Wed evening (~3 h): pick corpus + create eval set
Session B-Sat morning (~3 h): chunking + BM25
Session C-Sun afternoon (~3 h): NDCG + MRR + write up

Session A-Corpus + labeled queries¶

Goal: Pick a corpus that aligns with your anchor project. Create 30 queries with labeled relevant chunks.

Part 1-Pick the corpus (45 min)¶

Strong options: 1. Your runbooks / docs (synthesize 50–200 documents about your domain-Claude can help generate). 2. HotpotQA (huggingface.co/datasets/hotpotqa/hotpot_qa)-Wikipedia paragraphs with multi-hop questions. 3. A code corpus-your own repos or a popular OSS project.

For the incident-triage anchor: synthesize ~100 runbook-like documents about service architectures, failure modes, common fixes. Real or synthetic both work; aim for variety.

Part 2-Create labeled queries (75 min)¶

30 queries. For each, hand-label which document(s) contain the answer:

{"query_id": "q01", "query": "How do we recover from a checkout-api OOM crash?", "relevant_doc_ids": ["doc_42", "doc_57"]}
{"query_id": "q02", "query": "What's the rollback procedure for v2.x deploys?", "relevant_doc_ids": ["doc_11"]}

Labeling tips: - Don't second-guess. If a doc is the answer, label it. - Multiple docs can be relevant. - Create some adversarial queries (lexical mismatch, paraphrase) to stress dense vs lexical methods.

Part 3-Document the data (60 min)¶

Store: - evals/corpus/ - one file per document or a single JSONL. -evals/retrieval_queries.jsonl - labeled queries. - `evals/coverage.md - coverage analysis: how many docs are referenced by ≥1 query? (At least 50% should be-otherwise your queries don't exercise the corpus.)

Output of Session A¶

Corpus committed.
30 labeled queries.
Coverage analysis.

Session B-Chunking + BM25 baseline¶

Goal: Chunk the corpus and build a BM25 index. Retrieve top-10 for each query.

Part 1-Chunking strategies (60 min)¶

Watch: Greg Kamradt's "5 Levels of Chunking" video (YouTube search "kamradt chunking strategies").

Three strategies to compare later: 1. Fixed-size: 512 tokens, 50-token overlap. 2. Paragraph-based: split on \n\n. 3. Recursive: split by major separators first, then minor (langchain.text_splitter.RecursiveCharacterTextSplitter is the reference design).

For this week, pick fixed-size 512 with 50-token overlap as your default. We'll compare strategies in a later iteration.

def chunk_fixed(text, chunk_size=512, overlap=50):
    # Use tiktoken for token-accurate chunking
    import tiktoken
    enc = tiktoken.get_encoding("cl100k_base")
    tokens = enc.encode(text)
    chunks = []
    for i in range(0, len(tokens), chunk_size - overlap):
        chunk = enc.decode(tokens[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

Apply to all docs. Tag each chunk with (doc_id, chunk_idx).

Part 2-BM25 implementation (75 min)¶

from rank_bm25 import BM25Okapi

# Tokenize for BM25 (simple lowercase split-sufficient for English)
def tokenize(text):
    import re
    return re.findall(r'\w+', text.lower())

bm25 = BM25Okapi([tokenize(c["text"]) for c in chunks])

def search_bm25(query: str, k: int = 10):
    scores = bm25.get_scores(tokenize(query))
    top = scores.argsort()[::-1][:k]
    return [(chunks[i], float(scores[i])) for i in top]

Run for all 30 queries. Save (query_id, retrieved_chunk_ids) per query.

Part 3-Sanity check (45 min)¶

Manually inspect 5 query results. Are top-1 hits sensible? If not, the corpus or queries need refinement.

Output of Session B¶

Chunked corpus.
BM25 index.
Top-10 retrievals saved per query.

Session C-NDCG, MRR, write up¶

Goal: Implement NDCG@10 and MRR from scratch. Compute on BM25 results. Document.

Part 1-Implement metrics (75 min)¶

MRR (Mean Reciprocal Rank):

def reciprocal_rank(retrieved_doc_ids, relevant_doc_ids):
    """1/rank of first relevant doc; 0 if none in retrieved."""
    for i, d in enumerate(retrieved_doc_ids, start=1):
        if d in relevant_doc_ids:
            return 1.0 / i
    return 0.0

def mrr(results):
    return sum(reciprocal_rank(r["retrieved"], r["relevant"]) for r in results) / len(results)

NDCG@10:

import math

def dcg_at_k(retrieved_doc_ids, relevant_doc_ids, k=10):
    """Discounted Cumulative Gain. Binary relevance: 1 if relevant else 0."""
    return sum(
        (1.0 if d in relevant_doc_ids else 0.0) / math.log2(i + 2)
        for i, d in enumerate(retrieved_doc_ids[:k])
    )

def ndcg_at_k(retrieved_doc_ids, relevant_doc_ids, k=10):
    actual = dcg_at_k(retrieved_doc_ids, relevant_doc_ids, k)
    # Ideal: all relevant docs at the top
    n_rel = min(len(relevant_doc_ids), k)
    ideal = sum(1.0 / math.log2(i + 2) for i in range(n_rel))
    return actual / ideal if ideal > 0 else 0.0

def mean_ndcg_at_k(results, k=10):
    return sum(ndcg_at_k(r["retrieved"], r["relevant"], k) for r in results) / len(results)

Why these: MRR captures "did the user get a relevant result fast?"; NDCG captures "is the full list well-ordered?". Both matter for RAG.

Part 2-Compute and report (60 min)¶

print(f"BM25 baseline: NDCG@10 = {mean_ndcg_at_k(results):.4f}, MRR = {mrr(results):.4f}")

Likely numbers: NDCG@10 = 0.5–0.7, MRR = 0.4–0.6 (depends on corpus difficulty).

This is your baseline. Every later approach (dense, hybrid, rerank) compares against these numbers.

Part 3-Write up + push (45 min)¶

evals/retrieval_eval.ipynb:

## BM25 baseline
- 30 queries, 100-doc corpus chunked at 512 tokens.
- NDCG@10 = 0.612
- MRR     = 0.534
- Median rank of first relevant: 2.

Failure modes observed:
1. Lexical mismatch: query "outage" misses docs that say "downtime".
2. Synonyms: "checkout-api" docs missed by queries using "purchase service".

These are the cases dense retrieval will help with-measured next week.

Push.

Output of Session C¶

NDCG@10 + MRR implemented from scratch.
Baseline metrics documented.
Notebook published.

End-of-week artifact¶

Corpus + 30 labeled queries committed
BM25 baseline retrievals saved
NDCG@10 and MRR implemented from scratch
Baseline numbers documented in README
Failure mode observations recorded

End-of-week self-assessment¶

I can implement NDCG and MRR without looking them up.
I can defend "BM25 first" reasoning.
I have measurable retrieval baselines.

Common failure modes for this week¶

Skipping BM25. Don't.
Synthetic queries that look like the docs. Add adversarial cases that paraphrase.
Vague metric reporting. Always pair NDCG with the corpus and query-set descriptions.

What's next (preview of M05-W02)¶

Dense retrieval. Embeddings + vector DB. Compare to BM25. The semantic vs lexical comparison is one of RAG's most informative diagnostics.