Skip to content

Month 5-Week 1: Retrieval framing + BM25 baseline + retrieval metrics

Week summary

  • Goal: Frame the RAG problem properly. Pick a real corpus + 30 labeled queries. Implement BM25 baseline. Compute NDCG@10 and MRR from scratch (once) before reaching for a library.
  • Time: ~9 h over 3 sessions.
  • Output: evals/retrieval_eval.ipynb with corpus, queries, BM25 baseline, and reported metrics.
  • Sequences relied on: 10-retrieval-and-rag rungs 01, 02, 05, 07; 06-classical-ml rung 08.

Why this week matters

Most teams skip BM25 and start with a vector DB. That's how you ship a worse system than necessary. BM25 is the baseline every RAG system must beat. This week installs the discipline of measure first, optimize against a meaningful baseline, which is what separates senior RAG engineers from the rest.

Implementing NDCG and MRR by hand once internalizes what the metrics actually measure-rank-aware quality. Reading-only doesn't stick.

Prerequisites

  • M04 complete.
  • Comfortable with Python, Pandas.
  • Session A-Tue/Wed evening (~3 h): pick corpus + create eval set
  • Session B-Sat morning (~3 h): chunking + BM25
  • Session C-Sun afternoon (~3 h): NDCG + MRR + write up

Session A-Corpus + labeled queries

Goal: Pick a corpus that aligns with your anchor project. Create 30 queries with labeled relevant chunks.

Part 1-Pick the corpus (45 min)

Strong options: 1. Your runbooks / docs (synthesize 50–200 documents about your domain-Claude can help generate). 2. HotpotQA (huggingface.co/datasets/hotpotqa/hotpot_qa)-Wikipedia paragraphs with multi-hop questions. 3. A code corpus-your own repos or a popular OSS project.

For the incident-triage anchor: synthesize ~100 runbook-like documents about service architectures, failure modes, common fixes. Real or synthetic both work; aim for variety.

Part 2-Create labeled queries (75 min)

30 queries. For each, hand-label which document(s) contain the answer:

{"query_id": "q01", "query": "How do we recover from a checkout-api OOM crash?", "relevant_doc_ids": ["doc_42", "doc_57"]}
{"query_id": "q02", "query": "What's the rollback procedure for v2.x deploys?", "relevant_doc_ids": ["doc_11"]}

Labeling tips: - Don't second-guess. If a doc is the answer, label it. - Multiple docs can be relevant. - Create some adversarial queries (lexical mismatch, paraphrase) to stress dense vs lexical methods.

Part 3-Document the data (60 min)

Store: - evals/corpus/ - one file per document or a single JSONL. -evals/retrieval_queries.jsonl - labeled queries. - `evals/coverage.md - coverage analysis: how many docs are referenced by ≥1 query? (At least 50% should be-otherwise your queries don't exercise the corpus.)

Output of Session A

  • Corpus committed.
  • 30 labeled queries.
  • Coverage analysis.

Session B-Chunking + BM25 baseline

Goal: Chunk the corpus and build a BM25 index. Retrieve top-10 for each query.

Part 1-Chunking strategies (60 min)

Watch: Greg Kamradt's "5 Levels of Chunking" video (YouTube search "kamradt chunking strategies").

Three strategies to compare later: 1. Fixed-size: 512 tokens, 50-token overlap. 2. Paragraph-based: split on \n\n. 3. Recursive: split by major separators first, then minor (langchain.text_splitter.RecursiveCharacterTextSplitter is the reference design).

For this week, pick fixed-size 512 with 50-token overlap as your default. We'll compare strategies in a later iteration.

def chunk_fixed(text, chunk_size=512, overlap=50):
    # Use tiktoken for token-accurate chunking
    import tiktoken
    enc = tiktoken.get_encoding("cl100k_base")
    tokens = enc.encode(text)
    chunks = []
    for i in range(0, len(tokens), chunk_size - overlap):
        chunk = enc.decode(tokens[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

Apply to all docs. Tag each chunk with (doc_id, chunk_idx).

Part 2-BM25 implementation (75 min)

from rank_bm25 import BM25Okapi

# Tokenize for BM25 (simple lowercase split-sufficient for English)
def tokenize(text):
    import re
    return re.findall(r'\w+', text.lower())

bm25 = BM25Okapi([tokenize(c["text"]) for c in chunks])

def search_bm25(query: str, k: int = 10):
    scores = bm25.get_scores(tokenize(query))
    top = scores.argsort()[::-1][:k]
    return [(chunks[i], float(scores[i])) for i in top]

Run for all 30 queries. Save (query_id, retrieved_chunk_ids) per query.

Part 3-Sanity check (45 min)

Manually inspect 5 query results. Are top-1 hits sensible? If not, the corpus or queries need refinement.

Output of Session B

  • Chunked corpus.
  • BM25 index.
  • Top-10 retrievals saved per query.

Session C-NDCG, MRR, write up

Goal: Implement NDCG@10 and MRR from scratch. Compute on BM25 results. Document.

Part 1-Implement metrics (75 min)

MRR (Mean Reciprocal Rank):

def reciprocal_rank(retrieved_doc_ids, relevant_doc_ids):
    """1/rank of first relevant doc; 0 if none in retrieved."""
    for i, d in enumerate(retrieved_doc_ids, start=1):
        if d in relevant_doc_ids:
            return 1.0 / i
    return 0.0

def mrr(results):
    return sum(reciprocal_rank(r["retrieved"], r["relevant"]) for r in results) / len(results)

NDCG@10:

import math

def dcg_at_k(retrieved_doc_ids, relevant_doc_ids, k=10):
    """Discounted Cumulative Gain. Binary relevance: 1 if relevant else 0."""
    return sum(
        (1.0 if d in relevant_doc_ids else 0.0) / math.log2(i + 2)
        for i, d in enumerate(retrieved_doc_ids[:k])
    )

def ndcg_at_k(retrieved_doc_ids, relevant_doc_ids, k=10):
    actual = dcg_at_k(retrieved_doc_ids, relevant_doc_ids, k)
    # Ideal: all relevant docs at the top
    n_rel = min(len(relevant_doc_ids), k)
    ideal = sum(1.0 / math.log2(i + 2) for i in range(n_rel))
    return actual / ideal if ideal > 0 else 0.0

def mean_ndcg_at_k(results, k=10):
    return sum(ndcg_at_k(r["retrieved"], r["relevant"], k) for r in results) / len(results)

Why these: MRR captures "did the user get a relevant result fast?"; NDCG captures "is the full list well-ordered?". Both matter for RAG.

Part 2-Compute and report (60 min)

print(f"BM25 baseline: NDCG@10 = {mean_ndcg_at_k(results):.4f}, MRR = {mrr(results):.4f}")

Likely numbers: NDCG@10 = 0.5–0.7, MRR = 0.4–0.6 (depends on corpus difficulty).

This is your baseline. Every later approach (dense, hybrid, rerank) compares against these numbers.

Part 3-Write up + push (45 min)

evals/retrieval_eval.ipynb:

## BM25 baseline
- 30 queries, 100-doc corpus chunked at 512 tokens.
- NDCG@10 = 0.612
- MRR     = 0.534
- Median rank of first relevant: 2.

Failure modes observed:
1. Lexical mismatch: query "outage" misses docs that say "downtime".
2. Synonyms: "checkout-api" docs missed by queries using "purchase service".

These are the cases dense retrieval will help with-measured next week.

Push.

Output of Session C

  • NDCG@10 + MRR implemented from scratch.
  • Baseline metrics documented.
  • Notebook published.

End-of-week artifact

  • Corpus + 30 labeled queries committed
  • BM25 baseline retrievals saved
  • NDCG@10 and MRR implemented from scratch
  • Baseline numbers documented in README
  • Failure mode observations recorded

End-of-week self-assessment

  • I can implement NDCG and MRR without looking them up.
  • I can defend "BM25 first" reasoning.
  • I have measurable retrieval baselines.

Common failure modes for this week

  • Skipping BM25. Don't.
  • Synthetic queries that look like the docs. Add adversarial cases that paraphrase.
  • Vague metric reporting. Always pair NDCG with the corpus and query-set descriptions.

What's next (preview of M05-W02)

Dense retrieval. Embeddings + vector DB. Compare to BM25. The semantic vs lexical comparison is one of RAG's most informative diagnostics.

Comments