Month 5-Week 3: Hybrid retrieval, reranking, contextual retrieval¶

Week summary¶

Goal: Combine BM25 + dense via Reciprocal Rank Fusion. Add a cross-encoder reranker on top-50. Try Anthropic Contextual Retrieval. Quantify each step's lift with bootstrap CIs.
Time: ~9 h over 3 sessions.
Output: Hybrid + rerank pipeline; contextual retrieval experiment; 4-way comparison table.
Sequences relied on: 10-retrieval-and-rag rungs 06, 09.

Why this week matters¶

Hybrid + rerank is the modern best-practice baseline. Most teams stop at "dense retrieval" and miss 10–30% quality. This week installs the discipline of stacking complementary techniques and measuring each addition.

Anthropic Contextual Retrieval (released 2024) is a clever technique that prepends short context to each chunk before embedding. Trying it is both useful and a marker of staying current with the field.

Prerequisites¶

M05-W01 + W02 complete.
BM25 + dense retrieval already running.

Recommended cadence¶

Session A-Tue/Wed evening (~3 h): RRF + reranking
Session B-Sat morning (~3.5 h): contextual retrieval
Session C-Sun afternoon (~2.5 h): final comparison + draft post

Session A-Reciprocal Rank Fusion + reranking¶

Goal: Implement RRF to combine BM25 + dense. Add a reranker. Measure both.

Part 1-Reciprocal Rank Fusion (60 min)¶

RRF intuition: for each query, every retrieval method produces a ranked list. RRF combines them by summing 1/(rank + k) -k` smooths the influence of early ranks. Documents that rank well in either method bubble up.

def rrf(rankings: list[list[str]], k: int = 60) -> list[str]:
    """rankings: list of ranked lists of doc_ids. Returns combined ranking."""
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, start=1):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

# For each query, get top-100 from BM25 and dense; combine with RRF
def search_hybrid(query: str, k: int = 10):
    bm25_top = [c["chunk_id"] for c, _ in search_bm25(query, k=100)]
    dense_top = [c["chunk_id"] for c, _ in search_dense(query, k=100)]
    fused = rrf([bm25_top, dense_top])
    return fused[:k]

Evaluate. Likely:

Hybrid (BM25 + dense, RRF): NDCG@10 = 0.756 (vs 0.732 dense large)

Hybrid usually beats either alone by 1–4 points NDCG.

Part 2-Reranking (60 min)¶

A cross-encoder (re)ranker takes (query, document) pairs and produces relevance scores via a fine-tuned model. Slow per pair, so apply only to top-K candidates from a faster retriever.

Pick a reranker: - BAAI/bge-reranker-large - open-source, strong. -BAAI/bge-reranker-v2-m3 - newer, multilingual. - Cohere Rerank API-commercial, very strong, free tier.

from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-large")

def search_with_rerank(query: str, k: int = 10, rerank_top_n: int = 50):
    candidates = search_hybrid(query, k=rerank_top_n)
    pairs = [(query, chunk_text(cid)) for cid in candidates]
    scores = reranker.predict(pairs, batch_size=8)
    sorted_idx = scores.argsort()[::-1]
    return [candidates[i] for i in sorted_idx[:k]]

Evaluate. Likely:

Hybrid + rerank (top-50): NDCG@10 = 0.798 (+0.042 over hybrid)

The largest single lift in modern RAG often comes from reranking. Document this.

Part 3-Latency cost (30 min)¶

Reranking is slow. Measure: - Top-50 reranking with bge-reranker-large on CPU: ~2 sec. - Same on GPU: ~150 ms. - Cohere API: ~200 ms over network.

Document tradeoff. For sub-second latency on CPU, consider rerank top-10 only or use a smaller reranker.

Output of Session A¶

RRF implementation.
Reranking pipeline.
Comparison: BM25 → dense → hybrid → hybrid+rerank.

Session B-Anthropic Contextual Retrieval¶

Goal: Implement Contextual Retrieval (prepending generated context to each chunk before embedding). Quantify the effect.

Part 1-Read + design (45 min)¶

Read: Anthropic's "Introducing Contextual Retrieval"-anthropic.com/news/contextual-retrieval (search if URL changes).

The technique: 1. For each chunk, prompt Claude with the whole document and the chunk; ask Claude to produce a 50–100 token "context" describing where the chunk sits in the document. 2. Prepend the context to the chunk before embedding (and before BM25 indexing). 3. The embedding now reflects the chunk's role in its document, not just its surface text.

This is cheap with caching-the document goes in the cached prefix; only the per-chunk context generation isn't cached.

Part 2-Implement (90 min)¶

CONTEXT_PROMPT = """<document>
{doc_text}
</document>

Here is the chunk we want to situate within the whole document:
<chunk>
{chunk_text}
</chunk>

Please give a short (50-100 token) succinct context that situates this chunk within
the overall document for the purposes of improving search retrieval of the chunk.
Answer only with the succinct context and nothing else."""

def generate_context(doc_text: str, chunk_text: str) -> str:
    resp = client.messages.create(
        model="claude-haiku-4-5",  # cheap; fine for this
        max_tokens=128,
        system=[{"type": "text", "text": "...short instruction...",
                 "cache_control": {"type": "ephemeral"}}],
        messages=[{"role": "user", "content": CONTEXT_PROMPT.format(
            doc_text=doc_text, chunk_text=chunk_text)}],
    )
    return resp.content[0].text.strip()

# For each chunk, generate context, then re-embed
contextualized = []
for c in chunks:
    ctx = generate_context(docs[c["doc_id"]], c["text"])
    c["text_with_context"] = ctx + "\n\n" + c["text"]
    contextualized.append(c)

new_embeds = model.encode([c["text_with_context"] for c in contextualized],
                          normalize_embeddings=True)
# Also re-build BM25 index using contextualized text

Cost note: with caching, this is ~$0.01 per 100 chunks. Batched, it's a few dollars to contextualize a 1000-chunk corpus.

Part 3-Evaluate (45 min)¶

Re-run hybrid + rerank on the contextualized embeddings:

Hybrid + rerank (contextualized): NDCG@10 = 0.823 (+0.025 over non-contextualized)

Honest take: gains are real but smaller than the rerank lift. Worth it for production-grade systems; maybe not for prototypes.

Output of Session B¶

Contextual chunks generated and re-indexed.
Comparison vs non-contextualized.

Session C-Final comparison + start writeup¶

Goal: Build the master comparison table with bootstrap CIs. Start the M05 blog post.

Part 1-Master comparison (60 min)¶

methods = {
    "bm25": results_bm25,
    "dense_small": results_dense_small,
    "dense_large": results_dense_large,
    "hybrid": results_hybrid,
    "hybrid_rerank": results_hybrid_rerank,
    "contextual_hybrid_rerank": results_contextual_hr,
}

# Per-query NDCG@10 → bootstrap CI
import numpy as np
def bootstrap_ci(per_query_scores, n=10000):
    arr = np.array(per_query_scores)
    boots = [np.random.choice(arr, len(arr), replace=True).mean() for _ in range(n)]
    return arr.mean(), np.percentile(boots, [2.5, 97.5])

for name, results in methods.items():
    scores = [ndcg_at_k(r["retrieved"], r["relevant"]) for r in results]
    mean, ci = bootstrap_ci(scores)
    print(f"{name}: NDCG@10 = {mean:.4f} (95% CI [{ci[0]:.4f}, {ci[1]:.4f}])")

Honest reporting includes CIs. The final table:

Method	NDCG@10	95% CI	Latency p95
BM25	0.612	[0.564, 0.658]	5 ms
Dense small	0.687	[0.640, 0.732]	8 ms
Dense large	0.732	[0.689, 0.774]	14 ms
Hybrid	0.756	[0.715, 0.794]	20 ms
Hybrid + rerank	0.798	[0.762, 0.832]	180 ms (GPU)
Contextual + hybrid + rerank	0.823	[0.789, 0.853]	180 ms

Part 2-Begin blog post (60 min)¶

Title: "What actually moved retrieval quality on my dataset-measured."

Outline: 1. The corpus and queries. 2. BM25 baseline (always start here). 3. Dense retrieval; what improved, what didn't. 4. Hybrid via RRF. 5. The rerank lift (often the biggest). 6. Anthropic Contextual Retrieval-incremental but real. 7. Cost-latency tradeoffs. 8. What I'd do differently.

Draft 1500 words this session; finish next week.

Part 3-Push + retro (30 min)¶

Push v0.5.0. Update LEARNING_LOG.md.

Output of Session C¶

Master comparison with bootstrap CIs.
Blog post draft (1500 words).

End-of-week artifact¶

Hybrid retrieval via RRF
Reranking pipeline
Contextual retrieval experiment
4–6 method comparison table with bootstrap CIs

End-of-week self-assessment¶

I can implement RRF from scratch.
I can defend each step's lift with data, not folklore.
I can articulate when reranking is worth its latency cost.

Common failure modes for this week¶

Skipping CIs. Without them, "improvements" are noise.
Reranking everything (no top-K cap). Latency explodes.
Treating contextual retrieval as a silver bullet. It's a 2–3 point lift, not a 20-point one.

What's next (preview of M05-W04)¶

End-to-end RAG eval: faithfulness, answer relevance, context precision/recall (RAGAS). Plus publish the M05 blog post.