10-Retrieval & RAG¶

Why this matters in the journey¶

Most production LLM systems are retrieval-augmented because no model knows your data. RAG is also where most LLM teams produce mediocre results-the gap between "set up a vector DB" and "shipped a retrieval system that beats baselines" is enormous. Closing that gap is one of the most marketable skills in 2026 AI engineering.

The rungs¶

Rung 01-Why retrieval at all (problem framing)¶

What: LLMs are stateless and have a context limit. To answer questions over private data, you must fetch relevant chunks and put them in context.
Why it earns its place: Frame the problem before the tool. Many teams reach for a vector DB when keyword search would have worked.
Resource: Anthropic "Contextual Retrieval" blog (search "anthropic contextual retrieval"). Plus Pinecone's "What is RAG?" intro.
Done when: You can articulate when RAG is the right pattern vs fine-tuning vs long-context.

Rung 02-Lexical search (BM25)¶

What: Classical keyword search. TF-IDF descendant. Often shockingly competitive.
Why it earns its place: BM25 is the baseline every RAG system must beat. Teams skip it and ship worse systems.
Resource: rank_bm25 Python library docs. Plus the BM25 Wikipedia article (it's actually good).
Done when: You can build a BM25 index over a corpus and retrieve top-k for a query.

Rung 03-Embeddings: dense semantic search¶

What: Encode each document chunk as a vector via a sentence-embedding model. At query time, encode the query, find nearest vectors.
Why it earns its place: Captures semantic similarity that BM25 can't. The standard "RAG" approach.
Resource: sentence-transformers library docs (sbert.net). Plus the Sentence-BERT paper (arxiv.org/abs/1908.10084).
Done when: You can encode a corpus, store in NumPy, run cosine similarity, and retrieve top-k.

Rung 04-Vector databases¶

What: Specialized stores for high-dimensional vector search at scale (HNSW, IVF indices). Examples: Qdrant, Weaviate, Pinecone, pgvector.
Why it earns its place: Once you have >100k chunks, NumPy doesn't cut it. Knowing one vector DB well is enough.
Resource: Pick one and go deep. I recommend pgvector (Postgres extension) if you want minimal infra, Qdrant if you want best-in-class.
Done when: You can stand up a vector DB, ingest a corpus, and query it.

Rung 05-Chunking strategies¶

What: Split documents into retrievable units. Strategies: fixed-size, sentence, paragraph, recursive, semantic. Overlap between chunks.
Why it earns its place: Bad chunking = bad retrieval. Underrated lever.
Resource: Greg Kamradt's "5 Levels of Chunking" video (search "kamradt chunking strategies").
Done when: You can articulate the tradeoffs of fixed-size vs semantic chunking and have tried at least two.

Rung 06-Hybrid search and reranking¶

What: Combine BM25 + dense via Reciprocal Rank Fusion. Then rerank top results with a cross-encoder for precision.
Why it earns its place: Hybrid + rerank is the modern best-practice baseline. It typically beats either alone by 10–30% on real datasets.
Resource: Cohere's reranker docs. Plus bge-reranker from BAAI (open-source, on Hugging Face).
Done when: You can implement hybrid retrieval + reranking and measure the lift over each component alone.

Rung 07-Retrieval evaluation¶

What: NDCG, MRR, recall@k, precision@k. Pick the right one for your task.
Why it earns its place: You cannot improve what you don't measure. Most RAG systems have no retrieval eval.
Resource: Information Retrieval (Manning, Raghavan, Schütze; free online) chapter 8.
Done when: You can compute NDCG@10 and MRR for a query set with labeled gold passages.

Rung 08-End-to-end RAG evaluation¶

What: Beyond retrieval, evaluate answer quality: faithfulness (no hallucination), answer relevance, context precision/recall.
Why it earns its place: Retrieval can be perfect and answers still be bad. End-to-end eval is what you ship on.
Resource: RAGAS paper (arxiv.org/abs/2309.15217) and library (docs.ragas.io). Also Hamel Husain's "Eval Driven Development for RAG" posts.
Done when: You can run RAGAS or a hand-rolled equivalent and get faithfulness + answer relevance scores.

Rung 09-Advanced retrieval techniques¶

What: HyDE (hypothetical document embeddings), query rewriting, multi-query expansion, recursive retrieval, sentence-window retrieval, parent-document retrieval.
Why it earns its place: Toolbox for when basic RAG plateaus. Each addresses a specific failure mode.
Resource: LlamaIndex docs on advanced RAG patterns. Plus Anthropic Contextual Retrieval (search "anthropic contextual retrieval").
Done when: You can identify which advanced technique addresses a specific failure mode you've observed.

Rung 10-Long-context vs RAG (a 2025+ debate)¶

What: Frontier models with 200k–1M context windows can sometimes obviate RAG. When does each win?
Why it earns its place: This decision shapes architecture. Knowing the tradeoffs (cost, latency, recall) is judgment.
Resource: Lost in the Middle paper (arxiv.org/abs/2307.03172). Plus blog posts comparing long-context vs RAG (search "long context vs rag 2024").
Done when: You can argue both sides of "should we use long-context instead of RAG" with cost and quality data.

Rung 11-RAG observability¶

What: Trace retrieval-then-generation pipelines. Capture top-k results per query, faithfulness scores, eval drift over time.
Why it earns its place: Production RAG quality drifts as data changes. Observability is the only safety net.
Resource: Langfuse, LangSmith, or W&B Weave docs-pick one (also covered in sequence 13).
Done when: Your RAG system has traces showing retrieval, generation, and end-to-end metrics in a dashboard.

Minimum required to leave this sequence¶

BM25 baseline working on your corpus.
Dense retrieval with sentence-transformers + a vector DB.
At least two chunking strategies tried and compared.
Hybrid retrieval + reranking implemented.
Retrieval eval (NDCG@10, MRR) computed on labeled data.
End-to-end faithfulness eval running.