10-Retrieval & RAG¶
Why this matters in the journey¶
Most production LLM systems are retrieval-augmented because no model knows your data. RAG is also where most LLM teams produce mediocre results-the gap between "set up a vector DB" and "shipped a retrieval system that beats baselines" is enormous. Closing that gap is one of the most marketable skills in 2026 AI engineering.
The rungs¶
Rung 01-Why retrieval at all (problem framing)¶
- What: LLMs are stateless and have a context limit. To answer questions over private data, you must fetch relevant chunks and put them in context.
- Why it earns its place: Frame the problem before the tool. Many teams reach for a vector DB when keyword search would have worked.
- Resource: Anthropic "Contextual Retrieval" blog (search "anthropic contextual retrieval"). Plus Pinecone's "What is RAG?" intro.
- Done when: You can articulate when RAG is the right pattern vs fine-tuning vs long-context.
Rung 02-Lexical search (BM25)¶
- What: Classical keyword search. TF-IDF descendant. Often shockingly competitive.
- Why it earns its place: BM25 is the baseline every RAG system must beat. Teams skip it and ship worse systems.
- Resource:
rank_bm25Python library docs. Plus the BM25 Wikipedia article (it's actually good). - Done when: You can build a BM25 index over a corpus and retrieve top-k for a query.
Rung 03-Embeddings: dense semantic search¶
- What: Encode each document chunk as a vector via a sentence-embedding model. At query time, encode the query, find nearest vectors.
- Why it earns its place: Captures semantic similarity that BM25 can't. The standard "RAG" approach.
- Resource:
sentence-transformerslibrary docs (sbert.net). Plus the Sentence-BERT paper (arxiv.org/abs/1908.10084). - Done when: You can encode a corpus, store in NumPy, run cosine similarity, and retrieve top-k.
Rung 04-Vector databases¶
- What: Specialized stores for high-dimensional vector search at scale (HNSW, IVF indices). Examples: Qdrant, Weaviate, Pinecone, pgvector.
- Why it earns its place: Once you have >100k chunks, NumPy doesn't cut it. Knowing one vector DB well is enough.
- Resource: Pick one and go deep. I recommend pgvector (Postgres extension) if you want minimal infra, Qdrant if you want best-in-class.
- Done when: You can stand up a vector DB, ingest a corpus, and query it.
Rung 05-Chunking strategies¶
- What: Split documents into retrievable units. Strategies: fixed-size, sentence, paragraph, recursive, semantic. Overlap between chunks.
- Why it earns its place: Bad chunking = bad retrieval. Underrated lever.
- Resource: Greg Kamradt's "5 Levels of Chunking" video (search "kamradt chunking strategies").
- Done when: You can articulate the tradeoffs of fixed-size vs semantic chunking and have tried at least two.
Rung 06-Hybrid search and reranking¶
- What: Combine BM25 + dense via Reciprocal Rank Fusion. Then rerank top results with a cross-encoder for precision.
- Why it earns its place: Hybrid + rerank is the modern best-practice baseline. It typically beats either alone by 10–30% on real datasets.
- Resource: Cohere's reranker docs. Plus
bge-rerankerfrom BAAI (open-source, on Hugging Face). - Done when: You can implement hybrid retrieval + reranking and measure the lift over each component alone.
Rung 07-Retrieval evaluation¶
- What: NDCG, MRR, recall@k, precision@k. Pick the right one for your task.
- Why it earns its place: You cannot improve what you don't measure. Most RAG systems have no retrieval eval.
- Resource: Information Retrieval (Manning, Raghavan, Schütze; free online) chapter 8.
- Done when: You can compute NDCG@10 and MRR for a query set with labeled gold passages.
Rung 08-End-to-end RAG evaluation¶
- What: Beyond retrieval, evaluate answer quality: faithfulness (no hallucination), answer relevance, context precision/recall.
- Why it earns its place: Retrieval can be perfect and answers still be bad. End-to-end eval is what you ship on.
- Resource: RAGAS paper (arxiv.org/abs/2309.15217) and library (docs.ragas.io). Also Hamel Husain's "Eval Driven Development for RAG" posts.
- Done when: You can run RAGAS or a hand-rolled equivalent and get faithfulness + answer relevance scores.
Rung 09-Advanced retrieval techniques¶
- What: HyDE (hypothetical document embeddings), query rewriting, multi-query expansion, recursive retrieval, sentence-window retrieval, parent-document retrieval.
- Why it earns its place: Toolbox for when basic RAG plateaus. Each addresses a specific failure mode.
- Resource: LlamaIndex docs on advanced RAG patterns. Plus Anthropic Contextual Retrieval (search "anthropic contextual retrieval").
- Done when: You can identify which advanced technique addresses a specific failure mode you've observed.
Rung 10-Long-context vs RAG (a 2025+ debate)¶
- What: Frontier models with 200k–1M context windows can sometimes obviate RAG. When does each win?
- Why it earns its place: This decision shapes architecture. Knowing the tradeoffs (cost, latency, recall) is judgment.
- Resource: Lost in the Middle paper (arxiv.org/abs/2307.03172). Plus blog posts comparing long-context vs RAG (search "long context vs rag 2024").
- Done when: You can argue both sides of "should we use long-context instead of RAG" with cost and quality data.
Rung 11-RAG observability¶
- What: Trace retrieval-then-generation pipelines. Capture top-k results per query, faithfulness scores, eval drift over time.
- Why it earns its place: Production RAG quality drifts as data changes. Observability is the only safety net.
- Resource: Langfuse, LangSmith, or W&B Weave docs-pick one (also covered in sequence 13).
- Done when: Your RAG system has traces showing retrieval, generation, and end-to-end metrics in a dashboard.
Minimum required to leave this sequence¶
- BM25 baseline working on your corpus.
- Dense retrieval with sentence-transformers + a vector DB.
- At least two chunking strategies tried and compared.
- Hybrid retrieval + reranking implemented.
- Retrieval eval (NDCG@10, MRR) computed on labeled data.
- End-to-end faithfulness eval running.
Going further¶
- Information Retrieval (Manning et al., free online)-chapters 6–9.
- Pinecone's RAG learning hub-well-organized free resources.
- LlamaIndex docs-even if you don't use it, the patterns are well-explained.
How this sequence connects to the year¶
- Month 5: This sequence IS month 5.
- Month 6: Eval rigor (rungs 07–08) compounds into the eval sequence.
- Q3 (if Track A-Evals): Building eval frameworks for RAG is your wheelhouse.