Skip to content

10-Retrieval & RAG

Why this matters in the journey

Most production LLM systems are retrieval-augmented because no model knows your data. RAG is also where most LLM teams produce mediocre results-the gap between "set up a vector DB" and "shipped a retrieval system that beats baselines" is enormous. Closing that gap is one of the most marketable skills in 2026 AI engineering.

The rungs

Rung 01-Why retrieval at all (problem framing)

  • What: LLMs are stateless and have a context limit. To answer questions over private data, you must fetch relevant chunks and put them in context.
  • Why it earns its place: Frame the problem before the tool. Many teams reach for a vector DB when keyword search would have worked.
  • Resource: Anthropic "Contextual Retrieval" blog (search "anthropic contextual retrieval"). Plus Pinecone's "What is RAG?" intro.
  • Done when: You can articulate when RAG is the right pattern vs fine-tuning vs long-context.

Rung 02-Lexical search (BM25)

  • What: Classical keyword search. TF-IDF descendant. Often shockingly competitive.
  • Why it earns its place: BM25 is the baseline every RAG system must beat. Teams skip it and ship worse systems.
  • Resource: rank_bm25 Python library docs. Plus the BM25 Wikipedia article (it's actually good).
  • Done when: You can build a BM25 index over a corpus and retrieve top-k for a query.
  • What: Encode each document chunk as a vector via a sentence-embedding model. At query time, encode the query, find nearest vectors.
  • Why it earns its place: Captures semantic similarity that BM25 can't. The standard "RAG" approach.
  • Resource: sentence-transformers library docs (sbert.net). Plus the Sentence-BERT paper (arxiv.org/abs/1908.10084).
  • Done when: You can encode a corpus, store in NumPy, run cosine similarity, and retrieve top-k.

Rung 04-Vector databases

  • What: Specialized stores for high-dimensional vector search at scale (HNSW, IVF indices). Examples: Qdrant, Weaviate, Pinecone, pgvector.
  • Why it earns its place: Once you have >100k chunks, NumPy doesn't cut it. Knowing one vector DB well is enough.
  • Resource: Pick one and go deep. I recommend pgvector (Postgres extension) if you want minimal infra, Qdrant if you want best-in-class.
  • Done when: You can stand up a vector DB, ingest a corpus, and query it.

Rung 05-Chunking strategies

  • What: Split documents into retrievable units. Strategies: fixed-size, sentence, paragraph, recursive, semantic. Overlap between chunks.
  • Why it earns its place: Bad chunking = bad retrieval. Underrated lever.
  • Resource: Greg Kamradt's "5 Levels of Chunking" video (search "kamradt chunking strategies").
  • Done when: You can articulate the tradeoffs of fixed-size vs semantic chunking and have tried at least two.

Rung 06-Hybrid search and reranking

  • What: Combine BM25 + dense via Reciprocal Rank Fusion. Then rerank top results with a cross-encoder for precision.
  • Why it earns its place: Hybrid + rerank is the modern best-practice baseline. It typically beats either alone by 10–30% on real datasets.
  • Resource: Cohere's reranker docs. Plus bge-reranker from BAAI (open-source, on Hugging Face).
  • Done when: You can implement hybrid retrieval + reranking and measure the lift over each component alone.

Rung 07-Retrieval evaluation

  • What: NDCG, MRR, recall@k, precision@k. Pick the right one for your task.
  • Why it earns its place: You cannot improve what you don't measure. Most RAG systems have no retrieval eval.
  • Resource: Information Retrieval (Manning, Raghavan, Schütze; free online) chapter 8.
  • Done when: You can compute NDCG@10 and MRR for a query set with labeled gold passages.

Rung 08-End-to-end RAG evaluation

  • What: Beyond retrieval, evaluate answer quality: faithfulness (no hallucination), answer relevance, context precision/recall.
  • Why it earns its place: Retrieval can be perfect and answers still be bad. End-to-end eval is what you ship on.
  • Resource: RAGAS paper (arxiv.org/abs/2309.15217) and library (docs.ragas.io). Also Hamel Husain's "Eval Driven Development for RAG" posts.
  • Done when: You can run RAGAS or a hand-rolled equivalent and get faithfulness + answer relevance scores.

Rung 09-Advanced retrieval techniques

  • What: HyDE (hypothetical document embeddings), query rewriting, multi-query expansion, recursive retrieval, sentence-window retrieval, parent-document retrieval.
  • Why it earns its place: Toolbox for when basic RAG plateaus. Each addresses a specific failure mode.
  • Resource: LlamaIndex docs on advanced RAG patterns. Plus Anthropic Contextual Retrieval (search "anthropic contextual retrieval").
  • Done when: You can identify which advanced technique addresses a specific failure mode you've observed.

Rung 10-Long-context vs RAG (a 2025+ debate)

  • What: Frontier models with 200k–1M context windows can sometimes obviate RAG. When does each win?
  • Why it earns its place: This decision shapes architecture. Knowing the tradeoffs (cost, latency, recall) is judgment.
  • Resource: Lost in the Middle paper (arxiv.org/abs/2307.03172). Plus blog posts comparing long-context vs RAG (search "long context vs rag 2024").
  • Done when: You can argue both sides of "should we use long-context instead of RAG" with cost and quality data.

Rung 11-RAG observability

  • What: Trace retrieval-then-generation pipelines. Capture top-k results per query, faithfulness scores, eval drift over time.
  • Why it earns its place: Production RAG quality drifts as data changes. Observability is the only safety net.
  • Resource: Langfuse, LangSmith, or W&B Weave docs-pick one (also covered in sequence 13).
  • Done when: Your RAG system has traces showing retrieval, generation, and end-to-end metrics in a dashboard.

Minimum required to leave this sequence

  • BM25 baseline working on your corpus.
  • Dense retrieval with sentence-transformers + a vector DB.
  • At least two chunking strategies tried and compared.
  • Hybrid retrieval + reranking implemented.
  • Retrieval eval (NDCG@10, MRR) computed on labeled data.
  • End-to-end faithfulness eval running.

Going further

  • Information Retrieval (Manning et al., free online)-chapters 6–9.
  • Pinecone's RAG learning hub-well-organized free resources.
  • LlamaIndex docs-even if you don't use it, the patterns are well-explained.

How this sequence connects to the year

  • Month 5: This sequence IS month 5.
  • Month 6: Eval rigor (rungs 07–08) compounds into the eval sequence.
  • Q3 (if Track A-Evals): Building eval frameworks for RAG is your wheelhouse.

Comments