Week 22 - Retrieval-Augmented Generation: Doing It Properly¶

22.1 Conceptual Core¶

RAG fails in seven places, and a senior engineer must know each:

Ingestion: garbage-in (bad PDFs, lost layout, OCR errors).
Chunking: too big → diluted relevance; too small → loss of context. Try semantic / recursive / sentence-window strategies; benchmark them.
Embedding: model choice (text-embedding-3-large, bge-large, nomic-embed, voyage-3), normalization, dimension, multilingual support.
Indexing: HNSW (hnswlib, faiss, usearch), IVF-PQ for scale, keyword (bm25s, tantivy), hybrid (dense + sparse + reranker).
Retrieval: top-K, MMR for diversity, query rewriting / HyDE, query routing.
Reranking: a cross-encoder reranker (bge-reranker, Cohere rerank, Voyage rerank) on the top-50 → top-5. Often the single biggest quality win.
Prompting: how the chunks are presented, citation format, instructions for "don't answer if not in context."

Vector DBs: pgvector (Postgres extension, the boring-and-correct choice), qdrant, weaviate, milvus, chroma (dev), lance/lancedb (good for local), turbopuffer (cheap, serverless).
Hybrid search: RRF (reciprocal rank fusion) over dense + BM25.
Embedding pipelines with backpressure: don't OOM your provider, batch carefully, retry idempotently.
Evals for RAG: retrieval recall@K, answer faithfulness (LLM-as-judge), answer relevance, context precision (ragas, trulens, custom).

Pick a corpus (your own docs, a Wikipedia subset, or a publicly available QA dataset). Ingest with at least two chunking strategies.
Stand up pgvector or qdrant. Index with two embedding models.
Implement hybrid retrieval (dense + BM25 + RRF) and add a reranker.
Build a 50-question gold eval set with reference answers. Score with ragas. Iterate retrieval until faithfulness > 0.85.
Plot the impact of each pipeline change in a results table. Resist the urge to tune blindly.

Add eval-on-CI: every PR runs the gold set against the changed pipeline; regressions block merge.