Saltar a contenido

10 - Retrieval-Augmented Generation (RAG)

What this session is

About an hour. RAG is the dominant production pattern for LLMs answering questions over your data. Instead of fine-tuning facts in, you retrieve relevant passages at query time and pass them to the model.

Why RAG, not fine-tuning, for facts

Fine-tuning teaches a model patterns, styles, formats. Asking it to memorize facts works poorly: knowledge degrades, the model hallucinates "knowing" things, no clean way to update when facts change.

RAG separates concerns: the LLM is the language interface; the knowledge is in a database. Update facts by updating the database - no retraining.

The architecture

User question
Embed the question (vector)
Search a vector DB for similar passages → top-k passages
Build a prompt: question + retrieved passages
LLM generates answer using both

Five components: 1. Documents - your knowledge corpus (docs, PDFs, wiki). 2. Chunker - splits docs into ~200-1000 token passages. 3. Embedder - a model that turns text into vectors. 4. Vector store - stores passages + their embeddings; supports nearest-neighbor search. 5. LLM - generates the final answer.

A complete (minimal) RAG

from sentence_transformers import SentenceTransformer
import numpy as np
import torch
from transformers import pipeline

# 1. Documents - a tiny corpus
docs = [
    "Lagos is the most populous city in Nigeria.",
    "Abuja is the capital of Nigeria.",
    "The Niger River flows through Mali, Niger, and Nigeria.",
    "Python was created by Guido van Rossum in 1991.",
    "Rust was first released in 2010 by Mozilla.",
    "Go was designed at Google starting in 2007.",
]

# 2. Embed all documents (one-time index-build)
embedder = SentenceTransformer("all-MiniLM-L6-v2")
doc_embeddings = embedder.encode(docs, normalize_embeddings=True)
# shape: (6, 384)

# 3. Search function
def retrieve(query, k=2):
    q_emb = embedder.encode([query], normalize_embeddings=True)
    sims = (doc_embeddings @ q_emb.T).flatten()        # cosine sim because normalized
    topk = np.argsort(-sims)[:k]
    return [docs[i] for i in topk]

# 4. Generate with retrieved context
gen = pipeline("text-generation", model="microsoft/Phi-3-mini-4k-instruct",
               torch_dtype=torch.bfloat16, device_map="auto")

def answer(question):
    context = "\n".join(f"- {p}" for p in retrieve(question))
    prompt = f"""<|user|>
Use the following context to answer the question.

Context:
{context}

Question: {question}<|end|>
<|assistant|>
"""
    out = gen(prompt, max_new_tokens=100, do_sample=False)[0]["generated_text"]
    return out[len(prompt):]    # strip the prompt

print(answer("What is the capital of Nigeria?"))
print(answer("Who created Python?"))

That's a working RAG in ~30 lines. The model answers using the retrieved context, not just its baked-in knowledge.

For real production you'd swap in a proper vector DB (next section); the LLM call stays the same.

Vector databases

For 100 documents, a NumPy dot product is fine. For 1M+ documents, you need a vector database with efficient approximate nearest neighbor search.

Self-hosted: - FAISS (Facebook) - library, in-process. Fast. No persistence layer; you build that. - Chroma - embedded, easy to start. - Qdrant - server-mode, production-grade. - Weaviate - feature-rich, server-mode. - Milvus - distributed, for very large scale.

Hosted: - Pinecone - first popular hosted vector DB. - Cloud-native: AWS OpenSearch with k-NN, Postgres + pgvector, Redis with vector search.

For learning: Chroma or FAISS. For production: depends on scale and existing infra.

A Chroma example:

import chromadb
client = chromadb.Client()
collection = client.create_collection("docs")

collection.add(
    documents=docs,
    embeddings=doc_embeddings.tolist(),
    ids=[f"doc-{i}" for i in range(len(docs))],
)

results = collection.query(
    query_embeddings=embedder.encode(["What is Lagos?"]).tolist(),
    n_results=2,
)
print(results["documents"])

Chunking strategies

Long documents must be split. Naive: split every N characters. Better: - Fixed-size with overlap (e.g., 500 chars, 50-char overlap to preserve context across boundaries). - Semantic chunks (paragraphs, headings). - Recursive chunking - try paragraphs first, fall back to sentences, fall back to words.

langchain.text_splitter.RecursiveCharacterTextSplitter is the popular tool. Try a few; the best chunking is task-dependent.

Embedding choices

Bigger embedding model = better retrieval, slower to embed, larger vectors.

Model Dim Speed Quality
all-MiniLM-L6-v2 384 very fast decent
all-mpnet-base-v2 768 medium good
BAAI/bge-base-en-v1.5 768 medium excellent
BAAI/bge-large-en-v1.5 1024 slow best
text-embedding-3-small (OpenAI) 1536 API excellent
nomic-embed-text-v1.5 768 medium excellent, open source

For learning: all-MiniLM-L6-v2. For production: BGE or Nomic embed are strong open options.

Quality knobs

Things that matter, in order of impact:

  1. Chunking strategy. Bad chunks = bad retrieval. Tune first.
  2. Number of retrieved chunks (k). 3-10 typical. Too few = miss relevant info. Too many = context bloat, "lost in the middle."
  3. Re-ranking. Retrieve k=20, then re-rank with a more expensive model down to top-5. Improves quality at modest cost.
  4. Hybrid search. Combine semantic (vector) with keyword (BM25). Catches cases where exact word match matters.
  5. Query rewriting. LLM rewrites the user's question into a better search query.
  6. Embedding model. Better embeddings = better retrieval. Worth experimenting.

For a beginner, just fixed-size chunks + top-3 semantic retrieval is a strong baseline.

When RAG fails

  • User asks a question whose answer requires synthesis across many docs. RAG retrieves top-k passages; each independent. Synthesis fails.
  • Question is ambiguous. Retrieval gets the wrong passage; answer is confidently wrong.
  • The corpus genuinely doesn't contain the answer. The LLM hallucinates because the user expects an answer.

Mitigations: explicit "I don't know" in the prompt; structured outputs that include source citations; user-facing transparency about what was retrieved.

Frameworks

Real RAG apps often use: - LangChain - most popular framework. Composable chains for retrieval + generation. - LlamaIndex - alternative, more retrieval-focused. - Haystack - pipeline-oriented, German-engineered.

These wrap the patterns above with batteries included. For learning, building from scratch (like this page) makes the mechanics clear; for production, frameworks save time.

Going deeper

You can build a RAG pipeline now. This is the depth that explains why RAG systems disappoint in practice - because "it retrieves and generates" is easy; "it retrieves the right thing" is the entire hard problem. Here are the failure modes, with what you'll see.

The #1 RAG failure: retrieval returns the wrong chunks

RAG's quality ceiling is its retrieval. If the wrong chunks come back, the LLM gets fed irrelevant context and either hallucinates or says "I don't know" - and the LLM is not the problem, retrieval is. The first debugging move is always to look at what was retrieved:

results = vector_db.query(query_embedding, top_k=5)
for r in results:
    print(f"score={r.score:.3f}  {r.text[:80]}")    # ALWAYS inspect the retrieved chunks
score=0.42  "...unrelated paragraph about billing..."     # low score, off-topic = bad retrieval
score=0.41  "...another tangent..."

Low similarity scores (and visibly off-topic text) mean retrieval failed - and no prompt engineering on the generation side will fix it. When RAG gives bad answers, inspect the retrieved chunks first. If they're irrelevant, the bug is upstream (chunking, embedding, or the query), not the LLM.

Chunking is where most RAG quality is won or lost

How you split documents into chunks determines what can be retrieved. The common failures:

  • Chunks too big - a 2000-token chunk dilutes the embedding (it's "about" too many things), so it matches poorly and wastes context window.
  • Chunks too small - a 50-token chunk loses context (a sentence retrieved without its surrounding meaning).
  • Chunks split mid-thought - naive fixed-size splitting cuts sentences/tables in half, so neither half is coherent.

The reliable starting point: ~200-500 token chunks with ~10-20% overlap, split on natural boundaries (paragraphs, sections) not arbitrary character counts. Overlap ensures a concept spanning a boundary appears whole in at least one chunk. Most "RAG retrieves garbage" problems trace back to bad chunking - it's the unglamorous lever with the biggest effect.

The embedding mismatch you won't notice

A subtle, silent failure: the query and the documents must be embedded with the same model, and some embedding models need an asymmetric prefix (a query prefix vs a passage prefix):

# WRONG - mixing embedding models, or forgetting required prefixes
doc_emb = model_a.encode(documents)
query_emb = model_b.encode(query)        # different model -> incomparable vectors -> garbage

# Some models (e5, bge) require prefixes:
doc_emb = model.encode("passage: " + text)
query_emb = model.encode("query: " + question)    # forgetting these tanks retrieval silently

Mismatched embedding models produce vectors in different spaces - similarity is meaningless, retrieval is random, and there's no error. Check your embedding model's docs for required prefixes; forgetting them quietly halves retrieval quality.

What you'll see: the LLM ignores the context (or hallucinates anyway)

Sometimes retrieval is good but the answer is still wrong. Two patterns:

  • Hallucination despite good context - the model answers from its training, ignoring your retrieved chunks. Fix: a firmer prompt ("Answer ONLY using the context below. If the answer isn't in the context, say 'I don't know.'") and put the context prominently (right before the question).
  • "I don't know" despite the answer being present - the model is too conservative, or the relevant chunk was retrieved but ranked low and got cut by top_k. Fix: increase top_k, or add a reranking step (retrieve 20, rerank to the best 5 with a cross-encoder).

The diagnostic: print the exact prompt sent to the LLM (retrieved context + question). Seeing what the model actually received - is the answer even in there? is the context buried? - resolves most generation-side RAG bugs. The full prompt is the ground truth.

The honest evaluation problem

RAG is hard to evaluate because there are two failure points (retrieval and generation) and "looks plausible" isn't "correct" (the Evaluation chapter's lesson). At minimum, separate the two: measure retrieval (did the right chunk make the top-k? - you can check this with known question/chunk pairs) and generation (given the right context, is the answer faithful to it?). A RAG system failing end-to-end is almost always failing at one of these specifically - measuring them separately tells you which to fix.

Try it (with what you'll see)

  1. Run a query and print the retrieved chunks with scores before generation. Are they relevant? Are the scores high (>0.7-ish for good embedders)? Inspect before trusting.

  2. Re-chunk a document three ways: 2000-token, 50-token, and 300-token-with-overlap. Run the same query against each. See retrieval quality change with chunking alone.

  3. Deliberately embed documents and the query with different models. Watch retrieval return garbage with no error. Fix by using the same model.

  4. Print the full prompt (context + question) sent to the LLM. Confirm the answer is actually in the retrieved context - if the answer's wrong but the context is right, it's a generation/prompt problem, not retrieval.

Exercise

  1. Run the minimal RAG above. Confirm the answers use the retrieved context.

  2. Expand the corpus: add 20 more facts. Try a question that's ambiguous between two retrieved docs - see how the model handles it.

  3. Different embedder: swap all-MiniLM-L6-v2 for BAAI/bge-base-en-v1.5. Larger model; do retrieval results improve for tricky questions?

  4. Chunking exercise: download a Markdown doc (your README.md or any project's). Use RecursiveCharacterTextSplitter from langchain to chunk it. Index the chunks. Ask questions.

  5. (Stretch) Try Chroma instead of in-memory NumPy. Same RAG flow with persistent index.

What you might wonder

"What if the LLM ignores the retrieved context?" Happens. Make the prompt clearer: "Answer ONLY using the context above. If the context doesn't contain the answer, say 'I don't know.'" Smaller models ignore instructions more; bigger ones follow.

"Should I do RAG or fine-tuning?" Both, often. RAG for facts; fine-tuning for style + format. Don't pit them against each other.

"What's a 'retriever' vs an 'embedder'?" An embedder produces vectors. A retriever uses the embedder + a vector DB + post-processing to return passages. Same pipeline, different name for different layers.

"How do I evaluate a RAG?" Next page. Hardest part.

Done

  • Build a RAG pipeline end-to-end with embeddings + vector search + LLM.
  • Distinguish from fine-tuning (facts vs style).
  • Recognize vector DB options.
  • Apply basic quality knobs (chunking, k, re-ranking, hybrid search).
  • Know LangChain / LlamaIndex / Haystack exist.

Next: Evaluation →

Comments