Saltar a contenido

10 - Retrieval-Augmented Generation (RAG)

What this session is

About an hour. RAG is the dominant production pattern for LLMs answering questions over your data. Instead of fine-tuning facts in, you retrieve relevant passages at query time and pass them to the model.

Why RAG, not fine-tuning, for facts

Fine-tuning teaches a model patterns, styles, formats. Asking it to memorize facts works poorly: knowledge degrades, the model hallucinates "knowing" things, no clean way to update when facts change.

RAG separates concerns: the LLM is the language interface; the knowledge is in a database. Update facts by updating the database - no retraining.

The architecture

User question
Embed the question (vector)
Search a vector DB for similar passages → top-k passages
Build a prompt: question + retrieved passages
LLM generates answer using both

Five components: 1. Documents - your knowledge corpus (docs, PDFs, wiki). 2. Chunker - splits docs into ~200-1000 token passages. 3. Embedder - a model that turns text into vectors. 4. Vector store - stores passages + their embeddings; supports nearest-neighbor search. 5. LLM - generates the final answer.

A complete (minimal) RAG

from sentence_transformers import SentenceTransformer
import numpy as np
import torch
from transformers import pipeline

# 1. Documents - a tiny corpus
docs = [
    "Lagos is the most populous city in Nigeria.",
    "Abuja is the capital of Nigeria.",
    "The Niger River flows through Mali, Niger, and Nigeria.",
    "Python was created by Guido van Rossum in 1991.",
    "Rust was first released in 2010 by Mozilla.",
    "Go was designed at Google starting in 2007.",
]

# 2. Embed all documents (one-time index-build)
embedder = SentenceTransformer("all-MiniLM-L6-v2")
doc_embeddings = embedder.encode(docs, normalize_embeddings=True)
# shape: (6, 384)

# 3. Search function
def retrieve(query, k=2):
    q_emb = embedder.encode([query], normalize_embeddings=True)
    sims = (doc_embeddings @ q_emb.T).flatten()        # cosine sim because normalized
    topk = np.argsort(-sims)[:k]
    return [docs[i] for i in topk]

# 4. Generate with retrieved context
gen = pipeline("text-generation", model="microsoft/Phi-3-mini-4k-instruct",
               torch_dtype=torch.bfloat16, device_map="auto")

def answer(question):
    context = "\n".join(f"- {p}" for p in retrieve(question))
    prompt = f"""<|user|>
Use the following context to answer the question.

Context:
{context}

Question: {question}<|end|>
<|assistant|>
"""
    out = gen(prompt, max_new_tokens=100, do_sample=False)[0]["generated_text"]
    return out[len(prompt):]    # strip the prompt

print(answer("What is the capital of Nigeria?"))
print(answer("Who created Python?"))

That's a working RAG in ~30 lines. The model answers using the retrieved context, not just its baked-in knowledge.

For real production you'd swap in a proper vector DB (next section); the LLM call stays the same.

Vector databases

For 100 documents, a NumPy dot product is fine. For 1M+ documents, you need a vector database with efficient approximate nearest neighbor search.

Self-hosted: - FAISS (Facebook) - library, in-process. Fast. No persistence layer; you build that. - Chroma - embedded, easy to start. - Qdrant - server-mode, production-grade. - Weaviate - feature-rich, server-mode. - Milvus - distributed, for very large scale.

Hosted: - Pinecone - first popular hosted vector DB. - Cloud-native: AWS OpenSearch with k-NN, Postgres + pgvector, Redis with vector search.

For learning: Chroma or FAISS. For production: depends on scale and existing infra.

A Chroma example:

import chromadb
client = chromadb.Client()
collection = client.create_collection("docs")

collection.add(
    documents=docs,
    embeddings=doc_embeddings.tolist(),
    ids=[f"doc-{i}" for i in range(len(docs))],
)

results = collection.query(
    query_embeddings=embedder.encode(["What is Lagos?"]).tolist(),
    n_results=2,
)
print(results["documents"])

Chunking strategies

Long documents must be split. Naive: split every N characters. Better: - Fixed-size with overlap (e.g., 500 chars, 50-char overlap to preserve context across boundaries). - Semantic chunks (paragraphs, headings). - Recursive chunking - try paragraphs first, fall back to sentences, fall back to words.

langchain.text_splitter.RecursiveCharacterTextSplitter is the popular tool. Try a few; the best chunking is task-dependent.

Embedding choices

Bigger embedding model = better retrieval, slower to embed, larger vectors.

Model Dim Speed Quality
all-MiniLM-L6-v2 384 very fast decent
all-mpnet-base-v2 768 medium good
BAAI/bge-base-en-v1.5 768 medium excellent
BAAI/bge-large-en-v1.5 1024 slow best
text-embedding-3-small (OpenAI) 1536 API excellent
nomic-embed-text-v1.5 768 medium excellent, open source

For learning: all-MiniLM-L6-v2. For production: BGE or Nomic embed are strong open options.

Quality knobs

Things that matter, in order of impact:

  1. Chunking strategy. Bad chunks = bad retrieval. Tune first.
  2. Number of retrieved chunks (k). 3-10 typical. Too few = miss relevant info. Too many = context bloat, "lost in the middle."
  3. Re-ranking. Retrieve k=20, then re-rank with a more expensive model down to top-5. Improves quality at modest cost.
  4. Hybrid search. Combine semantic (vector) with keyword (BM25). Catches cases where exact word match matters.
  5. Query rewriting. LLM rewrites the user's question into a better search query.
  6. Embedding model. Better embeddings = better retrieval. Worth experimenting.

For a beginner, just fixed-size chunks + top-3 semantic retrieval is a strong baseline.

When RAG fails

  • User asks a question whose answer requires synthesis across many docs. RAG retrieves top-k passages; each independent. Synthesis fails.
  • Question is ambiguous. Retrieval gets the wrong passage; answer is confidently wrong.
  • The corpus genuinely doesn't contain the answer. The LLM hallucinates because the user expects an answer.

Mitigations: explicit "I don't know" in the prompt; structured outputs that include source citations; user-facing transparency about what was retrieved.

Frameworks

Real RAG apps often use: - LangChain - most popular framework. Composable chains for retrieval + generation. - LlamaIndex - alternative, more retrieval-focused. - Haystack - pipeline-oriented, German-engineered.

These wrap the patterns above with batteries included. For learning, building from scratch (like this page) makes the mechanics clear; for production, frameworks save time.

Exercise

  1. Run the minimal RAG above. Confirm the answers use the retrieved context.

  2. Expand the corpus: add 20 more facts. Try a question that's ambiguous between two retrieved docs - see how the model handles it.

  3. Different embedder: swap all-MiniLM-L6-v2 for BAAI/bge-base-en-v1.5. Larger model; do retrieval results improve for tricky questions?

  4. Chunking exercise: download a Markdown doc (your README.md or any project's). Use RecursiveCharacterTextSplitter from langchain to chunk it. Index the chunks. Ask questions.

  5. (Stretch) Try Chroma instead of in-memory NumPy. Same RAG flow with persistent index.

What you might wonder

"What if the LLM ignores the retrieved context?" Happens. Make the prompt clearer: "Answer ONLY using the context above. If the context doesn't contain the answer, say 'I don't know.'" Smaller models ignore instructions more; bigger ones follow.

"Should I do RAG or fine-tuning?" Both, often. RAG for facts; fine-tuning for style + format. Don't pit them against each other.

"What's a 'retriever' vs an 'embedder'?" An embedder produces vectors. A retriever uses the embedder + a vector DB + post-processing to return passages. Same pipeline, different name for different layers.

"How do I evaluate a RAG?" Next page. Hardest part.

Done

  • Build a RAG pipeline end-to-end with embeddings + vector search + LLM.
  • Distinguish from fine-tuning (facts vs style).
  • Recognize vector DB options.
  • Apply basic quality knobs (chunking, k, re-ranking, hybrid search).
  • Know LangChain / LlamaIndex / Haystack exist.

Next: Evaluation →

Comments