Workshop - Production-grade RAG with hybrid retrieval + reranking + eval¶

DifficultyCapstoneTime2 hours

Needs: Python 3.11+, Docker (for Postgres+pgvector), OpenAI API key (~$2 in tokens), Anthropic API key, optionally Cohere

Before you start:

Comfortable with Python and basic SQL
Have called an LLM API at least once
Understand what embeddings are at a conceptual level (vectors of numbers representing text meaning)

Launch in KillercodaFree browser-based environment - no install required to follow along.

Companion to AI Systems -> Month 05 -> Week 18: Retrieval and Vector Stores, and the fifth in the AI implementations workshop series. Most RAG tutorials stop at "chunk, embed, retrieve, answer" and call it production. The numbers tell a different story: that naive baseline typically returns the right chunk in the top-5 about 40% of the time, which is below the threshold where it stops feeling broken to users. This workshop walks the same baseline, then measures it, then layers in the three techniques that actually move the number - hybrid retrieval, reranking, and disciplined chunking - and measures again. By the end you'll have a RAG system with a hit-rate above 0.80 on a real golden set, and you'll know which knob does what.

~120 minutes. Needs: Python 3.11+, Docker (for Postgres + pgvector), an OpenAI API key for embeddings (~$2 in tokens), an Anthropic API key for generation, optionally a Cohere API key for hosted reranking. A laptop is plenty; no GPU required.

What you'll build, and the idea it makes concrete¶

You'll build a complete RAG pipeline against a real corpus (a subset of Wikipedia, ~10,000 chunks). Then you'll build a 50-question golden set with known-relevant chunks, measure the naive baseline, and watch it return a hit-rate around 0.40-0.50. Then you'll add hybrid retrieval (BM25 + vector with RRF fusion), reranking (cross-encoder), and smarter chunking, measuring at each step. By the end the hit rate is in the 0.80-0.85 range and you can attribute every percentage point to a specific technique.

The idea this makes concrete:

RAG quality is dominated by retrieval, not by the LLM. When users say "the RAG bot gave a wrong answer," 80% of the time it's because the right context never made it into the prompt. The generation step is usually fine - if you hand a competent model the right paragraph, it will produce a correct answer. So when you optimize a RAG system, you optimize retrieval, and you optimize it by measuring it as a search problem with the metrics search engines use (hit-rate, MRR, NDCG) - not by eyeballing the final answers. The eval comes before the optimization, not after, because without numbers you cannot tell which knob helps.

A second idea, equally load-bearing:

Hybrid retrieval + reranking is the production default in 2026. Pure-vector RAG was the 2023 state of the art and is now a baseline, not a goal. Vector embeddings excel at semantic similarity (synonyms, paraphrases) but miss exact-keyword matches that BM25 nails (acronyms, names, dates, product SKUs, code identifiers). Combining them via Reciprocal Rank Fusion and then reranking the union with a cross-encoder beats either alone by 20-40 points of hit-rate on most corpora. If a RAG system in 2026 is pure-vector, it's leaving accuracy on the table.

Step 0: the architecture you're about to assemble¶

INGEST                                         RUNTIME
+----------+                          +-----+   user query   +------+
| documents|                          |     |  ----------->  |      |
+----+-----+                          |     |                |      |
     |  chunk                         |     |  embed query   |      |
     v                                |     |  ------------> |      |
+----------+                          | Q U |                |  L L |
|  chunks  |                          | E R |  BM25          |  M   |
+----+-----+                          | Y   |  -----------+  |      |
     |  embed     +-----------+       |     |             |  |      |
     +----------> |  pgvector |       |     |  vector     |  |      |
                  | (vectors) |       |     |  ---------+ |  |      |
     |  index     +-----------+       |     |           v v  |      |
     +----------> +-----------+       |     |       +----------+    |
                  | tsvector  |       |     |       |   RRF    |    |
                  | (BM25)    |       |     |       |  fusion  |    |
                  +-----------+       |     |       +----+-----+    |
                                      |     |            |          |
                                      |     |            v          |
                                      |     |       +----------+    |
                                      |     |       | reranker |    |
                                      |     |       +----+-----+    |
                                      |     |            |          |
                                      |     |  context   v          |
                                      +-----+ ---------------------->
                                          top-5 reranked chunks

Two non-obvious truths this layout encodes that the literature often glosses over:

There is no "the RAG model." RAG is a pipeline. Each box can be swapped independently and measured independently. The "RAG model" mental shortcut hides the fact that retrieval and generation are different optimization problems with different metrics.
The retrieval signal flows through three stages by the end. Candidate generation (BM25 + vector retrieves ~30 candidates), fusion (RRF combines two ranked lists into one), and reranking (cross-encoder reorders the top candidates by deep relevance). The candidates can be cheap and recall-focused; the reranker can be expensive and precision-focused; together they hit accuracy a single-stage retriever cannot.

Step 1: set up Postgres + pgvector and ingest some data¶

Spin up Postgres with pgvector. The cleanest way is Docker:

$ docker run -d --name pgvec \
  -e POSTGRES_PASSWORD=workshop \
  -p 5432:5432 \
  pgvector/pgvector:pg16

$ docker exec -it pgvec psql -U postgres -c "CREATE EXTENSION vector;"

Create the schema. We need both a vector column and a full-text-search column on the same table:

-- Run this in psql.
CREATE TABLE chunks (
    id           BIGSERIAL PRIMARY KEY,
    source       TEXT NOT NULL,
    text         TEXT NOT NULL,
    embedding    vector(1536),                         -- text-embedding-3-small dims
    tsv          tsvector GENERATED ALWAYS AS (        -- BM25 column
                     to_tsvector('english', text)
                 ) STORED
);

CREATE INDEX chunks_tsv     ON chunks USING GIN (tsv);
CREATE INDEX chunks_emb_hnsw ON chunks USING hnsw (embedding vector_cosine_ops);

Two indexes, two retrievers. HNSW for fast approximate nearest-neighbor on the embedding; GIN for fast full-text search on the generated tsvector. Postgres now serves as both your vector store and your BM25-equivalent search engine. You do not need a separate vector database for any non-extreme workload.

Grab a real dataset. The Hugging Face wikipedia dataset, English subset, first 1,000 articles is enough to be interesting:

# ingest.py
from datasets import load_dataset

ds = load_dataset("wikimedia/wikipedia", "20231101.en", split="train", streaming=True)
articles = []
for i, article in enumerate(ds):
    if i >= 1000:
        break
    articles.append(article)
print(f"loaded {len(articles)} articles, {sum(len(a['text']) for a in articles):,} chars")

About 1,000 articles, several million characters. Realistic enough to evaluate retrieval honestly; small enough to ingest in 5 minutes.

Step 2: chunk, embed, and index (the naive baseline)¶

Chunking strategy in the naive baseline is the dumbest thing that works: fixed-size character windows with a small overlap.

def chunk_fixed(text: str, size: int = 800, overlap: int = 100) -> list[str]:
    chunks = []
    i = 0
    while i < len(text):
        chunks.append(text[i:i + size])
        i += size - overlap
    return chunks

Why 800 characters? Because text-embedding-3-small does its best work on inputs of roughly 100-500 tokens (~400-2000 chars), and 800 chars is a sweet spot that's specific enough to be useful but big enough to carry full sentences. Why 100 overlap? Because cutting mid-sentence loses information; overlap recovers some of it. These are not laws; they are starting values you'll tune.

Embed and store in batches:

from openai import OpenAI
import psycopg

client = OpenAI()
conn = psycopg.connect("postgresql://postgres:workshop@localhost/postgres")

def embed_batch(texts: list[str]) -> list[list[float]]:
    resp = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return [d.embedding for d in resp.data]

batch = []
with conn.cursor() as cur:
    for article in articles:
        for ch in chunk_fixed(article["text"]):
            batch.append((article["title"], ch))
            if len(batch) >= 100:
                vecs = embed_batch([t for _, t in batch])
                cur.executemany(
                    "INSERT INTO chunks (source, text, embedding) VALUES (%s, %s, %s)",
                    [(src, txt, vec) for (src, txt), vec in zip(batch, vecs)],
                )
                conn.commit()
                batch = []
    if batch:
        vecs = embed_batch([t for _, t in batch])
        cur.executemany(
            "INSERT INTO chunks (source, text, embedding) VALUES (%s, %s, %s)",
            [(src, txt, vec) for (src, txt), vec in zip(batch, vecs)],
        )
        conn.commit()

Run it. About 10,000 chunks, $1-2 in embedding tokens, ~5 minutes wall-clock. You now have a complete naive RAG store.

Quick vector retrieval to confirm it works:

def retrieve_vector(query: str, k: int = 5) -> list[dict]:
    qvec = embed_batch([query])[0]
    with conn.cursor() as cur:
        cur.execute(
            "SELECT id, source, text, embedding <=> %s::vector AS dist "
            "FROM chunks ORDER BY dist LIMIT %s",
            (qvec, k),
        )
        return [{"id": r[0], "source": r[1], "text": r[2], "score": -float(r[3])}
                for r in cur.fetchall()]

results = retrieve_vector("What year did the French Revolution begin?")
for r in results:
    print(f"{r['score']:.3f}  {r['source']}: {r['text'][:80]}...")

You'll see five plausible chunks come back. The temptation is to declare victory. Resist it - until you measure, you don't know whether "plausible" is "right."

Step 3: build the golden set (the most important step)¶

A golden set is a list of (question, relevant-chunk-ids) pairs that lets you compute retrieval accuracy as a real number. Without a golden set, every change you make is vibes; with one, every change is measured.

The painful truth: golden-set authorship is manual work, and most RAG projects skip it, which is why most RAG projects can't tell whether their changes help. Spend the 90 minutes; everything else gets easier.

Procedure to build a 50-question golden set:

Pick 50 chunks at random from your store. These will be the "truth" chunks.
For each chunk, write a question that the chunk specifically answers. The question should be answerable from that chunk alone, with no ambiguity. Variety matters: include some questions that name an exact entity ("When was the Treaty of Westphalia signed?"), some that ask for a property ("What is the boiling point of mercury?"), some that are looser ("Why did the Roman empire fall?").
Mark additional relevant chunks if any. Some questions are answered by multiple chunks; mark them all. Many will have just the one.
Save as JSONL so you can iterate:

# golden_set.jsonl
{"question": "When was the Treaty of Westphalia signed?", "relevant_ids": [847]}
{"question": "What is the boiling point of mercury in Celsius?", "relevant_ids": [3214]}
...

A worth-it shortcut: use Claude to propose questions from sampled chunks ("Given this chunk: , write one question this chunk specifically answers"), then a human spends ~30 seconds reviewing each. Authorship time drops from 90 minutes to 20, with similar quality.

Save the golden set. This file is more valuable than your code - your code can be rewritten; the golden set encodes what "good retrieval" means for your corpus, and rebuilding it from scratch is the painful path.

Step 4: measure the baseline¶

The two canonical retrieval metrics for RAG evaluation are hit-rate at k ("did the relevant chunk appear in the top-k?") and MRR (mean reciprocal rank - "how high up was the first relevant chunk on average?"). Both are between 0 and 1; both are dead simple to compute.

import json

def load_golden() -> list[dict]:
    with open("golden_set.jsonl") as f:
        return [json.loads(line) for line in f]

def evaluate(retriever, golden: list[dict], k: int = 5) -> dict:
    hits = 0
    mrr_sum = 0.0
    for example in golden:
        results = retriever(example["question"], k=k)
        retrieved_ids = [r["id"] for r in results]
        relevant = set(example["relevant_ids"])

        if any(rid in relevant for rid in retrieved_ids):
            hits += 1
        for rank, rid in enumerate(retrieved_ids, start=1):
            if rid in relevant:
                mrr_sum += 1.0 / rank
                break

    return {
        "hit@k": hits / len(golden),
        "mrr":   mrr_sum / len(golden),
        "k":     k,
        "n":     len(golden),
    }

golden = load_golden()
print("naive vector baseline:", evaluate(retrieve_vector, golden, k=5))

Run it. Typical result on a Wikipedia-style corpus with naive 800-char chunks and text-embedding-3-small:

naive vector baseline: {'hit@k': 0.46, 'mrr': 0.31, 'k': 5, 'n': 50}

46% hit-rate at k=5. Less than half the time, the right chunk made it into the top-5. The user-facing implication: ~half of all queries hit a degraded answer or a confidently wrong one. This is the gap between "I built a RAG demo" and "this is in production."

(Your number will vary by corpus and golden-set wording. The point is not the specific number; it's that the number exists and you can move it.)

Step 5: add BM25 retrieval (the first big win)¶

Vector retrieval excels at semantic similarity but misses exact-keyword queries. "Albert Einstein" as a vector query is dominated by "famous physicist" matches; as a BM25 query it lights up every chunk containing the literal name. A staggering fraction of real-world queries are name-, acronym-, or token-based and BM25 wins those decisively.

Postgres' built-in full-text search via tsvector is BM25-equivalent and runs against the GIN index you already built:

def retrieve_bm25(query: str, k: int = 10) -> list[dict]:
    with conn.cursor() as cur:
        cur.execute(
            "SELECT id, source, text, "
            "       ts_rank_cd(tsv, plainto_tsquery('english', %s)) AS score "
            "FROM chunks "
            "WHERE tsv @@ plainto_tsquery('english', %s) "
            "ORDER BY score DESC LIMIT %s",
            (query, query, k),
        )
        return [{"id": r[0], "source": r[1], "text": r[2], "score": float(r[3])}
                for r in cur.fetchall()]

print("naive BM25 baseline:", evaluate(retrieve_bm25, golden, k=5))

Typical result:

naive BM25 baseline: {'hit@k': 0.52, 'mrr': 0.39, 'k': 5, 'n': 50}

BM25 alone slightly outperforms vector alone on a name-heavy corpus. The interesting question is what happens when you combine them.

Step 6: hybrid retrieval with Reciprocal Rank Fusion¶

You cannot just add the BM25 score and the vector score - they live on different scales (BM25 is unbounded, cosine distance is [0,2]). The standard fusion is Reciprocal Rank Fusion (RRF), which uses only the rank of each document, not the raw score:

def rrf_fuse(rankings: list[list[int]], k_rrf: int = 60) -> list[int]:
    """Reciprocal Rank Fusion: combine multiple ranked lists into one.

    k_rrf = 60 is the standard constant from the original paper; it dampens
    the influence of any single ranker's top result so a doc must be near
    the top of multiple lists to win.
    """
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k_rrf + rank + 1)
    return sorted(scores, key=scores.get, reverse=True)


def retrieve_hybrid(query: str, k: int = 5, candidate_k: int = 30) -> list[dict]:
    """Pull candidate_k from each retriever, fuse by RRF, return top k.

    Candidate pool of 30 each = 60 max; in practice ~40 unique after overlap.
    """
    vec  = retrieve_vector(query, k=candidate_k)
    bm25 = retrieve_bm25(query, k=candidate_k)

    by_id = {r["id"]: r for r in vec}
    by_id.update({r["id"]: r for r in bm25})

    fused_ids = rrf_fuse([[r["id"] for r in vec], [r["id"] for r in bm25]])[:k]
    return [by_id[i] for i in fused_ids]

print("hybrid (RRF):", evaluate(retrieve_hybrid, golden, k=5))

Typical result:

hybrid (RRF): {'hit@k': 0.68, 'mrr': 0.52, 'k': 5, 'n': 50}

46% → 68%. A 22-point hit-rate improvement from one fusion function with no model changes. This is the single biggest single-step win in RAG, and it's free. If your production RAG is pure-vector, this is your highest-priority change.

The k_rrf=60 constant comes from the original Cormack et al. paper (2009). Tune it if you must, but you're chasing single-digit percentage points; the default is fine.

Step 7: rerank the candidates with a cross-encoder¶

The hybrid retriever returns 5 chunks. The naive expectation is they're already in best-first order. The reality: hybrid retrieval is good at recall (the right chunk is somewhere in the candidates) and only OK at precision (the right chunk is at the top). A cross-encoder reranker reads each (query, chunk) pair as a single input and produces a relevance score - much slower than the bi-encoder embedding model, but dramatically more accurate.

Use a cross-encoder via sentence-transformers (free, runs locally):

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-v2-m3", max_length=512)

def retrieve_hybrid_reranked(query: str, k: int = 5, candidate_k: int = 30) -> list[dict]:
    candidates = retrieve_hybrid(query, k=candidate_k * 2, candidate_k=candidate_k)
    pairs = [(query, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)
    paired = sorted(zip(scores, candidates), key=lambda x: -x[0])
    return [c for _, c in paired[:k]]

print("hybrid + rerank:", evaluate(retrieve_hybrid_reranked, golden, k=5))

Typical result:

hybrid + rerank: {'hit@k': 0.82, 'mrr': 0.71, 'k': 5, 'n': 50}

68% → 82%. Another 14 points. The total journey from naive vector (46%) to hybrid + reranker (82%) is a 36-point hit-rate improvement, and every step is attributable.

A few notes on rerankers:

They are slow. A cross-encoder rerank of 60 candidates takes 1-3 seconds on CPU, ~200ms on GPU. Plan for this latency budget. Some production systems do bi-encoder reranking (faster, less accurate) or hosted reranking (Cohere rerank-3, ~100ms typical).
The candidate pool matters. Reranking 100 candidates from hybrid retrieval is meaningfully better than reranking 30, up to about 200 where returns diminish. Tune candidate_k * 2 to the upper end of what your latency budget tolerates.
bge-reranker-v2-m3 is the strong public choice in 2026. Cohere's rerank-3 is the strong commercial choice. Both will likely be superseded; check the BEIR leaderboard for current best.

Step 8: smarter chunking is the third lever¶

Fixed-size character chunking is the baseline. Two improvements that tend to move the number:

Recursive splitting with semantic boundaries. Prefer to break at paragraph, then sentence, then word boundaries - never mid-word, never mid-code-block. Libraries: langchain.text_splitter.RecursiveCharacterTextSplitter, llama_index.node_parser.SentenceSplitter. A 5-10 line custom version usually suffices.
Structural chunking. If your documents are Markdown with headings, code blocks, lists, tables, structure-aware chunking respects those - one section = one chunk, with the heading prepended. For mixed content (HTML, PDFs with layout), tools like unstructured or docling extract structure first and chunk on it.

def chunk_structural(text: str, max_size: int = 1000) -> list[str]:
    """Split on paragraph (\\n\\n), then fall back to sentence boundaries."""
    chunks = []
    for paragraph in text.split("\n\n"):
        if len(paragraph) <= max_size:
            chunks.append(paragraph)
            continue
        # Paragraph too big - split on sentence boundaries.
        sentences = paragraph.replace("\n", " ").split(". ")
        current = ""
        for sent in sentences:
            if len(current) + len(sent) + 2 <= max_size:
                current += sent + ". "
            else:
                if current:
                    chunks.append(current.strip())
                current = sent + ". "
        if current:
            chunks.append(current.strip())
    return chunks

Re-ingest your corpus with structural chunking and re-evaluate. Typical lift: another 2-5 points of hit-rate, more if your corpus has strong structural cues (code, headings, tables). The lift is corpus-specific - measure it before committing.

Step 9: generate the final answer¶

With strong retrieval, generation is anticlimactic. Hand the top chunks to the LLM with a clear "answer from these sources only" instruction:

import anthropic

claude = anthropic.Anthropic()

def answer(query: str) -> str:
    chunks = retrieve_hybrid_reranked(query, k=5)
    context = "\n\n---\n\n".join(
        f"[{i+1}] {c['source']}\n{c['text']}" for i, c in enumerate(chunks)
    )
    resp = claude.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=(
            "You are a careful assistant. Answer the user's question using ONLY "
            "the provided sources. Cite source numbers in square brackets like [1]. "
            "If the sources do not contain the answer, say so explicitly."
        ),
        messages=[{
            "role": "user",
            "content": f"Sources:\n\n{context}\n\nQuestion: {query}",
        }],
    )
    return "".join(b.text for b in resp.content if b.type == "text")

print(answer("When was the Treaty of Westphalia signed?"))

The output cites sources, refuses to invent answers, and benefits from the strong retrieval underneath. The generation step is the cheap part now - the hard work was upstream.

Step 10: break it (failure modes worth knowing)¶

10.1 Embeddings don't catch typos or rare names¶

"Einstien" (misspelled) returns nothing useful from pure vector search. "AAPL" (a ticker) gets buried by news-article embeddings about apples. BM25 catches both - this is exactly the niche it occupies, and dropping BM25 leaves your users exposed to these queries.

10.2 Long documents get truncated at chunk boundaries¶

A question that requires synthesis across two adjacent chunks may have neither chunk contain the full answer. Fixes: bigger chunks (lose granularity), bigger overlaps (lose efficiency), or "parent-document retrieval" where chunks are searched but their parent document is what gets fed to the LLM. The last is the strongest pattern for long-document corpora.

10.3 The golden set goes stale¶

You added new documents to the corpus and your hit-rate dropped from 0.82 to 0.71. Did retrieval get worse, or did the new documents add competing chunks that displaced the right ones? Without revisiting the golden set, you can't tell. Plan to extend the golden set as the corpus grows; aim for ~5% of queries to be added each quarter.

10.4 Embedding-model upgrades require full re-ingestion¶

You decide to move from text-embedding-3-small to text-embedding-3-large (or to BGE-large, or to the next model). The new vectors are in a different space - existing vectors are now garbage and must be regenerated. Plan for: cost of re-embedding (~$10-100 per corpus depending on size), downtime or dual-write strategy, and re-evaluation against your golden set. Pin the embedding model in code so accidental upgrades are visible.

10.5 The "context too long" cliff¶

You feed the LLM 30 chunks "just to be safe." Two problems: cost scales linearly (often more than the retrieval cost), and quality often drops past about 10 chunks because the model gets distracted by irrelevant context. Stick to 3-5 chunks; let the reranker do its job.

Step 11: production considerations¶

A handful of operational concerns the workshop hasn't yet covered:

Caching. Embed queries that repeat (the same user asks the same thing twice; a popular query gets asked by many users). A Redis cache keyed on the normalized query string saves embedding-API tokens and 100ms per cached hit.
Freshness. When does new content show up in retrieval? Batch-ingest nightly is fine for slow-moving corpora; per-document streaming is necessary for news, support tickets, or fast-changing documentation.
Cost tracking. Embedding tokens, generation tokens, reranker GPU time. Log per query so you can see anomalies (a user query that triggers a 50-chunk context is 10x the cost of the average).
Eval in CI. Run the golden-set evaluation in a CI job before deploys. If hit-rate regresses below a threshold, block the release. The user has shipped whatifd for exactly this pattern - paired-percentile bootstrap on regression evals - and it's the right tool for the job once your eval is mature.
Per-user / per-tenant retrieval. Multi-tenant corpora must filter at retrieval time, not generation time. Add a tenant_id column, partition or filter the index, and never let one tenant's results leak into another's.

Now extend it¶

Add query rewriting / HyDE. Before retrieval, ask the LLM to (a) rephrase the query into 2-3 alternative phrasings, or (b) hallucinate a "hypothetical document" that would answer it. Embed each variant and union the results. Typical lift: 3-5 hit-rate points on ambiguous queries.
Add metadata filtering. Many real queries have implicit filters ("issues from the last week," "articles by Alice"). Detect these with a small LLM call and add as WHERE clauses; restrict the retrieval set before ranking.
Add re-evaluation on every model change. Wire your golden-set eval into CI so any embedding-model or reranker swap shows its impact on a PR comment. Use whatifd's regression check for the statistical rigor.
Add citation tracing. Track which source each generated sentence came from and surface citations in the UI. The "is this answer hallucinated" question becomes auditable.
Compare to a hosted vector DB. Run the same pipeline on Pinecone, Qdrant, or Weaviate. Note the operational differences (auth, billing, search latency) and decide if your scale justifies the leave-Postgres switch. Most teams under 10M documents are best off staying on pgvector.

What you might wonder¶

"Do I need a vector database, or is pgvector enough?" pgvector is enough for any corpus under ~50M chunks and most teams never reach that. It's the same JOIN-friendly Postgres your application already runs, it does both vector and BM25, and the operational complexity is zero new services. Switch to a dedicated vector DB only when you have a concrete reason - filter complexity that Postgres struggles with, write throughput beyond Postgres, or strict latency budgets at scale.

"Why text-embedding-3-small instead of large?" Cost. The small model is 5x cheaper and ~95% as accurate on retrieval benchmarks. Spend the savings on more retrieval candidates and reranking; the per-query quality is higher than large alone. Switch to large only if you have a measured reason in your golden set.

"What about LLM-as-judge eval (faithfulness, answer relevance)?" Useful but slower and noisier than retrieval metrics. Build the retrieval eval first (deterministic, fast, cheap). Add LLM-as-judge for end-to-end answer quality once retrieval is solid. The order matters: a system with bad retrieval can't be saved by judging the answer.

"How do I handle very long documents (papers, books, codebases)?" Hierarchical retrieval: index summaries at the document level, find the right document first, then retrieve chunks within it. Or: parent-document retrieval (the chunk index for search, the parent for generation context). Don't try to make one chunk size fit a 500-page document.

"How big a golden set do I need?" Diminishing returns past ~200 questions. The first 50 give you direction; 100-200 give you reliable comparisons; more lets you slice by query type. For a brand-new system, 50 is enough to get started; grow it to 200 over the first quarter of production.

"What does 'good' hit-rate look like?" Strongly corpus-dependent. For Wikipedia-style corpora with hand-written golden sets, 0.80-0.90 hit@5 is realistic. For technical documentation, 0.85-0.95 is the bar. For unstructured customer-support tickets, 0.65-0.75 may be the realistic ceiling. The number is less useful in absolute terms than as a delta you control.

"When is RAG not the right tool?" Three places it underperforms: (1) when the answer requires multi-hop reasoning across many chunks (use long-context with the full document in the prompt, or build a chain of RAG calls); (2) when the corpus is small enough to fit in the model's context (just stuff it - no embedding cost, no retrieval failure mode); (3) when the user query is so vague that no retrieval can help (a small clarification-question step before retrieval is the fix).

What this gave you¶

You built a complete production RAG pipeline (ingestion → hybrid retrieval → reranking → generation) on pgvector with measured numbers at every step.
You authored a golden set and learned that the eval is more important than the code.
You saw the 22-point lift from naive vector → hybrid retrieval (the biggest single-step RAG win).
You saw the additional 14 points from cross-encoder reranking.
You can attribute every percentage point of your hit-rate to a specific technique.
You know the failure modes that bite (embeddings vs typos, chunking boundaries, golden-set drift, embedding-model upgrades, context-length cliff) and how to defend against each.
You have a clear path to whatifd or any other eval framework once your evaluation matures.

The bigger transfer: AI systems quality is measured, not eyeballed. Every workshop after this builds on the assumption that you have a number you're moving. If you take one habit from this workshop, take that one.

Next: Workshop 6 - Structured output with grammar-constrained decoding, where the model literally cannot emit invalid JSON - tokens are constrained at the decoding step itself.

Submit your build¶

When you finish this workshop, share what you built so others can see and learn from your work. Include:

Public repo with the full RAG pipeline (chunker, embedder, retriever, reranker, generator)
Your golden set as a committed JSONL file
A measurement table showing hit-rate@5 and MRR at each stage (naive vector / BM25 / hybrid / hybrid+rerank)
Short note (5 to 8 sentences) on which knob moved the number most on YOUR corpus and why

Submit your build Request feedback on your output Discuss this workshop

Browse the gallery | All discussions

Workshop - Production-grade RAG with hybrid retrieval + reranking + eval¶

What you'll build, and the idea it makes concrete¶

Step 0: the architecture you're about to assemble¶

Step 1: set up Postgres + pgvector and ingest some data¶

Step 2: chunk, embed, and index (the naive baseline)¶

Step 3: build the golden set (the most important step)¶

Step 4: measure the baseline¶

Step 5: add BM25 retrieval (the first big win)¶

Step 6: hybrid retrieval with Reciprocal Rank Fusion¶

Step 7: rerank the candidates with a cross-encoder¶

Step 8: smarter chunking is the third lever¶

Step 9: generate the final answer¶

Step 10: break it (failure modes worth knowing)¶

10.1 Embeddings don't catch typos or rare names¶

10.2 Long documents get truncated at chunk boundaries¶

10.3 The golden set goes stale¶

10.4 Embedding-model upgrades require full re-ingestion¶

10.5 The "context too long" cliff¶

Step 11: production considerations¶

Now extend it¶

What you might wonder¶

What this gave you¶

Submit your build¶

Comments