Saltar a contenido

Month 5-Week 4: End-to-end RAG eval, RAGAS, publish

Week summary

  • Goal: Evaluate the full RAG pipeline (retrieval + generation): faithfulness, answer relevance, context precision/recall. Use RAGAS + a hand-rolled equivalent. Failure mode taxonomy. Publish the fifth public blog post.
  • Time: ~9 h over 3 sessions.
  • Output: End-to-end RAG eval; failure-mode analysis; fifth public blog post; M05 retrospective.
  • Sequences relied on: 10-retrieval-and-rag rungs 08, 11; 12-evaluation-systems rungs 04, 05, 10.

Why this week matters

Retrieval can be perfect and answers still be bad. End-to-end RAG eval-faithfulness (no hallucination), answer relevance, context precision/recall-is the discipline that converts "retrieval works" into "the system answers correctly." Without it you ship hallucinations.

The blog post wraps M05's RAG arc. Combined with M04's eval methodology, you now have a public portfolio of applied AI engineering with measurable outcomes-a rare and valuable signal.

Prerequisites

  • M05-W01–W03 complete.
  • Hybrid + rerank pipeline working.
  • Session A-Tue/Wed evening (~3 h): RAGAS setup + first eval
  • Session B-Sat morning (~3.5 h): hand-rolled eval + failure taxonomy
  • Session C-Sun afternoon (~2.5 h): publish post + M05 retro

Session A-RAGAS setup + first end-to-end eval

Goal: Install RAGAS. Wire it to your pipeline. Get first numbers on faithfulness, answer relevance, context precision/recall.

Part 1-Read RAGAS (60 min)

Read: RAGAS paper (arxiv.org/abs/2309.15217). Sections 1, 2, 3. Read: RAGAS docs (docs.ragas.io)-focus on the four core metrics: - Faithfulness: does the answer follow from the context? - Answer relevance: does the answer address the question? - Context precision: are retrieved chunks relevant? - Context recall: did we retrieve all the relevant context?

Part 2-Build the Q+A+context dataset (75 min)

For RAGAS-style eval, you need (question, ground_truth_answer, retrieved_contexts, generated_answer). Take 30 of your queries: - Query (you have). - Ground-truth answer (write or generate-then-edit; ~1-2 sentences each). - Retrieved contexts (from your best pipeline). - Generated answer (run your pipeline end-to-end with a generation step).

If your project doesn't have a generation step yet, add one:

def rag_answer(query: str, k: int = 5) -> tuple[str, list[str]]:
    chunks = search_with_rerank(query, k=k)
    context = "\n\n".join(chunk_text(c) for c in chunks)
    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=512,
        system="Answer the question using only the provided context. If the answer is not in context, say so.",
        messages=[{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}],
    )
    return resp.content[0].text, [chunk_text(c) for c in chunks]

Part 3-Run RAGAS (45 min)

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

ds = Dataset.from_list([{
    "question": q,
    "answer": gen_a,
    "contexts": ctxs,
    "ground_truth": gt,
} for q, gen_a, ctxs, gt in eval_data])

result = evaluate(ds, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])
print(result)

Likely first numbers (your project's may differ):

faithfulness:        0.78
answer_relevancy:    0.83
context_precision:   0.69
context_recall:      0.71

Output of Session A

  • Q+A+context dataset (30 examples).
  • RAGAS first run with all 4 metrics.

Session B-Hand-rolled eval + failure mode taxonomy

Goal: Implement your own RAGAS-style evaluators (gives you control + understanding). Categorize failures.

Part 1-Hand-rolled faithfulness (75 min)

FAITHFULNESS_PROMPT = """Given a context and an answer, identify which factual claims in the answer are supported by the context.

Context:
<<<CONTEXT>>>

Answer:
<<<ANSWER>>>

List each claim in the answer (atomic factual statements). For each, indicate whether it is supported by the context (YES/NO/PARTIAL).

Output strict JSON:
{"claims": [{"claim": "...", "supported": "YES|NO|PARTIAL"}]}
"""

def faithfulness_score(answer: str, contexts: list[str]) -> float:
    ctx = "\n\n".join(contexts)
    resp = client.messages.create(model="claude-opus-4-7", max_tokens=1024,
        messages=[{"role": "user", "content": FAITHFULNESS_PROMPT.replace("<<<CONTEXT>>>", ctx).replace("<<<ANSWER>>>", answer)}])
    parsed = json.loads(resp.content[0].text)
    claims = parsed["claims"]
    if not claims: return 1.0
    yes = sum(1 for c in claims if c["supported"] == "YES")
    return yes / len(claims)

Run on the same 30 examples. Compare to RAGAS faithfulness. Should correlate (Spearman ≥ 0.6 ideally); if much lower, your prompt or theirs has issues.

Part 2-Failure mode taxonomy (60 min)

For the cases that scored poorly, categorize:

Failure Mode Description Count Example query
Retrieval miss Right context not retrieved 4 "How to handle X"-relevant doc not in top-5
Faithful but unhelpful Answer cites context but doesn't actually answer 2 (paraphrasing without addressing)
Hallucination Claims not in context 1 (model invented a CLI flag)
Judge disagreement Eval thought wrong but answer was acceptable 1 (wording-level disagreement)

This taxonomy is content for your blog post and your future improvements.

Part 3-Inspect each failure (45 min)

Open one failure of each type. For each, write a 2-sentence note: what would fix this? Possible fixes: - Retrieval miss → improve chunking, add query expansion, tune k. - Hallucination → constrain prompts, add "say I don't know" instruction, switch to stricter model. - Faithful but unhelpful → improve answer prompt, ensure model sees the question clearly.

Output of Session B

  • Hand-rolled faithfulness scorer.
  • Failure mode taxonomy with counts and examples.
  • 4 fix hypotheses.

Session C-Finish and publish blog post + M05 retro

Goal: Polish and publish the M05 RAG blog post. Run month retrospective.

Part 1-Polish the post (60 min)

Build on the W03 draft: - Add the end-to-end eval section (RAGAS metrics, hand-rolled, failure taxonomy). - Final structure (~2500 words): 1. Hook: "What actually moved retrieval quality." 2. Setup: corpus, queries, eval methodology. 3. The progression: BM25 → dense → hybrid → rerank → contextual. 4. Numbers (the comparison table with CIs). 5. End-to-end eval (RAGAS + hand-rolled). 6. Failure taxonomy. 7. What I'd do differently. 8. Bridge to month 6 (agents).

Part 2-Publish + share (45 min)

  • Personal blog.
  • Cross-post: dev.to, HN, r/MachineLearning, r/LocalLLaMA, X, LinkedIn.
  • Tag relevant accounts (RAGAS team, Anthropic Contextual Retrieval team) politely.

Part 3-Month-5 retro (45 min)

MONTH_5_RETRO.md:

# Month 5 retro

## Artifacts shipped
- BM25 + dense + hybrid + rerank + contextual pipelines
- 30-query labeled retrieval eval
- 30-query end-to-end RAG eval (RAGAS + hand-rolled)
- Failure mode taxonomy
- Blog post: <link>

## KPIs vs Q2 targets
| Metric | Target Q2 | Actual end of M05 |
|---|---|---|
| Public repos | 2 | 1 (anchor) |
| Blog posts | 2 | 2 ✓ |
| Eval runs | 5+ | 8 ✓ |

## Lessons
1. Reranking gave the biggest single lift in retrieval.
2. End-to-end eval is more important than retrieval-only eval.
3. CIs make many "improvements" look smaller-and that's the point.

## What slipped

## Pace check

## M06 plan (agents)
- Tool-use loop scale-up to 5+ tools.
- ReAct + reflection.
- Agent observability with Langfuse / LangSmith.
- Q3 track decision (Evals / Agents / Inference) by start of M06-W04.

Output of Session C

  • Fifth public blog post live, ≥3 channels.
  • M05 retrospective committed.

End-of-week artifact

  • RAGAS eval running on 30+ examples
  • Hand-rolled faithfulness implemented
  • Failure mode taxonomy with example counts
  • Fifth blog post published, ≥3 channels
  • M05 retrospective written

End-of-week self-assessment

  • I can explain faithfulness vs answer relevance precisely.
  • I can run a RAGAS eval and interpret each metric.
  • My failure mode taxonomy guides my next improvements.

Common failure modes for this week

  • Treating RAGAS scores as ground truth. They're approximations from LLM judges. Validate against humans where possible.
  • Hiding the negative result. "Contextual retrieval gained only 2.5 points" is publishable.
  • Skipping the taxonomy. It's the lever for everything you build later.

What's next (preview of M06-W01)

Agents. Tool-use loop scale-up, ReAct, agent eval design. Then Q3 track decision by end of month.

Comments