11 - Evaluation¶

What this session is¶

About an hour. The hardest part of building with AI - and the one most beginner tutorials skip. By the end you'll know why "looks good" is not evaluation, and how to do it for real.

Why this page matters¶

Most "AI products" you'll see are evaluated by their authors clicking around and saying "yep, looks good." That's how launched products go viral with embarrassing failures the moment a user does something unexpected.

Good evaluation is what separates a demo from a system. Most engineers - even experienced ML practitioners - get this wrong. Take this page seriously.

The fundamental rule¶

You cannot iterate on what you cannot measure.

Without an objective evaluation, every change is a coin flip. Did this prompt change improve things or make them worse? You can't tell. Without a number, you'll convince yourself it's better - because you wrote it.

A measurable eval lets you see real improvement, A/B test prompts, catch regressions when you change models, ship confidently.

Types of evaluation¶

Different problems need different evals.

Classification - easy¶

If your output is a class (positive/negative, A/B/C, 0-9):

Accuracy - fraction correct.
Precision - of predicted positives, what fraction are actually positive.
Recall - of actual positives, what fraction did you find.
F1 - harmonic mean of precision and recall.
Confusion matrix - full breakdown of predicted vs actual.

scikit-learn:

from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_true, y_pred))
print(confusion_matrix(y_true, y_pred))

Done. Easy.

Free-form text generation - hard¶

If your output is a paragraph of text (LLM chatbot, summarizer):

Exact match - useless unless you're matching against a fixed answer.
BLEU, ROUGE, METEOR - n-gram overlap with a reference. Useful for translation; poor for chat (paraphrase = bad score).
Embedding similarity - cosine similarity between generated and reference embedding. Better than n-gram.
LLM-as-judge - use a strong model to grade outputs. Most-used in practice, with caveats below.
Human eval - gold standard, expensive.

For chat / RAG / summarization, LLM-as-judge is the practical default.

Retrieval - medium¶

If your problem is "did I retrieve the right passages":

Recall@K - of all relevant passages, how many appear in your top-K results.
MRR (Mean Reciprocal Rank) - average of 1 / rank-of-first-relevant.
nDCG - discounted gain over ranking quality.

You need a labeled dataset: each query has a known correct passage. Build this manually for ~100-1000 queries.

LLM-as-judge¶

You have outputs from your system. You want to know "are these good?"

from openai import OpenAI            # or any LLM client

judge_prompt = """You are grading an AI assistant's answer.

Question: {question}
Expected answer: {gold}
AI answer: {generated}

Grade the AI answer on:
- Correctness (0-5): does it match the expected answer in substance?
- Completeness (0-5): does it cover the key points?
- Conciseness (0-5): is it free of fluff?

Respond ONLY with JSON: {{"correctness": N, "completeness": N, "conciseness": N, "rationale": "..."}}
"""

def grade(question, gold, generated):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": judge_prompt.format(
            question=question, gold=gold, generated=generated
        )}],
    )
    import json
    return json.loads(response.choices[0].message.content)

Run your system against a held-out dataset of (question, gold-answer) pairs. Have the judge grade each. Aggregate the scores.

Caveats: - Use a more capable model for judging than for generating. Don't have GPT-3.5 grade GPT-3.5; have GPT-4 or Claude do it. - Judge bias. LLM judges have biases (preferring longer answers, preferring their own family's models). Counter with care. - Calibration. Run human-judged grades on a subset; check the LLM judge agrees with humans. - Pairwise > absolute. "Which of A and B is better" judgments are more stable than absolute 1-5 scores.

Even with caveats, LLM-as-judge is far better than "looks good to me."

Build an evaluation dataset¶

For LLM apps, this is the work you'll spend the most time on. Patterns:

Production traces. Sample real user queries from your service logs. Manually label expected answers. ~100-1000 examples.
Adversarial cases. Specifically construct queries that should fail or should succeed. Boundary cases, ambiguous queries, out-of-scope queries.
Public benchmarks. MMLU (multitask), TruthfulQA, HumanEval (coding), GSM8K (math). Useful for "how does my model compare," less useful for "is my prompt better."

A good eval dataset is representative + adversarial + maintained. Production examples + manually-curated edge cases. Refresh as your product evolves.

A real workflow¶

Pattern that works:

Build a small golden dataset. ~50-200 examples to start.
Run your current system. Score with LLM-as-judge. Get a baseline number.
Make a change (new prompt, new model, new retrieval).
Re-run. Compare. If the number went up materially, ship; if it went down, revert; if it's noise, you didn't change enough.
Expand the eval set as you discover failure modes in production.

This loop is the whole job. Every successful AI product team runs some version of it. Every failed one didn't.

Cost and latency are evaluation criteria¶

A model that's 5% better but 10× slower might be worse for your product. Track both quality and ops costs:

Per-request cost (tokens × model price).
p50, p95 latency.
Throughput (requests per second).

A useful "is the next model worth it" question: "for every 1% quality improvement, how much do cost/latency change?"

Bias, fairness, safety¶

Big and important; out of scope for a beginner page. The minimum:

Test on diverse inputs. Different demographics, languages, dialects, edge cases.
Test refusal behavior. Does it refuse harmful requests? Does it over-refuse benign ones?
Test on adversarial prompts (prompt injection).

Production teams have dedicated red-teamers. For your first project, manual sampling is fine.

Specific tools¶

evaluate (Hugging Face) - eval-metric library. Bundles many standard metrics.
langsmith (LangChain Labs) - tracing + evaluation platform.
promptfoo - open-source eval CLI for LLM prompts.
ragas - RAG-specific evaluation metrics (faithfulness, context relevancy).
lm-eval-harness (EleutherAI) - runs many academic benchmarks.

For learning: roll your own (the snippet above). For scale: pick one tool.

Exercise¶

Build a tiny eval dataset. Use the RAG from page 10. Write 10 (question, expected-answer) pairs covering your corpus.
Run the RAG. Score each output manually with a 1-5 score on correctness. Average the scores; that's your baseline.
Make a change. Change the prompt, or k=2 → k=4, or use a bigger embedder. Re-score. Did it improve?
Add LLM-as-judge. Have the model itself score the outputs against expected answers. Compare to your manual scores. How well do they agree?
(Stretch) Use the ragas library on your RAG. Run its faithfulness + answer_relevancy metrics.

What you might wonder¶

"How big does my eval set need to be?" 50 is the minimum for noisy signal. 500+ is comfortable. 5000+ for academic-paper-strength results. For getting started, start at 50 and expand.

"Can I trust LLM-as-judge?" Mostly. Pair with human spot-checks (10% of examples reviewed by you). When LLM-as-judge says scores went up but you can't see the improvement, something's miscalibrated.

"What about RLHF / DPO / online evaluation?" Real production AI products have ongoing eval pipelines collecting user feedback, A/B testing changes, fine-tuning on preference data. Beyond beginner; mentioned for awareness.

"How does this compare to evaluating a 'normal' classifier?" Classifiers have ground truth + simple metrics. LLM outputs have ambiguity at every step - there's no single correct answer to "summarize this article." Evaluation gets correspondingly fuzzy. The discipline is the same; the metrics are softer.

Done¶

Recognize different eval types (classification, generation, retrieval).
Build an evaluation dataset.
Use LLM-as-judge correctly (with calibration awareness).
Run the build → measure → change → measure loop.
Track cost + latency alongside quality.

Next: Serving models →