Deep Dive 08-Evaluation Systems for LLMs¶

Status: foundational chapter for the user's primary specialty. Sequence link: extends Sequence 12 of the curriculum from survey depth to working depth. Reading time: ~3 hours active, ~6 hours if you do every exercise. Prerequisite chapters: 01 (transformer mechanics), 03 (prompting), 06 (RAG), 07 (agents).

This chapter is the longest of the deep dives because evaluation is the leverage point of the entire applied-AI stack. If you cannot measure quality, every other engineering choice-model selection, prompt revision, retrieval tuning, fine-tuning-is a guess dressed up as decision. The teams that ship well-behaved LLM systems have one trait in common: they invested in eval before they invested in the model. That is the position you are training for.

The structure of the chapter mirrors how a real eval program is built up: first the philosophy of why this is hard, then the taxonomy of approaches, then the foundations (golden dataset, statistics, agreement), then the modern default (LLM-as-judge and its calibration), then operations (CI, dashboards, regressions, online vs offline), then the task-specific forms, then the meta-eval problem (evaluating the judge), then tools and lifecycle, and finally the cost and anti-pattern landscape, ending with exercises.

1. Why LLM evaluation is hard¶

Classical ML evaluation is a closed-form arithmetic problem. You have a labeled dataset, the model emits a prediction, you compare against the label, and you average a metric across the set. The metric-accuracy, AUC, RMSE-is a number with well-understood statistical properties. You can train against it, measure progress against it, and compare two models on it without philosophical debate.

LLM evaluation is none of that. The shift is structural, not cosmetic, and it affects every downstream decision.

(a) Generative outputs are open-ended. Asked to summarize a document, write an SQL query, or respond to a customer ticket, an LLM produces a string from a combinatorially large space. There is rarely a single "right" output. Two competent humans will produce different summaries; both can be correct. A reference-comparison metric that punishes any deviation from a single ground-truth string is therefore mismatched with the underlying notion of quality. This is not a small error-it is a category error, and it shows up as low correlation between automatic metrics and human judgment.

(b) Reference-based metrics correlate weakly with humans on generation. BLEU was designed for translation, where a token-overlap signal is at least defensible. ROUGE was designed for summarization, where it is already shaky. When applied to chat responses, instruction following, or RAG answers, BLEU/ROUGE numbers go up and down without tracking actual quality. Studies dating back to Liu et al. on summarization, and Kocmi et al. on translation, repeatedly show that humans and these metrics disagree often enough that a 1–2 point movement in the metric is essentially noise. You can ship a worse model with a better BLEU.

(c) Costs scale with eval-set size times call price times judge price. A single eval run on 1,000 examples with a 4-cent candidate call and a 10-cent judge call is $140 per run. If you iterate 20 times in a sprint, that is $2,800-and that is for one product surface. Many teams have multiple surfaces. Eval cost is a real budget line, not a rounding error.

(d) Models change weekly. Vendors deprecate, retrain, and re-release. Even if your prompt and code are frozen, the model under you is not. Eval has to be fast to re-run: every time the upstream model version moves, you need to know within hours whether your behavior shifted. This pushes you toward small, well-stratified eval sets and aggressive caching, against the natural temptation to grow the eval set unboundedly.

(e) Distributional shift is constant. Production traffic drifts. Users learn new tricks. Adversarial inputs appear. An eval set frozen in 2024 stops describing 2026 traffic. This is why eval is a lifecycle, not a one-off.

(f) The "good" judgment is multi-dimensional. Faithfulness, helpfulness, safety, format compliance, latency, cost. A single scalar metric is convenient and dishonest; serious eval is a vector with separate guard-rails per dimension.

The combination-open-ended outputs, weak automatic metrics, real costs, drifting models, drifting traffic, multi-dimensional quality-is what makes eval the engineering problem of applied LLM work. The rest of this chapter is how to attack it.

2. The eval taxonomy¶

There are four families of LLM eval. Each answers a different question; pick the family before you pick the metric.

2.1 Reference-based eval¶

You have a ground-truth answer; you compare the model's output to it.

Exact match (EM): score = 1 if pred == gold else 0. Brittle; whitespace and casing kill it.
EM with normalization: lower-case, strip punctuation, collapse whitespace, drop articles. The standard for short-answer QA (SQuAD style).
Token-level F1: treat pred and gold as bags of tokens. precision = |P ∩ G| / |P|, recall = |P ∩ G| / |G|, F1 = 2 P R / (P + R). Works when partial credit is sensible.
BLEU: n-gram precision over 1..4-grams, geometric mean, brevity penalty. Designed for MT.
ROUGE-N / ROUGE-L: n-gram recall (ROUGE-N) or longest common subsequence (ROUGE-L). Designed for summarization.
BERTScore: token-level cosine similarity of contextual embeddings between pred and gold; better correlation with humans than BLEU/ROUGE on most generation tasks.
Embedding cosine: sentence-level embedding similarity. Cheap, very rough.

Use reference-based when (i) the task has narrow correct answers (factoid QA, SQL generation against a fixed schema, code outputs measurable by test) or (ii) you want a cheap, deterministic regression signal alongside richer metrics.

2.2 Reference-free eval¶

No ground truth; you assess the output on its own merits.

Heuristic / programmatic: "does the response contain a JSON object that parses?" "does it cite at least one source?" "is it under 200 tokens?" These are cheap and high-signal for format and contract compliance.
Classifier-based: a trained classifier scores a property (toxicity, sentiment, hallucination probability). Examples: Detoxify, NLI-based faithfulness scorers.
LLM-as-judge: another LLM rates the response against a rubric. Now the modern default for subjective dimensions like helpfulness and faithfulness. Section 4 covers this in depth.

Use reference-free when (i) ground-truth answers are infeasible to produce at scale, (ii) the dimension you care about (toxicity, format, helpfulness) is not about matching a string.

2.3 Outcome-based eval¶

You don't score the model output directly; you score whether the downstream system succeeded.

For a search agent: did the user find the document? (click-through, dwell time, follow-up question rate)
For a coding agent: do the generated tests pass? Does the patch make CI green?
For an SQL agent: does the query return the correct rows? (compare against gold result set, not gold query string)
For a triage agent: was the ticket routed to the team that actually owned it?

Use outcome-based whenever the system has a closed-loop notion of success. It is the gold standard of relevance because it measures the thing you actually care about, not a proxy. The downside: outcomes are often delayed, sparse, or confounded by non-LLM factors.

2.4 Trajectory-based eval¶

For agents, the final answer can be right by luck or wrong despite the right approach. Trajectory eval scores the sequence of tool calls, intermediate states, and reasoning steps.

Did the agent call the right tools in a sensible order?
Did it spend tokens on irrelevant subtasks?
When given a chance to recover from an error, did it?
Number of steps to solution; cost-per-task; tool-call success rate.

Trajectory eval is what Inspect AI (UK AISI's framework) bakes in as a first-class concept and what Braintrust/LangSmith expose via tracing. It is essential for agent work because outcome-only eval is too coarse to debug.

2.5 When each is right¶

Question	Family
"Is this short answer correct?"	Reference-based (EM/F1)
"Is this summary faithful?"	Reference-free (LLM-as-judge with rubric)
"Did the user click?"	Outcome-based
"Did the agent take a sensible path?"	Trajectory-based
"Is this code correct?"	Outcome-based (run the tests)
"Is this response polite?"	Reference-free (classifier or judge)

Real systems use a stack of these, not one. A production RAG eval might run: programmatic format check → reference-based exact match on factoid subset → LLM-as-judge faithfulness → outcome-based click-through. Each catches a different class of failure.

3. Golden dataset design-the foundation¶

Everything downstream-your metric, your judge, your CI gate-sits on top of the golden dataset. A bad eval set produces measurements you cannot trust, which is worse than no measurements because it gives confident wrong answers.

3.1 Size: think in tiers¶

There are three useful sizes for an eval set, and they correspond to different stages of the lifecycle.

~50 examples (v0). Fast iteration. You can run this in under a minute and read every failure by hand. Use during early prompt design and prototyping. Confidence intervals are wide, but you don't need them yet-you need signal and speed.
~500 examples (v1). Confident measurement of large effects (>3 percentage points). Good enough for "does this prompt change improve things or not?" Will run in a few minutes, costs a few dollars per pass, and produces credible aggregate numbers.
~2,000+ examples (v2+). Fine-grained analysis: stratify by slice, detect 1-point regressions, support multi-judge agreement work. Required when the system is in production and decisions cost money.

Section 5 makes the size arithmetic precise via statistical power; the tiers above are pragmatic lower bounds.

3.2 Coverage: stratify deliberately¶

A 1,000-example set drawn uniformly from production looks fine in aggregate but typically over-represents the head intents and under-represents the rare-but-important tail. Stratify by:

Intent / use-case: if your bot serves billing, technical, and account questions, sample roughly equally from each, not by frequency.
Difficulty: include known-hard cases (multi-hop, ambiguous, adversarial). Hand-pick about 10% of the set as a "hard" slice you watch separately.
Length: include short and long inputs. Length is a major axis of failure that gets averaged out in unstratified eval.
Domain / vertical: if you have domain-specific traffic (legal, medical, code), make each domain a slice.
Language / locale: if multilingual, include each language with enough examples to compute a per-language metric.
Adversarial: prompt-injection attempts, jailbreaks, intentional ambiguity. You will not sample these uniformly from production.

The "rare but important" tail is where models fail in production and lose customers. A naive uniform sample will not catch it. Build a stratified set with explicit per-slice quotas.

3.3 Provenance: real beats synthetic beats hand-crafted¶

In rough order of preference:

Real production traffic, sampled and anonymized. This is the source of truth about what users actually do. Sampling strategies: uniform random, importance-weighted toward rare intents, error-driven (cases where the system was unsure or got negative feedback).
Synthetic data from a strong model. Useful to expand coverage cheaply, especially in the tail. Risks: distributional mismatch with real traffic, judge-model bias if the same family generates and judges.
Hand-crafted by domain experts. Highest per-example quality and intent precision; highest per-example cost. Best for the hard slice and for adversarial cases that synthetic generation will not invent.

Most mature eval sets are a mix: ~60% production-sampled, ~30% hand-crafted (especially the hard and adversarial slices), ~10% synthetic for coverage gaps.

3.4 Anonymization is not optional¶

Production data carries PII. Before any data leaves a logged-in environment for eval use, run it through a redaction pipeline (regex + named-entity-recognition + LLM redaction for obscure identifiers). This is a compliance requirement in most jurisdictions and a reputational requirement everywhere else.

3.5 Versioning: SHA the file¶

Every eval result must be pinned to a dataset version. The simplest discipline:

The eval set lives as a JSONL file in the repo (or in object storage with content-addressable keys).
Compute SHA-256 of the file bytes.
The eval run record stores (dataset_sha, dataset_version_tag, model_id, prompt_sha, judge_id, judge_prompt_sha, timestamp, results).
Two runs are comparable if and only if dataset_sha matches.

When you change the eval set (add examples, fix a label), bump the version tag and SHA. Old results stay valid as historical records on the old SHA; they simply are not directly comparable to runs on the new SHA. Re-run the baseline on the new SHA to bridge.

3.6 The rotating holdout¶

A classic ML pathology, recurring with vengeance in LLM work: model authors tune to the eval set, the eval set leaks into the iteration loop, and reported numbers stop predicting production performance. The mitigation is a rotating private holdout.

Designate ~20% of the eval set as private. Model authors and prompt authors do not see these examples or their labels.
Public eval results are reported on the public 80%; the private 20% is run periodically by a separate process (a release engineer, a CI job with restricted access) and reported as a sanity check.
Every quarter, rotate: move some private examples to public (so authors learn from them) and pull new private examples (so leakage is bounded).

This is the same logic as a held-out test set in classical ML, adapted to a world where the "training" of the system is informal prompt iteration.

3.7 Schema¶

A well-formed eval example has at least:

{
  "id": "evset-v1.3-00472",
  "input": {"user_message": "...", "context": [...]},
  "expected": {"answer": "...", "must_contain": ["..."], "must_not_contain": ["..."]},
  "labels": {"intent": "billing", "difficulty": "hard", "length_bucket": "long", "locale": "en-US"},
  "provenance": "production-2026-03-14",
  "annotator": "human-A",
  "annotated_at": "2026-03-20"
}

The labels field is what makes slice analysis possible. Skipping labels at creation time is a tax you pay forever after.

4. LLM-as-judge-the modern default¶

Reference-free eval at scale is dominated by LLM-as-judge. The pattern: a judge model reads the candidate response (and optionally the input, the rubric, and a reference answer) and emits a score or a preference. It is cheap relative to humans, fast, and-if calibrated-credible.

4.1 Variants¶

Single-grade judge. Output a 0–10 (or 1–5) integer score on a rubric dimension. Simple to instrument; low resolution; sensitive to anchoring effects (judges cluster around 7).
Pairwise judge. Given two candidates A and B for the same input, emit which is better (or "tie"). Higher signal per unit cost than single-grade because relative judgment is easier than absolute. The standard for model-vs-model comparison.
Reference-augmented judge. The judge sees a ground-truth reference along with the candidate. Especially useful for faithfulness ("does the candidate agree with the reference on facts X, Y, Z?") and for tasks where a competent answer is hard to recognize without an example.
Rubric-decomposed judge. Instead of one score, the judge produces sub-scores on a structured rubric (faithfulness, coverage, fluency, format) and an overall. Decomposition reduces ambiguity and enables slice analysis along the rubric dimensions.
Chain-of-thought judge. Judge produces a short rationale before its score. Empirically improves agreement with humans, at higher token cost. Most production judges use it.

4.2 Documented biases-these are real¶

The literature on LLM-as-judge biases (notably Zheng et al.'s MT-Bench paper and follow-up work) consistently shows the following effects. Treat them as known hazards.

Position bias. In pairwise comparisons, judges prefer the option presented first (or sometimes second, depending on the model family). Mitigation: randomize order per example; double-pass by running both A-first and B-first and averaging; report disagreement rate between the two passes as a noise floor.
Length bias. Judges prefer longer answers, all else equal. Mitigation: explicit instruction that length is not a quality signal; length-normalize candidate responses before judging where appropriate; report scores conditioned on length bucket.
Verbosity-as-quality bias. Judges reward confident-sounding language and structural cues (numbered lists, headers) even when correctness is identical or worse. Mitigation: explicit rubric language ("ignore confident tone if facts are wrong"); pair with a programmatic correctness check.
Self-preference bias. Judge models score outputs from their own model family higher than outputs from other families, even when humans rate them equal. Mitigation: cross-family judging-when comparing model X to model Y, use a judge from family Z. When that is not possible, use multiple judges from different families and take the median.
Format / parseability bias. Judges penalize outputs that disrupt their parsing (extra commentary, missing headers). This can be desirable or undesirable depending on whether format compliance is a real product requirement.
Anchoring on irrelevant cues. Judges sometimes pick up on stylistic markers (markdown formatting, leading caveats) and conflate them with quality.

A judge prompt is therefore not a one-shot artifact; it is a small piece of software that encodes an opinion about quality and a defense against these biases.

4.3 A defensible judge prompt template¶

You are a strict, calibrated evaluator. Your job is to score the candidate
response against the rubric. You must follow the rubric mechanically and
not be swayed by length, confident tone, or formatting.

INPUT:
<the user's input verbatim>

CANDIDATE RESPONSE:
<the candidate verbatim>

REFERENCE (optional):
<a known-good answer, when available>

RUBRIC:
- faithfulness (0-3): are claims supported by the input/reference?
  0 = contains a clearly false claim
  1 = mostly correct but with one unsupported claim
  2 = correct but with hedging that obscures meaning
  3 = fully supported by input/reference
- coverage (0-3): does it address all parts of the user's question?
- fluency (0-2): is it readable and grammatical?
- format (0-2): does it match required format (JSON, length, citations)?

Important:
- A longer response is not automatically better. Penalize verbosity that
  does not add information.
- Do not reward confident tone. Score only on factual support.
- If candidate disagrees with reference on a fact, the candidate is wrong.

OUTPUT FORMAT (JSON):
{
  "rationale": "<2-3 sentences explaining the most important strengths and weaknesses>",
  "scores": {"faithfulness": 0-3, "coverage": 0-3, "fluency": 0-2, "format": 0-2},
  "overall": 0-10
}

Notes on this template:

The rubric is decomposed: separate sub-scores. This forces the judge to think on each axis and gives you slice metrics for free.
The instruction explicitly disclaims length and tone bias. This is not a guarantee the bias is gone, but it measurably reduces it.
Output is structured JSON, which makes downstream parsing trivial and lets you handle parse failures as a separate signal.
A short rationale precedes the scores. This is cheap chain-of-thought and improves agreement.

4.4 Calibration: the step that makes the judge trustworthy¶

A judge is a measurement instrument. An uncalibrated instrument is a number generator. The calibration procedure:

Sample 50–200 examples from your eval set.
Have two humans independently grade each on the same rubric. (Two humans, not one-you need to know whether humans agree with each other before you can ask whether the LLM agrees with humans.)
Compute inter-human kappa (Section 6). If humans disagree a lot (κ < 0.4), the rubric is ambiguous; fix the rubric before going further.
Run the judge on the same examples.
Compute judge-vs-human kappa for each human, and judge-vs-consensus kappa.
Decision rule:
κ ≥ 0.6: judge is substantially aligned; usable for production eval, with periodic re-calibration.
0.4 ≤ κ < 0.6: moderate alignment; usable for relative comparisons (A-vs-B) but not for absolute thresholds.
κ < 0.4: judge is unreliable on this rubric. Fix the prompt, the rubric, or the model.

This calibration is not a one-time event. Re-run it (a) when you change the judge model, (b) when you change the rubric, (c) on a quarterly cadence to detect drift, (d) when you suspect a regression.

4.5 Cost discipline for judges¶

Judge calls dominate eval cost because judges are usually larger / smarter models than candidates. Three levers:

Cache by (judge_model, judge_prompt_sha, input_sha, candidate_sha)-identical tuples produce identical scores; the cache is safe and very effective when iteration only changes prompts upstream of the judge.
Sample for fast iteration. During prompt iteration, run the judge on a 100-example subset; reserve the full set for nightly / pre-merge runs.
Use a smaller judge with a stronger rubric. A well-decomposed rubric on a mid-tier judge often matches a flat rubric on a top-tier judge at a third the cost.

5. Statistical power for eval¶

Most "model A is better than model B" claims in industry are statistically illiterate. Here is the arithmetic so yours are not.

5.1 The core question¶

Suppose your baseline accuracy is p = 0.70 and you want to be 95% confident that you can detect a true improvement of Δ = 0.01 (one percentage point). How many eval examples N do you need?

For a one-sample binomial proportion test, the standard-error-driven rule of thumb is:

N ≈ z² · p · (1 - p) / Δ²

For 95% confidence (z ≈ 1.96, so `z² ≈ 3.84 - round to 4 for the back-of-envelope rule):

N ≈ 4 · p · (1 - p) / Δ²

Plug in p = 0.70, Δ = 0.01:

N ≈ 4 · 0.70 · 0.30 / (0.01)² = 4 · 0.21 / 0.0001 = 0.84 / 0.0001 = 8,400

Eight thousand four hundred examples to confidently detect a one-point delta. Most teams have 50–500. This is why "this prompt change moved accuracy from 71% to 72%" is, in almost every case, noise.

For Δ = 0.02:

N ≈ 4 · 0.21 / 0.0004 = 2,100

Two-point deltas are detectable with ~2k examples. Five-point deltas with a few hundred. The N-vs-Δ tradeoff is quadratic, which is the central reason small eval sets cannot adjudicate small changes.

5.2 Paired comparison cuts N substantially¶

If you run the same eval set on both model A and model B, you have paired observations. The relevant test is now McNemar's test for paired binary outcomes, and the relevant variance is the variance of the disagreements between A and B, not the variance of each marginal accuracy.

Let: - b = number of examples where A is right and B is wrong - c = number of examples where A is wrong and B is right - McNemar's statistic: χ² = (|b - c| - 1)² / (b + c), distributed approximately as χ² with 1 degree of freedom under H0 (no difference).

The key efficiency: paired tests need far fewer examples to detect the same Δ because per-example noise (some examples are easy, some are hard) is cancelled out. In practice, paired comparison roughly halves the required N relative to the unpaired estimate.

Always pair your A/B comparisons-same eval set, same order, run both candidates, compare item-by-item. The unpaired estimate above is the conservative ceiling.

5.3 Bootstrap confidence intervals¶

When the metric is something other than a proportion (per-example LLM-as-judge score on 0–10, average ROUGE, F1), use the bootstrap.

Procedure (B = 1,000 typical):

Given metric m and N examples:
for b in 1..B:
    sample N examples WITH REPLACEMENT
    compute m_b on the sample
report (m, percentile_2.5(m_1..m_B), percentile_97.5(m_1..m_B))

The reported triple (point estimate, lower CI, upper CI) is what you compare across runs. If two runs' CIs overlap heavily, you have not detected a difference.

For paired metrics (delta of A vs B per example), bootstrap the deltas, not the marginals. The CI on the delta is what tells you whether to ship.

5.4 Multiple comparisons¶

If you run 20 prompt variants and pick the best one, the "best" has an inflated metric estimate by chance alone. The classical fix is Bonferroni (divide α by the number of comparisons), or-more powerful-use a held-out set for the final winner-vs-baseline comparison after selecting on the dev set.

5.5 Practical N-rules to memorize¶

For binary metrics, with paired comparison and 95% confidence:

Δ to detect	N needed (rough)
10%	~50–100
5%	~200–400
2%	~1,500–2,000
1%	~6,000–8,000

Honest reporting includes the CI. "Model A: 0.72 [0.68, 0.76], Model B: 0.74 [0.70, 0.78]" tells the reader these are not distinguishable at this N.

6. Inter-rater agreement¶

You need to know how much human raters agree with each other before you can ask whether your judge agrees with humans, and you need to express that agreement in a way that controls for chance.

6.1 Cohen's kappa (two raters)¶

Raw agreement (p_o = fraction of items both raters labeled the same) is misleading because high agreement can occur by chance, especially with imbalanced label distributions. Cohen's kappa corrects for chance agreement.

Definitions:

p_o = observed agreement = (number of items both raters scored the same) / N
p_e = expected agreement under chance, computed from the marginals

For a binary label (pass/fail) with marginals: - rater 1 fraction "pass" = p1 - rater 2 fraction "pass" = p2 - expected chance agreement on "pass" = p1 · p2 - expected chance agreement on "fail" = (1 - p1) · (1 - p2) - p_e = p1·p2 + (1 - p1)·(1 - p2)

Then:

κ = (p_o - p_e) / (1 - p_e)

Interpretation (Landis & Koch, widely cited convention):

κ	Interpretation
< 0	Worse than chance
0.00 – 0.20	Slight
0.21 – 0.40	Fair
0.41 – 0.60	Moderate
0.61 – 0.80	Substantial
0.81 – 1.00	Almost perfect

For LLM-as-judge calibration, you want κ ≥ 0.6 against humans. Below that, your judge is making decisions partly by coin flip.

6.2 Kappa from scratch (Python)¶

def cohens_kappa(rater_a, rater_b):
    """
    rater_a, rater_b: equal-length sequences of categorical labels.
    Returns Cohen's kappa.
    """
    assert len(rater_a) == len(rater_b)
    n = len(rater_a)
    labels = sorted(set(rater_a) | set(rater_b))

    # observed agreement
    p_o = sum(1 for a, b in zip(rater_a, rater_b) if a == b) / n

    # expected agreement under independence
    p_e = 0.0
    for L in labels:
        p_a = sum(1 for x in rater_a if x == L) / n
        p_b = sum(1 for x in rater_b if x == L) / n
        p_e += p_a * p_b

    if p_e == 1.0:
        return 1.0  # both raters always agree on one label
    return (p_o - p_e) / (1 - p_e)

Test it on a contrived case to build intuition: if rater A and rater B both say "pass" 90% of the time, raw agreement near 0.82 is achievable by chance, and kappa correctly punishes it. If both call "pass" 50% of the time and they agree 90% of the time, kappa is much higher because the chance baseline is only 50%.

6.3 Fleiss' kappa (more than two raters)¶

For k raters labeling each of N items into C categories, Fleiss' kappa generalizes Cohen's. The math:

For each item i, let n_ij = number of raters assigning category j. Define:

per-item agreement: P_i = (1 / (k(k-1))) · Σ_j n_ij (n_ij - 1)
mean observed agreement: P_bar = (1/N) · Σ_i P_i
per-category marginal: p_j = (1/(N·k)) · Σ_i n_ij
expected agreement: P_e_bar = Σ_j p_j²

Then:

κ_fleiss = (P_bar - P_e_bar) / (1 - P_e_bar)

Use Fleiss when you have 3+ human annotators per item (which you should, for the calibration set, if budget allows).

6.4 Krippendorff's alpha¶

Krippendorff's α is more general: it handles missing data, any number of raters, and any measurement level (nominal, ordinal, interval, ratio) via a chosen distance function. It is the right choice for ordinal rubrics (0–3 faithfulness scores), where treating the labels as nominal would throw away the ordinal information. Most stats libraries implement it; you do not need to derive it from scratch, but you do need to know when it is appropriate.

6.5 What good calibration looks like¶

A defensible calibration report for an LLM judge contains:

N (calibration set size, ≥ 50; ≥ 100 preferred).
Per-rater marginals (how often each label was used).
Pairwise inter-human kappa.
Judge-vs-each-human kappa.
Judge-vs-human-consensus kappa (consensus = majority vote).
Confusion matrices (judge vs consensus) per rubric dimension.
Subset analysis: kappa on the "easy" slice vs the "hard" slice. Judges often agree on easy items and diverge on hard ones; if so, you under-detect failures.
A list of disagreement examples. Read them. They tell you what the judge is missing.

7. Eval-driven development workflow¶

Eval-driven development inverts the naive flow ("build, then measure"). The eval set comes first; everything else is hill-climbing on it.

7.1 The loop¶

Define the metric before writing the prompt. What does success look like, in numbers?
Build the v0 eval set (50 examples, hand-crafted). Include the hard slice from day one.
Author the v0 prompt, run it, score it. The first number is usually bad.
Iterate: change the prompt, re-run eval, commit results. The eval result becomes part of the commit message.
Calibrate the judge (Section 4.4) when you stand up the judge, then quarterly.
Promote to v1 (500 examples) when v0 stops surfacing useful failures.
CI gate: a PR that regresses the metric beyond the noise floor blocks merge. Section 8 expands on what counts as a regression.
Dashboards: per-metric trend over time, per-slice. The dashboard is the artifact senior stakeholders read.
Production loop: failures from prod feed back into the eval set with a label distinguishing them from authored examples.

7.2 Commit-level discipline¶

Every commit that changes prompt, model, retrieval, or judge runs the eval. The eval result is logged with the commit SHA. After six months you have a time series of how each engineering change moved which metric. This is exactly the "engineer-as-scientist" stance the user's curriculum is building toward.

7.3 The "if you can't measure it, you can't improve it" operationalization¶

It is a cliché, and like most clichés it becomes useful when you make it concrete:

A new feature ships only with an eval set covering it.
A bug-fix ships only when the failing case has been added to the eval set with the correct label, and the fix changes its result from fail to pass.
A model upgrade ships only after the regression report (Section 8) shows no per-slice regression beyond the noise floor.
A judge change is a code change with its own calibration set and its own PR.

This is heavyweight. It is also the difference between a team that ships LLM features predictably and a team that ships and prays.

8. Regression detection¶

Aggregate metrics lie. A model that improves by 1% overall while regressing by 8% on the "billing" slice is shipping a billing outage. The discipline is to surface those slice-level regressions before they ship.

8.1 Per-example regression¶

For a paired eval (same examples, two model versions), classify each example into one of:

stayed-pass: pass in both
stayed-fail: fail in both
flipped-pass-to-fail: pass before, fail now (regression)
flipped-fail-to-pass: fail before, pass now (improvement)

Read the flipped-pass-to-fail list. Every item on it is a regression you should review individually before shipping. If there are 10 of them, it is feasible. If there are 100, you need slice analysis (next).

8.2 Per-slice regression and the average-tide trap¶

The "average-tide trap" is when the overall metric improves but a critical slice silently regresses. To detect it:

for slice in slices:
    delta_slice = metric(model_new, slice) - metric(model_old, slice)
    if delta_slice < -threshold and CI_excludes_zero:
        FLAG

threshold is typically 5 percentage points for a major slice, 2 percentage points for the hard slice, 0 (any regression) for the safety / adversarial slice. The CI check uses a paired bootstrap on the slice (Section 5.3).

8.3 Noise-floor calibration¶

A metric that is itself noisy will hide small regressions and produce false alarms on phantom ones. Before you can declare a regression, you need the metric's noise floor. Procedure:

Run the eval twice on the same model with no changes (different judge seeds, or just re-runs if the judge has temperature > 0).
Compute the run-to-run delta on the metric and on each slice.
The noise floor is roughly 2× the standard deviation of the run-to-run delta.
Any change smaller than that is indistinguishable from noise.

Report the noise floor in your dashboard. A regression that is twice the noise floor is real; one that is half is not.

8.4 Sample regression-detection script¶

import json

def load(path):
    with open(path) as f:
        return {row["id"]: row for row in (json.loads(line) for line in f)}

def regressions(prev_path, curr_path, slice_key="intent",
                threshold=0.05, score_field="overall"):
    prev = load(prev_path)
    curr = load(curr_path)
    common = sorted(set(prev) & set(curr))

    by_slice = {}
    for ex_id in common:
        sl = prev[ex_id]["labels"][slice_key]
        by_slice.setdefault(sl, []).append(
            (prev[ex_id][score_field], curr[ex_id][score_field])
        )

    findings = []
    for sl, pairs in by_slice.items():
        n = len(pairs)
        prev_mean = sum(p for p, _ in pairs) / n
        curr_mean = sum(c for _, c in pairs) / n
        delta = curr_mean - prev_mean
        if delta < -threshold:
            findings.append((sl, n, prev_mean, curr_mean, delta))

    findings.sort(key=lambda r: r[-1])
    return findings

if __name__ == "__main__":
    import sys
    for sl, n, p, c, d in regressions(sys.argv[1], sys.argv[2]):
        print(f"REGRESSION slice={sl} n={n} prev={p:.3f} curr={c:.3f} delta={d:+.3f}")

In production, swap the simple delta check for a paired-bootstrap CI to control false positives, and add per-example flip lists for the worst-regressing slices.

8.5 Silent regressions¶

The hardest regressions are the ones the metric does not see at all-for example, the model is now correct on the eval set but produces longer, more expensive responses. This is why eval is multi-dimensional: track latency, output length, cost, and refusal rate alongside the quality metrics. A "no regression on accuracy, +30% latency" change should not ship without explicit acceptance.

9. Online vs offline eval¶

Offline eval is what we have been discussing: a fixed set, deterministic, fast, reproducible, but limited to the inputs you anticipated. Online eval is on real traffic. They are complementary, not substitutes.

9.1 Offline eval¶

Strengths: reproducible; cheap to re-run; supports tight CI loops; good for regression gates.
Weaknesses: input distribution is whatever you put in the set; if production drifts, offline numbers stop predicting production behavior; impossible to measure delayed outcomes (user satisfaction, follow-up actions).

9.2 Online eval¶

Live metrics on production traffic: click-through, conversion, follow-up rate, explicit thumbs-up/down, complaint rate, escalation rate.
Strengths: measures the truth; covers the actual input distribution; captures emergent behaviors no eval set anticipated.
Weaknesses: slow (hours to weeks for stat-sig); confounded by non-LLM changes; ethically constrained (you cannot expose real users to known-bad models); requires logging and feedback infrastructure.

9.3 Counterfactual eval (replay)¶

A bridge between offline and online. Procedure:

Log production inputs (with user consent and PII redaction).
Periodically (e.g., nightly), replay a sample of those inputs through a challenger model offline.
Compare challenger output to the production model's output (which the user actually saw) using LLM-as-judge or programmatic checks.
Promote the challenger if it wins by a meaningful margin on a representative sample.

Counterfactual eval gives you the input realism of online eval without exposing users to the challenger. The cost is that you cannot measure the user-side outcome (the user's reaction was conditioned on the production response, not the challenger's). For most quality dimensions this is acceptable; for outcome metrics it is not.

9.4 A/B testing¶

The gold standard when feasible. Allocate a small fraction of traffic (typically 1–10%) to the challenger; collect outcome metrics; declare a winner when the CI on the difference excludes zero. Required reading: any introductory experimentation textbook (Kohavi et al., Trustworthy Online Controlled Experiments).

For LLM features specifically:

Sample size for binary metrics, normal-approximation derivation, two-sided 95% / 80% power:

N_per_arm ≈ 16 · p · (1 - p) / Δ²

For p = 0.10, Δ = 0.01: N ≈ 16 · 0.09 / 0.0001 = 14,400 per arm. Note the "16" is the canonical rule (≈ 2·(z_{α/2} + z_β)² with z values for 95%/80%); some textbooks render it as ~21 with different power assumptions. Use the actual normal-approximation calculation for any decision that costs real money.

Sequential testing. The naive "peek every day and stop when significant" inflates Type I error grossly. Use formal sequential designs (Pocock, O'Brien-Fleming) or always-valid p-values (mSPRT, e-values) to allow continuous monitoring without p-hacking.
Bayesian A/B. Place a prior over the effect size, update with data, decide based on the posterior probability that the challenger is better. Often more interpretable for stakeholders than a frequentist p-value, and natively supports "we are 92% sure this is positive-should we ship?" conversations.
Guard rails. Pre-register the metrics that block shipping even if the headline metric is positive. "Latency must not increase by more than 100ms" or "refusal rate must not increase by more than 1pp." A 1% lift on the headline metric is not worth a 5% lift on user complaints.

10. Eval for specific tasks¶

Different tasks fail in different ways. Each has its own metric stack.

10.1 Classification¶

Standard ML metrics with a few LLM-flavored caveats.

Accuracy: fraction correct. Adequate when classes are balanced.
Precision / Recall / F1 per class: essential when classes are imbalanced. Macro-F1 (unweighted average across classes) protects the rare classes; micro-F1 weights by frequency.
Confusion matrix: read it. The off-diagonal entries are the failure modes.
Calibration: for classifiers that output a probability, do the predicted probabilities match observed frequencies? Compute reliability diagrams; report Expected Calibration Error (ECE). LLMs that emit confidence words ("definitely", "probably") tend to be overconfident; explicit calibration is a separate eval dimension.

10.2 Summarization¶

ROUGE-1/2/L: poorly correlated with human judgment on modern summaries; report only as a cheap regression signal.
BERTScore: noticeably better correlation than ROUGE; still imperfect.
LLM-as-judge with rubric: the modern default. Decompose into:
Faithfulness: every claim in the summary is supported by the source. Crucial; this is where hallucination shows up.
Coverage: the summary captures the source's key points (use a checklist if the source has discrete points).
Conciseness: information density per token.
Fluency: readability.
Reference-augmented BERTScore: when a gold summary exists, BERTScore against it gives a deterministic signal alongside the judge.

10.3 RAG¶

The RAGAS framework (Es et al., 2023) decomposes RAG eval into four metrics that should be in every RAG eval suite:

Faithfulness: are the answer's claims supported by the retrieved context? (Reference-free; often LLM-as-judge.)
Answer relevance: does the answer address the user's question? (Reference-free.)
Context precision: of the retrieved chunks, what fraction are relevant? (Reference-based against gold relevance labels.)
Context recall: of the relevant chunks (according to gold), what fraction were retrieved?

A weak retrieval will tank context precision/recall; a weak generator will tank faithfulness even with good retrieval. Decomposing lets you fix the right component.

Add to the RAGAS core:

Citation quality: if your system emits citations, are they correctly attributed?
Refusal rate: when the answer is not in the context, does the system say so instead of confabulating? This is the "I don't know" eval; it requires examples where the right answer is "I cannot answer from the provided sources."

10.4 Agents¶

Agents need both outcome and trajectory eval.

Outcome: task success rate. Did the agent achieve the goal?
Trajectory:
tool-call validity rate (what fraction of tool calls were syntactically/semantically valid)
tool-selection appropriateness (did it pick the right tool at each step; LLM-as-judge against trajectory)
number of steps to solution (efficiency)
error-recovery rate (when a tool failed, did the agent recover sensibly)
cost per task (tokens × price, summed over the trajectory)

Inspect AI (UK AISI) treats trajectory as a first-class structure (samples have message histories with tool calls and results; scorers can run over the full trajectory). For agent work, this matters a lot more than for chat eval.

10.5 Code generation¶

The standard metric is pass@k. Definitions, following the HumanEval paper (Chen et al., 2021):

For each problem, generate n independent samples.
Let c of them pass the unit tests.
The unbiased estimator of pass@k is:

pass@k = E[1 - C(n - c, k) / C(n, k)]

where C(·,·) is the binomial coefficient and the expectation is over problems. The intuition: C(n-c, k) / C(n, k) is the probability that all k of k samples drawn (without replacement from the n) miss every correct one; one minus that is the probability that at least one of k is correct.

Important corner cases:

If n - c < k, then C(n - c, k) = 0 and pass@k = 1 for that problem (you cannot avoid drawing a correct one).
The estimator requires n ≥ k. For pass@1 you need at least one sample per problem; for pass@5 you need at least 5.
This is per-problem; the headline number is the mean across problems.

Worked numerical example: n = 20, c = 3, k = 5.

C(17, 5) = 6188
C(20, 5) = 15504
pass@5 = 1 - 6188 / 15504 = 1 - 0.3992 = 0.6008

So for that one problem, pass@5 ≈ 0.60. Average across all problems for the suite-level number.

Beyond pass@k:

SWE-bench-style evaluation: did the model produce a patch that fixes a real GitHub issue and passes the project's test suite? This is closer to outcome eval and far harder than HumanEval-style.
Test coverage of the generated code: generating code that passes hand-picked tests is one thing; generating code with reasonable robustness is another.

10.6 Open-ended generation¶

For chat / instruction-following / creative writing, the standard is MT-Bench-style evaluation: a fixed set of multi-turn prompts, an LLM-as-judge with a rubric, and either single-grade or pairwise scoring. Aggregate to a leaderboard. This is reference-free and rubric-driven; everything in Section 4 applies.

11. Eval of the judge¶

This is the recursive step that everyone wants to skip and no serious team does. The judge is itself a measurement instrument; it can drift, it can be miscalibrated, it can be silently broken. You must evaluate it.

11.1 The eval-of-eval set¶

Build a small, gold-standard set of (input, candidate, human-consensus-score) triples, where human consensus is from at least two and ideally three independent annotators with documented inter-rater kappa.

Size: 100–300 examples, stratified by rubric dimension. This is small relative to the main eval set because the cost is human-rater time, which dominates.

11.2 Tracked metrics for the judge¶

Judge-vs-human kappa, overall and per rubric dimension.
Judge bias measurements, run as designed experiments:
Position bias: in pairwise mode, present the same pair as both (A, B) and (B, A); fraction of cases where the judge's preference flips quantifies the bias.
Length bias: generate length-controlled pairs (same content, different length) and measure the judge's preference for the longer version.
Self-preference: generate pairs from different model families on the same input; check whether the judge from family X over-prefers the candidate from family X.
Judge confidence vs accuracy: if the judge emits confidence-like signals, are they calibrated against ground truth?
Drift: the same calibration set, re-run quarterly, against the same judge model. Drift > some threshold triggers re-calibration of the prompt or migration to a new judge model.

11.3 Judge prompt versioning¶

Judge prompt is a code artifact, versioned in git. Any change to the prompt is a PR with re-calibration results attached. Old eval results stay valid only on the old judge prompt SHA; cross-prompt comparisons require running the baseline on the new judge prompt.

11.4 Multi-judge ensembles¶

When stakes are high (final ship/no-ship decision, public-facing leaderboards), use an ensemble: 3 judges from different model families, score = median. Disagreement among judges is itself a signal-examples where judges disagree are usually genuinely ambiguous and worth human review.

12. Tool landscape¶

The eval-tooling ecosystem is moving fast. The mappings below are general and approximate; specific feature claims are version-dependent and you should verify against current docs before adoption.

Inspect AI (UK AISI). Open-source Python framework, originally built for AI safety evaluation. First-class concepts include Sample, Solver, Scorer, message histories with tool calls. Strong support for trajectory-level scoring. Free, well-engineered, used in safety-critical work. The right default for agent eval and for any setting where you need fine-grained control over the eval pipeline.
Braintrust. Managed eval platform with strong UX for dataset management, judge prompt iteration, and experiment comparison. Pricing scales with volume; can become significant for large eval sets. Picks itself when team velocity matters more than cost minimization.
Langfuse / LangSmith. Tracing-first observability platforms with eval as a feature. Strong fit when your primary problem is "we need to see what's happening in production" and eval is a downstream capability you want integrated. Less specialized for eval-only workflows than Braintrust; better for teams already invested in their tracing.
OpenAI Evals. Open-source benchmark registry with a YAML-driven definition format. Originally tied to OpenAI's models but increasingly model-agnostic. Good for running standardized benchmarks reproducibly; less well-suited for custom production eval.
Promptfoo. Lightweight, opinionated CLI/config-driven eval. Great for small teams that want to add eval to a CI pipeline quickly. Limited at scale and for agent / trajectory eval.
RAGAS. Python library implementing the RAGAS metrics (faithfulness, answer relevance, context precision/recall) plus an extensible scorer set. Specialized for RAG; pairs well with general-purpose frameworks (Inspect, Promptfoo) that orchestrate the runs.
Helm / EleutherAI lm-eval-harness / others. Academic / standardized benchmark suites; valuable for research-style evaluation against published benchmarks. Less used for product eval.

12.1 When to pick which¶

"We have an agent and need trajectory eval" → Inspect AI.
"We have a chat product, want a managed UX, willing to pay" → Braintrust (or LangSmith if already tracing there).
"We have a RAG pipeline and want the four core RAG metrics tomorrow" → RAGAS for the metrics, orchestrated under Inspect or Promptfoo.
"We want CI-gated eval on a small team, minimal complexity" → Promptfoo for the gate, custom scripts for the rich metrics.
"We want to run published benchmarks reproducibly" → lm-eval-harness or OpenAI Evals.

The non-decision: never build your own eval framework from scratch as your first step. The libraries above absorb six months of common-case engineering. Build a custom layer only when you have outgrown them on a specific axis.

13. The eval-set lifecycle¶

The eval set is software, not data. It versions, it deploys, it ages out.

13.1 v0-bootstrap (week 1)¶

20–50 hand-crafted examples. Goal: enable iteration. Composition:

60% representative-of-target-traffic.
20% known-hard cases (multi-hop, ambiguous, adversarial).
20% format/safety cases (does the system refuse properly; does the format check pass).

Authored by the team lead and one domain expert in a single afternoon. Stored in eval/v0.jsonl; SHA recorded.

13.2 v1-confident measurement (month 1–3)¶

500 examples. Composition:

50% sampled from production logs (anonymized).
30% hand-crafted to fill known coverage gaps and tail cases.
20% synthetic for breadth.

Stratified labels mandatory. Holdout 20% (Section 3.6). Now eligible to be a CI gate.

13.3 v2+-mature (month 3 onward)¶

2000+ examples, growing continuously by feeding production failures back in. Discipline:

Each production failure that a customer reports is added to the eval set (with permission / anonymization).
Each customer-impacting incident creates 5–20 eval examples that would have caught the regression.
Quarterly review: deduplicate, retire stale examples, rotate holdout.

13.4 Versioning ritual¶

eval/
  v0_2026-01-12_a3f9e1.jsonl   # tag_date_sha8
  v1_2026-02-28_b71d04.jsonl
  v2_2026-04-10_e12fa7.jsonl
  current -> v2_2026-04-10_e12fa7.jsonl   # symlink
  CHANGELOG.md                            # what changed and why

Tag every run with the eval-set filename. Cross-version comparisons require re-running the baseline on the new version.

14. A/B testing for LLM features (deeper)¶

Online A/B testing on LLM features adds wrinkles classical A/B tests do not have.

14.1 Sample size¶

For a binary metric (e.g., conversion rate p = 0.10), the standard formula derived from the normal approximation, with two-sided α=0.05 and power 1-β=0.80, is:

N_per_arm = (z_{α/2} + z_β)² · 2 · p · (1 - p) / Δ²

With z_{0.025} ≈ 1.96, z_{0.20} ≈ 0.84, (1.96 + 0.84)² ≈ 7.84, so 2·7.84 ≈ 15.7 ≈ 16:

N_per_arm ≈ 16 · p (1 - p) / Δ²

For p = 0.10, Δ = 0.01: ~14,400 per arm. For continuous metrics (revenue per user), use the variance instead of p(1-p).

14.2 Sequential testing¶

Naïve repeated peeking inflates the false-positive rate. Three options:

Pre-registered fixed-horizon test. Decide N up front, do not peek (or only peek for safety not for stat-sig). Easiest to defend.
Group sequential (Pocock, O'Brien-Fleming). Pre-specify checkpoints; spend α at each according to a boundary that controls overall α.
Always-valid p-values (mSPRT, e-values). Modern; allows continuous monitoring with valid Type I error control. Higher math overhead, lower planning overhead.

14.3 Bayesian A/B¶

Posterior over the effect size given a prior and observed data. Decision: ship if P(challenger > baseline | data) > 0.95 and the magnitude is meaningful. Natural fit when stakeholders want continuous, intuitive read-outs.

14.4 Guard rails¶

Pre-declare every metric whose regression blocks shipping, even if the headline metric wins. For LLM features, common guard rails:

p95 latency.
Cost per request.
Refusal rate / non-response rate.
Safety-eval pass rate (toxicity, jailbreak resistance).
Per-segment quality on a critical slice (e.g., regulated-industry traffic).

A challenger that wins headline by 1pp and regresses guard-rail safety by 1pp does not ship.

14.5 Novelty effect and seasonality¶

A new model often shows a temporary lift from novelty (users explore, click more) that fades. Run experiments for at least two weeks, ideally over a full week-cycle, to detect this. Compare first-week and steady-state effect; if they diverge, the steady-state is the one to ship on.

15. The hidden costs of eval¶

Eval is not free. Treat it as a budget line.

15.1 Per-run cost arithmetic¶

Cost per eval run, ignoring caching:

cost ≈ N · (T_in_cand · price_in_cand + T_out_cand · price_out_cand)
     + N · (T_in_judge · price_in_judge + T_out_judge · price_out_judge)

With N = 1,000, T_in_cand = 800 tokens, T_out_cand = 400, T_in_judge = 1500 (input + candidate + rubric), T_out_judge = 200, and current-day per-token prices, the math is straightforward and worth doing for your specific stack. For typical 2026 prices, expect $20–$200 per full run depending on judge size. Runs add up across iterations.

15.2 Wallclock cost¶

A serial run of 1,000 examples at 5 seconds per example is 5,000 seconds-about 80 minutes. Parallelize aggressively (10–50 concurrent requests, respecting vendor rate limits). With 20× concurrency, the same run is under 5 minutes.

15.3 Caching¶

Cache by (model_id, prompt_sha, input_sha) for candidate calls and (judge_model, judge_prompt_sha, input_sha, candidate_sha) for judge calls. When you change only the prompt and not the model, candidate calls re-run but you can still cache anything upstream. This commonly cuts cost 5–10× during prompt-iteration sprints.

15.4 Subset sampling for fast iteration¶

During inner-loop iteration (a few-minute cycle), run on a 100-example subset. Promote to the full eval at the end of the day or in CI. Stratify the subset to match the full set's slice distribution; otherwise you will be iterating on the head and discovering tail regressions only at the end.

15.5 Human-rater cost¶

Calibration sets and eval-of-eval sets require human raters. Budget concretely: 100 examples × 2 raters × 2 minutes per example = 400 minutes ≈ 7 rater-hours. This is real work; build it into the project plan.

16. Eval anti-patterns¶

A non-exhaustive list of failure modes that recur across teams and recur across years.

Vibe-checking. Running 5 example prompts, looking at the outputs, and shipping if they "look good." This is everyone's first eval and everyone's worst eval. Quantify or do not deploy.
Eval-set leakage into training. Fine-tuning on data that overlaps the eval inputs. The model memorizes the answers; the metric goes up; production stays the same or gets worse. Mitigation: hold inputs of the rotating private set strictly out of any training data; SHA-match against your training corpus.
Optimizing on the test set. The classical ML sin, recurring. If you tune the prompt 50 times against the eval set and ship the best variant, the reported metric is biased upward by selection. Mitigation: hold out a private set for the final ship/no-ship comparison; report on it separately.
Ignoring the tail. Reporting only aggregate accuracy. The product fails 5% of users badly, and they churn. Aggregate is up, NPS is down. Mitigation: per-slice metrics; explicit hard-slice gate; failure-mode reading.
Single-metric tunnel vision. Accuracy up; latency 2×; cost 3×; the team celebrates the accuracy. Net product impact is negative. Mitigation: report a metric vector; pre-declare guard rails; ship decisions are multi-dimensional.
Uncalibrated judge. "Our LLM-as-judge says we improved 3pp." If kappa to humans is 0.3, the 3pp is noise. Mitigation: Section 4.4. No production judge without calibration.
Over-specified eval. A 50-page rubric that no one applies consistently. Annotators disagree; humans disagree; the LLM judge memorizes the rubric format and ignores the content. Mitigation: short rubrics with concrete examples per score level; calibration; iteration on the rubric itself.
One-shot eval set. Built once, never updated. Production drifts; eval set goes stale; the team flies blind without knowing it. Mitigation: lifecycle (Section 13).
Comparing on different eval-set versions. "v3.2 of the model gets 0.81; v3.3 gets 0.83"-but the eval set was updated between runs. The 2pp is a dataset effect, not a model effect. Mitigation: SHA the dataset; explicit version pins on every reported number.
Confusing offline gains for online gains. "Offline +5pp, but A/B test shows -1pp on engagement." Offline is a proxy. Mitigation: validate offline-online correlation periodically; treat offline metrics as evidence, not proof.
Judge collusion. Generator and judge from the same model family. Self-preference bias inflates scores. Mitigation: cross-family judges; multi-judge ensemble.
No noise floor. A 1pp move is reported as a regression or improvement without context. Half the team's energy goes into chasing noise. Mitigation: measure run-to-run variance on an unchanged baseline and report it in every dashboard.

17. Practical exercises¶

Do every one of these. They are graded by being able to defend the answer in a hiring loop.

17.1 Implement Cohen's kappa from scratch¶

Goal: build kappa without numpy/sklearn, then verify against a library implementation.

Provided dataset format (100 examples, two raters, binary labels):

example_id, rater_a, rater_b
0001, pass, pass
0002, pass, fail
...

Tasks:

Implement the function in Section 6.2 from scratch.
On the 100-example set, report p_o, p_e, κ.
Verify against sklearn.metrics.cohen_kappa_score.
Build intuition: construct three synthetic 100-example datasets:
A: both raters always say "pass" (agreement = 1.0, kappa = ?).
B: both raters say "pass" 50% of the time and agree on 90% of items (kappa = ?).
C: both raters say "pass" 95% of the time and agree on 92% of items (kappa = ?). Compute by hand and explain why C's kappa is lower than B's despite higher raw agreement.

Expected: A → kappa undefined (degenerate; the function should return 1.0 by the convention in Section 6.2 because p_e = 1); B → high kappa around 0.8; C → low kappa even though raw agreement is high, because the chance baseline is near 0.91-both raters almost always say "pass."

17.2 Statistical power calculation¶

Question: how many eval examples do you need to detect a 2-percentage-point improvement on a baseline accuracy of 75%, with 95% confidence?

Solve, both unpaired and paired:

Unpaired (rule of thumb): N ≈ 4 · 0.75 · 0.25 / (0.02)² = 4 · 0.1875 / 0.0004 = 1,875. Round up to ~2,000.
Paired (rough rule, ~half N): ~1,000 examples.

Now compute it more rigorously. For unpaired, the proper sample size is:

N = (z_{α/2} · √(2 p (1 - p)) + z_β · √(p_A (1-p_A) + p_B (1-p_B)))² / Δ²

Plug in p_A = 0.75, p_B = 0.77, α = 0.05, β = 0.20. The answer should be in the same neighborhood as the rule of thumb. Confirm.

17.3 Author and calibrate an LLM-as-judge prompt for summarization faithfulness¶

Steps:

Take 50 (source, summary) pairs from a public summarization dataset.
Have two human raters score each summary's faithfulness on a 0–3 scale (or arrange paired-rater scoring with a labeling buddy).
Compute inter-human kappa. If < 0.5, fix the rubric until raters agree more.
Write an LLM-as-judge prompt (use the template in Section 4.3, reduced to faithfulness only).
Run the judge on the same 50 pairs.
Compute judge-vs-each-human and judge-vs-consensus kappa.
Read the disagreements. For each, articulate whether the judge or the human is more defensible. This is the most instructive part of the exercise.
Iterate on the rubric language until judge-vs-consensus kappa ≥ 0.6.

Deliverable: the final prompt, the kappa numbers across iterations, and a 200-word writeup of what changed in the prompt and why kappa moved.

17.4 Build a regression-detection script¶

Use the script in Section 8.4 as a starting point. Extend it to:

Compute paired-bootstrap CIs on the per-slice delta.
Report only slices where the CI on the delta excludes zero (significant regressions).
Output a markdown report listing the top 5 regressing slices with the worst flipping examples (pass → fail).
Include a noise-floor estimate computed from two unchanged baseline runs.

Test it on synthetic data: prev run 1000 examples with 0.75 accuracy, current run with one slice deliberately regressed to 0.65. Confirm the script flags that slice and not the others.

17.5 Compute pass@5 from raw n=20 sample correctness data¶

Given for each problem a count c of correct samples out of n = 20, compute pass@5 per problem and then averaged across the suite.

from math import comb

def pass_at_k(n, c, k):
    if n - c < k:
        return 1.0
    return 1.0 - comb(n - c, k) / comb(n, k)

problems = [
    {"id": "p1", "n": 20, "c": 0},   # pass@5 = 0
    {"id": "p2", "n": 20, "c": 3},   # pass@5 ≈ 0.6008
    {"id": "p3", "n": 20, "c": 10},  # pass@5 ≈ 0.9837
    {"id": "p4", "n": 20, "c": 20},  # pass@5 = 1
]

scores = [pass_at_k(p["n"], p["c"], 5) for p in problems]
suite_pass_at_5 = sum(scores) / len(scores)
print(scores, suite_pass_at_5)

Verify p2: C(17, 5) = 6188, C(20, 5) = 15504, 1 - 6188/15504 = 0.6008. Verify p3: C(10, 5) = 252, 1 - 252/15504 = 0.9837. The suite-level pass@5 here is (0 + 0.6008 + 0.9837 + 1)/4 = 0.6461.

Extend: write the same calculation for pass@1 and pass@10, and discuss how the per-problem variance behaves with k.

17.6 Capstone-Q4 incident-triage agent eval set¶

Design the full eval program for an agent that triages incoming SRE alerts and routes them to the right on-call team.

Deliverables:

Intent and slice taxonomy. Enumerate the alert types you expect (e.g., service-down, latency-spike, cert-expiry, disk-full, auth-failure, etc.). Define slices on (service tier, alert type, time-of-day, false-positive likelihood). Justify the slice cuts.
Schema. Define the JSONL schema for an eval example. Include alert text, environment context, expected routing target(s), expected severity, expected runbook tag(s), labels for slices, provenance, annotator.
Counts per slice. Target counts for v0, v1, v2. Explain how you balance frequency-weighted (mostly common alerts) vs uniform (also rare alerts). My recommendation: v1 has 500 examples allocated as ~70% production-frequency-weighted, ~30% deliberately tail-stratified.
Metric stack. Define:
Outcome metric: routing precision and recall against gold target team.
Trajectory metrics: appropriate tool calls, ordering, no excess tool calls.
Programmatic checks: severity field present and in allowed set; runbook tag present.
LLM-as-judge metric: rationale quality on a 0–3 rubric (does the rationale correctly identify the symptom and propose a defensible next action).
Cost / latency guard rails.
Judge prompt. Author the rationale-quality judge using the Section 4.3 template, customized for triage rationales. State the rubric levels concretely.
Calibration plan. Propose: 100 examples, 2 SRE annotators, target inter-human kappa ≥ 0.6 after one round of rubric refinement, target judge-vs-consensus kappa ≥ 0.6 before the judge is used in CI.
CI gate. Define the merge-blocking conditions: any flipped-pass-to-fail on the safety slice; per-major-slice routing precision regression > 5pp with CI excluding zero; judge metric regression > 5pp on aggregate; latency p95 regression > 200ms.
Lifecycle plan. v0 hand-crafted in week 1 with 50 examples (you build it from real anonymized alerts). v1 in month 1 with 500 examples drawn from a month of production. Monthly addition of new failure cases. Quarterly rotating-holdout refresh.

If you can sit down and write this program, with the schema concrete and the metrics specific, you are operating at the level expected of an applied-AI engineer with eval as their headline specialty. That is the bar this chapter is training you toward.

Closing¶

The takeaways from this chapter compress to the following:

Eval is the leverage point. Build it before the model.
The golden dataset is software: stratified, versioned, SHA-pinned, lifecycle-managed.
LLM-as-judge is the modern default-but only after calibration against humans (κ ≥ 0.6) and only with explicit defenses against position, length, verbosity, and self-preference biases.
Statistical literacy is non-negotiable. Detecting a 1pp delta requires thousands of examples; honest reporting includes confidence intervals.
Slice analysis catches regressions that aggregate metrics hide; the noise floor calibrates which deltas are real.
Online and offline complement each other. Counterfactual replay bridges them; A/B is the gold standard but slow.
Each task has a metric stack: classification, summarization, RAG, agents, code, open-ended generation. Pick the stack before you pick the metric.
The judge itself must be evaluated, calibrated, and version-controlled.
Tools (Inspect AI, Braintrust, LangSmith, RAGAS, Promptfoo) absorb common-case engineering. Use them; do not reinvent.
Anti-patterns (vibe-checking, leakage, over-tuning, single-metric tunnel vision) are predictable and avoidable.

Eval done well is the most valuable thing an applied-AI engineer ships. It is the substrate every other improvement runs on. Master it, and the rest of the curriculum becomes hill-climbing on instruments you can trust.