12-Evaluation Systems¶

Why this matters in the journey¶

Evals is the most undersupplied skill in 2026 AI engineering. Every team building with LLMs has an eval problem: "Is this prompt change actually better?" "Did the new model regress?" "Why does our agent fail on these examples?" Solving this-rigorously, at scale, with the right metrics-is your highest-leverage specialty given your observability background. The skill transfer from SLI/SLO design is direct.

The rungs¶

Rung 01-Why evals (and what's wrong with vibes)¶

What: Without evals, AI engineering is folklore. Prompt changes are decided by "felt better." Regressions ship silently.
Why it earns its place: Frame the problem before the tooling.
Resource: Hamel Husain-"Your AI product needs evals" (hamel.dev/blog/posts/evals/). Read this twice.
Done when: You can argue for an eval-first culture in your own words.

Rung 02-Building a golden dataset¶

What: A curated set of (input, expected output) pairs. Must be representative of production traffic. 30–500 examples is usually enough to start.
Why it earns its place: No golden set, no eval. Curating it is the first step that almost everyone skips.
Resource: Hamel's "How to create a high-quality eval set" posts.
Done when: You have a 50-example golden set for your project, with both common and edge-case examples.

Rung 03-Deterministic and heuristic checks¶

What: Cheap, fast, automatic: regex matches, JSON validity, output length, refusal detection, format conformance.
Why it earns its place: Catch the easy bugs before reaching for LLM-as-judge. Cheap evals run on every change.
Resource: Hamel's eval blog series. Plus the `pytest - based eval pattern (write evals as tests).
Done when: Your project has heuristic evals running in CI on every prompt change.

Rung 04-LLM-as-judge¶

What: Use a strong model to grade outputs against a rubric. Pairwise comparison or pointwise scoring.
Why it earns its place: The dominant scalable eval method for open-ended outputs. Also the most easily abused.
Resource: Judging LLM-as-a-Judge paper (arxiv.org/abs/2306.05685). Plus Eugene Yan's posts on LLM-as-judge (eugeneyan.com).
Done when: You can write a clear rubric, prompt a judge model, and validate the judge against human labels.

Rung 05-Validating the judge¶

What: Compute agreement (Cohen's kappa, Spearman correlation) between judge and human ratings. If agreement is poor, the judge is unreliable.
Why it earns its place: Without validating the judge, your eval is fiction. This is the most-skipped step.
Resource: AI Engineering (Huyen) chapters on eval. Plus Hamel's "Be skeptical of LLM-as-judge."
Done when: You've human-labeled 30+ examples and computed agreement with your judge.

Rung 06-Eval datasets and benchmarks¶

What: Public benchmarks: MMLU (knowledge), GSM8K / MATH (reasoning), HumanEval / SWE-bench (code), HELM, BIG-bench. Domain-specific datasets to mirror what matters.
Why it earns its place: Knowing the canonical benchmarks lets you read papers. Knowing their limitations lets you not be misled.
Resource: HELM paper (arxiv.org/abs/2211.09110) and website (crfm.stanford.edu/helm).
Done when: You can list the major benchmark categories and one limitation of each.

Rung 07-Eval harnesses¶

What: Tools that orchestrate evals: dataset, model, scorer, results store. Examples: Inspect AI (UK AISI), Braintrust, OpenAI evals, Promptfoo, lm-eval-harness (EleutherAI).
Why it earns its place: Rolling your own gets you started; switching to a harness gets you parallelism, caching, dashboards, and shareability.
Resource: Inspect AI docs (inspect.ai-safety-institute.org.uk)-strongly recommended; thoughtful design. Plus Braintrust docs.
Done when: You can run an Inspect AI eval against an LLM and view the report.

Rung 08-Regression testing for prompts¶

What: Treat prompt changes like code changes: PR triggers eval suite, blocking merge if scores regress.
Why it earns its place: Production discipline. Prevents the "one-off improvement that broke five other things" pattern.
Resource: Promptfoo docs (promptfoo.dev). Plus Hamel's "eval-driven development" framing.
Done when: Your project has a CI step that runs evals on PRs and surfaces regressions.

Rung 09-Online evals and production observability¶

What: Evals on real production traffic, not just curated sets. Sample, score (deterministically or with judge), aggregate. Detect drift.
Why it earns its place: Golden sets go stale. Real distribution shifts. Online evals catch what offline can't.
Resource: Langfuse / LangSmith production eval guides. Plus the SLI/SLO mental model from your existing skill set-directly applicable.
Done when: You have a production sampler that scores 1% of real traffic and alerts on score drops.

Rung 10-Specialized eval domains¶

What: Faithfulness for RAG. Trajectory + outcome for agents. Code execution for coding tasks. Factuality for knowledge tasks. Safety/harmlessness.
Why it earns its place: Each domain has its own metric vocabulary. Mastery is per-domain.
Resource: RAGAS for RAG. SWE-bench eval methodology for code agents. Anthropic's responsible scaling policy for safety evals.
Done when: You can pick an eval suite for a task type and justify the choice.

Rung 11-Building an eval framework (Q3 Track A capstone)¶

What: Open-source a focused eval tool-perhaps for agent trajectories, or for RAG faithfulness, or for a specific domain underserved by existing tools.
Why it earns its place: This is your specialty made visible. Few engineers ship credible eval frameworks; doing so makes you a recognized practitioner.
Resource: Read the source of Inspect AI, Braintrust, and OpenAI evals. Identify a real gap. Build for that gap.
Done when: Public repo with: README, example eval, comparison against an existing tool, blog post.

Minimum required to leave this sequence¶

50-example golden dataset for a real task.
Heuristic evals running in CI.
LLM-as-judge with a written rubric, validated against human labels.
Inspect AI or equivalent harness running an eval suite.
Regression check on prompt changes via CI.
Online sampler scoring production traffic.

Going further¶

Hamel Husain's eval blog series (hamel.dev)-entire archive.
Eugene Yan's eval posts (eugeneyan.com).
Anthropic's evals documentation + responsible scaling policy.
UK AISI Inspect AI-read the source code.

How this sequence connects to the year¶

Month 6: Rungs 01–05 are core to month 6's eval harness build.
Q3 Track A: This sequence is your specialty if you pick evals.
Q4 capstone: A specialized eval framework is the recommended capstone artifact.