Saltar a contenido

12-Evaluation Systems

Why this matters in the journey

Evals is the most undersupplied skill in 2026 AI engineering. Every team building with LLMs has an eval problem: "Is this prompt change actually better?" "Did the new model regress?" "Why does our agent fail on these examples?" Solving this-rigorously, at scale, with the right metrics-is your highest-leverage specialty given your observability background. The skill transfer from SLI/SLO design is direct.

The rungs

Rung 01-Why evals (and what's wrong with vibes)

  • What: Without evals, AI engineering is folklore. Prompt changes are decided by "felt better." Regressions ship silently.
  • Why it earns its place: Frame the problem before the tooling.
  • Resource: Hamel Husain-"Your AI product needs evals" (hamel.dev/blog/posts/evals/). Read this twice.
  • Done when: You can argue for an eval-first culture in your own words.

Rung 02-Building a golden dataset

  • What: A curated set of (input, expected output) pairs. Must be representative of production traffic. 30–500 examples is usually enough to start.
  • Why it earns its place: No golden set, no eval. Curating it is the first step that almost everyone skips.
  • Resource: Hamel's "How to create a high-quality eval set" posts.
  • Done when: You have a 50-example golden set for your project, with both common and edge-case examples.

Rung 03-Deterministic and heuristic checks

  • What: Cheap, fast, automatic: regex matches, JSON validity, output length, refusal detection, format conformance.
  • Why it earns its place: Catch the easy bugs before reaching for LLM-as-judge. Cheap evals run on every change.
  • Resource: Hamel's eval blog series. Plus the `pytest - based eval pattern (write evals as tests).
  • Done when: Your project has heuristic evals running in CI on every prompt change.

Rung 04-LLM-as-judge

  • What: Use a strong model to grade outputs against a rubric. Pairwise comparison or pointwise scoring.
  • Why it earns its place: The dominant scalable eval method for open-ended outputs. Also the most easily abused.
  • Resource: Judging LLM-as-a-Judge paper (arxiv.org/abs/2306.05685). Plus Eugene Yan's posts on LLM-as-judge (eugeneyan.com).
  • Done when: You can write a clear rubric, prompt a judge model, and validate the judge against human labels.

Rung 05-Validating the judge

  • What: Compute agreement (Cohen's kappa, Spearman correlation) between judge and human ratings. If agreement is poor, the judge is unreliable.
  • Why it earns its place: Without validating the judge, your eval is fiction. This is the most-skipped step.
  • Resource: AI Engineering (Huyen) chapters on eval. Plus Hamel's "Be skeptical of LLM-as-judge."
  • Done when: You've human-labeled 30+ examples and computed agreement with your judge.

Rung 06-Eval datasets and benchmarks

  • What: Public benchmarks: MMLU (knowledge), GSM8K / MATH (reasoning), HumanEval / SWE-bench (code), HELM, BIG-bench. Domain-specific datasets to mirror what matters.
  • Why it earns its place: Knowing the canonical benchmarks lets you read papers. Knowing their limitations lets you not be misled.
  • Resource: HELM paper (arxiv.org/abs/2211.09110) and website (crfm.stanford.edu/helm).
  • Done when: You can list the major benchmark categories and one limitation of each.

Rung 07-Eval harnesses

  • What: Tools that orchestrate evals: dataset, model, scorer, results store. Examples: Inspect AI (UK AISI), Braintrust, OpenAI evals, Promptfoo, lm-eval-harness (EleutherAI).
  • Why it earns its place: Rolling your own gets you started; switching to a harness gets you parallelism, caching, dashboards, and shareability.
  • Resource: Inspect AI docs (inspect.ai-safety-institute.org.uk)-strongly recommended; thoughtful design. Plus Braintrust docs.
  • Done when: You can run an Inspect AI eval against an LLM and view the report.

Rung 08-Regression testing for prompts

  • What: Treat prompt changes like code changes: PR triggers eval suite, blocking merge if scores regress.
  • Why it earns its place: Production discipline. Prevents the "one-off improvement that broke five other things" pattern.
  • Resource: Promptfoo docs (promptfoo.dev). Plus Hamel's "eval-driven development" framing.
  • Done when: Your project has a CI step that runs evals on PRs and surfaces regressions.

Rung 09-Online evals and production observability

  • What: Evals on real production traffic, not just curated sets. Sample, score (deterministically or with judge), aggregate. Detect drift.
  • Why it earns its place: Golden sets go stale. Real distribution shifts. Online evals catch what offline can't.
  • Resource: Langfuse / LangSmith production eval guides. Plus the SLI/SLO mental model from your existing skill set-directly applicable.
  • Done when: You have a production sampler that scores 1% of real traffic and alerts on score drops.

Rung 10-Specialized eval domains

  • What: Faithfulness for RAG. Trajectory + outcome for agents. Code execution for coding tasks. Factuality for knowledge tasks. Safety/harmlessness.
  • Why it earns its place: Each domain has its own metric vocabulary. Mastery is per-domain.
  • Resource: RAGAS for RAG. SWE-bench eval methodology for code agents. Anthropic's responsible scaling policy for safety evals.
  • Done when: You can pick an eval suite for a task type and justify the choice.

Rung 11-Building an eval framework (Q3 Track A capstone)

  • What: Open-source a focused eval tool-perhaps for agent trajectories, or for RAG faithfulness, or for a specific domain underserved by existing tools.
  • Why it earns its place: This is your specialty made visible. Few engineers ship credible eval frameworks; doing so makes you a recognized practitioner.
  • Resource: Read the source of Inspect AI, Braintrust, and OpenAI evals. Identify a real gap. Build for that gap.
  • Done when: Public repo with: README, example eval, comparison against an existing tool, blog post.

Minimum required to leave this sequence

  • 50-example golden dataset for a real task.
  • Heuristic evals running in CI.
  • LLM-as-judge with a written rubric, validated against human labels.
  • Inspect AI or equivalent harness running an eval suite.
  • Regression check on prompt changes via CI.
  • Online sampler scoring production traffic.

Going further

  • Hamel Husain's eval blog series (hamel.dev)-entire archive.
  • Eugene Yan's eval posts (eugeneyan.com).
  • Anthropic's evals documentation + responsible scaling policy.
  • UK AISI Inspect AI-read the source code.

How this sequence connects to the year

  • Month 6: Rungs 01–05 are core to month 6's eval harness build.
  • Q3 Track A: This sequence is your specialty if you pick evals.
  • Q4 capstone: A specialized eval framework is the recommended capstone artifact.

Comments