Month 4-Week 3: Evals-golden set, heuristics, judge, and validation¶
Week summary¶
- Goal: Build the eval discipline that defines the rest of Q2 (and your Q3 specialty). 50-example golden dataset; heuristic + LLM-as-judge scorers; CI integration; judge-vs-human agreement (kappa) measured.
- Time: ~10 h over 3 sessions.
- Output:
evals/directory with golden set, scorers, CI workflow, and judge-validation report. - Sequences relied on: 12-evaluation-systems rungs 01–05; 09-llm-application-engineering rung 11; 03-probability-statistics rungs 09–10.
Why this week matters¶
Without evals, AI engineering is folklore. Every prompt change feels like an improvement; every model swap is celebrated; every deploy is a leap of faith. Real teams in 2026 use eval-driven development-golden datasets, automated scorers, regression tests in CI, online sampling on production traffic. This week installs that discipline in your project.
This is also the week your Q3 specialty hypothesis crystallizes. If you find the eval work intellectually satisfying-"designing the metric is harder and more interesting than designing the model"-that's a strong signal you should pick Track A (Evals) in Q3.
Prerequisites¶
- M04-W01 + W02 complete.
- A working LLM application that produces structured outputs.
Recommended cadence¶
- Session A-Tue/Wed evening (~3 h): read Hamel + design golden set
- Session B-Sat morning (~3.5 h): scorers + CI
- Session C-Sun afternoon (~3 h): human labeling + judge validation
Session A-Read deeply, design the golden set¶
Goal: Internalize Hamel's eval philosophy. Curate 50 representative examples.
Part 1-Hamel Husain's eval archive (90 min)¶
Hamel's blog (hamel.dev) has the best applied-eval writing on the internet. Read in this order: 1. "Your AI product needs evals"-the philosophy. 2. "Levels of complexity: RAG applications"-eval modalities. 3. "How to create a high-quality eval set"-the practicalities. 4. "Be skeptical of LLM-as-judge"-the warnings.
Take notes on: - The difference between eval cases and eval criteria. - Why you want diversity over volume. - When LLM-as-judge is appropriate vs unreliable.
Part 2-Golden dataset design (60 min)¶
Curate 50 incident-triage examples. Composition: - 30 typical-the bread-and-butter cases your system must handle well. - 10 edge cases-ambiguous, multi-cause, partial information. - 5 "should refuse"-cases where the right answer is "I need more information" or "escalate to human." - 5 distractors-cases that look like one type of incident but aren't.
Format as JSONL:
{"id": "001", "input": "Sudden spike in 5xx errors on checkout-api...", "expected": {"severity": "critical", "service_contains": "checkout", "cause_keywords": ["deploy", "v2.3.4"]}}
{"id": "002", "input": "...", "expected": {...}}
Tip
generate first drafts with Claude (give it a prompt asking for diverse incident scenarios) and carefully edit. Synthesizing then editing is faster than writing from scratch and avoids your own bias toward easy cases.
Part 3-Reasoning about diversity (30 min)¶
Plot or table your 50 cases by: - Severity distribution (don't make all "critical"). - Service variety (don't have only 2 services). - Failure mode (deploy regression, infra, dependency, code bug). - Length (some short alerts, some long detailed reports).
If anything is over-represented, replace examples until coverage is balanced.
Output of Session A¶
evals/golden.jsonlwith 50 examples.evals/coverage.mdshowing diversity stats.
Session B-Scorers and CI integration¶
Goal: Implement heuristic + LLM-as-judge scorers. Wire into a make eval command and GitHub Actions CI.
Part 1-Heuristic / deterministic scorers (75 min)¶
Cheap, fast, deterministic. Always check these first:
# evals/scorers.py
from src.triage.schemas import IncidentReport
def score_schema_valid(output: IncidentReport, expected: dict) -> bool:
return isinstance(output, IncidentReport) # Pydantic validated
def score_severity_match(output: IncidentReport, expected: dict) -> bool:
return expected.get("severity") == output.severity.value
def score_service_contains(output: IncidentReport, expected: dict) -> bool:
needle = expected.get("service_contains", "").lower()
return needle in output.affected_service.lower()
def score_cause_keywords(output: IncidentReport, expected: dict) -> float:
needed = expected.get("cause_keywords", [])
if not needed: return 1.0
found = sum(1 for k in needed if k.lower() in output.probable_cause.lower())
return found / len(needed)
Run all scorers over the golden set. Aggregate:
schema_valid: 100% (50/50)
severity_match: 78% (39/50)
service_contains: 92% (46/50)
cause_keywords (mean): 0.71
Heuristic eval pass rate is your baseline-every prompt change is compared against it.
Part 2-LLM-as-judge with rubric (90 min)¶
# evals/judge.py
JUDGE_PROMPT = """You are an expert evaluator. Score an incident triage report on three dimensions, 1-5 each:
1. **Faithfulness**: does the cause analysis match what the input describes? Penalize hallucinated facts.
2. **Action specificity**: are recommended actions concrete and useful, or generic?
3. **Severity appropriateness**: is the severity assignment reasonable given the input?
Return strict JSON:
{"faithfulness": int, "action_specificity": int, "severity_appropriateness": int, "rationale": "1-2 sentences"}
Input incident:
<<<INPUT>>>
Triage report:
<<<REPORT>>>
"""
def judge(incident: str, report: IncidentReport) -> dict:
client = anthropic.Anthropic()
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=512,
messages=[{"role": "user", "content": JUDGE_PROMPT.replace("<<<INPUT>>>", incident).replace("<<<REPORT>>>", report.model_dump_json(indent=2))}],
)
return json.loads(resp.content[0].text)
Run over the golden set. Aggregate mean scores per dimension.
Part 3-make eval + GitHub Actions (45 min)¶
# evals/run.py-orchestrator
import json
from src.triage.client import triage
from .scorers import *
from .judge import judge
def main():
with open("evals/golden.jsonl") as f:
cases = [json.loads(line) for line in f]
results = []
for case in cases:
output = triage(case["input"])
h = {
"schema_valid": score_schema_valid(output, case["expected"]),
"severity_match": score_severity_match(output, case["expected"]),
"service_contains": score_service_contains(output, case["expected"]),
"cause_keywords": score_cause_keywords(output, case["expected"]),
}
j = judge(case["input"], output)
results.append({"id": case["id"], "heuristic": h, "judge": j})
# Aggregate + write report
with open("evals/latest_report.json", "w") as f:
json.dump(results, f, indent=2)
print(summary(results))
if __name__ == "__main__":
main()
Add to Makefile:
GitHub Actions:
# .github/workflows/eval.yml
name: eval
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v2
- run: uv sync
- run: uv run python -m evals.run
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- uses: actions/upload-artifact@v4
with: { name: eval-report, path: evals/latest_report.json }
Add a baseline (last-known-good) comparison and fail the job if scores drop materially.
Output of Session B¶
- Heuristic scorers + LLM-as-judge implemented.
make evalruns end-to-end.- CI workflow committed.
Session C-Judge validation (the part most teams skip)¶
Goal: Hand-label 30 examples; compute Cohen's kappa between you and the judge; refine the judge if agreement is poor.
Part 1-Hand-label 30 examples (90 min)¶
Take 30 of your golden cases. For each, run triage() to get the report. Then:
- Label faithfulness 1–5 yourself.
- Label action specificity 1–5.
- Label severity appropriateness 1–5.
Be honest. This labeling is the ground truth.
Store in evals/human_labels.jsonl.
Part 2-Compute agreement (45 min)¶
# evals/agreement.py
from sklearn.metrics import cohen_kappa_score
def kappa(human, judge, dim):
h = [x[dim] for x in human]
j = [x[dim] for x in judge]
return cohen_kappa_score(h, j, weights="quadratic") # quadratic for ordinal
print("Faithfulness kappa:", kappa(human, judge, "faithfulness"))
print("Action kappa:", kappa(human, judge, "action_specificity"))
print("Severity kappa:", kappa(human, judge, "severity_appropriateness"))
Interpretation: - κ < 0.4: poor agreement; the judge is unreliable. Refine. - 0.4–0.6: moderate; usable but watch closely. - 0.6–0.8: substantial; fine for production. - 0.8+: almost-human; rare and valuable.
If a dimension scores poorly, look at disagreements. Update the rubric prompt to address the systematic gap. Re-run, re-measure.
Part 3-Document and ship (45 min)¶
Add to README:
## Eval methodology
- 50 golden examples, balanced across severity and failure modes.
- Heuristic checks: schema validity, severity match, service mention, cause keywords.
- LLM-as-judge (Claude Opus 4.7) on faithfulness, action specificity, severity appropriateness.
- Judge validated against 30 human labels:
- Faithfulness κ = 0.71 (substantial)
- Action specificity κ = 0.58 (moderate)
- Severity κ = 0.82 (almost-human)
- Eval CI runs on every PR; regressions fail the build.
Push v0.3.0. Update LEARNING_LOG.md.
Output of Session C¶
- 30 human labels in repo.
- Cohen's kappa per dimension.
- Refined judge prompt if needed.
- README eval methodology section.
End-of-week artifact¶
- 50-example golden dataset with diversity stats
- Heuristic + LLM-as-judge scorers
-
make eval+ CI integration - 30 human labels + Cohen's kappa per dimension
- README eval methodology section
End-of-week self-assessment¶
- I can curate a representative golden set.
- I can write a judge prompt and validate it against humans.
- I can interpret Cohen's kappa correctly.
- If asked "are your evals trustworthy?", I have data to defend it.
Common failure modes for this week¶
- Skipping judge validation. Without it, your eval is fiction.
- All easy cases in the golden set. Edge cases are where models fail in production.
- Treating κ < 0.5 as fine. It isn't. Iterate the rubric until at least 0.6.
What's next (preview of M04-W04)¶
Polish + first month-4 blog post. Write up your project, your numbers, your eval methodology-the post that announces your AI-engineer identity to the world.