Month 6-Week 3: Inspect AI, regression detection, online eval prep¶
Week summary¶
- Goal: Adopt Inspect AI as your eval harness. Migrate golden set and scorers. Set up regression detection in CI. Begin online eval sampling on production-like traffic.
- Time: ~9 h over 3 sessions.
- Output: Inspect AI eval suite running; CI fails on regression; expanded human-labeled set (50); production sampler stub.
- Sequences relied on: 12-evaluation-systems rungs 05, 07, 08, 09.
Why this week matters¶
Hand-rolled evals get you started. Real eval harnesses give you parallelism, caching, datasets-as-code, dashboards, and shareability. Inspect AI (UK AISI) is the most thoughtfully designed eval framework in 2025–2026 and a strong portfolio signal-using it well is itself a credential.
Online eval (sampling production traffic) is what catches drift that golden sets miss. It's also the closest analog to your existing SLO discipline.
Prerequisites¶
- M04-W03 + M06-W01–W02 complete.
- Working agent + RAG eval pipelines.
Recommended cadence¶
- Session A-Tue/Wed evening (~3 h): Inspect AI deep-dive
- Session B-Sat morning (~3.5 h): port + regression CI
- Session C-Sun afternoon (~2.5 h): expanded human labeling + online sampler
Session A-Inspect AI deep dive¶
Goal: Read enough Inspect AI to know its design. Run an example. Plan the port.
Part 1-Read Inspect AI docs (75 min)¶
inspect.ai-safety-institute.org.uk-the official docs.
Concepts to internalize:
- Task: a function that returns a Task object-combines dataset, solver, scorer.
- Solver: function that takes a TaskState and returns updated state. Composable.
- Scorer: function that produces metrics from (state, target).
- Dataset: examples loaded from JSONL/HF/etc.
- Sample: one example.
The composability is the design's strength. You can mix-and-match solvers and scorers across tasks.
Part 2-Read the source (45 min)¶
Skim:
- solver/ -chain,generate,multiple_choice,tool_use.
-scorer/ - match, model_graded.
- `dataset/ - formats.
This is also a good Python project to study-well-structured, well-tested.
Part 3-Run a quickstart (60 min)¶
# eval_quickstart.py
from inspect_ai import Task, eval, task
from inspect_ai.dataset import Sample
from inspect_ai.solver import generate
from inspect_ai.scorer import match
@task
def dummy():
samples = [
Sample(input="What's 2+2?", target="4"),
Sample(input="Capital of France?", target="Paris"),
]
return Task(dataset=samples, solver=generate(), scorer=match())
# Run
# inspect eval eval_quickstart.py --model anthropic/claude-opus-4-7
Inspect the report. Notice the dashboard.
Output of Session A¶
- Inspect AI installed and running on a toy task.
- Notes on the design.
Session B-Port + regression CI¶
Goal: Port your golden set and scorers to Inspect AI. Set up regression detection in CI.
Part 1-Port the dataset (45 min)¶
Inspect dataset format:
from inspect_ai.dataset import Sample, json_dataset
# evals/inspect_dataset.py
def load_triage_samples():
return json_dataset("evals/golden.jsonl",
sample_fields=lambda r: Sample(
id=r["id"],
input=r["input"],
target=r["expected"], # full expected dict
metadata={"failure_mode": r.get("failure_mode")},
))
Part 2-Port the scorers (75 min)¶
Inspect AI scorers return Score(value, answer, explanation, metadata). Port your heuristic + judge:
from inspect_ai.scorer import Score, scorer, mean
@scorer(metrics=[mean()])
def severity_match():
async def score(state, target):
# Parse the structured output (state.output.completion)
try:
report = IncidentReport.model_validate_json(state.output.completion)
except Exception:
return Score(value=0, explanation="schema invalid")
expected_sev = target["severity"]
match = report.severity.value == expected_sev
return Score(value=1 if match else 0,
explanation=f"got {report.severity.value}, expected {expected_sev}")
return score
# Compose multiple scorers in a multi_scorer
Run the full eval suite:
Part 3-Regression CI (60 min)¶
Create a baseline file evals/baseline.json with last-known-good metric values.
In CI:
# evals/check_regression.py
import json, sys
baseline = json.loads(Path("evals/baseline.json").read_text())
latest = json.loads(Path("logs/latest_summary.json").read_text())
THRESHOLD = 0.02 # allow 2 percentage points drop
for metric, baseline_val in baseline.items():
latest_val = latest.get(metric)
if latest_val < baseline_val - THRESHOLD:
print(f"REGRESSION: {metric} dropped from {baseline_val:.4f} to {latest_val:.4f}")
sys.exit(1)
print("OK: no regressions.")
Wire into .github/workflows/eval.yml so PRs that worsen evals are blocked.
Test it. Submit a deliberately bad prompt. CI should fail.
Output of Session B¶
- Golden set + scorers ported to Inspect AI.
- CI regression detection working and tested.
Session C-Expanded human labeling + online sampler¶
Goal: Add 30 more human labels (now 50 total). Recompute kappa. Stub a production sampler.
Part 1-Hand-label 30 more examples (75 min)¶
Like M04-W03, hand-label faithfulness, action specificity, severity. Aim for diversity-pull from across the failure-mode taxonomy.
Recompute Cohen's kappa per dimension. If it dropped, your judge needs refinement or the new examples reveal coverage gaps.
Update README with new kappa numbers.
Part 2-Production sampler stub (45 min)¶
A production sampler: 1% of real (or simulated production) traffic gets: - Trace stored. - Async-scored by your eval suite. - Aggregated daily.
# src/triage/online_eval.py
import random
SAMPLE_RATE = 0.01
async def sample_for_eval(incident: str, agent_state: AgentRun):
if random.random() > SAMPLE_RATE: return
# Save to a queue / table for async scoring
save_for_eval({"input": incident, "state": agent_state.model_dump(),
"timestamp": datetime.utcnow()})
# Async worker (separate process/cron)
async def score_sampled():
items = pull_from_queue()
for item in items:
score = await run_inspect_on_one(item)
write_score(item["id"], score)
For now, just stub it (no production traffic yet). The architecture is what matters.
Part 3-Document + push (30 min)¶
README.md "Eval methodology" section updated:
- Inspect AI suite running.
- 50 human labels with kappa per dimension.
- CI regression detection.
- Production sampler architecture (stub).
Push v0.7.0.
Output of Session C¶
- 50 total human labels with kappa documented.
- Production sampler scaffold.
End-of-week artifact¶
- Inspect AI eval suite working
- Scorers ported (heuristic + judge)
- CI regression detection working
- 50 human-labeled examples + kappa
- Production sampler stub
End-of-week self-assessment¶
- I can write an Inspect AI Task from scratch.
- My eval CI catches regressions before merge.
- My judge has substantial agreement with humans (kappa ≥ 0.6).
Common failure modes for this week¶
- Skipping the source-reading. The Inspect AI source is the best single way to understand its design.
- Accepting low kappa. If <0.6, the judge isn't trustworthy. Iterate the rubric.
- No regression baseline. Without it, CI passes regressions silently.
What's next (preview of M06-W04)¶
Q2 capstone-the bridge observability blog post (your highest-leverage post yet) + Q3 track decision.