Saltar a contenido

Month 6-Week 3: Inspect AI, regression detection, online eval prep

Week summary

  • Goal: Adopt Inspect AI as your eval harness. Migrate golden set and scorers. Set up regression detection in CI. Begin online eval sampling on production-like traffic.
  • Time: ~9 h over 3 sessions.
  • Output: Inspect AI eval suite running; CI fails on regression; expanded human-labeled set (50); production sampler stub.
  • Sequences relied on: 12-evaluation-systems rungs 05, 07, 08, 09.

Why this week matters

Hand-rolled evals get you started. Real eval harnesses give you parallelism, caching, datasets-as-code, dashboards, and shareability. Inspect AI (UK AISI) is the most thoughtfully designed eval framework in 2025–2026 and a strong portfolio signal-using it well is itself a credential.

Online eval (sampling production traffic) is what catches drift that golden sets miss. It's also the closest analog to your existing SLO discipline.

Prerequisites

  • M04-W03 + M06-W01–W02 complete.
  • Working agent + RAG eval pipelines.
  • Session A-Tue/Wed evening (~3 h): Inspect AI deep-dive
  • Session B-Sat morning (~3.5 h): port + regression CI
  • Session C-Sun afternoon (~2.5 h): expanded human labeling + online sampler

Session A-Inspect AI deep dive

Goal: Read enough Inspect AI to know its design. Run an example. Plan the port.

Part 1-Read Inspect AI docs (75 min)

inspect.ai-safety-institute.org.uk-the official docs.

Concepts to internalize: - Task: a function that returns a Task object-combines dataset, solver, scorer. - Solver: function that takes a TaskState and returns updated state. Composable. - Scorer: function that produces metrics from (state, target). - Dataset: examples loaded from JSONL/HF/etc. - Sample: one example.

The composability is the design's strength. You can mix-and-match solvers and scorers across tasks.

Part 2-Read the source (45 min)

git clone https://github.com/UKGovernmentBEIS/inspect_ai
cd inspect_ai/src

Skim: - solver/ -chain,generate,multiple_choice,tool_use. -scorer/ - match, model_graded. - `dataset/ - formats.

This is also a good Python project to study-well-structured, well-tested.

Part 3-Run a quickstart (60 min)

pip install inspect-ai
# eval_quickstart.py
from inspect_ai import Task, eval, task
from inspect_ai.dataset import Sample
from inspect_ai.solver import generate
from inspect_ai.scorer import match

@task
def dummy():
    samples = [
        Sample(input="What's 2+2?", target="4"),
        Sample(input="Capital of France?", target="Paris"),
    ]
    return Task(dataset=samples, solver=generate(), scorer=match())

# Run
# inspect eval eval_quickstart.py --model anthropic/claude-opus-4-7

Inspect the report. Notice the dashboard.

Output of Session A

  • Inspect AI installed and running on a toy task.
  • Notes on the design.

Session B-Port + regression CI

Goal: Port your golden set and scorers to Inspect AI. Set up regression detection in CI.

Part 1-Port the dataset (45 min)

Inspect dataset format:

from inspect_ai.dataset import Sample, json_dataset

# evals/inspect_dataset.py
def load_triage_samples():
    return json_dataset("evals/golden.jsonl",
        sample_fields=lambda r: Sample(
            id=r["id"],
            input=r["input"],
            target=r["expected"],  # full expected dict
            metadata={"failure_mode": r.get("failure_mode")},
        ))

Part 2-Port the scorers (75 min)

Inspect AI scorers return Score(value, answer, explanation, metadata). Port your heuristic + judge:

from inspect_ai.scorer import Score, scorer, mean

@scorer(metrics=[mean()])
def severity_match():
    async def score(state, target):
        # Parse the structured output (state.output.completion)
        try:
            report = IncidentReport.model_validate_json(state.output.completion)
        except Exception:
            return Score(value=0, explanation="schema invalid")
        expected_sev = target["severity"]
        match = report.severity.value == expected_sev
        return Score(value=1 if match else 0,
                     explanation=f"got {report.severity.value}, expected {expected_sev}")
    return score

# Compose multiple scorers in a multi_scorer

Run the full eval suite:

inspect eval evals/triage.py --model anthropic/claude-opus-4-7

Part 3-Regression CI (60 min)

Create a baseline file evals/baseline.json with last-known-good metric values.

In CI:

# evals/check_regression.py
import json, sys
baseline = json.loads(Path("evals/baseline.json").read_text())
latest = json.loads(Path("logs/latest_summary.json").read_text())
THRESHOLD = 0.02  # allow 2 percentage points drop
for metric, baseline_val in baseline.items():
    latest_val = latest.get(metric)
    if latest_val < baseline_val - THRESHOLD:
        print(f"REGRESSION: {metric} dropped from {baseline_val:.4f} to {latest_val:.4f}")
        sys.exit(1)
print("OK: no regressions.")

Wire into .github/workflows/eval.yml so PRs that worsen evals are blocked.

Test it. Submit a deliberately bad prompt. CI should fail.

Output of Session B

  • Golden set + scorers ported to Inspect AI.
  • CI regression detection working and tested.

Session C-Expanded human labeling + online sampler

Goal: Add 30 more human labels (now 50 total). Recompute kappa. Stub a production sampler.

Part 1-Hand-label 30 more examples (75 min)

Like M04-W03, hand-label faithfulness, action specificity, severity. Aim for diversity-pull from across the failure-mode taxonomy.

Recompute Cohen's kappa per dimension. If it dropped, your judge needs refinement or the new examples reveal coverage gaps.

Update README with new kappa numbers.

Part 2-Production sampler stub (45 min)

A production sampler: 1% of real (or simulated production) traffic gets: - Trace stored. - Async-scored by your eval suite. - Aggregated daily.

# src/triage/online_eval.py
import random

SAMPLE_RATE = 0.01

async def sample_for_eval(incident: str, agent_state: AgentRun):
    if random.random() > SAMPLE_RATE: return
    # Save to a queue / table for async scoring
    save_for_eval({"input": incident, "state": agent_state.model_dump(),
                   "timestamp": datetime.utcnow()})

# Async worker (separate process/cron)
async def score_sampled():
    items = pull_from_queue()
    for item in items:
        score = await run_inspect_on_one(item)
        write_score(item["id"], score)

For now, just stub it (no production traffic yet). The architecture is what matters.

Part 3-Document + push (30 min)

README.md "Eval methodology" section updated: - Inspect AI suite running. - 50 human labels with kappa per dimension. - CI regression detection. - Production sampler architecture (stub).

Push v0.7.0.

Output of Session C

  • 50 total human labels with kappa documented.
  • Production sampler scaffold.

End-of-week artifact

  • Inspect AI eval suite working
  • Scorers ported (heuristic + judge)
  • CI regression detection working
  • 50 human-labeled examples + kappa
  • Production sampler stub

End-of-week self-assessment

  • I can write an Inspect AI Task from scratch.
  • My eval CI catches regressions before merge.
  • My judge has substantial agreement with humans (kappa ≥ 0.6).

Common failure modes for this week

  • Skipping the source-reading. The Inspect AI source is the best single way to understand its design.
  • Accepting low kappa. If <0.6, the judge isn't trustworthy. Iterate the rubric.
  • No regression baseline. Without it, CI passes regressions silently.

What's next (preview of M06-W04)

Q2 capstone-the bridge observability blog post (your highest-leverage post yet) + Q3 track decision.

Comments