Month 7-Week 3: Head-to-head vs incumbent + supporting paper¶

Week summary¶

Goal: Run an honest head-to-head comparison of your track project against an established incumbent. Surface either a clear differentiator or a needed pivot.
Time: ~9 h over 3 sessions.
Output: Comparison data; comparison writeup in DESIGN.md; supporting paper notes.

Why this week matters¶

A track project that's "almost as good as Inspect AI / vLLM / etc." is worth nothing-there's no reason to use yours. You need either a clear differentiator (a specific niche owned) or to pivot. Better to find that out in week 3 than week 11.

Prerequisites¶

M07-W01 + W02 complete.
Track project at usable v0.1+.

Recommended cadence¶

Session A-Tue/Wed evening (~3 h): design the comparison
Session B-Sat morning (~3.5 h): run the comparison
Session C-Sun afternoon (~2.5 h): analyze + write up + supporting paper

Session A-Design the comparison¶

Goal: Define exactly what you're measuring and how to make it apples-to-apples.

Part 1-Scenario selection (45 min)¶

Pick 1-2 representative scenarios that exercise the dimension where your project might win.

Track A: - Scenario: "Score 30 agent trajectories using a custom rubric." - Compare: your framework vs Inspect AI's model_graded scorer. - Metrics: agreement with humans, time to write a custom scorer, cost per eval, dashboard quality.

Track B: - Scenario: "Resolve 10 SWE-bench-Lite issues." - Compare: your agent vs a published baseline (e.g., Aider, SWE-agent). - Metrics: success rate, avg cost per issue, avg wall-clock time.

Track C: - Scenario: "Serve a 70B model with 50 concurrent users at p95 < 5s." - Compare: vLLM vs SGLang vs TensorRT-LLM (one config of each). - Metrics: TTFT p95, ITL, throughput, GPU memory, ease of setup.

Part 2-Make it apples-to-apples (45 min)¶

For a fair comparison: - Same hardware. - Same input set. - Same evaluation criteria. - Same allowed budget.

Document conditions so a reader can reproduce.

Part 3-Pre-register expectations (30 min)¶

Before running, write down what you expect: - "I expect my framework wins on X by Y%." - "I expect to lose on Z because incumbent has more features."

Pre-registration reduces motivated reasoning later.

Document in comparison/setup.md.

Output of Session A¶

comparison/setup.md with scenario, conditions, pre-registered expectations.

Session B-Run the comparison¶

Goal: Run both sides. Capture all metrics. Don't fudge.

Part 1-Run your project (75 min)¶

Same configuration as production. Capture metrics. Save outputs for post-hoc inspection.

Part 2-Run the incumbent (75 min)¶

Same scenario. Don't half-effort the incumbent-it should be configured well.

Part 3-Capture surprising failures (30 min)¶

For both sides: pick 3 cases where output was surprising. Save these for the writeup; surprising failures are the most informative.

Output of Session B¶

Comparison data: metrics + saved outputs.

Session C-Analyze + write up + supporting paper¶

Goal: Analyze honestly. Write up. Read one supporting paper.

Part 1-Analyze (60 min)¶

Build the comparison table. Compute differences with bootstrap CIs where applicable.

Three possible outcomes: 1. Clear differentiator found. Document the niche; your project owns it. 2. Niche too narrow. Decide: scope down further or pivot. 3. Incumbent dominates everywhere. Honest pivot needed-go to a different project, or contribute to the incumbent instead.

Part 2-Write up in DESIGN.md (45 min)¶

Add a "Comparison vs " section to DESIGN.md:

## Comparison vs Inspect AI (M07-W03)

### Scenario
30 agent trajectories scored against a custom rubric for tool-call appropriateness.

### Setup
- Hardware: same machine, fp16 throughout.
- Input set: shared 30 examples committed to evals/agent_trajs/.
- Judge: claude-opus-4-7.

### Results
| Metric | Mine | Inspect AI |
|---|---|---|
| Score-vs-human kappa | 0.72 | 0.69 |
| Time to define a new scorer | 12 min | 25 min |
| Cost per eval run | $0.18 | $0.21 |
| Dashboard quality | basic | excellent |

### Verdict
My project wins on time-to-write-custom-scorer (~2× faster) due to its more declarative API.
Loses on dashboard ergonomics. Niche I'll claim: "Inspect-AI-quality evals with cleaner author-time API for trajectory-level scorers."

Part 3-Supporting paper (45 min)¶

Read one more paper: - (A) HELM (arxiv.org/abs/2211.09110)-methodology depth. - (B) AutoGen / multi-agent paper-for context on framework variety. - (C) Speculative Decoding (arxiv.org/abs/2211.17192) or SGLang (arxiv.org/abs/2312.07104).

Notes added.

Output of Session C¶

Analysis + comparison writeup committed to DESIGN.md.
Supporting paper notes added.

End-of-week artifact¶

Comparison setup + data
DESIGN.md updated with comparison verdict
Supporting paper notes

End-of-week self-assessment¶

I have an honest verdict on my project vs incumbents.
If pivot is needed, I've named it.
I have a defendable niche (or know I don't).

Common failure modes for this week¶

Stacking the comparison in your favor. Apples-to-apples or it's worthless.
Hiding the negative result. "Incumbent wins" is honest and clarifying.
Pivoting too late. This week is when pivots are cheap.

What's next (preview of M07-W04)¶

Track milestone + first specialty post. The repo gets a v0.2 release; the post announces your specialty publicly.