Month 7-Week 3: Head-to-head vs incumbent + supporting paper¶
Week summary¶
- Goal: Run an honest head-to-head comparison of your track project against an established incumbent. Surface either a clear differentiator or a needed pivot.
- Time: ~9 h over 3 sessions.
- Output: Comparison data; comparison writeup in DESIGN.md; supporting paper notes.
Why this week matters¶
A track project that's "almost as good as Inspect AI / vLLM / etc." is worth nothing-there's no reason to use yours. You need either a clear differentiator (a specific niche owned) or to pivot. Better to find that out in week 3 than week 11.
Prerequisites¶
- M07-W01 + W02 complete.
- Track project at usable v0.1+.
Recommended cadence¶
- Session A-Tue/Wed evening (~3 h): design the comparison
- Session B-Sat morning (~3.5 h): run the comparison
- Session C-Sun afternoon (~2.5 h): analyze + write up + supporting paper
Session A-Design the comparison¶
Goal: Define exactly what you're measuring and how to make it apples-to-apples.
Part 1-Scenario selection (45 min)¶
Pick 1-2 representative scenarios that exercise the dimension where your project might win.
Track A:
- Scenario: "Score 30 agent trajectories using a custom rubric."
- Compare: your framework vs Inspect AI's model_graded scorer.
- Metrics: agreement with humans, time to write a custom scorer, cost per eval, dashboard quality.
Track B: - Scenario: "Resolve 10 SWE-bench-Lite issues." - Compare: your agent vs a published baseline (e.g., Aider, SWE-agent). - Metrics: success rate, avg cost per issue, avg wall-clock time.
Track C: - Scenario: "Serve a 70B model with 50 concurrent users at p95 < 5s." - Compare: vLLM vs SGLang vs TensorRT-LLM (one config of each). - Metrics: TTFT p95, ITL, throughput, GPU memory, ease of setup.
Part 2-Make it apples-to-apples (45 min)¶
For a fair comparison: - Same hardware. - Same input set. - Same evaluation criteria. - Same allowed budget.
Document conditions so a reader can reproduce.
Part 3-Pre-register expectations (30 min)¶
Before running, write down what you expect: - "I expect my framework wins on X by Y%." - "I expect to lose on Z because incumbent has more features."
Pre-registration reduces motivated reasoning later.
Document in comparison/setup.md.
Output of Session A¶
comparison/setup.mdwith scenario, conditions, pre-registered expectations.
Session B-Run the comparison¶
Goal: Run both sides. Capture all metrics. Don't fudge.
Part 1-Run your project (75 min)¶
Same configuration as production. Capture metrics. Save outputs for post-hoc inspection.
Part 2-Run the incumbent (75 min)¶
Same scenario. Don't half-effort the incumbent-it should be configured well.
Part 3-Capture surprising failures (30 min)¶
For both sides: pick 3 cases where output was surprising. Save these for the writeup; surprising failures are the most informative.
Output of Session B¶
- Comparison data: metrics + saved outputs.
Session C-Analyze + write up + supporting paper¶
Goal: Analyze honestly. Write up. Read one supporting paper.
Part 1-Analyze (60 min)¶
Build the comparison table. Compute differences with bootstrap CIs where applicable.
Three possible outcomes: 1. Clear differentiator found. Document the niche; your project owns it. 2. Niche too narrow. Decide: scope down further or pivot. 3. Incumbent dominates everywhere. Honest pivot needed-go to a different project, or contribute to the incumbent instead.
Part 2-Write up in DESIGN.md (45 min)¶
Add a "Comparison vs
## Comparison vs Inspect AI (M07-W03)
### Scenario
30 agent trajectories scored against a custom rubric for tool-call appropriateness.
### Setup
- Hardware: same machine, fp16 throughout.
- Input set: shared 30 examples committed to evals/agent_trajs/.
- Judge: claude-opus-4-7.
### Results
| Metric | Mine | Inspect AI |
|---|---|---|
| Score-vs-human kappa | 0.72 | 0.69 |
| Time to define a new scorer | 12 min | 25 min |
| Cost per eval run | $0.18 | $0.21 |
| Dashboard quality | basic | excellent |
### Verdict
My project wins on time-to-write-custom-scorer (~2× faster) due to its more declarative API.
Loses on dashboard ergonomics. Niche I'll claim: "Inspect-AI-quality evals with cleaner author-time API for trajectory-level scorers."
Part 3-Supporting paper (45 min)¶
Read one more paper: - (A) HELM (arxiv.org/abs/2211.09110)-methodology depth. - (B) AutoGen / multi-agent paper-for context on framework variety. - (C) Speculative Decoding (arxiv.org/abs/2211.17192) or SGLang (arxiv.org/abs/2312.07104).
Notes added.
Output of Session C¶
- Analysis + comparison writeup committed to DESIGN.md.
- Supporting paper notes added.
End-of-week artifact¶
- Comparison setup + data
- DESIGN.md updated with comparison verdict
- Supporting paper notes
End-of-week self-assessment¶
- I have an honest verdict on my project vs incumbents.
- If pivot is needed, I've named it.
- I have a defendable niche (or know I don't).
Common failure modes for this week¶
- Stacking the comparison in your favor. Apples-to-apples or it's worthless.
- Hiding the negative result. "Incumbent wins" is honest and clarifying.
- Pivoting too late. This week is when pivots are cheap.
What's next (preview of M07-W04)¶
Track milestone + first specialty post. The repo gets a v0.2 release; the post announces your specialty publicly.