Saltar a contenido

Month 7-Week 3: Head-to-head vs incumbent + supporting paper

Week summary

  • Goal: Run an honest head-to-head comparison of your track project against an established incumbent. Surface either a clear differentiator or a needed pivot.
  • Time: ~9 h over 3 sessions.
  • Output: Comparison data; comparison writeup in DESIGN.md; supporting paper notes.

Why this week matters

A track project that's "almost as good as Inspect AI / vLLM / etc." is worth nothing-there's no reason to use yours. You need either a clear differentiator (a specific niche owned) or to pivot. Better to find that out in week 3 than week 11.

Prerequisites

  • M07-W01 + W02 complete.
  • Track project at usable v0.1+.
  • Session A-Tue/Wed evening (~3 h): design the comparison
  • Session B-Sat morning (~3.5 h): run the comparison
  • Session C-Sun afternoon (~2.5 h): analyze + write up + supporting paper

Session A-Design the comparison

Goal: Define exactly what you're measuring and how to make it apples-to-apples.

Part 1-Scenario selection (45 min)

Pick 1-2 representative scenarios that exercise the dimension where your project might win.

Track A: - Scenario: "Score 30 agent trajectories using a custom rubric." - Compare: your framework vs Inspect AI's model_graded scorer. - Metrics: agreement with humans, time to write a custom scorer, cost per eval, dashboard quality.

Track B: - Scenario: "Resolve 10 SWE-bench-Lite issues." - Compare: your agent vs a published baseline (e.g., Aider, SWE-agent). - Metrics: success rate, avg cost per issue, avg wall-clock time.

Track C: - Scenario: "Serve a 70B model with 50 concurrent users at p95 < 5s." - Compare: vLLM vs SGLang vs TensorRT-LLM (one config of each). - Metrics: TTFT p95, ITL, throughput, GPU memory, ease of setup.

Part 2-Make it apples-to-apples (45 min)

For a fair comparison: - Same hardware. - Same input set. - Same evaluation criteria. - Same allowed budget.

Document conditions so a reader can reproduce.

Part 3-Pre-register expectations (30 min)

Before running, write down what you expect: - "I expect my framework wins on X by Y%." - "I expect to lose on Z because incumbent has more features."

Pre-registration reduces motivated reasoning later.

Document in comparison/setup.md.

Output of Session A

  • comparison/setup.md with scenario, conditions, pre-registered expectations.

Session B-Run the comparison

Goal: Run both sides. Capture all metrics. Don't fudge.

Part 1-Run your project (75 min)

Same configuration as production. Capture metrics. Save outputs for post-hoc inspection.

Part 2-Run the incumbent (75 min)

Same scenario. Don't half-effort the incumbent-it should be configured well.

Part 3-Capture surprising failures (30 min)

For both sides: pick 3 cases where output was surprising. Save these for the writeup; surprising failures are the most informative.

Output of Session B

  • Comparison data: metrics + saved outputs.

Session C-Analyze + write up + supporting paper

Goal: Analyze honestly. Write up. Read one supporting paper.

Part 1-Analyze (60 min)

Build the comparison table. Compute differences with bootstrap CIs where applicable.

Three possible outcomes: 1. Clear differentiator found. Document the niche; your project owns it. 2. Niche too narrow. Decide: scope down further or pivot. 3. Incumbent dominates everywhere. Honest pivot needed-go to a different project, or contribute to the incumbent instead.

Part 2-Write up in DESIGN.md (45 min)

Add a "Comparison vs " section to DESIGN.md:

## Comparison vs Inspect AI (M07-W03)

### Scenario
30 agent trajectories scored against a custom rubric for tool-call appropriateness.

### Setup
- Hardware: same machine, fp16 throughout.
- Input set: shared 30 examples committed to evals/agent_trajs/.
- Judge: claude-opus-4-7.

### Results
| Metric | Mine | Inspect AI |
|---|---|---|
| Score-vs-human kappa | 0.72 | 0.69 |
| Time to define a new scorer | 12 min | 25 min |
| Cost per eval run | $0.18 | $0.21 |
| Dashboard quality | basic | excellent |

### Verdict
My project wins on time-to-write-custom-scorer (~2× faster) due to its more declarative API.
Loses on dashboard ergonomics. Niche I'll claim: "Inspect-AI-quality evals with cleaner author-time API for trajectory-level scorers."

Part 3-Supporting paper (45 min)

Read one more paper: - (A) HELM (arxiv.org/abs/2211.09110)-methodology depth. - (B) AutoGen / multi-agent paper-for context on framework variety. - (C) Speculative Decoding (arxiv.org/abs/2211.17192) or SGLang (arxiv.org/abs/2312.07104).

Notes added.

Output of Session C

  • Analysis + comparison writeup committed to DESIGN.md.
  • Supporting paper notes added.

End-of-week artifact

  • Comparison setup + data
  • DESIGN.md updated with comparison verdict
  • Supporting paper notes

End-of-week self-assessment

  • I have an honest verdict on my project vs incumbents.
  • If pivot is needed, I've named it.
  • I have a defendable niche (or know I don't).

Common failure modes for this week

  • Stacking the comparison in your favor. Apples-to-apples or it's worthless.
  • Hiding the negative result. "Incumbent wins" is honest and clarifying.
  • Pivoting too late. This week is when pivots are cheap.

What's next (preview of M07-W04)

Track milestone + first specialty post. The repo gets a v0.2 release; the post announces your specialty publicly.

Comments