Month 10-Week 1: Capstone build kickoff¶

Week summary¶

Goal: Begin the Q4 capstone. Repo scaffolding done. Architecture sketched. Eval target chosen and runnable. First end-to-end feature working.
Time: ~10 h over 3 sessions.
Output: Capstone repo with DESIGN, architecture, eval pipeline, first feature.

Why this week matters¶

The capstone is the artifact you'll point to for years. This week is about getting it scoped right and started clean. Avoid the "rebuild from scratch" trap-extend your strongest existing code, don't restart.

Prerequisites¶

M09-W04 complete with Q4 capstone DESIGN.md.
Capstone repo scaffolded.

Recommended cadence¶

Session A-Tue/Wed evening (~3.5 h): scaffolding + DESIGN refinement
Session B-Sat morning (~4 h): architecture + first feature
Session C-Sun afternoon (~2.5 h): eval target wired + paper refresh

Goal: Repo polished. DESIGN.md sharper than M09-W04 version.

Part 1-Boilerplate (60 min)¶

If not done in M09: README.md (placeholder), LICENSE, CONTRIBUTING.md, CI workflow, tests directory, examples directory.

Quality matters here-sloppy scaffolding signals sloppy project.

Re-read your M09 capstone DESIGN. Sharpen: - Make problem statement more specific. - Add 2 references to incumbents (compare what's missing). - Add measurable success criteria. - Add anchor experiment with predicted result. - List risks.

Part 3-First commit (60 min)¶

Push the polished scaffolding + DESIGN. Tag v0.0.1.

Output of Session A¶

Polished capstone repo with strong DESIGN.

Session B-Architecture + first feature¶

Goal: Module structure committed. First end-to-end feature runnable.

Part 1-Module sketch (45 min)¶

<capstone>/
├── src/<capstone>/
│   ├── __init__.py
│   ├── core.py          # main interfaces
│   ├── <feature1>.py
│   ├── <feature2>.py
├── tests/
├── examples/
├── docs/
└── pyproject.toml

Define the core.py interfaces-abstract base classes or Protocols. Type-annotate.

Part 2-Build first feature (135 min)¶

The smallest end-to-end thing that works. Examples: - (Track A) Task → Solver → Scorer pipeline runnable on 5 examples. - (Track B) Agent that reads a benchmark task and produces an output (low quality is fine; focus on pipeline). - (Track C) Benchmark harness that captures TTFT / throughput on a single config.

Part 3-Push (15 min)¶

Commit. CI green.

Output of Session B¶

Module structure committed.
First feature working end-to-end.

Session C-Eval target + paper refresh¶

Goal: Eval target wired and runnable. Refresh top 3 papers from the year.

Part 1-Eval target (75 min)¶

Pick the public benchmark or eval suite the capstone will be measured on: - (A) A specific eval task in your eval framework, with a target metric. - (B) SWE-bench Lite (50 issues), GAIA, or τ-bench. - (C) A standardized inference benchmark (your own, well-defined).

Get one full run end-to-end. Score doesn't matter yet-the pipeline matters.

Part 2-Re-read top 3 papers (60 min)¶

Pick the 3 most useful papers from your year. Re-read. They will hit differently now.

Likely candidates: - Foundational paper for your track. - A frontier paper (DeepSeek-V3 / R1). - A method paper (DPO, FSDP, ReAct, vLLM, etc.).

Add 100-word "what I see now I didn't before" notes to each.

Part 3-Push + LEARNING_LOG (15 min)¶

Output of Session C¶

Eval target wired.
Refresh notes on 3 papers.

End-of-week artifact¶

Capstone repo with DESIGN, scaffolding, first feature
Module structure committed
Eval pipeline runnable
3 refresh-paper notes

End-of-week self-assessment¶

My capstone has a measurable success criterion.
First feature runs end-to-end.
Eval pipeline is wired (not just planned).

Common failure modes for this week¶

Over-scoping the first feature. Smallest-possible thing first.
DESIGN as wishlist. Commitments, not aspirations.
No eval pipeline. Without it, you're shipping by feel.

What's next (preview of M10-W02)¶

Build sprint week 1. 3-5 substantive features. Eval each.