Saltar a contenido

Month 7-Week 1: Specialty kickoff-foundational paper + design doc + first commit

Week summary

  • Goal: Begin your Q3 specialty track. Read your foundational paper deeply (with notes). Set up the specialty repo. Write DESIGN.md. Make first non-trivial commit.
  • Time: ~10 h over 3 sessions.
  • Output: Specialty repo with DESIGN.md, paper notes, first non-trivial commit.
  • Sequences relied on: track-specific-12-evaluation-systems (A), 11-agents (B), 14-inference-serving (C).

Why this week matters

Q3 is the depth quarter. The depth happens or doesn't depending on this week. Pick the foundational paper for your track, read it deeply (not skimmed), and start a repo with a written design-this is what differentiates "I worked on agents this quarter" from "I built a measurable agent system this quarter."

Prerequisites

  • Q3 track committed (in Q3_TRACK_DECISION.md).
  • Q1 + Q2 complete.
  • Session A-Tue/Wed evening (~3 h): foundational paper deep read
  • Session B-Sat morning (~4 h): repo setup + DESIGN.md
  • Session C-Sun afternoon (~3 h): first commit + supporting paper

Session A-Foundational paper, deeply

Goal: Read your track's foundational paper twice, with notes. Understand it well enough to explain to a colleague.

Part 1-First pass (90 min)

Track A-Evals: - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (arxiv.org/abs/2306.05685). - Sections 1–4 deeply. The methodology is the contribution.

Track B-Agents: - SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (arxiv.org/abs/2310.06770). - Or ReAct (arxiv.org/abs/2210.03629) if you want a fundamental building block. - Sections 1–4 deeply.

Track C-Inference Infra: - vLLM / Efficient Memory Management for Large Language Model Serving with PagedAttention (arxiv.org/abs/2309.06180). - Sections 1–5 deeply.

First pass: read for understanding. Don't take detailed notes yet-just orient.

Part 2-Second pass + notes (75 min)

Second pass with notes. Per section: - Key claim. - Method. - Result (with numbers). - Limitation acknowledged by authors.

In your repo: paper_notes/<paper-shortname>.md. ~500 words.

Part 3-Three things you'd try (30 min)

What 3 specific experiments could you run that follow from this paper? Frame as hypothesis + method + metric. Examples: - (Track A) "Pairwise judges produce more reliable scores than pointwise on agent trajectories. Test: pair vs point on 50 examples; agreement with humans." - (Track B) "Adding a self-reflection step lifts SWE-bench score by ≥3pp. Test: with/without on 30 issues." - (Track C) "Speculative decoding lifts throughput more than continuous batching for our workload. Test: ablate each."

These hypotheses become Q3 milestones.

Output of Session A

  • Paper notes file in repo.
  • Three hypothesis statements.

Session B-Repo setup + DESIGN.md

Goal: New track repo with proper boilerplate. Write DESIGN.md (1500+ words) committing to scope.

Part 1-Repo setup (60 min)

mkdir <track-repo> && cd <track-repo>
uv init
git init
mkdir -p src tests examples docs

Boilerplate: - README.md-placeholder pointing to DESIGN.md. - LICENSE-MIT or Apache 2.0. - pyproject.toml. - .github/workflows/ci.yml-lint + tests on push. - CONTRIBUTING.md-even if it's just "open an issue first."

Part 2-DESIGN.md (105 min)

Structure:

# DESIGN-<project name>

## Problem
[Specific problem this addresses. 1-2 paragraphs. Cite who has it.]

## Why existing tools don't quite fit
[Honest comparison: Inspect AI / Braintrust / vLLM / etc. Don't dismiss them; note what's missing.]

## Goals
- [Specific outcome 1]
- [Specific outcome 2]
- [Specific outcome 3]

## Non-goals
- [What this is NOT-scope discipline]
- [What this WON'T do]

## Approach
[The technical approach. Architecture sketch. Key design decisions and rationale.]

## Success criteria
- Quantitative: [e.g., "passes Inspect AI's example tasks with ≤5% performance overhead"]
- Qualitative: [e.g., "an outsider can write a custom scorer in <1 hour"]

## Anchor experiment
[The headline result this project will produce.]

## Roadmap (Q3 weeks)
- M07-W01: foundations done (you're here)
- M07-W02: first non-trivial feature
- M07-W03: comparison vs incumbent
- M07-W04: track milestone + first specialty post
- M08: universal infra + fine-tuning weeks (parallel learning)
- M09: track final push, OSS PR, polish

## What I'm reading this quarter
- [paper 1]
- [paper 2]
- [paper 3]

Part 3-Commit + push (15 min)

Push the empty-but-designed repo public. The commitment goes on the record.

Output of Session B

  • Public repo with DESIGN.md, README placeholder, LICENSE.

Session C-First non-trivial commit + supporting paper

Goal: Ship something runnable. Read one supporting paper.

Part 1-First feature (90 min)

The smallest end-to-end feature that proves the project is real.

Track A-Evals: - A Task class wrapping a dataset + a single scorer. - Run on 5 examples and produce a report. - Comparison to Inspect AI's equivalent in your DESIGN.md.

Track B-Agents: - A baseline ReAct agent on 5 SWE-bench issues. Even if score is 0/5, the pipeline works. - Output: trajectory + final patch.

Track C-Inference Infra: - vLLM serving Llama 3.1 8B with one config. Benchmark TTFT + throughput at concurrency=10. - Reproducible launch script.

Part 2-Supporting paper (60 min)

A second paper related to your track. For example: - (A) RAGAS or HELM. - (B) Reflexion or Tree of Thoughts. - (C) FlashAttention or speculative decoding paper.

Notes added to paper_notes/.

Part 3-Push + retro (30 min)

git tag v0.0.1
git push --tags

Update LEARNING_LOG with: "What I learned in week 1 of Q3 specialty."

Output of Session C

  • v0.0.1 tagged with first feature.
  • Second paper notes.

End-of-week artifact

  • Foundational paper read deeply with notes
  • Specialty repo public with DESIGN.md
  • First non-trivial commit (v0.0.1 tag)
  • Second paper notes added

End-of-week self-assessment

  • I can explain the foundational paper to a colleague.
  • My DESIGN.md is specific enough that I can't pivot mid-quarter without rewriting it.
  • My first feature is runnable, not aspirational.

Common failure modes for this week

  • Skimming the foundational paper. Two passes minimum.
  • DESIGN.md as wishlist. It must be a commitment, not aspiration.
  • First feature too ambitious. The smallest thing that's real is right.

What's next (preview of M07-W02)

Frontier paper (DeepSeek V3 technical report) + substantive build progress on the track.

Comments