Skip to content

Month 3-Week 4: Q1 capstone-modify a transformer, ship a retrospective

Week summary

  • Goal: Modify nanoGPT architecturally (one of: RMSNorm, RoPE, SwiGLU, GQA). Compare baseline vs modified with 3 seeds each. Publish a long-form Q1 retrospective blog post.
  • Time: ~10 h over 3 sessions.
  • Output: Modified nanoGPT with comparison; third public blog post (Q1 retrospective); Q1 retro document.
  • Sequences relied on: 08-transformers rungs 03, 04, 11; 05-pytorch rungs 07–10.

Why this week matters

Q1 closes here. Three months ago you'd never implemented backprop; this week you modify a transformer architecturally, in a published-research-style ablation. The Q1 retrospective post is the artifact you'll point to for years-the one that proves you took yourself from "engineer who calls APIs" to "engineer who modifies models." Done well, this post alone gets reactions that change your career.

The architectural modification also matters technically. The 2024–2026 frontier (Llama 3, DeepSeek V3, Mistral, Qwen) all use RMSNorm + RoPE + SwiGLU + GQA-variations on the original Vaswani transformer. By implementing one, you understand why the field moved past the original paper.

Prerequisites

  • M03-W01 + W02 + W03 complete.
  • nanoGPT or your own transformer trains on something.
  • Session A-Tue/Wed evening (~3 h): pick modification + read paper
  • Session B-Sat morning (~4 h): implement + train baseline & modified
  • Session C-Sun afternoon (~3 h): write & publish retrospective post

Session A-Pick modification, read the paper deeply

Goal: Choose one architectural modification. Read its paper. Understand what it does and why.

Part 1-Pick the modification (15 min)

Modification What it changes Used in Difficulty
RMSNorm LayerNorm without centering Llama, Mistral Easy (~30 min impl)
RoPE Rotary positional embeddings Llama, Qwen Medium (~90 min impl)
SwiGLU Activation function in FFN Llama, PaLM Easy (~30 min impl)
GQA Grouped-query attention Llama 2/3 Medium (~90 min impl)

Recommendation: RoPE. It's the most conceptually rich (rotates Q and K vectors by position-dependent amounts), the most widely adopted, and produces measurable downstream effects.

Document choice in transformer-from-scratch/MODIFICATION.md with one paragraph reasoning.

Part 2-Read the relevant paper (90 min)

RoPE: RoFormer paper, arxiv.org/abs/2104.09864. Sections 1, 3.1, 3.2, 3.3. RMSNorm: arxiv.org/abs/1910.07467. Short paper; read in full. SwiGLU: "GLU Variants Improve Transformer", arxiv.org/abs/2002.05202. Skim. GQA: arxiv.org/abs/2305.13245. Sections 1–3.

For RoPE specifically: - Position is encoded by rotating Q and K vectors by an angle proportional to position. - Rotations applied to pairs of dimensions: (x₀, x₁) → (x₀cosθ − x₁sinθ, x₀sinθ + x₁cosθ). - The angle θᵢ = pos · 10000^(−2i/d) for dim pair index i. - Why it works: dot products Q·K then depend on relative position only.

Part 3-Sketch the implementation (75 min)

Open model.py. Identify exactly what to change.

For RoPE: - New helper function apply_rotary_emb(q, k, freqs). - Modify the attention forward to apply rotary to Q and K before the dot product. - Compute freqs once per training run (cached).

Pseudocode:

def precompute_freqs_cis(dim, end, theta=10000.0):
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim))
    t = torch.arange(end)
    freqs = torch.outer(t, freqs).float()
    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex unit vectors
    return freqs_cis

def apply_rotary_emb(xq, xk, freqs_cis):
    # xq, xk shape: (B, T, n_head, head_dim)
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
    freqs_cis = freqs_cis.unsqueeze(0).unsqueeze(2)  # broadcast over batch and heads
    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
    return xq_out.type_as(xq), xk_out.type_as(xk)

Make a sketch (don't fully implement yet)-Saturday is for implementation.

Output of Session A

  • MODIFICATION.md with choice + reasoning + paper notes.
  • Implementation sketch.

Session B-Implement, train, compare

Goal: Code the modification. Run baseline + modified with 3 seeds each. Capture results.

Part 1-Implementation (90 min)

Implement the modification cleanly. Add unit tests where possible (e.g., for RoPE, verify that apply_rotary_emb preserves the magnitude of vectors).

# Test: rotary preserves magnitude
q = torch.randn(2, 8, 4, 64)
k = torch.randn(2, 8, 4, 64)
freqs = precompute_freqs_cis(64, 8)
q_rot, k_rot = apply_rotary_emb(q, k, freqs)
assert torch.allclose(q.norm(dim=-1), q_rot.norm(dim=-1), atol=1e-4)

Part 2-Train baseline + modified, 3 seeds each (130 min)

6 training runs total. Each ~15–20 min on a T4 GPU.

import wandb
seeds = [0, 1, 2]
configs = ['baseline', 'modified']
for cfg in configs:
    for seed in seeds:
        torch.manual_seed(seed)
        wandb.init(project='q1-capstone', name=f'{cfg}-seed{seed}', config={'cfg': cfg, 'seed': seed})
        # train ...
        wandb.finish()

Part 3-Analyze (30 min)

Pull the W&B data:

                mean val_loss   std val_loss
baseline        2.137           0.024
modified        2.089           0.018

Compute bootstrap CI on the difference. Is the modification helping?

Be honest. Many modifications give a small or zero gain at this small scale. The honest negative is publishable. Don't fudge.

Output of Session B

  • 6 W&B runs.
  • Comparison table with bootstrap CIs.
  • Working modification merged into your transformer.

Session C-Q1 retrospective post + repo polish

Goal: Write and publish the Q1 retrospective. The longest, most substantive post yet.

Part 1-Outline + draft (90 min)

Title (suggestion): "Twelve weeks from no-backprop to a modified transformer-a Q1 deep dive."

Outline (~3000 words): 1. Hook. The transformation in 12 weeks. 2. Where I started. Honest baseline (couldn't derive backprop, etc.). 3. The math foundations weeks. What clicked, what didn't. 4. The classical-ML detour. XGBoost vs MLP. Why this matters. 5. The transformer build. Karpathy's pedagogy. The week attention clicked. 6. The modification experiment. The paper. The implementation. The results with CIs. 7. What surprised me. 3-5 specific surprises. 8. What I'd do differently. Honest critique. 9. What's next. Bridge to Q2 (LLM applications).

Embed code snippets, charts, the modification paper diagrams.

Part 2-Polish + publish (60 min)

  • Edit ruthlessly. Read aloud.
  • Add charts.
  • Publish.
  • Cross-post: HN (Show HN), r/MachineLearning (Project flair), r/LocalLLaMA, X, LinkedIn.
  • Tag relevant accounts (Karpathy if you literally implemented his lecture; the paper authors politely).

Part 3-Q1 retrospective document (45 min)

Write Q1_RETRO.md in your repo:

# Q1 Retrospective: Foundations

## Artifacts shipped (12 weeks)
- `ml-from-scratch/ - 4 from-scratch notebooks
- `micrograd-minimal/ - autograd engine
- `classical-ml/ - course notebooks, ablation, tabular comparison
- `transformer-from-scratch/ - 7 notebooks, modified nanoGPT
- 3 public blog posts

## KPIs vs targets (per AI_EXPERT_ROADMAP.md)
| Metric | Q1 Target | Actual |
|---|---|---|
| Public repos | 3 | 4 |
| Blog posts | 1 | 3 |
| Papers read deeply | 8 | ~10 |

## Three biggest insights
1. Backprop became inevitable after micrograd.
2. Cosine identity is the bridge between algebra and geometry.
3. Seed variance is real and most claimed improvements are noise.

## What slipped
- ...

## Pace check
- (sustainable / accelerated / behind)

## Q2 plan
- LLM application engineering. Build a real project.
- Anchor: incident triage / RCA system.
- Q2 starts with M04-W01.

## Confidence calibration before Q2
- [ ] I can implement attention from a blank file in 30 min.
- [ ] I can read any transformer paper.
- [ ] I have public artifacts to point to.

Output of Session C

  • Third public blog post live and shared in ≥3 channels.
  • Q1 retrospective document.

End-of-week artifact

  • Modified transformer with comparison vs baseline (3 seeds × 2 variants)
  • Third public blog post-Q1 retrospective, ~3000 words
  • Q1 retrospective document in repo
  • Updated AI_EXPERT_ROADMAP.md checkmarks

End-of-week self-assessment

  • I can articulate Q1's transformation in 30 seconds.
  • I have shipped artifacts that prove the work.
  • I'm ready to shift from "build models" to "build with models" in Q2.

Common failure modes for this week

  • Picking a modification too ambitious. RMSNorm or SwiGLU is fine. The point is the experiment design, not exotic complexity.
  • Hiding the negative result. "RoPE didn't help at this scale" is publishable if honest.
  • Not publishing the post. This is the year's most leveraged post so far. Ship.

What's next (preview of M04-W01-Q2 begins)

LLM application engineering. Pick your Q2 anchor project (recommended: incident triage / RCA on real or synthetic CI/CD telemetry). Make first calls to two providers. Set up structured outputs and Pydantic.

Comments