Month 3-Week 2: Self-attention, multi-head attention, GPT from scratch¶
Week summary¶
- Goal: Watch and implement Karpathy Zero to Hero lecture 6-"Let's build GPT." Implement self-attention, multi-head attention, causal masking, and a complete transformer block from scratch in PyTorch. Train on Tiny Shakespeare. By Sunday you can read any transformer paper.
- Time: ~12 h over 3 sessions (intentionally heavy-block the weekend).
- Output:
transformer-from-scratch/05-attention.ipynb,06-full-gpt.ipynb. Trained model with sample text. - Sequences relied on: 08-transformers rungs 03–07; 05-pytorch rungs 07, 08; 01-linear-algebra rungs 04, 05.
Why this week matters¶
This is the week. Implementing self-attention from scratch is the single highest-leverage intellectual move of your year. After this week, the transformer becomes a glass box: every paper that builds on it (BERT, GPT, Llama, Mistral, DeepSeek) reads as a variation. Without this week, every later session has a small black-box left in it. With it, the whole AI literature opens up.
Block your calendar. Tell your family. Set expectations. This week deserves serious time.
Prerequisites¶
- M03-W01 complete (Karpathy lectures 2–4).
- Cosine identity from M01-W01 internalized.
- Multivariable chain rule from M01-W02 internalized.
- PyTorch fluency from M01-W04.
Recommended cadence¶
- Session A-Tue/Wed evening (~3 h): pre-read Alammar + paper
- Session B-Sat full day (~5–6 h): Karpathy lecture 6 in full
- Session C-Sun afternoon (~3 h): modifications + experiments
Session A-Pre-read: Alammar and Attention Is All You Need¶
Goal: Build conceptual model of attention before coding. By end of session, you can describe Q/K/V projections, scaled dot-product attention, and multi-head attention in your own words.
Part 1-Jay Alammar's Illustrated Transformer (75 min)¶
Read carefully. jalammar.github.io/illustrated-transformer/.
Take notes on: 1. The encoder-decoder architecture (we'll only use the decoder side). 2. Q, K, V projections-what each represents. 3. Scaled dot-product attention-geometric intuition. 4. Multi-head: why split into multiple heads instead of one big one. 5. Positional encoding-why it's needed.
Sketch on paper: the data flow from input tokens → embeddings → attention → MLP → next-token logits. Label every tensor shape.
Part 2-Attention Is All You Need (75 min)¶
Paper: arxiv.org/abs/1706.03762.
Read sections 1, 2, 3.1, 3.2, 3.3 carefully. Skim 3.4–3.5. Skip the rest for now.
Key formulas:
1. Scaled dot-product: Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V
2. Multi-head: MultiHead(Q,K,V) = Concat(head₁, ..., headₕ) · W_O
Re-derive the shape arithmetic:
- Input: (batch, seq_len, d_model)
- Q, K, V projections: (batch, seq_len, d_model)
- Reshape to multi-head: (batch, n_heads, seq_len, d_head) where d_head = d_model / n_heads
- Attention scores: Q · Kᵀ → (batch, n_heads, seq_len, seq_len)
- After scaling, masking, softmax: still (batch, n_heads, seq_len, seq_len)
- Apply to V: (batch, n_heads, seq_len, d_head)
- Concat heads: (batch, seq_len, d_model)
- Output projection: (batch, seq_len, d_model)
If any line of this is mysterious, re-read.
Part 3-Self-check (30 min)¶
Without notes:
1. Why scale by √dₖ? (Hint: variance of dot product grows with dₖ.)
2. Why is causal masking necessary in a decoder LM?
3. Why multiple heads? What does "head 1 attends to syntax, head 2 to semantics" mean architecturally?
4. What's the parameter count of a single attention layer in terms of d_model?
5. Why is the output projection W_O needed (couldn't we just concat heads)?
If any are shaky, re-read Alammar.
Output of Session A¶
- Notes on Alammar + paper.
- Shape-arithmetic sketch on paper or whiteboard.
- Self-check answers.
Session B-Karpathy lecture 6-building GPT (~5–6 hours)¶
Goal: Type along with Karpathy lecture 6 in full. End with a working transformer LM training on Tiny Shakespeare.
This session is long. Do it Saturday morning. Take a 30-min break in the middle. Do not split across days.
Part 1-Lecture 6, first half (~2.5 h)¶
Karpathy lecture 6: "Let's build GPT".
The first half covers: 1. Tiny Shakespeare dataset. 2. Character-level tokenizer. 3. The "averaging the previous tokens" baseline (a sanity check). 4. Single-head self-attention. 5. Multi-head attention.
Type along. Do not paste. Every line you type is a chance to ask "why is this here?"
Part 2-Lecture 6, second half (~2.5 h)¶
The second half covers: 1. Position embeddings. 2. The full transformer block (attention + FFN + residuals + LayerNorm). 3. Stacking blocks. 4. Training loop. 5. Sampling.
By the end you have ~250 lines of code that train a transformer on Shakespeare and produce vaguely Shakespeare-like text.
Part 3-Train longer, sample (~30 min)¶
Run training for at least 5000 iterations. Save a checkpoint. Sample 500 characters at temperature 1. Compare to bigram and MLP-LM samples from W01-qualitative jump should be obvious.
Common pitfalls in Session B¶
- Subtle off-by-one in causal masking. Easy to mask the wrong way; loss looks fine but model "cheats." Compare your mask to Karpathy's.
- Forgetting to detach context tensors during sampling. Causes memory blowup.
- Wrong scale on attention scores. If loss flatlines at log(vocab_size), your scaling or softmax is off.
Output of Session B¶
05-attention.ipynband06-full-gpt.ipynbworking.- Trained Shakespeare model.
- 500-character sample committed to README.
Session C-Modifications and self-test¶
Goal: Modify the transformer in 3 ways and observe effects. Self-test that you understand each piece.
Part 1-Three modifications (90 min)¶
Modification 1: Embedding dimension. Double n_embd from 64 to 128 (or whatever your default). Train. Compare validation loss.
Modification 2: Number of layers. Double n_layer. Train. Compare.
Modification 3: Activation swap. Replace nn.ReLU in the FFN with nn.GELU. Train. Compare.
For each, log to W&B. Capture loss curves on the same plot.
Part 2-Self-test, no notes (45 min)¶
Open a fresh file. From a blank page:
1. Implement scaled dot-product attention in <30 lines.
2. Implement multi-head attention as a single batched matmul (no for-loops).
3. Implement causal masking using torch.tril.
4. Implement a transformer block (attention + FFN + residual + LayerNorm pre-norm style).
If you can do these in 60 minutes, you've absorbed the lecture. If not, re-watch the relevant section.
Part 3-Push, document, retro (45 min)¶
- Push everything to repo. README has architecture diagram, sample text, modification results.
- Update
LEARNING_LOG.mdwith: "Three things I learned that the paper didn't say." - Read M03-W03.md to prep for nanoGPT week.
Output of Session C¶
- Three modification experiments logged to W&B.
- Self-test code in a fresh file.
- Repo updated.
End-of-week artifact¶
- Working transformer LM on Tiny Shakespeare
- Three modification experiments compared in W&B
- Self-test passing (implement attention from scratch in 60 min)
- Sample text committed to README
- Architecture diagram in README
End-of-week self-assessment¶
- I can implement self-attention from a blank file in 30 minutes.
- I can explain why we scale by
√dₖ. - I can explain why causal masking enables parallel training.
- I can read any transformer paper and follow the architecture sections.
- I feel like the transformer is now a glass box.
Common failure modes for this week¶
- Splitting Saturday across days. The 5-hour block is the point. Connection between concepts requires uninterrupted attention.
- Pasting Karpathy's code. Type it all. The fingertips are part of how this learns.
- Skipping the self-test. It's the proof. Without it, you don't know what you know.
What's next (preview of M03-W03)¶
nanoGPT-production-style transformer reference implementation. You'll reproduce it on TinyStories or OpenWebText. Plus Karpathy's tokenizer lecture demystifies why LLMs fail at character counting.