Month 3-Week 2: Self-attention, multi-head attention, GPT from scratch¶

Week summary¶

Goal: Watch and implement Karpathy Zero to Hero lecture 6-"Let's build GPT." Implement self-attention, multi-head attention, causal masking, and a complete transformer block from scratch in PyTorch. Train on Tiny Shakespeare. By Sunday you can read any transformer paper.
Time: ~12 h over 3 sessions (intentionally heavy-block the weekend).
Output: transformer-from-scratch/05-attention.ipynb, 06-full-gpt.ipynb. Trained model with sample text.
Sequences relied on: 08-transformers rungs 03–07; 05-pytorch rungs 07, 08; 01-linear-algebra rungs 04, 05.

Why this week matters¶

This is the week. Implementing self-attention from scratch is the single highest-leverage intellectual move of your year. After this week, the transformer becomes a glass box: every paper that builds on it (BERT, GPT, Llama, Mistral, DeepSeek) reads as a variation. Without this week, every later session has a small black-box left in it. With it, the whole AI literature opens up.

Block your calendar. Tell your family. Set expectations. This week deserves serious time.

Prerequisites¶

M03-W01 complete (Karpathy lectures 2–4).
Cosine identity from M01-W01 internalized.
Multivariable chain rule from M01-W02 internalized.
PyTorch fluency from M01-W04.

Recommended cadence¶

Session A-Tue/Wed evening (~3 h): pre-read Alammar + paper
Session B-Sat full day (~5–6 h): Karpathy lecture 6 in full
Session C-Sun afternoon (~3 h): modifications + experiments

Session A-Pre-read: Alammar and Attention Is All You Need¶

Goal: Build conceptual model of attention before coding. By end of session, you can describe Q/K/V projections, scaled dot-product attention, and multi-head attention in your own words.

Part 1-Jay Alammar's Illustrated Transformer (75 min)¶

Read carefully. jalammar.github.io/illustrated-transformer/.

Take notes on: 1. The encoder-decoder architecture (we'll only use the decoder side). 2. Q, K, V projections-what each represents. 3. Scaled dot-product attention-geometric intuition. 4. Multi-head: why split into multiple heads instead of one big one. 5. Positional encoding-why it's needed.

Sketch on paper: the data flow from input tokens → embeddings → attention → MLP → next-token logits. Label every tensor shape.

Part 2-Attention Is All You Need (75 min)¶

Paper: arxiv.org/abs/1706.03762.

Read sections 1, 2, 3.1, 3.2, 3.3 carefully. Skim 3.4–3.5. Skip the rest for now.

Key formulas: 1. Scaled dot-product: Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V 2. Multi-head: MultiHead(Q,K,V) = Concat(head₁, ..., headₕ) · W_O

Re-derive the shape arithmetic: - Input: (batch, seq_len, d_model) - Q, K, V projections: (batch, seq_len, d_model) - Reshape to multi-head: (batch, n_heads, seq_len, d_head) where d_head = d_model / n_heads - Attention scores: Q · Kᵀ → (batch, n_heads, seq_len, seq_len) - After scaling, masking, softmax: still (batch, n_heads, seq_len, seq_len) - Apply to V: (batch, n_heads, seq_len, d_head) - Concat heads: (batch, seq_len, d_model) - Output projection: (batch, seq_len, d_model)

If any line of this is mysterious, re-read.

Part 3-Self-check (30 min)¶

Without notes: 1. Why scale by √dₖ? (Hint: variance of dot product grows with dₖ.) 2. Why is causal masking necessary in a decoder LM? 3. Why multiple heads? What does "head 1 attends to syntax, head 2 to semantics" mean architecturally? 4. What's the parameter count of a single attention layer in terms of d_model? 5. Why is the output projection W_O needed (couldn't we just concat heads)?

If any are shaky, re-read Alammar.

Output of Session A¶

Notes on Alammar + paper.
Shape-arithmetic sketch on paper or whiteboard.
Self-check answers.

Session B-Karpathy lecture 6-building GPT (~5–6 hours)¶

Goal: Type along with Karpathy lecture 6 in full. End with a working transformer LM training on Tiny Shakespeare.

This session is long. Do it Saturday morning. Take a 30-min break in the middle. Do not split across days.

Part 1-Lecture 6, first half (~2.5 h)¶

Karpathy lecture 6: "Let's build GPT".

The first half covers: 1. Tiny Shakespeare dataset. 2. Character-level tokenizer. 3. The "averaging the previous tokens" baseline (a sanity check). 4. Single-head self-attention. 5. Multi-head attention.

Type along. Do not paste. Every line you type is a chance to ask "why is this here?"

Part 2-Lecture 6, second half (~2.5 h)¶

The second half covers: 1. Position embeddings. 2. The full transformer block (attention + FFN + residuals + LayerNorm). 3. Stacking blocks. 4. Training loop. 5. Sampling.

By the end you have ~250 lines of code that train a transformer on Shakespeare and produce vaguely Shakespeare-like text.

Part 3-Train longer, sample (~30 min)¶

Run training for at least 5000 iterations. Save a checkpoint. Sample 500 characters at temperature 1. Compare to bigram and MLP-LM samples from W01-qualitative jump should be obvious.

Common pitfalls in Session B¶

Subtle off-by-one in causal masking. Easy to mask the wrong way; loss looks fine but model "cheats." Compare your mask to Karpathy's.
Forgetting to detach context tensors during sampling. Causes memory blowup.
Wrong scale on attention scores. If loss flatlines at log(vocab_size), your scaling or softmax is off.

Output of Session B¶

05-attention.ipynb and 06-full-gpt.ipynb working.
Trained Shakespeare model.
500-character sample committed to README.

Session C-Modifications and self-test¶

Goal: Modify the transformer in 3 ways and observe effects. Self-test that you understand each piece.

Part 1-Three modifications (90 min)¶

Modification 1: Embedding dimension. Double n_embd from 64 to 128 (or whatever your default). Train. Compare validation loss.

Modification 2: Number of layers. Double n_layer. Train. Compare.

Modification 3: Activation swap. Replace nn.ReLU in the FFN with nn.GELU. Train. Compare.

For each, log to W&B. Capture loss curves on the same plot.

Part 2-Self-test, no notes (45 min)¶

Open a fresh file. From a blank page: 1. Implement scaled dot-product attention in <30 lines. 2. Implement multi-head attention as a single batched matmul (no for-loops). 3. Implement causal masking using torch.tril. 4. Implement a transformer block (attention + FFN + residual + LayerNorm pre-norm style).

If you can do these in 60 minutes, you've absorbed the lecture. If not, re-watch the relevant section.

Part 3-Push, document, retro (45 min)¶

Push everything to repo. README has architecture diagram, sample text, modification results.
Update LEARNING_LOG.md with: "Three things I learned that the paper didn't say."
Read M03-W03.md to prep for nanoGPT week.

Output of Session C¶

Three modification experiments logged to W&B.
Self-test code in a fresh file.
Repo updated.

End-of-week artifact¶

Working transformer LM on Tiny Shakespeare
Three modification experiments compared in W&B
Self-test passing (implement attention from scratch in 60 min)
Sample text committed to README
Architecture diagram in README

End-of-week self-assessment¶

I can implement self-attention from a blank file in 30 minutes.
I can explain why we scale by √dₖ.
I can explain why causal masking enables parallel training.
I can read any transformer paper and follow the architecture sections.
I feel like the transformer is now a glass box.

Common failure modes for this week¶

Splitting Saturday across days. The 5-hour block is the point. Connection between concepts requires uninterrupted attention.
Pasting Karpathy's code. Type it all. The fingertips are part of how this learns.
Skipping the self-test. It's the proof. Without it, you don't know what you know.

What's next (preview of M03-W03)¶

nanoGPT-production-style transformer reference implementation. You'll reproduce it on TinyStories or OpenWebText. Plus Karpathy's tokenizer lecture demystifies why LLMs fail at character counting.