Saltar a contenido

Month 3-Week 2: Self-attention, multi-head attention, GPT from scratch

Week summary

  • Goal: Watch and implement Karpathy Zero to Hero lecture 6-"Let's build GPT." Implement self-attention, multi-head attention, causal masking, and a complete transformer block from scratch in PyTorch. Train on Tiny Shakespeare. By Sunday you can read any transformer paper.
  • Time: ~12 h over 3 sessions (intentionally heavy-block the weekend).
  • Output: transformer-from-scratch/05-attention.ipynb, 06-full-gpt.ipynb. Trained model with sample text.
  • Sequences relied on: 08-transformers rungs 03–07; 05-pytorch rungs 07, 08; 01-linear-algebra rungs 04, 05.

Why this week matters

This is the week. Implementing self-attention from scratch is the single highest-leverage intellectual move of your year. After this week, the transformer becomes a glass box: every paper that builds on it (BERT, GPT, Llama, Mistral, DeepSeek) reads as a variation. Without this week, every later session has a small black-box left in it. With it, the whole AI literature opens up.

Block your calendar. Tell your family. Set expectations. This week deserves serious time.

Prerequisites

  • M03-W01 complete (Karpathy lectures 2–4).
  • Cosine identity from M01-W01 internalized.
  • Multivariable chain rule from M01-W02 internalized.
  • PyTorch fluency from M01-W04.
  • Session A-Tue/Wed evening (~3 h): pre-read Alammar + paper
  • Session B-Sat full day (~5–6 h): Karpathy lecture 6 in full
  • Session C-Sun afternoon (~3 h): modifications + experiments

Session A-Pre-read: Alammar and Attention Is All You Need

Goal: Build conceptual model of attention before coding. By end of session, you can describe Q/K/V projections, scaled dot-product attention, and multi-head attention in your own words.

Part 1-Jay Alammar's Illustrated Transformer (75 min)

Read carefully. jalammar.github.io/illustrated-transformer/.

Take notes on: 1. The encoder-decoder architecture (we'll only use the decoder side). 2. Q, K, V projections-what each represents. 3. Scaled dot-product attention-geometric intuition. 4. Multi-head: why split into multiple heads instead of one big one. 5. Positional encoding-why it's needed.

Sketch on paper: the data flow from input tokens → embeddings → attention → MLP → next-token logits. Label every tensor shape.

Part 2-Attention Is All You Need (75 min)

Paper: arxiv.org/abs/1706.03762.

Read sections 1, 2, 3.1, 3.2, 3.3 carefully. Skim 3.4–3.5. Skip the rest for now.

Key formulas: 1. Scaled dot-product: Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V 2. Multi-head: MultiHead(Q,K,V) = Concat(head₁, ..., headₕ) · W_O

Re-derive the shape arithmetic: - Input: (batch, seq_len, d_model) - Q, K, V projections: (batch, seq_len, d_model) - Reshape to multi-head: (batch, n_heads, seq_len, d_head) where d_head = d_model / n_heads - Attention scores: Q · Kᵀ(batch, n_heads, seq_len, seq_len) - After scaling, masking, softmax: still (batch, n_heads, seq_len, seq_len) - Apply to V: (batch, n_heads, seq_len, d_head) - Concat heads: (batch, seq_len, d_model) - Output projection: (batch, seq_len, d_model)

If any line of this is mysterious, re-read.

Part 3-Self-check (30 min)

Without notes: 1. Why scale by √dₖ? (Hint: variance of dot product grows with dₖ.) 2. Why is causal masking necessary in a decoder LM? 3. Why multiple heads? What does "head 1 attends to syntax, head 2 to semantics" mean architecturally? 4. What's the parameter count of a single attention layer in terms of d_model? 5. Why is the output projection W_O needed (couldn't we just concat heads)?

If any are shaky, re-read Alammar.

Output of Session A

  • Notes on Alammar + paper.
  • Shape-arithmetic sketch on paper or whiteboard.
  • Self-check answers.

Session B-Karpathy lecture 6-building GPT (~5–6 hours)

Goal: Type along with Karpathy lecture 6 in full. End with a working transformer LM training on Tiny Shakespeare.

This session is long. Do it Saturday morning. Take a 30-min break in the middle. Do not split across days.

Part 1-Lecture 6, first half (~2.5 h)

Karpathy lecture 6: "Let's build GPT".

The first half covers: 1. Tiny Shakespeare dataset. 2. Character-level tokenizer. 3. The "averaging the previous tokens" baseline (a sanity check). 4. Single-head self-attention. 5. Multi-head attention.

Type along. Do not paste. Every line you type is a chance to ask "why is this here?"

Part 2-Lecture 6, second half (~2.5 h)

The second half covers: 1. Position embeddings. 2. The full transformer block (attention + FFN + residuals + LayerNorm). 3. Stacking blocks. 4. Training loop. 5. Sampling.

By the end you have ~250 lines of code that train a transformer on Shakespeare and produce vaguely Shakespeare-like text.

Part 3-Train longer, sample (~30 min)

Run training for at least 5000 iterations. Save a checkpoint. Sample 500 characters at temperature 1. Compare to bigram and MLP-LM samples from W01-qualitative jump should be obvious.

Common pitfalls in Session B

  • Subtle off-by-one in causal masking. Easy to mask the wrong way; loss looks fine but model "cheats." Compare your mask to Karpathy's.
  • Forgetting to detach context tensors during sampling. Causes memory blowup.
  • Wrong scale on attention scores. If loss flatlines at log(vocab_size), your scaling or softmax is off.

Output of Session B

  • 05-attention.ipynb and 06-full-gpt.ipynb working.
  • Trained Shakespeare model.
  • 500-character sample committed to README.

Session C-Modifications and self-test

Goal: Modify the transformer in 3 ways and observe effects. Self-test that you understand each piece.

Part 1-Three modifications (90 min)

Modification 1: Embedding dimension. Double n_embd from 64 to 128 (or whatever your default). Train. Compare validation loss.

Modification 2: Number of layers. Double n_layer. Train. Compare.

Modification 3: Activation swap. Replace nn.ReLU in the FFN with nn.GELU. Train. Compare.

For each, log to W&B. Capture loss curves on the same plot.

Part 2-Self-test, no notes (45 min)

Open a fresh file. From a blank page: 1. Implement scaled dot-product attention in <30 lines. 2. Implement multi-head attention as a single batched matmul (no for-loops). 3. Implement causal masking using torch.tril. 4. Implement a transformer block (attention + FFN + residual + LayerNorm pre-norm style).

If you can do these in 60 minutes, you've absorbed the lecture. If not, re-watch the relevant section.

Part 3-Push, document, retro (45 min)

  • Push everything to repo. README has architecture diagram, sample text, modification results.
  • Update LEARNING_LOG.md with: "Three things I learned that the paper didn't say."
  • Read M03-W03.md to prep for nanoGPT week.

Output of Session C

  • Three modification experiments logged to W&B.
  • Self-test code in a fresh file.
  • Repo updated.

End-of-week artifact

  • Working transformer LM on Tiny Shakespeare
  • Three modification experiments compared in W&B
  • Self-test passing (implement attention from scratch in 60 min)
  • Sample text committed to README
  • Architecture diagram in README

End-of-week self-assessment

  • I can implement self-attention from a blank file in 30 minutes.
  • I can explain why we scale by √dₖ.
  • I can explain why causal masking enables parallel training.
  • I can read any transformer paper and follow the architecture sections.
  • I feel like the transformer is now a glass box.

Common failure modes for this week

  • Splitting Saturday across days. The 5-hour block is the point. Connection between concepts requires uninterrupted attention.
  • Pasting Karpathy's code. Type it all. The fingertips are part of how this learns.
  • Skipping the self-test. It's the proof. Without it, you don't know what you know.

What's next (preview of M03-W03)

nanoGPT-production-style transformer reference implementation. You'll reproduce it on TinyStories or OpenWebText. Plus Karpathy's tokenizer lecture demystifies why LLMs fail at character counting.

Comments