08-Transformers¶

Why this matters in the journey¶

The transformer is the architectural foundation of every modern LLM. Implementing one from scratch is the single highest-leverage intellectual move of your year. Once you can implement it, you can read papers, debug training, modify architectures, and reason about why things work. Without it, you remain a black-box user.

The rungs¶

Rung 01-Tokenization¶

What: Convert text into integer IDs. Modern LLMs use Byte Pair Encoding (BPE) variants like GPT's tiktoken or Llama's SentencePiece.
Why it earns its place: Most "weird LLM behavior" turns out to be a tokenization quirk. Subword tokenization is also why LLMs handle rare words.
Resource: Karpathy Zero to Hero lecture on the GPT tokenizer (search "karpathy let's build the gpt tokenizer"). Plus the BPE paper (arxiv.org/abs/1508.07909).
Done when: You can implement BPE on a small corpus and explain how the merge process works.

Rung 02-Embeddings¶

What: Token IDs are looked up into a matrix E of shape (vocab_size, hidden_dim). Each token becomes a vector.
Why it earns its place: The first operation in every LLM. Embedding geometry is also the basis of similarity search and RAG.
Resource: Karpathy Zero to Hero makemore lectures introduce embeddings; nanoGPT shows them in production form.
Done when: You can implement an nn.Embedding lookup by hand using just indexing and a parameter matrix.

Rung 03-Positional encodings¶

What: Transformers have no inherent notion of order. Position is injected via sinusoidal embeddings (original), learned positional embeddings (GPT-2), RoPE (rotary, used in Llama), or ALiBi.
Why it earns its place: Long-context behavior, context-length extension, and position-bias bugs all trace back to positional encoding.
Resource: Original transformer paper (sec 3.5). RoPE paper (arxiv.org/abs/2104.09864). Excellent blog: search "Eleuther rope".
Done when: You can implement sinusoidal positional encoding from scratch and explain RoPE conceptually.

Rung 04-Self-attention¶

What: Compute softmax(QKᵀ / √d) V where Q, K, V are linear projections of the input. Each position attends to all others, weighted.
Why it earns its place: The single most important operation in modern AI. Implementing it is what makes you a transformer engineer.
Resource: Karpathy Zero to Hero lecture 6 ("Let's build GPT"). Plus The Annotated Transformer (Harvard NLP).
Done when: You can implement scaled dot-product attention in <30 lines of PyTorch.

Rung 05-Multi-head attention¶

What: Multiple attention "heads" run in parallel with different projections; outputs are concatenated.
Why it earns its place: All real transformers are multi-head. Different heads learn different things.
Resource: Same as rung 04. Plus visualization tools like bertviz to see what heads attend to.
Done when: You can implement multi-head attention as either a loop over heads or (efficiently) a single reshaped matmul.

Rung 06-Causal masking¶

What: In a decoder-only LM, position i must not attend to positions > i. Implemented by setting future-position scores to - inf` before softmax.
Why it earns its place: Without causal masking, the model cheats during training and inference is broken.
Resource: Karpathy nanoGPT-read the masking code.
Done when: You can implement causal masking from scratch and explain why it makes training parallelizable across positions.

Rung 07-The transformer block¶

What: Attention → residual + LayerNorm → MLP → residual + LayerNorm. Modern variants use pre-norm (LayerNorm before attention) and RMSNorm.
Why it earns its place: The repeated unit. Stack 12+ of these = GPT-2 small. Stack hundreds = a frontier model.
Resource: The Annotated Transformer. Plus nanoGPT model.py.
Done when: You can implement a transformer block as an nn.Module and stack it to make a working LM.

Rung 08-Training a small LM end-to-end¶

What: Tokenize a corpus, build a Dataset, batch with DataLoader, train with cross-entropy on next-token prediction.
Why it earns its place: The capstone of foundations. Once done, the rest of the year is variations and applications.
Resource: Reproduce nanoGPT on Shakespeare or TinyStories (search "TinyStories dataset huggingface").
Done when: Your model produces coherent (or coherent-ish) text and you've watched the loss go down.

Rung 09-Inference: greedy, top-k, top-p, temperature¶

What: Sampling strategies during generation.
Why it earns its place: Production decoding parameters matter enormously for output quality.
Resource: Hugging Face blog "How to generate text" (search "huggingface how to generate"). Plus the nucleus sampling paper (arxiv.org/abs/1904.09751).
Done when: You can implement top-k and top-p sampling, vary temperature, and observe the effects on output diversity.

Rung 10-Scaling and architecture variations¶

What: Encoder-only (BERT), decoder-only (GPT, Llama), encoder-decoder (T5). Mixture-of-Experts (MoE). Sparse attention.
Why it earns its place: Reading SOTA papers requires recognizing these families.
Resource: Sebastian Raschka's "LLMs from Scratch" book + blog posts (search "raschka LLMs from scratch"). Plus survey: "A Survey of Large Language Models" (arxiv.org/abs/2303.18223).
Done when: You can sketch the differences between BERT, GPT, T5, and a MoE model.

Rung 11-Modern efficiency techniques (read-only depth)¶

What: FlashAttention, KV cache, grouped-query attention (GQA), sliding-window attention, RoPE scaling.
Why it earns its place: These are why modern LLMs are fast enough to use. You don't have to implement them, but you have to know what they do.
Resource: FlashAttention paper (arxiv.org/abs/2205.14135). Llama 2 paper for GQA. Mistral paper for sliding-window.
Done when: You can explain in 3 sentences each what FlashAttention, GQA, and KV-cache do.

Minimum required to leave this sequence¶

Implement BPE on a small corpus.
Implement scaled dot-product attention from scratch.
Implement multi-head + causal-masked attention.
Implement a full transformer block.
Train a small LM end-to-end and sample from it.
Implement top-k and top-p sampling.
Explain the difference between encoder-only and decoder-only transformers.

Going further¶

Sebastian Raschka-Build a Large Language Model (From Scratch)-book, hands-on, excellent.
Stanford CS336-Language Modeling from Scratch (free lectures online; intense).
The Illustrated Transformer by Jay Alammar (jalammar.github.io)-gentle re-read after you implement.
The Annotated Transformer (Harvard NLP)-code-first walkthrough.

How this sequence connects to the year¶

Month 3: This sequence IS month 3.
Month 4 onwards: You'll use HF transformers daily-but you'll know what it's doing.
Month 8: Fine-tuning is just gradient descent on these same blocks. Knowing the architecture lets you debug LoRA targets and freezing strategies.
Month 9: Inference optimization (rung 11 made deep) is its own track.