08-Transformers¶
Why this matters in the journey¶
The transformer is the architectural foundation of every modern LLM. Implementing one from scratch is the single highest-leverage intellectual move of your year. Once you can implement it, you can read papers, debug training, modify architectures, and reason about why things work. Without it, you remain a black-box user.
The rungs¶
Rung 01-Tokenization¶
- What: Convert text into integer IDs. Modern LLMs use Byte Pair Encoding (BPE) variants like GPT's tiktoken or Llama's SentencePiece.
- Why it earns its place: Most "weird LLM behavior" turns out to be a tokenization quirk. Subword tokenization is also why LLMs handle rare words.
- Resource: Karpathy Zero to Hero lecture on the GPT tokenizer (search "karpathy let's build the gpt tokenizer"). Plus the BPE paper (arxiv.org/abs/1508.07909).
- Done when: You can implement BPE on a small corpus and explain how the merge process works.
Rung 02-Embeddings¶
- What: Token IDs are looked up into a matrix
Eof shape(vocab_size, hidden_dim). Each token becomes a vector. - Why it earns its place: The first operation in every LLM. Embedding geometry is also the basis of similarity search and RAG.
- Resource: Karpathy Zero to Hero
makemorelectures introduce embeddings; nanoGPT shows them in production form. - Done when: You can implement an
nn.Embeddinglookup by hand using just indexing and a parameter matrix.
Rung 03-Positional encodings¶
- What: Transformers have no inherent notion of order. Position is injected via sinusoidal embeddings (original), learned positional embeddings (GPT-2), RoPE (rotary, used in Llama), or ALiBi.
- Why it earns its place: Long-context behavior, context-length extension, and position-bias bugs all trace back to positional encoding.
- Resource: Original transformer paper (sec 3.5). RoPE paper (arxiv.org/abs/2104.09864). Excellent blog: search "Eleuther rope".
- Done when: You can implement sinusoidal positional encoding from scratch and explain RoPE conceptually.
Rung 04-Self-attention¶
- What: Compute
softmax(QKᵀ / √d) Vwhere Q, K, V are linear projections of the input. Each position attends to all others, weighted. - Why it earns its place: The single most important operation in modern AI. Implementing it is what makes you a transformer engineer.
- Resource: Karpathy Zero to Hero lecture 6 ("Let's build GPT"). Plus The Annotated Transformer (Harvard NLP).
- Done when: You can implement scaled dot-product attention in <30 lines of PyTorch.
Rung 05-Multi-head attention¶
- What: Multiple attention "heads" run in parallel with different projections; outputs are concatenated.
- Why it earns its place: All real transformers are multi-head. Different heads learn different things.
- Resource: Same as rung 04. Plus visualization tools like
bertvizto see what heads attend to. - Done when: You can implement multi-head attention as either a loop over heads or (efficiently) a single reshaped matmul.
Rung 06-Causal masking¶
- What: In a decoder-only LM, position
imust not attend to positions> i. Implemented by setting future-position scores to - inf` before softmax. - Why it earns its place: Without causal masking, the model cheats during training and inference is broken.
- Resource: Karpathy nanoGPT-read the masking code.
- Done when: You can implement causal masking from scratch and explain why it makes training parallelizable across positions.
Rung 07-The transformer block¶
- What: Attention → residual + LayerNorm → MLP → residual + LayerNorm. Modern variants use pre-norm (LayerNorm before attention) and RMSNorm.
- Why it earns its place: The repeated unit. Stack 12+ of these = GPT-2 small. Stack hundreds = a frontier model.
- Resource: The Annotated Transformer. Plus nanoGPT
model.py. - Done when: You can implement a transformer block as an
nn.Moduleand stack it to make a working LM.
Rung 08-Training a small LM end-to-end¶
- What: Tokenize a corpus, build a
Dataset, batch withDataLoader, train with cross-entropy on next-token prediction. - Why it earns its place: The capstone of foundations. Once done, the rest of the year is variations and applications.
- Resource: Reproduce nanoGPT on Shakespeare or TinyStories (search "TinyStories dataset huggingface").
- Done when: Your model produces coherent (or coherent-ish) text and you've watched the loss go down.
Rung 09-Inference: greedy, top-k, top-p, temperature¶
- What: Sampling strategies during generation.
- Why it earns its place: Production decoding parameters matter enormously for output quality.
- Resource: Hugging Face blog "How to generate text" (search "huggingface how to generate"). Plus the nucleus sampling paper (arxiv.org/abs/1904.09751).
- Done when: You can implement top-k and top-p sampling, vary temperature, and observe the effects on output diversity.
Rung 10-Scaling and architecture variations¶
- What: Encoder-only (BERT), decoder-only (GPT, Llama), encoder-decoder (T5). Mixture-of-Experts (MoE). Sparse attention.
- Why it earns its place: Reading SOTA papers requires recognizing these families.
- Resource: Sebastian Raschka's "LLMs from Scratch" book + blog posts (search "raschka LLMs from scratch"). Plus survey: "A Survey of Large Language Models" (arxiv.org/abs/2303.18223).
- Done when: You can sketch the differences between BERT, GPT, T5, and a MoE model.
Rung 11-Modern efficiency techniques (read-only depth)¶
- What: FlashAttention, KV cache, grouped-query attention (GQA), sliding-window attention, RoPE scaling.
- Why it earns its place: These are why modern LLMs are fast enough to use. You don't have to implement them, but you have to know what they do.
- Resource: FlashAttention paper (arxiv.org/abs/2205.14135). Llama 2 paper for GQA. Mistral paper for sliding-window.
- Done when: You can explain in 3 sentences each what FlashAttention, GQA, and KV-cache do.
Minimum required to leave this sequence¶
- Implement BPE on a small corpus.
- Implement scaled dot-product attention from scratch.
- Implement multi-head + causal-masked attention.
- Implement a full transformer block.
- Train a small LM end-to-end and sample from it.
- Implement top-k and top-p sampling.
- Explain the difference between encoder-only and decoder-only transformers.
Going further¶
- Sebastian Raschka-Build a Large Language Model (From Scratch)-book, hands-on, excellent.
- Stanford CS336-Language Modeling from Scratch (free lectures online; intense).
- The Illustrated Transformer by Jay Alammar (jalammar.github.io)-gentle re-read after you implement.
- The Annotated Transformer (Harvard NLP)-code-first walkthrough.
How this sequence connects to the year¶
- Month 3: This sequence IS month 3.
- Month 4 onwards: You'll use HF
transformersdaily-but you'll know what it's doing. - Month 8: Fine-tuning is just gradient descent on these same blocks. Knowing the architecture lets you debug LoRA targets and freezing strategies.
- Month 9: Inference optimization (rung 11 made deep) is its own track.