07-Deep Learning¶
Why this matters in the journey¶
Transformers are deep neural networks. To debug a transformer you need DL fundamentals: what an MLP is, why we have nonlinearities, what initialization does, why batch/layer norm helps, what residual connections solve, why dropout works, and what failure modes (vanishing/exploding gradients, dead ReLUs) look like. This sequence is the bridge between classical ML and transformers.
The rungs¶
Rung 01-The multilayer perceptron (MLP)¶
- What: Stack of
Linear→ activation →Linear→ activation → ... layers. Universal approximator. - Why it earns its place: The feedforward block in every transformer is an MLP. Understanding MLPs deeply makes transformers half-understood already.
- Resource: Karpathy Zero to Hero lecture 3 (
makemoreMLP). Or 3Blue1Brown NN series episode 1. - Done when: You can implement an MLP for MNIST in PyTorch and reach >95% accuracy.
Rung 02-Activations: ReLU, GELU, SiLU/Swish¶
- What: Nonlinearities applied elementwise. Without them, the entire network collapses to a linear map.
- Why it earns its place: GELU and SiLU are the activations of choice in modern transformers. ReLU is in everything else.
- Resource: Read the GELU paper (arxiv.org/abs/1606.08415) skim.
- Done when: You can plot ReLU, GELU, and SiLU and explain why GELU's smoothness is preferred for transformers.
Rung 03-Initialization¶
- What: How you set initial weights matters. Xavier/Glorot for tanh/sigmoid, Kaiming/He for ReLU, scaled init for transformers.
- Why it earns its place: Bad init = no training. Modern model code carefully scales by
1/√fan_infor a reason. - Resource: Karpathy Zero to Hero lectures on initialization (parts of
makemore4–5). Plus the He init paper (arxiv.org/abs/1502.01852). - Done when: You can explain why initialization standard deviation depends on
fan_in.
Rung 04-Normalization: BatchNorm, LayerNorm, RMSNorm¶
- What: Re-center / re-scale activations within a batch (BN) or within a sample (LN). RMSNorm drops the centering.
- Why it earns its place: Transformers use LayerNorm. Llama-family models use RMSNorm. Knowing the difference and why it matters is essential.
- Resource: Original BN paper (arxiv.org/abs/1502.03167) and LN paper (arxiv.org/abs/1607.06450) skim. Plus a clean explanation in The Annotated Transformer.
- Done when: You can explain why LayerNorm is preferred for variable-length sequences.
Rung 05-Residual connections¶
- What:
output = layer(x) + x. Allows gradients to flow directly through. - Why it earns its place: Without residual connections, deep networks don't train. Every transformer block has them.
- Resource: ResNet paper (arxiv.org/abs/1512.03385).
- Done when: You can explain why residuals address the vanishing gradient problem.
Rung 06-Optimizers: SGD, Adam, AdamW¶
- What: Adam adapts per-parameter learning rates. AdamW decouples weight decay.
- Why it earns its place: AdamW is the optimizer for nearly all LLMs. Wrong optimizer = wrong loss curve.
- Resource: Adam (arxiv.org/abs/1412.6980), AdamW (arxiv.org/abs/1711.05101).
- Done when: You can explain the difference between Adam's L2 regularization and AdamW's weight decay.
Rung 07-Learning rate schedules¶
- What: Constant, warmup, cosine decay, linear decay. Modern LLM training uses warmup + cosine.
- Why it earns its place: Wrong schedule = unstable training or undertrained model.
- Resource: Hugging Face
get_schedulersource. Plus the Training Compute-Optimal LLMs paper (Chinchilla, arxiv.org/abs/2203.15556) for context. - Done when: You can plot a warmup-then-cosine schedule and explain its parts.
Rung 08-Regularization in DL: dropout, weight decay, label smoothing¶
- What: Dropout randomly zeros activations. Weight decay shrinks weights. Label smoothing softens hard targets.
- Why it earns its place: Each shows up in training scripts. Knowing which one to reach for is judgment.
- Resource: Goodfellow chapter 7. Plus the original dropout paper.
- Done when: You can explain what dropout does at train time vs eval time.
Rung 09-Convolutions and CNNs (light touch for breadth)¶
- What: Local connectivity, weight sharing, pooling. ImageNet-era architecture.
- Why it earns its place: You'll encounter ConvNeXt, ViT comparisons, and multimodal architectures (CLIP, etc.) where conv intuition helps.
- Resource: fast.ai or Stanford CS231n (free lectures online).
- Done when: You can explain why a CNN has many fewer parameters than an MLP for images.
Rung 10-Failure modes and how to diagnose them¶
- What: Vanishing/exploding gradients, dead ReLUs, loss NaNs, mode collapse.
- Why it earns its place: Every long training run hits one of these. Diagnosis is half of training.
- Resource: Andrej Karpathy's "A Recipe for Training Neural Networks" blog post (search "karpathy training recipe").
- Done when: You can list 5 things to check when loss goes to NaN.
Minimum required to leave this sequence¶
- Implement an MLP on a real dataset and tune it to a target accuracy.
- Explain ReLU vs GELU vs SiLU.
- Implement LayerNorm from scratch.
- Build a model with residual connections and explain why.
- Configure AdamW with a warmup-cosine schedule.
- Diagnose at least one training failure (e.g., NaN loss) and fix it.
Going further¶
- Deep Learning (Goodfellow, Bengio, Courville; free online)-chapters 6–8.
- Stanford CS231n-free lectures, classic.
- Karpathy "Recipe for Training Neural Networks"-must-read.
How this sequence connects to the year¶
- Months 2–3: This is the bulk of what month 2 covers and what makes month 3 (transformers) feasible.
- Month 8: Fine-tuning a model is just deep learning with smaller learning rates and fewer steps. Same diagnostics apply.