07-Deep Learning¶

Why this matters in the journey¶

Transformers are deep neural networks. To debug a transformer you need DL fundamentals: what an MLP is, why we have nonlinearities, what initialization does, why batch/layer norm helps, what residual connections solve, why dropout works, and what failure modes (vanishing/exploding gradients, dead ReLUs) look like. This sequence is the bridge between classical ML and transformers.

The rungs¶

Rung 01-The multilayer perceptron (MLP)¶

What: Stack of Linear → activation → Linear → activation → ... layers. Universal approximator.
Why it earns its place: The feedforward block in every transformer is an MLP. Understanding MLPs deeply makes transformers half-understood already.
Resource: Karpathy Zero to Hero lecture 3 (makemore MLP). Or 3Blue1Brown NN series episode 1.
Done when: You can implement an MLP for MNIST in PyTorch and reach >95% accuracy.

Rung 02-Activations: ReLU, GELU, SiLU/Swish¶

What: Nonlinearities applied elementwise. Without them, the entire network collapses to a linear map.
Why it earns its place: GELU and SiLU are the activations of choice in modern transformers. ReLU is in everything else.
Resource: Read the GELU paper (arxiv.org/abs/1606.08415) skim.
Done when: You can plot ReLU, GELU, and SiLU and explain why GELU's smoothness is preferred for transformers.

Rung 03-Initialization¶

What: How you set initial weights matters. Xavier/Glorot for tanh/sigmoid, Kaiming/He for ReLU, scaled init for transformers.
Why it earns its place: Bad init = no training. Modern model code carefully scales by 1/√fan_in for a reason.
Resource: Karpathy Zero to Hero lectures on initialization (parts of makemore 4–5). Plus the He init paper (arxiv.org/abs/1502.01852).
Done when: You can explain why initialization standard deviation depends on fan_in.

Rung 04-Normalization: BatchNorm, LayerNorm, RMSNorm¶

What: Re-center / re-scale activations within a batch (BN) or within a sample (LN). RMSNorm drops the centering.
Why it earns its place: Transformers use LayerNorm. Llama-family models use RMSNorm. Knowing the difference and why it matters is essential.
Resource: Original BN paper (arxiv.org/abs/1502.03167) and LN paper (arxiv.org/abs/1607.06450) skim. Plus a clean explanation in The Annotated Transformer.
Done when: You can explain why LayerNorm is preferred for variable-length sequences.

Rung 05-Residual connections¶

What: output = layer(x) + x. Allows gradients to flow directly through.
Why it earns its place: Without residual connections, deep networks don't train. Every transformer block has them.
Resource: ResNet paper (arxiv.org/abs/1512.03385).
Done when: You can explain why residuals address the vanishing gradient problem.

Rung 06-Optimizers: SGD, Adam, AdamW¶

What: Adam adapts per-parameter learning rates. AdamW decouples weight decay.
Why it earns its place: AdamW is the optimizer for nearly all LLMs. Wrong optimizer = wrong loss curve.
Resource: Adam (arxiv.org/abs/1412.6980), AdamW (arxiv.org/abs/1711.05101).
Done when: You can explain the difference between Adam's L2 regularization and AdamW's weight decay.

Rung 07-Learning rate schedules¶

What: Constant, warmup, cosine decay, linear decay. Modern LLM training uses warmup + cosine.
Why it earns its place: Wrong schedule = unstable training or undertrained model.
Resource: Hugging Face get_scheduler source. Plus the Training Compute-Optimal LLMs paper (Chinchilla, arxiv.org/abs/2203.15556) for context.
Done when: You can plot a warmup-then-cosine schedule and explain its parts.

Rung 08-Regularization in DL: dropout, weight decay, label smoothing¶

What: Dropout randomly zeros activations. Weight decay shrinks weights. Label smoothing softens hard targets.
Why it earns its place: Each shows up in training scripts. Knowing which one to reach for is judgment.
Resource: Goodfellow chapter 7. Plus the original dropout paper.
Done when: You can explain what dropout does at train time vs eval time.

Rung 09-Convolutions and CNNs (light touch for breadth)¶

What: Local connectivity, weight sharing, pooling. ImageNet-era architecture.
Why it earns its place: You'll encounter ConvNeXt, ViT comparisons, and multimodal architectures (CLIP, etc.) where conv intuition helps.
Resource: fast.ai or Stanford CS231n (free lectures online).
Done when: You can explain why a CNN has many fewer parameters than an MLP for images.

Rung 10-Failure modes and how to diagnose them¶

What: Vanishing/exploding gradients, dead ReLUs, loss NaNs, mode collapse.
Why it earns its place: Every long training run hits one of these. Diagnosis is half of training.
Resource: Andrej Karpathy's "A Recipe for Training Neural Networks" blog post (search "karpathy training recipe").
Done when: You can list 5 things to check when loss goes to NaN.

Minimum required to leave this sequence¶

Implement an MLP on a real dataset and tune it to a target accuracy.
Explain ReLU vs GELU vs SiLU.
Implement LayerNorm from scratch.
Build a model with residual connections and explain why.
Configure AdamW with a warmup-cosine schedule.
Diagnose at least one training failure (e.g., NaN loss) and fix it.

Going further¶

Deep Learning (Goodfellow, Bengio, Courville; free online)-chapters 6–8.
Stanford CS231n-free lectures, classic.
Karpathy "Recipe for Training Neural Networks"-must-read.

How this sequence connects to the year¶

Months 2–3: This is the bulk of what month 2 covers and what makes month 3 (transformers) feasible.
Month 8: Fine-tuning a model is just deep learning with smaller learning rates and fewer steps. Same diagnostics apply.