Skip to content

07-Deep Learning

Why this matters in the journey

Transformers are deep neural networks. To debug a transformer you need DL fundamentals: what an MLP is, why we have nonlinearities, what initialization does, why batch/layer norm helps, what residual connections solve, why dropout works, and what failure modes (vanishing/exploding gradients, dead ReLUs) look like. This sequence is the bridge between classical ML and transformers.

The rungs

Rung 01-The multilayer perceptron (MLP)

  • What: Stack of Linear → activation → Linear → activation → ... layers. Universal approximator.
  • Why it earns its place: The feedforward block in every transformer is an MLP. Understanding MLPs deeply makes transformers half-understood already.
  • Resource: Karpathy Zero to Hero lecture 3 (makemore MLP). Or 3Blue1Brown NN series episode 1.
  • Done when: You can implement an MLP for MNIST in PyTorch and reach >95% accuracy.

Rung 02-Activations: ReLU, GELU, SiLU/Swish

  • What: Nonlinearities applied elementwise. Without them, the entire network collapses to a linear map.
  • Why it earns its place: GELU and SiLU are the activations of choice in modern transformers. ReLU is in everything else.
  • Resource: Read the GELU paper (arxiv.org/abs/1606.08415) skim.
  • Done when: You can plot ReLU, GELU, and SiLU and explain why GELU's smoothness is preferred for transformers.

Rung 03-Initialization

  • What: How you set initial weights matters. Xavier/Glorot for tanh/sigmoid, Kaiming/He for ReLU, scaled init for transformers.
  • Why it earns its place: Bad init = no training. Modern model code carefully scales by 1/√fan_in for a reason.
  • Resource: Karpathy Zero to Hero lectures on initialization (parts of makemore 4–5). Plus the He init paper (arxiv.org/abs/1502.01852).
  • Done when: You can explain why initialization standard deviation depends on fan_in.

Rung 04-Normalization: BatchNorm, LayerNorm, RMSNorm

  • What: Re-center / re-scale activations within a batch (BN) or within a sample (LN). RMSNorm drops the centering.
  • Why it earns its place: Transformers use LayerNorm. Llama-family models use RMSNorm. Knowing the difference and why it matters is essential.
  • Resource: Original BN paper (arxiv.org/abs/1502.03167) and LN paper (arxiv.org/abs/1607.06450) skim. Plus a clean explanation in The Annotated Transformer.
  • Done when: You can explain why LayerNorm is preferred for variable-length sequences.

Rung 05-Residual connections

  • What: output = layer(x) + x. Allows gradients to flow directly through.
  • Why it earns its place: Without residual connections, deep networks don't train. Every transformer block has them.
  • Resource: ResNet paper (arxiv.org/abs/1512.03385).
  • Done when: You can explain why residuals address the vanishing gradient problem.

Rung 06-Optimizers: SGD, Adam, AdamW

  • What: Adam adapts per-parameter learning rates. AdamW decouples weight decay.
  • Why it earns its place: AdamW is the optimizer for nearly all LLMs. Wrong optimizer = wrong loss curve.
  • Resource: Adam (arxiv.org/abs/1412.6980), AdamW (arxiv.org/abs/1711.05101).
  • Done when: You can explain the difference between Adam's L2 regularization and AdamW's weight decay.

Rung 07-Learning rate schedules

  • What: Constant, warmup, cosine decay, linear decay. Modern LLM training uses warmup + cosine.
  • Why it earns its place: Wrong schedule = unstable training or undertrained model.
  • Resource: Hugging Face get_scheduler source. Plus the Training Compute-Optimal LLMs paper (Chinchilla, arxiv.org/abs/2203.15556) for context.
  • Done when: You can plot a warmup-then-cosine schedule and explain its parts.

Rung 08-Regularization in DL: dropout, weight decay, label smoothing

  • What: Dropout randomly zeros activations. Weight decay shrinks weights. Label smoothing softens hard targets.
  • Why it earns its place: Each shows up in training scripts. Knowing which one to reach for is judgment.
  • Resource: Goodfellow chapter 7. Plus the original dropout paper.
  • Done when: You can explain what dropout does at train time vs eval time.

Rung 09-Convolutions and CNNs (light touch for breadth)

  • What: Local connectivity, weight sharing, pooling. ImageNet-era architecture.
  • Why it earns its place: You'll encounter ConvNeXt, ViT comparisons, and multimodal architectures (CLIP, etc.) where conv intuition helps.
  • Resource: fast.ai or Stanford CS231n (free lectures online).
  • Done when: You can explain why a CNN has many fewer parameters than an MLP for images.

Rung 10-Failure modes and how to diagnose them

  • What: Vanishing/exploding gradients, dead ReLUs, loss NaNs, mode collapse.
  • Why it earns its place: Every long training run hits one of these. Diagnosis is half of training.
  • Resource: Andrej Karpathy's "A Recipe for Training Neural Networks" blog post (search "karpathy training recipe").
  • Done when: You can list 5 things to check when loss goes to NaN.

Minimum required to leave this sequence

  • Implement an MLP on a real dataset and tune it to a target accuracy.
  • Explain ReLU vs GELU vs SiLU.
  • Implement LayerNorm from scratch.
  • Build a model with residual connections and explain why.
  • Configure AdamW with a warmup-cosine schedule.
  • Diagnose at least one training failure (e.g., NaN loss) and fix it.

Going further

  • Deep Learning (Goodfellow, Bengio, Courville; free online)-chapters 6–8.
  • Stanford CS231n-free lectures, classic.
  • Karpathy "Recipe for Training Neural Networks"-must-read.

How this sequence connects to the year

  • Months 2–3: This is the bulk of what month 2 covers and what makes month 3 (transformers) feasible.
  • Month 8: Fine-tuning a model is just deep learning with smaller learning rates and fewer steps. Same diagnostics apply.

Comments