Saltar a contenido

Deep Learning Fundamentals

The bridge between classical ML and transformers. This chapter assumes you have already met linear/logistic regression, gradient descent, and basic linear algebra. It assumes you have not yet met attention. Its job is to make sure that when you open AI_SYSTEMS_PLAN/DEEP_DIVES/07_ATTENTION_TRANSFORMER.md, every term-backprop, LayerNorm, AdamW, warmup, residual, GELU, He init-is something you have already derived, not something you have to take on faith.

Cross-references: - /07_ATTENTION_TRANSFORMER.md-attention/transformer math. The architecture you assemble out of the parts in this chapter. - /11_NUMERICAL_STABILITY.md-full treatment of mixed precision, FP16/BF16, log-sum-exp tricks. We touch on this here.

Notation: lowercase bold x would be a vector if we had it; we use plain Unicode and trust context. Shapes are written [d_in, d_out] for matrices and (d,) for vectors. θ is the full parameter set. L is a scalar loss. g = ∂L/∂θ. We write δ = ∂L/∂z for the "error signal" at a pre-activation z.


1. The Neural-Network Setup

A neural network is, structurally, just a parameterized function

f_θ : R^d_in -> R^d_out

built by composing simple pieces. The simplest non-trivial piece is the affine layer

z = W x + b           W ∈ R^{d_out × d_in},  b ∈ R^{d_out}

followed by a nonlinearity σ:

h = σ(z)              σ applied elementwise (mostly)

A network is the chain

f_θ(x) = σ_L( W_L · σ_{L-1}( W_{L-1} · ... σ_1(W_1 x + b_1) ... ) + b_L )

Without the nonlinearities, the whole composition collapses to a single affine map (W_L W_{L-1} ... W_1) and the network is no more expressive than logistic regression. The nonlinearity is what buys us expressiveness; the depth is what buys us efficient expressiveness.

1.1 Universal approximation, in one paragraph

A theorem (Cybenko 1989, Hornik 1991): a feed-forward network with one hidden layer of sufficient width and any non-polynomial squashing nonlinearity can approximate any continuous function on a compact set to arbitrary accuracy. This is reassuring but useless in practice-"sufficient width" can be exponential in d_in. Universal approximation tells us networks can represent whatever we need; it does not tell us they can learn it from data with reasonable amounts of compute.

1.2 Why depth helps

The empirical answer, validated by twenty years of experiments and by parts of theory: depth lets the network reuse intermediate features. A 2-layer net of width W can represent some boolean functions only with W exponential in input dimension; a deep net of polynomial width can represent the same function. Concretely, in a CNN you can see this happen: layer 1 picks up edges, layer 2 corners and textures, layer 3 object parts, layer 4 objects. Each layer composes the previous. A wide-shallow net would need to relearn "edge" for every "corner" it represents.

For transformers the depth-buys-features story is the same: layer 1 might attend to local syntax, deeper layers route information across longer ranges. The chapter you'll read next (/07) shows this concretely.


2. Forward Pass for an MLP

We will work through a 2-layer MLP because every term in transformer training is a generalization of something here.

Setup. Inputs x ∈ R^{d_in} (we'll allow a batch dimension B later). Hidden width d_h. Outputs d_out. ReLU between layers.

z1 = W1 x + b1            W1 : [d_h, d_in],   b1 : (d_h,)        z1 : (d_h,)
h1 = ReLU(z1)             ReLU(z) = max(z, 0)                    h1 : (d_h,)
z2 = W2 h1 + b2           W2 : [d_out, d_h],  b2 : (d_out,)      z2 : (d_out,)
ŷ  = z2                   for regression; or softmax(z2) for classification

For a classification cross-entropy loss with class label y:

p   = softmax(z2),     p_k = exp(z2_k) / Σ_j exp(z2_j)
L   = -log(p_y)

Batched: replace x with X : [B, d_in] and write Z1 = X W1ᵀ + b1, etc. Frameworks store matrices W as [d_out, d_in], so the actual code is Z1 = X @ W1.T + b1. We will keep math in the per-sample form for readability and switch to batched only when shapes matter.

Activation choice in the hidden layer: ReLU is the workhorse. We discuss alternatives in Section 5.

Activation choice in the output layer: depends on the task. - Regression with squared loss: identity (no σ). - Binary classification with BCE: sigmoid. - K-class classification with cross-entropy: softmax.

Choosing the loss-and-output-activation pair correctly produces a beautifully clean gradient (see Section 5.7 / Exercise 1).


3. Backpropagation, Fully Derived

Backprop is the chain rule, applied in reverse, with a memory trick: we cache forward activations so we don't recompute them.

3.1 Chain-rule reminder

If L = f(g(h(x))), then

dL/dx = f'(g(h(x))) · g'(h(x)) · h'(x)

In multiple dimensions, derivatives become Jacobians and · becomes matrix product. The key fact: gradients propagate right-to-left through the same wires that activations flowed left-to-right.

3.2 Backward through a single linear layer

Forward: z = W x + b with W : [d_out, d_in], x : (d_in,), z : (d_out,).

Suppose downstream computation gives us δ = ∂L/∂z ∈ R^{d_out} ("the error signal at the layer's output"). We want three things:

∂L/∂W,    ∂L/∂b,    ∂L/∂x

Gradient w.r.t. b. Since z = Wx + b and ∂z_i/∂b_j = δ_{ij}:

∂L/∂b_j = Σ_i (∂L/∂z_i)(∂z_i/∂b_j) = δ_j      ⇒   ∂L/∂b = δ

Gradient w.r.t. W. z_i = Σ_k W_{ik} x_k + b_i so ∂z_i/∂W_{jk} = δ_{ij} x_k:

∂L/∂W_{jk} = Σ_i δ_i δ_{ij} x_k = δ_j x_k      ⇒   ∂L/∂W = δ xᵀ           [d_out, d_in]

That's the outer product: rows of ∂L/∂W are scaled copies of x, scaled by δ.

Gradient w.r.t. x (the signal we pass back to the previous layer):

∂L/∂x_k = Σ_i δ_i W_{ik}                       ⇒   ∂L/∂x = Wᵀ δ            (d_in,)

Three-line summary, memorize this:

∂L/∂W = δ · xᵀ          (outer product)
∂L/∂b = δ
∂L/∂x = Wᵀ · δ          (the upstream signal)

Batched version. With X : [B, d_in], Z = X Wᵀ + b, and upstream δ : [B, d_out]:

∂L/∂W = δᵀ X            [d_out, d_in]
∂L/∂b = sum over batch of δ           (d_out,)
∂L/∂X = δ W             [B, d_in]

You should be able to write these from memory. They are the entire core of backprop.

3.3 Backward through a (pointwise) activation

For pointwise h_i = σ(z_i), the Jacobian ∂h/∂z is diagonal with entries σ'(z_i). So if δ_h = ∂L/∂h:

δ_z = δ_h ⊙ σ'(z)        elementwise multiply

For ReLU specifically: σ'(z_i) = 1 if z_i > 0 else 0, so δ_z = δ_h * [z > 0]. This is the only thing you ever do for a ReLU backward-multiply by a mask of where the pre-activation was positive.

For softmax-which is not pointwise, because the denominator couples all inputs-the Jacobian is full. We derive it in Section 5.7.

3.4 A 2-layer MLP, forward and backward, line by line

Forward (cross-entropy classification):

z1 = W1 x + b1                  (d_h,)
h1 = ReLU(z1)                   (d_h,)
z2 = W2 h1 + b2                 (d_out,)
p  = softmax(z2)                (d_out,)
L  = -log(p_y)

Backward. Start at the loss and walk left.

δ2 = ∂L/∂z2 = p - e_y                              (1)
∂L/∂W2 = δ2 · h1ᵀ                                  (2)
∂L/∂b2 = δ2                                        (3)
δ_h1   = W2ᵀ · δ2                                  (4)
δ1     = δ_h1 ⊙ 1[z1 > 0]                          (5)
∂L/∂W1 = δ1 · xᵀ                                   (6)
∂L/∂b1 = δ1                                        (7)
∂L/∂x  = W1ᵀ · δ1                                  (8)-only if x is itself a parameter / earlier layer

Line (1) is the famous identity for softmax + cross-entropy: the gradient at the logits is just p - e_y (predicted minus one-hot). We derive this in Section 5.7. It is the cleanest gradient in deep learning.

Lines (2)-(3) and (6)-(7): the "linear-layer triplet" we derived above.

Line (4): pass back through the second linear layer, from output side to input side.

Line (5): pointwise ReLU backward, the only line where information about the forward z1 is consumed.

That's the entire algorithm. Every neural network you will ever train, including a 100-billion-parameter transformer, is some elaboration of this loop.

3.5 Worked numerical example

Two inputs, three hidden units (ReLU), one output (squared loss). Let's compute every gradient by hand.

x  = [1, 2]
W1 = [[ 0.5, -0.3],
      [ 0.1,  0.4],
      [-0.2,  0.2]]                    [3, 2]
b1 = [0.0, 0.0, 0.0]
W2 = [[1.0, -1.0, 0.5]]                [1, 3]
b2 = [0.0]
y  = 1.0     (target)
L  = 0.5 (ŷ - y)^2

Forward:

z1 = W1 x + b1
   = [0.5·1 + (-0.3)·2, 0.1·1 + 0.4·2, -0.2·1 + 0.2·2]
   = [0.5 - 0.6, 0.1 + 0.8, -0.2 + 0.4]
   = [-0.1, 0.9, 0.2]

h1 = ReLU(z1) = [0.0, 0.9, 0.2]              # the first unit is dead this step

z2 = W2 h1 + b2 = 1.0·0.0 + (-1.0)·0.9 + 0.5·0.2 = -0.9 + 0.1 = -0.8
ŷ  = z2 = -0.8
L  = 0.5 (-0.8 - 1.0)^2 = 0.5 · 3.24 = 1.62

Backward:

δ2 = ∂L/∂z2 = (ŷ - y) = -1.8

∂L/∂W2 = δ2 · h1ᵀ = -1.8 · [0.0, 0.9, 0.2] = [0.0, -1.62, -0.36]
∂L/∂b2 = -1.8

δ_h1 = W2ᵀ · δ2 = [1.0, -1.0, 0.5]ᵀ · -1.8 = [-1.8, 1.8, -0.9]

mask = [z1 > 0] = [0, 1, 1]
δ1   = δ_h1 ⊙ mask = [0, 1.8, -0.9]

∂L/∂W1 = δ1 · xᵀ
       = [0, 1.8, -0.9]ᵀ · [1, 2]
       = [[0·1, 0·2],
          [1.8·1, 1.8·2],
          [-0.9·1, -0.9·2]]
       = [[0, 0], [1.8, 3.6], [-0.9, -1.8]]

∂L/∂b1 = [0, 1.8, -0.9]

Now imagine doing 1 trillion of these per training step across 10^11 parameters. That's modern deep learning.

Notice the dead unit: hidden unit 0 had z1 = -0.1, so ReLU killed it, and consequently row 0 of ∂L/∂W1 is zero. If on every training example this unit's pre-activation is negative, it never updates and remains dead forever. This is the dead-ReLU problem (Section 5.1).


4. Vanishing and Exploding Gradients

When you stack many layers, the backward pass multiplies many Jacobians together:

∂L/∂x_0 = J_L · J_{L-1} · ... · J_1 · ∂L/∂x_L

If a typical singular value of these Jacobians is < 1, the product shrinks geometrically-vanishing gradients, training stalls. If > 1, it grows geometrically-exploding gradients, NaN.

4.1 The classic sigmoid stack failure

Sigmoid: σ(z) = 1/(1 + e^{-z}), σ'(z) = σ(z)(1 - σ(z)).

σ'(z) is at most 0.25 (at z=0) and approaches 0 in the saturated regions. A 10-layer sigmoid MLP multiplies at least ten factors of ≤ 0.25, so the gradient at layer 1 is at most 0.25^10 ≈ 10^-6 of the gradient at the output. Layer 1 effectively does not learn. Pre-2010, this was the reason deep networks were considered unworkable.

Two things broke us out: ReLU (derivative is exactly 1 in the active region) and good initialization (Section 6) so we don't start in the saturated regime.

4.2 The skip-connection insight (ResNet, He et al. 2015)

Even with ReLU, very deep networks (50+ layers) degraded. The fix that unlocked depth was almost embarrassingly simple: change the layer from

y = F(x)            # the layer learns the full transformation

to

y = x + F(x)        # the layer learns a residual

Why this works, mathematically: the backward pass is

∂L/∂x = ∂L/∂y · (I + ∂F/∂x) = ∂L/∂y + ∂L/∂y · ∂F/∂x

There is now an identity term I in the Jacobian. Even if F learns nothing, the gradient flows through unchanged. Depth becomes free: adding a layer can only add capacity, it cannot block the gradient signal.

Every modern transformer block is x + Attention(LN(x)) then x + MLP(LN(x)). The skip connections are non-negotiable. The transformer chapter (/07) will show you exactly where they sit.


5. Activations

5.1 ReLU

ReLU(z) = max(0, z)
ReLU'(z) = 1 if z > 0 else 0      (undefined at 0; pick 0 by convention)

Pros: cheap, gradient is 0 or 1 (no vanishing in the active path), induces sparsity (about half of units are off in expectation at init).

Con: dead-ReLU problem. If a unit gets pushed into z < 0 for every input, its gradient is always 0 and it never recovers. This is a real phenomenon, not a theoretical worry-you can lose 20-40% of units this way with a bad LR.

5.2 Leaky ReLU and PReLU

LeakyReLU(z) = z      if z > 0
             = α·z    if z ≤ 0,    typical α = 0.01

A small positive slope on the negative side. Dead units can recover. PReLU: same idea but α is a learnable parameter per channel.

In practice these mostly fix a non-problem; well-initialized networks with appropriate LR don't lose many neurons. They show up in CV literature more than in modern transformers.

5.3 GELU (Gaussian Error Linear Unit)

GELU(z) = z · Φ(z)

where Φ is the standard-normal CDF. The "soft gate": z is multiplied by the probability that a standard Gaussian is below z. Approximate form (the one most code uses):

GELU(z) ≈ 0.5 z (1 + tanh( √(2/π) (z + 0.044715 z^3) ))

Why GELU: it is smooth (infinitely differentiable), it is non-monotonic (slight dip below zero around z ≈ -0.7), and empirically it trains transformers a little better than ReLU. BERT and GPT-2/3 use GELU.

5.4 SiLU / Swish

SiLU(z) = Swish(z) = z · sigmoid(z)

Like GELU but cheaper. Smooth, non-monotonic, self-gated. Used in many vision and language models. Often interchangeable with GELU.

5.5 SwiGLU (Llama, PaLM)

A gated linear unit combined with SiLU. The standard MLP block is

MLP(x) = W2 · activation(W1 x)             # 2 matrices: W1, W2

SwiGLU replaces it with

SwiGLU(x) = W2 · ( SiLU(W1 x) ⊙ (W3 x) )    # 3 matrices: W1, W2, W3

So instead of one nonlinearity applied to one projection, we have an elementwise gate where one branch (W3 x) modulates the other (SiLU(W1 x)). To keep parameter count comparable, the hidden dimension is reduced (the standard recipe: d_ff is (2/3)·4·d_model instead of 4·d_model).

Cost: 3 weight matrices instead of 2, slightly more FLOPs and memory. Benefit: empirically better perplexity. This is the MLP block in Llama, Mistral, and most modern open-weights LLMs.

5.6 Softmax (and its Jacobian)

Softmax for a vector z ∈ R^K:

S_i = softmax(z)_i = exp(z_i) / Σ_j exp(z_j)

It maps logits to a probability distribution. It appears in two places in modern DL: 1. Output layer for classification. 2. Attention scores in transformers (softmax(QKᵀ/√d)).

Jacobian. Compute ∂S_i / ∂z_k. Two cases:

Case i = k:

∂S_i/∂z_i = (exp(z_i) Σ - exp(z_i)·exp(z_i)) / Σ^2
          = S_i - S_i^2 = S_i (1 - S_i)

Case i ≠ k:

∂S_i/∂z_k = (0·Σ - exp(z_i)·exp(z_k)) / Σ^2
          = -S_i · S_k

Compactly, with S as a column vector:

∂S/∂z = diag(S) - S Sᵀ

This is the S(I - Sᵀ) Jacobian, sometimes written as diag(S) - SSᵀ. It is symmetric, of size K×K, and rank-deficient (the all-ones vector is in its null space-which is exactly the statement that softmax is invariant to additive shifts in z).

5.7 Softmax + cross-entropy = clean gradient

For classification, L = -log(S_y). Then

∂L/∂z_k = -∂ log(S_y) / ∂z_k = -(1/S_y) · ∂S_y/∂z_k

Plug in: - if k = y: ∂L/∂z_y = -(1/S_y) · S_y(1 - S_y) = S_y - 1 - if k ≠ y: ∂L/∂z_k = -(1/S_y) · (-S_y S_k) = S_k

Combining: ∂L/∂z = S - e_y. Predicted distribution minus one-hot target. This is the only gradient you need to remember.

This is also why frameworks fuse softmax and cross-entropy into a single op (cross_entropy(logits, target)): the fused backward is just softmax(logits) - one_hot(target), which avoids a separate softmax-Jacobian materialization and is more numerically stable (uses log-sum-exp; see /11_NUMERICAL_STABILITY.md).


6. Initialization

Why init matters: the network's behavior at step 0-before any learning-depends entirely on the random Ws and bs. If the initial weights cause activations or gradients to blow up or shrink to zero through the layers, training will either NaN immediately or stall in vanishing-gradient land.

6.1 The variance argument

Take a linear layer z = W x with x ∈ R^{n_in}, weights drawn iid W_{ij} ∼ (0, σ_W^2), inputs iid x_j ∼ (0, σ_x^2), and weights independent of inputs. Then for any single output:

z_i = Σ_j W_{ij} x_j
Var(z_i) = Σ_j Var(W_{ij} x_j) = n_in · σ_W^2 · σ_x^2

For activations not to grow or shrink as they pass through the layer, we want Var(z) = Var(x), which requires

σ_W^2 = 1 / n_in       ("fan-in" rule)

Symmetrically, for the backward pass we want gradients not to blow up, which requires

σ_W^2 = 1 / n_out      ("fan-out" rule)

6.2 Xavier / Glorot

You can't satisfy both at once unless n_in = n_out. Glorot and Bengio (2010) split the difference:

Var(W) = 2 / (n_in + n_out)             # Xavier (normal)
W ~ Uniform[-√(6/(n_in+n_out)), +√(6/(n_in+n_out))]   # Xavier (uniform)

This is correct for symmetric activations (tanh, sigmoid in the linear regime).

6.3 He / Kaiming

For ReLU, half of the activations are zero on average, which halves the variance of the post-activation signal. To compensate, double the weight variance:

Var(W) = 2 / n_in          # He (Kaiming) init for ReLU
W ~ N(0, √(2/n_in))

This is the standard for any ReLU (or ReLU-like) MLP. It is the answer to Exercise 6.

6.4 Modern transformer init

Transformers are deeper and have residual connections, and the right init is more delicate. Standard recipes: - GPT-2: weights N(0, 0.02), biases zero, residual-projection weights scaled by an additional 1/√(2L) factor to keep activation variance constant through L residual blocks. The intuition: each residual block adds variance, and L of them stacked grow variance by L; so each branch should contribute 1/L - scale, hence the1/√L` standard deviation. - T5 / Llama: similar, with slight differences. The empirical answer is "use what the reference implementation uses"; the variance-preservation principle is the same.

6.5 Why bad init = NaN in epoch 1

If σ_W is too large, activations grow geometrically with depth. By layer 30 they are at 1e30. Squaring that in MSE loss gives 1e60, exceeds FP32 max (~3.4e38), produces inf, and the next op produces NaN. By the time you see loss = NaN at step 1, the network is already dead. Conversely, if σ_W is too small, activations underflow to 0, and the gradient at every layer is also 0, and the network learns nothing.

The remedy is rarely "look at activation statistics by hand." It is almost always "use He/Kaiming, or use whatever the architecture's reference does." But knowing why lets you debug the rare case (e.g., weight_init='zeros' from a copy-paste mistake) immediately.


7. Optimizers, Derived

7.1 SGD

θ ← θ - lr · g                where g = ∇L(θ)

In stochastic gradient descent, g is computed on a minibatch, so it is a noisy estimate of the true gradient. Convergence intuition: in the direction of true gradient, you descend; in directions perpendicular to it, the noise averages out (over many steps).

Pros: simple, well-understood, generalizes well in CV. Cons: slow on ill-conditioned losses, sensitive to LR, no per-parameter scaling.

7.2 SGD with momentum

The "rolling ball" picture: imagine a ball rolling down the loss surface. It accumulates velocity in directions that have consistently pointed the same way; it dampens oscillations in directions that have flipped sign.

v ← β · v + g                 # velocity (or "momentum buffer"); β typically 0.9
θ ← θ - lr · v

(Some texts write v ← β v + (1 - β) g; equivalent up to LR rescaling.)

Why it helps: in narrow valleys with a long, gentle slope along one axis and steep walls perpendicular, vanilla SGD bounces between the walls. Momentum sums consistent gradient sign along the slope (large v) and cancels the bouncing sign perpendicular to it (small v). You traverse the valley faster.

7.3 Nesterov momentum

Peek-ahead trick: evaluate the gradient at the location momentum will take you to, not at where you are now.

θ_lookahead = θ - lr · β · v
g           = ∇L(θ_lookahead)
v           = β v + g
θ           = θ - lr · v

Slightly faster convergence on convex problems. In deep learning, the gain over plain momentum is small; rarely the bottleneck.

7.4 RMSprop

The first widely-used adaptive optimizer. Maintain a running second moment (uncentered variance) of gradients per parameter, and divide:

v ← β · v + (1 - β) · g²            # elementwise; β typically 0.99
θ ← θ - lr · g / (√v + ε)

The √v denominator gives each parameter an effective LR proportional to 1/√(typical |g|). Parameters with large gradients get a smaller step; parameters with tiny gradients get a relatively larger step. This is the per-parameter adaptive scaling that fixes ill-conditioning.

7.5 Adam (the workhorse)

Adam = Momentum + RMSprop + bias correction. Algorithm:

m ← β1·m + (1 - β1)·g          # first moment (mean of g)
v ← β2·v + (1 - β2)·g²         # second moment (uncentered variance of g)
m̂ = m / (1 - β1^t)             # bias-corrected first moment
v̂ = v / (1 - β2^t)             # bias-corrected second moment
θ ← θ - lr · m̂ / (√v̂ + ε)

Defaults (Kingma & Ba 2015): β1=0.9, β2=0.999, ε=1e-8. For LLMs people often use β2=0.95 (slightly faster adaptation when you're on a tight token budget).

Why bias correction. At t=1, m = (1-β1) g = 0.1 g - a 10× underestimate of the true mean. Without correction, the first thousand steps see severely undersized first-moment estimates. The factor1/(1-β1^t)exactly cancels this: att=1it multiplies by 10 (som̂ = g); at largetthe factor approaches 1 (no effect). Similarly forv̂`.

Why these defaults: β1=0.9 matches common momentum, β2=0.999 averages over thousands of steps for a stable variance estimate, ε=1e-8 prevents division-by-zero without distorting the normal-magnitude regime.

When Adam wins: ill-conditioned losses (transformers, RNNs), tasks where SGD requires extensive LR tuning. When SGD wins: image classification with the right schedule, where SGD-with-momentum often generalizes slightly better despite slower convergence (the "Adam generalizes worse" debate).

7.6 AdamW (the actual modern default)

The Loshchilov & Hutter (2017) insight. L2 regularization adds λ ||θ||² / 2 to the loss. The gradient of that penalty is λ θ. In plain SGD this gives the update

θ ← θ - lr (g + λ θ) = (1 - lr·λ) θ - lr·g

i.e., a multiplicative shrinkage of θ by `(1 - lr·λ) - weight decay. So in SGD, "L2 regularization" and "weight decay" coincide.

In Adam, they don't. Folding λθ into g gives

θ ← θ - lr · (m̂ + λθ) / (√v̂ + ε)

The decay term is divided by √v̂. Parameters with large historical gradients get less decay; parameters with tiny gradients get more. This breaks the regularizer: it no longer applies uniformly.

AdamW decouples the decay from the gradient:

m ← β1·m + (1 - β1)·g
v ← β2·v + (1 - β2)·g²
m̂ = m / (1 - β1^t)
v̂ = v / (1 - β2^t)
θ ← θ - lr · m̂ / (√v̂ + ε) - lr · λ · θ           ← decay applied directly

Now the decay is exactly (1 - lr·λ) shrinkage per step, irrespective of gradient history. AdamW is the default in every modern transformer codebase. Typical λ ∈ [0.01, 0.1] for LLM pretraining.

Whether to decay biases and LayerNorm gains: the strong convention is no-only decay 2-D weight matrices. Decaying a LayerNorm gain pulls the gain toward 0, which kills the layer's ability to scale its outputs.

7.7 Lion, Sophia (briefly)

  • Lion (Chen et al. 2023, Google): sign-only update with momentum. c = β1 m + (1-β1) g; θ ← θ - lr · sign(c); m ← β2 m + (1-β2) g. Matches AdamW with less memory (no v buffer). Promising; not yet universal.
  • Sophia (Liu et al. 2023): Hessian-aware second-moment estimator. Faster pretraining in some reports. Research-stage; not a default.

For now, AdamW with cosine schedule and warmup is the default. Use Lion if you need to save optimizer-state memory.

7.8 When to pick which

Setting Optimizer
Image classification (ResNets) SGD + momentum + cosine
Transformer pretraining AdamW + warmup-cosine
Fine-tuning a transformer AdamW, smaller lr
RL policy gradient Adam (sometimes RMSprop)
Simple convex / linear problems Plain SGD
Optimizer-state-memory bound Lion

8. Learning-Rate Schedules

The LR is the most important hyperparameter. The right schedule lets you start fast (large step), exploit fast (medium step), and refine at the end (small step). Some history first.

8.1 Constant LR

What it sounds like. Used only in toy problems and in some online-learning settings. For deep networks: never the right answer.

8.2 Step decay (legacy)

lr divided by 10 every N epochs. The PyTorch MultiStepLR. Used to dominate ImageNet recipes (e.g., divide by 10 at epochs 30, 60, 90). Largely superseded by cosine.

8.3 Linear decay

lr_t = lr_max · (1 - t/T)

Linear ramp from lr_max to 0 over the full training. Simple, monotone, sometimes used by RoBERTa-style fine-tuning.

8.4 Cosine annealing (the modern default)

Loshchilov & Hutter (2016):

lr_t = lr_min + 0.5 (lr_max - lr_min) (1 + cos(π · t / T))

A smooth half-cosine from lr_max down to lr_min (often 0 or lr_max/10). Why cosine: the schedule decreases slowly at first (stay near lr_max while exploration is most useful), accelerates the decrease in the middle, then slows again at the end (small refinement steps near the optimum). Empirically beats step decay and linear decay on essentially every modern benchmark.

8.5 Warmup

lr_t = lr_max · (t / T_warmup)        for t ≤ T_warmup

Linear ramp from 0 (or near 0) up to lr_max. Then hand off to the main schedule (typically cosine).

Why warmup is critical for transformers. At step 0, the attention weights are random-the softmax distribution is roughly uniform. The gradients that flow are not yet meaningful directional signal, but Adam's is also 0, so the effective step lr · m̂ / √v̂ is enormous. A few large random steps can throw the network into a region from which it never recovers. Warmup keeps lr small while stabilizes.

A typical recipe: 1k–10k warmup steps, often 1% of total training.

8.6 One-cycle (Leslie Smith)

linear ramp 0 → lr_max over first half
linear ramp lr_max → lr_min over second half

Sometimes with a tail of even-smaller LR at the end. Smith showed this gives "super-convergence" on CIFAR-fewer epochs than step decay. Less used in transformers (cosine + warmup tends to win there).

8.7 The standard recipe (memorize this)

linear warmup from 0 to lr_max for the first N_warmup steps
cosine decay from lr_max to lr_min for the remaining steps

For LLM pretraining, common values: - lr_max ≈ 3e-4 for small models (~125M params), down to ~1e-4 for large (10B+). - N_warmup ≈ 2000 steps. - lr_min = 0.1 · lr_max (often) or 0.

This recipe is what's behind GPT-2, GPT-3, Llama, and almost every modern LLM trained from scratch.


9. Normalization

Normalization layers re-center and re-scale activations to keep them in a healthy range across training. There are three you must know.

9.1 BatchNorm (Ioffe & Szegedy 2015)

For a feature j over a batch of size B:

μ_j = (1/B) Σ_i x_{ij}
σ²_j = (1/B) Σ_i (x_{ij} - μ_j)²
x̂_{ij} = (x_{ij} - μ_j) / √(σ²_j + ε)
y_{ij} = γ_j · x̂_{ij} + β_j        # learned scale γ and shift β

At inference time, μ and σ² are running averages from training, not batch statistics.

Why it works: was originally framed as fixing "internal covariate shift"-distribution of layer inputs changing during training. The "real" reason is debated; modern thinking is that BN smooths the loss landscape (Santurkar et al. 2018) and decouples direction from magnitude of weight updates.

Why it's bad for transformers: BN normalizes across the batch, but for variable-length sequences and for batch sizes that vary at inference, the statistics are unstable. Also, when batch size shrinks (small-batch fine-tuning, distributed training with small per-device batch), BN statistics become noisy.

Where BN still wins: convolutional vision models with stable batch sizes.

9.2 LayerNorm (Ba, Kiros, Hinton 2016)

For a single sample, normalize across the feature dimension d:

μ = (1/d) Σ_k x_k
σ² = (1/d) Σ_k (x_k - μ)²
x̂_k = (x_k - μ) / √(σ² + ε)
y_k = γ_k · x̂_k + β_k

No batch coupling. Each token is normalized independently of others. Works for any batch size, any sequence length. The transformer default.

LayerNorm gradient: derive in Exercise 5. The short version is that you need to backprop through the normalization, which couples all d features (because μ and σ² are functions of all of them).

9.3 RMSNorm (Zhang & Sennrich 2019, popularized by Llama)

LayerNorm without mean-centering:

rms = √( (1/d) Σ_k x_k² + ε )
y_k = γ_k · x_k / rms

Drops μ and β. One fewer reduction (no mean) and one fewer parameter (no shift). About 5-10% faster. Empirically as good as LayerNorm, sometimes slightly better. Used in Llama, Mistral, most modern open LLMs.

The intuition for why dropping the mean is fine: in a residual network, the residual stream's mean drifts but the model can absorb that drift in the next linear layer's bias. The variance scaling is the part that matters.

9.4 Pre-norm vs post-norm

Two ways to insert normalization in a transformer block.

Post-norm (original Transformer, Vaswani et al. 2017):

x ← LN(x + Sublayer(x))

Pre-norm (GPT-2, Llama, modern default):

x ← x + Sublayer(LN(x))

The difference matters at depth. In post-norm, the residual stream is normalized after each block; the gradient on the residual path passes through ∂LN/∂x, which is not the identity, and gradient magnitudes attenuate with depth. In pre-norm, the residual path is x ← x + (...) with no normalization applied to the skip-the gradient flows through identity untouched. Pre-norm trains stably to 100+ layers; post-norm without learning-rate warmup or careful scaling fails.

The transformer chapter (/07) shows the block diagram. Pre-norm is what you should default to.


10. Regularization in Deep Learning

10.1 Dropout

Train-time: independently zero each activation with probability p. Test-time: keep all activations, but compensate during training (the inverted dropout trick):

mask ~ Bernoulli(1 - p)             # shape of activation; 1 means keep
y = (mask ⊙ x) / (1 - p)            # scale up surviving activations

The /(1 - p) scaling means E[y] = x so test-time and train-time activations have the same expected magnitude. No special inference path.

Why it helps: training many "thinned" sub-networks simultaneously approximates an ensemble of 2^N networks (where N is the number of dropped units). Equivalent to noise injection that prevents co-adaptation between units.

Dropout is the major regularizer in MLPs, RNNs, and parts of transformers (commonly p=0.1 on attention probabilities and on the FFN output). LLM pretraining often uses p=0 because the dataset is large enough that overfitting isn't the bottleneck.

10.2 DropPath (stochastic depth)

For residual networks: with probability p, replace x + F(x) with just x (drop the entire residual branch). Equivalent to randomly making the network shallower at each step. Used in DeiT, ConvNeXt, video transformers. Lets you train very deep networks with reasonable compute.

10.3 Weight decay

Per AdamW (Section 7.6). The dominant regularizer in transformer training.

10.4 Label smoothing

Replace one-hot target e_y with

y_smooth_k = 1 - ε       if k = y
           = ε / (K - 1) otherwise

For typical ε ∈ [0.05, 0.1]. Effect: prevents the network from pushing logits to ±∞, which in turn prevents pathological overconfidence. Improves calibration of the predicted probabilities (the predicted probability of the top class is closer to the true frequency of being correct).

In LLMs, less universally used than in CV; recent training recipes often skip it.

10.5 Early stopping

Track validation loss; stop when it stops improving. Equivalent (under some assumptions) to L2 regularization. Cheap and effective. Standard in supervised fine-tuning.

10.6 Data augmentation

Vision: crops, flips, color jitter, MixUp, CutMix, RandAugment. Language: limited (back-translation, EDA tricks) and rarely used in pretraining. The dataset itself is the augmentation.


11. Loss Landscapes (Intuition)

The training loss L(θ) is a function from R^N (where N is the number of parameters, typically 10^6 to 10^11) to R.

11.1 Non-convexity

L(θ) is wildly non-convex. There are many local minima, many saddle points, many flat regions. Classical optimization theory (which assumes convex L) does not apply.

11.2 Saddle points dominate

In high dimensions, saddle points are exponentially more common than local minima. Reason: a critical point is a local minimum only if all N Hessian eigenvalues are positive. If each is independently positive with probability ~1/2 (a hand-wavy random-matrix heuristic), the probability of pure-positive is 2^-N. With N=10^9, the probability is essentially zero. Almost every critical point you encounter is a saddle.

This is good news. Saddle points are escapable-you just need any downhill direction, and there are typically many.

11.3 Why SGD's noise helps

SGD's gradient is a noisy estimate of the true gradient. The noise has two effects: 1. Saddle-point escape. At a saddle, the true gradient is zero, but the stochastic gradient is not. Noise pushes you off the saddle in some direction; if any direction is downhill, you take it. 2. Implicit regularization. SGD's noise has been argued to bias optimization toward flat minima rather than sharp minima.

11.4 Flat vs sharp minima

A "sharp" minimum has high curvature (large Hessian eigenvalues): a small perturbation in θ causes a large jump in loss. A "flat" minimum has low curvature: nearby θ give nearly the same loss.

The empirical hypothesis (Hochreiter & Schmidhuber 1997, Keskar et al. 2017): flat minima generalize better. Intuitively, training-set vs test-set difference is a small perturbation in the loss surface; if the minimum is flat, that perturbation barely moves the loss; if sharp, it does. This is part of why "Adam generalizes worse than SGD" in some settings-Adam's per-parameter scaling can find sharper minima.

This is intuition, not theorem. Recent work shows the sharpness-generalization correlation is reparameterization-dependent. But it's the working picture most practitioners hold.


12. Gradient Clipping

12.1 Clip-by-norm (the standard)

Compute the global norm of all gradients:

‖g‖ = √( Σ over all parameters of ||g_θ||² )

If ‖g‖ > max_norm, scale every gradient by max_norm / ‖g‖:

g_θ ← g_θ · min(1, max_norm / ‖g‖)

Typical max_norm = 1.0 for transformers (sometimes 0.5 for very large models, sometimes 5.0 for older code).

Why clip-by-norm preserves direction. All gradients are scaled by the same scalar, so the direction of the global gradient vector is unchanged; only its magnitude is capped. This is what you want: if the gradient direction is right, follow it; just don't take a huge step.

12.2 Clip-by-value

Clip each component individually to [-c, +c]:

g_i ← max(-c, min(c, g_i))

Distorts the gradient direction (some components capped, others not). Less common; survives mostly in older RL code.

12.3 Why clipping matters

Even with good init, careful LR, and warmup, occasional large gradients happen-a single rare token, a single bad batch, a single instability between optimizer steps. Without clipping, one of these events can produce an inf weight; from there, NaN spreads. With clipping, the largest possible step is bounded, and the network rides through.

Clipping is standard for any transformer training run, including fine-tuning. PyTorch:

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Place it after loss.backward() and before optimizer.step().


13. Mixed Precision Training (Overview)

Full treatment in /11_NUMERICAL_STABILITY.md. The core idea:

13.1 The setup

  • Master weights kept in FP32 (4 bytes/param). The optimizer reads and writes these.
  • Compute done in FP16 or BF16 (2 bytes/param). Forward and backward pass produce activations and gradients in low precision.
  • Optimizer step updates FP32 master weights using FP32-cast gradients.

Motivation: FP16/BF16 ops on Tensor Cores are 2-8× faster than FP32, and activations occupy half the memory. For large models this is the difference between "fits on a GPU" and "doesn't."

13.2 Loss scaling (FP16 only)

FP16 has a small dynamic range (~6e-8 to 6e4). Many gradients are smaller than 6e-8 and underflow to zero in FP16, halting training.

Fix: multiply the loss by a large constant S (e.g., 1024 or dynamic) before backward; this scales all gradients by S, lifting them out of underflow. After backward, divide gradients by S before the optimizer step:

loss_scaled = S · loss
loss_scaled.backward()
for p in params: p.grad /= S
optimizer.step()

PyTorch: torch.cuda.amp.GradScaler does this automatically and adapts S based on overflow detection.

13.3 BF16

BF16 has the same dynamic range as FP32 (8 exponent bits) but only 7 mantissa bits (vs FP32's 23). Underflow is no longer a concern, and loss scaling is unnecessary. Less precision in the mantissa means small numerical noise in matmul outputs, but training is robust to this.

When BF16 is the cleaner choice: any Ampere+ GPU (A100, H100, RTX 30/40 series) and any modern training run. BF16 is the default for LLM pretraining today.


14. Practical Exercises

Exercise 1. Cross-entropy gradient

Derive ∂L/∂z for L = -log(softmax(z)_y).

Solution. Let S = softmax(z). Then L = -log(S_y).

∂L/∂z_k = -(1/S_y) · ∂S_y/∂z_k

∂S_y/∂z_k = S_y (1 - S_y)    if k = y
          = -S_y S_k          if k ≠ y

⇒ ∂L/∂z_y = -(1/S_y) · S_y(1 - S_y) = S_y - 1 = (S - e_y)_y
  ∂L/∂z_k = -(1/S_y) · (-S_y S_k)  = S_k       = (S - e_y)_k    for k ≠ y

Combined: ∂L/∂z = S - e_y. The cleanest gradient in deep learning, and the reason you should never split softmax and cross_entropy into separate ops.

Exercise 2. Adam in 20 lines

import math
import torch

class MyAdam:
    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8):
        self.params = list(params)
        self.lr, self.b1, self.b2, self.eps = lr, betas[0], betas[1], eps
        self.t = 0
        self.m = [torch.zeros_like(p) for p in self.params]
        self.v = [torch.zeros_like(p) for p in self.params]

    @torch.no_grad()
    def step(self):
        self.t += 1
        for p, m, v in zip(self.params, self.m, self.v):
            g = p.grad
            m.mul_(self.b1).add_(g, alpha=1 - self.b1)
            v.mul_(self.b2).addcmul_(g, g, value=1 - self.b2)
            m_hat = m / (1 - self.b1 ** self.t)
            v_hat = v / (1 - self.b2 ** self.t)
            p.addcdiv_(m_hat, v_hat.sqrt().add_(self.eps), value=-self.lr)

# Test on a simple quadratic L(x) = 0.5 ||x - target||^2
target = torch.tensor([3.0, -1.0, 2.0])
x = torch.zeros(3, requires_grad=True)
opt = MyAdam([x], lr=0.1)
for step in range(200):
    loss = 0.5 * ((x - target) ** 2).sum()
    loss.backward()
    opt.step()
    x.grad.zero_()
print(x)  # converges to target

That's Adam. Twenty lines, no framework optimizer.

Exercise 3. He init for (in=512, out=2048) ReLU layer

He / Kaiming for ReLU: Var(W) = 2 / fan_in.

σ = √(2 / 512) = √(1/256) = 1/16 = 0.0625

So W ~ N(0, 0.0625). PyTorch:

torch.nn.init.kaiming_normal_(layer.weight, nonlinearity='relu')

(fan_out mode is also available; for ReLU MLPs fan_in is conventional.)

Exercise 4. NaN at step 3000, grad_norm = [1.2, 1.5, 2.0, 4.5, 12, NaN]

Three diagnoses:

  1. Insufficient or absent gradient clipping. The grad norm jumped from ~2 to 12 to NaN over five steps. Gradient clipping by global norm at max_norm=1.0 would have prevented this. Add clip_grad_norm_ after backward().

  2. LR too high (or warmup too short). Stable then sudden divergence is the signature of stepping into a sharp region of the loss landscape with too large a step. Reduce lr_max by 2-4×, or extend the warmup, or both.

  3. FP16 overflow without proper loss scaling. If training in FP16, a gradient just outside the FP16 range (±6.5e4) becomes inf and propagates. Switch to BF16 (no loss scaling needed, larger dynamic range) or verify the GradScaler is working-overflow detection should rescale, not produce NaN. In a healthy AMP run, scaler.update() would have backed off S and you'd see step skipping, not NaN.

Less likely (but worth checking): bad data point (a single sample with degenerate features), a buggy custom CUDA kernel, a bug introduced by recent code change, a bug in mixed-precision casting.

The first thing to do is print grad_norm per parameter at the failing step and find which parameter group blew up. The blowup is usually localized.

Exercise 5. Gradient through LayerNorm

LayerNorm forward (single sample, vector x ∈ R^d):

μ = (1/d) Σ_k x_k
σ² = (1/d) Σ_k (x_k - μ)²
x̂_k = (x_k - μ) / √(σ² + ε)
y_k = γ_k · x̂_k + β_k

Given upstream δy = ∂L/∂y, derive δx = ∂L/∂x.

Step 1. ∂L/∂γ_k = δy_k · x̂_k. ∂L/∂β_k = δy_k.

Step 2. ∂L/∂x̂_k = δy_k · γ_k. Call this δx̂_k.

Step 3. Now propagate through normalization. Let s = √(σ² + ε), so x̂_k = (x_k - μ)/s.

∂x̂_k/∂x_j = (1/s) (δ_{kj} - 1/d) - (x_k - μ) (1/s²) · ∂s/∂x_j
∂s/∂x_j = (1/(2s)) · ∂σ²/∂x_j = (1/(s d)) (x_j - μ)

Combining and simplifying (algebra; standard derivation):

δx_k = (1/(d·s)) · [ d · δx̂_k - Σ_j δx̂_j - x̂_k · Σ_j δx̂_j · x̂_j ]

Or, equivalently:

δx = (1/s) · [ δx̂ - mean(δx̂) - x̂ · mean(δx̂ ⊙ x̂) ]

The structure is: subtract the mean of the upstream gradient, then subtract a component proportional to whose magnitude is the inner product <δx̂, x̂>/d. This is the LayerNorm backward. PyTorch implements it as a fused kernel; this is what's in there.

Exercise 6. Initial LR for a 24-layer transformer

Standard recipe:

  • Use AdamW.
  • Use linear warmup → cosine decay.
  • For a "small-medium" transformer (decoder-only, ~125M-350M params), lr_max is typically 3e-4 or 2e-4. As models grow, lr_max shrinks (Llama-2 7B uses 3e-4; Llama-2 70B uses 1.5e-4). Width-dependent scaling (μP) suggests lr ∝ 1/d_model is the principled choice.
  • Warmup: 2000 steps is the most common default.
  • Cosine decay to lr_min = 0.1 · lr_max over the rest of training.
  • Weight decay: 0.1.
  • Gradient clip: max_norm = 1.0.
  • Betas: (0.9, 0.95).

Justification: 3e-4 is the "Adam learning rate that just works" empirically; warmup of 2000 steps gives Adam's time to stabilize before large steps; cosine to a small lr_min allows refinement; weight decay 0.1 is the LLM-pretraining standard since GPT-3.

So my answer: lr_max = 3e-4, 2000 warmup steps, cosine to lr_min = 3e-5, AdamW(β=(0.9, 0.95), wd=0.1), grad-clip 1.0. That recipe transfers across nearly all decoder-only LLM training runs in this size range.


What you now have

Every concept in /07_ATTENTION_TRANSFORMER.md rests on this chapter:

  • The forward pass of a transformer layer is a chain of affine + nonlinearity (softmax, GELU/SwiGLU)-Sections 1, 2, 5.
  • The backward pass is mechanical chain rule-Section 3.
  • Pre-norm with residual connections is what makes deep transformers train at all-Sections 4, 9.4.
  • AdamW + cosine + warmup is the optimization recipe-Sections 7, 8.
  • He / GPT-2 init, gradient clipping, BF16 mixed precision are the day-2 stability tools-Sections 6, 12, 13.

When in /07 you read "the transformer block is x ← x + Attention(LayerNorm(x)) followed by x ← x + MLP(LayerNorm(x)), trained with AdamW(β=(0.9, 0.95), wd=0.1) on a cosine schedule with 2000 warmup steps and gradient clipped at 1.0," every clause should now read as something you have derived, not memorized.

That is the bridge to transformers. Cross it.

Comments