Deep Dive 01-Math for Machine Learning (Self-Contained Reference)¶

A self-contained chapter for the working applied AI engineer. Every claim is derived; nothing is offloaded to an external link. Read this once carefully and you have the math you actually need to read papers, debug training, and reason about transformers, embeddings, and losses.

How to read this document¶

The chapter has four parts:

Part A-Linear Algebra: vectors, matrices, SVD, norms, tensors. The operational level-enough to read a transformer implementation and know why every shape is what it is.
Part B-Calculus: derivatives, chain rule, gradients, the two backprop derivations every ML engineer should be able to do on a whiteboard.
Part C-Probability and Statistics: distributions, MLE, KL, cross-entropy, perplexity. The probabilistic glue that explains why the losses look the way they do.
Part D-Worked exercises: six problems with full solutions, exercising material from all three parts.

Notation throughout: lowercase bold-ish letters (a, x, w) are vectors; uppercase (A, W, Q) are matrices or higher-rank tensors; ‖·‖ is a norm (L2 unless stated); · is multiplication (scalar, dot product, or matrix-vector-context disambiguates); aᵀ is transpose; ∂ and ∇ are partial derivatives and gradients; E[·] is expectation; log is natural log unless stated otherwise.

Part A-Linear Algebra (the operational level)¶

A.1 Vectors: list-of-numbers and arrow-with-direction¶

A vector in ℝⁿ is, on paper, an ordered tuple of n real numbers:

a = (a₁, a₂, ..., aₙ)

It has two equivalent interpretations:

Algebraic view: a list. a = (3, 4) is just two numbers.
Geometric view: an arrow from the origin to the point (3, 4) in 2D. It has a direction (which way it points) and a magnitude (how long it is).

The two views are the same object. The list (3, 4) is the arrow ending at (3, 4). The geometric view is useful when we want to talk about angles and lengths; the algebraic view is useful when we want to compute.

Switching between views:

The magnitude (length) of a = (a₁, ..., aₙ) is

‖a‖ = √(a₁² + a₂² + ... + aₙ²)

This is the Pythagorean theorem in n dimensions: (3, 4) has length √(9 + 16) = √25 = 5.

The direction of a non-zero vector is `a / ‖a‖ - the same arrow scaled to length 1. This is called the unit vector in the direction of a.
A unit vector has length 1 by construction. Any vector can be written as

a = ‖a‖ · (a / ‖a‖) = magnitude · direction.

So the two views are reconciled: a vector = (length, direction), and once you know its components in a chosen coordinate frame, you can compute both.

Why this matters in ML: when we say two embeddings e₁ and e₂ are "similar," we usually mean their directions are similar. Magnitude often encodes something like word frequency or token "energy," and we frequently want to factor it out. That motivates cosine similarity (A.3).

A.2 Vector operations and the dot product identity¶

Addition: componentwise.

(a₁, a₂) + (b₁, b₂) = (a₁ + b₁, a₂ + b₂)

Geometrically, this is the parallelogram rule: place the tail of b at the head of a; the sum is the arrow from origin to the new head.

Scalar multiplication: componentwise.

c · (a₁, a₂) = (c·a₁, c·a₂)

Geometrically, scaling by c stretches the arrow by c (and flips it if c < 0).

Dot product (algebraic definition):

a · b = a₁·b₁ + a₂·b₂ + ... + aₙ·bₙ

This is a scalar. It is bilinear (linear in each argument), commutative (a·b = b·a), and a·a = ‖a‖².

Deriving `a·b = ‖a‖ ‖b‖ cos(θ)`¶

This identity is the bridge between the algebraic and geometric views. Here is the derivation.

Place vectors a and b in the plane with their tails at the origin. The angle between them is θ. Form the vector c = a − b. The three vectors form a triangle: from the origin to the tip of a, from the origin to the tip of b, and from the tip of b to the tip of a (which is c, since a = b + c).

By the law of cosines applied to this triangle:

‖c‖² = ‖a‖² + ‖b‖² − 2·‖a‖·‖b‖·cos(θ)        ... (1)

Now compute ‖c‖² algebraically:

‖c‖² = (a − b) · (a − b)
     = a·a − a·b − b·a + b·b
     = ‖a‖² − 2·(a·b) + ‖b‖²                    ... (2)

Setting (1) = (2):

‖a‖² − 2·(a·b) + ‖b‖² = ‖a‖² + ‖b‖² − 2·‖a‖·‖b‖·cos(θ)

Cancel ‖a‖² + ‖b‖² from both sides:

− 2·(a·b) = − 2·‖a‖·‖b‖·cos(θ)

Divide by −2:

a · b = ‖a‖ · ‖b‖ · cos(θ)        ✓

This is the fundamental identity of the dot product. It says: the dot product encodes both magnitudes and the angle between vectors.

Consequences worth memorizing:

a · b = 0 iff a and b are perpendicular (cos(90°) = 0). This is the definition of orthogonality.
a · b > 0 means the angle is acute (< 90°), i.e., they point in "broadly the same direction."
a · b < 0 means the angle is obtuse (> 90°), i.e., they point in "broadly opposite directions."
For unit vectors (‖a‖ = ‖b‖ = 1), a · b = cos(θ) directly.

A.3 Cosine similarity¶

Definition:

cos_sim(a, b) = (a · b) / (‖a‖ · ‖b‖)

This is just cos(θ) extracted from the dot product identity. It lies in [−1, 1]:

+1: identical direction.
0: orthogonal-no shared direction.
−1: opposite direction.

Worked example: a = (1, 0), b = (1, 1).

a · b   = 1·1 + 0·1 = 1
‖a‖    = √(1 + 0) = 1
‖b‖    = √(1 + 1) = √2
cos_sim = 1 / (1 · √2) = 1/√2 ≈ 0.7071

The angle is arccos(1/√2) = 45°. Sanity check: (1, 0) points along the x-axis; (1, 1) is the diagonal; the angle between them is indeed 45°.

Why cosine similarity is the default for embeddings:

In an embedding model (sentence-transformer, CLIP image encoder, etc.), each input is mapped to a vector in ℝᵈ, typically with d between 256 and 4096. We want a similarity score that is:

Invariant to magnitude. If a sentence is "longer" or has higher-energy activations, that should not by itself make it look more similar to everything else. Dividing by ‖a‖·‖b‖ strips magnitude.
Bounded. [−1, 1] is convenient for thresholds, ranking, and interpretation.
Cheap to compute when vectors are normalized. If we pre-normalize embeddings to unit length, cosine similarity is just the dot product-one fused multiply-add per dimension. This is critical for nearest-neighbor search at scale.
Reflects what training optimized. Contrastive losses (InfoNCE, the workhorse of CLIP and most retrieval models) are computed on dot products of L2-normalized embeddings, which is cosine similarity. So evaluating with cosine matches training.

Other similarities exist (Euclidean, dot product without normalization, learned metrics), but cosine is the dominant default in retrieval, RAG, and semantic search.

A.4 Matrices as linear transformations¶

A matrix A ∈ ℝᵐˣⁿ is a rectangular array of numbers with m rows and n columns. But the operational view that matters for ML is: a matrix is a linear function from ℝⁿ to ℝᵐ.

A function f : ℝⁿ → ℝᵐ is linear if for all vectors x, y and scalars c:

f(x + y)  = f(x) + f(y)
f(c · x)  = c · f(x)

Every linear function is matrix-vector multiplication for some unique matrix A, and every matrix A defines a linear function x ↦ A·x. This equivalence is the heart of linear algebra.

Why matmul = composition: If A : ℝⁿ → ℝᵐ and B : ℝᵏ → ℝⁿ are linear, then their composition (A ∘ B)(x) = A(B(x)) is also linear. The matrix that represents the composition is the product A·B. That is the definition of matrix multiplication: (AB)·x = A·(B·x) for all x.

This is why matrix multiplication is associative ((AB)C = A(BC)) but not commutative (function composition isn't commutative-putting on socks then shoes ≠ shoes then socks).

Shape rules follow directly from "matmul is composition":

For B : ℝᵏ → ℝⁿ, B is n × k.
For A : ℝⁿ → ℝᵐ, A is m × n.
The composition A·B : ℝᵏ → ℝᵐ must be m × k.
(m × n) · (n × k) = (m × k). Inner dims must match; outer dims become the result.

This is the shape rule. Burn it in. Every "shape mismatch" error you ever debug is a violation of this rule.

A.5 Matrix multiplication: three views¶

Given A of shape (m, n) and B of shape (n, k), the product C = A·B of shape (m, k) has entries

C[i, j] = Σₚ A[i, p] · B[p, j]    (sum over p = 1 to n)

There are three equivalent ways to visualize this. All three are useful; you should be fluent in switching between them.

View 1: row times column (the textbook view)¶

C[i, j] is the dot product of row i of A with column j of B.

C[i, j] = (A's row i) · (B's column j)

Worked example:

A = [[1, 2],          B = [[5, 6],
     [3, 4]]                [7, 8]]

C[0, 0] = (1, 2) · (5, 7) = 5 + 14 = 19
C[0, 1] = (1, 2) · (6, 8) = 6 + 16 = 22
C[1, 0] = (3, 4) · (5, 7) = 15 + 28 = 43
C[1, 1] = (3, 4) · (6, 8) = 18 + 32 = 50

C = [[19, 22],
     [43, 50]]

This view is good for hand-calculating and for understanding attention scores: Q·Kᵀ is exactly "for each query row, take its dot product with each key row"-a similarity matrix.

View 2: outer-product sum¶

A·B = Σₚ (column p of A) ⊗ (row p of B)

where ⊗ is the outer product: u ⊗ vᵀ is the rank-1 matrix u · vᵀ (an m × 1 times 1 × k gives m × k).

So C = a₁ b₁ᵀ + a₂ b₂ᵀ + ... + aₙ bₙᵀ, where aᵢ is the i-th column of A and bᵢᵀ is the i-th row of B.

Worked example (same A, B as above):

a₁ b₁ᵀ = (1, 3)ᵀ · (5, 6) = [[5, 6], [15, 18]]
a₂ b₂ᵀ = (2, 4)ᵀ · (7, 8) = [[14, 16], [28, 32]]

Sum:    [[5+14, 6+16], [15+28, 18+32]] = [[19, 22], [43, 50]]   ✓

This view is the right one for understanding low-rank decomposition: any rank-r matrix can be written as a sum of r outer products. SVD (A.9) makes this rigorous, and LoRA (low-rank adaptation in fine-tuning) directly exploits this view: ΔW = B·A where B is m×r and A is r×n, so ΔW is a sum of r outer products with far fewer parameters than full m×n.

View 3: linear combination of columns¶

A·x is a linear combination of the columns of A, with coefficients given by the entries of x.

A·x = x₁ · (col 1 of A) + x₂ · (col 2 of A) + ... + xₙ · (col n of A)

Generalized: column j of A·B is A · (column j of B), which is a linear combination of A's columns weighted by column j of B.

Worked example: A·(2, 3)ᵀ with A as above:

= 2·(1, 3)ᵀ + 3·(2, 4)ᵀ
= (2, 6)ᵀ + (6, 12)ᵀ
= (8, 18)ᵀ

This view explains the column space (the range of A as a function): the set of all A·x is precisely the set of all linear combinations of columns of A. Hence "column space."

A.6 Transpose¶

The transpose Aᵀ of an m × n matrix A is the n × m matrix obtained by swapping rows and columns: Aᵀ[i, j] = A[j, i].

For vectors viewed as n × 1 column matrices, aᵀ is a 1 × n row matrix, and aᵀ · b is a 1×1 matrix whose single entry is exactly the dot product a · b.

Why `(AB)ᵀ = BᵀAᵀ`¶

Compute the (i, j) entry of (AB)ᵀ:

(AB)ᵀ[i, j] = (AB)[j, i] = Σₚ A[j, p] · B[p, i]

Compute the (i, j) entry of BᵀAᵀ:

(BᵀAᵀ)[i, j] = Σₚ Bᵀ[i, p] · Aᵀ[p, j]
             = Σₚ B[p, i] · A[j, p]
             = Σₚ A[j, p] · B[p, i]    (multiplication is commutative for scalars)

The two expressions are equal entry-by-entry. ✓

The order reverses because of shape compatibility: (AB)ᵀ is k × m, Aᵀ is n × m, Bᵀ is k × n, and only Bᵀ · Aᵀ (with shapes (k×n)·(n×m) = k×m) lines up.

Why transpose appears in Q·Kᵀ (attention): We have queries Q of shape (S, d) (S tokens, d-dim each) and keys K of shape (S, d). We want a score matrix where entry (i, j) is the dot product of query i with key j. That score is exactly Q[i, :] · K[j, :], the dot product of two row vectors. Writing this as a matrix product requires transposing K so that its rows become columns: (Q · Kᵀ)[i, j] = Q[i, :] · K[j, :]. The result is (S, d) · (d, S) = (S, S). The transpose is a shape-mechanical necessity, not a deep mathematical move.

Why transpose appears in backprop: If forward is y = W·x (with W of shape (out, in) and x of shape (in,)), the gradient with respect to x of any downstream loss L is

∂L/∂x = Wᵀ · (∂L/∂y)

That is, the upstream gradient (which has shape (out,)) gets multiplied by Wᵀ (shape (in, out)) to produce a gradient with shape (in,). The shape works because the backward pass goes from output-space to input-space, the reverse of the forward pass. We will derive this in B.18.

A.7 Identity, inverse, determinant¶

Identity matrix I: square, 1's on the diagonal, 0's elsewhere. I·x = x for any x. The "do nothing" linear map.

Inverse A⁻¹: defined for square matrices A only. A⁻¹ is the unique matrix (when it exists) such that A·A⁻¹ = A⁻¹·A = I. It exists iff A is invertible iff det(A) ≠ 0 iff the columns of A are linearly independent iff A has full rank.

Why we don't compute A⁻¹ in practice: numerically unstable and O(n³). To "solve A·x = b," use a factorization (LU, QR, Cholesky) or an iterative method. In ML we almost never invert a learned weight matrix; the only inverses you commonly see in code are diagonal ones (trivial) or covariance inverses in classical methods.

Determinant-geometric meaning: For an n × n matrix A, |det(A)| is the factor by which A scales n - dimensional volume, andsign(det(A))tells you whetherApreserves orientation (+) or flips it (−`).

In 2D: take the unit square with vertices (0,0), (1,0), (1,1), (0,1). Apply A. The image is a parallelogram whose area equals |det(A)|.

Worked example:

A = [[2, 0],
     [0, 3]]

A stretches by 2 along x and by 3 along y. det(A) = 2·3 = 6. The unit square (area 1) becomes a 2×3 rectangle (area 6). ✓

det(A) = 0 means A collapses some dimension-the image is lower-dimensional than the input. That's exactly when A is non-invertible: information was destroyed; you cannot uniquely recover x from A·x.

A.8 Rank, basis, dimension; what "low rank" means¶

A set of vectors {v₁, ..., vₖ} is linearly independent if no vᵢ can be written as a linear combination of the others-equivalently, the only solution to c₁v₁ + ... + cₖvₖ = 0 is c₁ = ... = cₖ = 0.

A basis of a vector space is a linearly independent set that spans the space (every vector in the space is some linear combination of basis vectors). All bases of a given space have the same number of elements; that number is the dimension.

The rank of a matrix A is the dimension of its column space (equivalently, of its row space-they always have the same dimension; this is a non-trivial theorem). Practically:

rank(A) = number of linearly independent columns of A.
For an m × n matrix, rank(A) ≤ min(m, n).
A is full rank if rank(A) = min(m, n); otherwise rank-deficient.

Low rank: a matrix is "low rank" if its rank is much smaller than min(m, n). By the outer-product view (A.5, View 2), a rank-r matrix can be written as a sum of r outer products:

A = u₁v₁ᵀ + u₂v₂ᵀ + ... + uᵣvᵣᵀ

That uses r·(m + n) numbers instead of m·n. For m = n = 4096 and r = 16, that is 16·8192 = 131,072 parameters versus `16,777,216 - a 128× reduction.

Why low-rank approximations work in ML: many natural matrices are approximately low-rank-they have a few large singular values and many tiny ones. The "important" structure lives in a low-dimensional subspace; the rest is noise or near-zero. Truncating to the top-r components preserves most of the signal at a fraction of the cost. SVD (next section) makes this precise.

LoRA, the dominant fine-tuning technique for LLMs, exploits this: instead of updating a full 4096×4096 weight matrix, we learn a rank-r correction B·A with r=8 or r=16. We are betting that the update to the pretrained weights is approximately low-rank, which empirically holds.

A.9 Singular Value Decomposition (SVD)¶

Every real matrix A of shape m × n can be factored as

A = U · Σ · Vᵀ

where:

U is m × m, orthogonal (UᵀU = I, columns are orthonormal). Its columns are the left singular vectors.
Σ is m × n, diagonal-shaped with non-negative entries σ₁ ≥ σ₂ ≥ ... ≥ σ_min(m,n) ≥ 0 on the diagonal and zeros off the diagonal. The σᵢ are the singular values.
V is n × n, orthogonal. Its columns are the right singular vectors.

This factorization always exists. The proof goes through the eigendecomposition of AᵀA (whose eigenvalues are σᵢ² and whose eigenvectors form V), but for our purposes existence and the geometric picture matter more than the proof.

Geometric picture: rotate, scale, rotate¶

Apply A·x = U·Σ·Vᵀ·x step by step:

Vᵀ · x: an orthogonal transformation. Geometrically, this is a rotation (and possibly reflection) in input space. It does not change lengths.
Σ · (Vᵀ · x): scale each axis by the corresponding singular value σᵢ. The unit ball of input space becomes an axis-aligned ellipsoid with semi-axes of length σᵢ.
U · (Σ · Vᵀ · x): another rotation (and possibly reflection), now in output space. The axis-aligned ellipsoid becomes a tilted ellipsoid.

So every linear map is a rotation, a per-axis scaling, and another rotation. That's the deep content of SVD.

Rank-k truncation: the Eckart–Young theorem¶

The best rank-k approximation of A (best in Frobenius norm and in spectral norm) is

A_k = U[:, :k] · Σ[:k, :k] · V[:, :k]ᵀ
    = σ₁·u₁v₁ᵀ + σ₂·u₂v₂ᵀ + ... + σₖ·uₖvₖᵀ

— keep the top-k singular values and their vectors; throw the rest away. The reconstruction error is ‖A − A_k‖² = σ_{k+1}² + σ_{k+2}² + ... (sum of squared discarded singular values, in Frobenius norm).

Why this matters in ML:

PCA is SVD applied to mean-centered data: the top-k right singular vectors are the principal directions; the data's projection onto them is the rank-k embedding.
Compression and pruning: replace a big weight matrix by its rank-k SVD truncation. Deploy fewer parameters with controlled accuracy loss.
Understanding what an attention head does: the value-output projection W_V · W_O (with low rank when d_head < d_model) is literally a low-rank factorization built into the architecture.
LoRA: ΔW = B · A where B is m×r and A is r×n is a rank-r update-a deliberately structured truncated SVD-shaped update.

A.10 Norms: L1, L2, L∞¶

A norm ‖·‖ assigns each vector a non-negative "size," with ‖x‖ = 0 iff x = 0, ‖c·x‖ = |c|·‖x‖, and ‖x + y‖ ≤ ‖x‖ + ‖y‖ (triangle inequality).

For x = (x₁, ..., xₙ):

L1 norm: ‖x‖₁ = |x₁| + |x₂| + ... + |xₙ|. Sum of absolute values. Geometrically, the unit ball is a diamond (rotated square) in 2D, an octahedron in 3D-pointy along the axes.
L2 norm: ‖x‖₂ = √(x₁² + x₂² + ... + xₙ²). Euclidean length. Unit ball is a circle/sphere.
L∞ norm: ‖x‖∞ = max(|x₁|, |x₂|, ..., |xₙ|). Largest single component. Unit ball is a square/cube.

These are all instances of the Lp norm: ‖x‖ₚ = (Σ|xᵢ|ᵖ)^(1/p) for p ≥ 1. L1 is p=1, L2 is p=2, and L∞ is the limit as p → ∞.

Why L2 regularization "shrinks" weights¶

Add a penalty λ·‖w‖₂² = λ·Σwᵢ² to the loss. The gradient of the penalty w.r.t. wᵢ is 2λ·wᵢ. Each gradient step in the penalty direction updates

wᵢ ← wᵢ − η · 2λ · wᵢ = (1 − 2ηλ) · wᵢ

— a uniform multiplicative shrinkage toward zero each step. Big weights get pulled in proportionally. This is also called weight decay, and in modern optimizers (AdamW) decoupled weight decay implements it directly as wᵢ ← (1 − ηλ)·wᵢ independent of the gradient of the original loss.

Why L1 regularization "selects" features¶

The gradient of λ·‖w‖₁ w.r.t. wᵢ is λ·sign(wᵢ) - a constant magnitude regardless of how bigwᵢ` is. So a weight gets pushed toward zero by the same amount each step until it crosses zero, where the subgradient analysis tells you it sticks at zero. The L1 penalty therefore drives many weights exactly to zero, producing a sparse model. L2 only shrinks; it doesn't zero things out.

In transformers we mostly use L2 weight decay (often via AdamW). L1 shows up in classical sparse-coding and feature-selection settings, and in L1-regularized sparse autoencoders used for mechanistic interpretability of LLMs (where the goal is to discover sparsely-activating features in a model's internal representations).

A.11 Tensors as N-D arrays; the (B, S, H) layout¶

A tensor, in the ML sense, is an N-dimensional array. (Mathematicians use "tensor" for something more sophisticated; ignore that here.)

0-D: scalar, e.g., 3.14.
1-D: vector, shape (n,).
2-D: matrix, shape (m, n).
3-D: e.g., shape `(B, S, H) - a batch of sequences of hidden states.
4-D: e.g., shape `(B, C, H, W) - a batch of images (channels, height, width).

Reshape: rearrange the same data into a different shape, preserving the total number of elements. (2, 3, 4) and (6, 4) and (24,) all hold 24 numbers. Reshape changes how indices map to memory locations only by reinterpreting strides-typically zero-cost (no data copy) if the tensor is contiguous.

Transpose generalizes to permute: rearrange the order of axes. permute(0, 2, 1) on a (B, S, H) tensor produces (B, H, S). This does change the strides; whether you need a copy depends on the framework.

Broadcasting: when two tensors of different shapes are combined elementwise, smaller-shape dims are virtually expanded by repeating, as long as either:

their sizes match, or
one of them is 1 (which gets repeated to match the other).

Examples: - (B, S, H) + (H,): the bias of shape (H,) is added to every (B, S) position. Broadcasting expands (H,) → (1, 1, H) → (B, S, H). - (B, 1, H) * (B, S, 1) → (B, S, H): this is how you would multiply per-token weights by per-feature weights.

Get this wrong and you silently broadcast in unintended directions, producing the right shape but the wrong math. A frequent bug source.

Why `(B, S, H)` is the standard transformer layout¶

B (batch): independent examples processed in parallel for GPU throughput. The model is identical along this axis.
S (sequence): tokens within a sequence. Causal self-attention mixes information along this axis.
H (hidden / d_model): per-token feature dimension. Linear layers act along this axis.

This is the right ordering because:

The vast majority of operations (LayerNorm, linear, GELU) act independently across B and S and only mix along H. So H being the last (innermost, fastest-changing memory) dimension makes those ops cache-friendly.
Attention is an exception: Q·Kᵀ mixes along S. To do this, we reshape to (B, num_heads, S, d_head) temporarily-a permute-and after attention permute back.
Position-along-sequence is the natural axis for masking, causal attention masks, RoPE rotations, and KV-cache slicing.

You will see a few other conventions ((S, B, H) in older PyTorch RNN code, (B, H, S) for some 1-D-conv-based models), but (B, S, H) is standard for attention-based architectures.

Part B-Calculus (only what backprop and gradient descent need)¶

B.12 Derivative as slope of tangent¶

For a function f : ℝ → ℝ, the derivative at x is

f'(x) = lim_{h→0} [f(x + h) − f(x)] / h

Geometrically: take two points on the graph, (x, f(x)) and (x+h, f(x+h)). The line through them is a secant; its slope is [f(x+h) − f(x)]/h. As h → 0, the secant approaches the tangent line at x, and its slope is f'(x).

Mental picture: zoom in on the graph of f at the point x. As you zoom enough, any smooth curve looks like a straight line. The slope of that line is f'(x).

The fundamental approximation:

f(x + h) ≈ f(x) + h · f'(x)        for small h

This is the linear (or first-order) approximation. Everything in optimization comes from this idea: locally, a smooth function looks linear; we exploit that to take useful steps.

Worked example: f(x) = x². By the limit:

f'(x) = lim_{h→0} [(x+h)² − x²] / h
      = lim_{h→0} [x² + 2xh + h² − x²] / h
      = lim_{h→0} [2xh + h²] / h
      = lim_{h→0} (2x + h)
      = 2x

So at x = 3, the slope is 6, and near x = 3, f(3 + h) ≈ 9 + 6h. (Exact: (3 + 0.1)² = 9.61; approximation: 9 + 0.6 = 9.6. Off by 0.01 = h².)

B.13 Chain rule¶

If y = f(g(x)), set u = g(x) so y = f(u). Then

dy/dx = (dy/du) · (du/dx) = f'(g(x)) · g'(x)

Why this is the engine of backprop: a deep network is a long composition

y = f_L(f_{L-1}(... f_2(f_1(x)) ...))

Compute the gradient of a scalar loss L with respect to any internal parameter by repeated application of the chain rule, multiplying local Jacobians from output back to that parameter. Backpropagation is exactly this-with two efficiencies:

Cache the forward activations so each local Jacobian is cheap.
Multiply right-to-left (output toward input) so each intermediate is a vector (the gradient w.r.t. that activation), not a full Jacobian matrix. We never materialize the giant intermediate Jacobians.

Worked example: y = sin(x²). Let u = x², so y = sin(u).

du/dx = 2x
dy/du = cos(u) = cos(x²)
dy/dx = cos(x²) · 2x = 2x · cos(x²)

B.14 Partial derivatives and the gradient¶

For a function f : ℝⁿ → ℝ, the partial derivative ∂f/∂xᵢ is the derivative of f with respect to xᵢ, holding all other variables constant.

Worked example: f(x, y) = x²·y + 3y.

∂f/∂x = 2x · y         (treating y as constant)
∂f/∂y = x² + 3         (treating x as constant)

The gradient ∇f is the vector of all partials:

∇f = (∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ)

For the example: ∇f(x, y) = (2xy, x² + 3).

Geometric meaning: at any point, ∇f points in the direction in which f increases fastest, and ‖∇f‖ is the rate of increase per unit distance in that direction. A direction perpendicular to ∇f keeps f constant (locally)-those are the level sets / contour lines.

Derivation of "gradient is direction of steepest ascent"¶

By the multivariate linear approximation: for a small step Δx ∈ ℝⁿ,

f(x + Δx) ≈ f(x) + ∇f(x) · Δx

Constrain ‖Δx‖ = ε (a small fixed step). To maximize f(x + Δx) − f(x) ≈ ∇f · Δx, we are maximizing the dot product of ∇f with a unit-length-times-ε vector Δx.

Using the dot-product identity a·b = ‖a‖‖b‖cos(θ): the dot product is maximized when cos(θ) = 1, i.e., Δx is aligned with ∇f. So the direction of steepest ascent is +∇f. The direction of steepest descent is −∇f. ✓

B.15 Gradient descent¶

Algorithm: pick a starting point x₀. Repeat

x_{t+1} = x_t − η · ∇f(x_t)

η > 0 is the learning rate (step size).

Why this works: by the linear approximation,

f(x_{t+1}) ≈ f(x_t) + ∇f(x_t) · (x_{t+1} − x_t)
          = f(x_t) − η · ‖∇f(x_t)‖²

For small enough η, the correction −η·‖∇f‖² is negative (assuming gradient is nonzero), so f decreases. Over many steps, we converge toward a minimum.

If η is too large, the linear approximation is invalid and we may overshoot-the loss can increase. If η is too small, we crawl. Most modern training uses adaptive optimizers (Adam, AdamW) that adjust per-parameter step sizes from running estimates of gradient first and second moments, plus a learning-rate schedule (warmup + cosine decay typically) for the global η.

Stochastic gradient descent (SGD): instead of ∇f(x_t) over the whole dataset, use the gradient on a random mini-batch. The gradient is noisy but unbiased; it is much cheaper per step; and the noise itself acts as implicit regularization. All of modern deep learning is SGD with adaptive step sizes.

B.16 Derive ∂/∂w of MSE for `ŷ = w·x + b`¶

Setup: a single example (x, y), prediction ŷ = w·x + b (with scalar w, b, x, y for clarity), squared error loss

L = (y − ŷ)² = (y − w·x − b)²

Goal: ∂L/∂w.

Let r = y − w·x − b (the residual). Then L = r². By the chain rule:

∂L/∂w = (dL/dr) · (∂r/∂w)
      = 2r · (−x)
      = −2x · (y − w·x − b)
      = −2x · (y − ŷ)

So:

∂L/∂w = −2 · x · (y − ŷ)

Sanity check on signs:

If ŷ < y (we underpredicted) and x > 0: (y − ŷ) > 0, so ∂L/∂w < 0. Gradient descent updates w ← w − η·(∂L/∂w) = w + η·2x·(y−ŷ) - i.e., we *increase*w, which (sincex > 0) increasesŷ`, reducing the error. ✓
If ŷ > y (overpredicted), (y − ŷ) < 0, so ∂L/∂w > 0, and we decrease w. ✓

Vector form (with w, x ∈ ℝᵈ and ŷ = w·x + b):

∂L/∂w = −2 · (y − ŷ) · x        (a vector of length d)
∂L/∂b = −2 · (y − ŷ)

Over a batch, average across examples. This is exactly the linear-regression gradient, and the closed-form least-squares solution w = (XᵀX)⁻¹Xᵀy is the place where this gradient is zero.

B.17 Derive ∂/∂z of cross-entropy(y, softmax(z)) = softmax(z) − y¶

This is the derivation that justifies the elegant softmax_logits − one_hot_label form you see in classification training code.

Setup: z ∈ ℝᴷ is a vector of logits (one per class). The softmax produces a probability vector p:

pᵢ = exp(zᵢ) / Σⱼ exp(zⱼ)        for i = 1, ..., K

y ∈ ℝᴷ is the one-hot label: yᵢ = 1 for the true class, 0 otherwise. So Σᵢ yᵢ = 1.

Cross-entropy loss:

L = − Σᵢ yᵢ · log(pᵢ)

Goal: derive ∂L/∂zₖ for each k.

Step 1: simplify L using the one-hot structure.

Since y is one-hot at the true class c, only one term survives:

L = − log(p_c)

Or, expanding p_c:

L = − [z_c − log(Σⱼ exp(zⱼ))]
  = − z_c + log(Σⱼ exp(zⱼ))

This is the log-sum-exp form of cross-entropy and the form that is numerically stable in practice (subtract max(z) from all zⱼ first).

Step 2: derivative of the log-sum-exp term.

Let S = Σⱼ exp(zⱼ). By the chain rule:

∂/∂zₖ log(S) = (1/S) · ∂S/∂zₖ
             = (1/S) · exp(zₖ)
             = exp(zₖ) / S
             = pₖ

Step 3: derivative of −z_c.

∂/∂zₖ (−z_c) = −1 if k = c else 0
            = −yₖ        (using the one-hot definition)

Step 4: combine.

∂L/∂zₖ = −yₖ + pₖ = pₖ − yₖ

Vector form:

∂L/∂z = p − y = softmax(z) − y

That is, the gradient of cross-entropy w.r.t. the pre-softmax logits is exactly the predicted probability minus the one-hot label. No special cases, no awkward division-by-zero issues that a naive (− y / p) · (∂p/∂z) derivation would suggest.

This is why every classification head in every modern framework is "linear → softmax cross-entropy," and the gradient flows back as (p − y) cleanly. It is also why frameworks fuse softmax and cross-entropy into a single op: doing them separately introduces numerical issues and computes the same gradient anyway.

Generalization beyond one-hot: if y is a general probability vector (e.g., from label smoothing or distillation targets), the same derivation goes through with Σᵢ yᵢ = 1, and the result is still ∂L/∂z = p − y. ✓

B.18 Jacobian and the vector chain rule¶

For a function f : ℝⁿ → ℝᵐ, the Jacobian J is the m × n matrix of all partials:

J[i, j] = ∂fᵢ/∂xⱼ

The Jacobian generalizes the derivative to vector-valued functions.

Vector chain rule: if y = f(g(x)) with g : ℝⁿ → ℝᵏ and f : ℝᵏ → ℝᵐ, then

J_y/x = J_f(g(x)) · J_g(x)        (matrix product, m×k times k×n)

This is just the scalar chain rule applied componentwise and packaged into matrices.

Why we need Jacobians for backprop: a layer like y = W·x is a vector-valued function of vector-valued input. The Jacobian of y w.r.t. x is W itself (componentwise: yᵢ = Σⱼ Wᵢⱼ xⱼ, so ∂yᵢ/∂xⱼ = Wᵢⱼ).

Backward through a linear layer: forward y = W·x. Suppose downstream we have a scalar loss L and we know g_y = ∂L/∂y (a vector of shape (m,)). We want g_x = ∂L/∂x.

By the chain rule, g_x (as a row vector in the gradient-as-row convention) is g_y · J_y/x = g_y · W. As a column vector,

g_x = Wᵀ · g_y

That's where transpose appears in backprop: the Jacobian of a forward linear layer is W, and backprop multiplies by the transpose of the Jacobian to push gradients backward.

For the weights:

∂L/∂Wᵢⱼ = (∂L/∂yᵢ) · (∂yᵢ/∂Wᵢⱼ) = g_y[i] · x[j]

So ∂L/∂W = g_y · xᵀ (an outer product, shape (m, n) = (m, 1)·(1, n)). This is the rule "weight gradient is upstream gradient outer-product input."

Practical note: in real backprop we never materialize per-layer Jacobians. We define a backward(g_y) → g_x, g_W function for each layer that computes Wᵀ·g_y and g_y·xᵀ directly without forming the Jacobian explicitly. The Jacobian is the conceptual scaffold; the implementation is the closed-form contraction.

B.19 Hessian (light touch)¶

The Hessian H of a scalar function f : ℝⁿ → ℝ is the n × n matrix of second-order partials:

H[i, j] = ∂²f / (∂xᵢ ∂xⱼ)

It encodes curvature-how the gradient itself changes as you move. For smooth f, H is symmetric (∂²f/∂x∂y = ∂²f/∂y∂x).

Why curvature matters: the second-order Taylor expansion is

f(x + Δx) ≈ f(x) + ∇f(x) · Δx + (1/2) Δxᵀ · H · Δx

The first-order term tells you the slope; the second-order term tells you the bowl shape. Newton's method uses H directly: Δx = −H⁻¹ · ∇f. This is far faster near a minimum but requires O(n²) storage and O(n³) solves per step-infeasible for billion-parameter models.

Why we don't use the Hessian in deep learning training: too big. For a model with N = 10⁹ parameters, the Hessian is 10¹⁸ entries. Modern optimizers (Adam, AdamW, Lion, Sophia, Shampoo) use approximations to second-order info-diagonal estimates, block-diagonal estimates, or running variance estimates that act as a coarse curvature proxy. We defer that discussion to the optimization chapter; for now, know that Hessian = curvature, that curvature accelerates convergence in principle, and that practical optimizers approximate it cheaply.

Part C-Probability and Statistics (the ML-relevant subset)¶

C.20 Random variables and distributions¶

A random variable X is a variable whose value is uncertain-described by a probability distribution.

Discrete RV: takes values in a countable set. Described by a probability mass function (PMF) P(X = x) = p(x) with Σₓ p(x) = 1.
Continuous RV: takes real values. Described by a probability density function (PDF) p(x) with ∫p(x)dx = 1. P(a ≤ X ≤ b) = ∫_a^b p(x)dx. The point-probability P(X = x) is zero for continuous RVs; only intervals have positive probability.

ML mostly cares about joint and conditional distributions over many variables: p(x, y) over input-and-label pairs, p(y | x) for a classifier's predictive distribution, p(x_t | x_{<t}) for a language model's next-token distribution.

C.21 Expectation, variance, standard deviation¶

Expectation (a.k.a. mean):

E[X] = Σₓ x · p(x)        (discrete)
E[X] = ∫ x · p(x) dx      (continuous)

The "average value" of X weighted by probabilities.

Linearity of expectation-the most useful property:

E[a·X + b·Y + c] = a·E[X] + b·E[Y] + c

This holds even when X and Y are not independent. Critical: linearity is true unconditionally; multiplicativity (E[XY] = E[X]E[Y]) requires independence.

Variance: spread around the mean.

Var(X) = E[(X − E[X])²] = E[X²] − (E[X])²

The second form is computationally handy. Variance is in squared units.

Standard deviation:

σ(X) = √Var(X)

In the same units as X. Roughly the typical deviation from the mean.

Why we care in ML:

Loss is an expectation over the data distribution; we estimate it with the empirical mean over a batch.
Variance of the gradient estimator drives optimization noise. Bigger batch ⇒ lower-variance gradient estimate ⇒ smoother trajectory but more compute per step.
Initialization (Xavier/Glorot, He) is designed to keep activation variance roughly constant across layers-so signals don't explode or vanish at initialization.
BatchNorm, LayerNorm, RMSNorm: all manipulate first and second moments of activations.

C.22 Common distributions¶

Bernoulli(p): a single binary trial. P(X=1) = p, P(X=0) = 1−p. Mean p, variance p(1−p). Where it shows up: binary classification (per-example label is Bernoulli given features), dropout (each unit kept with probability p).

Categorical(p₁, ..., p_K): a single trial with K outcomes; P(X = k) = pₖ with Σpₖ = 1. Where: multi-class classification target distribution; next-token distribution from a language model.

Gaussian / Normal N(μ, σ²):

p(x) = (1 / √(2πσ²)) · exp(−(x − μ)² / (2σ²))

Mean μ, variance σ². Where: noise model for regression (predict a real value, errors approximately Gaussian); weight initialization; the Central Limit Theorem makes Gaussians the right default whenever you're summing many small effects; latent variables in VAEs and diffusion are Gaussian by design.

Multivariate Gaussian N(μ, Σ): μ ∈ ℝᵈ, Σ ∈ ℝᵈˣᵈ is positive semi-definite covariance. PDF:

p(x) = (2π)^(-d/2) · |Σ|^(-1/2) · exp(−(1/2)·(x−μ)ᵀΣ⁻¹(x−μ))

Uniform U(a, b): equal density on [a, b]. Where: random initialization (often), random sampling for synthetic data, RoPE position encodings have a uniform-frequency-mixing flavor (though that's not strictly the distribution itself).

C.23 Conditional probability and Bayes' theorem¶

Joint probability P(A, B) is the probability that both events happen. Equivalently P(A ∩ B).

Conditional probability:

P(A | B) = P(A, B) / P(B)        (when P(B) > 0)

Read: "the probability of A given that we know B happened."

Rearranging gives the chain rule of probability:

P(A, B) = P(A | B) · P(B) = P(B | A) · P(A)

Bayes' theorem: from the equality P(A | B) · P(B) = P(B | A) · P(A), divide both sides by P(B):

P(A | B) = P(B | A) · P(A) / P(B)

Read it as:

posterior  =  likelihood × prior  /  evidence

P(A) - prior onAbefore we seeB`.
P(B | A) - likelihood of observingBifA` were true.
P(B) - total probability ofB, the normalizing constant. Can be expanded by **law of total probability**:P(B) = P(B|A)·P(A) + P(B|¬A)·P(¬A)for binaryA, or generallyP(B) = Σ_k P(B|A_k)·P(A_k)` over a partition.
P(A | B) - posterior onAafter seeingB`.

Why this matters in ML: discriminative models learn P(y|x) directly. Generative models often factor as P(x|y)·P(y) and use Bayes to invert when needed (e.g., classification from a generative model). Bayesian deep learning is about treating weights as random variables with posterior P(W | data). Diffusion training reverses a noise process-every step is a conditional distribution p(x_{t-1} | x_t).

C.24 Independence and conditional independence¶

Events A, B are independent iff P(A, B) = P(A) · P(B). Equivalently P(A|B) = P(A): knowing B tells you nothing about A.

For random variables: X ⊥ Y iff p(x, y) = p(x)·p(y) for all x, y.

Conditional independence: X ⊥ Y | Z iff p(x, y | z) = p(x | z) · p(y | z). Given Z, X and Y are independent-but they may not be marginally independent.

Where conditional independence shows up:

Naive Bayes assumes features are conditionally independent given the class. Strong assumption; surprisingly effective on text.
Causal graphs and Markov blankets.
Autoregressive language models factor `p(x₁, ..., x_T) = Πₜ p(xₜ | x_{<t}) - this is no independence assumption (it's exact), but it does say each token depends only on the prefix.
Diffusion models assume the reverse process is Markovian: p(x_{t-1} | x_t, x_{t+1}, ..., x_T) = p(x_{t-1} | x_t).

C.25 Maximum Likelihood Estimation (MLE)¶

Setup: a parametric model p(x | θ) and an i.i.d. dataset D = {x₁, ..., x_N}. We want to choose θ.

The likelihood is the model's probability of producing the observed data:

L(θ) = Πᵢ p(xᵢ | θ)

The MLE is

θ_MLE = argmax_θ L(θ) = argmax_θ Σᵢ log p(xᵢ | θ)

(Take logs to convert the product to a sum; argmax is unchanged because log is monotone.)

Equivalently, MLE minimizes the negative log-likelihood (NLL):

NLL(θ) = − Σᵢ log p(xᵢ | θ)

That is what we literally compute as the training loss in classification (cross-entropy) and language modeling (next-token NLL).

MLE for Gaussian noise = MSE¶

Setup: regression. We model y | x ~ N(f_θ(x), σ²) for some fixed σ². Per example:

p(yᵢ | xᵢ, θ) = (1/√(2πσ²)) · exp(−(yᵢ − f_θ(xᵢ))² / (2σ²))

Log-likelihood per example:

log p(yᵢ | xᵢ, θ) = −(1/2)·log(2πσ²) − (yᵢ − f_θ(xᵢ))² / (2σ²)

Total NLL:

NLL(θ) = (N/2)·log(2πσ²) + (1/(2σ²)) · Σᵢ (yᵢ − f_θ(xᵢ))²

The first term is constant in θ (since σ² is fixed). Minimizing NLL over θ is therefore equivalent to minimizing

Σᵢ (yᵢ − f_θ(xᵢ))²        (sum of squared errors = N · MSE up to constants)

— exactly mean squared error. ✓

So when you train a regression model with MSE, you are doing MLE under a Gaussian-noise assumption. If your residuals are Gaussian-ish, MSE is principled. If they have heavy tails, MAE (Laplace assumption) or Huber loss may be better-justified.

MLE for Categorical = Cross-entropy¶

Setup: classification. A model outputs a categorical distribution p_θ(y | x) (e.g., softmax(logits)). Per example, with one-hot label y (so the true class is c):

p_θ(y | x) = p_θ(c | x)

Log-likelihood: log p_θ(c | x). Total NLL:

NLL(θ) = − Σᵢ log p_θ(cᵢ | xᵢ)

This is cross-entropy (in its one-hot form). With non-one-hot targets y (a probability vector), the same calculation gives

CE = − Σᵢ Σₖ yᵢ,ₖ · log p_θ(k | xᵢ)

— general cross-entropy. ✓

So cross-entropy IS the negative log-likelihood for categorical models. Training a classifier with cross-entropy loss is doing MLE under a categorical model. Training a language model with next-token cross-entropy is doing MLE for the joint distribution of token sequences (factorized autoregressively).

This is the deep reason these losses are the defaults: they are not arbitrary penalties; they are the principled probabilistic choices given the modeling assumption.

C.26 KL divergence¶

The Kullback-Leibler divergence between distributions p and q (over the same space) is

KL(p ‖ q) = Σₓ p(x) · log(p(x) / q(x))      (discrete)
KL(p ‖ q) = ∫ p(x) · log(p(x) / q(x)) dx    (continuous)

It measures "how far q is from p"-but it is not symmetric, and it is not a metric. Properties:

KL(p ‖ q) ≥ 0 always (Gibbs' inequality).
KL(p ‖ q) = 0 iff p = q almost everywhere.
KL(p ‖ q) ≠ KL(q ‖ p) in general.

Why we minimize it: if p is the true data distribution and q_θ is our model, training tries to make q_θ close to p.

Connection: minimizing cross-entropy ≈ minimizing KL¶

Decompose:

KL(p ‖ q) = Σₓ p(x)·log p(x) − Σₓ p(x)·log q(x)
          = − H(p) − Σₓ p(x)·log q(x)
          = H(p, q) − H(p)

where H(p, q) = − Σₓ p(x)·log q(x) is the cross-entropy and H(p) = − Σₓ p(x)·log p(x) is the entropy of p.

So:

KL(p ‖ q) = H(p, q) − H(p)

H(p) does not depend on q. Therefore minimizing KL(p ‖ q) over q is equivalent to minimizing H(p, q) - the cross-entropy. The constant offset−H(p)` doesn't move the argmin.

In ML practice, p is the empirical data distribution (one-hot or near-one-hot for labeled data). We minimize cross-entropy, which is exactly MLE (C.25), which is exactly KL minimization up to a constant. Three different motivations, one optimization objective. That convergence of motivations is why cross-entropy is the right loss.

C.27 Cross-entropy in detail¶

For two probability vectors p (target) and q (predicted), over K classes:

H(p, q) = − Σₖ pₖ · log qₖ

One-hot target case: p = e_c (the c-th standard basis vector), so

H(p, q) = − log q_c

— just the negative log-probability the model assigns to the true class. This is the "NLL of the right answer."

Properties:

Non-negative; minimized at zero when q puts all mass on the true class.
Strongly penalizes confident wrong answers (log of a small number is very negative).
Asymmetric in roles: the target p weights, the prediction q is what gets logged.

Why it's the standard classification loss: derived three ways above-MLE, KL minimization, log-likelihood-they all give cross-entropy. The gradient-w.r.t.-logits form softmax(z) − y (B.17) makes it cheap and numerically clean.

Connection to perplexity (C.28): cross-entropy is reported in nats per token (or bits per token if log base 2). Perplexity is exp(cross-entropy).

C.28 Information theory primer; perplexity¶

Self-information of an outcome x: I(x) = − log p(x). Rare events carry more information; certain events carry zero information. Units depend on log base: log₂ → bits, ln → nats.

Entropy of a distribution p:

H(p) = E_p[− log p(X)] = − Σₓ p(x)·log p(x)

The expected information per sample. Equivalently, the average optimal code length per sample (Shannon's source coding theorem). Higher entropy ⇒ more uncertainty.

Worked example: fair coin. p(H) = p(T) = 0.5. H = − 0.5·log 0.5 − 0.5·log 0.5 = log 2 = 0.693 nats = 1 bit. Maximally uncertain.

Biased coin p(H) = 0.9: H = − 0.9·log 0.9 − 0.1·log 0.1 ≈ 0.325 nats. Less uncertain (we mostly know what's coming).

Perplexity:

PPL(p) = exp(H(p))

Or for a model evaluated on data:

PPL = exp(cross_entropy_loss_in_nats)

Interpretation: the model is "as confused as if it had to choose uniformly among PPL equally likely options at each step."

Worked example: a language model with cross-entropy loss 2.5 nats/token has perplexity exp(2.5) ≈ 12.18. Roughly: at each token, the model is as uncertain as picking from ~12 equally likely options.

Lower perplexity = better predictive language model. A model with PPL=20 on Wikipedia is far worse than one with PPL=10. The minimum achievable PPL is bounded below by the entropy of the data itself (you can't do better than the data's intrinsic randomness).

If the loss is reported in log₂ (bits/token), perplexity is 2^cross_entropy. Most LM training uses natural log, so exp is the right base.

C.29 Sampling and the law of large numbers¶

Sample mean: given i.i.d. samples X₁, ..., X_N from a distribution with mean μ, the sample mean is

X̄_N = (1/N) · Σᵢ Xᵢ

Law of Large Numbers (LLN, intuitive form): as N → ∞, X̄_N → μ.

Why this matters in ML:

Mini-batch gradients. The true loss is E_{x ~ data}[ℓ(x; θ)]; the mini-batch loss is its sample average. By LLN, the mini-batch gradient is an unbiased estimator of the full-data gradient, with variance shrinking as 1/B (batch size).
Monte Carlo estimation. Whenever you can't compute an expectation in closed form (e.g., expected reward in RL, expected ELBO in VAEs), draw samples and average.
Eval metrics on a held-out set. You report sample-average accuracy / perplexity / BLEU; LLN tells you that with enough samples, this estimate is close to the true expected metric. Confidence intervals come from the Central Limit Theorem: for large N, X̄_N is approximately Gaussian with mean μ and standard deviation σ/√N. So error bars shrink as `1/√N - to halve them, you need 4× more samples.

Part D-Worked exercises (full solutions)¶

D.1 Cosine similarity of (3, 4) and (4, 3)¶

a = (3, 4)
b = (4, 3)

a · b = 3·4 + 4·3 = 12 + 12 = 24
‖a‖   = √(9 + 16) = √25 = 5
‖b‖   = √(16 + 9) = √25 = 5

cos_sim = 24 / (5 · 5) = 24 / 25 = 0.96

θ = arccos(0.96) ≈ 16.26°

Predicted angle: about 16°. Sanity check: both vectors are close to the line y = x (which is at 45° from each axis). (3,4) is just above the line (slope 4/3 > 1, angle from x-axis ≈ 53.13°); (4,3) is just below (slope 3/4 < 1, angle ≈ 36.87°). Difference ≈ 16.26°. ✓

D.2 Shape walkthrough for a 2-layer MLP¶

Architecture:

x : (B=4, in=8)
h = ReLU(x · W₁ + b₁)
y = h · W₂ + b₂

Shapes:

symbol	shape	reason
`x`	`(4, 8)`	batch of 4, each input is 8-dim
`W₁`	`(8, 16)`	maps `in=8` to `hidden=16`
`b₁`	`(16,)`	one bias per hidden unit; broadcasts over batch
`x · W₁`	`(4, 16)`	`(4,8)·(8,16) = (4,16)`
`h`	`(4, 16)`	same shape as `x · W₁` after bias add and ReLU
`W₂`	`(16, 3)`	maps `hidden=16` to `out=3`
`b₂`	`(3,)`	one bias per output
`h · W₂`	`(4, 3)`	`(4,16)·(16,3) = (4,3)`
`y`	`(4, 3)`	final logits, batch of 4, 3-class problem

Chain compatibility check: at each matmul, inner dimensions agree (8=8 for x·W₁, 16=16 for h·W₂). Broadcasting for biases: (4,16) + (16,) → (4,16); (4,3) + (3,) → (4,3). All consistent.

Parameter counts:

W₁: 8 · 16 = 128
b₁: 16
W₂: 16 · 3 = 48
b₂: 3
Total: 128 + 16 + 48 + 3 = 195 parameters.

Backward shapes (using B.18):

Upstream gradient at output: g_y shape (4, 3).
∂L/∂W₂ = hᵀ · g_y of shape (16, 4)·(4, 3) = (16, 3). ✓ Same shape as W₂.
∂L/∂h = g_y · W₂ᵀ of shape (4, 3)·(3, 16) = (4, 16). ✓
ReLU's local Jacobian: 1 where h > 0, 0 elsewhere-applied elementwise.
∂L/∂W₁ = xᵀ · (∂L/∂(x·W₁+b₁)) of shape (8, 4)·(4, 16) = (8, 16). ✓ Same shape as W₁.

Every shape is forced by chain compatibility.

D.3 Gradient of `L = (y − σ(w · x))²` with sigmoid¶

Setup: y, w, x scalar (extends componentwise to vectors). Sigmoid

σ(z) = 1 / (1 + exp(−z))

Let z = w·x, ŷ = σ(z), L = (y − ŷ)².

Need: ∂L/∂w.

Step 1: derivative of sigmoid.

σ'(z) = d/dz [1 / (1 + e^{−z})]
      = −1·(1 + e^{−z})^{−2} · (−e^{−z})
      = e^{−z} / (1 + e^{−z})²
      = [1/(1+e^{−z})] · [e^{−z}/(1+e^{−z})]
      = σ(z) · (1 − σ(z))

So σ'(z) = σ(z)(1 − σ(z)) = ŷ · (1 − ŷ). Memorize this-it's a workhorse identity.

Step 2: chain rule.

L = (y − ŷ)²
∂L/∂ŷ = −2(y − ŷ)
∂ŷ/∂z = σ'(z) = ŷ(1 − ŷ)
∂z/∂w = x

By the chain rule:

∂L/∂w = (∂L/∂ŷ) · (∂ŷ/∂z) · (∂z/∂w)
      = [−2(y − ŷ)] · [ŷ(1 − ŷ)] · x
      = −2 · x · (y − ŷ) · ŷ · (1 − ŷ)

Vector form (with w, x ∈ ℝᵈ):

∂L/∂w = −2 · (y − ŷ) · ŷ · (1 − ŷ) · x        (gradient vector in ℝᵈ)

Note on why this is rarely used: combining sigmoid + MSE introduces the ŷ(1 − ŷ) factor, which vanishes when ŷ is near 0 or 1-exactly when the model is confidently wrong. The gradient saturates and learning stalls.

The fix is to use sigmoid + binary cross-entropy (or equivalently, the numerically-stable binary_cross_entropy_with_logits). The gradient w.r.t. w simplifies to (ŷ − y)·x - noŷ(1−ŷ)factor, no saturation. This is the samepredicted − target` clean gradient we derived for softmax + cross-entropy in B.17. That is the reason classification uses cross-entropy and not MSE.

D.4 Bayes for spam classification¶

Given:

P(spam)        = 0.3
P(¬spam)       = 0.7
P(word | spam) = 0.8
P(word | ¬spam) = 0.1

Goal: P(spam | word).

By Bayes' theorem:

P(spam | word) = P(word | spam) · P(spam) / P(word)

By the law of total probability:

P(word) = P(word | spam) · P(spam) + P(word | ¬spam) · P(¬spam)
        = 0.8 · 0.3 + 0.1 · 0.7
        = 0.24 + 0.07
        = 0.31

So:

P(spam | word) = (0.8 · 0.3) / 0.31
               = 0.24 / 0.31
               ≈ 0.7742

Interpretation: before seeing the word, our prior on spam was 30%. After seeing the word (which is much more likely under spam than under non-spam), our posterior is about 77%. The word is strong evidence; the prior was weak; the posterior shifts substantially.

Sanity check on the likelihood ratio:

LR = P(word | spam) / P(word | ¬spam) = 0.8 / 0.1 = 8

The word is 8× more likely under spam. Posterior odds = prior odds × LR:

prior odds       = 0.3 / 0.7 ≈ 0.4286
posterior odds   = 0.4286 · 8 ≈ 3.4286
posterior prob   = 3.4286 / (1 + 3.4286) ≈ 0.7742  ✓

(This is the "log-odds form" of Bayes that underlies naive Bayes classifiers and logistic regression's logit interpretation.)

D.5 MLE for Gaussian-noise regression = MSE (full derivation)¶

Setup: data {(xᵢ, yᵢ)}_{i=1}^N, model f_θ(x), assume

yᵢ = f_θ(xᵢ) + εᵢ,    εᵢ ~ N(0, σ²)  i.i.d.

Equivalently, yᵢ | xᵢ ~ N(f_θ(xᵢ), σ²).

Likelihood of the dataset:

L(θ) = Πᵢ p(yᵢ | xᵢ, θ)
     = Πᵢ (1/√(2πσ²)) · exp(−(yᵢ − f_θ(xᵢ))² / (2σ²))

Log-likelihood:

log L(θ) = Σᵢ [−(1/2)·log(2πσ²) − (yᵢ − f_θ(xᵢ))² / (2σ²)]
         = −(N/2)·log(2πσ²) − (1/(2σ²)) · Σᵢ (yᵢ − f_θ(xᵢ))²

MLE: maximize log L(θ) over θ. Treat σ² as fixed (or jointly optimize and substitute its MLE later-same conclusion for θ).

The first term −(N/2)·log(2πσ²) does not depend on θ. The second term is −(1/(2σ²)) (a negative constant) times the sum of squared errors Σᵢ (yᵢ − f_θ(xᵢ))².

Maximizing a constant-times-negative quantity = minimizing the positive quantity:

argmax_θ log L(θ) = argmin_θ Σᵢ (yᵢ − f_θ(xᵢ))²

Dividing by N (a positive constant, doesn't change argmin):

= argmin_θ (1/N) Σᵢ (yᵢ − f_θ(xᵢ))² = argmin_θ MSE(θ)        ✓

Conclusion: MLE under Gaussian noise is identical to MSE minimization.

Bonus-what about σ²? Take ∂ log L / ∂σ² = 0:

∂/∂σ² log L = −N/(2σ²) + (1/(2σ⁴)) · Σᵢ (yᵢ − f_θ(xᵢ))² = 0

Solving:

σ²_MLE = (1/N) · Σᵢ (yᵢ − f_θ(xᵢ))² = MSE at the MLE θ

So the MLE noise variance equals the residual MSE at the optimum-the residual variance is the noise estimate. In Bayesian regression with a prior on σ², you get a posterior; that's the gateway to heteroscedastic regression and uncertainty estimates. For deterministic deep nets with MSE loss, the implicit σ² cancels out and we just learn θ.

D.6 KL divergence between Bernoulli(0.3) and Bernoulli(0.5)¶

Setup:

p = Bernoulli(0.3):  P_p(1) = 0.3,  P_p(0) = 0.7
q = Bernoulli(0.5):  P_q(1) = 0.5,  P_q(0) = 0.5

KL(p ‖ q):

KL(p ‖ q) = Σₓ p(x) · log(p(x) / q(x))
          = 0.3 · log(0.3 / 0.5) + 0.7 · log(0.7 / 0.5)

Compute each term (natural log):

log(0.3 / 0.5) = log(0.6) ≈ −0.5108
log(0.7 / 0.5) = log(1.4) ≈  0.3365

So:

KL(p ‖ q) ≈ 0.3 · (−0.5108) + 0.7 · (0.3365)
          = −0.15325 + 0.23552
          ≈ 0.08228 nats

In bits (divide by ln 2 ≈ 0.6931):

≈ 0.08228 / 0.6931 ≈ 0.1187 bits

KL(q ‖ p) (asymmetry check):

KL(q ‖ p) = 0.5 · log(0.5/0.3) + 0.5 · log(0.5/0.7)
          = 0.5 · log(5/3) + 0.5 · log(5/7)
          = 0.5 · 0.5108 + 0.5 · (−0.3365)
          = 0.2554 − 0.16825
          ≈ 0.08715 nats

So KL(p ‖ q) ≈ 0.0823 and KL(q ‖ p) ≈ 0.0872. Different values, confirming asymmetry. (In this symmetric-ish example they're close; for very different distributions the gap is larger.)

Interpretation: both KLs are small and positive-the two Bernoullis are similar but not identical. Both are zero only when p = q exactly.

Sanity check: when q = uniform = Bernoulli(0.5):

KL(p ‖ uniform) = log K − H(p)

(general identity for KL to uniform over K outcomes; here K = 2).

log 2 ≈ 0.6931
H(p)  = −0.3·log 0.3 − 0.7·log 0.7
      = −0.3·(−1.2040) − 0.7·(−0.3567)
      = 0.3612 + 0.2497
      = 0.6109 nats

KL = 0.6931 − 0.6109 = 0.0822 nats   ✓  (matches our direct computation)

Wrap-up: the minimum mental model¶

If you keep only one paragraph from this chapter, keep this:

A neural network is a long composition of linear maps and nonlinearities. Linear algebra gives us the language for the linear pieces (matmul = composition; transpose appears whenever we go backward; SVD shows every matrix is rotate-scale-rotate; low-rank structure is everywhere and we exploit it). Calculus, via the chain rule, propagates a scalar loss's gradient back through the composition-and the only two derivatives an applied engineer must be able to do from scratch are MSE for regression and softmax-cross-entropy for classification. Probability tells us why those losses: cross-entropy is the negative log-likelihood for categorical models, MSE is the negative log-likelihood for Gaussian-noise regression, and minimizing either is equivalent (up to a constant) to minimizing KL divergence between data and model. Cosine similarity is the geometric meaning of normalized dot products and is why retrieval works. Perplexity is `exp(cross_entropy) - the "effective branching factor" of a language model. That's the spine. Everything else-attention, normalization layers, optimizer variants, regularization-is engineering on top of these pieces.

You now have the machinery. The rest of the curriculum builds implementations on top of this math. When debugging or reading a paper, return here and re-derive-the muscle memory is what separates an applied engineer from someone who just runs notebooks.