Month 1-Week 3: Probability foundations + MLP forward/backward by hand¶

Week summary¶

Goal: Build probability foundations (random variables, expectation, MLE, KL divergence). Implement a 2-layer MLP-forward AND backward pass-entirely in NumPy with no autograd. Train to >90% on MNIST.
Time: ~10–12 hours over 3 sessions (this is the densest week of the month).
Output: 03-mlp-numpy.ipynb in `ml-from-scratch - manual backprop, runnable on MNIST.
Sequences relied on: 03-probability-statistics rungs 01–07; 02-calculus rungs 04, 10; 07-deep-learning rungs 01–02.

Why this week matters in your AI expert journey¶

This is the week backprop becomes inevitable. After implementing forward AND backward by hand for a 2-layer network, PyTorch's .backward() stops being magic-you'll know exactly what it's doing under the hood. The probability section also lays the groundwork for understanding why cross-entropy is the loss for classification (it's MLE under a Categorical model) and for reading every paper that contains a KL divergence term-DPO, distillation, RLHF, anything alignment-flavored.

You will not implement backprop by hand again after this week. But having done it once permanently removes a layer of mystery from everything you'll build for the rest of your career.

Prerequisites¶

M01-W01, M01-W02 complete.
Comfortable with the multivariable chain rule (M01-W02 Session A Part 3).
NumPy fluency from prior weeks.

Recommended cadence¶

Session A-Tue/Wed evening (~3 h): probability + MLE → cross-entropy
Session B-Sat morning (~4 h): MLP forward + manual backward derivation
Session C-Sun afternoon (~3 h): train on MNIST, polish, ship

Session A-Probability, MLE, and why cross-entropy is the LLM loss¶

Goal: Internalize random variables, expectation, MLE. Derive that minimizing cross-entropy is maximizing likelihood under a Categorical model.

Part 1-Random variables, expectation, variance (45 min)¶

A random variable maps outcomes to numbers. Examples: X = result of a die roll (1–6); X = next token in a sentence (∈ vocab).

Expectation: E[X] = Σ x·P(X=x) (discrete)-the average value, weighted by probability. Variance: `Var(X) = E[(X − E[X])²] - average squared deviation.

Watch - Joe Blitzstein Stat 110, Lectures 6–8 (search "Stat 110 expectation").

Worked example For a die: E[X] = (1+2+3+4+5+6)/6 = 3.5. Var(X) = E[(X−3.5)²] = ((−2.5)² + ... + 2.5²)/6 = 35/12 ≈ 2.92.

Code

import numpy as np
rng = np.random.default_rng(0)
samples = rng.integers(1, 7, 100_000)
print(samples.mean())  # ~3.5
print(samples.var())   # ~2.92

Part 2-Maximum likelihood estimation (60 min)¶

Setup. A model with parameters θ defines P(x; θ). Given observed data {x₁, ..., xₙ}, MLE picks:

θ̂ = argmax_θ  ∏ᵢ P(xᵢ; θ)
   = argmax_θ  Σᵢ log P(xᵢ; θ)
   = argmin_θ  −Σᵢ log P(xᵢ; θ)        [negative log-likelihood]

Now apply to a Categorical distribution. Suppose your model outputs probabilities pₖ = softmax(zₖ) over K classes, and the true class is one-hot yₖ. The probability of the true label is:

P(y | x; θ) = ∏ₖ pₖ^yₖ

Negative log-likelihood:

−log P(y | x) = −Σₖ yₖ · log pₖ

This is exactly cross-entropy. So minimizing cross-entropy = maximizing likelihood under a Categorical model.

Why this matters for LLMs. Next-token prediction is Categorical over the vocabulary. Cross-entropy is the natural training objective. Every LLM is trained by maximum likelihood.

Part 3-KL divergence and entropy (45 min)¶

Entropy: H(p) = −Σ p log p. Measures the uncertainty of distribution p. Maximum when uniform; zero when concentrated on one outcome.

KL divergence: D_KL(p || q) = Σ p·log(p/q) = E_p[log(p/q)]. Measures how distribution p differs from q. Always ≥ 0; zero iff p == q. Not symmetric: D_KL(p||q) ≠ D_KL(q||p).

Cross-entropy decomposition:

H(p, q) = H(p) + D_KL(p || q)

So minimizing cross-entropy when p (the data distribution) is fixed is equivalent to minimizing KL divergence from p to q.

Why this matters. KL appears in: - Distillation: match a small model's output distribution to a teacher's. - DPO/RLHF: keep the policy from drifting too far from a reference model-−β · D_KL(π_θ || π_ref). - VAEs: regularize the latent distribution toward a prior. You will see D_KL constantly. Knowing what it measures (asymmetric divergence between distributions) is the floor.

Common pitfalls in Session A¶

Confusing entropy H(p) (a property of one distribution) with cross-entropy H(p, q) (between two).
Forgetting that KL is asymmetric-using D_KL(q || p) when you want D_KL(p || q).
Not seeing that "minimize cross-entropy" and "maximize likelihood" are the same thing said two ways.

Output of Session A¶

Notes file with the MLE → cross-entropy derivation written out.
One-line bash snippets verifying E[X] and Var(X) empirically for a die.

Session B-MLP forward + backward by hand¶

Goal: Implement a 2-layer MLP for MNIST in NumPy. Derive every gradient by hand. Verify forward + backward correctness against numerical gradients.

Part 1-Architecture and forward pass (45 min)¶

The network. Input x ∈ ℝ⁷⁸⁴ (flattened MNIST). Architecture:

z₁ = W₁·x + b₁     where W₁ ∈ ℝ¹²⁸ˣ⁷⁸⁴, b₁ ∈ ℝ¹²⁸
a₁ = ReLU(z₁)      ∈ ℝ¹²⁸
z₂ = W₂·a₁ + b₂    where W₂ ∈ ℝ¹⁰ˣ¹²⁸,  b₂ ∈ ℝ¹⁰
ŷ  = softmax(z₂)   ∈ ℝ¹⁰   (probabilities)
L  = −Σₖ yₖ·log ŷₖ        (cross-entropy)

Implement (forward only)

import numpy as np

def relu(x): return np.maximum(0, x)

def softmax(z):
    z_shift = z - z.max(axis=-1, keepdims=True)   # numerical stability
    exp = np.exp(z_shift)
    return exp / exp.sum(axis=-1, keepdims=True)

class MLP:
    def __init__(self, rng):
        # Kaiming-flavored init for ReLU
        self.W1 = rng.normal(0, np.sqrt(2/784), (784, 128))
        self.b1 = np.zeros(128)
        self.W2 = rng.normal(0, np.sqrt(2/128), (128, 10))
        self.b2 = np.zeros(10)
    def forward(self, X):
        self.X = X
        self.z1 = X @ self.W1 + self.b1
        self.a1 = relu(self.z1)
        self.z2 = self.a1 @ self.W2 + self.b2
        self.p  = softmax(self.z2)
        return self.p

Test: random input → output is a valid probability distribution per row.

Part 2-Backward by hand (75 min)¶

Goal: derive ∂L/∂W₁, ∂L/∂b₁, ∂L/∂W₂, ∂L/∂b₂.

We start from the output and walk backwards. Use the chain rule.

Step 1: ∂L/∂z₂. For one example: L = −Σₖ yₖ · log ŷₖ and ŷ = softmax(z₂). Standard derivation (write it out):

∂L/∂z₂ = ŷ − y     (a vector of length 10)

This is the same form as logistic regression, generalized to multiclass.

Step 2: ∂L/∂W₂ and ∂L/∂b₂. Since z₂ = a₁ · W₂ + b₂:

∂L/∂W₂ = a₁ᵀ · (ŷ − y)              shape: (128, 10) ✓
∂L/∂b₂ = (ŷ − y)                    shape: (10,) ✓

Note the transpose-that's the chain rule across a matrix multiplication.

Step 3: ∂L/∂a₁. Since z₂ = a₁ · W₂ + b₂:

∂L/∂a₁ = (ŷ − y) · W₂ᵀ              shape: (128,) ✓

Step 4: ∂L/∂z₁. Since a₁ = ReLU(z₁), and ReLU'(z) = 1 if z > 0 else 0:

∂L/∂z₁ = ∂L/∂a₁ ⊙ ReLU'(z₁)         elementwise multiplication

Step 5: ∂L/∂W₁ and ∂L/∂b₁. Since z₁ = x · W₁ + b₁:

∂L/∂W₁ = xᵀ · ∂L/∂z₁                shape: (784, 128) ✓
∂L/∂b₁ = ∂L/∂z₁                     shape: (128,) ✓

Photograph this entire derivation. Every step. Commit it to the repo as derivation.jpg or derivation.md.

Part 3-Implement and verify with numerical gradient check (60 min)¶

def cross_entropy(p, y_onehot, eps=1e-12):
    return -np.mean(np.sum(y_onehot * np.log(p + eps), axis=-1))

def backward(model, y_onehot):
    N = model.X.shape[0]
    dz2 = (model.p - y_onehot) / N            # (N, 10)
    dW2 = model.a1.T @ dz2                    # (128, 10)
    db2 = dz2.sum(axis=0)                     # (10,)
    da1 = dz2 @ model.W2.T                    # (N, 128)
    dz1 = da1 * (model.z1 > 0)                # ReLU'(z1)
    dW1 = model.X.T @ dz1                     # (784, 128)
    db1 = dz1.sum(axis=0)                     # (128,)
    return dW1, db1, dW2, db2

Numerical gradient check (the test that proves your math). For one weight Wᵢⱼ:

numerical_grad = (L(W + ε·eᵢⱼ) − L(W − ε·eᵢⱼ)) / (2·ε)

should match ∂L/∂Wᵢⱼ from your backward to ~1e-6 relative error. Pick ε = 1e-5. Test 10 random weights from W1 and W2.

If the check fails, find the bug before continuing. This is the contract that says your derivation matches your code.

Common pitfalls in Session B¶

Forgetting to divide by N in the gradient. Cross-entropy averages over examples; gradient must too.
Wrong shape on backprop matmul. Always check shapes after every line. Transpose where shapes demand.
Skipping the gradient check. Without it, you'll silently train a broken model and not understand why.
Numerical issues in softmax. Subtract max(z) before exp.

Output of Session B¶

Working forward + backward in NumPy.
Numerical gradient check passing.
Hand derivation committed.

Session C-Train on MNIST, polish, ship¶

Goal: Train the MLP on MNIST. Reach >90% test accuracy. Polish notebook. Push.

Part 1-Load MNIST + training loop (45 min)¶

from tensorflow.keras.datasets import mnist  # or use sklearn.datasets.fetch_openml
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.reshape(-1, 784).astype(np.float32) / 255.0
X_test  = X_test.reshape(-1, 784).astype(np.float32) / 255.0

def one_hot(y, K=10):
    out = np.zeros((len(y), K)); out[np.arange(len(y)), y] = 1
    return out

y_train_oh = one_hot(y_train)
y_test_oh  = one_hot(y_test)

# Mini-batch SGD
model = MLP(np.random.default_rng(0))
lr = 0.1
batch = 64
n_epochs = 5
for epoch in range(n_epochs):
    perm = np.random.permutation(len(X_train))
    for i in range(0, len(X_train), batch):
        idx = perm[i:i+batch]
        xb, yb = X_train[idx], y_train_oh[idx]
        model.forward(xb)
        dW1, db1, dW2, db2 = backward(model, yb)
        model.W1 -= lr * dW1; model.b1 -= lr * db1
        model.W2 -= lr * dW2; model.b2 -= lr * db2
    p_test = model.forward(X_test)
    acc = (p_test.argmax(-1) == y_test).mean()
    print(f"epoch {epoch+1}: test acc = {acc:.4f}")

Expected: test accuracy reaches >90% within 5 epochs on CPU (~3 min).

Part 2-Visualize errors (45 min)¶

Display a 5×5 grid of misclassified test examples with true and predicted labels. Are they hard? (They usually are-confusing 4s and 9s.)

wrong = np.where(p_test.argmax(-1) != y_test)[0][:25]
fig, axes = plt.subplots(5, 5, figsize=(8, 8))
for ax, i in zip(axes.flat, wrong):
    ax.imshow(X_test[i].reshape(28, 28), cmap='gray')
    ax.set_title(f"true={y_test[i]}, pred={p_test[i].argmax()}", fontsize=8)
    ax.axis('off')
plt.tight_layout()

Part 3-Polish, push, recall (60 min)¶

Notebook polish. Add markdown explaining: 1. The architecture and shape of every tensor. 2. The backward derivation (link the photo). 3. Why MLE → cross-entropy. 4. The training curves (loss + accuracy by epoch).

Push. 03-mlp-numpy.ipynb to repo. Update README.

Recall test (no peeking, on paper). 1. Write the 5-step backward derivation for the 2-layer MLP. 2. Explain why cross-entropy is "natural" for classification. 3. Why does the transpose W₂ᵀ show up when propagating gradient backwards?

Output of Session C¶

03-mlp-numpy.ipynb complete with training curves and error grid.
Repo pushed with three notebooks.
Recall test on paper-note any gaps for review.

End-of-week artifact¶

03-mlp-numpy.ipynb runnable end-to-end
Test accuracy >90% on MNIST
Hand derivation of all 4 gradients in repo
Numerical gradient check passing
Misclassified-examples visualization

End-of-week self-assessment¶

I can implement softmax with numerical stability.
I can derive ∂L/∂z₂ = ŷ − y from cross-entropy + softmax.
I can write the 5 gradient terms for a 2-layer MLP from a blank page.
I can explain why minimizing cross-entropy = maximizing likelihood.
I can define KL divergence and explain why it's asymmetric.

Common failure modes for this week¶

Skipping the numerical gradient check. It's the only way to know your math is right.
Trying to memorize the gradients instead of deriving them. Memorization fades; derivation skill compounds.
Reaching for autograd "just to try". Not this week. The point is to feel the pain so PyTorch becomes a relief, not a black box.

What's next (preview of M01-W04)¶

PyTorch + first public blog post. You'll port your hand-built MLP to PyTorch, implement Karpathy's micrograd to deeply understand autograd, and publish "Backprop with no hand-waving"-your first public artifact.