Month 1-Week 2: Calculus, the chain rule, and logistic regression¶

Week summary¶

Goal: Internalize partial derivatives and the chain rule. Implement logistic regression for binary classification, deriving binary cross-entropy by hand and observing its σ(z) − y gradient form.
Time: ~9–11 hours over 3 sessions.
Output: 02-logistic-regression.ipynb in `ml-from-scratch - full derivation, training, decision boundary, confusion matrix.
Sequences relied on: 02-calculus rungs 02–04, 08; 06-classical-ml rung 03; 03-probability-statistics rungs 01–02.

Why this week matters in your AI expert journey¶

Cross-entropy is the loss that powers every LLM. Next-token prediction is multiclass classification. GPT, Claude, Llama-every one of them is trained by minimizing cross-entropy. Logistic regression is its simplest expression: binary classification with sigmoid output. Doing this derivation by hand once-and seeing the elegant σ(z) − y gradient drop out-is what makes "softmax + cross-entropy" feel inevitable instead of magical for the rest of your career.

The chain rule is what backpropagation is. Master it on simple cases now; transformer training is just the same rule chained deeper.

Prerequisites¶

M01-W01 complete (vectors, dot products, MSE gradient derivation).
Basic high-school calculus (derivatives of xⁿ, eˣ, log x).

Recommended cadence¶

Session A-Tue/Wed evening (~3 h): chain rule + gradients
Session B-Sat morning (~3–4 h): derive BCE gradient + implement
Session C-Sun afternoon (~2–3 h): metrics, polish, ship

Session A-Chain rule, partial derivatives, gradients¶

Goal: Be able to differentiate sin(x²), log(1+eˣ), and compute the gradient of a multivariable function without hesitation.

Pre-flight: M01-W01.

Part 1-Single-variable chain rule (45 min)¶

The chain rule for (f∘g)(x) = f(g(x)):

d/dx [f(g(x))] = f'(g(x)) · g'(x)

Differentiate the outer, leave the inner alone, multiply by the derivative of the inner.

Watch - 3Blue1Brown Essence of Calculus, Episode 4 ("Visualizing the chain rule and product rule")-~16 min.

Worked examples (do on paper) 1. d/dx [sin(x²)] = cos(x²) · 2x 2. d/dx [e^(2x+1)] = e^(2x+1) · 2 3. d/dx [log(1 + eˣ)] - Let u = 1 + eˣ. Then du/dx = eˣ. - d/du [log u] = 1/u. - Combined: `eˣ / (1 + eˣ) = σ(x) - the sigmoid! Notice this; we'll use it tonight.

Self-check (no calculator, on paper) 1. d/dx [(x² + 1)^3] 2. d/dx [tanh(2x)] 3. d/dx [σ(x)] where σ(x) = 1 / (1 + e⁻ˣ). Show that σ'(x) = σ(x)(1 − σ(x)).

Part 2-Partial derivatives and gradients (60 min)¶

For f: ℝⁿ → ℝ, the partial derivative ∂f/∂xᵢ holds all other variables fixed. The gradient is the vector of all partials:

∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]

Geometric meaning: ∇f points in the direction of steepest ascent. So −∇f points to steepest descent-which is why gradient descent works.

Watch - Khan Academy "Multivariable Calculus → Partial derivatives intro" (~15 min) and "Gradient and directional derivatives" (~15 min).

Worked example For f(x, y) = x² + 3xy + y²: - ∂f/∂x = 2x + 3y - ∂f/∂y = 3x + 2y - ∇f(1, 2) = [2·1 + 3·2, 3·1 + 2·2] = [8, 7]

Visualize in code

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-3, 3, 20); y = np.linspace(-3, 3, 20)
X, Y = np.meshgrid(x, y)
F = X**2 + 3*X*Y + Y**2
dFdx = 2*X + 3*Y
dFdy = 3*X + 2*Y
plt.contour(X, Y, F, levels=20)
plt.quiver(X, Y, dFdx, dFdy, scale=200)
plt.title('Gradient field of x² + 3xy + y²')

Observe: arrows point uphill; orthogonal to contour lines.

Part 3-Multivariable chain rule (the level you need) (60 min)¶

For composed functions y = f(g(x)) where g: ℝⁿ → ℝᵐ and f: ℝᵐ → ℝᵏ:

∂y/∂x = (∂y/∂g) · (∂g/∂x)        [matrix-chain form]

This is the Jacobian-vector product. It is exactly what .backward() does in PyTorch.

A 2-layer MLP example (we'll use this Friday)

z₁ = W₁·x + b₁
a₁ = ReLU(z₁)
z₂ = W₂·a₁ + b₂
ŷ  = softmax(z₂)
L  = cross_entropy(ŷ, y)

To compute ∂L/∂W₁, you walk backwards:

∂L/∂z₂ = ŷ − y                      (we'll derive this in Session B)
∂L/∂a₁ = W₂ᵀ · ∂L/∂z₂
∂L/∂z₁ = ∂L/∂a₁ ⊙ ReLU'(z₁)
∂L/∂W₁ = ∂L/∂z₁ · xᵀ

The ⊙ is elementwise. The Wᵀ is what propagates gradient through a linear layer backwards. Notice how W₂ shows up transposed in the gradient w.r.t. `a₁ - that's the multivariable chain rule at work.

Self-check 1. Why does the transpose appear when propagating gradients backward through Wx? 2. If you double the size of W₂, what happens to gradient magnitudes flowing into W₁? 3. Define ReLU'(z). What value does it take when z < 0?

Common pitfalls in Session A¶

Treating partial derivatives as "the same as derivatives"-they hold other variables fixed.
Forgetting the g'(x) factor in the chain rule-leads to silent wrong gradients.
Confusing ∇f with `(∂f/∂x)·x - the first is a vector, the second is a number.

Output of Session A¶

Notebook page with chain-rule derivations photographed/typeset.
Gradient field plot of x² + 3xy + y².

Session B-Logistic regression: BCE derivation and implementation¶

Goal: Derive binary cross-entropy's gradient by hand. Implement logistic regression. Train it to >95% on a 2-class problem. Visualize the decision boundary updating.

Pre-flight: Session A complete; you can compute σ'(x) = σ(x)(1−σ(x)).

Part 1-Sigmoid + BCE on paper (60 min)¶

Sigmoid: σ(z) = 1 / (1 + e⁻ᶻ). Squashes ℝ → (0, 1) - a probability. **Binary cross-entropy** (BCE) loss for one example:

L = −[y · log σ(z) + (1−y) · log(1 − σ(z))]


where

y ∈ {0, 1}is the label andz = w·x + b` is the model's pre-activation.

Why BCE? It's the negative log-likelihood under the assumption that y is Bernoulli with P(y=1|x) = σ(z). Maximizing likelihood = minimizing negative log-likelihood = minimizing BCE. So minimizing BCE is doing maximum likelihood. (Sequence 03 rung 06.)

Derive dL/dz by hand. Using dσ/dz = σ(1−σ) and d(log σ)/dz = (1−σ):

dL/dz = −y · (1−σ) + (1−y) · σ
     = −y + y·σ + σ − y·σ
     = σ − y
     = σ(z) − y

This is the prize of the week. A sigmoid + BCE produces the elegant gradient `σ(z) − y - just the prediction error. By the chain rule:

dL/dw = (σ(z) − y) · x
dL/db = (σ(z) − y)

Why this is beautiful and why it generalizes The same elegance shows up for softmax + categorical cross-entropy: dL/dz = softmax(z) − y_onehot. This is the loss for LLMs. Recognize the form when it appears.

Photograph this derivation. Commit it to the repo.

Part 2-Implementation on a 2D toy problem (75 min)¶

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(0)
N = 200
# Two Gaussian blobs
X0 = rng.normal([-2, -2], 1, size=(N//2, 2))
X1 = rng.normal([ 2,  2], 1, size=(N//2, 2))
X = np.vstack([X0, X1])
y = np.concatenate([np.zeros(N//2), np.ones(N//2)])

def sigmoid(z): return 1 / (1 + np.exp(-z))

w = np.zeros(2)
b = 0.0
lr = 0.1
losses = []
for step in range(2000):
    z = X @ w + b
    p = sigmoid(z)
    eps = 1e-9    # avoid log(0)
    loss = -np.mean(y * np.log(p + eps) + (1 - y) * np.log(1 - p + eps))
    losses.append(loss)

    error = p - y                  # this is the gradient!
    grad_w = X.T @ error / N
    grad_b = error.mean()
    w -= lr * grad_w
    b -= lr * grad_b

print(f"final loss={losses[-1]:.4f}, w={w}, b={b:.4f}")

# Decision boundary
xx, yy = np.meshgrid(np.linspace(-5, 5, 100), np.linspace(-5, 5, 100))
grid = np.c_[xx.ravel(), yy.ravel()]
preds = sigmoid(grid @ w + b).reshape(xx.shape)
plt.contourf(xx, yy, preds, levels=20, cmap='RdBu_r', alpha=0.6)
plt.scatter(X[:,0], X[:,1], c=y, cmap='RdBu', edgecolor='k')
plt.title('Logistic regression decision boundary')

Verify: accuracy >95%. Boundary is a line between the two blobs.

Part 3-Animation of training (45 min)¶

Modify the loop to save weights every 50 iterations. Plot the decision boundary at iterations 0, 100, 500, 2000 in a 2×2 grid. Watch the line rotate and translate into place.

This visualization is what makes "the model is learning" tactile.

Common pitfalls in Session B¶

Forgetting the eps in log(p). If p ever hits 0 or 1, you get −inf and gradients explode silently.
Confusing σ(z) − y with y − σ(z) (sign flip). If loss increases, check this.
Not normalizing by `N - gradients become dataset-size-dependent.

Output of Session B¶

02-logistic-regression.ipynb in repo: derivation, training loop, decision boundary, animation grid.

Session C-Metrics from scratch, ship, retro¶

Goal: Implement confusion matrix, precision, recall, F1 from scratch. Polish notebook. Push.

Pre-flight: Sessions A and B complete.

Part 1-Classification metrics (60 min)¶

Definitions for binary classification: - TP (true positive): pred=1, label=1 - TN (true negative): pred=0, label=0 - FP (false positive): pred=1, label=0 - FN (false negative): pred=0, label=1

Then: - Accuracy = (TP + TN) / N - Precision = TP / (TP + FP) - of those we said are positive, how many really are? - **Recall** =TP / (TP + FN) - of all the actual positives, how many did we catch? - F1 = `2·P·R / (P + R) - harmonic mean.

Why each exists. Imagine a fraud-detection problem: 99% of transactions are clean. Predict "all clean" gives 99% accuracy but 0 recall on fraud. Accuracy is misleading; recall isn't.

Implement from scratch (no sklearn)

def confusion_matrix(y_true, y_pred):
    TP = ((y_pred == 1) & (y_true == 1)).sum()
    TN = ((y_pred == 0) & (y_true == 0)).sum()
    FP = ((y_pred == 1) & (y_true == 0)).sum()
    FN = ((y_pred == 0) & (y_true == 1)).sum()
    return TP, TN, FP, FN

def precision_recall_f1(y_true, y_pred):
    TP, TN, FP, FN = confusion_matrix(y_true, y_pred)
    precision = TP / (TP + FP) if (TP + FP) else 0
    recall    = TP / (TP + FN) if (TP + FN) else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0
    return precision, recall, f1

Apply to your model's predictions on a held-out test set. Report all metrics.

Part 2-Notebook polish + push (60 min)¶

Add markdown sections explaining each block.
Embed the BCE derivation photo.
Add a "Why this matters" closing paragraph: "Cross-entropy is what's underneath every LLM. We'll meet it again in week 3 (multiclass) and every week after."
Push to repo. Update README to link the new notebook.

Part 3-Recall + forward look (45 min)¶

Recall test (no peeking). On a fresh page, write: 1. The BCE loss (one example). 2. The chain-rule derivation showing dL/dz = σ(z) − y. 3. Why this gradient is "elegant."

If gaps emerge, re-read your derivation. Recall is the test of comprehension.

Forward-look: Read M01-W03.md. Pre-watch the first ~15 minutes of Karpathy's micrograd lecture (it's 2+ hours; you'll do the rest in W04).

Output of Session C¶

Polished notebook with metrics implemented from scratch.
Repo pushed with two notebooks.
Updated LEARNING_LOG.md.

End-of-week artifact¶

02-logistic-regression.ipynb complete in repo
BCE gradient derivation embedded
Decision boundary visualization showing convergence
Precision/recall/F1 implemented from scratch

End-of-week self-assessment¶

I can differentiate log(1 + eˣ) without notes.
I can derive dL/dz = σ(z) − y on a blank page.
I can explain why F1 is preferred over accuracy in imbalanced classification.
I can sketch the multivariable chain rule for a 2-layer network.
My logistic regression converges to >95% accuracy.

Common failure modes for this week¶

"I'll just trust the σ(z)−y form." No. Derive it. The act of derivation is what builds the intuition that pays off in transformers.
Skipping the metrics implementation in favor of sklearn.metrics. Just once, by hand. After that, use the library.
Reading without writing. Recall, not re-reading, is what cements knowledge.

What's next (preview of M01-W03)¶

Probability foundations + a 2-layer MLP forward AND backward pass implemented entirely in NumPy with no autograd. This is the week backpropagation truly clicks.