Saltar a contenido

Month 1-Week 2: Calculus, the chain rule, and logistic regression

Week summary

  • Goal: Internalize partial derivatives and the chain rule. Implement logistic regression for binary classification, deriving binary cross-entropy by hand and observing its σ(z) − y gradient form.
  • Time: ~9–11 hours over 3 sessions.
  • Output: 02-logistic-regression.ipynb in `ml-from-scratch - full derivation, training, decision boundary, confusion matrix.
  • Sequences relied on: 02-calculus rungs 02–04, 08; 06-classical-ml rung 03; 03-probability-statistics rungs 01–02.

Why this week matters in your AI expert journey

Cross-entropy is the loss that powers every LLM. Next-token prediction is multiclass classification. GPT, Claude, Llama-every one of them is trained by minimizing cross-entropy. Logistic regression is its simplest expression: binary classification with sigmoid output. Doing this derivation by hand once-and seeing the elegant σ(z) − y gradient drop out-is what makes "softmax + cross-entropy" feel inevitable instead of magical for the rest of your career.

The chain rule is what backpropagation is. Master it on simple cases now; transformer training is just the same rule chained deeper.

Prerequisites

  • M01-W01 complete (vectors, dot products, MSE gradient derivation).
  • Basic high-school calculus (derivatives of xⁿ, , log x).
  • Session A-Tue/Wed evening (~3 h): chain rule + gradients
  • Session B-Sat morning (~3–4 h): derive BCE gradient + implement
  • Session C-Sun afternoon (~2–3 h): metrics, polish, ship

Session A-Chain rule, partial derivatives, gradients

Goal: Be able to differentiate sin(x²), log(1+eˣ), and compute the gradient of a multivariable function without hesitation.

Pre-flight: M01-W01.

Part 1-Single-variable chain rule (45 min)

The chain rule for (f∘g)(x) = f(g(x)):

d/dx [f(g(x))] = f'(g(x)) · g'(x)
Differentiate the outer, leave the inner alone, multiply by the derivative of the inner.

Watch - 3Blue1Brown Essence of Calculus, Episode 4 ("Visualizing the chain rule and product rule")-~16 min.

Worked examples (do on paper) 1. d/dx [sin(x²)] = cos(x²) · 2x 2. d/dx [e^(2x+1)] = e^(2x+1) · 2 3. d/dx [log(1 + eˣ)] - Let u = 1 + eˣ. Then du/dx = eˣ. - d/du [log u] = 1/u. - Combined: `eˣ / (1 + eˣ) = σ(x) - the sigmoid! Notice this; we'll use it tonight.

Self-check (no calculator, on paper) 1. d/dx [(x² + 1)^3] 2. d/dx [tanh(2x)] 3. d/dx [σ(x)] where σ(x) = 1 / (1 + e⁻ˣ). Show that σ'(x) = σ(x)(1 − σ(x)).

Part 2-Partial derivatives and gradients (60 min)

For f: ℝⁿ → ℝ, the partial derivative ∂f/∂xᵢ holds all other variables fixed. The gradient is the vector of all partials:

∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]

Geometric meaning: ∇f points in the direction of steepest ascent. So −∇f points to steepest descent-which is why gradient descent works.

Watch - Khan Academy "Multivariable Calculus → Partial derivatives intro" (~15 min) and "Gradient and directional derivatives" (~15 min).

Worked example For f(x, y) = x² + 3xy + y²: - ∂f/∂x = 2x + 3y - ∂f/∂y = 3x + 2y - ∇f(1, 2) = [2·1 + 3·2, 3·1 + 2·2] = [8, 7]

Visualize in code

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-3, 3, 20); y = np.linspace(-3, 3, 20)
X, Y = np.meshgrid(x, y)
F = X**2 + 3*X*Y + Y**2
dFdx = 2*X + 3*Y
dFdy = 3*X + 2*Y
plt.contour(X, Y, F, levels=20)
plt.quiver(X, Y, dFdx, dFdy, scale=200)
plt.title('Gradient field of x² + 3xy + y²')
Observe: arrows point uphill; orthogonal to contour lines.

Part 3-Multivariable chain rule (the level you need) (60 min)

For composed functions y = f(g(x)) where g: ℝⁿ → ℝᵐ and f: ℝᵐ → ℝᵏ:

∂y/∂x = (∂y/∂g) · (∂g/∂x)        [matrix-chain form]
This is the Jacobian-vector product. It is exactly what .backward() does in PyTorch.

A 2-layer MLP example (we'll use this Friday)

z₁ = W₁·x + b₁
a₁ = ReLU(z₁)
z₂ = W₂·a₁ + b₂
ŷ  = softmax(z₂)
L  = cross_entropy(ŷ, y)
To compute ∂L/∂W₁, you walk backwards:
∂L/∂z₂ = ŷ − y                      (we'll derive this in Session B)
∂L/∂a₁ = W₂ᵀ · ∂L/∂z₂
∂L/∂z₁ = ∂L/∂a₁ ⊙ ReLU'(z₁)
∂L/∂W₁ = ∂L/∂z₁ · xᵀ
The is elementwise. The Wᵀ is what propagates gradient through a linear layer backwards. Notice how W₂ shows up transposed in the gradient w.r.t. `a₁ - that's the multivariable chain rule at work.

Self-check 1. Why does the transpose appear when propagating gradients backward through Wx? 2. If you double the size of W₂, what happens to gradient magnitudes flowing into W₁? 3. Define ReLU'(z). What value does it take when z < 0?

Common pitfalls in Session A

  • Treating partial derivatives as "the same as derivatives"-they hold other variables fixed.
  • Forgetting the g'(x) factor in the chain rule-leads to silent wrong gradients.
  • Confusing ∇f with `(∂f/∂x)·x - the first is a vector, the second is a number.

Output of Session A

  • Notebook page with chain-rule derivations photographed/typeset.
  • Gradient field plot of x² + 3xy + y².

Session B-Logistic regression: BCE derivation and implementation

Goal: Derive binary cross-entropy's gradient by hand. Implement logistic regression. Train it to >95% on a 2-class problem. Visualize the decision boundary updating.

Pre-flight: Session A complete; you can compute σ'(x) = σ(x)(1−σ(x)).

Part 1-Sigmoid + BCE on paper (60 min)

Sigmoid: σ(z) = 1 / (1 + e⁻ᶻ). Squashes ℝ → (0, 1) - a probability. **Binary cross-entropy** (BCE) loss for one example:

L = −[y · log σ(z) + (1−y) · log(1 − σ(z))]
wherey ∈ {0, 1}is the label andz = w·x + b` is the model's pre-activation.

Why BCE? It's the negative log-likelihood under the assumption that y is Bernoulli with P(y=1|x) = σ(z). Maximizing likelihood = minimizing negative log-likelihood = minimizing BCE. So minimizing BCE is doing maximum likelihood. (Sequence 03 rung 06.)

Derive dL/dz by hand. Using dσ/dz = σ(1−σ) and d(log σ)/dz = (1−σ):

dL/dz = −y · (1−σ) + (1−y) · σ
     = −y + y·σ + σ − y·σ
     = σ − y
     = σ(z) − y
This is the prize of the week. A sigmoid + BCE produces the elegant gradient `σ(z) − y - just the prediction error. By the chain rule:
dL/dw = (σ(z) − y) · x
dL/db = (σ(z) − y)

Why this is beautiful and why it generalizes The same elegance shows up for softmax + categorical cross-entropy: dL/dz = softmax(z) − y_onehot. This is the loss for LLMs. Recognize the form when it appears.

Photograph this derivation. Commit it to the repo.

Part 2-Implementation on a 2D toy problem (75 min)

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(0)
N = 200
# Two Gaussian blobs
X0 = rng.normal([-2, -2], 1, size=(N//2, 2))
X1 = rng.normal([ 2,  2], 1, size=(N//2, 2))
X = np.vstack([X0, X1])
y = np.concatenate([np.zeros(N//2), np.ones(N//2)])

def sigmoid(z): return 1 / (1 + np.exp(-z))

w = np.zeros(2)
b = 0.0
lr = 0.1
losses = []
for step in range(2000):
    z = X @ w + b
    p = sigmoid(z)
    eps = 1e-9    # avoid log(0)
    loss = -np.mean(y * np.log(p + eps) + (1 - y) * np.log(1 - p + eps))
    losses.append(loss)

    error = p - y                  # this is the gradient!
    grad_w = X.T @ error / N
    grad_b = error.mean()
    w -= lr * grad_w
    b -= lr * grad_b

print(f"final loss={losses[-1]:.4f}, w={w}, b={b:.4f}")

# Decision boundary
xx, yy = np.meshgrid(np.linspace(-5, 5, 100), np.linspace(-5, 5, 100))
grid = np.c_[xx.ravel(), yy.ravel()]
preds = sigmoid(grid @ w + b).reshape(xx.shape)
plt.contourf(xx, yy, preds, levels=20, cmap='RdBu_r', alpha=0.6)
plt.scatter(X[:,0], X[:,1], c=y, cmap='RdBu', edgecolor='k')
plt.title('Logistic regression decision boundary')

Verify: accuracy >95%. Boundary is a line between the two blobs.

Part 3-Animation of training (45 min)

Modify the loop to save weights every 50 iterations. Plot the decision boundary at iterations 0, 100, 500, 2000 in a 2×2 grid. Watch the line rotate and translate into place.

This visualization is what makes "the model is learning" tactile.

Common pitfalls in Session B

  • Forgetting the eps in log(p). If p ever hits 0 or 1, you get −inf and gradients explode silently.
  • Confusing σ(z) − y with y − σ(z) (sign flip). If loss increases, check this.
  • Not normalizing by `N - gradients become dataset-size-dependent.

Output of Session B

  • 02-logistic-regression.ipynb in repo: derivation, training loop, decision boundary, animation grid.

Session C-Metrics from scratch, ship, retro

Goal: Implement confusion matrix, precision, recall, F1 from scratch. Polish notebook. Push.

Pre-flight: Sessions A and B complete.

Part 1-Classification metrics (60 min)

Definitions for binary classification: - TP (true positive): pred=1, label=1 - TN (true negative): pred=0, label=0 - FP (false positive): pred=1, label=0 - FN (false negative): pred=0, label=1

Then: - Accuracy = (TP + TN) / N - Precision = TP / (TP + FP) - of those we said are positive, how many really are? - **Recall** =TP / (TP + FN) - of all the actual positives, how many did we catch? - F1 = `2·P·R / (P + R) - harmonic mean.

Why each exists. Imagine a fraud-detection problem: 99% of transactions are clean. Predict "all clean" gives 99% accuracy but 0 recall on fraud. Accuracy is misleading; recall isn't.

Implement from scratch (no sklearn)

def confusion_matrix(y_true, y_pred):
    TP = ((y_pred == 1) & (y_true == 1)).sum()
    TN = ((y_pred == 0) & (y_true == 0)).sum()
    FP = ((y_pred == 1) & (y_true == 0)).sum()
    FN = ((y_pred == 0) & (y_true == 1)).sum()
    return TP, TN, FP, FN

def precision_recall_f1(y_true, y_pred):
    TP, TN, FP, FN = confusion_matrix(y_true, y_pred)
    precision = TP / (TP + FP) if (TP + FP) else 0
    recall    = TP / (TP + FN) if (TP + FN) else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0
    return precision, recall, f1

Apply to your model's predictions on a held-out test set. Report all metrics.

Part 2-Notebook polish + push (60 min)

  • Add markdown sections explaining each block.
  • Embed the BCE derivation photo.
  • Add a "Why this matters" closing paragraph: "Cross-entropy is what's underneath every LLM. We'll meet it again in week 3 (multiclass) and every week after."
  • Push to repo. Update README to link the new notebook.

Part 3-Recall + forward look (45 min)

Recall test (no peeking). On a fresh page, write: 1. The BCE loss (one example). 2. The chain-rule derivation showing dL/dz = σ(z) − y. 3. Why this gradient is "elegant."

If gaps emerge, re-read your derivation. Recall is the test of comprehension.

Forward-look: Read M01-W03.md. Pre-watch the first ~15 minutes of Karpathy's micrograd lecture (it's 2+ hours; you'll do the rest in W04).

Output of Session C

  • Polished notebook with metrics implemented from scratch.
  • Repo pushed with two notebooks.
  • Updated LEARNING_LOG.md.

End-of-week artifact

  • 02-logistic-regression.ipynb complete in repo
  • BCE gradient derivation embedded
  • Decision boundary visualization showing convergence
  • Precision/recall/F1 implemented from scratch

End-of-week self-assessment

  • I can differentiate log(1 + eˣ) without notes.
  • I can derive dL/dz = σ(z) − y on a blank page.
  • I can explain why F1 is preferred over accuracy in imbalanced classification.
  • I can sketch the multivariable chain rule for a 2-layer network.
  • My logistic regression converges to >95% accuracy.

Common failure modes for this week

  • "I'll just trust the σ(z)−y form." No. Derive it. The act of derivation is what builds the intuition that pays off in transformers.
  • Skipping the metrics implementation in favor of sklearn.metrics. Just once, by hand. After that, use the library.
  • Reading without writing. Recall, not re-reading, is what cements knowledge.

What's next (preview of M01-W03)

Probability foundations + a 2-layer MLP forward AND backward pass implemented entirely in NumPy with no autograd. This is the week backpropagation truly clicks.

Comments