Month 1-Week 2: Calculus, the chain rule, and logistic regression¶
Week summary¶
- Goal: Internalize partial derivatives and the chain rule. Implement logistic regression for binary classification, deriving binary cross-entropy by hand and observing its
σ(z) − ygradient form. - Time: ~9–11 hours over 3 sessions.
- Output:
02-logistic-regression.ipynbin `ml-from-scratch - full derivation, training, decision boundary, confusion matrix. - Sequences relied on: 02-calculus rungs 02–04, 08; 06-classical-ml rung 03; 03-probability-statistics rungs 01–02.
Why this week matters in your AI expert journey¶
Cross-entropy is the loss that powers every LLM. Next-token prediction is multiclass classification. GPT, Claude, Llama-every one of them is trained by minimizing cross-entropy. Logistic regression is its simplest expression: binary classification with sigmoid output. Doing this derivation by hand once-and seeing the elegant σ(z) − y gradient drop out-is what makes "softmax + cross-entropy" feel inevitable instead of magical for the rest of your career.
The chain rule is what backpropagation is. Master it on simple cases now; transformer training is just the same rule chained deeper.
Prerequisites¶
- M01-W01 complete (vectors, dot products, MSE gradient derivation).
- Basic high-school calculus (derivatives of
xⁿ,eˣ,log x).
Recommended cadence¶
- Session A-Tue/Wed evening (~3 h): chain rule + gradients
- Session B-Sat morning (~3–4 h): derive BCE gradient + implement
- Session C-Sun afternoon (~2–3 h): metrics, polish, ship
Session A-Chain rule, partial derivatives, gradients¶
Goal: Be able to differentiate sin(x²), log(1+eˣ), and compute the gradient of a multivariable function without hesitation.
Pre-flight: M01-W01.
Part 1-Single-variable chain rule (45 min)¶
The chain rule for (f∘g)(x) = f(g(x)):
Watch - 3Blue1Brown Essence of Calculus, Episode 4 ("Visualizing the chain rule and product rule")-~16 min.
Worked examples (do on paper)
1. d/dx [sin(x²)] = cos(x²) · 2x
2. d/dx [e^(2x+1)] = e^(2x+1) · 2
3. d/dx [log(1 + eˣ)]
- Let u = 1 + eˣ. Then du/dx = eˣ.
- d/du [log u] = 1/u.
- Combined: `eˣ / (1 + eˣ) = σ(x) - the sigmoid! Notice this; we'll use it tonight.
Self-check (no calculator, on paper)
1. d/dx [(x² + 1)^3]
2. d/dx [tanh(2x)]
3. d/dx [σ(x)] where σ(x) = 1 / (1 + e⁻ˣ). Show that σ'(x) = σ(x)(1 − σ(x)).
Part 2-Partial derivatives and gradients (60 min)¶
For f: ℝⁿ → ℝ, the partial derivative ∂f/∂xᵢ holds all other variables fixed. The gradient is the vector of all partials:
Geometric meaning: ∇f points in the direction of steepest ascent. So −∇f points to steepest descent-which is why gradient descent works.
Watch - Khan Academy "Multivariable Calculus → Partial derivatives intro" (~15 min) and "Gradient and directional derivatives" (~15 min).
Worked example
For f(x, y) = x² + 3xy + y²:
- ∂f/∂x = 2x + 3y
- ∂f/∂y = 3x + 2y
- ∇f(1, 2) = [2·1 + 3·2, 3·1 + 2·2] = [8, 7]
Visualize in code
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-3, 3, 20); y = np.linspace(-3, 3, 20)
X, Y = np.meshgrid(x, y)
F = X**2 + 3*X*Y + Y**2
dFdx = 2*X + 3*Y
dFdy = 3*X + 2*Y
plt.contour(X, Y, F, levels=20)
plt.quiver(X, Y, dFdx, dFdy, scale=200)
plt.title('Gradient field of x² + 3xy + y²')
Part 3-Multivariable chain rule (the level you need) (60 min)¶
For composed functions y = f(g(x)) where g: ℝⁿ → ℝᵐ and f: ℝᵐ → ℝᵏ:
.backward() does in PyTorch.
A 2-layer MLP example (we'll use this Friday)
To compute∂L/∂W₁, you walk backwards:
∂L/∂z₂ = ŷ − y (we'll derive this in Session B)
∂L/∂a₁ = W₂ᵀ · ∂L/∂z₂
∂L/∂z₁ = ∂L/∂a₁ ⊙ ReLU'(z₁)
∂L/∂W₁ = ∂L/∂z₁ · xᵀ
⊙ is elementwise. The Wᵀ is what propagates gradient through a linear layer backwards. Notice how W₂ shows up transposed in the gradient w.r.t. `a₁ - that's the multivariable chain rule at work.
Self-check
1. Why does the transpose appear when propagating gradients backward through Wx?
2. If you double the size of W₂, what happens to gradient magnitudes flowing into W₁?
3. Define ReLU'(z). What value does it take when z < 0?
Common pitfalls in Session A¶
- Treating partial derivatives as "the same as derivatives"-they hold other variables fixed.
- Forgetting the
g'(x)factor in the chain rule-leads to silent wrong gradients. - Confusing
∇fwith `(∂f/∂x)·x - the first is a vector, the second is a number.
Output of Session A¶
- Notebook page with chain-rule derivations photographed/typeset.
- Gradient field plot of
x² + 3xy + y².
Session B-Logistic regression: BCE derivation and implementation¶
Goal: Derive binary cross-entropy's gradient by hand. Implement logistic regression. Train it to >95% on a 2-class problem. Visualize the decision boundary updating.
Pre-flight: Session A complete; you can compute σ'(x) = σ(x)(1−σ(x)).
Part 1-Sigmoid + BCE on paper (60 min)¶
Sigmoid: σ(z) = 1 / (1 + e⁻ᶻ). Squashes ℝ → (0, 1) - a probability.
**Binary cross-entropy** (BCE) loss for one example:
wherey ∈ {0, 1}is the label andz = w·x + b` is the model's pre-activation.
Why BCE? It's the negative log-likelihood under the assumption that y is Bernoulli with P(y=1|x) = σ(z). Maximizing likelihood = minimizing negative log-likelihood = minimizing BCE. So minimizing BCE is doing maximum likelihood. (Sequence 03 rung 06.)
Derive dL/dz by hand. Using dσ/dz = σ(1−σ) and d(log σ)/dz = (1−σ):
Why this is beautiful and why it generalizes
The same elegance shows up for softmax + categorical cross-entropy: dL/dz = softmax(z) − y_onehot. This is the loss for LLMs. Recognize the form when it appears.
Photograph this derivation. Commit it to the repo.
Part 2-Implementation on a 2D toy problem (75 min)¶
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(0)
N = 200
# Two Gaussian blobs
X0 = rng.normal([-2, -2], 1, size=(N//2, 2))
X1 = rng.normal([ 2, 2], 1, size=(N//2, 2))
X = np.vstack([X0, X1])
y = np.concatenate([np.zeros(N//2), np.ones(N//2)])
def sigmoid(z): return 1 / (1 + np.exp(-z))
w = np.zeros(2)
b = 0.0
lr = 0.1
losses = []
for step in range(2000):
z = X @ w + b
p = sigmoid(z)
eps = 1e-9 # avoid log(0)
loss = -np.mean(y * np.log(p + eps) + (1 - y) * np.log(1 - p + eps))
losses.append(loss)
error = p - y # this is the gradient!
grad_w = X.T @ error / N
grad_b = error.mean()
w -= lr * grad_w
b -= lr * grad_b
print(f"final loss={losses[-1]:.4f}, w={w}, b={b:.4f}")
# Decision boundary
xx, yy = np.meshgrid(np.linspace(-5, 5, 100), np.linspace(-5, 5, 100))
grid = np.c_[xx.ravel(), yy.ravel()]
preds = sigmoid(grid @ w + b).reshape(xx.shape)
plt.contourf(xx, yy, preds, levels=20, cmap='RdBu_r', alpha=0.6)
plt.scatter(X[:,0], X[:,1], c=y, cmap='RdBu', edgecolor='k')
plt.title('Logistic regression decision boundary')
Verify: accuracy >95%. Boundary is a line between the two blobs.
Part 3-Animation of training (45 min)¶
Modify the loop to save weights every 50 iterations. Plot the decision boundary at iterations 0, 100, 500, 2000 in a 2×2 grid. Watch the line rotate and translate into place.
This visualization is what makes "the model is learning" tactile.
Common pitfalls in Session B¶
- Forgetting the
epsinlog(p). Ifpever hits 0 or 1, you get−infand gradients explode silently. - Confusing
σ(z) − ywithy − σ(z)(sign flip). If loss increases, check this. - Not normalizing by `N - gradients become dataset-size-dependent.
Output of Session B¶
02-logistic-regression.ipynbin repo: derivation, training loop, decision boundary, animation grid.
Session C-Metrics from scratch, ship, retro¶
Goal: Implement confusion matrix, precision, recall, F1 from scratch. Polish notebook. Push.
Pre-flight: Sessions A and B complete.
Part 1-Classification metrics (60 min)¶
Definitions for binary classification:
- TP (true positive): pred=1, label=1
- TN (true negative): pred=0, label=0
- FP (false positive): pred=1, label=0
- FN (false negative): pred=0, label=1
Then:
- Accuracy = (TP + TN) / N
- Precision = TP / (TP + FP) - of those we said are positive, how many really are?
- **Recall** =TP / (TP + FN) - of all the actual positives, how many did we catch?
- F1 = `2·P·R / (P + R) - harmonic mean.
Why each exists. Imagine a fraud-detection problem: 99% of transactions are clean. Predict "all clean" gives 99% accuracy but 0 recall on fraud. Accuracy is misleading; recall isn't.
Implement from scratch (no sklearn)
def confusion_matrix(y_true, y_pred):
TP = ((y_pred == 1) & (y_true == 1)).sum()
TN = ((y_pred == 0) & (y_true == 0)).sum()
FP = ((y_pred == 1) & (y_true == 0)).sum()
FN = ((y_pred == 0) & (y_true == 1)).sum()
return TP, TN, FP, FN
def precision_recall_f1(y_true, y_pred):
TP, TN, FP, FN = confusion_matrix(y_true, y_pred)
precision = TP / (TP + FP) if (TP + FP) else 0
recall = TP / (TP + FN) if (TP + FN) else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0
return precision, recall, f1
Apply to your model's predictions on a held-out test set. Report all metrics.
Part 2-Notebook polish + push (60 min)¶
- Add markdown sections explaining each block.
- Embed the BCE derivation photo.
- Add a "Why this matters" closing paragraph: "Cross-entropy is what's underneath every LLM. We'll meet it again in week 3 (multiclass) and every week after."
- Push to repo. Update README to link the new notebook.
Part 3-Recall + forward look (45 min)¶
Recall test (no peeking). On a fresh page, write:
1. The BCE loss (one example).
2. The chain-rule derivation showing dL/dz = σ(z) − y.
3. Why this gradient is "elegant."
If gaps emerge, re-read your derivation. Recall is the test of comprehension.
Forward-look: Read M01-W03.md. Pre-watch the first ~15 minutes of Karpathy's micrograd lecture (it's 2+ hours; you'll do the rest in W04).
Output of Session C¶
- Polished notebook with metrics implemented from scratch.
- Repo pushed with two notebooks.
- Updated
LEARNING_LOG.md.
End-of-week artifact¶
-
02-logistic-regression.ipynbcomplete in repo - BCE gradient derivation embedded
- Decision boundary visualization showing convergence
- Precision/recall/F1 implemented from scratch
End-of-week self-assessment¶
- I can differentiate
log(1 + eˣ)without notes. - I can derive
dL/dz = σ(z) − yon a blank page. - I can explain why F1 is preferred over accuracy in imbalanced classification.
- I can sketch the multivariable chain rule for a 2-layer network.
- My logistic regression converges to >95% accuracy.
Common failure modes for this week¶
- "I'll just trust the σ(z)−y form." No. Derive it. The act of derivation is what builds the intuition that pays off in transformers.
- Skipping the metrics implementation in favor of
sklearn.metrics. Just once, by hand. After that, use the library. - Reading without writing. Recall, not re-reading, is what cements knowledge.
What's next (preview of M01-W03)¶
Probability foundations + a 2-layer MLP forward AND backward pass implemented entirely in NumPy with no autograd. This is the week backpropagation truly clicks.