02-Calculus¶
Why this matters in the journey¶
Training a neural network is an optimization problem solved by gradient descent. Gradients are derivatives. Backpropagation is the chain rule. Without calculus intuition you cannot debug training (vanishing gradients, exploding gradients, why ReLU helps, why batch norm helps), and the modern fine-tuning paper landscape (DPO, GRPO, etc.) is closed to you. You don't need ε-δ rigor-you need what a derivative is, what a gradient is, and why the chain rule lets us train arbitrarily deep networks.
The rungs¶
Rung 01-Derivatives as instantaneous rate of change¶
- What: The derivative of
fatxis the slope of the tangent line. Symbolicallydf/dx. - Why it earns its place: Loss is a function of weights. The derivative of loss w.r.t. a weight tells us how to nudge the weight to reduce loss. That's all training is.
- Resource: 3Blue1Brown-Essence of Calculus, episodes 1–3. Search YouTube "3blue1brown essence of calculus".
- Done when: You can explain what
df/dxmeans without the word "derivative."
Rung 02-Differentiation rules¶
- What: Power rule, product rule, quotient rule, chain rule. In particular:
(f∘g)'(x) = f'(g(x)) · g'(x). - Why it earns its place: The chain rule is backprop. Internalize it.
- Resource: 3Blue1Brown episode 4 ("Visualizing the chain rule and product rule"). Plus Khan Academy "Calculus 1 → Differentiation rules" exercises.
- Done when: You can differentiate
sin(x²),e^(2x+1), andlog(1+e^x)without looking anything up.
Rung 03-Partial derivatives and gradients¶
- What: For a multi-variable function
f(x, y, z),∂f/∂xholds others fixed. The gradient∇fis the vector of all partials. - Why it earns its place: A neural network has millions of parameters. We need the partial derivative of loss w.r.t. each one. The gradient is what gradient descent descends.
- Resource: 3Blue1Brown-Multivariable Calculus on Khan Academy (free; Grant Sanderson is the instructor). Search "khan academy multivariable calculus".
- Done when: You can compute the gradient of
f(x, y) = x² + 3xy + y²and explain what direction it points.
Rung 04-Chain rule in multiple dimensions¶
- What: For composed functions
y = f(g(x))where everything is multi-dimensional, the gradient flows backwards as a chain of Jacobian-vector products. - Why it earns its place: This is literally how backprop is implemented in PyTorch. Every
.backward()call walks the computational graph applying the multi-dim chain rule. - Resource: 3Blue1Brown chain rule episode + Karpathy's Zero to Hero lecture 1 (
micrograd)-the best calculus pedagogy on the internet for ML. - Done when: You can hand-derive the gradient through a 2-layer MLP with ReLU.
Rung 05-Optimization basics: gradient descent¶
- What: Repeatedly nudge parameters in the direction of
−∇Lto reduce loss. Step size = learning rate. - Why it earns its place: Every neural network you ever train uses some flavor of gradient descent.
- Resource: 3Blue1Brown-Neural Networks series, episode 2 ("Gradient descent, how neural networks learn"). Plus a hand-rolled NumPy implementation as part of week 1 of month 1.
- Done when: You can write gradient descent in NumPy from scratch on a 1D function.
Rung 06-Convexity intuition¶
- What: A convex function has one minimum. Linear regression's loss is convex; neural network loss is not.
- Why it earns its place: It explains why we have local minima in deep learning, why initialization matters, and why "good enough" is the goal.
- Resource: Mathematics for Machine Learning (Deisenroth et al., free PDF) chapter 7 sections on convexity. Or any optimization 101 source.
- Done when: You can sketch a convex vs non-convex loss landscape and explain the implications.
Rung 07-Stochastic gradient descent + variants¶
- What: SGD computes the gradient on a mini-batch instead of the full dataset. Variants: momentum, Adam, AdamW.
- Why it earns its place: Adam is the default optimizer for LLMs. AdamW is what's actually used. You should know what β1, β2, ε mean.
- Resource: Sebastian Ruder's blog post "An overview of gradient descent optimization algorithms" (search "ruder gradient descent overview"). Plus the original Adam paper (arxiv.org/abs/1412.6980).
- Done when: You can explain why Adam is better than vanilla SGD for transformer training.
Rung 08-Loss functions and what their gradients look like¶
- What: MSE, cross-entropy, KL divergence, hinge loss. Each has a characteristic gradient shape.
- Why it earns its place: Cross-entropy is the loss for next-token prediction (i.e., for every LLM). KL divergence shows up in DPO, distillation, and RL fine-tuning.
- Resource: Deep Learning (Goodfellow) chapter 5 sections on loss functions. Plus implement each in NumPy as a one-page exercise.
- Done when: You can derive the gradient of cross-entropy w.r.t. logits and recognize the elegant
softmax(z) − yform.
Rung 09-Jacobians, Hessians (concept-level only)¶
- What: Jacobian = matrix of all first partials. Hessian = matrix of all second partials.
- Why it earns its place: You'll see "Jacobian" in PyTorch's autograd docs and in second-order optimization papers (rare in practice but reading-required for breadth).
- Resource: Khan Academy multivariable, Mathematics for ML chapter 5. Skim, don't grind.
- Done when: You know what a Jacobian is and can explain why second-order methods are too expensive for LLMs.
Rung 10-Automatic differentiation¶
- What: Frameworks like PyTorch build a computational graph and apply the chain rule automatically. Forward mode vs reverse mode (we use reverse).
- Why it earns its place: This is the magic that makes PyTorch usable. Knowing how it works under the hood lets you debug "why doesn't my gradient flow."
- Resource: Karpathy's
micrograd(lecture + code on GitHub:karpathy/micrograd). Implement it. It is ~150 lines and changes how you see PyTorch forever. - Done when: You can hand-implement a tiny autograd engine that backprops through
+,*,tanh.
Minimum required to leave this sequence¶
- Differentiate single-variable functions fluently.
- Compute and interpret a gradient in 2 or 3 dimensions.
- Explain the chain rule in your own words.
- Hand-derive the gradient of a 2-layer MLP loss.
- Implement gradient descent in NumPy on a toy problem.
- Implement micrograd from Karpathy's lecture and run backprop on a simple expression.
- Explain why cross-entropy is used for classification.
Going further (only after the minimum)¶
- MIT 18.01 / 18.02 lectures on OCW for rigorous coverage.
- Mathematics for Machine Learning chapters 5–7.
- Convex Optimization-Boyd & Vandenberghe (free PDF)-chapters 1–3 only.
How this sequence connects to the year¶
- Months 1–2: rungs 01–08 are the math behind every model you train.
- Month 3: rung 04 (multi-dim chain rule) and rung 10 (autograd) are the infrastructure for understanding transformer training.
- Month 8: rungs 07–08 are needed to read the DPO / GRPO papers without bouncing.