Week 3 - Tensors, Autograd, the Gradient Tape¶
3.1 Conceptual Core¶
- A tensor is an N-dimensional array with a dtype, a shape, a device, and a computation graph attached (if it requires grad).
- Automatic differentiation has two modes:
- Forward-mode (efficient when outputs ≫ inputs): propagate derivatives alongside values.
- Reverse-mode / backpropagation (efficient when inputs ≫ outputs, the ML case): build a graph during forward, traverse it backward.
- PyTorch implements dynamic (define-by-run) reverse-mode AD via a graph built from
Functionnodes. JAX implements functional AD via tracing. - The single most useful thing about reverse-mode is the VJP (vector-Jacobian product): given output gradients, propagate to input gradients without ever materializing the Jacobian matrix.
3.2 Mechanical Detail¶
- A PyTorch tensor with
requires_grad=Truerecords every op into a graph.loss.backward()traverses the graph, calling each op's backward function. - The graph is built per-iteration (this is what "dynamic" means). At
backward(), the graph is consumed and discarded (unlessretain_graph=True). torch.no_grad()disables graph building-used during inference and during certain training tricks (target networks in RL, EMA updates).detach()creates a tensor that shares storage but is severed from the graph.- Custom autograd:
torch.autograd.Functionlets you define forward/backward pairs. The escape hatch when you need a custom op (week 11–12).
3.3 Lab-"Autograd From Scratch"¶
Implement reverse-mode AD in ~100 lines of pure Python (no PyTorch). Support:
- A Tensor class wrapping a NumPy array with a grad field.
- __add__, __mul__, __matmul__, relu, sum. Each records its inputs and a backward function.
- A backward() method that topologically sorts and traverses the graph.
- Test on a tiny MLP: define f = x @ W1 + b1; g = relu(f); h = g @ W2 + b2; loss = h.sum(). Verify the gradients match a torch.autograd reference within float-precision.
This is Andrej Karpathy's micrograd exercise. Do it before reading his code; then read his code and compare.
3.4 Idiomatic & Diagnostic Drill¶
- Run a real training step and inspect
tensor.grad_fn. Walk the graph manually:loss.grad_fn.next_functions[0][0].next_functions....
3.5 Production Slice¶
- The most common beginner bug: forgetting
optimizer.zero_grad(), accumulating gradients across iterations. Add a unit test to your training loop scaffolding that asserts gradients are zeroed at the start of every step.