Week 3 - Tensors, Autograd, the Gradient Tape¶

3.1 Conceptual Core¶

A tensor is an N-dimensional array with a dtype, a shape, a device, and a computation graph attached (if it requires grad).
Automatic differentiation has two modes:
Forward-mode (efficient when outputs ≫ inputs): propagate derivatives alongside values.
Reverse-mode / backpropagation (efficient when inputs ≫ outputs, the ML case): build a graph during forward, traverse it backward.
PyTorch implements dynamic (define-by-run) reverse-mode AD via a graph built from Function nodes. JAX implements functional AD via tracing.
The single most useful thing about reverse-mode is the VJP (vector-Jacobian product): given output gradients, propagate to input gradients without ever materializing the Jacobian matrix.

3.2 Mechanical Detail¶

A PyTorch tensor with requires_grad=True records every op into a graph. loss.backward() traverses the graph, calling each op's backward function.
The graph is built per-iteration (this is what "dynamic" means). At backward(), the graph is consumed and discarded (unless retain_graph=True).
torch.no_grad() disables graph building-used during inference and during certain training tricks (target networks in RL, EMA updates).
detach() creates a tensor that shares storage but is severed from the graph.
Custom autograd: torch.autograd.Function lets you define forward/backward pairs. The escape hatch when you need a custom op (week 11–12).

3.3 Lab-"Autograd From Scratch"¶

Implement reverse-mode AD in ~100 lines of pure Python (no PyTorch). Support: - A Tensor class wrapping a NumPy array with a grad field. - __add__, __mul__, __matmul__, relu, sum. Each records its inputs and a backward function. - A backward() method that topologically sorts and traverses the graph. - Test on a tiny MLP: define f = x @ W1 + b1; g = relu(f); h = g @ W2 + b2; loss = h.sum(). Verify the gradients match a torch.autograd reference within float-precision.

This is Andrej Karpathy's micrograd exercise. Do it before reading his code; then read his code and compare.

3.4 Idiomatic & Diagnostic Drill¶

Run a real training step and inspect tensor.grad_fn. Walk the graph manually: loss.grad_fn.next_functions[0][0].next_functions....

3.5 Production Slice¶

The most common beginner bug: forgetting optimizer.zero_grad(), accumulating gradients across iterations. Add a unit test to your training loop scaffolding that asserts gradients are zeroed at the start of every step.