Skip to content

Month 1-Week 4: PyTorch, autograd, and your first blog post

Week summary

  • Goal: Port your hand-built MLP to PyTorch. Implement Karpathy's micrograd from scratch to deeply understand autograd. Publish the first public blog post: "Backprop with no hand-waving."
  • Time: ~10 hours over 3 sessions.
  • Output: 04-mlp-pytorch.ipynb, separate micrograd-minimal/ repo, first public blog post.
  • Sequences relied on: 05-pytorch rungs 01–05; 02-calculus rung 10; 04-python-for-ml rungs 01, 02, 06.

Why this week matters in your AI expert journey

Autograd is what makes modern deep learning practical. PyTorch lets you write the forward pass; the gradient is computed for free. But "for free" is misleading-there's a computational graph being built and walked. Knowing what's underneath the magic-by writing your own ~150-line autograd engine-converts PyTorch from a black box into a glass box. Once you've felt how Value.backward() works in micrograd, you can debug PyTorch behavior that confuses everyone else.

The blog post matters separately. AI careers compound on visibility. Most engineers will never publish anything. Those who do are remembered. Your very first post being technical, well-derived, and honest sets the tone for the rest of the year.

Prerequisites

  • M01-W01, W02, W03 complete.
  • Numerical-gradient-check-passing MLP from W03.
  • Session A-Tue/Wed evening (~2.5 h): PyTorch tutorials + tensors
  • Session B-Sat morning (~4 h): port MLP + implement micrograd
  • Session C-Sun afternoon (~3 h): blog post + month-1 retro

Session A-PyTorch tensors, modules, autograd

Goal: Be fluent in PyTorch tensor operations, autograd, and the nn.Module pattern.

Part 1-Tensors and devices (45 min)

PyTorch tensors are like NumPy arrays with three additions: 1. They live on a device (CPU or CUDA GPU). Move with .to('cuda'). 2. They optionally track gradients when requires_grad=True. 3. They're the inputs to autograd's computational graph.

Read - PyTorch tutorial: "Tensors"-pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html - PyTorch tutorial: "Autograd"-pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html

Reproduce W03 ops in PyTorch

import torch

# Cross-entropy gradient parity check
torch.manual_seed(0)
z = torch.randn(2, 10, requires_grad=True)
y = torch.tensor([3, 7])
loss = torch.nn.functional.cross_entropy(z, y)
loss.backward()
print(z.grad)        # should ~equal (softmax(z) - one_hot(y)) / batch_size
Verify: gradient matches your hand derivation from W03.

Part 2-nn.Module and nn.Linear (60 min)

Every PyTorch model inherits from nn.Module and implements forward. Parameters auto-register for autograd and .to(device).

Code along (do not skip)

import torch.nn as nn

class TwoLayerMLP(nn.Module):
    def __init__(self, in_dim=784, hidden=128, out_dim=10):
        super().__init__()
        self.fc1 = nn.Linear(in_dim, hidden)
        self.fc2 = nn.Linear(hidden, out_dim)
    def forward(self, x):
        h = torch.relu(self.fc1(x))
        return self.fc2(h)

Inspect

model = TwoLayerMLP()
for name, p in model.named_parameters():
    print(name, p.shape)
Match the output to your W03 NumPy MLP-same shapes; the PyTorch version is just instrumented.

Part 3-DataLoader, training loop, and optimizer (60 min)

The standard PyTorch training loop:

from torch.utils.data import DataLoader
from torchvision import datasets, transforms

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Lambda(lambda x: x.view(-1)),
])
train_ds = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_ds  = datasets.MNIST('./data', train=False, download=True, transform=transform)

train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
test_loader  = DataLoader(test_ds, batch_size=256)

model = TwoLayerMLP()
opt = torch.optim.SGD(model.parameters(), lr=0.1)
loss_fn = nn.CrossEntropyLoss()

for epoch in range(5):
    model.train()
    for x, y in train_loader:
        opt.zero_grad()
        logits = model(x)
        loss = loss_fn(logits, y)
        loss.backward()
        opt.step()
    # eval
    model.eval()
    correct = 0; total = 0
    with torch.no_grad():
        for x, y in test_loader:
            correct += (model(x).argmax(-1) == y).sum().item()
            total += len(y)
    print(f"epoch {epoch+1}: test acc = {correct/total:.4f}")
Expected: >97% test accuracy in 5 epochs (PyTorch convergence is faster than your NumPy version because of better defaults).

Common pitfalls in Session A

  • Forgetting opt.zero_grad(). Gradients accumulate by default; you must clear each step.
  • Forgetting model.eval() and torch.no_grad() at evaluation. Causes wrong dropout behavior and wasted memory.
  • Using loss.backward() twice without zeroing. Double-counts.

Output of Session A

  • Tensor parity check verifying your W03 derivation.
  • A working PyTorch MLP achieving >97% on MNIST.

Session B-Implement micrograd (autograd from scratch)

Goal: Build Karpathy's micrograd - a minimal autograd engine in ~150 lines that backprops through+,*,tanh,exp,log`. This is the highest-leverage 4 hours of your month.

Part 1-Watch and understand (90 min)

Watch in full - Karpathy Zero to Hero, Lecture 1: "Building micrograd"-~2.5h. YouTube. Take notes as you watch.

The lecture builds: 1. A Value class wrapping a scalar with gradient tracking. 2. Operator overloads (__add__, __mul__, etc.) that build a graph. 3. A topological sort over the graph. 4. A backward() that walks topo-order applying local gradients. 5. A neural-network-style usage on a tiny dataset.

Part 2-Implement along (90 min)

Type along-don't paste. Build micrograd-minimal/engine.py:

class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op
    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad  += out.grad
            other.grad += out.grad
        out._backward = _backward
        return out
    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad  += other.data * out.grad
            other.grad += self.data  * out.grad
        out._backward = _backward
        return out
    def tanh(self):
        import math
        t = math.tanh(self.data)
        out = Value(t, (self,), 'tanh')
        def _backward():
            self.grad += (1 - t**2) * out.grad
        out._backward = _backward
        return out
    def backward(self):
        topo = []
        visited = set()
        def build(v):
            if v not in visited:
                visited.add(v)
                for c in v._prev:
                    build(c)
                topo.append(v)
        build(self)
        self.grad = 1.0
        for v in reversed(topo):
            v._backward()

Test it

a = Value(2.0)
b = Value(-3.0)
c = Value(10.0)
d = a*b + c
d.backward()
print(a.grad)  # should be -3.0 (∂d/∂a = b)
print(b.grad)  # should be  2.0 (∂d/∂b = a)
print(c.grad)  # should be  1.0

Part 3-Build a tiny neural net on top (60 min)

Add a Neuron, Layer, MLP on top of Value. Train on a 4-point XOR-style dataset. Watch loss go down. This proves that ~150 lines of code can train a real network.

import random
class Neuron:
    def __init__(self, nin):
        self.w = [Value(random.uniform(-1, 1)) for _ in range(nin)]
        self.b = Value(random.uniform(-1, 1))
    def __call__(self, x):
        act = sum((wi*xi for wi, xi in zip(self.w, x)), self.b)
        return act.tanh()
    def parameters(self):
        return self.w + [self.b]
# ...Layer, MLP analogous; pattern in Karpathy's video

Why this matters. PyTorch's autograd does the same thing-but with tensors instead of scalars, and CUDA-accelerated. The graph-building, topo-sort, backward-walk pattern is the design.

Common pitfalls in Session B

  • Pasting Karpathy's code instead of typing it. Do not. Typing is what cements the design.
  • Forgetting that += on .grad is needed. Multiple paths to the same node accumulate.
  • Misunderstanding the topological order. Backward must visit a node only after all its consumers.

Output of Session B

  • micrograd-minimal/ repo, public, with engine.py and a tiny demo.
  • A 100-word note in your LEARNING_LOG.md: "What PyTorch does that micrograd doesn't (yet)."

Session C-Blog post + month-1 retrospective

Goal: Publish "Backprop with no hand-waving", ~1500 words, with code. Run month-1 retrospective.

Part 1-Outline + draft (90 min)

Outline 1. Hook. "Most ML tutorials hand-wave the backward pass. I refused. Here's what happened when I derived it from scratch." 2. The 1D chain rule (a small f(g(x)) example with the derivation). 3. The 2-layer MLP backward derivation. Show the 5 steps. Embed your photo or LaTeX. 4. The numerical gradient check. Show the code. Show it passing. 5. Bridge to autograd. Brief intro to micrograd. Link to your repo. 6. What PyTorch adds beyond micrograd. Tensors, devices, the operator zoo, fused kernels. 7. Closing. What's next: transformers.

Length: ~1500 words. Tone: Confident, specific, free of unnecessary hedging.

Part 2-Polish + publish (60 min)

  • Edit. Cut filler.
  • Add charts: training curves, the gradient field plot.
  • Choose a platform: personal blog (preferred), dev.to, Hashnode, or Substack.
  • Publish.
  • Share in one channel: Twitter/X, Reddit r/learnmachinelearning, LinkedIn, your team Slack.

Part 3-Month-1 retrospective (45 min)

Write MONTH_1_RETRO.md in your ml-from-scratch repo:

# Month 1 retro

## Artifacts shipped
- `01-linear-regression.ipynb`
- `02-logistic-regression.ipynb`
- `03-mlp-numpy.ipynb`
- `04-mlp-pytorch.ipynb`
- `micrograd-minimal/`
- Blog post: <link>

## KPIs vs targets (Q1 row)
| Metric | Target | Actual | Note |
| Public repos | 3 | 2 | ml-from-scratch + micrograd-minimal |
| Blog posts | 1 | 1 | "Backprop with no hand-waving" |
| Papers read deeply | 8/quarter | 0 | will accelerate from M03 |

## Biggest insights
1. ...
2. ...
3. ...

## What slipped
- ...

## Pace check
- Sustainable / accelerated / behind?
- Adjustments for M02:

## Confidence on these check before M02
- [ ] Vectors, dot products, cosine similarity automatic
- [ ] Cross-entropy ↔ MLE link clear
- [ ] Backprop derivation possible from blank page
- [ ] PyTorch training loop fluent

Output of Session C

  • Public blog post live.
  • Month-1 retrospective written.

End-of-week artifact

  • 04-mlp-pytorch.ipynb reaching >97% on MNIST
  • micrograd-minimal/ public repo with engine + demo
  • First public blog post live and shared in one channel
  • MONTH_1_RETRO.md written

End-of-week self-assessment

  • I can write the PyTorch training loop boilerplate from memory.
  • I can implement micrograd's +, *, and backward() from a blank file.
  • I can explain what loss.backward() does in PyTorch in terms of micrograd's design.
  • My blog post is something I'd be proud to link in a job application.

Common failure modes for this week

  • Not publishing. "It's not perfect yet" is the killer. Publish at 80%; the comments improve it.
  • Pasting micrograd from GitHub. The lecture's value is the typing. Don't shortcut.
  • Skipping the retro. It's the highest-leverage 45 min of the month. Schedule it.

What's next (preview of M02-W01)

Classical ML-pick fast.ai or Andrew Ng. Build a real image classifier. Start the discipline of train/val/test, baselines, and ablations. The math foundations sit; now we apply them.

Comments