05-PyTorch¶

Why this matters in the journey¶

PyTorch is the lingua franca of modern AI. Every transformer, every fine-tuning script, every research paper's reference implementation-PyTorch. Hugging Face is built on it. vLLM is built on it. Knowing it well is non-negotiable. The goal of this sequence is to take you from "I can write a training loop" to "I can read and modify nanoGPT, debug shape errors instantly, and write efficient custom modules."

The rungs¶

Rung 01-Tensors¶

What: PyTorch tensors are NumPy arrays that live on CPU or GPU and track gradients. Same API surface as NumPy plus .to(device) and .requires_grad.
Why it earns its place: Everything in PyTorch is a tensor. Comfort here is the floor.
Resource: PyTorch official tutorial-"Tensors" (search "pytorch tutorials tensors").
Done when: You can create, reshape, slice, transpose, and matmul tensors fluently and predict shapes without running code.

Rung 02-Autograd¶

What: PyTorch tracks operations on tensors that have requires_grad=True and computes gradients via .backward().
Why it earns its place: This is the magic. Understanding it lets you debug "why is my gradient None" and "why is this slow."
Resource: PyTorch tutorial "A Gentle Introduction to torch.autograd". Plus Karpathy's micrograd for the conceptual model.
Done when: You can hand-trace what .backward() does on a small computation graph.

Rung 03-`nn.Module` and `nn.Parameter`¶

What: A Module holds parameters and a forward method. Parameters auto-register for gradient tracking and .to(device) movement.
Why it earns its place: All real PyTorch code is structured as nn.Modules. Reading model code becomes much easier once you know the convention.
Resource: PyTorch tutorial "Build the Neural Network."
Done when: You can write a 2-layer MLP as an nn.Module from scratch without referencing docs.

Rung 04-`DataLoader` and `Dataset`¶

What: Dataset provides items by index; DataLoader batches, shuffles, and parallelizes loading.
Why it earns its place: Real training is bottlenecked by data loading. Knowing how to write a custom Dataset is bread-and-butter.
Resource: PyTorch "Datasets and DataLoaders" tutorial.
Done when: You can write a custom Dataset for a tokenized text corpus that returns (input_ids, target_ids) pairs.

Rung 05-The training loop¶

What: The boilerplate: zero grads, forward, loss, backward, step. Plus eval mode, gradient clipping, learning rate schedules.
Why it earns its place: You'll write this loop hundreds of times. Internalizing it removes friction.
Resource: PyTorch tutorial "Optimizing Model Parameters." Plus Karpathy's nanoGPT `train.py - read it line by line.
Done when: You can write a training loop from scratch with: optimizer, loss, gradient clipping, validation, checkpointing.

Rung 06-Devices, mixed precision, gradient accumulation¶

What: .to('cuda'), torch.cuda.amp.autocast, torch.compile, gradient accumulation for large effective batches.
Why it earns its place: Modern training requires these tricks to fit and to be fast. They're not optional even at small scale.
Resource: PyTorch "Automatic Mixed Precision" tutorial. Plus the torch.compile docs.
Done when: You can convert a vanilla training loop to AMP + grad accum and verify both correctness and speedup.

Rung 07-Common modules: Linear, Embedding, LayerNorm, Dropout¶

What: The building blocks of every transformer. Each has a specific shape behavior.
Why it earns its place: Every transformer architecture is composed of these. Knowing the shape behavior of each is debugging fluency.
Resource: PyTorch docs for each (nn.Linear, nn.Embedding, nn.LayerNorm, nn.Dropout).
Done when: You can predict the output shape and parameter count of each given the inputs.

Rung 08-Implementing attention¶

What: Multi-head attention from scratch using nn.Linear and basic tensor ops.
Why it earns its place: Implementing attention once unlocks the entire transformer field. Reading any modern paper afterwards becomes 10× easier.
Resource: Karpathy's Zero to Hero lecture 6 ("Let's build GPT"). Plus The Annotated Transformer (Harvard NLP, search "annotated transformer harvard").
Done when: You can implement multi-head self-attention in <50 lines of PyTorch and explain every line.

Rung 09-Hugging Face `transformers` library¶

What: Pre-built model classes (AutoModelForCausalLM), tokenizers (AutoTokenizer), Trainer API, generation utilities.
Why it earns its place: Most of your applied work uses Hugging Face. Reading their source is also a great way to learn idiomatic PyTorch.
Resource: Hugging Face NLP course (free at huggingface.co/learn).
Done when: You can load a model, tokenize input, generate output, and inspect attention weights.

Rung 10-Profiling and debugging¶

What: torch.profiler, nvidia-smi, gradient checking, torch.autograd.detect_anomaly.
Why it earns its place: When training is slow or wrong, these are the only ways out.
Resource: PyTorch profiler tutorial. Plus the "Common debugging" section of the official docs.
Done when: You can profile a training step and identify the slowest operation.

Minimum required to leave this sequence¶

Implement an MLP from scratch as nn.Module.
Write a custom Dataset and DataLoader.
Write a complete training loop with mixed precision and gradient clipping.
Implement multi-head self-attention in <50 lines.
Load and run a Hugging Face causal LM.
Profile a training step and identify the bottleneck.

Going further¶

Deep Learning with PyTorch (Stevens, Antiga, Viehmann)-read cover to cover.
PyTorch internals blog by Edward Yang (search "ezyang pytorch internals")-what's under the hood.
torch.compile deep dive-once you have a real training loop you want to make fast.

How this sequence connects to the year¶

Months 1–2: rungs 01–06 are the toolkit for every NumPy → PyTorch port you'll do.
Month 3: rungs 07–09 are how you implement nanoGPT.
Month 4 onwards: rung 09 (Hugging Face) is the daily driver.
Month 8: rung 10 (profiling) becomes critical when you're tuning fine-tuning runs.