05-PyTorch¶
Why this matters in the journey¶
PyTorch is the lingua franca of modern AI. Every transformer, every fine-tuning script, every research paper's reference implementation-PyTorch. Hugging Face is built on it. vLLM is built on it. Knowing it well is non-negotiable. The goal of this sequence is to take you from "I can write a training loop" to "I can read and modify nanoGPT, debug shape errors instantly, and write efficient custom modules."
The rungs¶
Rung 01-Tensors¶
- What: PyTorch tensors are NumPy arrays that live on CPU or GPU and track gradients. Same API surface as NumPy plus
.to(device)and.requires_grad. - Why it earns its place: Everything in PyTorch is a tensor. Comfort here is the floor.
- Resource: PyTorch official tutorial-"Tensors" (search "pytorch tutorials tensors").
- Done when: You can create, reshape, slice, transpose, and matmul tensors fluently and predict shapes without running code.
Rung 02-Autograd¶
- What: PyTorch tracks operations on tensors that have
requires_grad=Trueand computes gradients via.backward(). - Why it earns its place: This is the magic. Understanding it lets you debug "why is my gradient None" and "why is this slow."
- Resource: PyTorch tutorial "A Gentle Introduction to torch.autograd". Plus Karpathy's
microgradfor the conceptual model. - Done when: You can hand-trace what
.backward()does on a small computation graph.
Rung 03-nn.Module and nn.Parameter¶
- What: A
Moduleholds parameters and aforwardmethod. Parameters auto-register for gradient tracking and.to(device)movement. - Why it earns its place: All real PyTorch code is structured as
nn.Modules. Reading model code becomes much easier once you know the convention. - Resource: PyTorch tutorial "Build the Neural Network."
- Done when: You can write a 2-layer MLP as an
nn.Modulefrom scratch without referencing docs.
Rung 04-DataLoader and Dataset¶
- What:
Datasetprovides items by index;DataLoaderbatches, shuffles, and parallelizes loading. - Why it earns its place: Real training is bottlenecked by data loading. Knowing how to write a custom
Datasetis bread-and-butter. - Resource: PyTorch "Datasets and DataLoaders" tutorial.
- Done when: You can write a custom
Datasetfor a tokenized text corpus that returns(input_ids, target_ids)pairs.
Rung 05-The training loop¶
- What: The boilerplate: zero grads, forward, loss, backward, step. Plus eval mode, gradient clipping, learning rate schedules.
- Why it earns its place: You'll write this loop hundreds of times. Internalizing it removes friction.
- Resource: PyTorch tutorial "Optimizing Model Parameters." Plus Karpathy's nanoGPT `train.py - read it line by line.
- Done when: You can write a training loop from scratch with: optimizer, loss, gradient clipping, validation, checkpointing.
Rung 06-Devices, mixed precision, gradient accumulation¶
- What:
.to('cuda'),torch.cuda.amp.autocast,torch.compile, gradient accumulation for large effective batches. - Why it earns its place: Modern training requires these tricks to fit and to be fast. They're not optional even at small scale.
- Resource: PyTorch "Automatic Mixed Precision" tutorial. Plus the
torch.compiledocs. - Done when: You can convert a vanilla training loop to AMP + grad accum and verify both correctness and speedup.
Rung 07-Common modules: Linear, Embedding, LayerNorm, Dropout¶
- What: The building blocks of every transformer. Each has a specific shape behavior.
- Why it earns its place: Every transformer architecture is composed of these. Knowing the shape behavior of each is debugging fluency.
- Resource: PyTorch docs for each (
nn.Linear,nn.Embedding,nn.LayerNorm,nn.Dropout). - Done when: You can predict the output shape and parameter count of each given the inputs.
Rung 08-Implementing attention¶
- What: Multi-head attention from scratch using
nn.Linearand basic tensor ops. - Why it earns its place: Implementing attention once unlocks the entire transformer field. Reading any modern paper afterwards becomes 10× easier.
- Resource: Karpathy's Zero to Hero lecture 6 ("Let's build GPT"). Plus The Annotated Transformer (Harvard NLP, search "annotated transformer harvard").
- Done when: You can implement multi-head self-attention in <50 lines of PyTorch and explain every line.
Rung 09-Hugging Face transformers library¶
- What: Pre-built model classes (
AutoModelForCausalLM), tokenizers (AutoTokenizer),TrainerAPI, generation utilities. - Why it earns its place: Most of your applied work uses Hugging Face. Reading their source is also a great way to learn idiomatic PyTorch.
- Resource: Hugging Face NLP course (free at huggingface.co/learn).
- Done when: You can load a model, tokenize input, generate output, and inspect attention weights.
Rung 10-Profiling and debugging¶
- What:
torch.profiler,nvidia-smi, gradient checking,torch.autograd.detect_anomaly. - Why it earns its place: When training is slow or wrong, these are the only ways out.
- Resource: PyTorch profiler tutorial. Plus the "Common debugging" section of the official docs.
- Done when: You can profile a training step and identify the slowest operation.
Minimum required to leave this sequence¶
- Implement an MLP from scratch as
nn.Module. - Write a custom
DatasetandDataLoader. - Write a complete training loop with mixed precision and gradient clipping.
- Implement multi-head self-attention in <50 lines.
- Load and run a Hugging Face causal LM.
- Profile a training step and identify the bottleneck.
Going further¶
- Deep Learning with PyTorch (Stevens, Antiga, Viehmann)-read cover to cover.
- PyTorch internals blog by Edward Yang (search "ezyang pytorch internals")-what's under the hood.
torch.compiledeep dive-once you have a real training loop you want to make fast.
How this sequence connects to the year¶
- Months 1–2: rungs 01–06 are the toolkit for every NumPy → PyTorch port you'll do.
- Month 3: rungs 07–09 are how you implement nanoGPT.
- Month 4 onwards: rung 09 (Hugging Face) is the daily driver.
- Month 8: rung 10 (profiling) becomes critical when you're tuning fine-tuning runs.