Saltar a contenido

05-PyTorch

Why this matters in the journey

PyTorch is the lingua franca of modern AI. Every transformer, every fine-tuning script, every research paper's reference implementation-PyTorch. Hugging Face is built on it. vLLM is built on it. Knowing it well is non-negotiable. The goal of this sequence is to take you from "I can write a training loop" to "I can read and modify nanoGPT, debug shape errors instantly, and write efficient custom modules."

The rungs

Rung 01-Tensors

  • What: PyTorch tensors are NumPy arrays that live on CPU or GPU and track gradients. Same API surface as NumPy plus .to(device) and .requires_grad.
  • Why it earns its place: Everything in PyTorch is a tensor. Comfort here is the floor.
  • Resource: PyTorch official tutorial-"Tensors" (search "pytorch tutorials tensors").
  • Done when: You can create, reshape, slice, transpose, and matmul tensors fluently and predict shapes without running code.

Rung 02-Autograd

  • What: PyTorch tracks operations on tensors that have requires_grad=True and computes gradients via .backward().
  • Why it earns its place: This is the magic. Understanding it lets you debug "why is my gradient None" and "why is this slow."
  • Resource: PyTorch tutorial "A Gentle Introduction to torch.autograd". Plus Karpathy's micrograd for the conceptual model.
  • Done when: You can hand-trace what .backward() does on a small computation graph.

Rung 03-nn.Module and nn.Parameter

  • What: A Module holds parameters and a forward method. Parameters auto-register for gradient tracking and .to(device) movement.
  • Why it earns its place: All real PyTorch code is structured as nn.Modules. Reading model code becomes much easier once you know the convention.
  • Resource: PyTorch tutorial "Build the Neural Network."
  • Done when: You can write a 2-layer MLP as an nn.Module from scratch without referencing docs.

Rung 04-DataLoader and Dataset

  • What: Dataset provides items by index; DataLoader batches, shuffles, and parallelizes loading.
  • Why it earns its place: Real training is bottlenecked by data loading. Knowing how to write a custom Dataset is bread-and-butter.
  • Resource: PyTorch "Datasets and DataLoaders" tutorial.
  • Done when: You can write a custom Dataset for a tokenized text corpus that returns (input_ids, target_ids) pairs.

Rung 05-The training loop

  • What: The boilerplate: zero grads, forward, loss, backward, step. Plus eval mode, gradient clipping, learning rate schedules.
  • Why it earns its place: You'll write this loop hundreds of times. Internalizing it removes friction.
  • Resource: PyTorch tutorial "Optimizing Model Parameters." Plus Karpathy's nanoGPT `train.py - read it line by line.
  • Done when: You can write a training loop from scratch with: optimizer, loss, gradient clipping, validation, checkpointing.

Rung 06-Devices, mixed precision, gradient accumulation

  • What: .to('cuda'), torch.cuda.amp.autocast, torch.compile, gradient accumulation for large effective batches.
  • Why it earns its place: Modern training requires these tricks to fit and to be fast. They're not optional even at small scale.
  • Resource: PyTorch "Automatic Mixed Precision" tutorial. Plus the torch.compile docs.
  • Done when: You can convert a vanilla training loop to AMP + grad accum and verify both correctness and speedup.

Rung 07-Common modules: Linear, Embedding, LayerNorm, Dropout

  • What: The building blocks of every transformer. Each has a specific shape behavior.
  • Why it earns its place: Every transformer architecture is composed of these. Knowing the shape behavior of each is debugging fluency.
  • Resource: PyTorch docs for each (nn.Linear, nn.Embedding, nn.LayerNorm, nn.Dropout).
  • Done when: You can predict the output shape and parameter count of each given the inputs.

Rung 08-Implementing attention

  • What: Multi-head attention from scratch using nn.Linear and basic tensor ops.
  • Why it earns its place: Implementing attention once unlocks the entire transformer field. Reading any modern paper afterwards becomes 10× easier.
  • Resource: Karpathy's Zero to Hero lecture 6 ("Let's build GPT"). Plus The Annotated Transformer (Harvard NLP, search "annotated transformer harvard").
  • Done when: You can implement multi-head self-attention in <50 lines of PyTorch and explain every line.

Rung 09-Hugging Face transformers library

  • What: Pre-built model classes (AutoModelForCausalLM), tokenizers (AutoTokenizer), Trainer API, generation utilities.
  • Why it earns its place: Most of your applied work uses Hugging Face. Reading their source is also a great way to learn idiomatic PyTorch.
  • Resource: Hugging Face NLP course (free at huggingface.co/learn).
  • Done when: You can load a model, tokenize input, generate output, and inspect attention weights.

Rung 10-Profiling and debugging

  • What: torch.profiler, nvidia-smi, gradient checking, torch.autograd.detect_anomaly.
  • Why it earns its place: When training is slow or wrong, these are the only ways out.
  • Resource: PyTorch profiler tutorial. Plus the "Common debugging" section of the official docs.
  • Done when: You can profile a training step and identify the slowest operation.

Minimum required to leave this sequence

  • Implement an MLP from scratch as nn.Module.
  • Write a custom Dataset and DataLoader.
  • Write a complete training loop with mixed precision and gradient clipping.
  • Implement multi-head self-attention in <50 lines.
  • Load and run a Hugging Face causal LM.
  • Profile a training step and identify the bottleneck.

Going further

  • Deep Learning with PyTorch (Stevens, Antiga, Viehmann)-read cover to cover.
  • PyTorch internals blog by Edward Yang (search "ezyang pytorch internals")-what's under the hood.
  • torch.compile deep dive-once you have a real training loop you want to make fast.

How this sequence connects to the year

  • Months 1–2: rungs 01–06 are the toolkit for every NumPy → PyTorch port you'll do.
  • Month 3: rungs 07–09 are how you implement nanoGPT.
  • Month 4 onwards: rung 09 (Hugging Face) is the daily driver.
  • Month 8: rung 10 (profiling) becomes critical when you're tuning fine-tuning runs.

Comments