Skip to content

Week 4 - The Honest Training Loop

4.1 Conceptual Core

  • A "real" training loop has ~15 concerns most tutorials skip:
  • Data loading (parallel, prefetched, shuffled, sharded).
  • Forward + backward + optimizer step.
  • Mixed-precision casting (AMP).
  • Gradient accumulation (effective batch > per-step batch).
  • Gradient clipping (NaN-prevention).
  • LR schedule.
  • Checkpointing (model, optimizer, scheduler, RNG state).
  • Resumption (idempotent, exact reproduction).
  • Validation loop (no grad, EMA where applicable).
  • Metrics logging (loss, throughput, GPU util, memory).
  • Determinism flags (torch.use_deterministic_algorithms for debugging).
  • Error handling (NaN detection, OOM recovery).
  • Multi-GPU coordination (preview; full in Month 4).
  • Early stopping and best-checkpoint tracking.
  • Run metadata (commit, config, hardware fingerprint).

4.2 Mechanical Detail

  • torch.utils.data.Dataset + DataLoader: the canonical data path. num_workers > 0 for parallel loading; pin_memory=True for faster H2D copies; prefetch_factor for tuning lookahead.
  • torch.cuda.amp.autocast + GradScaler for mixed-precision FP16 training. For BF16, autocast alone is sufficient (no scaling needed; the dynamic range is enough).
  • torch.utils.checkpoint.checkpoint for gradient checkpointing-trade compute (extra forward pass) for memory (don't store activations). Essential for fitting large models.
  • Determinism: setting torch.manual_seed, numpy.random.seed, random.seed, torch.use_deterministic_algorithms(True), and CUBLAS_WORKSPACE_CONFIG=:4096:8 is not always enough-some kernels are inherently non-deterministic. Document what you achieve.

4.3 Lab-"Train Something Small, Right"

Train a 1-layer transformer (or 2-layer MLP if transformer is too far) on TinyShakespeare or MNIST. Required: - Dataset class + DataLoader with num_workers=4, pin_memory=True. - AMP autocast (BF16 on Ampere+, FP16 with GradScaler on older). - LR schedule (warmup + cosine). - Checkpoint every N steps; able to resume from any checkpoint and produce identical loss thereafter (within 1e-5). - Per-step metrics: loss, tokens/sec, GPU memory, GPU util%. - Final report: train + val loss curves, throughput, peak memory, total cost in $ (compute hours × $/hr).

4.4 Idiomatic & Diagnostic Drill

  • torch.profiler.profile for one training step. Read the output. Identify what fraction of time is spent in: forward, backward, optimizer step, data loading, idle.

4.5 Production Slice

  • Ship the training script as a reproducible Docker image. Pin Python, PyTorch, CUDA, NumPy versions. Embed git rev-parse HEAD into the run metadata. This is the foundation for every later lab.

Month 1 Capstone Deliverable

A foundations/ directory containing: 1. roofline-sketch/ (week 1)-measured roofline plot for your laptop and one GPU. 2. three-matmuls/ (week 2)-naive vs NumPy timings, with a markdown writeup. 3. micrograd/ (week 3)-your AD from scratch. 4. honest-training-loop/ (week 4)-the full training scaffold, reproducible.

By the end of Month 1 you should be comfortable in PyTorch-not a wizard. Comfort is what Month 2's GPU work assumes.


  • The roofline paper.
  • Karpathy's micrograd and nanoGPT (the latter is the right level of implementation depth for the year ahead).
  • The PyTorch autograd documentation, end to end.
  • Goodfellow et al., Deep Learning, chapters 6 and 8.

Comments