Week 4 - The Honest Training Loop¶
4.1 Conceptual Core¶
- A "real" training loop has ~15 concerns most tutorials skip:
- Data loading (parallel, prefetched, shuffled, sharded).
- Forward + backward + optimizer step.
- Mixed-precision casting (AMP).
- Gradient accumulation (effective batch > per-step batch).
- Gradient clipping (NaN-prevention).
- LR schedule.
- Checkpointing (model, optimizer, scheduler, RNG state).
- Resumption (idempotent, exact reproduction).
- Validation loop (no grad, EMA where applicable).
- Metrics logging (loss, throughput, GPU util, memory).
- Determinism flags (
torch.use_deterministic_algorithmsfor debugging). - Error handling (NaN detection, OOM recovery).
- Multi-GPU coordination (preview; full in Month 4).
- Early stopping and best-checkpoint tracking.
- Run metadata (commit, config, hardware fingerprint).
4.2 Mechanical Detail¶
torch.utils.data.Dataset+DataLoader: the canonical data path.num_workers > 0for parallel loading;pin_memory=Truefor faster H2D copies;prefetch_factorfor tuning lookahead.torch.cuda.amp.autocast+GradScalerfor mixed-precision FP16 training. For BF16, autocast alone is sufficient (no scaling needed; the dynamic range is enough).torch.utils.checkpoint.checkpointfor gradient checkpointing-trade compute (extra forward pass) for memory (don't store activations). Essential for fitting large models.- Determinism: setting
torch.manual_seed,numpy.random.seed,random.seed,torch.use_deterministic_algorithms(True), andCUBLAS_WORKSPACE_CONFIG=:4096:8is not always enough-some kernels are inherently non-deterministic. Document what you achieve.
4.3 Lab-"Train Something Small, Right"¶
Train a 1-layer transformer (or 2-layer MLP if transformer is too far) on TinyShakespeare or MNIST. Required:
- Dataset class + DataLoader with num_workers=4, pin_memory=True.
- AMP autocast (BF16 on Ampere+, FP16 with GradScaler on older).
- LR schedule (warmup + cosine).
- Checkpoint every N steps; able to resume from any checkpoint and produce identical loss thereafter (within 1e-5).
- Per-step metrics: loss, tokens/sec, GPU memory, GPU util%.
- Final report: train + val loss curves, throughput, peak memory, total cost in $ (compute hours × $/hr).
4.4 Idiomatic & Diagnostic Drill¶
torch.profiler.profilefor one training step. Read the output. Identify what fraction of time is spent in: forward, backward, optimizer step, data loading, idle.
4.5 Production Slice¶
- Ship the training script as a reproducible Docker image. Pin Python, PyTorch, CUDA, NumPy versions. Embed
git rev-parse HEADinto the run metadata. This is the foundation for every later lab.
Month 1 Capstone Deliverable¶
A foundations/ directory containing:
1. roofline-sketch/ (week 1)-measured roofline plot for your laptop and one GPU.
2. three-matmuls/ (week 2)-naive vs NumPy timings, with a markdown writeup.
3. micrograd/ (week 3)-your AD from scratch.
4. honest-training-loop/ (week 4)-the full training scaffold, reproducible.
By the end of Month 1 you should be comfortable in PyTorch-not a wizard. Comfort is what Month 2's GPU work assumes.
Recommended Reading Done This Month¶
- The roofline paper.
- Karpathy's
microgradandnanoGPT(the latter is the right level of implementation depth for the year ahead). - The PyTorch autograd documentation, end to end.
- Goodfellow et al., Deep Learning, chapters 6 and 8.