Week 4 - The Honest Training Loop¶

4.1 Conceptual Core¶

A "real" training loop has ~15 concerns most tutorials skip:
Data loading (parallel, prefetched, shuffled, sharded).
Forward + backward + optimizer step.
Mixed-precision casting (AMP).
Gradient accumulation (effective batch > per-step batch).
Gradient clipping (NaN-prevention).
LR schedule.
Checkpointing (model, optimizer, scheduler, RNG state).
Resumption (idempotent, exact reproduction).
Validation loop (no grad, EMA where applicable).
Metrics logging (loss, throughput, GPU util, memory).
Determinism flags (torch.use_deterministic_algorithms for debugging).
Error handling (NaN detection, OOM recovery).
Multi-GPU coordination (preview; full in Month 4).
Early stopping and best-checkpoint tracking.
Run metadata (commit, config, hardware fingerprint).

4.2 Mechanical Detail¶

torch.utils.data.Dataset + DataLoader: the canonical data path. num_workers > 0 for parallel loading; pin_memory=True for faster H2D copies; prefetch_factor for tuning lookahead.
torch.cuda.amp.autocast + GradScaler for mixed-precision FP16 training. For BF16, autocast alone is sufficient (no scaling needed; the dynamic range is enough).
torch.utils.checkpoint.checkpoint for gradient checkpointing-trade compute (extra forward pass) for memory (don't store activations). Essential for fitting large models.
Determinism: setting torch.manual_seed, numpy.random.seed, random.seed, torch.use_deterministic_algorithms(True), and CUBLAS_WORKSPACE_CONFIG=:4096:8 is not always enough-some kernels are inherently non-deterministic. Document what you achieve.

4.3 Lab-"Train Something Small, Right"¶

Train a 1-layer transformer (or 2-layer MLP if transformer is too far) on TinyShakespeare or MNIST. Required: - Dataset class + DataLoader with num_workers=4, pin_memory=True. - AMP autocast (BF16 on Ampere+, FP16 with GradScaler on older). - LR schedule (warmup + cosine). - Checkpoint every N steps; able to resume from any checkpoint and produce identical loss thereafter (within 1e-5). - Per-step metrics: loss, tokens/sec, GPU memory, GPU util%. - Final report: train + val loss curves, throughput, peak memory, total cost in $ (compute hours × $/hr).

4.4 Idiomatic & Diagnostic Drill¶

torch.profiler.profile for one training step. Read the output. Identify what fraction of time is spent in: forward, backward, optimizer step, data loading, idle.

4.5 Production Slice¶

Ship the training script as a reproducible Docker image. Pin Python, PyTorch, CUDA, NumPy versions. Embed git rev-parse HEAD into the run metadata. This is the foundation for every later lab.

Month 1 Capstone Deliverable¶

A foundations/ directory containing: 1. roofline-sketch/ (week 1)-measured roofline plot for your laptop and one GPU. 2. three-matmuls/ (week 2)-naive vs NumPy timings, with a markdown writeup. 3. micrograd/ (week 3)-your AD from scratch. 4. honest-training-loop/ (week 4)-the full training scaffold, reproducible.

By the end of Month 1 you should be comfortable in PyTorch-not a wizard. Comfort is what Month 2's GPU work assumes.