16-Distributed Training¶
Why this matters in the journey¶
You almost certainly will not pretrain a frontier model. But you absolutely need to read the papers, understand DDP/FSDP/ZeRO concepts, recognize when a problem is communication-bound, and know what makes training expensive. This sequence is breadth-first-concept depth, not implementation depth. It's also the sequence that most clearly leverages your distributed-systems background.
The rungs¶
Rung 01-Why distributed training¶
- What: A 70B model in fp16 needs 140GB just for weights. Optimizer states multiply this by 4×. Single-GPU training stops at maybe 7B with QLoRA.
- Why it earns its place: Frame the necessity before the techniques.
- Resource: "Scaling Laws for Neural Language Models" (Kaplan et al., arxiv.org/abs/2001.08361). Plus a memory accounting walkthrough-search "transformer math 101 eleuther".
- Done when: You can compute memory requirements for a training run from model size and optimizer choice.
Rung 02-Data parallelism (DDP)¶
- What: Each GPU holds the full model. Each gets a different mini-batch. Gradients are all-reduced across GPUs.
- Why it earns its place: Simplest distributed strategy. Foundation of everything else.
- Resource: PyTorch DDP tutorial (search "pytorch ddp tutorial"). Plus the PyTorch Distributed Overview.
- Done when: You can explain DDP and where its bottleneck is (gradient all-reduce bandwidth).
Rung 03-Tensor parallelism¶
- What: Split each layer's weights across GPUs. Megatron-LM is the canonical implementation.
- Why it earns its place: Required when a single layer doesn't fit on a single GPU.
- Resource: Megatron-LM paper (arxiv.org/abs/1909.08053). Plus the Megatron-LM repo README.
- Done when: You can explain why attention's QKV projections are easy to tensor-parallelize.
Rung 04-Pipeline parallelism¶
- What: Split model layers across GPU stages. Microbatches flow through the pipeline.
- Why it earns its place: Used when tensor-parallel + data-parallel isn't enough. Bubble overhead is the cost.
- Resource: GPipe paper (arxiv.org/abs/1811.06965). PipeDream paper.
- Done when: You can explain pipeline bubbles and 1F1B scheduling at a conceptual level.
Rung 05-ZeRO (DeepSpeed)¶
- What: Shard optimizer states, gradients, and parameters across GPUs. Stages 1, 2, 3 progressively shard more.
- Why it earns its place: ZeRO-3 is the dominant approach for training models that don't fit per-GPU.
- Resource: ZeRO paper (arxiv.org/abs/1910.02054). DeepSpeed docs.
- Done when: You can explain what each ZeRO stage shards and the tradeoffs.
Rung 06-FSDP (PyTorch native)¶
- What: Fully Sharded Data Parallel-PyTorch's native equivalent to ZeRO-3. Shards parameters, gathers them just-in-time per layer.
- Why it earns its place: FSDP is the modern PyTorch standard. Hugging Face Accelerate uses it.
- Resource: FSDP paper (arxiv.org/abs/2304.11277). PyTorch FSDP tutorial.
- Done when: You can explain FSDP's wrapping policies and when to use them.
Rung 07-3D parallelism¶
- What: Combine data, tensor, and pipeline parallelism. Used for the largest models. Hyperparameter search problem in itself.
- Why it earns its place: Frontier-scale training. Reading-only depth for most engineers.
- Resource: Megatron-DeepSpeed documentation. Plus GPT-NeoX paper for an open-source reference.
- Done when: You can sketch a 3D parallelism layout for an 8-GPU setup.
Rung 08-Mixed precision and bf16¶
- What: Train in lower precision (fp16, bf16) with master weights in fp32. bf16 is the modern default-same dynamic range as fp32, half the memory.
- Why it earns its place: Required for training to be fast and to fit in memory.
- Resource: PyTorch AMP docs. Plus the BF16 vs FP16 comparison in PaLM and Gopher papers.
- Done when: You can explain why bf16 is preferred over fp16 for stability.
Rung 09-Activation checkpointing / gradient checkpointing¶
- What: Don't store activations during forward pass; recompute them during backward. Trade compute for memory.
- Why it earns its place: Standard memory-saving technique. Knowing it lets you fit larger batches.
- Resource: PyTorch checkpoint utilities docs.
- Done when: You can explain when to use gradient checkpointing.
Rung 10-Communication and networking¶
- What: NCCL, RDMA, all-reduce algorithms (ring, tree). InfiniBand vs ethernet. Your distributed-systems instincts are gold here.
- Why it earns its place: Communication is the bottleneck of distributed training. Understanding it sets you apart.
- Resource: NVIDIA NCCL docs. Plus the Bandwidth Optimal All-Reduce literature.
- Done when: You can explain ring all-reduce and why bandwidth (not latency) matters most for it.
Rung 11-Practical: training a tiny model with FSDP¶
- What: Stand up a multi-GPU FSDP training job (even on rented GPUs from RunPod / Lambda Labs).
- Why it earns its place: Reading vs doing. One real run is worth ten papers.
- Resource: PyTorch FSDP examples. Plus Hugging Face Accelerate.
- Done when: You've run a multi-GPU job and observed scaling behavior.
Minimum required to leave this sequence¶
- Compute memory requirements for a training run.
- Explain DDP, FSDP, ZeRO.
- Distinguish tensor, pipeline, data parallelism.
- Explain bf16 and gradient checkpointing.
- Run a multi-GPU FSDP job on rented hardware.
Going further¶
- EleutherAI's "Transformer Math 101" blog post (search).
- Megatron-LM and DeepSpeed source code-read.
- Stanford CS336 lectures on distributed training.
How this sequence connects to the year¶
- Month 9: Rungs 01–06 are the conceptual core.
- Q3 Track C: Operational depth if you go inference / infra.
- Reading frontier papers: The vocabulary of every major lab's tech reports.