16-Distributed Training¶

Why this matters in the journey¶

You almost certainly will not pretrain a frontier model. But you absolutely need to read the papers, understand DDP/FSDP/ZeRO concepts, recognize when a problem is communication-bound, and know what makes training expensive. This sequence is breadth-first-concept depth, not implementation depth. It's also the sequence that most clearly leverages your distributed-systems background.

The rungs¶

Rung 01-Why distributed training¶

What: A 70B model in fp16 needs 140GB just for weights. Optimizer states multiply this by 4×. Single-GPU training stops at maybe 7B with QLoRA.
Why it earns its place: Frame the necessity before the techniques.
Resource: "Scaling Laws for Neural Language Models" (Kaplan et al., arxiv.org/abs/2001.08361). Plus a memory accounting walkthrough-search "transformer math 101 eleuther".
Done when: You can compute memory requirements for a training run from model size and optimizer choice.

Rung 02-Data parallelism (DDP)¶

What: Each GPU holds the full model. Each gets a different mini-batch. Gradients are all-reduced across GPUs.
Why it earns its place: Simplest distributed strategy. Foundation of everything else.
Resource: PyTorch DDP tutorial (search "pytorch ddp tutorial"). Plus the PyTorch Distributed Overview.
Done when: You can explain DDP and where its bottleneck is (gradient all-reduce bandwidth).

Rung 03-Tensor parallelism¶

What: Split each layer's weights across GPUs. Megatron-LM is the canonical implementation.
Why it earns its place: Required when a single layer doesn't fit on a single GPU.
Resource: Megatron-LM paper (arxiv.org/abs/1909.08053). Plus the Megatron-LM repo README.
Done when: You can explain why attention's QKV projections are easy to tensor-parallelize.

Rung 04-Pipeline parallelism¶

What: Split model layers across GPU stages. Microbatches flow through the pipeline.
Why it earns its place: Used when tensor-parallel + data-parallel isn't enough. Bubble overhead is the cost.
Resource: GPipe paper (arxiv.org/abs/1811.06965). PipeDream paper.
Done when: You can explain pipeline bubbles and 1F1B scheduling at a conceptual level.

Rung 05-ZeRO (DeepSpeed)¶

What: Shard optimizer states, gradients, and parameters across GPUs. Stages 1, 2, 3 progressively shard more.
Why it earns its place: ZeRO-3 is the dominant approach for training models that don't fit per-GPU.
Resource: ZeRO paper (arxiv.org/abs/1910.02054). DeepSpeed docs.
Done when: You can explain what each ZeRO stage shards and the tradeoffs.

Rung 06-FSDP (PyTorch native)¶

What: Fully Sharded Data Parallel-PyTorch's native equivalent to ZeRO-3. Shards parameters, gathers them just-in-time per layer.
Why it earns its place: FSDP is the modern PyTorch standard. Hugging Face Accelerate uses it.
Resource: FSDP paper (arxiv.org/abs/2304.11277). PyTorch FSDP tutorial.
Done when: You can explain FSDP's wrapping policies and when to use them.

Rung 07-3D parallelism¶

What: Combine data, tensor, and pipeline parallelism. Used for the largest models. Hyperparameter search problem in itself.
Why it earns its place: Frontier-scale training. Reading-only depth for most engineers.
Resource: Megatron-DeepSpeed documentation. Plus GPT-NeoX paper for an open-source reference.
Done when: You can sketch a 3D parallelism layout for an 8-GPU setup.

Rung 08-Mixed precision and bf16¶

What: Train in lower precision (fp16, bf16) with master weights in fp32. bf16 is the modern default-same dynamic range as fp32, half the memory.
Why it earns its place: Required for training to be fast and to fit in memory.
Resource: PyTorch AMP docs. Plus the BF16 vs FP16 comparison in PaLM and Gopher papers.
Done when: You can explain why bf16 is preferred over fp16 for stability.

Rung 09-Activation checkpointing / gradient checkpointing¶

What: Don't store activations during forward pass; recompute them during backward. Trade compute for memory.
Why it earns its place: Standard memory-saving technique. Knowing it lets you fit larger batches.
Resource: PyTorch checkpoint utilities docs.
Done when: You can explain when to use gradient checkpointing.

Rung 10-Communication and networking¶

What: NCCL, RDMA, all-reduce algorithms (ring, tree). InfiniBand vs ethernet. Your distributed-systems instincts are gold here.
Why it earns its place: Communication is the bottleneck of distributed training. Understanding it sets you apart.
Resource: NVIDIA NCCL docs. Plus the Bandwidth Optimal All-Reduce literature.
Done when: You can explain ring all-reduce and why bandwidth (not latency) matters most for it.

Rung 11-Practical: training a tiny model with FSDP¶

What: Stand up a multi-GPU FSDP training job (even on rented GPUs from RunPod / Lambda Labs).
Why it earns its place: Reading vs doing. One real run is worth ten papers.
Resource: PyTorch FSDP examples. Plus Hugging Face Accelerate.
Done when: You've run a multi-GPU job and observed scaling behavior.

Minimum required to leave this sequence¶

Compute memory requirements for a training run.
Explain DDP, FSDP, ZeRO.
Distinguish tensor, pipeline, data parallelism.
Explain bf16 and gradient checkpointing.
Run a multi-GPU FSDP job on rented hardware.