Saltar a contenido

16-Distributed Training

Why this matters in the journey

You almost certainly will not pretrain a frontier model. But you absolutely need to read the papers, understand DDP/FSDP/ZeRO concepts, recognize when a problem is communication-bound, and know what makes training expensive. This sequence is breadth-first-concept depth, not implementation depth. It's also the sequence that most clearly leverages your distributed-systems background.

The rungs

Rung 01-Why distributed training

  • What: A 70B model in fp16 needs 140GB just for weights. Optimizer states multiply this by 4×. Single-GPU training stops at maybe 7B with QLoRA.
  • Why it earns its place: Frame the necessity before the techniques.
  • Resource: "Scaling Laws for Neural Language Models" (Kaplan et al., arxiv.org/abs/2001.08361). Plus a memory accounting walkthrough-search "transformer math 101 eleuther".
  • Done when: You can compute memory requirements for a training run from model size and optimizer choice.

Rung 02-Data parallelism (DDP)

  • What: Each GPU holds the full model. Each gets a different mini-batch. Gradients are all-reduced across GPUs.
  • Why it earns its place: Simplest distributed strategy. Foundation of everything else.
  • Resource: PyTorch DDP tutorial (search "pytorch ddp tutorial"). Plus the PyTorch Distributed Overview.
  • Done when: You can explain DDP and where its bottleneck is (gradient all-reduce bandwidth).

Rung 03-Tensor parallelism

  • What: Split each layer's weights across GPUs. Megatron-LM is the canonical implementation.
  • Why it earns its place: Required when a single layer doesn't fit on a single GPU.
  • Resource: Megatron-LM paper (arxiv.org/abs/1909.08053). Plus the Megatron-LM repo README.
  • Done when: You can explain why attention's QKV projections are easy to tensor-parallelize.

Rung 04-Pipeline parallelism

  • What: Split model layers across GPU stages. Microbatches flow through the pipeline.
  • Why it earns its place: Used when tensor-parallel + data-parallel isn't enough. Bubble overhead is the cost.
  • Resource: GPipe paper (arxiv.org/abs/1811.06965). PipeDream paper.
  • Done when: You can explain pipeline bubbles and 1F1B scheduling at a conceptual level.

Rung 05-ZeRO (DeepSpeed)

  • What: Shard optimizer states, gradients, and parameters across GPUs. Stages 1, 2, 3 progressively shard more.
  • Why it earns its place: ZeRO-3 is the dominant approach for training models that don't fit per-GPU.
  • Resource: ZeRO paper (arxiv.org/abs/1910.02054). DeepSpeed docs.
  • Done when: You can explain what each ZeRO stage shards and the tradeoffs.

Rung 06-FSDP (PyTorch native)

  • What: Fully Sharded Data Parallel-PyTorch's native equivalent to ZeRO-3. Shards parameters, gathers them just-in-time per layer.
  • Why it earns its place: FSDP is the modern PyTorch standard. Hugging Face Accelerate uses it.
  • Resource: FSDP paper (arxiv.org/abs/2304.11277). PyTorch FSDP tutorial.
  • Done when: You can explain FSDP's wrapping policies and when to use them.

Rung 07-3D parallelism

  • What: Combine data, tensor, and pipeline parallelism. Used for the largest models. Hyperparameter search problem in itself.
  • Why it earns its place: Frontier-scale training. Reading-only depth for most engineers.
  • Resource: Megatron-DeepSpeed documentation. Plus GPT-NeoX paper for an open-source reference.
  • Done when: You can sketch a 3D parallelism layout for an 8-GPU setup.

Rung 08-Mixed precision and bf16

  • What: Train in lower precision (fp16, bf16) with master weights in fp32. bf16 is the modern default-same dynamic range as fp32, half the memory.
  • Why it earns its place: Required for training to be fast and to fit in memory.
  • Resource: PyTorch AMP docs. Plus the BF16 vs FP16 comparison in PaLM and Gopher papers.
  • Done when: You can explain why bf16 is preferred over fp16 for stability.

Rung 09-Activation checkpointing / gradient checkpointing

  • What: Don't store activations during forward pass; recompute them during backward. Trade compute for memory.
  • Why it earns its place: Standard memory-saving technique. Knowing it lets you fit larger batches.
  • Resource: PyTorch checkpoint utilities docs.
  • Done when: You can explain when to use gradient checkpointing.

Rung 10-Communication and networking

  • What: NCCL, RDMA, all-reduce algorithms (ring, tree). InfiniBand vs ethernet. Your distributed-systems instincts are gold here.
  • Why it earns its place: Communication is the bottleneck of distributed training. Understanding it sets you apart.
  • Resource: NVIDIA NCCL docs. Plus the Bandwidth Optimal All-Reduce literature.
  • Done when: You can explain ring all-reduce and why bandwidth (not latency) matters most for it.

Rung 11-Practical: training a tiny model with FSDP

  • What: Stand up a multi-GPU FSDP training job (even on rented GPUs from RunPod / Lambda Labs).
  • Why it earns its place: Reading vs doing. One real run is worth ten papers.
  • Resource: PyTorch FSDP examples. Plus Hugging Face Accelerate.
  • Done when: You've run a multi-GPU job and observed scaling behavior.

Minimum required to leave this sequence

  • Compute memory requirements for a training run.
  • Explain DDP, FSDP, ZeRO.
  • Distinguish tensor, pipeline, data parallelism.
  • Explain bf16 and gradient checkpointing.
  • Run a multi-GPU FSDP job on rented hardware.

Going further

  • EleutherAI's "Transformer Math 101" blog post (search).
  • Megatron-LM and DeepSpeed source code-read.
  • Stanford CS336 lectures on distributed training.

How this sequence connects to the year

  • Month 9: Rungs 01–06 are the conceptual core.
  • Q3 Track C: Operational depth if you go inference / infra.
  • Reading frontier papers: The vocabulary of every major lab's tech reports.

Comments