Month 4-Distributed Training: NCCL, DDP, FSDP, Tensor & Pipeline Parallelism, FP8¶
Goal: by the end of week 16 you can (a) explain the ring-allreduce algorithm and predict its bandwidth, (b) train a model on 8 GPUs across 2 nodes with FSDP achieving >85% scaling efficiency, (c) implement tensor-parallel attention by hand, and (d) reason about 3D parallelism schedules and FP8 training stability.
Deep-dive companions (read in tandem): - Weeks 13–15 (all) →
DEEP_DIVES/06_DISTRIBUTED_TRAINING.md - derivation of all 5 all-reduce algorithms (with ring-allreduce bandwidth-optimality proof), full ZeRO-1/2/3 memory-math table, Megatron column→row partition derivations, ASCII pipeline schedules (GPipe, 1F1B, Interleaved 1F1B, Zero Bubble) with bubble formulas, 3D parallelism worked examples for 8B/70B/405B. - Week 16 →DEEP_DIVES/11_NUMERICS_AND_MIXED_PRECISION.md - IEEE-754 derivation, FP16/BF16/FP8 layouts, full loss-scaling derivation including dynamic GradScaler, FP8 with delayed scaling, Adam-low-precision pitfall, catastrophic cancellation, transformer numerical-stability tricks.
Weeks¶
- Week 13 - Communication Primitives: NCCL, Allreduce, Topology
- Week 14 - Data Parallelism: DDP, ZeRO, FSDP
- Week 15 - Tensor Parallelism and Pipeline Parallelism
- Week 16 - Mixed Precision, FP8, Numerical Stability at Scale