Skip to content

Month 4-Distributed Training: NCCL, DDP, FSDP, Tensor & Pipeline Parallelism, FP8

Goal: by the end of week 16 you can (a) explain the ring-allreduce algorithm and predict its bandwidth, (b) train a model on 8 GPUs across 2 nodes with FSDP achieving >85% scaling efficiency, (c) implement tensor-parallel attention by hand, and (d) reason about 3D parallelism schedules and FP8 training stability.

Deep-dive companions (read in tandem): - Weeks 13–15 (all) → DEEP_DIVES/06_DISTRIBUTED_TRAINING.md - derivation of all 5 all-reduce algorithms (with ring-allreduce bandwidth-optimality proof), full ZeRO-1/2/3 memory-math table, Megatron column→row partition derivations, ASCII pipeline schedules (GPipe, 1F1B, Interleaved 1F1B, Zero Bubble) with bubble formulas, 3D parallelism worked examples for 8B/70B/405B. - Week 16 →DEEP_DIVES/11_NUMERICS_AND_MIXED_PRECISION.md - IEEE-754 derivation, FP16/BF16/FP8 layouts, full loss-scaling derivation including dynamic GradScaler, FP8 with delayed scaling, Adam-low-precision pitfall, catastrophic cancellation, transformer numerical-stability tricks.


Weeks

Comments