Week 15 - Tensor Parallelism and Pipeline Parallelism¶
15.1 Conceptual Core¶
- Tensor parallelism (TP): shard individual layers across ranks. The classic Megatron-LM approach (Shoeybi et al., 2019):
- Column-parallel linear: shard weight matrix by output dim. Each rank computes a slice of output; allgather to combine if needed.
- Row-parallel linear: shard by input dim. Each rank computes a partial output; allreduce to sum.
- Attention: shard heads across ranks. Each rank computes its assigned heads; results allreduced before output projection.
- TP requires fast (intra-node, NVLink) communication. Across nodes, the latency dominates.
- Pipeline parallelism (PP): split layers across ranks. Each rank holds layers L_i to L_j. Activations flow forward; gradients flow backward.
- Naive PP wastes most ranks' time (bubble). GPipe (Huang et al., 2018) microbatches to fill the pipeline. 1F1B schedule (interleaved forward and backward) further reduces bubble.
- 3D parallelism: combine TP (intra-node) + PP (inter-node) + DP (across pipeline stages). Each model parameter is sharded along three axes. The GPT-3, PaLM, Llama-3 405B training recipes all use 3D parallelism.
15.2 Mechanical Detail¶
- TP communication: per-layer allreduce in row-parallel; allgather in column-parallel. Latency-sensitive; do TP within a node where NVLink bandwidth dominates.
- PP communication: only at stage boundaries; activations forward + gradients backward. Bandwidth-friendly; do PP across nodes.
- DP: gradient sync once per step; very bandwidth-friendly; do DP across pipeline stages.
- The choice of which parallelism along which axis is governed by:
- TP degree ≤ GPUs-per-node (NVLink scope).
- PP degree determined by memory needs (each stage holds layers + activations).
- DP fills the rest.
- Megatron-LM and DeepSpeed are the two open-source 3D-parallelism stacks. Modern PyTorch's Pipeline Parallelism API (
torch.distributed.pipelining, stable as of 2.4) and DTensor (torch.distributed.tensor) are converging on a unified path.
15.3 Lab-"Implement Tensor-Parallel Attention"¶
By hand, in pure PyTorch + torch.distributed:
1. Implement the Megatron-style tensor-parallel multi-head attention: column-parallel QKV projection, sharded heads, row-parallel output projection.
2. Verify numerically against a single-GPU reference for correctness (allclose to atol=1e-3).
3. Benchmark on 4 GPUs vs 1-GPU baseline. Compute scaling efficiency.
15.4 Idiomatic & Diagnostic Drill¶
- Use
nsys profileto see the timeline of allreduce vs compute. Tensor parallelism's signature is short, frequent allreduces.
15.5 Production Slice¶
- Capture the topology decision: for your hardware (e.g., 8x H100 per node, 16 nodes), what TP/PP/DP degrees do you choose for a 70B model? Document the math.