Week 15 - Tensor Parallelism and Pipeline Parallelism¶

15.1 Conceptual Core¶

Tensor parallelism (TP): shard individual layers across ranks. The classic Megatron-LM approach (Shoeybi et al., 2019):
Column-parallel linear: shard weight matrix by output dim. Each rank computes a slice of output; allgather to combine if needed.
Row-parallel linear: shard by input dim. Each rank computes a partial output; allreduce to sum.
Attention: shard heads across ranks. Each rank computes its assigned heads; results allreduced before output projection.
TP requires fast (intra-node, NVLink) communication. Across nodes, the latency dominates.
Pipeline parallelism (PP): split layers across ranks. Each rank holds layers L_i to L_j. Activations flow forward; gradients flow backward.
Naive PP wastes most ranks' time (bubble). GPipe (Huang et al., 2018) microbatches to fill the pipeline. 1F1B schedule (interleaved forward and backward) further reduces bubble.
3D parallelism: combine TP (intra-node) + PP (inter-node) + DP (across pipeline stages). Each model parameter is sharded along three axes. The GPT-3, PaLM, Llama-3 405B training recipes all use 3D parallelism.

15.2 Mechanical Detail¶

TP communication: per-layer allreduce in row-parallel; allgather in column-parallel. Latency-sensitive; do TP within a node where NVLink bandwidth dominates.
PP communication: only at stage boundaries; activations forward + gradients backward. Bandwidth-friendly; do PP across nodes.
DP: gradient sync once per step; very bandwidth-friendly; do DP across pipeline stages.
The choice of which parallelism along which axis is governed by:
TP degree ≤ GPUs-per-node (NVLink scope).
PP degree determined by memory needs (each stage holds layers + activations).
DP fills the rest.
Megatron-LM and DeepSpeed are the two open-source 3D-parallelism stacks. Modern PyTorch's Pipeline Parallelism API (torch.distributed.pipelining, stable as of 2.4) and DTensor (torch.distributed.tensor) are converging on a unified path.

15.3 Lab-"Implement Tensor-Parallel Attention"¶

By hand, in pure PyTorch + torch.distributed: 1. Implement the Megatron-style tensor-parallel multi-head attention: column-parallel QKV projection, sharded heads, row-parallel output projection. 2. Verify numerically against a single-GPU reference for correctness (allclose to atol=1e-3). 3. Benchmark on 4 GPUs vs 1-GPU baseline. Compute scaling efficiency.

15.4 Idiomatic & Diagnostic Drill¶

Use nsys profile to see the timeline of allreduce vs compute. Tensor parallelism's signature is short, frequent allreduces.

15.5 Production Slice¶

Capture the topology decision: for your hardware (e.g., 8x H100 per node, 16 nodes), what TP/PP/DP degrees do you choose for a 70B model? Document the math.