Saltar a contenido

Week 15 - Tensor Parallelism and Pipeline Parallelism

15.1 Conceptual Core

  • Tensor parallelism (TP): shard individual layers across ranks. The classic Megatron-LM approach (Shoeybi et al., 2019):
  • Column-parallel linear: shard weight matrix by output dim. Each rank computes a slice of output; allgather to combine if needed.
  • Row-parallel linear: shard by input dim. Each rank computes a partial output; allreduce to sum.
  • Attention: shard heads across ranks. Each rank computes its assigned heads; results allreduced before output projection.
  • TP requires fast (intra-node, NVLink) communication. Across nodes, the latency dominates.
  • Pipeline parallelism (PP): split layers across ranks. Each rank holds layers L_i to L_j. Activations flow forward; gradients flow backward.
  • Naive PP wastes most ranks' time (bubble). GPipe (Huang et al., 2018) microbatches to fill the pipeline. 1F1B schedule (interleaved forward and backward) further reduces bubble.
  • 3D parallelism: combine TP (intra-node) + PP (inter-node) + DP (across pipeline stages). Each model parameter is sharded along three axes. The GPT-3, PaLM, Llama-3 405B training recipes all use 3D parallelism.

15.2 Mechanical Detail

  • TP communication: per-layer allreduce in row-parallel; allgather in column-parallel. Latency-sensitive; do TP within a node where NVLink bandwidth dominates.
  • PP communication: only at stage boundaries; activations forward + gradients backward. Bandwidth-friendly; do PP across nodes.
  • DP: gradient sync once per step; very bandwidth-friendly; do DP across pipeline stages.
  • The choice of which parallelism along which axis is governed by:
  • TP degree ≤ GPUs-per-node (NVLink scope).
  • PP degree determined by memory needs (each stage holds layers + activations).
  • DP fills the rest.
  • Megatron-LM and DeepSpeed are the two open-source 3D-parallelism stacks. Modern PyTorch's Pipeline Parallelism API (torch.distributed.pipelining, stable as of 2.4) and DTensor (torch.distributed.tensor) are converging on a unified path.

15.3 Lab-"Implement Tensor-Parallel Attention"

By hand, in pure PyTorch + torch.distributed: 1. Implement the Megatron-style tensor-parallel multi-head attention: column-parallel QKV projection, sharded heads, row-parallel output projection. 2. Verify numerically against a single-GPU reference for correctness (allclose to atol=1e-3). 3. Benchmark on 4 GPUs vs 1-GPU baseline. Compute scaling efficiency.

15.4 Idiomatic & Diagnostic Drill

  • Use nsys profile to see the timeline of allreduce vs compute. Tensor parallelism's signature is short, frequent allreduces.

15.5 Production Slice

  • Capture the topology decision: for your hardware (e.g., 8x H100 per node, 16 nodes), what TP/PP/DP degrees do you choose for a 70B model? Document the math.

Comments