Week 13 - Communication Primitives: NCCL, Allreduce, Topology¶
13.1 Conceptual Core¶
- Distributed training reduces to collective communication: at certain points, every GPU's tensor must combine with every other GPU's. The fundamental collectives:
- Allreduce (sum tensors across all ranks; result on all ranks). The workhorse-used for gradient sync in data parallelism.
- Allgather (concatenate per-rank tensors; result on all ranks). Used for sharded ops.
- Reduce-scatter (sum then shard; each rank gets a piece). Used in ZeRO/FSDP.
- Broadcast (one rank → all). Initialization, parameter sync.
- All-to-all (every rank exchanges with every other). Used in MoE expert routing.
- NCCL (NVIDIA Collective Communication Library) is the canonical implementation on NVIDIA GPUs. AMD/RCCL is the equivalent. Both implement the same API.
- The algorithms matter:
- Ring-allreduce: the gold standard. Each rank sends its chunk to the next, accumulating. After 2(N-1) steps (N = ranks), all ranks have the sum. Bandwidth-optimal at scale.
- Tree-allreduce: lower latency at small messages, scales poorly.
- Hierarchical: ring within a node (NVLink), ring across nodes (InfiniBand). NCCL chooses automatically.
13.2 Mechanical Detail¶
- NVLink: GPU-to-GPU interconnect within a node. ~900 GB/s on H100 (NVLink 4). Matters because intra-node allreduce moves at NVLink speed; inter-node moves at NIC speed (typically 200-400 Gbps InfiniBand or 800 Gbps for newer fabrics-i.e., 25-100 GB/s, 10-30× slower).
- Topology: 8-GPU H100 nodes typically use NVSwitch-full bisection bandwidth between any pair. Cross-node uses NDR/HDR InfiniBand or RoCE. The
nvidia-smi topo -mcommand shows the matrix. torch.distributedwraps NCCL.init_process_group(backend='nccl'),dist.all_reduce(tensor, op=ReduceOp.SUM). Synchronous by default; async viaasync_op=Truereturns aWorkhandle.- Ring-allreduce bandwidth analysis: for tensor of size B bytes across N ranks, each rank sends 2(N-1)/N · B bytes. Time ≈ 2(N-1)/N · B / link_bandwidth. The 2× hides the gradient sync's true cost.
13.3 Lab-"Allreduce Bench"¶
On at least 2 GPUs (single node fine), run an allreduce benchmark:
1. torch.distributed.all_reduce on tensors from 1 KB to 1 GB.
2. Compute achieved bandwidth (= 2(N-1)/N · message_size / time).
3. Plot bandwidth vs message size; identify the message size at which BW saturates (the "knee").
4. If you have access: run on 8 GPUs via single node (NVLink) and compare to 8 GPUs across 2 nodes (InfiniBand). Document the gap.
13.4 Idiomatic & Diagnostic Drill¶
NCCL_DEBUG=INFOproduces verbose NCCL output. Read one full session's output; identify the chosen algorithm (ring/tree/CollNet) and the topology NCCL inferred.
13.5 Production Slice¶
- Document your cluster's NIC count, IB rail, GPU-NIC affinity. NCCL's perf hinges on
NCCL_IB_HCA,NCCL_SOCKET_IFNAME,NCCL_TOPO_FILEsettings-wrong defaults can halve throughput.