Week 13 - Communication Primitives: NCCL, Allreduce, Topology¶

13.1 Conceptual Core¶

Distributed training reduces to collective communication: at certain points, every GPU's tensor must combine with every other GPU's. The fundamental collectives:
Allreduce (sum tensors across all ranks; result on all ranks). The workhorse-used for gradient sync in data parallelism.
Allgather (concatenate per-rank tensors; result on all ranks). Used for sharded ops.
Reduce-scatter (sum then shard; each rank gets a piece). Used in ZeRO/FSDP.
Broadcast (one rank → all). Initialization, parameter sync.
All-to-all (every rank exchanges with every other). Used in MoE expert routing.
NCCL (NVIDIA Collective Communication Library) is the canonical implementation on NVIDIA GPUs. AMD/RCCL is the equivalent. Both implement the same API.
The algorithms matter:
Ring-allreduce: the gold standard. Each rank sends its chunk to the next, accumulating. After 2(N-1) steps (N = ranks), all ranks have the sum. Bandwidth-optimal at scale.
Tree-allreduce: lower latency at small messages, scales poorly.
Hierarchical: ring within a node (NVLink), ring across nodes (InfiniBand). NCCL chooses automatically.

13.2 Mechanical Detail¶

NVLink: GPU-to-GPU interconnect within a node. ~900 GB/s on H100 (NVLink 4). Matters because intra-node allreduce moves at NVLink speed; inter-node moves at NIC speed (typically 200-400 Gbps InfiniBand or 800 Gbps for newer fabrics-i.e., 25-100 GB/s, 10-30× slower).
Topology: 8-GPU H100 nodes typically use NVSwitch-full bisection bandwidth between any pair. Cross-node uses NDR/HDR InfiniBand or RoCE. The nvidia-smi topo -m command shows the matrix.
torch.distributed wraps NCCL. init_process_group(backend='nccl'), dist.all_reduce(tensor, op=ReduceOp.SUM). Synchronous by default; async via async_op=True returns a Work handle.
Ring-allreduce bandwidth analysis: for tensor of size B bytes across N ranks, each rank sends 2(N-1)/N · B bytes. Time ≈ 2(N-1)/N · B / link_bandwidth. The 2× hides the gradient sync's true cost.

13.3 Lab-"Allreduce Bench"¶

On at least 2 GPUs (single node fine), run an allreduce benchmark: 1. torch.distributed.all_reduce on tensors from 1 KB to 1 GB. 2. Compute achieved bandwidth (= 2(N-1)/N · message_size / time). 3. Plot bandwidth vs message size; identify the message size at which BW saturates (the "knee"). 4. If you have access: run on 8 GPUs via single node (NVLink) and compare to 8 GPUs across 2 nodes (InfiniBand). Document the gap.

13.4 Idiomatic & Diagnostic Drill¶

NCCL_DEBUG=INFO produces verbose NCCL output. Read one full session's output; identify the chosen algorithm (ring/tree/CollNet) and the topology NCCL inferred.

13.5 Production Slice¶

Document your cluster's NIC count, IB rail, GPU-NIC affinity. NCCL's perf hinges on NCCL_IB_HCA, NCCL_SOCKET_IFNAME, NCCL_TOPO_FILE settings-wrong defaults can halve throughput.