Week 5 - GPU Hardware Architecture¶
5.1 Conceptual Core¶
- A modern NVIDIA GPU (Hopper H100 used here as the canonical example) consists of:
- 132 SMs (Streaming Multiprocessors), each functionally an independent processor.
- 80 GB HBM3 at ~3 TB/s.
- 50 MB L2 shared across SMs.
- Per-SM resources: 256 KB register file, 256 KB combined L1/shared, ~64 FP32 cores, 4 tensor cores, 1 ray-tracing core.
- The GPU runs threads in groups of 32 called warps. Threads in a warp execute the same instruction in lockstep (SIMT-single-instruction multi-thread). Branching within a warp causes divergence-warp executes both branches serially, masking off the inactive lanes.
- A block is a group of warps (up to 32 warps = 1024 threads) that share an SM, share L1/shared memory, and can synchronize via
__syncthreads(). - A grid is the set of blocks for a kernel launch. Blocks within a grid do not share state (other than HBM) and do not synchronize.
5.2 Mechanical Detail¶
- Warp scheduling: each SM has 4 warp schedulers; each cycle, each scheduler picks one ready warp from up to 16 resident warps and issues an instruction. Latency is hidden by switching warps-not by speculation as on CPUs. This is why you want many resident warps ("occupancy").
- Memory tiers per-SM:
- Registers: 256 per thread × 1024 threads = 256K total; spill to local memory (slow).
- Shared memory: 228 KB per SM on H100 (configurable split with L1).
- L1 cache: shares budget with shared memory.
- Tensor cores: specialized matmul-accumulate units. On H100 each TC does an 8x4 BF16 GEMM per cycle; aggregated across the chip ~989 TFLOPS BF16. They demand specific data layouts (16x16 tiles in registers); the
wmmaandmma.syncPTX instructions expose them. - Async copy (
cp.async): copy from HBM to shared memory in the background; thread continues other work, syncs later. Foundation for software pipelining (used in FlashAttention).
5.3 Lab-"Inspect Your Hardware"¶
- Run
nvidia-smiandnvidia-smi -q. Read every line. - Compile and run NVIDIA's
deviceQuerysample. It prints all the numbers above for your specific GPU. - Compile and run
bandwidthTest(CUDA samples). Compare measured PCIe and HBM bandwidth to spec. - Compute: at the measured HBM BW and compute peak of your GPU, what is the arithmetic intensity break-even? Sketch the roofline.
5.4 Idiomatic & Diagnostic Drill¶
- Install Nsight Systems (
nsys) and Nsight Compute (ncu). They are the tools. Familiarize with the GUI; learnnsys profile -o trace.qdrep ./progandncu --set full ./prog.
5.5 Production Slice¶
- Document your GPU fleet's exact SKUs in a
HARDWARE.md: model, HBM size, peak BW, peak FLOPS BF16/FP16/FP8, TDP. Cluster ops decisions hinge on this.