Week 5 - GPU Hardware Architecture¶

5.1 Conceptual Core¶

A modern NVIDIA GPU (Hopper H100 used here as the canonical example) consists of:
132 SMs (Streaming Multiprocessors), each functionally an independent processor.
80 GB HBM3 at ~3 TB/s.
50 MB L2 shared across SMs.
Per-SM resources: 256 KB register file, 256 KB combined L1/shared, ~64 FP32 cores, 4 tensor cores, 1 ray-tracing core.
The GPU runs threads in groups of 32 called warps. Threads in a warp execute the same instruction in lockstep (SIMT-single-instruction multi-thread). Branching within a warp causes divergence-warp executes both branches serially, masking off the inactive lanes.
A block is a group of warps (up to 32 warps = 1024 threads) that share an SM, share L1/shared memory, and can synchronize via __syncthreads().
A grid is the set of blocks for a kernel launch. Blocks within a grid do not share state (other than HBM) and do not synchronize.

Warp scheduling: each SM has 4 warp schedulers; each cycle, each scheduler picks one ready warp from up to 16 resident warps and issues an instruction. Latency is hidden by switching warps-not by speculation as on CPUs. This is why you want many resident warps ("occupancy").
Memory tiers per-SM:
Registers: 256 per thread × 1024 threads = 256K total; spill to local memory (slow).
Shared memory: 228 KB per SM on H100 (configurable split with L1).
L1 cache: shares budget with shared memory.
Tensor cores: specialized matmul-accumulate units. On H100 each TC does an 8x4 BF16 GEMM per cycle; aggregated across the chip ~989 TFLOPS BF16. They demand specific data layouts (16x16 tiles in registers); the wmma and mma.sync PTX instructions expose them.
Async copy (cp.async): copy from HBM to shared memory in the background; thread continues other work, syncs later. Foundation for software pipelining (used in FlashAttention).

Run nvidia-smi and nvidia-smi -q. Read every line.
Compile and run NVIDIA's deviceQuery sample. It prints all the numbers above for your specific GPU.
Compile and run bandwidthTest (CUDA samples). Compare measured PCIe and HBM bandwidth to spec.
Compute: at the measured HBM BW and compute peak of your GPU, what is the arithmetic intensity break-even? Sketch the roofline.

Install Nsight Systems (nsys) and Nsight Compute (ncu). They are the tools. Familiarize with the GUI; learn nsys profile -o trace.qdrep ./prog and ncu --set full ./prog.

Document your GPU fleet's exact SKUs in a HARDWARE.md: model, HBM size, peak BW, peak FLOPS BF16/FP16/FP8, TDP. Cluster ops decisions hinge on this.