Deep Dive 01-NVIDIA GPU Architecture and Memory Hierarchy¶

A self-contained reference chapter. Reader prerequisites: Python, basic C, basic linear algebra. Everything else is built up here. Canonical chip: NVIDIA H100 (Hopper, SM_90a). Other generations are referenced with explicit version tags.

Table of Contents¶

Why GPUs Exist: Throughput Machines vs Latency Machines
Execution Models: SIMD, SIMT, MIMD
Why Deep Learning Maps onto Throughput Hardware
The Streaming Multiprocessor (SM)-Anatomy of the Compute Unit
Warp Scheduling, Dual-Issue, and Scoreboarding
Branch Divergence and Independent Thread Scheduling
The Memory Hierarchy: Registers to NVMe (with H100 numbers)
Tensor Cores: WMMA, mma.sync, Fragments, Precisions
2:4 Structured Sparsity and Async Tensor Cores
Async Copy and the Tensor Memory Accelerator (TMA)
Occupancy Theory-Derivation From First Principles
NVLink, NVSwitch, and Multi-GPU Topologies
Ada and Blackwell Deltas (with explicit uncertainty)
AMD CDNA / MI300X Contrast
Five Worked Practical Exercises

1. Why GPUs Exist: Throughput Machines vs Latency Machines¶

A CPU is a latency machine. Its design goal is: take one sequential thread of instructions and complete it as fast as possible. To do that, a modern CPU core spends most of its silicon area on machinery that has nothing to do with arithmetic:

Out-of-order execution (reorder buffer, register renaming, ~hundreds of in-flight instructions).
Aggressive branch prediction (TAGE-style predictors with multi-KB history tables).
Several MB of private and shared cache.
Speculative loads, memory disambiguation, prefetchers.

On a recent server CPU, perhaps 5–10% of the die is the ALUs that actually do arithmetic. The other 90% exists to hide latency for one thread.

A GPU is a throughput machine. Its design goal is: given many independent units of work, finish all of them as fast as possible. Per-task latency is irrelevant; what matters is operations completed per second across the whole chip. To do that, a GPU does the inverse: it spends silicon on ALUs and, instead of hiding latency with prediction and out-of-order, hides latency with parallelism. When one warp stalls on memory, the SM switches to another warp. As long as you have enough independent warps in flight, every cycle is doing useful arithmetic.

Concretely on H100:

              Latency machine (CPU core)        Throughput machine (H100 SM)
              -----------------------------     -----------------------------
ALUs/unit     ~few wide vector lanes            128 FP32 + 64 FP64 + 64 INT32
                                                 + 4 Tensor Cores per SM
Threads in    1 (SMT: 2)                        Up to 64 warps = 2048 threads
flight                                           per SM, x132 SMs = 270k threads
Branch pred   Massive                            Minimal
OoO window    ~512 micro-ops                    None (in-order issue per warp)
Cache/core    ~1–4 MB private                   256 KB register file +
                                                 228 KB combined L1/smem

The two designs are both correct-just for different problem shapes. CPU wins when you have one critical-path thread. GPU wins when you have a problem expressible as "do this same operation to a million data items, mostly independently."

Roofline picture. Any program is bounded either by compute (FLOPs/s) or by memory bandwidth (bytes/s). The crossover point is arithmetic intensity-FLOPs per byte loaded. H100 has roughly:

BF16 dense compute: ~989 TFLOPS
HBM3 bandwidth: ~3 TB/s = ~3 × 10^12 B/s

Crossover intensity = 989e12 / 3e12 ≈ 330 FLOPs/byte (BF16 case). Below that, you are memory-bound; above, compute-bound. A GEMM of large enough size sits well above. Element-wise activations sit far below. This single inequality predicts most of GPU programming pain.

2. Execution Models: SIMD, SIMT, MIMD¶

Three classical taxonomies for parallel hardware:

SIMD (Single Instruction, Multiple Data). One instruction fetch operates on a vector register of N lanes. Examples: x86 AVX-512 (16 FP32 lanes), ARM SVE. The programmer writes vector code explicitly; lanes are not addressable as separate threads. Branches require predication or scalar fallback.
MIMD (Multiple Instruction, Multiple Data). Each core runs its own independent instruction stream. Examples: any multi-core CPU. Maximum flexibility, maximum hardware overhead per core.
SIMT (Single Instruction, Multiple Threads-NVIDIA's term). 32 lanes execute the same instruction in lockstep, but each lane is programmed as if it were a thread: it has its own registers, its own program counter (logically), its own stack. The compiler/programmer writes scalar code; the hardware groups 32 of them into a warp that issues together.

SIMT is essentially "SIMD with a thread illusion." It buys you:

Programmer ergonomics. You write if (tid % 2 == 0) ... else ... and it works. Under the hood the warp executes both sides with masks (this is divergence; see §6).
Per-lane addressing. A warp can load 32 different addresses in one instruction-a gather. SIMD can too, but only via dedicated gather instructions; in SIMT it is the natural mode.
Dynamic parallelism scaling. You launch a grid of millions of threads and the hardware schedules warps onto SMs. The same source compiles for a 24-SM laptop GPU and a 132-SM H100.

The cost: lanes within a warp must execute the same instruction each cycle. If they branch differently, only the lanes on the active path do useful work. This is the fundamental SIMT tradeoff and the source of much GPU performance lore.

SIMD:     [op] -> [lane0 lane1 ... lane15]   one PC, one mask register
SIMT:     [op] -> [thread0 thread1 ... thread31]   per-lane PC since Volta
                                                    plus an "active mask"
MIMD:     [op0]->core0   [op1]->core1   ...        N independent PCs

3. Why Deep Learning Maps onto Throughput Hardware¶

Take a single forward pass through a transformer block. The expensive ops are:

Linear layers (y = x W): a GEMM. For batch B, sequence S, model dim D, this is (B·S × D) × (D × D) = O(B·S·D²) FLOPs against O(B·S·D + D²) bytes of weights+activations. Arithmetic intensity grows with D.
Attention (Q Kᵀ, softmax, ·V): two GEMMs and a softmax. The GEMMs are again high-intensity for non-tiny shapes.
LayerNorm / RMSNorm / activations: element-wise; arithmetic intensity ~1. Memory-bound.

GEMMs are the canonical GPU-friendly workload because:

Massive independent parallelism. Each output element is an independent dot product. A 4096×4096 output has 16.7M independent dot products-enough to fill an H100's 270k threads many times over.
High arithmetic intensity at scale. For (M×K)·(K×N), FLOPs ≈ 2·M·N·K and bytes ≈ 2·(M·K + K·N + M·N) (in BF16). Intensity scales with min(M,N,K), so big matmuls are compute-bound.
Regular memory access. Tiles map cleanly onto shared memory. Vectorized 128-bit loads stay coalesced.
Exploitable structure. Tensor Cores accept fixed-shape tiles (e.g., 16×16×16 BF16) and produce one output in a few cycles. The math literally is "outer product and accumulate," which is what a transformer wants.

Training adds backward and optimizer steps but the dominant cost is still GEMM. So the entire ML training stack-PyTorch, cuBLAS, cuDNN, FlashAttention, Triton, CUTLASS-is essentially elaborate machinery for feeding GEMMs to Tensor Cores efficiently.

4. The Streaming Multiprocessor (SM)-Anatomy of the Compute Unit¶

The SM is the smallest unit of "GPU." An H100 has 132 SMs (H100 SXM5; the PCIe variant has 114). All your CUDA threads run inside SMs. A thread block is assigned to exactly one SM and stays there for its lifetime.

4.1 H100 SM block diagram (canonical)¶

+--------------------------------------------------------------+
|                       Streaming Multiprocessor (H100)        |
|                                                              |
|  +-------------------+   +-------------------+               |
|  |  Sub-partition 0  |   |  Sub-partition 1  |  ... x4       |
|  |                   |   |                   |               |
|  |  Warp Scheduler   |   |  Warp Scheduler   |               |
|  |  Dispatch Unit    |   |  Dispatch Unit    |               |
|  |                   |   |                   |               |
|  |  Register File    |   |  Register File    |               |
|  |   16384 x 32-bit  |   |   16384 x 32-bit  |               |
|  |   = 64 KB         |   |   = 64 KB         |               |
|  |                   |   |                   |               |
|  |  32 FP32 ALUs     |   |  32 FP32 ALUs     |               |
|  |  16 FP64 ALUs     |   |  16 FP64 ALUs     |               |
|  |  16 INT32 ALUs    |   |  16 INT32 ALUs    |               |
|  |  8  LD/ST units   |   |  8  LD/ST units   |               |
|  |  4  SFUs          |   |  4  SFUs          |               |
|  |  1  Tensor Core   |   |  1  Tensor Core   |               |
|  +-------------------+   +-------------------+               |
|                                                              |
|  Total per SM: 128 FP32, 64 FP64, 64 INT32, 4 Tensor Cores   |
|                                                              |
|  +--------------------------------------------------------+  |
|  |  Combined L1 Data Cache + Shared Memory  (228 KB)      |  |
|  |  (configurable: up to 228 KB shared, rest as L1)       |  |
|  +--------------------------------------------------------+  |
|                                                              |
|  Tensor Memory Accelerator (TMA)   |  Async barrier engines  |
+--------------------------------------------------------------+

The SM is divided into four sub-partitions, also called processing blocks. Each has its own:

Warp scheduler-picks one ready warp per cycle.
Dispatch unit-issues the picked warp's instruction to the right execution unit.
Register file slice-16384 × 32-bit registers = 64 KB. (4 × 64 KB = 256 KB per SM total.)
A 1/4 share of the FP32/FP64/INT/SFU/LD-ST/Tensor-Core resources.

Threads within a warp always live in one sub-partition. That is why the warp is exactly 32 threads: it matches the sub-partition's lane count for the most common ops.

4.2 The execution units¶

FP32 cores ("CUDA cores"): 128/SM on H100. Do scalar FP32 add/multiply/FMA each cycle.
FP64 cores: 64/SM. Hopper's FP64 ratio is much higher than gaming GPUs (Ada has only 4 FP64 per SM as a throttled path).
INT32 cores: 64/SM. Address arithmetic and integer kernels.
SFU (Special Function Unit): transcendentals (rsqrt, exp2, log2, sin, cos) at reduced throughput (typically 1/4 of FP32). PyTorch GELU/SiLU eventually lower to SFU instructions.
LD/ST units: issue loads/stores to L1, shared, L2, HBM.
Tensor Cores: 4/SM on Hopper, 4th-generation. Each does a small matrix multiply per cycle on tiles.

4.3 Register file-the fastest, most precious resource¶

Capacity: 256 KB per SM = 65536 × 32-bit registers.
Latency: effectively 0 cycles (read-after-write hazards are tracked but throughput is one operand bundle/cycle).
Bandwidth: enormous-every FMA reads 3 operands and writes 1, on every functional unit, every cycle.

Each thread can use up to 255 32-bit registers (a hard CUDA limit). The compiler (ptxas) decides how many registers a kernel actually uses, controllable via __launch_bounds__ or - maxrregcount`. Higher register count per thread → fewer threads can be resident on the SM (because the file is fixed size) → lower occupancy. This is the central tension we'll formalize in §11.

A warp's "register footprint" is regs_per_thread × 32. For a warp using 64 regs/thread, that is 2048 registers = 8 KB. The 64 KB sub-partition register file therefore holds at most 64 KB / 8 KB = 8 warps' worth, which caps that sub-partition's warp residency.

4.4 Generation deltas (SM-level)¶

Feature	A100 (Ampere, SM_80)	H100 (Hopper, SM_90)	RTX 4090 (Ada, SM_89)	B100/B200 (Blackwell, SM_100)
FP32/SM	64	128	128	128 (approximate)
Tensor Cores/SM	4 (3rd gen)	4 (4th gen)	4 (4th gen)	4 (5th gen, FP4-capable)
Register file/SM	256 KB	256 KB	256 KB	256 KB (publicly stated)
L1+smem/SM	192 KB	228 KB	128 KB	~256 KB (approximate)
FP8 tensor cores	no	yes (E4M3 / E5M2)	yes	yes (also FP4 / FP6)
TMA hardware	no	yes	no	yes (enhanced)
Thread block clusters	no	yes (SM-to-SM smem)	no	yes
2nd-gen Transformer Engine	no	no (1st gen)	no	yes

Numbers marked "approximate" for Blackwell because as of authoring NVIDIA had not published every microarchitectural detail with full precision-verify with the latest H100/B200 whitepapers when relying on exact figures.

5. Warp Scheduling, Dual-Issue, and Scoreboarding¶

Now the dynamic picture: how an SM picks what to execute each cycle.

5.1 The basic loop¶

Each cycle, in each sub-partition:

The warp scheduler scans all resident warps in this sub-partition (up to 16; 64 total per SM).
It selects warps whose next instruction has all source operands ready (no outstanding dependencies). This is scoreboarding.
It issues one (Hopper: sometimes two) instruction(s) to the appropriate functional unit.

Key fact: the SM does no out-of-order execution within a warp. Instructions from a single warp are issued in program order. Latency is hidden by switching among warps, not by reordering one warp.

5.2 Scoreboarding in detail¶

Each register has a "scoreboard bit" tracking whether a long-latency operation (memory load, transcendental, tensor MMA) is still writing it. When the scheduler considers issuing instruction I, it checks all of I's source registers' scoreboard bits. If any are set, the warp is not ready; the scheduler picks a different warp. When the long op completes, the bit clears.

This is why a kernel with many independent warps hides memory latency for free. The math:

Suppose every load takes 400 cycles to HBM.
A warp that just issued a load will be unable to issue its next dependent instruction for ~400 cycles.
If the sub-partition has 8 resident warps and each issues a load every ~50 cycles, then on average there is always at least one warp ready to issue.
Functional units stay busy; no stall is observed.

The exact threshold is Little's Law: parallelism_required = latency × throughput. For an SM that issues 1 instruction/cycle/sub-partition with 400-cycle memory latency, you need ~400 in-flight instructions per sub-partition to hide the latency. Each warp can have a few in flight at once (independent loads), so a handful of warps suffice.

5.3 Dual-issue on Hopper¶

In some cases a Hopper sub-partition can issue two instructions from the same warp in one cycle, provided they target different functional units and have no dependency. Example: an FP32 FMA and an INT32 address calculation can co-issue. This is not superscalar OoO-it is constrained dual-issue from the same in-order warp.

5.4 Warp scheduling timeline (ASCII)¶

cycle:        0   1   2   3   4   5   6   7   8   9  ...
warp 0:       I0          (LOAD pending..............)
warp 1:           I0  I1  I2
warp 2:                       I0          (LOAD pending...
warp 3:                           I0  I1
warp 0:                                      [load done] I1
issue slot:   W0  W1  W1  W1  W2  W3  W3  W2'??  W0  ...
                                       ^waiting on its load

The scheduler's job is to keep that bottom row never empty. If it goes empty, the SM is stalled and you are leaving FLOPs on the table.

5.5 Resident warp limits (H100)¶

64 warps maximum per SM (16 per sub-partition × 4 sub-partitions).
32 thread blocks maximum per SM.
2048 threads maximum per SM.
Constrained additionally by registers (256 KB/SM) and shared memory (≤228 KB/SM).

The actual resident count is min of all these constraints. §11 walks the math.

6. Branch Divergence and Independent Thread Scheduling¶

6.1 The classical (pre-Volta) story¶

A warp has one program counter. If lanes diverge:

if (tid % 2 == 0) {
    A();  // even lanes
} else {
    B();  // odd lanes
}

The hardware executes A() with mask 0xAAAAAAAA, then B() with 0x55555555, then reconverges. Both branches run sequentially; throughput is halved. Worst case (32 different paths) is a 32× slowdown-the warp serializes.

6.2 Independent Thread Scheduling (Volta and later)¶

Since Volta (SM_70), each lane has its own program counter and call stack. The hardware can interleave divergent paths and even let lanes from the same warp synchronize among themselves. This enables fine-grained algorithms (per-lane locks, producer/consumer within a warp) that were impossible before.

The performance picture is unchanged: at any one cycle the sub-partition can only issue one path. The benefit is correctness/expressiveness, not raw throughput. You still want lanes within a warp to follow the same path most of the time.

6.3 Practical rules¶

Branch on warp-aligned quantities when possible (e.g., on warp_id, not tid). All 32 lanes go the same way; no divergence.
Hoist invariant work out of divergent branches.
Use __ballot_sync, __any_sync, __all_sync for warp-wide voting instead of explicit branch+reduce.
Predication (compiler-generated) handles short branches with no actual divergence cost.

7. The Memory Hierarchy: Registers to NVMe (with H100 Numbers)¶

This section is the core of practical GPU programming. Every performance decision is a memory decision.

7.1 The hierarchy (H100 specific)¶

Tier	Capacity	Latency	Bandwidth (peak)	Scope
Registers	256 KB / SM	~1 cycle	enormous (per-lane)	per-thread
L1 / Shared mem	228 KB / SM	~20–30 cycles	~10s of TB/s aggreg	per-block
L2 cache	50 MB	~150–250 cycles	~5–7 TB/s aggreg	device-wide
HBM3	80 GB	~400–600 cycles	~3 TB/s	device-wide
Host DRAM	system-dep.	~µs (many k cycles)	~50 GB/s (PCIe 5)	host
NVMe SSD	TB-scale	~10s of µs	~5–14 GB/s (PCIe 5)	host (block dev)

Caveats: cycle counts are nominal-they vary with bank conflicts, DRAM row state, and contention. Treat them as orders of magnitude. Bandwidth numbers are peak achievable on H100 SXM5 in well-tuned kernels; everyday kernels see less.

Aggregate L1/smem bandwidth: each SM can issue ~128 B/cycle of shared loads, at ~1.6 GHz, × 132 SMs ≈ 27 TB/s peak in well-balanced cases. Treat as approximate-verify with NVIDIA H100 datasheet.

7.2 ASCII picture¶

   per-thread               per-block               device-wide        host
  +-----------+         +----------------+        +------------+    +-------+
  | Registers |         |  Shared Memory |        |     L2     |    | DRAM  |
  |  ~1 cyc   |  -->    |  (config'd     |  -->   |   ~150 cyc |--> | µs    |
  | 256 KB/SM |         |   from L1)     |        |   50 MB    |    | (PCIe |
  +-----------+         |  ~25 cyc       |        +------------+    |  5)   |
        ^               |  228 KB/SM     |              |           +-------+
        |               +----------------+              |               |
        |                       ^                       v               v
        |                       |                  +---------+      +------+
        +----stmts-------------(L1 hit ~30 cyc)<---|  HBM3   |      | NVMe |
                                                   |  ~500   |      |  10s |
                                                   |  cyc    |      |  µs  |
                                                   |  3 TB/s |      | GB/s |
                                                   |  80 GB  |      +------+
                                                   +---------+

7.3 Cascade of misses¶

A single load instruction in a CUDA kernel does this:

Coalesce check. The 32 lanes' addresses are inspected. If they fall within a small number of 128-byte sectors, the request becomes 1–4 memory transactions. If they scatter, it becomes up to 32. This is the single biggest determinant of memory performance.
L1 lookup. Each transaction consults L1. If it hits, ~30 cycles, done.
L2 lookup on L1 miss. Sent to L2 (which is shared device-wide and partitioned across the chip). Hit: ~150–250 cycles round trip.
HBM on L2 miss. Goes to HBM3. ~400–600 cycles. Possibly more if DRAM page must be opened.
(Unified memory only) host fault. If using managed memory and the page lives in host DRAM, a page fault crosses PCIe-microseconds, i.e. thousands of cycles. Catastrophic for kernel throughput.
NVMe. Only via explicit pinning + cudaMemcpy or via GPUDirect Storage. Tens of microseconds or worse.

The practical implication: once you miss to HBM, you have ~500 cycles to hide. Once you miss to host, you have ~10000+ cycles to hide and almost certainly cannot. Keep working sets in registers and shared memory.

7.4 Coalescing: a worked picture¶

Suppose 32 threads each load float a = arr[tid] where arr is 4-byte aligned. The 32 addresses are `arr[0], arr[1], ..., arr[31] - contiguous, 128 bytes total. The hardware bundles this into one 128-byte transaction, perfectly coalesced.

Now suppose float a = arr[tid * 32]. The addresses are arr[0], arr[32], arr[64], ..., each 128 bytes apart. That is 32 transactions of 128 bytes-32× the memory traffic for the same useful data.

COALESCED:    [t0 t1 t2 ... t31]  -> one 128B sector
              ^^^^^^^^^^^^^^^^^^

UNCOALESCED:  t0 ........... t1 ............ t2 ............ ...
              ^^^^                                                 -> 32 sectors
                              ^^^^
                                                ^^^^

7.5 Shared memory: the workhorse¶

Shared memory is software-managed, on-SM SRAM. Latency is ~25 cycles, comparable to a register hit, and all 32 threads in a warp can read independent addresses simultaneously. This is what makes it the staging area for tile-based algorithms.

It is organized into 32 banks, each 4 bytes wide. A warp's 32 accesses are conflict-free if every lane hits a different bank. If two lanes hit the same bank (different rows), the access serializes-a bank conflict. Common pitfall: 2D tiles of size 32×32 with a stride of 32 produce systematic conflicts; the fix is padding to stride 33.

banks:    0   1   2   3  ...  31
           |   |   |   |       |
words:   [B0][B1][B2][B3] ... [B31]   <- a warp reading these is conflict-free
         [B0][B1][B2][B3] ... [B31]   <- next row, same banks

7.6 L2 cache (50 MB)¶

On H100 the L2 is split into two partitions joined by a high-bandwidth crossbar. It serves all SMs. Useful properties:

Persistent access policies (cudaAccessPolicyWindow) let you mark a buffer as "keep this in L2 with high priority"-relevant for KV-caches that fit in 50 MB.
L2 hit rate matters: a 50 MB working set that fits in L2 effectively turns HBM bandwidth into L2 bandwidth (~5–7 TB/s) for that data.

7.7 HBM3 (80 GB, ~3 TB/s)¶

HBM is stacked DRAM connected by a wide silicon interposer. Peak bandwidth is ~3 TB/s (H100 SXM5; H100 PCIe is lower). This is the single number that governs memory-bound kernel performance.

Two consequences:

Largest model that fits: 80 GB / (params × bytes-per-param). At BF16 (2 B/param), that's ~40B parameters of weights alone, before activations and gradients.
Memory-bound kernel ceiling: an elementwise op on a tensor of size N (BF16, so 2N bytes read + 2N bytes written = 4N bytes) on H100 takes at minimum 4N / 3e12 seconds. For N = 1e9, that's ~1.3 ms-and you cannot go faster regardless of compute.

7.8 Host DRAM and NVMe¶

Crossing PCIe 5 x16 to host DRAM is ~50 GB/s-60× slower than HBM. This is why model loading is slow, why pinned memory matters (avoids an extra copy through pageable DRAM), and why model.to(device) for a 70 GB model takes ~1.5 s of pure transfer at best.

NVMe via GPUDirect Storage can hit 5–14 GB/s on PCIe 5 SSDs, bypassing the CPU. It is the right tool for streaming datasets but not for hot tensors.

8. Tensor Cores: WMMA, mma.sync, Fragments, Precisions¶

8.1 Why they exist¶

A standard FP32 FMA does 2 FLOPs (multiply + add) per cycle per lane. An H100 SM at ~1.8 GHz with 128 FP32 lanes does ~460 GFLOPS of FP32-and 132 SMs gives ~60 TFLOPS FP32. Respectable.

The same SM has 4 Tensor Cores. Each Tensor Core, at BF16, executes a small dense matmul per cycle-far more FLOPs per cycle than the FP32 path because the operation is fused. At full chip: ~989 TFLOPS BF16, ~17× faster than the FP32 path. At FP8: ~1979 TFLOPS dense, ~34× faster.

Ignoring Tensor Cores leaves 90%+ of the chip's arithmetic on the floor. This is non-negotiable for ML.

8.2 What a Tensor Core operation actually is¶

Conceptually: D = A · B + C where A, B, C, D are small fixed-shape tiles. On Hopper at BF16, a common shape is 16×16×16: A is 16×16, B is 16×16, C and D are 16×16. So each Tensor Core MMA does:

16×16×16 = 4096 multiply-accumulates per instruction
= 8192 FLOPs per instruction

In one cycle, across 4 Tensor Cores per SM × 132 SMs × 1.8 GHz, we get ~7800 GFLOPs/cycle × 1.8e9 cycles/s ≈ ~14 PFLOPS of theoretical peak-consistent with the published ~989 TFLOPS BF16 once you account for the actual mma throughput per Tensor Core (not every cycle issues a full 16×16×16; the published number bakes in real issue rates).

8.3 The PTX `mma.sync` family¶

PTX is NVIDIA's virtual ISA. The instructions you actually emit (or that ptxas emits) for tensor cores are of the form:

mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 d, a, b, c;

Decoded:

`m16n8k16 - D is 16×8, A is 16×16, B is 16×8 (the K dim is shared, =16). MMA shapes vary: m16n8k8, m16n8k16, m16n8k32 (for 8-bit), etc.
`row.col - A is row-major in the fragment, B is column-major. Layout matters.
`.f32.bf16.bf16.f32 - D type, A type, B type, C type. Here: BF16 inputs, FP32 accumulator. This is the standard mixed-precision recipe.

Each thread in the warp owns a fragment of the tile-a few registers' worth of A, B, C, D each. The 32 threads collaboratively hold the entire 16×16 tile distributed across their register files. The hardware knows the distribution; you must load fragments using matching ldmatrix instructions or via the WMMA API.

8.4 The WMMA C++ API¶

wmma::fragment<...> is the high-level handle:

#include <mma.h>
using namespace nvcuda::wmma;

fragment<matrix_a, 16, 16, 16, __nv_bfloat16, row_major> a_frag;
fragment<matrix_b, 16, 16, 16, __nv_bfloat16, col_major> b_frag;
fragment<accumulator, 16, 16, 16, float> c_frag;

fill_fragment(c_frag, 0.0f);
load_matrix_sync(a_frag, A_smem_ptr, 16);
load_matrix_sync(b_frag, B_smem_ptr, 16);
mma_sync(c_frag, a_frag, b_frag, c_frag);
store_matrix_sync(D_global_ptr, c_frag, 16, mem_row_major);

Each *_sync call is warp-collective-all 32 lanes must participate. The fragment objects live in registers.

8.5 Supported precisions and what they buy¶

For Hopper Tensor Cores (4th gen), peak dense throughput approximately scales as:

Type	Bits	H100 dense TFLOPS (approx)	Notes
FP64	64	~67	Dedicated FP64 Tensor Core path
TF32	19	~495	NVIDIA's drop-in for FP32 GEMM
BF16	16	~989	Standard ML training dtype
FP16	16	~989	Older default; less dynamic range
FP8	8	~1979	E4M3 (forward), E5M2 (gradients)
INT8	8	~1979	Quantized inference

(Numbers are H100 SXM5 dense, no sparsity. Verify exact figures with the H100 datasheet.) Each precision halving roughly doubles throughput because each Tensor Core can pack twice as many ops in the same die area-that is the raison d'être of FP8 for inference and FP4 for Blackwell.

FP8 detail. Two formats exist:

E4M3: 1 sign + 4 exponent + 3 mantissa. Range ~±448. Used for forward activations and weights-needs precision more than range.
E5M2: 1 sign + 5 exponent + 2 mantissa. Range ~±57344. Used for gradients in training-gradients have huge dynamic range.

A "Transformer Engine" library (NVIDIA's TE, plus cuDNN/cuBLAS support) automatically picks per-tensor scaling factors to keep values in range, accumulates in FP32, and chooses E4M3 vs E5M2 based on tensor role.

TF32. A funny format: 19 bits total (1 sign + 8 exponent + 10 mantissa), padded to look like FP32 in registers. The Tensor Core silently truncates the mantissa. Net effect: code that reads "FP32 GEMM" runs at TF32 speed (~495 TFLOPS) on Ampere/Hopper unless you pass torch.backends.cuda.matmul.allow_tf32 = False.

8.6 Data movement is everything¶

Tensor Cores are so fast that they will starve unless data flows from HBM → L2 → smem → registers fast enough. A typical tile-based GEMM kernel structure:

Persistent loop over output tiles (each block owns one output tile):
  Loop over K dimension in tile-sized chunks:
    1. cp.async (or TMA) loads A-tile and B-tile from HBM into shared memory.
    2. ldmatrix loads fragments from shared memory into registers.
    3. mma.sync accumulates into the output fragment (in registers).
    4. Overlap: while step 3 runs, step 1 of the next K-chunk is already
       in flight via async copy.
  Write output tile from registers to HBM.

The whole CUTLASS/cuBLAS/Triton design space is variations on this skeleton: tile sizes, number of K-stages buffered, smem layout (swizzled to avoid bank conflicts), warp specialization (some warps load, others compute).

9. 2:4 Structured Sparsity and Async Tensor Cores¶

9.1 The 2:4 sparsity hardware path¶

Since Ampere (3rd-gen Tensor Cores) and continued in Hopper/Blackwell, Tensor Cores can natively skip multiplications when the weight matrix is 2:4 structured-sparse: in every contiguous group of 4 elements along the K dimension, exactly 2 are zero.

This is a hardware-enforced pattern. The benefit:

The compressed weight stores only the 2 nonzero values + a 4-bit metadata mask per group of 4 (so 50% memory).
The Tensor Core fetches the 2 nonzeros + the matching 2 lanes of the activation, multiplies, accumulates. The two skipped multiplications are physically not done.
Effective throughput doubles: H100 BF16 dense is ~989 TFLOPS, BF16 with 2:4 sparsity is ~1979 TFLOPS.

Caveat: the model must actually have 2:4 sparsity. Tools (NVIDIA ASP) prune dense models to this pattern with QAT-style fine-tuning. Quality recovery is usually possible but not free.

4 contiguous K-elements:    [w0  w1  w2  w3]
2:4 enforced:               [w0   0   0  w3]  <- two zeros in known positions
compressed in memory:       [w0  w3] + 4-bit metadata "10..01"

9.2 Async (warpgroup) Tensor Core operations on Hopper¶

Hopper introduced the warpgroup MMA (wgmma.mma_async): a Tensor Core operation that operates on a warpgroup of 4 warps (128 threads) and runs asynchronously with respect to the issuing thread. Properties:

Inputs A, B can come directly from shared memory (not just registers)-saves register pressure.
The accumulator C lives in registers.
The instruction returns immediately; you wgmma.wait_group to synchronize before reading C.
The Tensor Core can be working while the warp issues additional instructions (e.g., the next cp.async to fetch the next K-tile).

This is the missing piece that lets a Hopper kernel truly overlap data movement with tensor compute. On Ampere, the MMA was synchronous: while it ran, the warp was blocked. On Hopper, the warp can do other work, including issuing more MMAs and queueing more loads. This is why Hopper kernels often use warp specialization: dedicate some warps to loading (issuing TMAs), others to computing (issuing wgmmas), and let them communicate via shared memory and barriers.

10. Async Copy and the Tensor Memory Accelerator (TMA)¶

10.1 `cp.async` (Ampere onward)¶

The instruction cp.async.cg.shared.global copies bytes from global memory to shared memory without going through registers and without blocking the issuing thread.

cp.async.cg.shared.global   [smem_ptr], [global_ptr], 16;
cp.async.commit_group;          // package outstanding cp.async into group
... do other work ...
cp.async.wait_group 0;          // wait for all groups to finish
__syncthreads();                // ensure smem is visible to all threads

Key semantics:

commit_group bundles all outstanding cp.async instructions issued by this thread since the last commit into a numbered group.
wait_group N waits until at most N groups remain in flight.
This pattern enables pipelined / multi-stage smem buffers: while compute uses stage k, stage k+1 is being filled.

A typical 3-stage pipeline:

stage k:    [load HBM->smem]   [load HBM->smem]   [load HBM->smem]
stage k-1:                     [compute on smem]  [compute on smem]
stage k-2:                                        [(done)]
                ^                  ^                  ^
              cycle T            cycle T+1          cycle T+2

10.2 The Tensor Memory Accelerator (Hopper)¶

cp.async is per-thread. A 128×128 BF16 tile is 32 KB; loading it takes thousands of cp.async instructions across a warp, each doing address arithmetic. That is a lot of instruction-issue overhead.

The TMA is dedicated hardware that takes a single descriptor describing a multi-dimensional tensor (base pointer, dimensions, strides, element type, swizzle pattern) and a tile coordinate, and asynchronously moves the entire tile between HBM and shared memory. One instruction triggers a multi-KB transfer.

// pseudo-PTX
cp.async.bulk.tensor.2d.shared::cluster.global   [smem_ptr], [tma_descr, {x, y}], [mbarrier];
mbarrier.try_wait(mbarrier);

Properties:

One thread issues the TMA on behalf of the whole block.
An mbarrier (memory barrier object in shared memory) is signaled when bytes arrive.
Multi-dimensional indexing is in hardware: the TMA computes addresses for a 2D, 3D, 4D, or 5D tile correctly, including out-of-bounds zero-fill.
Swizzle patterns (interleavings of elements within smem) are applied for free, eliminating bank conflicts in the subsequent ldmatrix.
TMA also supports HBM→HBM and multicast within a thread block cluster.

10.3 Thread block clusters¶

Hopper added a new level above thread block: the cluster (up to 16 blocks). Blocks in the same cluster can directly access each other's shared memory (Distributed Shared Memory, DSM) and synchronize. The TMA can multicast a tile to all blocks in the cluster, broadcasting input data with a single HBM read. This is how very large GEMMs amortize input bandwidth across many SMs.

Thread hierarchy (Hopper):
  thread  (1)
    -> warp (32 threads)
      -> warpgroup (4 warps = 128 threads)
        -> block (1..1024 threads, all on one SM)
          -> cluster (up to 16 blocks, on adjacent SMs)
            -> grid (entire kernel launch)

11. Occupancy Theory-Derivation From First Principles¶

Occupancy = (resident warps per SM) / (maximum warps per SM). On H100 the denominator is 64.

11.1 What constrains residency¶

Each thread block, once placed on an SM, consumes:

R = registers per thread × threads per block (in 32-bit registers)
S = shared memory per block (bytes)
T = threads per block

The SM has hard limits:

R_max = 65536 registers (256 KB / 4 B)
S_max = 228 KB (configurable; often slightly less is usable)
T_max = 2048 threads
B_max = 32 blocks
W_max = 64 warps

The number of resident blocks is

B_resident = min( floor(R_max / R),
                  floor(S_max / S),
                  floor(T_max / T),
                  B_max )

Then resident warps = B_resident × ceil(T / 32), and occupancy = resident_warps / W_max.

11.2 Worked example 1-register-limited¶

A kernel with:

256 threads per block
96 registers per thread
16 KB shared memory per block

Compute each constraint:

R = 96 × 256 = 24576 registers/block. floor(65536 / 24576) = 2 blocks.
S = 16384 bytes/block. floor(228·1024 / 16384) = floor(14.25) = 14 blocks.
T = 256. floor(2048 / 256) = 8 blocks.
B_max = 32 blocks.

B_resident = min(2, 14, 8, 32) = 2 blocks. Warps = 2 × (256/32) = 16. Occupancy = 16/64 = 25%. The bottleneck is registers.

To raise occupancy you'd reduce regs/thread (via __launch_bounds__(256, 4), which tells ptxas "I want at least 4 blocks of 256 threads resident, please compile within that register budget"). The compiler will spill some registers to local memory to comply-possibly hurting performance more than the occupancy gain helps.

11.3 Worked example 2-shared-memory-limited¶

Same kernel but bumped to 80 KB shared memory per block (e.g., a big tile):

S = 80 KB/block. floor(228 / 80) = 2 blocks.
R = 96 × 256 = 24576. floor(65536 / 24576) = 2 blocks.
T limit: 8 blocks.

B_resident = 2 blocks, occupancy = 25%, but now both registers and shared memory are at the 2-block limit. Reducing shared memory wouldn't help unless you also reduce registers.

11.4 Worked example 3-block-count-limited¶

256 threads/block, 32 regs/thread, 4 KB smem/block:

R = 8192. floor(65536/8192) = 8 blocks.
S = 4096. floor(228·1024/4096) = 57 blocks.
T: 8 blocks.
B_max: 32 blocks.

B_resident = min(8, 57, 8, 32) = 8 blocks. Warps = 8 × 8 = 64. Occupancy = 100%.

11.5 Reverse engineering-"given X warps, what register budget?"¶

If you want at least 8 warps resident in one sub-partition (= 8 blocks of 32 threads each, or equivalently 1 block of 256 threads spreading 2 warps across each sub-partition-be careful with the per-sub-partition accounting), and the sub-partition has 16K registers:

regs/thread × warps × 32 ≤ 16384 → regs/thread ≤ 16384 / (8 × 32) = 64.

So a 64-reg/thread budget gives exactly 8 warps per sub-partition = 32 warps per SM = 50% occupancy.

11.6 When low occupancy is fine¶

Occupancy is only a means to an end (latency hiding). It is not the same as performance. Kernels can run at full HBM bandwidth or full Tensor Core throughput at 25% occupancy if:

Each warp has lots of independent instructions (high ILP) → less reliance on warp-level parallelism for latency hiding.
The bottleneck is HBM bandwidth, not arithmetic, and the few warps already saturate it.
The kernel uses async copies / TMA so memory operations don't tie up warps.

CUTLASS and FlashAttention typically run at modest occupancy (30–50%) because their warps are very busy doing useful Tensor Core work. Chasing occupancy by spilling registers usually loses performance. Always profile.

11.7 When low occupancy hurts¶

Kernels with serialized memory waits and no async copy: only way to hide HBM latency is to have many warps.
Kernels with frequent __syncthreads() and short between-sync work: a single block doesn't hide much; you need many concurrent blocks.
Memory-bound kernels with low ILP per warp.

12. NVLink, NVSwitch, and Multi-GPU Topologies¶

12.1 Why we need it¶

A single H100 has 80 GB. A 70B-parameter LLM at BF16 is 140 GB of weights, plus optimizer states (3–5× weights for AdamW), activations, gradients. Training requires multi-GPU. Inference of a 405B model also requires multi-GPU. The interconnect determines how close the multi-GPU system is to a single big GPU.

12.2 NVLink vs PCIe¶

PCIe 5 x16: ~64 GB/s per direction (~128 GB/s bidirectional). Your CPU↔GPU path. Also the only path to non-NVLink machines.
NVLink 4 (Hopper): 18 links per H100, each 50 GB/s bidirectional, = ~900 GB/s aggregate per GPU (sum of all 18 links, both directions). About 7× a single PCIe 5 x16.
NVLink 3 (Ampere): 12 links × 50 GB/s = 600 GB/s aggregate per A100.
NVLink 5 (Blackwell): roughly doubled bandwidth vs NVLink 4 (publicly stated as ~1.8 TB/s aggregate per B200; verify with NVIDIA datasheet).

The 900 GB/s is aggregate to all NVLink peers; if you have 8 peers, each link to a single peer is 900/8 ≈ 112 GB/s, still ~2× PCIe.

12.3 NVSwitch¶

In a DGX H100 (8 GPUs), you can wire each GPU's 18 NVLinks point-to-point-but with 8 GPUs that's only ~2 links per pair, asymmetric. NVSwitch is a chip that acts as a non-blocking crossbar: every GPU has all 18 links going into NVSwitches, and the switches route any-to-any at full bandwidth. A DGX H100 has 4 NVSwitches.

    GPU0 -+        +- GPU4
    GPU1 -+--NVSw-+- GPU5
    GPU2 -+        +- GPU6
    GPU3 -+        +- GPU7

    Each GPU sees a flat 900 GB/s aggregate to *any* combination of peers.

12.4 NVL8, NVL36, NVL72¶

NVIDIA's GB200 ("Grace + Blackwell") rack systems chain many GPUs into one NVLink domain via external NVLink switches:

NVL8-8 GPUs in one server (DGX/HGX). The classic configuration.
NVL36-36 GPUs (typically 18 GB200 superchips × 2 Blackwells each). Single NVLink domain across one rack half.
NVL72-72 GPUs in one NVLink domain (GB200 NVL72 rack). Total NVLink bandwidth is staggering (~130 TB/s aggregate). All 72 GPUs see each other as if they were on the same node.

The point of large NVLink domains is to make tensor parallelism and expert parallelism tractable across more GPUs without falling off a bandwidth cliff onto InfiniBand (which is ~50 GB/s per port, ~20× slower than NVLink).

12.5 Reading `nvidia-smi topo -m`¶

The topology matrix shows how every GPU pair is connected. Cell legend:

X -self
NV# - NVLink, where#` is the number of links between the pair (more = higher bw)
`PIX - same PCIe switch (no host bridge between)
`PXB - multiple PCIe switches, no CPU
`PHB - through a host bridge (CPU root complex)
NODE— traverses NUMA node
`SYS - traverses CPU socket (worst PCIe path)

Example partial output:

        GPU0   GPU1   GPU2   GPU3   GPU4   GPU5   GPU6   GPU7
GPU0     X     NV18   NV18   NV18   NV18   NV18   NV18   NV18
GPU1   NV18     X     NV18   NV18   NV18   NV18   NV18   NV18
...

NV18 means 18 NVLinks between every pair-i.e., a fully-connected NVSwitch fabric. Conversely if you saw SYS between two GPUs, you would know that GPU↔GPU traffic goes over PCIe and across CPU sockets, the slowest possible path.

12.6 Collective communication primitives¶

NCCL sits on top of NVLink/NVSwitch/IB and provides AllReduce, AllGather, ReduceScatter, Broadcast, AlltoAll. The two performance numbers to know:

AllReduce bandwidth ≈ NVLink bw × (n-1)/n × something close to 1 for ring or tree algorithms within an NVLink domain.
Cross-domain (over IB) AllReduce is bottlenecked by per-node IB bandwidth-typically 4× 400 Gb/s = ~200 GB/s in modern clusters, ~5× slower than NVLink.

This mismatch is why hierarchical algorithms (intra-node first, then inter-node) dominate.

13. Ada and Blackwell Deltas (with explicit uncertainty)¶

13.1 Ada Lovelace (RTX 40 series, SM_89)¶

Ada is the consumer/workstation generation contemporary with Hopper. Per-SM compute and Tensor Cores look similar to Hopper on the surface but Ada is missing key datacenter features:

No TMA hardware-you can use cp.async (Ampere-style) but not the multi-dim tensor descriptors.
No thread block clusters-no distributed shared memory.
No async warpgroup MMA (wgmma)-MMAs are synchronous as on Ampere.
No HBM-uses GDDR6X (~1 TB/s on 4090 vs 3 TB/s HBM3 on H100).
Throttled FP64-Ada has 2 FP64 cores per SM vs Hopper's 64. Crippling for HPC, irrelevant for ML.
Has FP8 Tensor Cores-same E4M3/E5M2 as Hopper.
Has 2:4 sparsity.

So Ada is a fine inference card but is a different architecture from Hopper for kernel programming. A FlashAttention kernel written for Hopper (using TMA + wgmma + clusters) needs significant fallback code on Ada.

13.2 Blackwell (B100 / B200 / GB200, SM_100)¶

Blackwell is the post-Hopper datacenter generation. Public commitments include:

5th-generation Tensor Cores with native FP4 (E2M1) and FP6 support. FP4 enables ~2× throughput vs FP8-quoted as approximately 10–20 PFLOPS dense FP4 on B200 (verify with NVIDIA Blackwell whitepaper; numbers vary by SKU and dense-vs-sparse).
2nd-generation Transformer Engine-extended micro-scaling formats (MXFP8, MXFP6, MXFP4) where each small block of values shares an exponent; enables FP4 inference and FP4-ish training without catastrophic accuracy loss.
NVLink 5-~1.8 TB/s aggregate per GPU, ~2× NVLink 4.
HBM3e-higher capacity (192 GB on B200) and bandwidth (~8 TB/s; verify) vs Hopper's 80 GB / 3 TB/s.
GB200 superchip-1 Grace CPU + 2 Blackwells on one board with NVLink-C2C between Grace and the GPUs (much higher than PCIe).
Two-die package-B200 is a multi-die GPU (two compute dies joined by a high-bandwidth on-package interconnect, presented to software as a single GPU).
NVL72 racks-72 Blackwells in one NVLink domain.

Uncertainty disclosure. As of authoring, NVIDIA had not published every Blackwell microarchitectural detail with the same fidelity as Hopper's whitepaper. Treat exact TFLOPS numbers, register-file sizes, and shared-memory capacities for Blackwell as approximate-verify with the latest official Blackwell datasheet/whitepaper before relying on them in code. The shape of the architecture (more dies, FP4, NVLink 5, 2nd-gen TE) is committed; precise numbers may shift between announcement and shipping silicon.

13.3 What this means practically¶

Concern	A100	H100	Ada (4090)	B100/B200
Best ML training option	yes (legacy)	yes	no (consumer)	yes (current)
FP8 inference	no	yes	yes	yes
FP4 inference	no	no	no	yes
TMA / wgmma kernels	no	yes	no	yes (extended)
HBM capacity	40/80 GB	80 GB	24 GB GDDR6X	~192 GB HBM3e
NVLink bw aggregate	600 GB/s	900 GB/s	none (or limited)	~1.8 TB/s

14. AMD CDNA / MI300X Contrast¶

A short, accurate comparison so you know what's the same and what isn't.

14.1 Vocabulary¶

NVIDIA term	AMD CDNA equivalent
Streaming Multiprocessor (SM)	Compute Unit (CU)
CUDA core (FP32)	SIMD lane (CU has 4 SIMD16 vector units)
Warp (32 threads)	Wavefront (64 threads on CDNA-twice as wide)
Tensor Core	Matrix Core (MFMA)
NVLink	Infinity Fabric (xGMI)
L2 cache	L2 (per-XCD on MI300)
HBM	HBM (same physical tech)

14.2 MI300X (CDNA 3) specifics¶

8 XCDs (Accelerator Compute Dies) per package, 304 CUs total-vs H100's 132 SMs. The CUs are smaller individually; aggregate FP64 and matrix throughput are competitive.
192 GB HBM3 (vs H100's 80 GB). Single biggest practical advantage for large-model inference: a 70B BF16 model fits on one MI300X with room for KV cache; on H100 you need 2 GPUs.
5.3 TB/s HBM3 bandwidth (vs H100's ~3 TB/s).
Matrix Cores support FP64, FP32, TF32-equivalent, BF16, FP16, INT8, FP8. FP4 is not supported on MI300X (it is reportedly added in MI355).
Wavefront = 64 lanes, so divergence dynamics are coarser; tiles and vectorization need adjustment.
Infinity Fabric between MI300Xs is roughly comparable to NVLink 4 in per-link bandwidth, but the topology is different (e.g., 8-GPU all-to-all in MI300X servers).

14.3 Software story¶

ROCm (HIP, rocBLAS, MIOpen, RCCL) targets CDNA. HIP is a near-source-compatible CUDA dialect: cudaMalloc → hipMalloc, with a translation tool (hipify). Kernels written for plain CUDA usually port; kernels written for Hopper-specific features (TMA, wgmma, clusters) do not-those are NVIDIA-exclusive.

Practical rule: MI300X is excellent silicon, often the best choice for very large model inference where 192 GB > 80 GB matters; ecosystem maturity for training and exotic kernels still trails CUDA's. Triton and PyTorch both work on ROCm but with thinner kernel coverage than on CUDA.

15. Five Worked Practical Exercises¶

Exercise 1-Occupancy on H100¶

A kernel uses 64 registers per thread, 24 KB of shared memory per block, and launches 256 threads per block. Compute occupancy on H100.

Solution.

Per-block resource use:

Registers: 64 × 256 = 16,384 registers.
Shared memory: 24 × 1024 = 24,576 bytes.
Threads: 256, i.e. 8 warps.

SM limits (H100): 65,536 registers, 228 KB smem (= 233,472 B), 2048 threads, 32 blocks, 64 warps.

Per-resource block caps:

Registers: 65,536 / 16,384 = 4 blocks.
Shared memory: 233,472 / 24,576 ≈ 9.5 → 9 blocks.
Threads: 2048 / 256 = 8 blocks.
Block hard cap: 32.

B_resident = min(4, 9, 8, 32) = 4 blocks.

Warps resident: 4 × 8 = 32 warps. Occupancy = 32 / 64 = 50%.

The bottleneck is registers. To raise occupancy you'd compile with __launch_bounds__(256, 6) and accept the spills, or refactor the kernel to use fewer registers (e.g., smaller register-resident tile).

Exercise 2-Roofline on a memory-bound kernel¶

You write an elementwise BF16 kernel: y = silu(x). Tensor x has 1 billion elements. On H100 (3 TB/s HBM, ~989 TFLOPS BF16), what is the minimum runtime, and what limits it?

Solution.

Bytes moved: read x (2 B/elem) + write y (2 B/elem) = 4 B/elem × 1e9 elem = 4e9 B = 4 GB.

Min time (HBM-bound) = 4e9 / 3e12 ≈ 1.33 ms.

FLOPs done: SiLU ≈ a multiply, an exp, an add, a divide ≈ ~4 FLOPs/elem (the exp via SFU is more expensive). Call it 5 FLOPs × 1e9 = 5e9 FLOPs. At 989 TFLOPS, that's 5e9 / 9.89e14 = ~5 µs of compute.

Compute is negligible (5 µs) vs HBM (1.33 ms). The kernel is memory-bound. Arithmetic intensity = 5 FLOPs / 4 B = 1.25 FLOPs/B, well below the H100 BF16 ridge of ~330 FLOPs/B. No amount of Tensor Core wizardry helps; the ceiling is HBM. Fusing this op into a neighboring GEMM (so x and y are read/written through registers in a fused kernel) is the only way past 1.33 ms.

Exercise 3-Tile size for a tensor-core GEMM¶

You write a CUDA kernel that loads a 128×64 BF16 tile of A and a 64×128 BF16 tile of B per K-step. How much shared memory does double-buffering require, and is that compatible with 50% occupancy at 256 threads/block?

Solution.

Single tile: A is 128 × 64 × 2 B = 16,384 B = 16 KB. B is 64 × 128 × 2 B = 16 KB. Total per stage: 32 KB.

Double-buffered (2 stages): 64 KB per block.

H100 has 228 KB smem/SM. Blocks per SM by smem: floor(228 / 64) = 3 blocks.

For 50% occupancy = 32 warps = 8 blocks of 4 warps each = 8 blocks of 128 threads each, or 4 blocks of 8 warps each = 4 blocks of 256 threads each.

At 256 threads/block (8 warps), 50% needs 4 blocks resident. Smem allows only 3. So double-buffered 64 KB/block is incompatible with 50% occupancy at 256 threads/block-you get at most 3 × 8 = 24 warps = 37.5%.

Options: (a) shrink the tile (e.g., 128×32 + 32×128 = 16 KB/stage, 32 KB double-buffered → 7 blocks possible, way over). (b) Use a 3-stage pipeline that pays smem to keep Tensor Cores fed more and accept lower occupancy-often the right call on Hopper. (c) Use thread block clusters to share input tiles across multiple blocks via TMA multicast, reducing per-block smem.

Exercise 4-Coalescing analysis¶

A kernel does y[tid] = x[tid * stride] with stride = 8, BF16. How many HBM transactions per warp, and what's the effective bandwidth utilization?

Solution.

Each lane reads x[tid*8]. Lane addresses (in bytes from base): 0, 16, 32, 48, ..., 16·31 = 496. So the warp's 32 reads span 0..496 + 1 element = 498 B.

A 128-byte sector covers 64 BF16 elements. Lane 0 reads byte 0 (sector 0). Lane 8 reads byte 128 (sector 1). Lane 16 reads byte 256 (sector 2). Lane 24 reads byte 384 (sector 3).

So the warp's loads span 4 sectors, but uses only 32 elements × 2 B = 64 B of the 4 × 128 B = 512 B fetched. Useful data ratio = 64/512 = 12.5%.

Effective bandwidth utilization = 12.5% of HBM peak. To fix: reorganize x so accesses are contiguous (stride 1), or use a transposed kernel that achieves contiguous access via shared-memory staging.

Exercise 5-Why is FlashAttention faster than naive attention?¶

Naive attention computes Q Kᵀ → softmax → · V by materializing the full N×N attention matrix in HBM. For N = 8192, BF16, what is the HBM traffic, and how does FlashAttention's tile-based approach reduce it?

Solution.

Naive (per attention head):

Q is N×d (d=128). Read Q: N·d·2 = 2 MB.
K is N×d. Read K: 2 MB.
Compute Q Kᵀ → S of shape N×N. N² × 2 B = 8192² × 2 = 128 MB. Write S to HBM.
Read S (128 MB). Softmax. Write back P (128 MB). Together: 256 MB.
Read P (128 MB), read V (N×d = 2 MB), write O (N×d = 2 MB).

Total HBM bytes ≈ 2 + 2 + 128 + 128 + 128 + 128 + 2 + 2 ≈ 520 MB per head, dominated by the N×N matrix shuffling. For 32 heads: ~16 GB.

FlashAttention:

Tile Q in row-blocks of size Br, tile K and V in column-blocks of size Bc, where Br·d + 2·Bc·d fits in shared memory.
For each Q tile, stream all K, V tiles through it, computing partial softmax statistics (rowmax, rowsum) online and accumulating O directly.
The N×N matrix S never materializes in HBM-it lives only as Br×Bc tiles in shared memory.

HBM bytes for FlashAttention:

Read Q once: 2 MB.
Read K and V once across all Q tiles via clever reordering: 2 + 2 = 4 MB. (Strictly, K and V are reread per Q tile; with Q outer-loop tiling and recomputed-softmax tricks they are read O(N²·d / (Br·smem)) bytes, but with reasonable Br the multiplier is small-roughly O(N·d) total per K and V given enough smem.)
Write O: 2 MB.

Total ≈ ~10 MB per head, ~50× less HBM traffic than naive. Since attention at N=8192 is HBM-bound, this directly translates to ~50× wall-clock speedup at this scale.

The lesson: the GPU memory hierarchy is the algorithm. FlashAttention's mathematical content (online softmax) exists because keeping intermediate state in shared memory-instead of HBM-is the only way to make attention fast at long context. Tile size, smem budget, and the SM architecture chose the algorithm.

Closing Notes¶

What you should now hold without external references:

The GPU is a throughput machine that hides latency by parallelism, not prediction.
The SM has 4 sub-partitions; each holds up to 16 warps; 64-warp / 2048-thread / 256 KB-register / 228 KB-smem caps on H100.
The memory hierarchy is registers → smem/L1 (~25 cycles) → L2 (~150 cyc) → HBM (~500 cyc) → host (~µs) → NVMe (~10s of µs), and every algorithmic decision is a memory decision.
Tensor Cores do D = A·B + C on small fixed tiles; 4 per SM; precisions BF16, FP16, FP8, INT8 (Hopper), FP4 (Blackwell); structured 2:4 sparsity doubles throughput.
TMA + warpgroup MMA + clusters are Hopper's mechanism for decoupling HBM motion from Tensor Core compute-kernels overlap them via warp specialization.
Occupancy = resident_warps / 64 on H100, derived from min(register, smem, thread, block) caps. Higher is not always faster.
NVLink 4 = 900 GB/s aggregate per H100; NVSwitch makes 8-GPU domains flat; NVL72 makes 72-GPU domains flat for Blackwell.
Ada lacks TMA / wgmma / clusters / HBM. Blackwell adds FP4, NVLink 5, HBM3e, multi-die.
AMD MI300X has 192 GB HBM3 / 304 CUs / 64-lane wavefronts-different shape, similar physics.
Numbers I marked approximate (especially Blackwell figures): always verify with the current NVIDIA datasheet/whitepaper before relying on them in code or capacity planning.

Deep Dive 01-NVIDIA GPU Architecture and Memory Hierarchy¶

Table of Contents¶

1. Why GPUs Exist: Throughput Machines vs Latency Machines¶

2. Execution Models: SIMD, SIMT, MIMD¶

3. Why Deep Learning Maps onto Throughput Hardware¶

4. The Streaming Multiprocessor (SM)-Anatomy of the Compute Unit¶

4.1 H100 SM block diagram (canonical)¶

4.2 The execution units¶

4.3 Register file-the fastest, most precious resource¶

4.4 Generation deltas (SM-level)¶

5. Warp Scheduling, Dual-Issue, and Scoreboarding¶

5.1 The basic loop¶

5.2 Scoreboarding in detail¶

5.3 Dual-issue on Hopper¶

5.4 Warp scheduling timeline (ASCII)¶

5.5 Resident warp limits (H100)¶

6. Branch Divergence and Independent Thread Scheduling¶

6.1 The classical (pre-Volta) story¶

6.2 Independent Thread Scheduling (Volta and later)¶

6.3 Practical rules¶

7. The Memory Hierarchy: Registers to NVMe (with H100 Numbers)¶

7.1 The hierarchy (H100 specific)¶

7.2 ASCII picture¶

7.3 Cascade of misses¶

7.4 Coalescing: a worked picture¶

7.5 Shared memory: the workhorse¶

7.6 L2 cache (50 MB)¶

7.7 HBM3 (80 GB, ~3 TB/s)¶

7.8 Host DRAM and NVMe¶

8. Tensor Cores: WMMA, mma.sync, Fragments, Precisions¶

8.1 Why they exist¶

8.2 What a Tensor Core operation actually is¶

8.3 The PTX mma.sync family¶

8.4 The WMMA C++ API¶

8.5 Supported precisions and what they buy¶

8.6 Data movement is everything¶

9. 2:4 Structured Sparsity and Async Tensor Cores¶

9.1 The 2:4 sparsity hardware path¶

9.2 Async (warpgroup) Tensor Core operations on Hopper¶

10. Async Copy and the Tensor Memory Accelerator (TMA)¶

10.1 cp.async (Ampere onward)¶

10.2 The Tensor Memory Accelerator (Hopper)¶

10.3 Thread block clusters¶

11. Occupancy Theory-Derivation From First Principles¶

11.1 What constrains residency¶

11.2 Worked example 1-register-limited¶

11.3 Worked example 2-shared-memory-limited¶

11.4 Worked example 3-block-count-limited¶

11.5 Reverse engineering-"given X warps, what register budget?"¶

11.6 When low occupancy is fine¶

11.7 When low occupancy hurts¶

12. NVLink, NVSwitch, and Multi-GPU Topologies¶

12.1 Why we need it¶

12.2 NVLink vs PCIe¶

12.3 NVSwitch¶

12.4 NVL8, NVL36, NVL72¶

12.5 Reading nvidia-smi topo -m¶

12.6 Collective communication primitives¶

13. Ada and Blackwell Deltas (with explicit uncertainty)¶

13.1 Ada Lovelace (RTX 40 series, SM_89)¶

13.2 Blackwell (B100 / B200 / GB200, SM_100)¶

13.3 What this means practically¶

14. AMD CDNA / MI300X Contrast¶

14.1 Vocabulary¶

14.2 MI300X (CDNA 3) specifics¶

14.3 Software story¶

15. Five Worked Practical Exercises¶

Exercise 1-Occupancy on H100¶

Exercise 2-Roofline on a memory-bound kernel¶

Exercise 3-Tile size for a tensor-core GEMM¶

Exercise 4-Coalescing analysis¶

Exercise 5-Why is FlashAttention faster than naive attention?¶

Closing Notes¶

Comments

8.3 The PTX `mma.sync` family¶

10.1 `cp.async` (Ampere onward)¶

12.5 Reading `nvidia-smi topo -m`¶