Week 1 - The Compute Hierarchy and the Cost Model¶

1.1 Conceptual Core¶

Modern AI computation crosses seven memory tiers, each ~10× slower and ~10× larger than the one above. Performance is determined by which tier you're operating in:
Registers (~1 KB per core, 0-cycle latency, ~10 TB/s).
L1 / shared memory / SRAM (~64 KB per SM/core, ~5-cycle, ~5 TB/s).
L2 cache (~50 MB on H100, ~50-cycle).
HBM / VRAM (~80 GB on H100, ~500-cycle, ~3 TB/s).
Host DRAM (~TB, ~PCIe-cycle, ~64 GB/s over PCIe Gen5).
Local NVMe (~TB+, ms latency, ~10 GB/s).
Network / cluster (TB to PB, microseconds–milliseconds, depends on fabric).
The same model trained across 1,024 GPUs is the same algorithm applied at every layer of this hierarchy: keep the hot data near the compute.
The defining trick of modern ML systems: most operations are memory-bound. The ALUs are starved. Performance work is moving data better, not computing faster.

1.2 Mechanical Detail¶

Floating-point formats in 2026:
FP32 (single, 32-bit): 8-bit exponent, 23-bit mantissa. Default training precision historically.
FP16 (half, 16-bit): 5-bit exponent, 10-bit mantissa. Limited dynamic range-overflows training.
BF16 (bfloat16, 16-bit): 8-bit exponent (matches FP32), 7-bit mantissa. The standard training format on modern GPUs/TPUs.
FP8 (8-bit, two variants): E4M3 (4 exp, 3 mantissa) and E5M2 (5 exp, 2 mantissa). H100/H200 native; the future of training.
INT8 / INT4: integer quantization formats for inference.
Hardware peak numbers worth memorizing (2026, single GPU):
H100 SXM: ~989 TFLOPS BF16 dense, ~1979 TFLOPS FP8 dense, ~3.35 TB/s HBM.
H200: same compute, ~4.8 TB/s HBM.
B200 (Blackwell): roughly 2.5× H100 BF16, 5× FP8 with sparsity, ~8 TB/s HBM.
A100: ~312 TFLOPS BF16, ~2 TB/s HBM. Still the workhorse.
Arithmetic intensity = FLOPs per byte loaded. The crossover point between memory-bound and compute-bound is hardware-specific. For H100 BF16: ~295 FLOP/byte. Below it: memory-bound. Above: compute-bound. This is the most important number in your career.

1.3 Lab-"Roofline Sketch"¶

Write a small program (Python+NumPy, or any) that: 1. Performs C = A @ B for square matrices N=64, 256, 1024, 4096. 2. Times each. Computes achieved FLOPS (= 2·N³ / time). 3. Computes the bytes moved (= 3·N²·sizeof(dtype)). 4. Plots achieved FLOPS vs arithmetic intensity on log-log axes. 5. Overlays the theoretical roofline of your laptop CPU (look up its peak FLOPS and DRAM bandwidth).

You should see the small N points sit under the bandwidth ramp and the large N points approach the compute roof. Keep the plot-every subsequent lab will produce another.

1.4 Idiomatic & Diagnostic Drill¶

Install htop, numactl, perf. Pin the matmul to a single CPU core (taskset -c 0); observe perf change. Pin to a NUMA node (numactl --cpunodebind=0 --membind=0); observe again.

1.5 Production Slice¶

Most production AI workloads run on cloud GPUs metered per second. Build a one-page "GPU economics cheat sheet" with cost per hour for: A100 (40GB and 80GB), H100, H200, B200 across at least three providers (AWS, GCP, Lambda, RunPod). Update yearly.