Week 1 - The Compute Hierarchy and the Cost Model¶
1.1 Conceptual Core¶
- Modern AI computation crosses seven memory tiers, each ~10× slower and ~10× larger than the one above. Performance is determined by which tier you're operating in:
- Registers (~1 KB per core, 0-cycle latency, ~10 TB/s).
- L1 / shared memory / SRAM (~64 KB per SM/core, ~5-cycle, ~5 TB/s).
- L2 cache (~50 MB on H100, ~50-cycle).
- HBM / VRAM (~80 GB on H100, ~500-cycle, ~3 TB/s).
- Host DRAM (~TB, ~PCIe-cycle, ~64 GB/s over PCIe Gen5).
- Local NVMe (~TB+, ms latency, ~10 GB/s).
- Network / cluster (TB to PB, microseconds–milliseconds, depends on fabric).
- The same model trained across 1,024 GPUs is the same algorithm applied at every layer of this hierarchy: keep the hot data near the compute.
- The defining trick of modern ML systems: most operations are memory-bound. The ALUs are starved. Performance work is moving data better, not computing faster.
1.2 Mechanical Detail¶
- Floating-point formats in 2026:
- FP32 (single, 32-bit): 8-bit exponent, 23-bit mantissa. Default training precision historically.
- FP16 (half, 16-bit): 5-bit exponent, 10-bit mantissa. Limited dynamic range-overflows training.
- BF16 (bfloat16, 16-bit): 8-bit exponent (matches FP32), 7-bit mantissa. The standard training format on modern GPUs/TPUs.
- FP8 (8-bit, two variants): E4M3 (4 exp, 3 mantissa) and E5M2 (5 exp, 2 mantissa). H100/H200 native; the future of training.
- INT8 / INT4: integer quantization formats for inference.
- Hardware peak numbers worth memorizing (2026, single GPU):
- H100 SXM: ~989 TFLOPS BF16 dense, ~1979 TFLOPS FP8 dense, ~3.35 TB/s HBM.
- H200: same compute, ~4.8 TB/s HBM.
- B200 (Blackwell): roughly 2.5× H100 BF16, 5× FP8 with sparsity, ~8 TB/s HBM.
- A100: ~312 TFLOPS BF16, ~2 TB/s HBM. Still the workhorse.
- Arithmetic intensity = FLOPs per byte loaded. The crossover point between memory-bound and compute-bound is hardware-specific. For H100 BF16: ~295 FLOP/byte. Below it: memory-bound. Above: compute-bound. This is the most important number in your career.
1.3 Lab-"Roofline Sketch"¶
Write a small program (Python+NumPy, or any) that:
1. Performs C = A @ B for square matrices N=64, 256, 1024, 4096.
2. Times each. Computes achieved FLOPS (= 2·N³ / time).
3. Computes the bytes moved (= 3·N²·sizeof(dtype)).
4. Plots achieved FLOPS vs arithmetic intensity on log-log axes.
5. Overlays the theoretical roofline of your laptop CPU (look up its peak FLOPS and DRAM bandwidth).
You should see the small N points sit under the bandwidth ramp and the large N points approach the compute roof. Keep the plot-every subsequent lab will produce another.
1.4 Idiomatic & Diagnostic Drill¶
- Install
htop,numactl,perf. Pin the matmul to a single CPU core (taskset -c 0); observe perf change. Pin to a NUMA node (numactl --cpunodebind=0 --membind=0); observe again.
1.5 Production Slice¶
- Most production AI workloads run on cloud GPUs metered per second. Build a one-page "GPU economics cheat sheet" with cost per hour for: A100 (40GB and 80GB), H100, H200, B200 across at least three providers (AWS, GCP, Lambda, RunPod). Update yearly.