Skip to content

Week 1 - The Compute Hierarchy and the Cost Model

1.1 Conceptual Core

  • Modern AI computation crosses seven memory tiers, each ~10× slower and ~10× larger than the one above. Performance is determined by which tier you're operating in:
  • Registers (~1 KB per core, 0-cycle latency, ~10 TB/s).
  • L1 / shared memory / SRAM (~64 KB per SM/core, ~5-cycle, ~5 TB/s).
  • L2 cache (~50 MB on H100, ~50-cycle).
  • HBM / VRAM (~80 GB on H100, ~500-cycle, ~3 TB/s).
  • Host DRAM (~TB, ~PCIe-cycle, ~64 GB/s over PCIe Gen5).
  • Local NVMe (~TB+, ms latency, ~10 GB/s).
  • Network / cluster (TB to PB, microseconds–milliseconds, depends on fabric).
  • The same model trained across 1,024 GPUs is the same algorithm applied at every layer of this hierarchy: keep the hot data near the compute.
  • The defining trick of modern ML systems: most operations are memory-bound. The ALUs are starved. Performance work is moving data better, not computing faster.

1.2 Mechanical Detail

  • Floating-point formats in 2026:
  • FP32 (single, 32-bit): 8-bit exponent, 23-bit mantissa. Default training precision historically.
  • FP16 (half, 16-bit): 5-bit exponent, 10-bit mantissa. Limited dynamic range-overflows training.
  • BF16 (bfloat16, 16-bit): 8-bit exponent (matches FP32), 7-bit mantissa. The standard training format on modern GPUs/TPUs.
  • FP8 (8-bit, two variants): E4M3 (4 exp, 3 mantissa) and E5M2 (5 exp, 2 mantissa). H100/H200 native; the future of training.
  • INT8 / INT4: integer quantization formats for inference.
  • Hardware peak numbers worth memorizing (2026, single GPU):
  • H100 SXM: ~989 TFLOPS BF16 dense, ~1979 TFLOPS FP8 dense, ~3.35 TB/s HBM.
  • H200: same compute, ~4.8 TB/s HBM.
  • B200 (Blackwell): roughly 2.5× H100 BF16, 5× FP8 with sparsity, ~8 TB/s HBM.
  • A100: ~312 TFLOPS BF16, ~2 TB/s HBM. Still the workhorse.
  • Arithmetic intensity = FLOPs per byte loaded. The crossover point between memory-bound and compute-bound is hardware-specific. For H100 BF16: ~295 FLOP/byte. Below it: memory-bound. Above: compute-bound. This is the most important number in your career.

1.3 Lab-"Roofline Sketch"

Write a small program (Python+NumPy, or any) that: 1. Performs C = A @ B for square matrices N=64, 256, 1024, 4096. 2. Times each. Computes achieved FLOPS (= 2·N³ / time). 3. Computes the bytes moved (= 3·N²·sizeof(dtype)). 4. Plots achieved FLOPS vs arithmetic intensity on log-log axes. 5. Overlays the theoretical roofline of your laptop CPU (look up its peak FLOPS and DRAM bandwidth).

You should see the small N points sit under the bandwidth ramp and the large N points approach the compute roof. Keep the plot-every subsequent lab will produce another.

1.4 Idiomatic & Diagnostic Drill

  • Install htop, numactl, perf. Pin the matmul to a single CPU core (taskset -c 0); observe perf change. Pin to a NUMA node (numactl --cpunodebind=0 --membind=0); observe again.

1.5 Production Slice

  • Most production AI workloads run on cloud GPUs metered per second. Build a one-page "GPU economics cheat sheet" with cost per hour for: A100 (40GB and 80GB), H100, H200, B200 across at least three providers (AWS, GCP, Lambda, RunPod). Update yearly.

Comments