Month 2-GPU Programming: Architecture, CUDA, Memory, Triton¶
Goal: by the end of week 8 you can (a) describe the GPU's hierarchical execution model from grid down to warp lane, (b) write CUDA kernels that achieve >70% of peak BW or compute on memory-bound and compute-bound problems respectively, (c) use shared memory and tensor cores correctly, and (d) write equivalent kernels in Triton with within-2× performance and 5× less code.
Deep-dive companions (read in tandem): - Week 5 → the GPU Architecture deep dive - full SM/memory/tensor-core derivation, occupancy theory, NVLink topology. - Week 6–7 → the CUDA Programming deep dive - six-stage tiled GEMM with code, mma.sync PTX, complete buildable BF16 GEMM at 60–70% cuBLAS. - Week 8 → the Triton deep dive - block-level model, autotune, six annotated kernels including a simplified flash-attention.
Worked investigation (hands-on, real GPU): Profile a kernel with ncu and read the speed-of-light report - memory-bound vs compute-bound, and how close to the hardware ceiling. See the Worked Examples section.
Weeks¶
- Week 5 - GPU Hardware Architecture
- Week 6 - Your First CUDA Kernels
- Week 7 - Memory Optimization: Coalescing, Shared Memory, Tensor Cores
- Week 8 - Triton: GPU Kernels From Python