Month 2-GPU Programming: Architecture, CUDA, Memory, Triton¶

Goal: by the end of week 8 you can (a) describe the GPU's hierarchical execution model from grid down to warp lane, (b) write CUDA kernels that achieve >70% of peak BW or compute on memory-bound and compute-bound problems respectively, (c) use shared memory and tensor cores correctly, and (d) write equivalent kernels in Triton with within-2× performance and 5× less code.

Deep-dive companions (read in tandem): - Week 5 → the GPU Architecture deep dive - full SM/memory/tensor-core derivation, occupancy theory, NVLink topology. - Week 6–7 → the CUDA Programming deep dive - six-stage tiled GEMM with code, mma.sync PTX, complete buildable BF16 GEMM at 60–70% cuBLAS. - Week 8 → the Triton deep dive - block-level model, autotune, six annotated kernels including a simplified flash-attention.

Worked investigation (hands-on, real GPU): Profile a kernel with ncu and read the speed-of-light report - memory-bound vs compute-bound, and how close to the hardware ceiling. See the Worked Examples section.

Month 2-GPU Programming: Architecture, CUDA, Memory, Triton¶

Weeks¶

Comments