Skip to content

Month 2-GPU Programming: Architecture, CUDA, Memory, Triton

Goal: by the end of week 8 you can (a) describe the GPU's hierarchical execution model from grid down to warp lane, (b) write CUDA kernels that achieve >70% of peak BW or compute on memory-bound and compute-bound problems respectively, (c) use shared memory and tensor cores correctly, and (d) write equivalent kernels in Triton with within-2× performance and 5× less code.

Deep-dive companions (read in tandem): - Week 5 → DEEP_DIVES/01_GPU_ARCHITECTURE.md - full SM/memory/tensor-core derivation, occupancy theory, NVLink topology. - Week 6–7 →DEEP_DIVES/02_CUDA_PROGRAMMING.md - six-stage tiled GEMM with code, mma.sync PTX, complete buildable BF16 GEMM at 60–70% cuBLAS. - Week 8 → `DEEP_DIVES/03_TRITON.md - block-level model, autotune, six annotated kernels including a simplified flash-attention.


Weeks

Comments