Month 2-GPU Programming: Architecture, CUDA, Memory, Triton¶
Goal: by the end of week 8 you can (a) describe the GPU's hierarchical execution model from grid down to warp lane, (b) write CUDA kernels that achieve >70% of peak BW or compute on memory-bound and compute-bound problems respectively, (c) use shared memory and tensor cores correctly, and (d) write equivalent kernels in Triton with within-2× performance and 5× less code.
Deep-dive companions (read in tandem): - Week 5 →
DEEP_DIVES/01_GPU_ARCHITECTURE.md - full SM/memory/tensor-core derivation, occupancy theory, NVLink topology. - Week 6–7 →DEEP_DIVES/02_CUDA_PROGRAMMING.md - six-stage tiled GEMM with code, mma.sync PTX, complete buildable BF16 GEMM at 60–70% cuBLAS. - Week 8 → `DEEP_DIVES/03_TRITON.md - block-level model, autotune, six annotated kernels including a simplified flash-attention.
Weeks¶
- Week 5 - GPU Hardware Architecture
- Week 6 - Your First CUDA Kernels
- Week 7 - Memory Optimization: Coalescing, Shared Memory, Tensor Cores
- Week 8 - Triton: GPU Kernels From Python