Week 8 - Triton: GPU Kernels From Python¶
8.1 Conceptual Core¶
- Triton (Tillet et al., 2019; OpenAI's project) is a Python-embedded DSL that compiles to PTX. It targets the same execution model as CUDA but at a block (not thread) abstraction: you write the work for a block of threads, the compiler handles intra-block parallelism.
- The promise: 80% of CUTLASS performance with 20% of the code. The reality (as of 2026): often achieves it for memory-bound kernels (attention, layer norm); sometimes leaves 2-3× on the table for compute-bound GEMMs vs hand-tuned CUTLASS.
- Triton is the dominant kernel authoring DSL in the open-source LLM ecosystem (FlashAttention's reference impl, vLLM's custom kernels, Liger Kernel, Unsloth-all Triton).
8.2 Mechanical Detail¶
- A Triton kernel is a
@triton.jitPython function. Inside, you operate on blocks (vectors / tiles of values) rather than individual scalars: - Memory ops:
tl.load(ptr, mask),tl.store(ptr, val, mask). - Math ops: elementwise, dot (`tl.dot - uses tensor cores!), reductions.
- Autotuning:
@triton.autotune(configs=[...])sweeps block sizes / num_warps / num_stages; picks fastest at runtime. - The compiler handles: vectorization, software pipelining, register allocation,
cp.asyncinsertion, tensor-core mapping.
8.3 Lab-"Three Triton Kernels"¶
- Elementwise add (the Hello World).
- Softmax with online maximum subtraction (numerical stability). Compare to
torch.softmaxperf. - Naive matmul in Triton with autotuning. Compare to cuBLAS-you should reach 70-90% of peak for square BF16 matmul on common shapes.
8.4 Idiomatic & Diagnostic Drill¶
TRITON_PRINT_AUTOTUNING=1to see autotune traces.triton.compiler.compile(...).asm['ptx']to inspect generated PTX. The PTX-level view becomes useful in week 16+ when you debug kernel choice.
8.5 Production Slice¶
- Build a small library of "kernels you'll need later": fused softmax, layer norm, RMSNorm, fused dropout. Each ≤100 lines, autotuned, benchmarked vs PyTorch reference.
Month 2 Capstone Deliverable¶
A gpu-programming/ directory:
1. hardware-survey/ - your hardware tier.
2.cuda-kernels/ - vector add, reduction, naive matmul, optimized matmul.
3. triton-kernels/ - three kernels with autotune, benchmark plots vs PyTorch baseline.
4. AKERNEL_LOG.mddocumenting each optimization step'sncu` deltas.
This is the artifact that will impress GPU-engineer interviewers.
Recommended Reading Done This Month¶
- Programming Massively Parallel Processors, Hwu/Kirk/Hajj, chapters 1–6.
- The CUDA C++ Programming Guide, sections 1–5.
- The Triton paper (MAPL 2019).
- NVIDIA's "GPU Performance Background" technical blog posts.
- The CUTLASS README and the gemm_universal example.