Week 8 - Triton: GPU Kernels From Python¶

8.1 Conceptual Core¶

Triton (Tillet et al., 2019; OpenAI's project) is a Python-embedded DSL that compiles to PTX. It targets the same execution model as CUDA but at a block (not thread) abstraction: you write the work for a block of threads, the compiler handles intra-block parallelism.
The promise: 80% of CUTLASS performance with 20% of the code. The reality (as of 2026): often achieves it for memory-bound kernels (attention, layer norm); sometimes leaves 2-3× on the table for compute-bound GEMMs vs hand-tuned CUTLASS.
Triton is the dominant kernel authoring DSL in the open-source LLM ecosystem (FlashAttention's reference impl, vLLM's custom kernels, Liger Kernel, Unsloth-all Triton).

8.2 Mechanical Detail¶

A Triton kernel is a @triton.jit Python function. Inside, you operate on blocks (vectors / tiles of values) rather than individual scalars:

@triton.jit
def add_kernel(x_ptr, y_ptr, out_ptr, n, BLOCK: tl.constexpr):
    pid = tl.program_id(0)
    offsets = pid * BLOCK + tl.arange(0, BLOCK)
    mask = offsets < n
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    tl.store(out_ptr + offsets, x + y, mask=mask)

Memory ops: tl.load(ptr, mask), tl.store(ptr, val, mask).
Math ops: elementwise, dot (`tl.dot - uses tensor cores!), reductions.
Autotuning: @triton.autotune(configs=[...]) sweeps block sizes / num_warps / num_stages; picks fastest at runtime.
The compiler handles: vectorization, software pipelining, register allocation, cp.async insertion, tensor-core mapping.

8.3 Lab-"Three Triton Kernels"¶

Elementwise add (the Hello World).
Softmax with online maximum subtraction (numerical stability). Compare to torch.softmax perf.
Naive matmul in Triton with autotuning. Compare to cuBLAS-you should reach 70-90% of peak for square BF16 matmul on common shapes.

8.4 Idiomatic & Diagnostic Drill¶

TRITON_PRINT_AUTOTUNING=1 to see autotune traces. triton.compiler.compile(...).asm['ptx'] to inspect generated PTX. The PTX-level view becomes useful in week 16+ when you debug kernel choice.

8.5 Production Slice¶

Build a small library of "kernels you'll need later": fused softmax, layer norm, RMSNorm, fused dropout. Each ≤100 lines, autotuned, benchmarked vs PyTorch reference.

Month 2 Capstone Deliverable¶

A gpu-programming/ directory: 1. hardware-survey/ - your hardware tier. 2.cuda-kernels/ - vector add, reduction, naive matmul, optimized matmul. 3. triton-kernels/ - three kernels with autotune, benchmark plots vs PyTorch baseline. 4. AKERNEL_LOG.mddocumenting each optimization step'sncu` deltas.

This is the artifact that will impress GPU-engineer interviewers.