Saltar a contenido

Week 8 - Triton: GPU Kernels From Python

8.1 Conceptual Core

  • Triton (Tillet et al., 2019; OpenAI's project) is a Python-embedded DSL that compiles to PTX. It targets the same execution model as CUDA but at a block (not thread) abstraction: you write the work for a block of threads, the compiler handles intra-block parallelism.
  • The promise: 80% of CUTLASS performance with 20% of the code. The reality (as of 2026): often achieves it for memory-bound kernels (attention, layer norm); sometimes leaves 2-3× on the table for compute-bound GEMMs vs hand-tuned CUTLASS.
  • Triton is the dominant kernel authoring DSL in the open-source LLM ecosystem (FlashAttention's reference impl, vLLM's custom kernels, Liger Kernel, Unsloth-all Triton).

8.2 Mechanical Detail

  • A Triton kernel is a @triton.jit Python function. Inside, you operate on blocks (vectors / tiles of values) rather than individual scalars:
    @triton.jit
    def add_kernel(x_ptr, y_ptr, out_ptr, n, BLOCK: tl.constexpr):
        pid = tl.program_id(0)
        offsets = pid * BLOCK + tl.arange(0, BLOCK)
        mask = offsets < n
        x = tl.load(x_ptr + offsets, mask=mask)
        y = tl.load(y_ptr + offsets, mask=mask)
        tl.store(out_ptr + offsets, x + y, mask=mask)
    
  • Memory ops: tl.load(ptr, mask), tl.store(ptr, val, mask).
  • Math ops: elementwise, dot (`tl.dot - uses tensor cores!), reductions.
  • Autotuning: @triton.autotune(configs=[...]) sweeps block sizes / num_warps / num_stages; picks fastest at runtime.
  • The compiler handles: vectorization, software pipelining, register allocation, cp.async insertion, tensor-core mapping.

8.3 Lab-"Three Triton Kernels"

  1. Elementwise add (the Hello World).
  2. Softmax with online maximum subtraction (numerical stability). Compare to torch.softmax perf.
  3. Naive matmul in Triton with autotuning. Compare to cuBLAS-you should reach 70-90% of peak for square BF16 matmul on common shapes.

8.4 Idiomatic & Diagnostic Drill

  • TRITON_PRINT_AUTOTUNING=1 to see autotune traces. triton.compiler.compile(...).asm['ptx'] to inspect generated PTX. The PTX-level view becomes useful in week 16+ when you debug kernel choice.

8.5 Production Slice

  • Build a small library of "kernels you'll need later": fused softmax, layer norm, RMSNorm, fused dropout. Each ≤100 lines, autotuned, benchmarked vs PyTorch reference.

Month 2 Capstone Deliverable

A gpu-programming/ directory: 1. hardware-survey/ - your hardware tier. 2.cuda-kernels/ - vector add, reduction, naive matmul, optimized matmul. 3. triton-kernels/ - three kernels with autotune, benchmark plots vs PyTorch baseline. 4. AKERNEL_LOG.mddocumenting each optimization step'sncu` deltas.

This is the artifact that will impress GPU-engineer interviewers.


  • Programming Massively Parallel Processors, Hwu/Kirk/Hajj, chapters 1–6.
  • The CUDA C++ Programming Guide, sections 1–5.
  • The Triton paper (MAPL 2019).
  • NVIDIA's "GPU Performance Background" technical blog posts.
  • The CUTLASS README and the gemm_universal example.

Comments