Week 6 - Your First CUDA Kernels¶
6.1 Conceptual Core¶
- A CUDA program has host code (runs on CPU, in C++) and device code (runs on GPU, in C++ with CUDA extensions). They share a binary, but compile via different toolchains (
nvccorchestrates). - A kernel is a
__global__function called from the host with a launch configuration:kernel<<<gridDim, blockDim, sharedBytes, stream>>>(args). - Within a kernel, each thread sees
threadIdx,blockIdx,blockDim,gridDimto compute its own work.
6.2 Mechanical Detail¶
- Vector add is the canonical first kernel:
- Memory transfer:
cudaMalloc,cudaMemcpy, `cudaFree - the basics.- Pinned memory (
cudaMallocHost): host memory that can DMA directly. Faster H2D/D2H copies. - Unified memory (
cudaMallocManaged): single pointer accessible from both; the runtime migrates pages on demand. Easy but unpredictable; avoid for performance code. - Async (
cudaMemcpyAsync+ streams): overlap copy with compute. Essential. - Streams: queues of kernels and copies. Operations within a stream are sequential; across streams, concurrent. Default stream is special (synchronizes with all others); use explicit non-default streams in production.
- Error handling: every CUDA call returns an error code. Wrap with a macro (
CUDA_CHECK(...)) that aborts on error. The most-skipped step in beginner code; the source of every "silent corrupt output" bug.
6.3 Lab-"Kernel Speedrun"¶
Write three kernels in CUDA C++:
1. Vector add: SAXPY (y = a*x + y). Time vs cuBLAS axpy.
2. Reduction: sum a million floats. Compare your naive version (one global atomic) with a hierarchical version (block-level reduction in shared memory, then global). Expect ~100× difference.
3. Naive matmul: 1024×1024 BF16. Compare to cuBLAS-expect to be 50-100× slower. Don't get discouraged; you'll close most of the gap in week 7.
For each: measure runtime with cudaEvent_t timing; compute achieved throughput; mark on the roofline.
6.4 Idiomatic & Diagnostic Drill¶
- Run each kernel under
ncu --set full. Read the "GPU speed of light" section: it tells you the % of peak compute and BW you achieved. Memorize the report layout.
6.5 Production Slice¶
- Wrap CUDA error checking and timing in a small C++ header you'll reuse all month. This is your
cuda_utils.cuh.