Week 6 - Your First CUDA Kernels¶

6.1 Conceptual Core¶

A CUDA program has host code (runs on CPU, in C++) and device code (runs on GPU, in C++ with CUDA extensions). They share a binary, but compile via different toolchains (nvcc orchestrates).
A kernel is a __global__ function called from the host with a launch configuration: kernel<<<gridDim, blockDim, sharedBytes, stream>>>(args).
Within a kernel, each thread sees threadIdx, blockIdx, blockDim, gridDim to compute its own work.

6.2 Mechanical Detail¶

Vector add is the canonical first kernel:

__global__ void vadd(const float* a, const float* b, float* c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}
// launch: vadd<<<(n+255)/256, 256>>>(a, b, c, n);

Memory transfer:
cudaMalloc, cudaMemcpy, `cudaFree - the basics.
Pinned memory (cudaMallocHost): host memory that can DMA directly. Faster H2D/D2H copies.
Unified memory (cudaMallocManaged): single pointer accessible from both; the runtime migrates pages on demand. Easy but unpredictable; avoid for performance code.
Async (cudaMemcpyAsync + streams): overlap copy with compute. Essential.
Streams: queues of kernels and copies. Operations within a stream are sequential; across streams, concurrent. Default stream is special (synchronizes with all others); use explicit non-default streams in production.
Error handling: every CUDA call returns an error code. Wrap with a macro (CUDA_CHECK(...)) that aborts on error. The most-skipped step in beginner code; the source of every "silent corrupt output" bug.

6.3 Lab-"Kernel Speedrun"¶

Write three kernels in CUDA C++: 1. Vector add: SAXPY (y = a*x + y). Time vs cuBLAS axpy. 2. Reduction: sum a million floats. Compare your naive version (one global atomic) with a hierarchical version (block-level reduction in shared memory, then global). Expect ~100× difference. 3. Naive matmul: 1024×1024 BF16. Compare to cuBLAS-expect to be 50-100× slower. Don't get discouraged; you'll close most of the gap in week 7.

For each: measure runtime with cudaEvent_t timing; compute achieved throughput; mark on the roofline.

6.4 Idiomatic & Diagnostic Drill¶

Run each kernel under ncu --set full. Read the "GPU speed of light" section: it tells you the % of peak compute and BW you achieved. Memorize the report layout.

6.5 Production Slice¶

Wrap CUDA error checking and timing in a small C++ header you'll reuse all month. This is your cuda_utils.cuh.