Skip to content

Week 6 - Your First CUDA Kernels

6.1 Conceptual Core

  • A CUDA program has host code (runs on CPU, in C++) and device code (runs on GPU, in C++ with CUDA extensions). They share a binary, but compile via different toolchains (nvcc orchestrates).
  • A kernel is a __global__ function called from the host with a launch configuration: kernel<<<gridDim, blockDim, sharedBytes, stream>>>(args).
  • Within a kernel, each thread sees threadIdx, blockIdx, blockDim, gridDim to compute its own work.

6.2 Mechanical Detail

  • Vector add is the canonical first kernel:
    __global__ void vadd(const float* a, const float* b, float* c, int n) {
        int i = blockIdx.x * blockDim.x + threadIdx.x;
        if (i < n) c[i] = a[i] + b[i];
    }
    // launch: vadd<<<(n+255)/256, 256>>>(a, b, c, n);
    
  • Memory transfer:
  • cudaMalloc, cudaMemcpy, `cudaFree - the basics.
  • Pinned memory (cudaMallocHost): host memory that can DMA directly. Faster H2D/D2H copies.
  • Unified memory (cudaMallocManaged): single pointer accessible from both; the runtime migrates pages on demand. Easy but unpredictable; avoid for performance code.
  • Async (cudaMemcpyAsync + streams): overlap copy with compute. Essential.
  • Streams: queues of kernels and copies. Operations within a stream are sequential; across streams, concurrent. Default stream is special (synchronizes with all others); use explicit non-default streams in production.
  • Error handling: every CUDA call returns an error code. Wrap with a macro (CUDA_CHECK(...)) that aborts on error. The most-skipped step in beginner code; the source of every "silent corrupt output" bug.

6.3 Lab-"Kernel Speedrun"

Write three kernels in CUDA C++: 1. Vector add: SAXPY (y = a*x + y). Time vs cuBLAS axpy. 2. Reduction: sum a million floats. Compare your naive version (one global atomic) with a hierarchical version (block-level reduction in shared memory, then global). Expect ~100× difference. 3. Naive matmul: 1024×1024 BF16. Compare to cuBLAS-expect to be 50-100× slower. Don't get discouraged; you'll close most of the gap in week 7.

For each: measure runtime with cudaEvent_t timing; compute achieved throughput; mark on the roofline.

6.4 Idiomatic & Diagnostic Drill

  • Run each kernel under ncu --set full. Read the "GPU speed of light" section: it tells you the % of peak compute and BW you achieved. Memorize the report layout.

6.5 Production Slice

  • Wrap CUDA error checking and timing in a small C++ header you'll reuse all month. This is your cuda_utils.cuh.

Comments