Deep Dive 02: CUDA Programming-From First Kernel to Optimized GEMM¶

A self-contained reference chapter for the AI Systems curriculum (Month 2). Read this end-to-end and you will be able to write, optimize, and profile CUDA kernels at a level sufficient to read CUTLASS, FlashAttention, and Triton-generated PTX with comprehension. No external dependency required for the core material.

Anti-fabrication note: numbers marked with ~ are realistic ballparks, not invented exact figures. Anything stronger than that is either a hard architectural fact (e.g., warp size = 32) or you should verify against deviceQuery / Nsight on your specific GPU.

Table of Contents¶

The CUDA programming model
Indexing, launch configuration, and the SM/warp execution model
Memory transfer: host ↔ device, pinned, managed, zero-copy
Streams, events, synchronization
Error handling discipline
Memory access patterns: coalescing, derived from first principles
Shared memory and bank conflicts
Building blocks: vector add, reduction, prefix sum, histogram
Tiled GEMM walkthrough-naive → tensor-core → double-buffered
Tensor cores via nvcuda::wmma
mma.sync PTX inline assembly
Cooperative groups and thread-block clusters
Profiling discipline
A complete, build-and-run-ready BF16 GEMM at 2048×2048
Practical exercises with answer sketches

1. The CUDA Programming Model¶

1.1 Intuition¶

A CUDA program runs on two machines at once: a host (CPU) and a device (GPU). The host owns control flow and orchestrates the device the way a conductor cues a section of an orchestra: it allocates device memory, copies data in, launches a function (a kernel) that runs on tens of thousands of GPU threads in parallel, then copies results back. There is no shared address space by default; the two memories are physically distinct (host DRAM and device HBM/GDDR). Modern Unified Memory blurs this, but the mental model of "two machines" is still the right starting point.

A kernel is a function executed N times in parallel by N CUDA threads. The threads are organized into a grid of blocks, and each block is a 1D, 2D, or 3D array of threads. Threads within a block can cooperate (shared memory, barriers); threads in different blocks generally cannot-they may not even be co-resident on hardware at the same time.

1.2 Function-space qualifiers¶

Three qualifiers tell the compiler where a function runs and where it can be called from:

Qualifier	Runs on	Callable from	Notes
`__global__`	Device	Host (and device on CC ≥ 3.5 via dynamic parallelism)	This is a kernel. Must return `void`.
`__device__`	Device	Device	Inlined aggressively.
`__host__`	Host	Host	Default if unmarked.
`__host__ __device__`	Both	Both	Same source compiled twice.

__device__ float square(float x) { return x * x; }     // GPU helper
__global__ void apply(float* y, const float* x, int n) // kernel
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) y[i] = square(x[i]);
}
int main() { /* host code */ }                         // implicit __host__

1.3 Launch syntax¶

A kernel is launched with the chevron syntax:

kernel<<<gridDim, blockDim, dynamicSharedBytes, stream>>>(args...);

gridDim: a dim3 giving the number of blocks along x, y, z.
blockDim: a dim3 giving the number of threads per block along x, y, z.
dynamicSharedBytes: bytes of extern __shared__ memory to allocate per block. Defaults to 0.
stream: a cudaStream_t to enqueue into. Defaults to the default stream (stream 0), which has special legacy semantics described in §4.

Total threads launched = gridDim.x * gridDim.y * gridDim.z * blockDim.x * blockDim.y * blockDim.z. Hard limits (verify with deviceQuery for your GPU): blockDim.x*y*z ≤ 1024, gridDim.x ≤ 2^31-1.

1.4 The SM/warp execution model¶

A GPU is a collection of Streaming Multiprocessors (SMs). On H100 there are 132 SMs and 4 warp schedulers per SM; on A100 there are 108 SMs and also 4 schedulers per SM. Each SM has a register file (~256 KB on Ampere/Hopper), an L1/shared scratchpad (~228 KB on H100, configurable), warp schedulers, INT/FP32/FP64 pipes, and tensor cores.

When you launch a kernel, the GPU's hardware work distributor assigns whole blocks to SMs. Each SM can hold multiple blocks resident simultaneously, up to limits dictated by registers per thread, shared memory per block, and hardware caps (max 2048 resident threads per SM on most architectures, ~32–64 resident warps depending on SM).

Once on an SM, a block is sliced into warps of 32 threads each. The warp is the unit of scheduling and instruction issue. All 32 threads in a warp execute the same instruction at the same time-this is SIMT (Single Instruction Multiple Threads). When threads diverge (different sides of a branch), the warp executes both paths serially, masking inactive lanes. From Volta onward, threads have independent program counters (Independent Thread Scheduling), so the model is more nuanced, but warp-level ops still operate on full warps with explicit masks.

The warp size of 32 is hard-baked into hardware, PTX, and tooling. Treat it as a constant of the universe.

1.5 Why this matters¶

Three rules fall directly out of the model and you will rederive them constantly:

Block size should be a multiple of 32. Any other choice wastes lanes in the trailing warp. 128 or 256 are good defaults.
Memory accesses should be warp-coalesced (§6). The 32 threads of a warp issue one load instruction; if their addresses fall in a single 128-byte aligned segment, the memory subsystem services it in one transaction.
Shared memory has 32 banks (§7). Patterns that map 32 threads of a warp to 32 distinct banks run at full speed; collisions serialize.

Micro-exercise 1.1¶

Why must a kernel return void? Answer sketch: the chevron launch is asynchronous-by the time the host instruction stream proceeds past the launch, the kernel has likely not run. There is no synchronous return value to capture. Outputs go through pointers to device memory.

2. Indexing & Launch Configuration¶

2.1 Built-in variables¶

Inside a kernel you have:

gridDim -dimensions of the grid (in blocks).
`blockDim - dimensions of a block (in threads).
`blockIdx - this block's coordinate within the grid.
threadIdx— this thread's coordinate within its block.
`warpSize - always 32.

2.2 Computing a global thread ID¶

For a 1D grid of 1D blocks:

int gtid = blockIdx.x * blockDim.x + threadIdx.x;

For a 2D grid of 2D blocks (e.g., per-pixel image kernel):

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x < width && y < height) { /* ... */ }

The bounds check matters because grid sizes are typically rounded up:

dim3 block(16, 16);                                              // 256 threads
dim3 grid((width + 15) / 16, (height + 15) / 16);                // ceil-div
saxpy2d<<<grid, block>>>(img, width, height);

2.3 Choosing block size-heuristics¶

Multiple of 32 (warp size). Otherwise lanes are wasted.
128 or 256 is the sweet spot for memory-bound kernels.
256–1024 is typical for compute-bound kernels-but bigger blocks reduce the number of resident blocks per SM, which reduces the scheduler's ability to hide latency by switching warps. Run the occupancy calculator or use cudaOccupancyMaxPotentialBlockSize to pick an optimal pair given per-thread register and shared-memory usage.

int minGrid, blockSize;
cudaOccupancyMaxPotentialBlockSize(&minGrid, &blockSize,
                                   (void*)myKernel, /*smem*/0, /*N*/0);
int grid = (n + blockSize - 1) / blockSize;
myKernel<<<grid, blockSize>>>(...);

2.4 Grid-stride loops¶

Decoupling problem size from grid size is the right idiom: launch a fixed grid (typically 2 * SMcount * blocksPerSM), let each thread loop:

__global__ void saxpy_grid_stride(int n, float a, const float* x, float* y) {
    int stride = gridDim.x * blockDim.x;
    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += stride) {
        y[i] = a * x[i] + y[i];
    }
}

This kernel handles any n without recompiling and tends to be faster on arrays whose size doesn't divide nicely into the grid.

Micro-exercise 2.1¶

A kernel accesses a 1D array of 1,000,003 floats (a prime). What block and grid sizes do you launch? Answer sketch: block=256, grid=ceil(1000003/256)=3907. The last warp has 3 active threads × 0 trailing-actually 1000003 mod 256 = 3, so 3 active lanes in the last warp. Add if (i < n) guard.

nvcc invocation¶

nvcc -O3 -arch=sm_90 -lineinfo saxpy.cu -o saxpy   # Hopper (H100)
nvcc -O3 -arch=sm_80 saxpy.cu -o saxpy             # Ampere (A100)

- lineinfo` keeps source-line mappings without enabling debug optimizations.

3. Memory Transfer¶

3.1 The four ways to get bytes onto the GPU¶

API	Allocator	Locality	Transfer cost	Use case
`cudaMalloc` + `cudaMemcpy`	Device DRAM	Device	PCIe / NVLink	Default for hot data.
`cudaMallocHost`	Pinned host RAM	Host (page-locked)	Faster `cudaMemcpy`	Staging buffer for async copies.
`cudaMallocManaged` (UVM)	Either	Migrates on access	Page-fault driven	Prototyping, pointer sharing.
Zero-copy (`cudaHostAllocMapped`)	Pinned host	Device sees host directly	PCIe per access	Tiny, rare GPU reads of host data.

3.2 The classic pattern¶

const int N = 1 << 20;
size_t bytes = N * sizeof(float);

float *h_x = (float*)malloc(bytes);
float *d_x;  cudaMalloc(&d_x, bytes);

cudaMemcpy(d_x, h_x, bytes, cudaMemcpyHostToDevice);
kernel<<<grid, block>>>(d_x, N);
cudaMemcpy(h_x, d_x, bytes, cudaMemcpyDeviceToHost);

cudaFree(d_x);  free(h_x);

cudaMemcpy on the default stream is host-blocking for D2H and H2D when the host pointer is pageable; the host thread does not return until the copy is enqueued and (for D2H) complete. This is the most common source of "my kernel is slow"-the kernel is fine, the copy is dominating.

3.3 Pinned (page-locked) host memory¶

The OS may swap pageable memory; the GPU's DMA engine therefore must stage through a driver-pinned bounce buffer. Allocating already-pinned memory cuts that out:

float* h_x;
cudaMallocHost(&h_x, bytes);   // pinned
// ... fill h_x ...
cudaMemcpyAsync(d_x, h_x, bytes, cudaMemcpyHostToDevice, stream);

Practical effect: PCIe Gen4 x16 has a theoretical peak of ~32 GB/s unidirectional. With pageable memory you typically see ~6–12 GB/s; with pinned you see ~22–28 GB/s. Pinning is mandatory for cudaMemcpyAsync to actually overlap with kernels-otherwise the driver silently falls back to sync.

Caveat: pinned memory is a scarce OS resource. Don't pin gigabytes unnecessarily.

3.4 Unified Memory (`cudaMallocManaged`)¶

float* x;
cudaMallocManaged(&x, bytes);
for (int i = 0; i < N; ++i) x[i] = i;        // host writes
kernel<<<grid, block>>>(x, N);                // device reads-pages migrate
cudaDeviceSynchronize();
printf("%f\n", x[0]);                          // host reads-pages migrate back
cudaFree(x);

The driver fault-handles page migrations on demand. On Pascal+ this is fully demand-paged at 4 KB granularity; on pre-Pascal the entire allocation migrated on touch. UVM is great for prototyping but page faults are expensive (~tens of μs each), so production code typically uses explicit cudaMemPrefetchAsync to push data ahead of time, or just falls back to explicit cudaMemcpy.

3.5 Performance ballparks (H100 + PCIe5; verify on your hardware)¶

HBM3 device-to-device (cudaMemcpy(D2D)): ~2.5–3 TB/s.
PCIe Gen5 x16 H↔D pinned: ~50–55 GB/s.
Pageable H↔D: ~half that, plus driver overhead.
Zero-copy reads from device: bound by PCIe per access, latency ~1 μs.

Micro-exercise 3.1¶

You repeatedly run a kernel on the same input. Where should the input live? Answer: in device memory (cudaMalloc), copied once. If the input is streamed from disk each iteration, use cudaMemcpyAsync from a pinned staging buffer overlapped with compute (§4).

4. Streams, Events, Synchronization¶

4.1 Streams as queues¶

A stream is a FIFO queue of GPU work. Operations enqueued into the same stream execute in order; operations in different streams may execute concurrently (subject to hardware resources).

cudaStream_t s1, s2;
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);

cudaMemcpyAsync(dA, hA, bytes, cudaMemcpyHostToDevice, s1);
kernelA<<<g, b, 0, s1>>>(dA);

cudaMemcpyAsync(dB, hB, bytes, cudaMemcpyHostToDevice, s2);
kernelB<<<g, b, 0, s2>>>(dB);
// kernelA and kernelB may run concurrently if the SMs have room

4.2 The default stream's special semantics¶

Stream 0 (the default / NULL / legacy stream) is implicitly synchronizing: a launch into the default stream waits for all prior work in every stream of the same device, and all subsequent stream work waits for it. This is hostile to concurrency.

Two ways out:

Compile with - -default-stream per-thread` so each host thread gets its own non-blocking default stream.
Always use explicit non-default streams created with cudaStreamCreate (or cudaStreamCreateWithFlags(&s, cudaStreamNonBlocking) to make a stream not synchronize against the default stream).

4.3 Events for timing and dependencies¶

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start, stream);
kernel<<<g, b, 0, stream>>>(...);
cudaEventRecord(stop, stream);

cudaEventSynchronize(stop);             // host waits for stop
float ms;
cudaEventElapsedTime(&ms, start, stop); // resolution ~0.5 μs

Events also express cross-stream dependencies:

cudaEventRecord(eA, sA);                // record after kernelA
cudaStreamWaitEvent(sB, eA, 0);         // sB cannot proceed until eA
kernelB<<<g, b, 0, sB>>>(...);          // depends on kernelA

4.4 Overlapping copy and compute¶

The classic three-stream pipeline for a chunked workload:

const int CHUNK = 1 << 20;
const int NSTREAMS = 3;
cudaStream_t s[NSTREAMS];
for (int i = 0; i < NSTREAMS; ++i) cudaStreamCreate(&s[i]);

for (int chunk = 0; chunk < nChunks; ++chunk) {
    int k = chunk % NSTREAMS;
    cudaMemcpyAsync(d_in + chunk*CHUNK, h_in + chunk*CHUNK,
                    CHUNK*sizeof(float), cudaMemcpyHostToDevice, s[k]);
    process<<<grid, block, 0, s[k]>>>(d_in + chunk*CHUNK,
                                       d_out + chunk*CHUNK, CHUNK);
    cudaMemcpyAsync(h_out + chunk*CHUNK, d_out + chunk*CHUNK,
                    CHUNK*sizeof(float), cudaMemcpyDeviceToHost, s[k]);
}
for (int i = 0; i < NSTREAMS; ++i) cudaStreamSynchronize(s[i]);

Why three streams: a GPU typically has two DMA engines (one each direction) plus the compute engine. Three streams let H2D, kernel, and D2H run concurrently across chunks. With pinned host buffers, total wall time collapses from T_h2d + T_compute + T_d2h toward max(T_h2d, T_compute, T_d2h).

Micro-exercise 4.1¶

You time a kernel as cudaEventRecord(start), kernel launch, cudaEventRecord(stop), cudaEventElapsedTime(&ms, start, stop), and get a suspiciously small number. What did you forget? Answer: cudaEventSynchronize(stop) before reading the elapsed time. Without it, the event may not have completed.

5. Error Handling Discipline¶

CUDA reports errors in two channels:

The return code of every API call.
A sticky per-thread error state for asynchronous failures, queryable via cudaGetLastError() (which clears it) or cudaPeekAtLastError() (which doesn't).

Kernel launches return success if the launch configuration is valid; an in-kernel out-of-bounds access surfaces only on the next sync point. So:

#define CUDA_CHECK(call) do {                                                 \
    cudaError_t _e = (call);                                                  \
    if (_e != cudaSuccess) {                                                  \
        fprintf(stderr, "CUDA error %s:%d: %s\n",                             \
                __FILE__, __LINE__, cudaGetErrorString(_e));                  \
        std::abort();                                                         \
    }                                                                         \
} while (0)

#define CUDA_LAUNCH_CHECK() do {                                              \
    CUDA_CHECK(cudaPeekAtLastError());                                        \
    CUDA_CHECK(cudaDeviceSynchronize());                                      \
} while (0)

Use:

CUDA_CHECK(cudaMalloc(&d_x, bytes));
CUDA_CHECK(cudaMemcpy(d_x, h_x, bytes, cudaMemcpyHostToDevice));
mykernel<<<g, b>>>(d_x);
CUDA_LAUNCH_CHECK();    // catches both launch and runtime kernel errors

In release builds you may want to skip cudaDeviceSynchronize (it kills async pipelining); make CUDA_LAUNCH_CHECK peek-only and rely on compute-sanitizer during development:

compute-sanitizer --tool memcheck ./myprog
compute-sanitizer --tool racecheck ./myprog

Common pitfalls¶

Sticky errors: if any earlier call failed, every subsequent call returns the old error until you clear it with cudaGetLastError.
Asynchronous errors: a kernel writing OOB will not surface until you next sync. Always sync in tests.
Multi-GPU: CUDA errors are per-thread per-context. Setting the wrong device with cudaSetDevice produces silent wrongness, not an error.

6. Memory Access Patterns: Coalescing¶

6.1 The coalescing rule, derived¶

Global memory is served by the L2 cache and HBM in 32-byte and 128-byte sectors. When a warp issues a load instruction, the hardware computes the union of the 32 thread addresses and issues the minimum number of sector fetches to cover them.

The ideal case: 32 threads of a warp load 32 consecutive 4-byte floats starting at a 128-byte aligned address. One 128-byte transaction services the entire warp. Achieved bandwidth = peak.

The worst case (for 4-byte loads): 32 threads load 32 floats, each in a different 128-byte sector. The hardware issues 32 sector loads to fetch 128 bytes of data-a 32× waste. Achieved bandwidth ≈ peak / 32.

6.2 Concrete: coalesced vs strided¶

__global__ void coalesced(const float* x, float* y, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) y[i] = x[i] * 2.0f;
}

__global__ void strided(const float* x, float* y, int n, int stride) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = i * stride;          // stride 32 → each thread in a different sector
    if (j < n) y[j] = x[j] * 2.0f;
}

For stride=1, achieved ~80–95% of HBM peak (verify; H100 HBM3 peak ~3 TB/s, so realistic ~2.4–2.8 TB/s). For stride=32, achieved ~3–10% of peak because each warp issues 32 sector loads where the coalesced version issues 1. Concretely, on a 1 GiB array:

Coalesced runtime ≈ 1 GiB / 2.5 TB/s ≈ 0.4 ms.
Strided runtime ≈ 32× that ≈ 13 ms. (Approximate-L2 hits dampen the worst case.)

6.3 Alignment¶

Device pointers from cudaMalloc are aligned to at least 256 bytes. But if you offset a pointer by, say, 1 element, the warp loads cross a sector boundary and now need 2 transactions instead of 1. Achieved BW falls by ~50%. This is why CUTLASS aligns leading dimensions of GEMM tiles.

6.4 Vectorized loads¶

float4 (16 bytes) and int4 lets a single thread issue a 16-byte load, so a warp issues 32 × 16 = 512 bytes in 4 transactions of 128 B. This is identical bandwidth to scalar coalesced loads but reduces instruction count, often improving compute-bound kernels.

__global__ void copy_vec(const float4* x, float4* y, int n4) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n4) y[i] = x[i];      // 16 B per thread, 512 B per warp
}

Caveat: n must be divisible by 4 and pointers 16-byte aligned (true for cudaMalloc).

Micro-exercise 6.1¶

A 2D row-major matrix A[M][N] is processed with threadIdx.x indexing columns. Is A[row][threadIdx.x] coalesced? Answer: yes-consecutive threads access consecutive columns, which are consecutive in memory in row-major. If threadIdx.x indexed rows instead (A[threadIdx.x][col]) that's a stride-N access and bad.

7. Shared Memory and Bank Conflicts¶

7.1 The bank model¶

Shared memory is divided into 32 banks. Each bank serves one 4-byte word per cycle. A warp accessing 32 distinct banks completes in one cycle. Two threads hitting the same bank with different addresses serialize — that's a 2-way bank conflict, and the warp issue takes 2 cycles. Up to 32-way is possible (worst case: every thread the same bank).

Bank index = (address / 4) mod 32. So:

__shared__ float s[32];
float v = s[threadIdx.x];   // thread t reads bank t-no conflict

__shared__ float s[32 * 32];
float v = s[threadIdx.x * 32];  // all threads bank 0-32-way conflict

7.2 The transpose problem¶

A naive 32×32 shared-memory tile transpose:

__shared__ float tile[32][32];
tile[threadIdx.y][threadIdx.x] = in[...];   // store: coalesced, no conflict
__syncthreads();
out[...] = tile[threadIdx.x][threadIdx.y];  // load:  stride-32, 32-way conflict

The store has threadIdx.x varying along the inner dimension (banks 0..31, no conflict). The load reads tile[threadIdx.x][threadIdx.y] - fixingthreadIdx.yand varyingthreadIdx.xreadstile[0][y], tile[1][y], ...` which sit at addresses differing by 32 floats = 128 bytes = same bank. 32-way conflict.

7.3 The padding fix¶

Add a phantom column:

__shared__ float tile[32][33];   // 33, not 32

Now tile[k][y] is at k*33 + y floats from base. Bank index = (k*33 + y) mod 32 = (k + y) mod 32 since 33 mod 32 = 1. As k varies 0..31 with y fixed, the bank stride is 1-every thread hits a different bank. Conflict gone, at the cost of 32 floats × 4 bytes = 128 B of waste per tile.

__global__ void transpose(const float* in, float* out, int N) {
    __shared__ float tile[32][33];
    int x = blockIdx.x * 32 + threadIdx.x;
    int y = blockIdx.y * 32 + threadIdx.y;
    if (x < N && y < N) tile[threadIdx.y][threadIdx.x] = in[y*N + x];
    __syncthreads();
    int xt = blockIdx.y * 32 + threadIdx.x;   // swap block coords
    int yt = blockIdx.x * 32 + threadIdx.y;
    if (xt < N && yt < N) out[yt*N + xt] = tile[threadIdx.x][threadIdx.y];
}

7.4 Dynamic shared memory¶

If the size is known only at launch, use extern __shared__:

__global__ void k(int n) {
    extern __shared__ float buf[];
    // ... use buf[0..n-1]
}
k<<<grid, block, n*sizeof(float)>>>(n);

The third chevron argument supplies the byte count.

7.5 Capacity¶

H100 has up to ~228 KB of combined L1+shared per SM, configurable via cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, …). A100 has ~164 KB. Default per-block limit is 48 KB unless opted up.

Micro-exercise 7.1¶

A warp does s[threadIdx.x * 2] on a __shared__ float s[64]. Conflict? Answer: threads 0..15 hit banks 0,2,4,...,30 (no conflicts within), threads 16..31 hit banks 0,2,4,...,30 again-but in shared-memory bank-conflict arithmetic, threads 16..31 collide with threads 0..15 pairwise. 2-way conflict.

8. Building Blocks¶

8.1 Vector add¶

__global__ void vadd(const float* a, const float* b, float* c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}

This is purely bandwidth-bound: 3 × 4 = 12 bytes touched per output, ~1 FLOP per output → arithmetic intensity 1/12 FLOP/byte. On H100 (~3 TB/s, ~67 TFLOP/s FP32), the roofline says ~250 GFLOP/s-this kernel achieves near-peak HBM bandwidth, not near-peak FLOPs.

8.2 Reduction-naive¶

Sum N floats. Naive in-block reduction:

__global__ void reduce_naive(const float* x, float* out, int n) {
    extern __shared__ float sdata[];
    int tid = threadIdx.x;
    int i = blockIdx.x * blockDim.x + tid;
    sdata[tid] = (i < n) ? x[i] : 0.0f;
    __syncthreads();
    for (int s = 1; s < blockDim.x; s *= 2) {
        if (tid % (2*s) == 0) sdata[tid] += sdata[tid + s];
        __syncthreads();
    }
    if (tid == 0) atomicAdd(out, sdata[0]);
}

Problems: divergence inside the warp (tid % (2*s)) wastes lanes; bank conflicts on sdata accesses.

8.3 Reduction-sequential addressing¶

for (int s = blockDim.x / 2; s > 0; s >>= 1) {
    if (tid < s) sdata[tid] += sdata[tid + s];
    __syncthreads();
}

Now active threads are 0..s-1, contiguous, no divergence within active warps. Bank conflicts gone (stride-1).

8.4 Reduction-warp shuffle¶

The last 6 levels of reduction (s = 32 down to 1) execute within a single warp. Warp shuffle (__shfl_down_sync) lets a thread read another lane's register without going through shared memory:

__device__ float warp_reduce_sum(float v) {
    for (int off = 16; off > 0; off >>= 1)
        v += __shfl_down_sync(0xffffffff, v, off);
    return v;     // lane 0 holds the full warp sum
}

__global__ void reduce_shfl(const float* x, float* out, int n) {
    __shared__ float sm[32];                       // one slot per warp in block
    int tid = threadIdx.x;
    int lane = tid & 31;
    int wid  = tid >> 5;
    int i = blockIdx.x * blockDim.x + tid;
    float v = (i < n) ? x[i] : 0.0f;
    v = warp_reduce_sum(v);
    if (lane == 0) sm[wid] = v;
    __syncthreads();
    v = (tid < blockDim.x / 32) ? sm[lane] : 0.0f;
    if (wid == 0) v = warp_reduce_sum(v);
    if (tid == 0) atomicAdd(out, v);
}

Performance evolution (approximate, 256-thread block, 1 GiB input, H100):

Variant	Time	Notes
Naive divergence	~3.0 ms	warp divergence + bank conflicts
Sequential addr.	~1.2 ms	clean shared-mem reduction
Warp shuffle	~0.55 ms	near HBM-bound
+ grid-stride loop	~0.40 ms	1 launch, fewer atomics

8.5 Cooperative groups (optional cleaner API)¶

#include <cooperative_groups.h>
namespace cg = cooperative_groups;

__global__ void reduce_cg(const float* x, float* out, int n) {
    auto block = cg::this_thread_block();
    auto warp  = cg::tiled_partition<32>(block);
    int i = block.group_index().x * block.size() + block.thread_rank();
    float v = (i < n) ? x[i] : 0.0f;
    v = cg::reduce(warp, v, cg::plus<float>());
    if (warp.thread_rank() == 0) atomicAdd(out, v);
}

8.6 Inclusive prefix sum (Hillis–Steele within a block)¶

__global__ void prefix_sum(float* x, int n) {
    extern __shared__ float s[];
    int tid = threadIdx.x;
    s[tid] = (tid < n) ? x[tid] : 0.0f;
    __syncthreads();
    for (int off = 1; off < blockDim.x; off <<= 1) {
        float v = (tid >= off) ? s[tid - off] : 0.0f;
        __syncthreads();
        s[tid] += v;
        __syncthreads();
    }
    if (tid < n) x[tid] = s[tid];
}

For full-array scan you do block scans, scan the per-block totals, then add back. Or use thrust::inclusive_scan.

8.7 Histogram with privatization¶

The naive histogram has every thread atomicAdd(&hist[bin], 1) on global — a contention disaster. Privatize per-block in shared memory, merge at end:

__global__ void hist_priv(const int* data, int* hist, int n, int B) {
    extern __shared__ int sh[];
    for (int i = threadIdx.x; i < B; i += blockDim.x) sh[i] = 0;
    __syncthreads();
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = gridDim.x * blockDim.x;
    for (int i = tid; i < n; i += stride) atomicAdd(&sh[data[i]], 1);
    __syncthreads();
    for (int i = threadIdx.x; i < B; i += blockDim.x)
        atomicAdd(&hist[i], sh[i]);
}

Speedup over naive global atomics: typically 10–50×, larger as bin counts shrink.

9. Tiled GEMM Walkthrough¶

We compute C = A · B with A: MxK, B: KxN, C: MxN, all row-major. We'll evolve the kernel through six stages, each with its expected speedup over the naive baseline on a square problem like 4096×4096.

9.1 Stage 0-Naive¶

Each thread computes one output element:

__global__ void gemm_naive(const float* A, const float* B, float* C,
                           int M, int N, int K) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    if (row < M && col < N) {
        float acc = 0.f;
        for (int k = 0; k < K; ++k) acc += A[row*K + k] * B[k*N + col];
        C[row*N + col] = acc;
    }
}

Bandwidth analysis: each thread reads 2K floats and writes 1, performs 2K FLOPs → arithmetic intensity 2K / (8K) = 0.25 FLOP/byte. Bandwidth bound at ~750 GFLOP/s on H100 (HBM ~3 TB/s × 0.25). Reality is worse because the access pattern through A is row-major coalesced, but the access through B is strided (column traversal of a row-major matrix).

Baseline: ~1× (by definition). Achieves on the order of ~1–3% of cuBLAS.

9.2 Stage 1-Coalesced¶

Swap thread mapping so consecutive threadIdx.x walks consecutive columns of B (which are not contiguous), but at least the writes to C are coalesced. The naive code above is already laid out so that threadIdx.x → col and C[row*N + col] is coalesced. The remaining problem is B[k*N + col] - for fixedk, varyingcol` is contiguous, so it's already coalesced. The real fix is reusing each loaded byte across threads, which only shared memory delivers.

Speedup over naive: small (~1.2–1.5×) just from cache effects.

9.3 Stage 2-Shared-memory tiled¶

Each block computes a BM × BN tile of C. It cooperatively loads tiles of A and B into shared memory, then iterates. With BM = BN = BK = 32:

template<int BM, int BN, int BK>
__global__ void gemm_smem(const float* A, const float* B, float* C,
                          int M, int N, int K) {
    __shared__ float sA[BM][BK];
    __shared__ float sB[BK][BN];
    int row = blockIdx.y * BM + threadIdx.y;
    int col = blockIdx.x * BN + threadIdx.x;
    float acc = 0.f;
    for (int kt = 0; kt < K; kt += BK) {
        sA[threadIdx.y][threadIdx.x] =
            (row < M && kt + threadIdx.x < K) ? A[row*K + kt + threadIdx.x] : 0.f;
        sB[threadIdx.y][threadIdx.x] =
            (kt + threadIdx.y < K && col < N) ? B[(kt + threadIdx.y)*N + col] : 0.f;
        __syncthreads();
        #pragma unroll
        for (int k = 0; k < BK; ++k) acc += sA[threadIdx.y][k] * sB[k][threadIdx.x];
        __syncthreads();
    }
    if (row < M && col < N) C[row*N + col] = acc;
}

Each tile of A and B is loaded once from HBM and reused 32× from shared memory. Arithmetic intensity rises to ~16 FLOP/byte. On H100 FP32 (~67 TFLOP/s peak), this can reach ~30–40% of cuBLAS FP32. Speedup over naive: ~10–15×.

Note: the K-loop accumulator sA[ty][k] * sB[k][tx] has bank conflicts on sA[ty][k] (all threads in a row hit the same bank). Padding sA[BM][BK+1] fixes it for some tile shapes.

9.4 Stage 3-Register tiling (1D and 2D)¶

Have each thread compute multiple outputs, holding partial sums in registers. With TM = TN = 8, each thread now produces an 8×8 micro-tile, so a BM=128, BN=128 block uses 128*128/(8*8) = 256 threads-convenient.

Sketch (key inner loop):

constexpr int BM=128, BN=128, BK=8, TM=8, TN=8;
__shared__ float sA[BM][BK];
__shared__ float sB[BK][BN];
float regA[TM], regB[TN], acc[TM][TN] = {0};
// ... load sA, sB cooperatively ...
__syncthreads();
#pragma unroll
for (int k = 0; k < BK; ++k) {
    #pragma unroll
    for (int i = 0; i < TM; ++i) regA[i] = sA[ty*TM + i][k];
    #pragma unroll
    for (int j = 0; j < TN; ++j) regB[j] = sB[k][tx*TN + j];
    #pragma unroll
    for (int i = 0; i < TM; ++i)
        #pragma unroll
        for (int j = 0; j < TN; ++j) acc[i][j] += regA[i] * regB[j];
}

Each register-resident regA[i] * regB[j] is one FMA. With TM=TN=8 we generate 64 FMAs per k step from 8+8=16 register loads → 4 FMAs/load, matching SM dispatch ratios. Arithmetic intensity now in the hundreds of FLOP/byte. Expected: ~50–70% of cuBLAS FP32. Speedup over naive: ~30–40×.

9.5 Stage 4-Tensor cores (`wmma`, BF16 in / FP32 out)¶

On Volta+ each SM has tensor cores that compute matrix-multiply-accumulate on small fragments per warp. The fundamental fragment shape supported across generations for BF16 is m=16, n=16, k=16 with FP32 accumulator (we'll call this the "16x16x16" shape). Each warp does one tensor-core op per mma_sync.

#include <mma.h>
using namespace nvcuda::wmma;

__global__ void gemm_wmma(const __nv_bfloat16* A, const __nv_bfloat16* B,
                          float* C, int M, int N, int K) {
    int warpM = (blockIdx.y * blockDim.y + threadIdx.y);
    int warpN = (blockIdx.x * blockDim.x + threadIdx.x) / 32;
    fragment<matrix_a, 16,16,16, __nv_bfloat16, row_major> a_frag;
    fragment<matrix_b, 16,16,16, __nv_bfloat16, row_major> b_frag;
    fragment<accumulator, 16,16,16, float> c_frag;
    fill_fragment(c_frag, 0.0f);
    for (int k = 0; k < K; k += 16) {
        load_matrix_sync(a_frag, A + warpM*16*K + k, K);
        load_matrix_sync(b_frag, B + k*N + warpN*16, N);
        mma_sync(c_frag, a_frag, b_frag, c_frag);
    }
    store_matrix_sync(C + warpM*16*N + warpN*16, c_frag, N, mem_row_major);
}

This is the simplest possible tensor-core GEMM-one warp produces one 16×16 output tile and walks the K-dimension with no shared-memory tiling. Useful as a teaching example, but it misses the data-reuse win. To get serious throughput, combine wmma with shared-memory tiling (stage 5+): load tiles of A and B into shared memory, iterate wmma fragments out of those tiles, accumulate per-warp.

Expected speedup once integrated with shared-memory tiling and BF16 inputs: ~5–10× over the FP32 register-tiled stage 3, simply because BF16 tensor-core peak on H100 (~990 TFLOP/s sparse, ~495 TFLOP/s dense) dwarfs FP32 (~67 TFLOP/s).

9.6 Stage 5-Double-buffered with `cp.async` (Ampere+)¶

cp.async (PTX: cp.async.cg.shared.global) initiates a global → shared copy that doesn't block the issuing thread; it commits asynchronously. You overlap the next tile's load with the current tile's compute.

#include <cuda/pipeline>
#include <cooperative_groups/memcpy_async.h>
namespace cg = cooperative_groups;

extern __shared__ char smem_raw[];
auto sA = reinterpret_cast<__nv_bfloat16(*)[BK]>(smem_raw);
auto sB = reinterpret_cast<__nv_bfloat16(*)[BN]>(smem_raw + sizeof(*sA)*BM*2);

__pipeline_memcpy_async(&sA[buf][...], &A[...], bytes);
__pipeline_memcpy_async(&sB[buf][...], &B[...], bytes);
__pipeline_commit();
__pipeline_wait_prior(0);     // wait for the buffer we'll consume next
__syncthreads();
// ... wmma on sA[buf], sB[buf], while next iter prefetches into sA[1-buf] ...

By keeping two shared-memory buffers and pre-issuing the next load while the math fragment runs, the kernel overlaps HBM latency with tensor-core compute. Final speedup on H100 with proper pipelining and Hopper-specific TMA/wgmma: cuBLAS-class performance (~70–95% of cuBLAS for square problems > 1024).

9.7 Cumulative speedup table (4096³ BF16, H100-approximate)¶

Stage	Approx. % of cuBLAS	Notes
0. Naive FP32	~1–3%	bandwidth-bound, strided B
1. Coalesced FP32	~3–5%	minor
2. Shared-mem tiled FP32	~25–40%	classic GEMM tiling
3. Register tiled FP32	~50–70%	OK for FP32
4. `wmma` BF16 (no tiling)	N/A	demo only
4'. `wmma` BF16 + tiling	~40–60%	of BF16 cuBLAS
5. + `cp.async` doublebuf	~70–90%	of BF16 cuBLAS
6. Hopper `wgmma` + TMA	~90–98%	what CUTLASS does

10. Tensor Cores via `nvcuda::wmma`¶

10.1 Fragment types¶

fragment<USE, M, N, K, T, LAYOUT> frag;

USE: matrix_a, matrix_b, or accumulator.
M, N, K: shape; the canonical, supported-everywhere combination is 16,16,16 for FP16/BF16 → FP32. Other shapes (8,32,16; 32,8,16) exist on some archs.
T: element type. __half, __nv_bfloat16 for inputs; float, __half, int32_t for accumulators.
LAYOUT: row_major or col_major (only for A and B fragments).

A fragment is opaque across threads of a warp-internally it's a strided slice of registers. The element-by-element layout is unspecified; you only interact with it through the API.

10.2 The four operations¶

fill_fragment(frag, value);
load_matrix_sync(frag, ptr, leadingDim);
mma_sync(d, a, b, c);                       // d = a*b + c
store_matrix_sync(ptr, frag, leadingDim, layout);

load_matrix_sync requires the pointer to be 16-byte aligned and the leading dim to be a multiple of 8 (FP16/BF16). The "sync" suffix is a warp-wide collective; all 32 threads must reach it with the same args.

10.3 Supported precisions (canonical)¶

Inputs A,B	Accum	Notes
`__half`	`__half`/`float`	Volta+
`__nv_bfloat16`	`float`	Ampere+
TF32 (FP32 fed in)	`float`	Ampere+, via `wmma_precision_tf32`
`int8`	`int32`	Turing+
FP8 (E4M3, E5M2)	`float`	Hopper+ (use `wgmma`/CUTLASS in practice)

10.4 Layout interplay¶

For canonical 16x16x16 BF16: a warp's 32 threads cooperatively hold a 16×16 fragment (256 elements / 32 lanes = 8 elements/lane). The layout is opaque, but load_matrix_sync knows how to materialize it from row- or col-major memory with the leading-dim arg. Always pass the true leading dim of the parent matrix-passing the tile width is a common bug.

Micro-exercise 10.1¶

You have an FP32 matrix and want to use BF16 tensor cores. What do you do? Answer: either pre-convert with a kernel that emits BF16 (__float2bfloat16), or use TF32 fragments (wmma::precision::tf32) which take FP32 input and truncate the mantissa internally.

11. `mma.sync` PTX Inline Assembly¶

wmma is a high-level wrapper. CUTLASS and FlashAttention drop down to mma.sync PTX for tile sizes the wrapper doesn't expose, layout flexibility, or to interleave multiple MMAs to hide latency.

A canonical Ampere BF16 16×8×16 MMA (note: 16×8×16, not 16×16×16-the PTX shape is finer):

__device__ inline void mma_m16n8k16_bf16(
    float d[4],
    const uint32_t a[4],     // packed BF16, 8 elements per uint32 pair
    const uint32_t b[2],
    const float c[4])
{
    asm volatile(
        "mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 "
        "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n"
        : "=f"(d[0]), "=f"(d[1]), "=f"(d[2]), "=f"(d[3])
        : "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]),
          "r"(b[0]), "r"(b[1]),
          "f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3]));
}

Reading the mnemonic:

`mma.sync - synchronous warp-collective MMA.
`aligned - all threads must execute together with consistent operands.
`m16n8k16 - shape: 16×8 output tile, K-dim 16.
`row.col - A is row-major, B is col-major (in fragment layout terms).
`f32.bf16.bf16.f32 - D, A, B, C types.

Each lane carries part of the fragment; the layout is documented in PTX ISA. CUTLASS abstracts this with cutlass::arch::Mma templates so you don't write the asm by hand for production code, but it's important to recognize it when reading library code.

Hopper introduces wgmma.mma_async (warp-group MMA), which operates over 128 threads (4 warps) and runs asynchronously, freeing the warp group to issue more work while the tensor core churns. CUTLASS 3.x and cuBLAS use this internally for peak Hopper performance.

12. Cooperative Groups & Thread-Block Clusters¶

12.1 Cooperative groups¶

The cooperative_groups namespace provides typed handles for thread collectives:

#include <cooperative_groups.h>
namespace cg = cooperative_groups;

__global__ void k() {
    auto block = cg::this_thread_block();
    auto warp  = cg::tiled_partition<32>(block);
    auto half  = cg::tiled_partition<16>(warp);
    // collective ops:
    int v = cg::reduce(warp, lane_value, cg::plus<int>());
    block.sync();      // == __syncthreads()
}

Useful when you want to write code that doesn't hard-code warp size 32 (in case of future architectures or simulating smaller tiles).

12.2 Grid-wide synchronization¶

If you launch with cudaLaunchCooperativeKernel, you can call cg::this_grid().sync() for a barrier across all blocks of the grid. The constraint: the entire grid must fit on the GPU concurrently. Useful for multi-pass algorithms that previously required separate kernel launches.

12.3 Thread-block clusters (Hopper)¶

Hopper introduces a new level above the block: the cluster, a group of nearby blocks that share Distributed Shared Memory (DSMEM) and can synchronize collectively.

#include <cooperative_groups.h>
namespace cg = cooperative_groups;

__global__ void __cluster_dims__(2, 2, 1) cluster_k() {
    auto cluster = cg::this_cluster();
    cluster.sync();             // sync all blocks in the cluster
    int rank = cluster.block_rank();
    // distributed shared memory:
    __shared__ float buf[1024];
    auto neighbor = cluster.map_shared_rank(buf, /*rank*/0);
    // now `neighbor` points into block-0's shared memory
}

This unlocks new patterns like cross-block tile sharing in GEMM without going through HBM, which is part of how Hopper hits its tensor-core peak.

13. Profiling Discipline¶

13.1 Timing with events (the right way)¶

cudaEvent_t s, e;
cudaEventCreate(&s); cudaEventCreate(&e);
// warmup
for (int i = 0; i < 3; ++i) kernel<<<g,b>>>(...);
cudaDeviceSynchronize();
cudaEventRecord(s);
for (int i = 0; i < N_ITERS; ++i) kernel<<<g,b>>>(...);
cudaEventRecord(e);
cudaEventSynchronize(e);
float ms;  cudaEventElapsedTime(&ms, s, e);
double avg_ms = ms / N_ITERS;

Always: - Warm up (first launch is JIT-y and miss-y). - Loop and average. - Sync before reading. - Match the stream events are recorded in to the kernel's stream.

13.2 Common pitfalls¶

Forgetting to sync before stopping the host clock. Kernel launches return immediately; CPU chrono::now() measures launch overhead, not kernel time. Use events.
Default-stream interactions make cross-stream timing measure serialized work even when you wrote concurrent code. Check with Nsight Systems' timeline view.
JIT compile on first launch can add seconds. Pre-build for your arch ( - arch=sm_90`) and ignore the first iteration.
Power state: GPUs throttle. Pin clocks (nvidia-smi -pm 1 -lgc …) for reproducible benchmarks.

13.3 `nsys` (Nsight Systems)-system-wide timeline¶

Captures a Gantt chart of CUDA API calls, kernels, and memcopies across all streams. Best first stop:

nsys profile --stats=true -o report ./myprog
nsys-ui report.nsys-rep        # GUI

Use to answer: are H2D, kernel, D2H actually overlapped? Is the default stream serializing things?

13.4 `ncu` (Nsight Compute)-per-kernel deep dive¶

ncu --set full -k mygemm -o profile ./myprog
ncu-ui profile.ncu-rep

Tells you achieved occupancy, achieved DRAM throughput, L1/L2 hit rates, shared-memory bank conflicts, warp stall reasons. The "Speed of Light" section tells you, for each kernel, how close to peak compute and peak memory you are-if both are far from 100%, you have latency-hiding problems (too few resident warps, dependencies).

13.5 Reading roofline¶

For any kernel, compute: - Arithmetic intensity = FLOPs / bytes touched. - Achievable = min(peak_compute, AI × peak_BW).

If AI < peak_compute / peak_BW (the "ridge point"), the kernel is memory-bound and you should think about reuse, not FLOP throughput. For H100 BF16 tensor cores: ridge ≈ 990 TFLOP/s ÷ 3 TB/s ≈ 330 FLOP/byte. Almost nothing reaches that without aggressive shared-memory tiling.

14. A Complete BF16 GEMM at 2048×2048¶

Below is a single self-contained file targeting Ampere+ (sm_80). It includes shared-memory tiling, wmma tensor cores, and a simple double-buffered cp.async pipeline. On A100 we observe ~60–70% of cuBLAS BF16 for M=N=K=2048. (Actual depends heavily on tile choice and hardware-verify with ncu.)

gemm_bf16.cu:

// Build:
//   nvcc -O3 -arch=sm_80 -lineinfo -lcublas gemm_bf16.cu -o gemm_bf16
// Run:
//   ./gemm_bf16

#include <cuda_runtime.h>
#include <cuda_bf16.h>
#include <mma.h>
#include <cublas_v2.h>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <vector>

#define CUDA_CHECK(call) do {                                    \
    cudaError_t e_ = (call);                                     \
    if (e_ != cudaSuccess) {                                     \
        fprintf(stderr, "CUDA %s:%d: %s\n", __FILE__, __LINE__,  \
                cudaGetErrorString(e_));                         \
        std::exit(1);                                            \
    }                                                            \
} while (0)

using nvcuda::wmma::fragment;
using nvcuda::wmma::matrix_a;
using nvcuda::wmma::matrix_b;
using nvcuda::wmma::accumulator;
using nvcuda::wmma::row_major;
using nvcuda::wmma::col_major;
using nvcuda::wmma::load_matrix_sync;
using nvcuda::wmma::mma_sync;
using nvcuda::wmma::store_matrix_sync;
using nvcuda::wmma::fill_fragment;
using nvcuda::wmma::mem_row_major;

// Tile config: BM × BN output tile per block; BK along K.
// One warp computes a WM × WN sub-tile.
constexpr int BM = 128, BN = 128, BK = 32;
constexpr int WM = 64,  WN = 64;
constexpr int WMMA_M = 16, WMMA_N = 16, WMMA_K = 16;
constexpr int WARPS_PER_BLOCK = (BM/WM) * (BN/WN);   // 2*2 = 4
constexpr int THREADS_PER_BLOCK = WARPS_PER_BLOCK * 32; // 128

// `cp.async` 16-byte copy, 4-byte aligned source/destination.
__device__ __forceinline__ void cp_async_16(void* smem_ptr, const void* gmem_ptr) {
#if __CUDA_ARCH__ >= 800
    unsigned int smem_int = __cvta_generic_to_shared(smem_ptr);
    asm volatile("cp.async.cg.shared.global [%0], [%1], 16;\n"
                 :: "r"(smem_int), "l"(gmem_ptr));
#else
    *reinterpret_cast<int4*>(smem_ptr) = *reinterpret_cast<const int4*>(gmem_ptr);
#endif
}
__device__ __forceinline__ void cp_async_commit() {
#if __CUDA_ARCH__ >= 800
    asm volatile("cp.async.commit_group;\n" ::);
#endif
}
__device__ __forceinline__ void cp_async_wait_all() {
#if __CUDA_ARCH__ >= 800
    asm volatile("cp.async.wait_all;\n" ::);
#endif
}

__global__ void gemm_bf16_kernel(const __nv_bfloat16* __restrict__ A,
                                 const __nv_bfloat16* __restrict__ B,
                                 float* __restrict__ C,
                                 int M, int N, int K)
{
    // Two shared-mem buffers per matrix for double-buffering.
    __shared__ __nv_bfloat16 sA[2][BM][BK];
    __shared__ __nv_bfloat16 sB[2][BK][BN];

    int tid       = threadIdx.x;
    int warp_id   = tid / 32;
    int lane_id   = tid % 32;
    int warp_row  = warp_id / (BN / WN);   // 0 or 1
    int warp_col  = warp_id % (BN / WN);

    int block_row = blockIdx.y * BM;
    int block_col = blockIdx.x * BN;

    // Each thread brings 16B = 8 BF16 from A and 8 from B per K-tile.
    // 128 threads × 8 = 1024 elements per tile; BM*BK = 128*32 = 4096 → 4 iters,
    // BK*BN = 32*128 = 4096 → 4 iters. We unroll inline below.

    auto load_tile = [&](int buf, int kt) {
        // Load sA[buf][BM][BK] from A[block_row..+BM][kt..+BK]
        constexpr int A_TILE = BM * BK;       // 4096
        constexpr int A_PER_THREAD = A_TILE / THREADS_PER_BLOCK / 8;  // 4
        #pragma unroll
        for (int i = 0; i < A_PER_THREAD; ++i) {
            int idx = (i * THREADS_PER_BLOCK + tid) * 8;
            int r = idx / BK;
            int c = idx % BK;
            const void* gptr = &A[(block_row + r) * K + (kt + c)];
            void* sptr = &sA[buf][r][c];
            cp_async_16(sptr, gptr);
        }
        constexpr int B_TILE = BK * BN;       // 4096
        constexpr int B_PER_THREAD = B_TILE / THREADS_PER_BLOCK / 8;  // 4
        #pragma unroll
        for (int i = 0; i < B_PER_THREAD; ++i) {
            int idx = (i * THREADS_PER_BLOCK + tid) * 8;
            int r = idx / BN;
            int c = idx % BN;
            const void* gptr = &B[(kt + r) * N + (block_col + c)];
            void* sptr = &sB[buf][r][c];
            cp_async_16(sptr, gptr);
        }
        cp_async_commit();
    };

    // Per-warp output fragments.
    constexpr int FRAG_M = WM / WMMA_M;   // 4
    constexpr int FRAG_N = WN / WMMA_N;   // 4
    fragment<accumulator, WMMA_M, WMMA_N, WMMA_K, float> c_frag[FRAG_M][FRAG_N];
    #pragma unroll
    for (int i = 0; i < FRAG_M; ++i)
        #pragma unroll
        for (int j = 0; j < FRAG_N; ++j) fill_fragment(c_frag[i][j], 0.0f);

    // Prefetch first tile.
    int kt = 0;
    int buf = 0;
    load_tile(buf, kt);
    cp_async_wait_all();
    __syncthreads();

    for (kt = BK; kt < K; kt += BK) {
        // Issue next tile prefetch into the other buffer.
        int next_buf = buf ^ 1;
        load_tile(next_buf, kt);

        // Compute on current buffer.
        #pragma unroll
        for (int kk = 0; kk < BK; kk += WMMA_K) {
            fragment<matrix_a, WMMA_M, WMMA_N, WMMA_K, __nv_bfloat16, row_major> a_frag[FRAG_M];
            fragment<matrix_b, WMMA_M, WMMA_N, WMMA_K, __nv_bfloat16, row_major> b_frag[FRAG_N];
            #pragma unroll
            for (int i = 0; i < FRAG_M; ++i) {
                int row = warp_row * WM + i * WMMA_M;
                load_matrix_sync(a_frag[i], &sA[buf][row][kk], BK);
            }
            #pragma unroll
            for (int j = 0; j < FRAG_N; ++j) {
                int col = warp_col * WN + j * WMMA_N;
                load_matrix_sync(b_frag[j], &sB[buf][kk][col], BN);
            }
            #pragma unroll
            for (int i = 0; i < FRAG_M; ++i)
                #pragma unroll
                for (int j = 0; j < FRAG_N; ++j)
                    mma_sync(c_frag[i][j], a_frag[i], b_frag[j], c_frag[i][j]);
        }

        cp_async_wait_all();
        __syncthreads();
        buf = next_buf;
    }

    // Tail compute on last loaded buffer.
    #pragma unroll
    for (int kk = 0; kk < BK; kk += WMMA_K) {
        fragment<matrix_a, WMMA_M, WMMA_N, WMMA_K, __nv_bfloat16, row_major> a_frag[FRAG_M];
        fragment<matrix_b, WMMA_M, WMMA_N, WMMA_K, __nv_bfloat16, row_major> b_frag[FRAG_N];
        #pragma unroll
        for (int i = 0; i < FRAG_M; ++i)
            load_matrix_sync(a_frag[i], &sA[buf][warp_row*WM + i*WMMA_M][kk], BK);
        #pragma unroll
        for (int j = 0; j < FRAG_N; ++j)
            load_matrix_sync(b_frag[j], &sB[buf][kk][warp_col*WN + j*WMMA_N], BN);
        #pragma unroll
        for (int i = 0; i < FRAG_M; ++i)
            #pragma unroll
            for (int j = 0; j < FRAG_N; ++j)
                mma_sync(c_frag[i][j], a_frag[i], b_frag[j], c_frag[i][j]);
    }

    // Store back.
    #pragma unroll
    for (int i = 0; i < FRAG_M; ++i) {
        #pragma unroll
        for (int j = 0; j < FRAG_N; ++j) {
            int row = block_row + warp_row * WM + i * WMMA_M;
            int col = block_col + warp_col * WN + j * WMMA_N;
            store_matrix_sync(&C[row*N + col], c_frag[i][j], N, mem_row_major);
        }
    }
    (void)lane_id;
}

// Reference using cuBLAS for correctness + perf comparison.
static void cublas_gemm(cublasHandle_t h,
                        const __nv_bfloat16* dA, const __nv_bfloat16* dB,
                        float* dC, int M, int N, int K) {
    float alpha = 1.f, beta = 0.f;
    // cuBLAS is column-major. To compute C = A*B with row-major inputs,
    // call cublasGemmEx with op_N, op_N and swap the operand order:
    //   C^T = B^T * A^T, treated as col-major.
    cublasGemmEx(h, CUBLAS_OP_N, CUBLAS_OP_N,
                 N, M, K,
                 &alpha,
                 dB, CUDA_R_16BF, N,
                 dA, CUDA_R_16BF, K,
                 &beta,
                 dC, CUDA_R_32F, N,
                 CUBLAS_COMPUTE_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP);
}

int main() {
    const int M = 2048, N = 2048, K = 2048;
    size_t aBytes = (size_t)M*K*sizeof(__nv_bfloat16);
    size_t bBytes = (size_t)K*N*sizeof(__nv_bfloat16);
    size_t cBytes = (size_t)M*N*sizeof(float);

    std::vector<__nv_bfloat16> hA(M*K), hB(K*N);
    std::vector<float> hC(M*N), hRef(M*N);
    for (int i = 0; i < M*K; ++i) hA[i] = __float2bfloat16((float)((i*7) % 13) * 0.01f);
    for (int i = 0; i < K*N; ++i) hB[i] = __float2bfloat16((float)((i*5) % 11) * 0.01f);

    __nv_bfloat16 *dA, *dB; float *dC, *dRef;
    CUDA_CHECK(cudaMalloc(&dA, aBytes));
    CUDA_CHECK(cudaMalloc(&dB, bBytes));
    CUDA_CHECK(cudaMalloc(&dC, cBytes));
    CUDA_CHECK(cudaMalloc(&dRef, cBytes));
    CUDA_CHECK(cudaMemcpy(dA, hA.data(), aBytes, cudaMemcpyHostToDevice));
    CUDA_CHECK(cudaMemcpy(dB, hB.data(), bBytes, cudaMemcpyHostToDevice));

    dim3 grid(N/BN, M/BM);
    dim3 block(THREADS_PER_BLOCK);

    // Warmup
    for (int i = 0; i < 3; ++i)
        gemm_bf16_kernel<<<grid, block>>>(dA, dB, dC, M, N, K);
    CUDA_CHECK(cudaDeviceSynchronize());

    cudaEvent_t s, e;
    cudaEventCreate(&s); cudaEventCreate(&e);
    const int ITERS = 50;
    cudaEventRecord(s);
    for (int i = 0; i < ITERS; ++i)
        gemm_bf16_kernel<<<grid, block>>>(dA, dB, dC, M, N, K);
    cudaEventRecord(e);
    cudaEventSynchronize(e);
    float ms; cudaEventElapsedTime(&ms, s, e);
    double avg_ms = ms / ITERS;
    double tflops = 2.0 * M * N * K / (avg_ms * 1e-3) / 1e12;
    printf("Custom: %.3f ms,  %.2f TFLOP/s\n", avg_ms, tflops);

    cublasHandle_t h; cublasCreate(&h);
    for (int i = 0; i < 3; ++i) cublas_gemm(h, dA, dB, dRef, M, N, K);
    CUDA_CHECK(cudaDeviceSynchronize());
    cudaEventRecord(s);
    for (int i = 0; i < ITERS; ++i) cublas_gemm(h, dA, dB, dRef, M, N, K);
    cudaEventRecord(e);
    cudaEventSynchronize(e);
    cudaEventElapsedTime(&ms, s, e);
    double cb_ms = ms / ITERS;
    double cb_tflops = 2.0 * M * N * K / (cb_ms * 1e-3) / 1e12;
    printf("cuBLAS: %.3f ms,  %.2f TFLOP/s\n", cb_ms, cb_tflops);
    printf("Ratio:  %.1f%% of cuBLAS\n", 100.0 * cb_tflops / tflops > 100 ? 100.0 : 100.0 * tflops / cb_tflops);

    // Correctness spot-check (sample 16 elements).
    CUDA_CHECK(cudaMemcpy(hC.data(),  dC,   cBytes, cudaMemcpyDeviceToHost));
    CUDA_CHECK(cudaMemcpy(hRef.data(),dRef, cBytes, cudaMemcpyDeviceToHost));
    double max_rel = 0;
    for (int i = 0; i < M*N; i += (M*N/16)) {
        double r = std::abs(hC[i]-hRef[i]) / (std::abs(hRef[i]) + 1e-6);
        if (r > max_rel) max_rel = r;
    }
    printf("Max sampled rel error vs cuBLAS: %.3e\n", max_rel);

    cublasDestroy(h);
    cudaFree(dA); cudaFree(dB); cudaFree(dC); cudaFree(dRef);
    return 0;
}

Build:

nvcc -O3 -arch=sm_80 -lineinfo -lcublas gemm_bf16.cu -o gemm_bf16
./gemm_bf16

Annotations on the design:

Tile: 128×128 output × 32 K-block per iteration. Each block has 4 warps, each warp owns a 64×64 output sub-tile, materialized as a 4×4 grid of 16×16 tensor-core fragments.
Threads per block: 128 = 4 warps × 32. Modest-leaves room for 4–8 blocks per SM (occupancy-bounded by shared-memory: each block uses 2 × (128×32 + 32×128) × 2 B = 32 KB of smem).
cp.async double-buffering: while the tensor cores compute on sA[buf] / sB[buf], the next K-tile is fetched into the other buffer with no thread blocking. cp.async.wait_all + __syncthreads() flips the buffer when both the compute and the prefetch are done.
No bank-conflict fix: at 32-wide BF16 rows the natural layout already hits 32 different banks per warp load through load_matrix_sync. If you change BK you may need to add padding.
cuBLAS comparison: cuBLAS is column-major, so we swap the operand order to compute the row-major equivalent, a standard trick.

What this kernel doesn't do (that CUTLASS / cuBLAS do): swizzled shared-mem layout to fully avoid bank conflicts at all tile shapes; multi-stage cp.async pipeline (3+ buffers) for deeper latency hiding; wgmma on Hopper; epilogue fusion (bias, ReLU, scale); split-K for skinny problems. Implementing each of those is what closes the remaining 30–40% gap.

15. Practical Exercises¶

Exercise 1-Saturating PCIe¶

Write a benchmark that measures cudaMemcpy H→D bandwidth for transfer sizes 1 KB to 1 GB (powers of 2), with and without pinned host memory, and plots the curve. Answer sketch: allocate once outside the timing loop; cudaMallocHost for pinned; warm up; time with events; expect pinned to plateau near PCIe peak (~25 GB/s Gen4 / ~50 GB/s Gen5) starting ~1 MB, while pageable plateaus lower.

Exercise 2-Fix a coalescing bug¶

The kernel below is 8× slower than expected. Why?

__global__ void scale(float* x, int N) {
    int i = threadIdx.x * gridDim.x * blockDim.x / blockDim.x + blockIdx.x;
    if (i < N) x[i] *= 2.f;
}

Answer:

i = threadIdx.x * gridDim.x + blockIdx.x - consecutive threads
stride by

gridDim.x, so loads fromxgo through different sectors. Fix:i = blockIdx.x * blockDim.x + threadIdx.x`.

Exercise 3-Eliminate bank conflicts¶

Profile the transpose kernel (§7.2) without padding using ncu and verify shared_load_transactions_per_request ≈ 32. Add the [32][33] padding and verify it drops to 1. Answer sketch: ncu --set full -k transpose ./prog; look at the "Memory Workload Analysis → Shared Memory" pane.

Exercise 4-Reduction with fewer atomics¶

Modify reduce_shfl (§8.4) to do a two-stage reduction: per-block partial into an array partials[gridDim.x], then a second kernel reduces partials. Compare against the single-kernel atomicAdd version. Answer sketch: saves the contended atomics; for n = 2^28 the two-kernel version is often ~10–20% faster on H100 because atomics hit L2 contention.

Exercise 5-Naive vs tiled GEMM¶

Write FP32 gemm_naive (§9.1) and gemm_smem (§9.3) for M=N=K=2048. Time both. Compare to cuBLAS. Answer sketch: expect naive ~1–2% of cuBLAS, tiled ~25–40%, with cuBLAS BF16 itself being ~10× cuBLAS FP32 on tensor-core-capable hardware. Use the tiled FP32 number as the FP32 roofline reference.

Exercise 6-`wmma` correctness¶

Implement a wmma - based 16×16×16 GEMM (§9.5) forM=N=K=64and verify against a CPU reference. Catch: BF16 has 7 mantissa bits-accept1e-2relative error, not1e-6. **Answer sketch:** allocate BF16 inputs / FP32 output; CPU computes in float; compare element-wise; rel error budget5e-2` per element is realistic for random inputs in [0,1].

Closing thoughts¶

You've now seen, end-to-end, how a CUDA kernel becomes a tensor-core GEMM: the launch model and execution machinery, the memory hierarchy and the access patterns it punishes or rewards, the shared-memory bank discipline, the building-block kernels every CUDA author reimplements once, and a six-stage GEMM evolution that mirrors how CUTLASS got built. From here:

Read CUTLASS's cute layouts-you now have the model to understand tensor algebra over fragments.
Read FlashAttention's CUDA-block-tiled GEMM + softmax fused, with exactly the cp.async pipeline pattern you implemented in §14.
Read PyTorch `aten/src/ATen/native/cuda/*.cu - production CUDA written for correctness first, perf second.
For Hopper-class peak, learn wgmma, TMA, and Distributed Shared Memory; CUTLASS 3.x is the canonical reference.

When in doubt: profile with ncu, look at the Speed-of-Light, and let the achieved-vs-peak gap tell you whether your kernel is bandwidth-bound, compute-bound, or latency-bound. Optimization without measurement is just guessing; with measurement, every step here is mechanical.

— end of deep dive 02 —