Deep Dive 02: CUDA Programming-From First Kernel to Optimized GEMM¶
A self-contained reference chapter for the AI Systems curriculum (Month 2). Read this end-to-end and you will be able to write, optimize, and profile CUDA kernels at a level sufficient to read CUTLASS, FlashAttention, and Triton-generated PTX with comprehension. No external dependency required for the core material.
Anti-fabrication note: numbers marked with
~are realistic ballparks, not invented exact figures. Anything stronger than that is either a hard architectural fact (e.g., warp size = 32) or you should verify againstdeviceQuery/ Nsight on your specific GPU.
Table of Contents¶
- The CUDA programming model
- Indexing, launch configuration, and the SM/warp execution model
- Memory transfer: host ↔ device, pinned, managed, zero-copy
- Streams, events, synchronization
- Error handling discipline
- Memory access patterns: coalescing, derived from first principles
- Shared memory and bank conflicts
- Building blocks: vector add, reduction, prefix sum, histogram
- Tiled GEMM walkthrough-naive → tensor-core → double-buffered
- Tensor cores via
nvcuda::wmma mma.syncPTX inline assembly- Cooperative groups and thread-block clusters
- Profiling discipline
- A complete, build-and-run-ready BF16 GEMM at 2048×2048
- Practical exercises with answer sketches
1. The CUDA Programming Model¶
1.1 Intuition¶
A CUDA program runs on two machines at once: a host (CPU) and a device (GPU). The host owns control flow and orchestrates the device the way a conductor cues a section of an orchestra: it allocates device memory, copies data in, launches a function (a kernel) that runs on tens of thousands of GPU threads in parallel, then copies results back. There is no shared address space by default; the two memories are physically distinct (host DRAM and device HBM/GDDR). Modern Unified Memory blurs this, but the mental model of "two machines" is still the right starting point.
A kernel is a function executed N times in parallel by N CUDA threads. The threads are organized into a grid of blocks, and each block is a 1D, 2D, or 3D array of threads. Threads within a block can cooperate (shared memory, barriers); threads in different blocks generally cannot-they may not even be co-resident on hardware at the same time.
1.2 Function-space qualifiers¶
Three qualifiers tell the compiler where a function runs and where it can be called from:
| Qualifier | Runs on | Callable from | Notes |
|---|---|---|---|
__global__ |
Device | Host (and device on CC ≥ 3.5 via dynamic parallelism) | This is a kernel. Must return void. |
__device__ |
Device | Device | Inlined aggressively. |
__host__ |
Host | Host | Default if unmarked. |
__host__ __device__ |
Both | Both | Same source compiled twice. |
__device__ float square(float x) { return x * x; } // GPU helper
__global__ void apply(float* y, const float* x, int n) // kernel
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) y[i] = square(x[i]);
}
int main() { /* host code */ } // implicit __host__
1.3 Launch syntax¶
A kernel is launched with the chevron syntax:
gridDim: adim3giving the number of blocks along x, y, z.blockDim: adim3giving the number of threads per block along x, y, z.dynamicSharedBytes: bytes ofextern __shared__memory to allocate per block. Defaults to 0.stream: acudaStream_tto enqueue into. Defaults to the default stream (stream 0), which has special legacy semantics described in §4.
Total threads launched = gridDim.x * gridDim.y * gridDim.z *
blockDim.x * blockDim.y * blockDim.z. Hard limits (verify with
deviceQuery for your GPU): blockDim.x*y*z ≤ 1024, gridDim.x ≤ 2^31-1.
1.4 The SM/warp execution model¶
A GPU is a collection of Streaming Multiprocessors (SMs). On H100 there are 132 SMs and 4 warp schedulers per SM; on A100 there are 108 SMs and also 4 schedulers per SM. Each SM has a register file (~256 KB on Ampere/Hopper), an L1/shared scratchpad (~228 KB on H100, configurable), warp schedulers, INT/FP32/FP64 pipes, and tensor cores.
When you launch a kernel, the GPU's hardware work distributor assigns whole blocks to SMs. Each SM can hold multiple blocks resident simultaneously, up to limits dictated by registers per thread, shared memory per block, and hardware caps (max 2048 resident threads per SM on most architectures, ~32–64 resident warps depending on SM).
Once on an SM, a block is sliced into warps of 32 threads each. The warp is the unit of scheduling and instruction issue. All 32 threads in a warp execute the same instruction at the same time-this is SIMT (Single Instruction Multiple Threads). When threads diverge (different sides of a branch), the warp executes both paths serially, masking inactive lanes. From Volta onward, threads have independent program counters (Independent Thread Scheduling), so the model is more nuanced, but warp-level ops still operate on full warps with explicit masks.
The warp size of 32 is hard-baked into hardware, PTX, and tooling. Treat it as a constant of the universe.
1.5 Why this matters¶
Three rules fall directly out of the model and you will rederive them constantly:
- Block size should be a multiple of 32. Any other choice wastes lanes in the trailing warp. 128 or 256 are good defaults.
- Memory accesses should be warp-coalesced (§6). The 32 threads of a warp issue one load instruction; if their addresses fall in a single 128-byte aligned segment, the memory subsystem services it in one transaction.
- Shared memory has 32 banks (§7). Patterns that map 32 threads of a warp to 32 distinct banks run at full speed; collisions serialize.
Micro-exercise 1.1¶
Why must a kernel return void? Answer sketch: the chevron launch is
asynchronous-by the time the host instruction stream proceeds past the
launch, the kernel has likely not run. There is no synchronous return value
to capture. Outputs go through pointers to device memory.
2. Indexing & Launch Configuration¶
2.1 Built-in variables¶
Inside a kernel you have:
gridDim-dimensions of the grid (in blocks).- `blockDim - dimensions of a block (in threads).
- `blockIdx - this block's coordinate within the grid.
threadIdx— this thread's coordinate within its block.- `warpSize - always 32.
2.2 Computing a global thread ID¶
For a 1D grid of 1D blocks:
For a 2D grid of 2D blocks (e.g., per-pixel image kernel):
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x < width && y < height) { /* ... */ }
The bounds check matters because grid sizes are typically rounded up:
dim3 block(16, 16); // 256 threads
dim3 grid((width + 15) / 16, (height + 15) / 16); // ceil-div
saxpy2d<<<grid, block>>>(img, width, height);
2.3 Choosing block size-heuristics¶
- Multiple of 32 (warp size). Otherwise lanes are wasted.
- 128 or 256 is the sweet spot for memory-bound kernels.
- 256–1024 is typical for compute-bound kernels-but bigger blocks reduce
the number of resident blocks per SM, which reduces the scheduler's
ability to hide latency by switching warps. Run the occupancy calculator
or use
cudaOccupancyMaxPotentialBlockSizeto pick an optimal pair given per-thread register and shared-memory usage.
int minGrid, blockSize;
cudaOccupancyMaxPotentialBlockSize(&minGrid, &blockSize,
(void*)myKernel, /*smem*/0, /*N*/0);
int grid = (n + blockSize - 1) / blockSize;
myKernel<<<grid, blockSize>>>(...);
2.4 Grid-stride loops¶
Decoupling problem size from grid size is the right idiom: launch a fixed
grid (typically 2 * SMcount * blocksPerSM), let each thread loop:
__global__ void saxpy_grid_stride(int n, float a, const float* x, float* y) {
int stride = gridDim.x * blockDim.x;
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += stride) {
y[i] = a * x[i] + y[i];
}
}
This kernel handles any n without recompiling and tends to be faster on
arrays whose size doesn't divide nicely into the grid.
Micro-exercise 2.1¶
A kernel accesses a 1D array of 1,000,003 floats (a prime). What block and
grid sizes do you launch? Answer sketch: block=256, grid=ceil(1000003/256)=3907.
The last warp has 3 active threads × 0 trailing-actually 1000003 mod 256
= 3, so 3 active lanes in the last warp. Add if (i < n) guard.
nvcc invocation¶
nvcc -O3 -arch=sm_90 -lineinfo saxpy.cu -o saxpy # Hopper (H100)
nvcc -O3 -arch=sm_80 saxpy.cu -o saxpy # Ampere (A100)
3. Memory Transfer¶
3.1 The four ways to get bytes onto the GPU¶
| API | Allocator | Locality | Transfer cost | Use case |
|---|---|---|---|---|
cudaMalloc + cudaMemcpy |
Device DRAM | Device | PCIe / NVLink | Default for hot data. |
cudaMallocHost |
Pinned host RAM | Host (page-locked) | Faster cudaMemcpy |
Staging buffer for async copies. |
cudaMallocManaged (UVM) |
Either | Migrates on access | Page-fault driven | Prototyping, pointer sharing. |
Zero-copy (cudaHostAllocMapped) |
Pinned host | Device sees host directly | PCIe per access | Tiny, rare GPU reads of host data. |
3.2 The classic pattern¶
const int N = 1 << 20;
size_t bytes = N * sizeof(float);
float *h_x = (float*)malloc(bytes);
float *d_x; cudaMalloc(&d_x, bytes);
cudaMemcpy(d_x, h_x, bytes, cudaMemcpyHostToDevice);
kernel<<<grid, block>>>(d_x, N);
cudaMemcpy(h_x, d_x, bytes, cudaMemcpyDeviceToHost);
cudaFree(d_x); free(h_x);
cudaMemcpy on the default stream is host-blocking for D2H and H2D
when the host pointer is pageable; the host thread does not return until the
copy is enqueued and (for D2H) complete. This is the most common source of
"my kernel is slow"-the kernel is fine, the copy is dominating.
3.3 Pinned (page-locked) host memory¶
The OS may swap pageable memory; the GPU's DMA engine therefore must stage through a driver-pinned bounce buffer. Allocating already-pinned memory cuts that out:
float* h_x;
cudaMallocHost(&h_x, bytes); // pinned
// ... fill h_x ...
cudaMemcpyAsync(d_x, h_x, bytes, cudaMemcpyHostToDevice, stream);
Practical effect: PCIe Gen4 x16 has a theoretical peak of ~32 GB/s
unidirectional. With pageable memory you typically see ~6–12 GB/s; with
pinned you see ~22–28 GB/s. Pinning is mandatory for cudaMemcpyAsync to
actually overlap with kernels-otherwise the driver silently falls back to
sync.
Caveat: pinned memory is a scarce OS resource. Don't pin gigabytes unnecessarily.
3.4 Unified Memory (cudaMallocManaged)¶
float* x;
cudaMallocManaged(&x, bytes);
for (int i = 0; i < N; ++i) x[i] = i; // host writes
kernel<<<grid, block>>>(x, N); // device reads-pages migrate
cudaDeviceSynchronize();
printf("%f\n", x[0]); // host reads-pages migrate back
cudaFree(x);
The driver fault-handles page migrations on demand. On Pascal+ this is fully
demand-paged at 4 KB granularity; on pre-Pascal the entire allocation
migrated on touch. UVM is great for prototyping but page faults are
expensive (~tens of μs each), so production code typically uses explicit
cudaMemPrefetchAsync to push data ahead of time, or just falls back to
explicit cudaMemcpy.
3.5 Performance ballparks (H100 + PCIe5; verify on your hardware)¶
- HBM3 device-to-device (
cudaMemcpy(D2D)): ~2.5–3 TB/s. - PCIe Gen5 x16 H↔D pinned: ~50–55 GB/s.
- Pageable H↔D: ~half that, plus driver overhead.
- Zero-copy reads from device: bound by PCIe per access, latency ~1 μs.
Micro-exercise 3.1¶
You repeatedly run a kernel on the same input. Where should the input live?
Answer: in device memory (cudaMalloc), copied once. If the input is
streamed from disk each iteration, use cudaMemcpyAsync from a pinned
staging buffer overlapped with compute (§4).
4. Streams, Events, Synchronization¶
4.1 Streams as queues¶
A stream is a FIFO queue of GPU work. Operations enqueued into the same stream execute in order; operations in different streams may execute concurrently (subject to hardware resources).
cudaStream_t s1, s2;
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);
cudaMemcpyAsync(dA, hA, bytes, cudaMemcpyHostToDevice, s1);
kernelA<<<g, b, 0, s1>>>(dA);
cudaMemcpyAsync(dB, hB, bytes, cudaMemcpyHostToDevice, s2);
kernelB<<<g, b, 0, s2>>>(dB);
// kernelA and kernelB may run concurrently if the SMs have room
4.2 The default stream's special semantics¶
Stream 0 (the default / NULL / legacy stream) is implicitly synchronizing: a launch into the default stream waits for all prior work in every stream of the same device, and all subsequent stream work waits for it. This is hostile to concurrency.
Two ways out:
- Compile with - -default-stream per-thread` so each host thread gets its own non-blocking default stream.
- Always use explicit non-default streams created with
cudaStreamCreate(orcudaStreamCreateWithFlags(&s, cudaStreamNonBlocking)to make a stream not synchronize against the default stream).
4.3 Events for timing and dependencies¶
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, stream);
kernel<<<g, b, 0, stream>>>(...);
cudaEventRecord(stop, stream);
cudaEventSynchronize(stop); // host waits for stop
float ms;
cudaEventElapsedTime(&ms, start, stop); // resolution ~0.5 μs
Events also express cross-stream dependencies:
cudaEventRecord(eA, sA); // record after kernelA
cudaStreamWaitEvent(sB, eA, 0); // sB cannot proceed until eA
kernelB<<<g, b, 0, sB>>>(...); // depends on kernelA
4.4 Overlapping copy and compute¶
The classic three-stream pipeline for a chunked workload:
const int CHUNK = 1 << 20;
const int NSTREAMS = 3;
cudaStream_t s[NSTREAMS];
for (int i = 0; i < NSTREAMS; ++i) cudaStreamCreate(&s[i]);
for (int chunk = 0; chunk < nChunks; ++chunk) {
int k = chunk % NSTREAMS;
cudaMemcpyAsync(d_in + chunk*CHUNK, h_in + chunk*CHUNK,
CHUNK*sizeof(float), cudaMemcpyHostToDevice, s[k]);
process<<<grid, block, 0, s[k]>>>(d_in + chunk*CHUNK,
d_out + chunk*CHUNK, CHUNK);
cudaMemcpyAsync(h_out + chunk*CHUNK, d_out + chunk*CHUNK,
CHUNK*sizeof(float), cudaMemcpyDeviceToHost, s[k]);
}
for (int i = 0; i < NSTREAMS; ++i) cudaStreamSynchronize(s[i]);
Why three streams: a GPU typically has two DMA engines (one each direction)
plus the compute engine. Three streams let H2D, kernel, and D2H run
concurrently across chunks. With pinned host buffers, total wall time
collapses from T_h2d + T_compute + T_d2h toward max(T_h2d, T_compute,
T_d2h).
Micro-exercise 4.1¶
You time a kernel as cudaEventRecord(start), kernel launch,
cudaEventRecord(stop), cudaEventElapsedTime(&ms, start, stop), and get a
suspiciously small number. What did you forget? Answer: cudaEventSynchronize(stop)
before reading the elapsed time. Without it, the event may not have completed.
5. Error Handling Discipline¶
CUDA reports errors in two channels:
- The return code of every API call.
- A sticky per-thread error state for asynchronous failures, queryable
via
cudaGetLastError()(which clears it) orcudaPeekAtLastError()(which doesn't).
Kernel launches return success if the launch configuration is valid; an in-kernel out-of-bounds access surfaces only on the next sync point. So:
#define CUDA_CHECK(call) do { \
cudaError_t _e = (call); \
if (_e != cudaSuccess) { \
fprintf(stderr, "CUDA error %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(_e)); \
std::abort(); \
} \
} while (0)
#define CUDA_LAUNCH_CHECK() do { \
CUDA_CHECK(cudaPeekAtLastError()); \
CUDA_CHECK(cudaDeviceSynchronize()); \
} while (0)
Use:
CUDA_CHECK(cudaMalloc(&d_x, bytes));
CUDA_CHECK(cudaMemcpy(d_x, h_x, bytes, cudaMemcpyHostToDevice));
mykernel<<<g, b>>>(d_x);
CUDA_LAUNCH_CHECK(); // catches both launch and runtime kernel errors
In release builds you may want to skip cudaDeviceSynchronize (it kills
async pipelining); make CUDA_LAUNCH_CHECK peek-only and rely on
compute-sanitizer during development:
Common pitfalls¶
- Sticky errors: if any earlier call failed, every subsequent call returns
the old error until you clear it with
cudaGetLastError. - Asynchronous errors: a kernel writing OOB will not surface until you next sync. Always sync in tests.
- Multi-GPU: CUDA errors are per-thread per-context. Setting the wrong
device with
cudaSetDeviceproduces silent wrongness, not an error.
6. Memory Access Patterns: Coalescing¶
6.1 The coalescing rule, derived¶
Global memory is served by the L2 cache and HBM in 32-byte and 128-byte sectors. When a warp issues a load instruction, the hardware computes the union of the 32 thread addresses and issues the minimum number of sector fetches to cover them.
The ideal case: 32 threads of a warp load 32 consecutive 4-byte floats starting at a 128-byte aligned address. One 128-byte transaction services the entire warp. Achieved bandwidth = peak.
The worst case (for 4-byte loads): 32 threads load 32 floats, each in a different 128-byte sector. The hardware issues 32 sector loads to fetch 128 bytes of data-a 32× waste. Achieved bandwidth ≈ peak / 32.
6.2 Concrete: coalesced vs strided¶
__global__ void coalesced(const float* x, float* y, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) y[i] = x[i] * 2.0f;
}
__global__ void strided(const float* x, float* y, int n, int stride) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = i * stride; // stride 32 → each thread in a different sector
if (j < n) y[j] = x[j] * 2.0f;
}
For stride=1, achieved ~80–95% of HBM peak (verify; H100 HBM3 peak ~3 TB/s,
so realistic ~2.4–2.8 TB/s). For stride=32, achieved ~3–10% of peak
because each warp issues 32 sector loads where the coalesced version
issues 1. Concretely, on a 1 GiB array:
- Coalesced runtime ≈
1 GiB / 2.5 TB/s ≈ 0.4 ms. - Strided runtime ≈
32× that ≈ 13 ms. (Approximate-L2 hits dampen the worst case.)
6.3 Alignment¶
Device pointers from cudaMalloc are aligned to at least 256 bytes. But if
you offset a pointer by, say, 1 element, the warp loads cross a sector
boundary and now need 2 transactions instead of 1. Achieved BW falls by
~50%. This is why CUTLASS aligns leading dimensions of GEMM tiles.
6.4 Vectorized loads¶
float4 (16 bytes) and int4 lets a single thread issue a 16-byte load,
so a warp issues 32 × 16 = 512 bytes in 4 transactions of 128 B. This is
identical bandwidth to scalar coalesced loads but reduces instruction
count, often improving compute-bound kernels.
__global__ void copy_vec(const float4* x, float4* y, int n4) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n4) y[i] = x[i]; // 16 B per thread, 512 B per warp
}
Caveat: n must be divisible by 4 and pointers 16-byte aligned (true for
cudaMalloc).
Micro-exercise 6.1¶
A 2D row-major matrix A[M][N] is processed with threadIdx.x indexing
columns. Is A[row][threadIdx.x] coalesced? Answer: yes-consecutive
threads access consecutive columns, which are consecutive in memory in
row-major. If threadIdx.x indexed rows instead (A[threadIdx.x][col])
that's a stride-N access and bad.
7. Shared Memory and Bank Conflicts¶
7.1 The bank model¶
Shared memory is divided into 32 banks. Each bank serves one 4-byte word per cycle. A warp accessing 32 distinct banks completes in one cycle. Two threads hitting the same bank with different addresses serialize — that's a 2-way bank conflict, and the warp issue takes 2 cycles. Up to 32-way is possible (worst case: every thread the same bank).
Bank index = (address / 4) mod 32. So:
7.2 The transpose problem¶
A naive 32×32 shared-memory tile transpose:
__shared__ float tile[32][32];
tile[threadIdx.y][threadIdx.x] = in[...]; // store: coalesced, no conflict
__syncthreads();
out[...] = tile[threadIdx.x][threadIdx.y]; // load: stride-32, 32-way conflict
The store has threadIdx.x varying along the inner dimension (banks 0..31,
no conflict). The load reads tile[threadIdx.x][threadIdx.y] - fixingthreadIdx.yand varyingthreadIdx.xreadstile[0][y], tile[1][y], ...`
which sit at addresses differing by 32 floats = 128 bytes = same bank. 32-way
conflict.
7.3 The padding fix¶
Add a phantom column:
Now tile[k][y] is at k*33 + y floats from base. Bank index =
(k*33 + y) mod 32 = (k + y) mod 32 since 33 mod 32 = 1. As k varies 0..31
with y fixed, the bank stride is 1-every thread hits a different bank.
Conflict gone, at the cost of 32 floats × 4 bytes = 128 B of waste per tile.
__global__ void transpose(const float* in, float* out, int N) {
__shared__ float tile[32][33];
int x = blockIdx.x * 32 + threadIdx.x;
int y = blockIdx.y * 32 + threadIdx.y;
if (x < N && y < N) tile[threadIdx.y][threadIdx.x] = in[y*N + x];
__syncthreads();
int xt = blockIdx.y * 32 + threadIdx.x; // swap block coords
int yt = blockIdx.x * 32 + threadIdx.y;
if (xt < N && yt < N) out[yt*N + xt] = tile[threadIdx.x][threadIdx.y];
}
7.4 Dynamic shared memory¶
If the size is known only at launch, use extern __shared__:
__global__ void k(int n) {
extern __shared__ float buf[];
// ... use buf[0..n-1]
}
k<<<grid, block, n*sizeof(float)>>>(n);
The third chevron argument supplies the byte count.
7.5 Capacity¶
H100 has up to ~228 KB of combined L1+shared per SM, configurable via
cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, …).
A100 has ~164 KB. Default per-block limit is 48 KB unless opted up.
Micro-exercise 7.1¶
A warp does s[threadIdx.x * 2] on a __shared__ float s[64]. Conflict?
Answer: threads 0..15 hit banks 0,2,4,...,30 (no conflicts within), threads
16..31 hit banks 0,2,4,...,30 again-but in shared-memory bank-conflict
arithmetic, threads 16..31 collide with threads 0..15 pairwise. 2-way conflict.
8. Building Blocks¶
8.1 Vector add¶
__global__ void vadd(const float* a, const float* b, float* c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) c[i] = a[i] + b[i];
}
This is purely bandwidth-bound: 3 × 4 = 12 bytes touched per output, ~1 FLOP per output → arithmetic intensity 1/12 FLOP/byte. On H100 (~3 TB/s, ~67 TFLOP/s FP32), the roofline says ~250 GFLOP/s-this kernel achieves near-peak HBM bandwidth, not near-peak FLOPs.
8.2 Reduction-naive¶
Sum N floats. Naive in-block reduction:
__global__ void reduce_naive(const float* x, float* out, int n) {
extern __shared__ float sdata[];
int tid = threadIdx.x;
int i = blockIdx.x * blockDim.x + tid;
sdata[tid] = (i < n) ? x[i] : 0.0f;
__syncthreads();
for (int s = 1; s < blockDim.x; s *= 2) {
if (tid % (2*s) == 0) sdata[tid] += sdata[tid + s];
__syncthreads();
}
if (tid == 0) atomicAdd(out, sdata[0]);
}
Problems: divergence inside the warp (tid % (2*s)) wastes lanes; bank
conflicts on sdata accesses.
8.3 Reduction-sequential addressing¶
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) sdata[tid] += sdata[tid + s];
__syncthreads();
}
Now active threads are 0..s-1, contiguous, no divergence within active warps. Bank conflicts gone (stride-1).
8.4 Reduction-warp shuffle¶
The last 6 levels of reduction (s = 32 down to 1) execute within a single
warp. Warp shuffle (__shfl_down_sync) lets a thread read another lane's
register without going through shared memory:
__device__ float warp_reduce_sum(float v) {
for (int off = 16; off > 0; off >>= 1)
v += __shfl_down_sync(0xffffffff, v, off);
return v; // lane 0 holds the full warp sum
}
__global__ void reduce_shfl(const float* x, float* out, int n) {
__shared__ float sm[32]; // one slot per warp in block
int tid = threadIdx.x;
int lane = tid & 31;
int wid = tid >> 5;
int i = blockIdx.x * blockDim.x + tid;
float v = (i < n) ? x[i] : 0.0f;
v = warp_reduce_sum(v);
if (lane == 0) sm[wid] = v;
__syncthreads();
v = (tid < blockDim.x / 32) ? sm[lane] : 0.0f;
if (wid == 0) v = warp_reduce_sum(v);
if (tid == 0) atomicAdd(out, v);
}
Performance evolution (approximate, 256-thread block, 1 GiB input, H100):
| Variant | Time | Notes |
|---|---|---|
| Naive divergence | ~3.0 ms | warp divergence + bank conflicts |
| Sequential addr. | ~1.2 ms | clean shared-mem reduction |
| Warp shuffle | ~0.55 ms | near HBM-bound |
| + grid-stride loop | ~0.40 ms | 1 launch, fewer atomics |
8.5 Cooperative groups (optional cleaner API)¶
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void reduce_cg(const float* x, float* out, int n) {
auto block = cg::this_thread_block();
auto warp = cg::tiled_partition<32>(block);
int i = block.group_index().x * block.size() + block.thread_rank();
float v = (i < n) ? x[i] : 0.0f;
v = cg::reduce(warp, v, cg::plus<float>());
if (warp.thread_rank() == 0) atomicAdd(out, v);
}
8.6 Inclusive prefix sum (Hillis–Steele within a block)¶
__global__ void prefix_sum(float* x, int n) {
extern __shared__ float s[];
int tid = threadIdx.x;
s[tid] = (tid < n) ? x[tid] : 0.0f;
__syncthreads();
for (int off = 1; off < blockDim.x; off <<= 1) {
float v = (tid >= off) ? s[tid - off] : 0.0f;
__syncthreads();
s[tid] += v;
__syncthreads();
}
if (tid < n) x[tid] = s[tid];
}
For full-array scan you do block scans, scan the per-block totals, then add
back. Or use thrust::inclusive_scan.
8.7 Histogram with privatization¶
The naive histogram has every thread atomicAdd(&hist[bin], 1) on global —
a contention disaster. Privatize per-block in shared memory, merge at end:
__global__ void hist_priv(const int* data, int* hist, int n, int B) {
extern __shared__ int sh[];
for (int i = threadIdx.x; i < B; i += blockDim.x) sh[i] = 0;
__syncthreads();
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int stride = gridDim.x * blockDim.x;
for (int i = tid; i < n; i += stride) atomicAdd(&sh[data[i]], 1);
__syncthreads();
for (int i = threadIdx.x; i < B; i += blockDim.x)
atomicAdd(&hist[i], sh[i]);
}
Speedup over naive global atomics: typically 10–50×, larger as bin counts shrink.
9. Tiled GEMM Walkthrough¶
We compute C = A · B with A: MxK, B: KxN, C: MxN, all row-major.
We'll evolve the kernel through six stages, each with its expected speedup
over the naive baseline on a square problem like 4096×4096.
9.1 Stage 0-Naive¶
Each thread computes one output element:
__global__ void gemm_naive(const float* A, const float* B, float* C,
int M, int N, int K) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < M && col < N) {
float acc = 0.f;
for (int k = 0; k < K; ++k) acc += A[row*K + k] * B[k*N + col];
C[row*N + col] = acc;
}
}
Bandwidth analysis: each thread reads 2K floats and writes 1, performs
2K FLOPs → arithmetic intensity 2K / (8K) = 0.25 FLOP/byte. Bandwidth
bound at ~750 GFLOP/s on H100 (HBM ~3 TB/s × 0.25). Reality is worse
because the access pattern through A is row-major coalesced, but the
access through B is strided (column traversal of a row-major matrix).
Baseline: ~1× (by definition). Achieves on the order of ~1–3% of cuBLAS.
9.2 Stage 1-Coalesced¶
Swap thread mapping so consecutive threadIdx.x walks consecutive columns
of B (which are not contiguous), but at least the writes to C are
coalesced. The naive code above is already laid out so that
threadIdx.x → col and C[row*N + col] is coalesced. The remaining
problem is B[k*N + col] - for fixedk, varyingcol` is contiguous, so
it's already coalesced. The real fix is reusing each loaded byte across
threads, which only shared memory delivers.
Speedup over naive: small (~1.2–1.5×) just from cache effects.
9.3 Stage 2-Shared-memory tiled¶
Each block computes a BM × BN tile of C. It cooperatively loads tiles of
A and B into shared memory, then iterates. With BM = BN = BK = 32:
template<int BM, int BN, int BK>
__global__ void gemm_smem(const float* A, const float* B, float* C,
int M, int N, int K) {
__shared__ float sA[BM][BK];
__shared__ float sB[BK][BN];
int row = blockIdx.y * BM + threadIdx.y;
int col = blockIdx.x * BN + threadIdx.x;
float acc = 0.f;
for (int kt = 0; kt < K; kt += BK) {
sA[threadIdx.y][threadIdx.x] =
(row < M && kt + threadIdx.x < K) ? A[row*K + kt + threadIdx.x] : 0.f;
sB[threadIdx.y][threadIdx.x] =
(kt + threadIdx.y < K && col < N) ? B[(kt + threadIdx.y)*N + col] : 0.f;
__syncthreads();
#pragma unroll
for (int k = 0; k < BK; ++k) acc += sA[threadIdx.y][k] * sB[k][threadIdx.x];
__syncthreads();
}
if (row < M && col < N) C[row*N + col] = acc;
}
Each tile of A and B is loaded once from HBM and reused 32× from shared
memory. Arithmetic intensity rises to ~16 FLOP/byte. On H100 FP32 (~67
TFLOP/s peak), this can reach ~30–40% of cuBLAS FP32. Speedup over naive:
~10–15×.
Note: the K-loop accumulator sA[ty][k] * sB[k][tx] has bank conflicts on
sA[ty][k] (all threads in a row hit the same bank). Padding sA[BM][BK+1]
fixes it for some tile shapes.
9.4 Stage 3-Register tiling (1D and 2D)¶
Have each thread compute multiple outputs, holding partial sums in
registers. With TM = TN = 8, each thread now produces an 8×8 micro-tile,
so a BM=128, BN=128 block uses 128*128/(8*8) = 256 threads-convenient.
Sketch (key inner loop):
constexpr int BM=128, BN=128, BK=8, TM=8, TN=8;
__shared__ float sA[BM][BK];
__shared__ float sB[BK][BN];
float regA[TM], regB[TN], acc[TM][TN] = {0};
// ... load sA, sB cooperatively ...
__syncthreads();
#pragma unroll
for (int k = 0; k < BK; ++k) {
#pragma unroll
for (int i = 0; i < TM; ++i) regA[i] = sA[ty*TM + i][k];
#pragma unroll
for (int j = 0; j < TN; ++j) regB[j] = sB[k][tx*TN + j];
#pragma unroll
for (int i = 0; i < TM; ++i)
#pragma unroll
for (int j = 0; j < TN; ++j) acc[i][j] += regA[i] * regB[j];
}
Each register-resident regA[i] * regB[j] is one FMA. With TM=TN=8 we
generate 64 FMAs per k step from 8+8=16 register loads → 4 FMAs/load,
matching SM dispatch ratios. Arithmetic intensity now in the hundreds of
FLOP/byte. Expected: ~50–70% of cuBLAS FP32. Speedup over naive: ~30–40×.
9.5 Stage 4-Tensor cores (wmma, BF16 in / FP32 out)¶
On Volta+ each SM has tensor cores that compute matrix-multiply-accumulate
on small fragments per warp. The fundamental fragment shape supported across
generations for BF16 is m=16, n=16, k=16 with FP32 accumulator (we'll call
this the "16x16x16" shape). Each warp does one tensor-core op per
mma_sync.
#include <mma.h>
using namespace nvcuda::wmma;
__global__ void gemm_wmma(const __nv_bfloat16* A, const __nv_bfloat16* B,
float* C, int M, int N, int K) {
int warpM = (blockIdx.y * blockDim.y + threadIdx.y);
int warpN = (blockIdx.x * blockDim.x + threadIdx.x) / 32;
fragment<matrix_a, 16,16,16, __nv_bfloat16, row_major> a_frag;
fragment<matrix_b, 16,16,16, __nv_bfloat16, row_major> b_frag;
fragment<accumulator, 16,16,16, float> c_frag;
fill_fragment(c_frag, 0.0f);
for (int k = 0; k < K; k += 16) {
load_matrix_sync(a_frag, A + warpM*16*K + k, K);
load_matrix_sync(b_frag, B + k*N + warpN*16, N);
mma_sync(c_frag, a_frag, b_frag, c_frag);
}
store_matrix_sync(C + warpM*16*N + warpN*16, c_frag, N, mem_row_major);
}
This is the simplest possible tensor-core GEMM-one warp produces one
16×16 output tile and walks the K-dimension with no shared-memory tiling.
Useful as a teaching example, but it misses the data-reuse win. To get
serious throughput, combine wmma with shared-memory tiling (stage 5+):
load tiles of A and B into shared memory, iterate wmma fragments out
of those tiles, accumulate per-warp.
Expected speedup once integrated with shared-memory tiling and BF16 inputs: ~5–10× over the FP32 register-tiled stage 3, simply because BF16 tensor-core peak on H100 (~990 TFLOP/s sparse, ~495 TFLOP/s dense) dwarfs FP32 (~67 TFLOP/s).
9.6 Stage 5-Double-buffered with cp.async (Ampere+)¶
cp.async (PTX: cp.async.cg.shared.global) initiates a global → shared
copy that doesn't block the issuing thread; it commits asynchronously. You
overlap the next tile's load with the current tile's compute.
#include <cuda/pipeline>
#include <cooperative_groups/memcpy_async.h>
namespace cg = cooperative_groups;
extern __shared__ char smem_raw[];
auto sA = reinterpret_cast<__nv_bfloat16(*)[BK]>(smem_raw);
auto sB = reinterpret_cast<__nv_bfloat16(*)[BN]>(smem_raw + sizeof(*sA)*BM*2);
__pipeline_memcpy_async(&sA[buf][...], &A[...], bytes);
__pipeline_memcpy_async(&sB[buf][...], &B[...], bytes);
__pipeline_commit();
__pipeline_wait_prior(0); // wait for the buffer we'll consume next
__syncthreads();
// ... wmma on sA[buf], sB[buf], while next iter prefetches into sA[1-buf] ...
By keeping two shared-memory buffers and pre-issuing the next load while
the math fragment runs, the kernel overlaps HBM latency with tensor-core
compute. Final speedup on H100 with proper pipelining and Hopper-specific
TMA/wgmma: cuBLAS-class performance (~70–95% of cuBLAS for square
problems > 1024).
9.7 Cumulative speedup table (4096³ BF16, H100-approximate)¶
| Stage | Approx. % of cuBLAS | Notes |
|---|---|---|
| 0. Naive FP32 | ~1–3% | bandwidth-bound, strided B |
| 1. Coalesced FP32 | ~3–5% | minor |
| 2. Shared-mem tiled FP32 | ~25–40% | classic GEMM tiling |
| 3. Register tiled FP32 | ~50–70% | OK for FP32 |
4. wmma BF16 (no tiling) |
N/A | demo only |
4'. wmma BF16 + tiling |
~40–60% | of BF16 cuBLAS |
5. + cp.async doublebuf |
~70–90% | of BF16 cuBLAS |
6. Hopper wgmma + TMA |
~90–98% | what CUTLASS does |
10. Tensor Cores via nvcuda::wmma¶
10.1 Fragment types¶
USE:matrix_a,matrix_b, oraccumulator.M, N, K: shape; the canonical, supported-everywhere combination is 16,16,16 for FP16/BF16 → FP32. Other shapes (8,32,16; 32,8,16) exist on some archs.T: element type.__half,__nv_bfloat16for inputs;float,__half,int32_tfor accumulators.LAYOUT:row_majororcol_major(only for A and B fragments).
A fragment is opaque across threads of a warp-internally it's a strided slice of registers. The element-by-element layout is unspecified; you only interact with it through the API.
10.2 The four operations¶
fill_fragment(frag, value);
load_matrix_sync(frag, ptr, leadingDim);
mma_sync(d, a, b, c); // d = a*b + c
store_matrix_sync(ptr, frag, leadingDim, layout);
load_matrix_sync requires the pointer to be 16-byte aligned and the
leading dim to be a multiple of 8 (FP16/BF16). The "sync" suffix is a
warp-wide collective; all 32 threads must reach it with the same args.
10.3 Supported precisions (canonical)¶
| Inputs A,B | Accum | Notes |
|---|---|---|
__half |
__half/float |
Volta+ |
__nv_bfloat16 |
float |
Ampere+ |
| TF32 (FP32 fed in) | float |
Ampere+, via wmma_precision_tf32 |
int8 |
int32 |
Turing+ |
| FP8 (E4M3, E5M2) | float |
Hopper+ (use wgmma/CUTLASS in practice) |
10.4 Layout interplay¶
For canonical 16x16x16 BF16: a warp's 32 threads cooperatively hold a
16×16 fragment (256 elements / 32 lanes = 8 elements/lane). The layout is
opaque, but load_matrix_sync knows how to materialize it from row- or
col-major memory with the leading-dim arg. Always pass the true leading
dim of the parent matrix-passing the tile width is a common bug.
Micro-exercise 10.1¶
You have an FP32 matrix and want to use BF16 tensor cores. What do you do?
Answer: either pre-convert with a kernel that emits BF16 (__float2bfloat16),
or use TF32 fragments (wmma::precision::tf32) which take FP32 input and
truncate the mantissa internally.
11. mma.sync PTX Inline Assembly¶
wmma is a high-level wrapper. CUTLASS and FlashAttention drop down to
mma.sync PTX for tile sizes the wrapper doesn't expose, layout flexibility,
or to interleave multiple MMAs to hide latency.
A canonical Ampere BF16 16×8×16 MMA (note: 16×8×16, not 16×16×16-the PTX shape is finer):
__device__ inline void mma_m16n8k16_bf16(
float d[4],
const uint32_t a[4], // packed BF16, 8 elements per uint32 pair
const uint32_t b[2],
const float c[4])
{
asm volatile(
"mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 "
"{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n"
: "=f"(d[0]), "=f"(d[1]), "=f"(d[2]), "=f"(d[3])
: "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]),
"r"(b[0]), "r"(b[1]),
"f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3]));
}
Reading the mnemonic:
- `mma.sync - synchronous warp-collective MMA.
- `aligned - all threads must execute together with consistent operands.
- `m16n8k16 - shape: 16×8 output tile, K-dim 16.
- `row.col - A is row-major, B is col-major (in fragment layout terms).
- `f32.bf16.bf16.f32 - D, A, B, C types.
Each lane carries part of the fragment; the layout is documented in PTX
ISA. CUTLASS abstracts this with cutlass::arch::Mma templates so you
don't write the asm by hand for production code, but it's important to
recognize it when reading library code.
Hopper introduces wgmma.mma_async (warp-group MMA), which operates over
128 threads (4 warps) and runs asynchronously, freeing the warp group to
issue more work while the tensor core churns. CUTLASS 3.x and cuBLAS use
this internally for peak Hopper performance.
12. Cooperative Groups & Thread-Block Clusters¶
12.1 Cooperative groups¶
The cooperative_groups namespace provides typed handles for thread
collectives:
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void k() {
auto block = cg::this_thread_block();
auto warp = cg::tiled_partition<32>(block);
auto half = cg::tiled_partition<16>(warp);
// collective ops:
int v = cg::reduce(warp, lane_value, cg::plus<int>());
block.sync(); // == __syncthreads()
}
Useful when you want to write code that doesn't hard-code warp size 32 (in case of future architectures or simulating smaller tiles).
12.2 Grid-wide synchronization¶
If you launch with cudaLaunchCooperativeKernel, you can call
cg::this_grid().sync() for a barrier across all blocks of the grid. The
constraint: the entire grid must fit on the GPU concurrently. Useful for
multi-pass algorithms that previously required separate kernel launches.
12.3 Thread-block clusters (Hopper)¶
Hopper introduces a new level above the block: the cluster, a group of nearby blocks that share Distributed Shared Memory (DSMEM) and can synchronize collectively.
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void __cluster_dims__(2, 2, 1) cluster_k() {
auto cluster = cg::this_cluster();
cluster.sync(); // sync all blocks in the cluster
int rank = cluster.block_rank();
// distributed shared memory:
__shared__ float buf[1024];
auto neighbor = cluster.map_shared_rank(buf, /*rank*/0);
// now `neighbor` points into block-0's shared memory
}
This unlocks new patterns like cross-block tile sharing in GEMM without going through HBM, which is part of how Hopper hits its tensor-core peak.
13. Profiling Discipline¶
13.1 Timing with events (the right way)¶
cudaEvent_t s, e;
cudaEventCreate(&s); cudaEventCreate(&e);
// warmup
for (int i = 0; i < 3; ++i) kernel<<<g,b>>>(...);
cudaDeviceSynchronize();
cudaEventRecord(s);
for (int i = 0; i < N_ITERS; ++i) kernel<<<g,b>>>(...);
cudaEventRecord(e);
cudaEventSynchronize(e);
float ms; cudaEventElapsedTime(&ms, s, e);
double avg_ms = ms / N_ITERS;
Always: - Warm up (first launch is JIT-y and miss-y). - Loop and average. - Sync before reading. - Match the stream events are recorded in to the kernel's stream.
13.2 Common pitfalls¶
- Forgetting to sync before stopping the host clock. Kernel launches
return immediately; CPU
chrono::now()measures launch overhead, not kernel time. Use events. - Default-stream interactions make cross-stream timing measure serialized work even when you wrote concurrent code. Check with Nsight Systems' timeline view.
- JIT compile on first launch can add seconds. Pre-build for your arch ( - arch=sm_90`) and ignore the first iteration.
- Power state: GPUs throttle. Pin clocks (
nvidia-smi -pm 1 -lgc …) for reproducible benchmarks.
13.3 nsys (Nsight Systems)-system-wide timeline¶
Captures a Gantt chart of CUDA API calls, kernels, and memcopies across all streams. Best first stop:
Use to answer: are H2D, kernel, D2H actually overlapped? Is the default stream serializing things?
13.4 ncu (Nsight Compute)-per-kernel deep dive¶
Tells you achieved occupancy, achieved DRAM throughput, L1/L2 hit rates, shared-memory bank conflicts, warp stall reasons. The "Speed of Light" section tells you, for each kernel, how close to peak compute and peak memory you are-if both are far from 100%, you have latency-hiding problems (too few resident warps, dependencies).
13.5 Reading roofline¶
For any kernel, compute:
- Arithmetic intensity = FLOPs / bytes touched.
- Achievable = min(peak_compute, AI × peak_BW).
If AI < peak_compute / peak_BW (the "ridge point"), the kernel is memory-bound and you should think about reuse, not FLOP throughput. For H100 BF16 tensor cores: ridge ≈ 990 TFLOP/s ÷ 3 TB/s ≈ 330 FLOP/byte. Almost nothing reaches that without aggressive shared-memory tiling.
14. A Complete BF16 GEMM at 2048×2048¶
Below is a single self-contained file targeting Ampere+ (sm_80). It
includes shared-memory tiling, wmma tensor cores, and a simple
double-buffered cp.async pipeline. On A100 we observe ~60–70% of cuBLAS
BF16 for M=N=K=2048. (Actual depends heavily on tile choice and
hardware-verify with ncu.)
gemm_bf16.cu:
// Build:
// nvcc -O3 -arch=sm_80 -lineinfo -lcublas gemm_bf16.cu -o gemm_bf16
// Run:
// ./gemm_bf16
#include <cuda_runtime.h>
#include <cuda_bf16.h>
#include <mma.h>
#include <cublas_v2.h>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <vector>
#define CUDA_CHECK(call) do { \
cudaError_t e_ = (call); \
if (e_ != cudaSuccess) { \
fprintf(stderr, "CUDA %s:%d: %s\n", __FILE__, __LINE__, \
cudaGetErrorString(e_)); \
std::exit(1); \
} \
} while (0)
using nvcuda::wmma::fragment;
using nvcuda::wmma::matrix_a;
using nvcuda::wmma::matrix_b;
using nvcuda::wmma::accumulator;
using nvcuda::wmma::row_major;
using nvcuda::wmma::col_major;
using nvcuda::wmma::load_matrix_sync;
using nvcuda::wmma::mma_sync;
using nvcuda::wmma::store_matrix_sync;
using nvcuda::wmma::fill_fragment;
using nvcuda::wmma::mem_row_major;
// Tile config: BM × BN output tile per block; BK along K.
// One warp computes a WM × WN sub-tile.
constexpr int BM = 128, BN = 128, BK = 32;
constexpr int WM = 64, WN = 64;
constexpr int WMMA_M = 16, WMMA_N = 16, WMMA_K = 16;
constexpr int WARPS_PER_BLOCK = (BM/WM) * (BN/WN); // 2*2 = 4
constexpr int THREADS_PER_BLOCK = WARPS_PER_BLOCK * 32; // 128
// `cp.async` 16-byte copy, 4-byte aligned source/destination.
__device__ __forceinline__ void cp_async_16(void* smem_ptr, const void* gmem_ptr) {
#if __CUDA_ARCH__ >= 800
unsigned int smem_int = __cvta_generic_to_shared(smem_ptr);
asm volatile("cp.async.cg.shared.global [%0], [%1], 16;\n"
:: "r"(smem_int), "l"(gmem_ptr));
#else
*reinterpret_cast<int4*>(smem_ptr) = *reinterpret_cast<const int4*>(gmem_ptr);
#endif
}
__device__ __forceinline__ void cp_async_commit() {
#if __CUDA_ARCH__ >= 800
asm volatile("cp.async.commit_group;\n" ::);
#endif
}
__device__ __forceinline__ void cp_async_wait_all() {
#if __CUDA_ARCH__ >= 800
asm volatile("cp.async.wait_all;\n" ::);
#endif
}
__global__ void gemm_bf16_kernel(const __nv_bfloat16* __restrict__ A,
const __nv_bfloat16* __restrict__ B,
float* __restrict__ C,
int M, int N, int K)
{
// Two shared-mem buffers per matrix for double-buffering.
__shared__ __nv_bfloat16 sA[2][BM][BK];
__shared__ __nv_bfloat16 sB[2][BK][BN];
int tid = threadIdx.x;
int warp_id = tid / 32;
int lane_id = tid % 32;
int warp_row = warp_id / (BN / WN); // 0 or 1
int warp_col = warp_id % (BN / WN);
int block_row = blockIdx.y * BM;
int block_col = blockIdx.x * BN;
// Each thread brings 16B = 8 BF16 from A and 8 from B per K-tile.
// 128 threads × 8 = 1024 elements per tile; BM*BK = 128*32 = 4096 → 4 iters,
// BK*BN = 32*128 = 4096 → 4 iters. We unroll inline below.
auto load_tile = [&](int buf, int kt) {
// Load sA[buf][BM][BK] from A[block_row..+BM][kt..+BK]
constexpr int A_TILE = BM * BK; // 4096
constexpr int A_PER_THREAD = A_TILE / THREADS_PER_BLOCK / 8; // 4
#pragma unroll
for (int i = 0; i < A_PER_THREAD; ++i) {
int idx = (i * THREADS_PER_BLOCK + tid) * 8;
int r = idx / BK;
int c = idx % BK;
const void* gptr = &A[(block_row + r) * K + (kt + c)];
void* sptr = &sA[buf][r][c];
cp_async_16(sptr, gptr);
}
constexpr int B_TILE = BK * BN; // 4096
constexpr int B_PER_THREAD = B_TILE / THREADS_PER_BLOCK / 8; // 4
#pragma unroll
for (int i = 0; i < B_PER_THREAD; ++i) {
int idx = (i * THREADS_PER_BLOCK + tid) * 8;
int r = idx / BN;
int c = idx % BN;
const void* gptr = &B[(kt + r) * N + (block_col + c)];
void* sptr = &sB[buf][r][c];
cp_async_16(sptr, gptr);
}
cp_async_commit();
};
// Per-warp output fragments.
constexpr int FRAG_M = WM / WMMA_M; // 4
constexpr int FRAG_N = WN / WMMA_N; // 4
fragment<accumulator, WMMA_M, WMMA_N, WMMA_K, float> c_frag[FRAG_M][FRAG_N];
#pragma unroll
for (int i = 0; i < FRAG_M; ++i)
#pragma unroll
for (int j = 0; j < FRAG_N; ++j) fill_fragment(c_frag[i][j], 0.0f);
// Prefetch first tile.
int kt = 0;
int buf = 0;
load_tile(buf, kt);
cp_async_wait_all();
__syncthreads();
for (kt = BK; kt < K; kt += BK) {
// Issue next tile prefetch into the other buffer.
int next_buf = buf ^ 1;
load_tile(next_buf, kt);
// Compute on current buffer.
#pragma unroll
for (int kk = 0; kk < BK; kk += WMMA_K) {
fragment<matrix_a, WMMA_M, WMMA_N, WMMA_K, __nv_bfloat16, row_major> a_frag[FRAG_M];
fragment<matrix_b, WMMA_M, WMMA_N, WMMA_K, __nv_bfloat16, row_major> b_frag[FRAG_N];
#pragma unroll
for (int i = 0; i < FRAG_M; ++i) {
int row = warp_row * WM + i * WMMA_M;
load_matrix_sync(a_frag[i], &sA[buf][row][kk], BK);
}
#pragma unroll
for (int j = 0; j < FRAG_N; ++j) {
int col = warp_col * WN + j * WMMA_N;
load_matrix_sync(b_frag[j], &sB[buf][kk][col], BN);
}
#pragma unroll
for (int i = 0; i < FRAG_M; ++i)
#pragma unroll
for (int j = 0; j < FRAG_N; ++j)
mma_sync(c_frag[i][j], a_frag[i], b_frag[j], c_frag[i][j]);
}
cp_async_wait_all();
__syncthreads();
buf = next_buf;
}
// Tail compute on last loaded buffer.
#pragma unroll
for (int kk = 0; kk < BK; kk += WMMA_K) {
fragment<matrix_a, WMMA_M, WMMA_N, WMMA_K, __nv_bfloat16, row_major> a_frag[FRAG_M];
fragment<matrix_b, WMMA_M, WMMA_N, WMMA_K, __nv_bfloat16, row_major> b_frag[FRAG_N];
#pragma unroll
for (int i = 0; i < FRAG_M; ++i)
load_matrix_sync(a_frag[i], &sA[buf][warp_row*WM + i*WMMA_M][kk], BK);
#pragma unroll
for (int j = 0; j < FRAG_N; ++j)
load_matrix_sync(b_frag[j], &sB[buf][kk][warp_col*WN + j*WMMA_N], BN);
#pragma unroll
for (int i = 0; i < FRAG_M; ++i)
#pragma unroll
for (int j = 0; j < FRAG_N; ++j)
mma_sync(c_frag[i][j], a_frag[i], b_frag[j], c_frag[i][j]);
}
// Store back.
#pragma unroll
for (int i = 0; i < FRAG_M; ++i) {
#pragma unroll
for (int j = 0; j < FRAG_N; ++j) {
int row = block_row + warp_row * WM + i * WMMA_M;
int col = block_col + warp_col * WN + j * WMMA_N;
store_matrix_sync(&C[row*N + col], c_frag[i][j], N, mem_row_major);
}
}
(void)lane_id;
}
// Reference using cuBLAS for correctness + perf comparison.
static void cublas_gemm(cublasHandle_t h,
const __nv_bfloat16* dA, const __nv_bfloat16* dB,
float* dC, int M, int N, int K) {
float alpha = 1.f, beta = 0.f;
// cuBLAS is column-major. To compute C = A*B with row-major inputs,
// call cublasGemmEx with op_N, op_N and swap the operand order:
// C^T = B^T * A^T, treated as col-major.
cublasGemmEx(h, CUBLAS_OP_N, CUBLAS_OP_N,
N, M, K,
&alpha,
dB, CUDA_R_16BF, N,
dA, CUDA_R_16BF, K,
&beta,
dC, CUDA_R_32F, N,
CUBLAS_COMPUTE_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP);
}
int main() {
const int M = 2048, N = 2048, K = 2048;
size_t aBytes = (size_t)M*K*sizeof(__nv_bfloat16);
size_t bBytes = (size_t)K*N*sizeof(__nv_bfloat16);
size_t cBytes = (size_t)M*N*sizeof(float);
std::vector<__nv_bfloat16> hA(M*K), hB(K*N);
std::vector<float> hC(M*N), hRef(M*N);
for (int i = 0; i < M*K; ++i) hA[i] = __float2bfloat16((float)((i*7) % 13) * 0.01f);
for (int i = 0; i < K*N; ++i) hB[i] = __float2bfloat16((float)((i*5) % 11) * 0.01f);
__nv_bfloat16 *dA, *dB; float *dC, *dRef;
CUDA_CHECK(cudaMalloc(&dA, aBytes));
CUDA_CHECK(cudaMalloc(&dB, bBytes));
CUDA_CHECK(cudaMalloc(&dC, cBytes));
CUDA_CHECK(cudaMalloc(&dRef, cBytes));
CUDA_CHECK(cudaMemcpy(dA, hA.data(), aBytes, cudaMemcpyHostToDevice));
CUDA_CHECK(cudaMemcpy(dB, hB.data(), bBytes, cudaMemcpyHostToDevice));
dim3 grid(N/BN, M/BM);
dim3 block(THREADS_PER_BLOCK);
// Warmup
for (int i = 0; i < 3; ++i)
gemm_bf16_kernel<<<grid, block>>>(dA, dB, dC, M, N, K);
CUDA_CHECK(cudaDeviceSynchronize());
cudaEvent_t s, e;
cudaEventCreate(&s); cudaEventCreate(&e);
const int ITERS = 50;
cudaEventRecord(s);
for (int i = 0; i < ITERS; ++i)
gemm_bf16_kernel<<<grid, block>>>(dA, dB, dC, M, N, K);
cudaEventRecord(e);
cudaEventSynchronize(e);
float ms; cudaEventElapsedTime(&ms, s, e);
double avg_ms = ms / ITERS;
double tflops = 2.0 * M * N * K / (avg_ms * 1e-3) / 1e12;
printf("Custom: %.3f ms, %.2f TFLOP/s\n", avg_ms, tflops);
cublasHandle_t h; cublasCreate(&h);
for (int i = 0; i < 3; ++i) cublas_gemm(h, dA, dB, dRef, M, N, K);
CUDA_CHECK(cudaDeviceSynchronize());
cudaEventRecord(s);
for (int i = 0; i < ITERS; ++i) cublas_gemm(h, dA, dB, dRef, M, N, K);
cudaEventRecord(e);
cudaEventSynchronize(e);
cudaEventElapsedTime(&ms, s, e);
double cb_ms = ms / ITERS;
double cb_tflops = 2.0 * M * N * K / (cb_ms * 1e-3) / 1e12;
printf("cuBLAS: %.3f ms, %.2f TFLOP/s\n", cb_ms, cb_tflops);
printf("Ratio: %.1f%% of cuBLAS\n", 100.0 * cb_tflops / tflops > 100 ? 100.0 : 100.0 * tflops / cb_tflops);
// Correctness spot-check (sample 16 elements).
CUDA_CHECK(cudaMemcpy(hC.data(), dC, cBytes, cudaMemcpyDeviceToHost));
CUDA_CHECK(cudaMemcpy(hRef.data(),dRef, cBytes, cudaMemcpyDeviceToHost));
double max_rel = 0;
for (int i = 0; i < M*N; i += (M*N/16)) {
double r = std::abs(hC[i]-hRef[i]) / (std::abs(hRef[i]) + 1e-6);
if (r > max_rel) max_rel = r;
}
printf("Max sampled rel error vs cuBLAS: %.3e\n", max_rel);
cublasDestroy(h);
cudaFree(dA); cudaFree(dB); cudaFree(dC); cudaFree(dRef);
return 0;
}
Build:
Annotations on the design:
- Tile: 128×128 output × 32 K-block per iteration. Each block has 4 warps, each warp owns a 64×64 output sub-tile, materialized as a 4×4 grid of 16×16 tensor-core fragments.
- Threads per block: 128 = 4 warps × 32. Modest-leaves room for 4–8
blocks per SM (occupancy-bounded by shared-memory: each block uses
2 × (128×32 + 32×128) × 2 B = 32 KBof smem). cp.asyncdouble-buffering: while the tensor cores compute onsA[buf]/sB[buf], the next K-tile is fetched into the other buffer with no thread blocking.cp.async.wait_all+__syncthreads()flips the buffer when both the compute and the prefetch are done.- No bank-conflict fix: at 32-wide BF16 rows the natural layout already
hits 32 different banks per warp load through
load_matrix_sync. If you changeBKyou may need to add padding. - cuBLAS comparison: cuBLAS is column-major, so we swap the operand order to compute the row-major equivalent, a standard trick.
What this kernel doesn't do (that CUTLASS / cuBLAS do): swizzled shared-mem
layout to fully avoid bank conflicts at all tile shapes; multi-stage cp.async
pipeline (3+ buffers) for deeper latency hiding; wgmma on Hopper; epilogue
fusion (bias, ReLU, scale); split-K for skinny problems. Implementing each
of those is what closes the remaining 30–40% gap.
15. Practical Exercises¶
Exercise 1-Saturating PCIe¶
Write a benchmark that measures cudaMemcpy H→D bandwidth for transfer sizes
1 KB to 1 GB (powers of 2), with and without pinned host memory, and plots
the curve. Answer sketch: allocate once outside the timing loop;
cudaMallocHost for pinned; warm up; time with events; expect pinned to
plateau near PCIe peak (~25 GB/s Gen4 / ~50 GB/s Gen5) starting ~1 MB, while
pageable plateaus lower.
Exercise 2-Fix a coalescing bug¶
The kernel below is 8× slower than expected. Why?
__global__ void scale(float* x, int N) {
int i = threadIdx.x * gridDim.x * blockDim.x / blockDim.x + blockIdx.x;
if (i < N) x[i] *= 2.f;
}
i = threadIdx.x * gridDim.x + blockIdx.x - consecutive threads
stride bygridDim.x, so loads fromxgo through different sectors. Fix:i = blockIdx.x * blockDim.x + threadIdx.x`.
Exercise 3-Eliminate bank conflicts¶
Profile the transpose kernel (§7.2) without padding using ncu and verify
shared_load_transactions_per_request ≈ 32. Add the [32][33] padding
and verify it drops to 1. Answer sketch: ncu --set full -k transpose
./prog; look at the "Memory Workload Analysis → Shared Memory" pane.
Exercise 4-Reduction with fewer atomics¶
Modify reduce_shfl (§8.4) to do a two-stage reduction: per-block partial
into an array partials[gridDim.x], then a second kernel reduces partials.
Compare against the single-kernel atomicAdd version. Answer sketch:
saves the contended atomics; for n = 2^28 the two-kernel version is often
~10–20% faster on H100 because atomics hit L2 contention.
Exercise 5-Naive vs tiled GEMM¶
Write FP32 gemm_naive (§9.1) and gemm_smem (§9.3) for M=N=K=2048.
Time both. Compare to cuBLAS. Answer sketch: expect naive ~1–2% of
cuBLAS, tiled ~25–40%, with cuBLAS BF16 itself being ~10× cuBLAS FP32 on
tensor-core-capable hardware. Use the tiled FP32 number as the FP32
roofline reference.
Exercise 6-wmma correctness¶
Implement a wmma - based 16×16×16 GEMM (§9.5) forM=N=K=64and verify
against a CPU reference. Catch: BF16 has 7 mantissa bits-accept1e-2relative error, not1e-6. **Answer sketch:** allocate BF16 inputs / FP32
output; CPU computes in float; compare element-wise; rel error budget5e-2` per element is realistic for random inputs in [0,1].
Closing thoughts¶
You've now seen, end-to-end, how a CUDA kernel becomes a tensor-core GEMM: the launch model and execution machinery, the memory hierarchy and the access patterns it punishes or rewards, the shared-memory bank discipline, the building-block kernels every CUDA author reimplements once, and a six-stage GEMM evolution that mirrors how CUTLASS got built. From here:
- Read CUTLASS's
cutelayouts-you now have the model to understand tensor algebra over fragments. - Read FlashAttention's CUDA-block-tiled GEMM + softmax fused, with
exactly the
cp.asyncpipeline pattern you implemented in §14. - Read PyTorch `aten/src/ATen/native/cuda/*.cu - production CUDA written for correctness first, perf second.
- For Hopper-class peak, learn
wgmma, TMA, and Distributed Shared Memory; CUTLASS 3.x is the canonical reference.
When in doubt: profile with ncu, look at the Speed-of-Light, and let
the achieved-vs-peak gap tell you whether your kernel is bandwidth-bound,
compute-bound, or latency-bound. Optimization without measurement is just
guessing; with measurement, every step here is mechanical.
— end of deep dive 02 —