Week 7 - Memory Optimization: Coalescing, Shared Memory, Tensor Cores¶

7.1 Conceptual Core¶

The naive matmul of week 6 is slow because:
Uncoalesced memory access: adjacent threads read non-adjacent addresses. Each warp issues many memory transactions instead of one.
No data reuse: each element of A is loaded N times from HBM.
No tensor cores: scalar FP32 ops, not 16×16 BF16 GEMM blocks.
Three optimizations, each ~5-10×:
Coalesce-make threads in a warp read adjacent addresses.
Tile in shared memory-a block cooperatively loads a 32×32 tile of A and B into shared memory; each thread computes its output using shared data. Each element of A loaded once per block.
Tensor cores-use wmma (or nvcuda::wmma) intrinsics to issue 16×16 GEMM blocks. ~10× over CUDA cores at BF16.

7.2 Mechanical Detail¶

Coalescing rule: a warp's 32 memory accesses to a 128-byte aligned, contiguous range = 1 transaction. Strided or scattered = up to 32 transactions.
Shared memory bank conflicts: shared memory is divided into 32 banks (4-byte stride). If two threads in a warp access the same bank but different addresses, conflict-serialized. Common with column-major access. Fix: pad arrays (shared_mem[32][33] not [32][32]).
Double buffering: while the SM computes on tile N, asynchronously load tile N+1 with cp.async. The compute hides the load latency. This is software pipelining.

Tensor core usage (CUDA C++):

using namespace nvcuda::wmma;
fragment<matrix_a, 16, 16, 16, half, row_major> a_frag;
fragment<matrix_b, 16, 16, 16, half, col_major> b_frag;
fragment<accumulator, 16, 16, 16, float> c_frag;
load_matrix_sync(a_frag, A_smem, 16);
load_matrix_sync(b_frag, B_smem, 16);
mma_sync(c_frag, a_frag, b_frag, c_frag);
store_matrix_sync(C_gmem, c_frag, N, mem_row_major);

7.3 Lab-"Climb the Roofline"¶

Take your week 6 naive matmul and progressively optimize: 1. Coalesce loads (transpose access pattern). Re-time. 2. Tile in shared memory with 32×32 blocks. Re-time. 3. Double-buffer with cp.async. Re-time. 4. Use tensor cores with BF16. Re-time.

You should reach 30–60% of cuBLAS perf. Document each step's improvement and the residual gap. Read NVIDIA's cutlass examples for the production-grade version.

7.4 Idiomatic & Diagnostic Drill¶

For each version, capture an ncu report. The metrics that matter: sm__cycles_active.avg.pct_of_peak_sustained_elapsed (tensor core utilization), dram__bytes_read.sum (HBM traffic), l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum (load coalescing).

7.5 Production Slice¶

Production CUDA kernels go through cuBLAS, cuDNN, or CUTLASS, not from-scratch CUDA, in 99% of cases. Read CUTLASS's examples/ directory; understand its template-based GEMM and how it's tuned per architecture.