Week 7 - Memory Optimization: Coalescing, Shared Memory, Tensor Cores¶
7.1 Conceptual Core¶
- The naive matmul of week 6 is slow because:
- Uncoalesced memory access: adjacent threads read non-adjacent addresses. Each warp issues many memory transactions instead of one.
- No data reuse: each element of A is loaded N times from HBM.
- No tensor cores: scalar FP32 ops, not 16×16 BF16 GEMM blocks.
- Three optimizations, each ~5-10×:
- Coalesce-make threads in a warp read adjacent addresses.
- Tile in shared memory-a block cooperatively loads a 32×32 tile of A and B into shared memory; each thread computes its output using shared data. Each element of A loaded once per block.
- Tensor cores-use
wmma(ornvcuda::wmma) intrinsics to issue 16×16 GEMM blocks. ~10× over CUDA cores at BF16.
7.2 Mechanical Detail¶
- Coalescing rule: a warp's 32 memory accesses to a 128-byte aligned, contiguous range = 1 transaction. Strided or scattered = up to 32 transactions.
- Shared memory bank conflicts: shared memory is divided into 32 banks (4-byte stride). If two threads in a warp access the same bank but different addresses, conflict-serialized. Common with column-major access. Fix: pad arrays (
shared_mem[32][33]not[32][32]). - Double buffering: while the SM computes on tile N, asynchronously load tile N+1 with
cp.async. The compute hides the load latency. This is software pipelining. - Tensor core usage (CUDA C++):
using namespace nvcuda::wmma; fragment<matrix_a, 16, 16, 16, half, row_major> a_frag; fragment<matrix_b, 16, 16, 16, half, col_major> b_frag; fragment<accumulator, 16, 16, 16, float> c_frag; load_matrix_sync(a_frag, A_smem, 16); load_matrix_sync(b_frag, B_smem, 16); mma_sync(c_frag, a_frag, b_frag, c_frag); store_matrix_sync(C_gmem, c_frag, N, mem_row_major);
7.3 Lab-"Climb the Roofline"¶
Take your week 6 naive matmul and progressively optimize:
1. Coalesce loads (transpose access pattern). Re-time.
2. Tile in shared memory with 32×32 blocks. Re-time.
3. Double-buffer with cp.async. Re-time.
4. Use tensor cores with BF16. Re-time.
You should reach 30–60% of cuBLAS perf. Document each step's improvement and the residual gap. Read NVIDIA's cutlass examples for the production-grade version.
7.4 Idiomatic & Diagnostic Drill¶
- For each version, capture an
ncureport. The metrics that matter:sm__cycles_active.avg.pct_of_peak_sustained_elapsed(tensor core utilization),dram__bytes_read.sum(HBM traffic),l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum(load coalescing).
7.5 Production Slice¶
- Production CUDA kernels go through cuBLAS, cuDNN, or CUTLASS, not from-scratch CUDA, in 99% of cases. Read CUTLASS's
examples/directory; understand its template-based GEMM and how it's tuned per architecture.