Saltar a contenido

Week 7 - Memory Optimization: Coalescing, Shared Memory, Tensor Cores

7.1 Conceptual Core

  • The naive matmul of week 6 is slow because:
  • Uncoalesced memory access: adjacent threads read non-adjacent addresses. Each warp issues many memory transactions instead of one.
  • No data reuse: each element of A is loaded N times from HBM.
  • No tensor cores: scalar FP32 ops, not 16×16 BF16 GEMM blocks.
  • Three optimizations, each ~5-10×:
  • Coalesce-make threads in a warp read adjacent addresses.
  • Tile in shared memory-a block cooperatively loads a 32×32 tile of A and B into shared memory; each thread computes its output using shared data. Each element of A loaded once per block.
  • Tensor cores-use wmma (or nvcuda::wmma) intrinsics to issue 16×16 GEMM blocks. ~10× over CUDA cores at BF16.

7.2 Mechanical Detail

  • Coalescing rule: a warp's 32 memory accesses to a 128-byte aligned, contiguous range = 1 transaction. Strided or scattered = up to 32 transactions.
  • Shared memory bank conflicts: shared memory is divided into 32 banks (4-byte stride). If two threads in a warp access the same bank but different addresses, conflict-serialized. Common with column-major access. Fix: pad arrays (shared_mem[32][33] not [32][32]).
  • Double buffering: while the SM computes on tile N, asynchronously load tile N+1 with cp.async. The compute hides the load latency. This is software pipelining.
  • Tensor core usage (CUDA C++):
    using namespace nvcuda::wmma;
    fragment<matrix_a, 16, 16, 16, half, row_major> a_frag;
    fragment<matrix_b, 16, 16, 16, half, col_major> b_frag;
    fragment<accumulator, 16, 16, 16, float> c_frag;
    load_matrix_sync(a_frag, A_smem, 16);
    load_matrix_sync(b_frag, B_smem, 16);
    mma_sync(c_frag, a_frag, b_frag, c_frag);
    store_matrix_sync(C_gmem, c_frag, N, mem_row_major);
    

7.3 Lab-"Climb the Roofline"

Take your week 6 naive matmul and progressively optimize: 1. Coalesce loads (transpose access pattern). Re-time. 2. Tile in shared memory with 32×32 blocks. Re-time. 3. Double-buffer with cp.async. Re-time. 4. Use tensor cores with BF16. Re-time.

You should reach 30–60% of cuBLAS perf. Document each step's improvement and the residual gap. Read NVIDIA's cutlass examples for the production-grade version.

7.4 Idiomatic & Diagnostic Drill

  • For each version, capture an ncu report. The metrics that matter: sm__cycles_active.avg.pct_of_peak_sustained_elapsed (tensor core utilization), dram__bytes_read.sum (HBM traffic), l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum (load coalescing).

7.5 Production Slice

  • Production CUDA kernels go through cuBLAS, cuDNN, or CUTLASS, not from-scratch CUDA, in 99% of cases. Read CUTLASS's examples/ directory; understand its template-based GEMM and how it's tuned per architecture.

Comments