Worked investigation - Profile a kernel with ncu and read the speed-of-light report¶

Companion to AI Systems -> Month 02 (GPU Programming) and the GPU Architecture deep dive. The deep dive derives the roofline model and memory hierarchy in theory. This page makes you point Nsight Compute at a single kernel and read the speed-of-light report - the GPU's verdict on "are you compute-bound or memory-bound, and how close to the hardware ceiling are you?" ~45 minutes. Needs an NVIDIA GPU with Nsight Compute (ncu).

The symptom you're learning to diagnose¶

nsys (previous investigation) told you the GPU is genuinely the bottleneck and showed which kernel dominates. Now: is that kernel good? Is it near the hardware's limit, or leaving 80% of the GPU on the table? "It's slow" isn't actionable; "this kernel achieves 18% of peak memory bandwidth and is memory-bound because of uncoalesced access" is. ncu gives you the latter - the difference between guessing at optimizations and knowing exactly which ceiling you're hitting.

Step 0: the one model - the roofline¶

Every kernel is limited by one of two things: how fast the GPU can do math (compute-bound) or how fast it can move data (memory-bound). The deciding factor is arithmetic intensity = FLOPs performed per byte moved from memory.

Low intensity (few math ops per byte - e.g. an elementwise add: read 2 values, do 1 add, write 1 value) -> memory-bound. You're limited by bandwidth; the compute units sit idle waiting for data. Optimizing the math does nothing; you must move less data or move it better.
High intensity (lots of math per byte - e.g. a big matmul: each loaded value is reused across many multiply-adds) -> compute-bound. You're limited by the math units; the fix is using faster math units (tensor cores - next investigation).

The roofline plot (you sketched this in Month 2's lab) draws both ceilings; your kernel sits under one. ncu's speed-of-light report tells you which ceiling and how close. The single most important question about any kernel is "memory-bound or compute-bound?" because it determines the entire optimization strategy.

Step 1: profile one kernel¶

ncu is invasive (it replays each kernel many times to gather counters) - so target a specific kernel, not a whole run. Profile, say, the matmul in a small script:

$ ncu --set full -k "regex:gemm|matmul" -c 3 -o kernel_report python bench.py

--set full - collect the full metric set (including speed-of-light).
-k "regex:..." - only profile kernels whose name matches (here, matmuls). Without filtering, ncu profiles everything and takes forever.
-c 3 - profile only the first 3 matching launches (you don't need all of them).
Open kernel_report.ncu-rep in the Nsight Compute GUI (ncu-ui), or read the terminal summary:

$ ncu --set full -k "regex:gemm" -c 1 python bench.py

Step 2: read the speed-of-light section¶

This is the headline - the first thing in every ncu report. It expresses your kernel as a percentage of the hardware's theoretical maximum:

  Section: GPU Speed Of Light Throughput
  ---------------------------------------------------------------------
  Compute (SM) Throughput                 [%]        22.4
  Memory Throughput                       [%]        91.7
  ---------------------------------------------------------------------
  DRAM Throughput                         [%]        91.7
  Duration                                [us]       412
  Achieved Occupancy                      [%]        58.3

Read it in one glance: - Memory Throughput 91.7%, Compute 22.4% - this kernel is memory-bound. It's using 92% of the available memory bandwidth but only 22% of the compute units. The compute units are starving for data. No amount of math optimization will help; you must reduce or improve data movement. (If the numbers were flipped - Compute ~90%, Memory ~20% - it'd be compute-bound, and the fix would be tensor cores / better math, the next investigation.) - DRAM Throughput 91.7% - it's saturating the slowest, biggest memory (global DRAM). The fix direction: get data into faster on-chip memory (shared memory / L2) and reuse it, so you hit DRAM less. - Achieved Occupancy 58.3% - 58% of the GPU's thread slots are filled. Below ~50% often means not enough parallelism to hide memory latency (the GPU hides stalls by switching warps - the deep dive's occupancy theory). Worth improving, but secondary to the memory-bound verdict.

In two numbers (Compute% vs Memory%) you know the entire optimization strategy. That's the power of the speed-of-light report: it collapses "is this kernel good?" into a verdict.

Step 3: confirm the why - the memory workload section¶

The verdict is "memory-bound"; ncu also tells you why the memory traffic is high. Scroll to the memory workload analysis:

  Section: Memory Workload Analysis
  ---------------------------------------------------------------------
  Memory Throughput              [GB/s]      1430
  L1/TEX Hit Rate                [%]         12.1     <- very low
  L2 Hit Rate                    [%]         34.8     <- low
  Global Load Efficiency         [%]         25.0     <- the smoking gun

Global Load Efficiency 25% - of every memory transaction, only 25% of the bytes fetched were actually used. This is the classic uncoalesced access problem: when neighboring threads read non-contiguous memory, the GPU fetches full 128-byte cache lines but uses a fraction of each, wasting 75% of bandwidth. The fix is restructuring the access pattern so neighboring threads read neighboring memory (coalescing) - or a different data layout (the struct-of-arrays idea).
Low cache hit rates confirm data isn't being reused on-chip - it's being re-fetched from slow DRAM repeatedly. Tiling the computation to reuse data in shared memory is the canonical fix (exactly the "naive matmul vs tiled matmul, ~50-100x" lab from Month 2).

So the full diagnosis: memory-bound, because of uncoalesced global access and poor cache reuse. That sentence dictates the fix. You're no longer optimizing blind.

Step 4: the before/after that proves it¶

Take Month 2's naive matmul (one global memory access per multiply-add, no reuse) and the tiled version (load a tile into shared memory, reuse it). Profile both with ncu:

NAIVE matmul:                            TILED matmul (shared-memory reuse):
Compute Throughput:   18%                Compute Throughput:   71%
Memory Throughput:    94%                Memory Throughput:    48%
Global Load Eff:      25%                Global Load Eff:      98%
Duration:             4100 us           Duration:             280 us   (~15x faster)

Read the transformation: tiling moved the kernel from memory-bound to compute-bound (memory dropped from 94% to 48%, compute rose from 18% to 71%), global load efficiency went from 25% to 98% (coalesced + reused), and it ran ~15x faster. The speed-of-light report predicted the fix (reduce memory traffic) and confirmed it (the bottleneck moved). This is the entire discipline: read the ceiling, fix toward it, re-profile, watch the bound move. When it becomes compute-bound, the next lever is tensor cores - the final investigation.

Step 5: the rules of thumb the report encodes¶

After reading a few speed-of-light reports, the verdicts become instinct:

What you see	Diagnosis	Fix direction
Memory% high, Compute% low	memory-bound	reduce/coalesce data movement; tile for cache reuse
Compute% high, Memory% low	compute-bound	faster math: tensor cores, better algorithm
Both low	latency/occupancy-bound	more parallelism; bigger problem; fix occupancy
Global Load Efficiency low	uncoalesced access	restructure access pattern / data layout
Low cache hit rates	no data reuse	tile into shared memory
Low occupancy + memory-bound	can't hide latency	more warps in flight

"Both low" is the surprising one and a common beginner result: a kernel using neither the compute nor the memory hardware well, because it's launch-overhead-bound (tiny kernels - fuse them, the nsys lesson) or occupancy-starved (not enough threads to hide latency).

Now you do it (on any NVIDIA GPU)¶

Write or grab a simple elementwise kernel (or use a PyTorch op). Profile with ncu --set full -c 1. Read the speed-of-light section. Predict before you look: an elementwise op (low arithmetic intensity) should be strongly memory-bound. Confirm.
Profile a matmul (torch.matmul on big tensors). Compare its Compute% vs Memory% to the elementwise op. Matmul should be far more compute-bound (high arithmetic intensity).
If you did Month 2's naive vs tiled matmul lab, profile both and reproduce the before/after table. Watch the bound flip from memory to compute and the global load efficiency jump.
Find a kernel with low Global Load Efficiency. Look at its access pattern in the source. Is it striding through memory non-contiguously? That's the uncoalescing.

What you might wonder¶

"ncu is incredibly slow - is that normal?" Yes. ncu replays each kernel dozens of times to collect different hardware counters (you can't measure them all in one pass). This is why you must filter to specific kernels (-k) and few launches (-c). It's a precision microscope, not a continuous monitor - the opposite end from nvidia-smi (whole card, live) and nsys (timeline, low overhead). Three tools, three zoom levels: smi -> nsys -> ncu.

"What's 'occupancy' really?" The ratio of active warps to the maximum the SM can hold. The GPU hides memory latency by switching to another ready warp when one stalls (the deep dive's core mechanism) - so you need enough warps in flight. Low occupancy means too few warps to hide stalls, so the compute units idle during memory waits. But high occupancy isn't always the goal - a kernel can be fast at moderate occupancy if it has enough instruction-level parallelism. Treat low occupancy as a suspect, confirmed by the memory/compute verdict, not a target to max blindly.

"Do I need this if I just use PyTorch?" For everyday training, no - you use library kernels (cuBLAS, cuDNN) already tuned to near speed-of-light. You need ncu when you write custom kernels (CUDA/Triton - Month 2-3), debug why a fused op is slow, or evaluate whether a kernel is worth optimizing. It's the tool for the kernel-author and the performance specialist - the deepest layer of the AI systems stack. Reading the report is valuable even if you never write a kernel, because it builds the compute-vs-memory intuition that explains why models run the way they do.

"How does this connect to FlashAttention and the famous kernels?" FlashAttention's whole insight is exactly this report's lesson: naive attention is memory-bound (it materializes a huge N x N matrix in slow DRAM); FlashAttention restructures it to keep data in fast on-chip SRAM and never write the big matrix to DRAM, turning it compute-bound and far faster. If you ncu'd both, you'd see naive attention at ~90% memory / low compute, FlashAttention at high compute / low memory - the same flip as the tiled-matmul before/after, on the kernel that defined modern LLM serving.

What this gave you¶

You know the roofline: every kernel is memory-bound or compute-bound, decided by arithmetic intensity.
You read the speed-of-light report's two headline numbers (Compute% vs Memory%) and instantly know the optimization strategy.
You confirm the why (uncoalesced access, poor cache reuse) in the memory workload section.
You watched tiling flip a kernel from memory-bound to compute-bound with a 15x speedup, predicted and confirmed by ncu.
You have the verdict-to-fix rules of thumb, including the "both low" overhead/occupancy case.
You know the three-tool zoom (nvidia-smi -> nsys -> ncu) and where ncu fits - and how this explains FlashAttention.

Back to the GPU Programming month or the GPU Architecture deep dive, or on to watching tensor-core utilization.