Worked investigation - Read an nsys trace of a training step¶
Companion to AI Systems -> Month 03 (Framework Internals) and the PyTorch Internals deep dive. The curriculum explains the GPU executes kernels dispatched by the framework. This page makes you capture and read a timeline of a real training step - and answer the question the previous investigation raised: "my GPU is at 100% util but slow - why?" ~45 minutes. Needs an NVIDIA GPU with Nsight Systems (nsys, ships with the CUDA toolkit).
The symptom you're learning to diagnose¶
Your training is slower than it should be.
nvidia-smisays GPU-Util 95-100%, so "the GPU is busy" - but throughput is bad and power draw is suspiciously low (the nvidia-smi investigation flagged this). Is the GPU starved by the CPU dataloader? Are the kernels tiny and launch-overhead-bound? Is there a sync point serializing everything? You can't tell from a utilization number. You need the timeline - what ran, when, and what the GPU was waiting for.
nsys records a microsecond-resolution timeline of CPU and GPU activity. Reading it is the single highest-leverage performance skill in ML systems work, because the bottleneck is almost never where you'd guess (the measure-don't-guess lesson, GPU edition).
Step 0: the one fact - GPU-Util lies, the timeline doesn't¶
"GPU-Util 100%" means "a kernel was executing during the sampling window" - not "the GPU was doing useful math at full capacity." A GPU running a thousand tiny kernels back to back, or one kernel then waiting on the CPU for the next batch, can read ~100% util while accomplishing little. The timeline shows the truth: the gaps between kernels (GPU idle, waiting), the size of kernels (tiny = overhead-bound), and what the CPU was doing during GPU gaps (dataloading? Python overhead?). You diagnose by looking at the gaps, not the utilization.
Step 1: capture a trace¶
Wrap a few training steps. You don't need the whole run - 10-20 steps is plenty (and traces get huge fast). The pattern:
$ nsys profile -o trace --force-overwrite true \
--trace=cuda,nvtx,osrt,cudnn,cublas \
python train.py --max-steps 20
--trace=cuda,nvtx,osrt,cudnn,cublas- record CUDA kernels, NVTX annotations, OS runtime (CPU threads/syscalls), and the cuDNN/cuBLAS library calls. This combination shows both what the GPU did and what the CPU was doing.- It produces
trace.nsys-rep- open it in the Nsight Systems GUI (nsys-ui trace.nsys-rep), or get a text summary in the terminal:
For a real run you'll use the GUI timeline (it's visual and essential), but the stats give you the headline numbers fast.
Step 2: the headline - where did the time go?¶
nsys stats prints several tables. The most important is the kernel summary:
Time(%) Total Time Instances Avg (ns) Name
------- ----------- --------- -------- ----
38.2 1284900000 4820 266577 ampere_sgemm_128x64 (matmul)
22.1 743100000 38400 19351 elementwise_kernel (add/mul/activation)
18.7 628200000 19200 32718 vectorized_layer_norm
9.4 316800000 9600 33000 reduce_kernel
...
Read it like a CPU profiler's "top functions," but for GPU kernels: - ampere_sgemm at 38% - matrix multiply, the work you want dominating. Good - the GPU is mostly doing the math that matters. - elementwise_kernel at 22%, 38400 instances - 38,400 tiny add/multiply/activation kernels. That instance count is a flag: thousands of tiny kernels means launch overhead and memory round-trips dominate. This is the classic "fusion opportunity" - those 38,400 separate ops could be fused into far fewer (what torch.compile does - the next investigation's territory).
If instead the top entry were Memcpy HtoD (host-to-device copy) or the kernels were all tiny with huge gaps, you'd have a data movement or starvation problem, not a compute problem - a completely different fix.
Step 3: the real diagnosis - read the gaps (GUI timeline)¶
The stats tell you what ran; the timeline tells you what the GPU waited for. In nsys-ui, you see stacked rows: CPU threads on top, CUDA API calls, then the GPU's kernel execution stream. The pattern you're hunting:
GPU stream: [matmul][gap........][matmul][gap........][matmul]
CPU thread: [....dataloader collate....][launch] [....dataloader....]
^ the GPU is IDLE here, waiting for the CPU to prepare the next batch
Gaps in the GPU row = the GPU sitting idle. If those gaps line up with the CPU doing dataloader work, your bottleneck is the dataloader, not the GPU - the GPU finishes a step and waits for the CPU to feed it the next batch. This is the #1 cause of "100% util but slow": util counts the kernel bursts, but the gaps (where the real time goes) are invisible to nvidia-smi. The fix is num_workers > 0 in the DataLoader, pin_memory=True, prefetching - CPU-side, nothing to do with the GPU or model.
Three gap patterns and their causes: - Gaps aligned with CPU dataloader activity -> input pipeline starvation. Fix: more dataloader workers, prefetch, faster storage. - Gaps aligned with cudaStreamSynchronize / cudaDeviceSynchronize -> a sync point serializing CPU and GPU (often a .item(), .cpu(), or print(loss) mid-step forcing the GPU to finish before the CPU continues). Fix: remove or defer the sync. - No gaps, but kernels are tiny and numerous -> overhead/fusion-bound. Fix: torch.compile, larger batches, fused ops.
Step 4: use NVTX to find your step boundaries¶
A raw trace is a sea of kernels with no structure. Annotate your code with NVTX ranges so the timeline shows your phases:
import torch.cuda.nvtx as nvtx
for batch in loader:
nvtx.range_push("dataload")
x, y = batch
nvtx.range_pop()
nvtx.range_push("forward")
out = model(x)
loss = criterion(out, y)
nvtx.range_pop()
nvtx.range_push("backward")
loss.backward()
nvtx.range_pop()
nvtx.range_push("optimizer")
optimizer.step(); optimizer.zero_grad()
nvtx.range_pop()
Now the timeline has labeled bands - dataload, forward, backward, optimizer - and you can measure each phase. If dataload is 40% of the step wall-clock, that's your answer in one glance. (PyTorch's profiler emits these automatically with with torch.profiler.profile(...) + record_function, an alternative to manual NVTX.) NVTX ranges are the GPU-trace equivalent of giving a flame graph meaningful function names instead of raw addresses.
Step 5: confirm the fix moved the bottleneck¶
The discipline from every performance investigation: after a fix, re-trace and confirm the bottleneck moved. Suppose Step 3 showed dataloader gaps and you added num_workers=8:
BEFORE (starved): AFTER (num_workers=8):
GPU: [mm][gap....][mm][gap....] GPU: [mm][mm][mm][mm] (gaps gone)
step time: 180 ms step time: 95 ms (1.9x faster)
GPU gaps: 45% of step GPU gaps: 4% of step
The gaps closed, step time nearly halved, and now the GPU stream is dense - the bottleneck moved onto the GPU (now you'd profile the kernels with ncu - the next investigation). You stop when the GPU is genuinely the bound and the kernels are efficient. This before/after - gaps quantified, step time measured - is the proof, not "it feels faster."
Now you do it (on any NVIDIA GPU)¶
nsys profile --trace=cuda,nvtx,osrt -o t python your_train.py --max-steps 20, thennsys stats t.nsys-rep. Read the kernel summary - what's the top kernel, and is it matmul (good) or thousands of tiny elementwise ops (fusion opportunity)?- Add NVTX ranges around dataload/forward/backward/optimizer. Re-trace, open in
nsys-ui, and measure what fraction of each step is each phase. Where does the time actually go? (Bet against your guess.) - Find a gap in the GPU row. What's the CPU doing during it? Dataloader? A sync? That's your bottleneck.
- If you find dataloader starvation, set
num_workers>0, pin_memory=True, re-trace, and confirm the gaps shrink and step time drops. Quantify the speedup.
What you might wonder¶
"nsys vs ncu vs the PyTorch profiler - which?" nsys (Nsight Systems) = the timeline, the big picture: CPU+GPU, gaps, what's serializing, dataloader vs compute. Start here always. ncu (Nsight Compute) = a microscope on one kernel - is this matmul using the tensor cores, hitting memory bandwidth limits (the next investigation). The PyTorch profiler (torch.profiler) = a convenient framework-level view that emits NVTX and a nice table, great for "which PyTorch op is expensive" without leaving Python. Use nsys for system-level, ncu for kernel-level, the torch profiler for quick framework-level checks.
"My trace file is enormous / nsys-ui is sluggish." Trace fewer steps (10-20), and skip the first few (warmup/JIT compile distort everything - the same warmup lesson as JMH/Linux benchmarking). Use --capture-range with NVTX to record only the steady-state region. A focused 15-step trace beats a 5-minute one you can't load.
"Why is the first step always weird in the trace?" cuDNN/cuBLAS autotuning, lazy CUDA context init, allocator warmup, and (if used) torch.compile JIT all happen on the first step(s) - they're 10-100x slower than steady state. Always profile steady-state steps, never the first. This is the GPU version of JIT warmup from the profiling investigations.
"How does this connect to the 'low power at 100% util' clue?" Directly. Low power + 100% util = the GPU is busy with overhead or waiting, not doing heavy math. The nsys timeline shows you which: gaps (waiting) or tiny kernels (overhead). nvidia-smi raised the suspicion; nsys confirms and localizes it.
What this gave you¶
- You can capture a CPU+GPU timeline of real training steps with
nsys. - You read the kernel summary to see whether matmul dominates (good) or tiny elementwise ops do (fuse them).
- You diagnose by the gaps in the GPU stream - dataloader starvation, sync points, or overhead - which utilization numbers hide.
- You annotate with NVTX to measure your own forward/backward/dataload/optimizer phases.
- You confirm a fix by re-tracing and watching the bottleneck move, with quantified before/after.
- You know when to escalate to
ncu(kernel microscope) vs stay at thensyssystem level.
Back to the Framework Internals month or the PyTorch Internals deep dive, or on to profiling a single kernel with ncu.