Worked investigation - Read an nsys trace of a training step¶

Companion to AI Systems -> Month 03 (Framework Internals) and the PyTorch Internals deep dive. The curriculum explains the GPU executes kernels dispatched by the framework. This page makes you capture and read a timeline of a real training step - and answer the question the previous investigation raised: "my GPU is at 100% util but slow - why?" ~45 minutes. Needs an NVIDIA GPU with Nsight Systems (nsys, ships with the CUDA toolkit).

The symptom you're learning to diagnose¶

Your training is slower than it should be. nvidia-smi says GPU-Util 95-100%, so "the GPU is busy" - but throughput is bad and power draw is suspiciously low (the nvidia-smi investigation flagged this). Is the GPU starved by the CPU dataloader? Are the kernels tiny and launch-overhead-bound? Is there a sync point serializing everything? You can't tell from a utilization number. You need the timeline - what ran, when, and what the GPU was waiting for.

nsys records a microsecond-resolution timeline of CPU and GPU activity. Reading it is the single highest-leverage performance skill in ML systems work, because the bottleneck is almost never where you'd guess (the measure-don't-guess lesson, GPU edition).

Step 0: the one fact - GPU-Util lies, the timeline doesn't¶

"GPU-Util 100%" means "a kernel was executing during the sampling window" - not "the GPU was doing useful math at full capacity." A GPU running a thousand tiny kernels back to back, or one kernel then waiting on the CPU for the next batch, can read ~100% util while accomplishing little. The timeline shows the truth: the gaps between kernels (GPU idle, waiting), the size of kernels (tiny = overhead-bound), and what the CPU was doing during GPU gaps (dataloading? Python overhead?). You diagnose by looking at the gaps, not the utilization.

Step 1: capture a trace¶

Wrap a few training steps. You don't need the whole run - 10-20 steps is plenty (and traces get huge fast). The pattern:

$ nsys profile -o trace --force-overwrite true \
    --trace=cuda,nvtx,osrt,cudnn,cublas \
    python train.py --max-steps 20

--trace=cuda,nvtx,osrt,cudnn,cublas - record CUDA kernels, NVTX annotations, OS runtime (CPU threads/syscalls), and the cuDNN/cuBLAS library calls. This combination shows both what the GPU did and what the CPU was doing.
It produces trace.nsys-rep - open it in the Nsight Systems GUI (nsys-ui trace.nsys-rep), or get a text summary in the terminal:

$ nsys stats trace.nsys-rep

For a real run you'll use the GUI timeline (it's visual and essential), but the stats give you the headline numbers fast.

Step 2: the headline - where did the time go?¶

nsys stats prints several tables. The most important is the kernel summary:

 Time(%)  Total Time   Instances   Avg (ns)    Name
 -------  -----------   ---------   --------    ----
   38.2    1284900000      4820     266577     ampere_sgemm_128x64    (matmul)
   22.1     743100000     38400      19351     elementwise_kernel     (add/mul/activation)
   18.7     628200000     19200      32718     vectorized_layer_norm
    9.4     316800000      9600      33000     reduce_kernel
    ...

Read it like a CPU profiler's "top functions," but for GPU kernels: - ampere_sgemm at 38% - matrix multiply, the work you want dominating. Good - the GPU is mostly doing the math that matters. - elementwise_kernel at 22%, 38400 instances - 38,400 tiny add/multiply/activation kernels. That instance count is a flag: thousands of tiny kernels means launch overhead and memory round-trips dominate. This is the classic "fusion opportunity" - those 38,400 separate ops could be fused into far fewer (what torch.compile does - the next investigation's territory).

If instead the top entry were Memcpy HtoD (host-to-device copy) or the kernels were all tiny with huge gaps, you'd have a data movement or starvation problem, not a compute problem - a completely different fix.

Step 3: the real diagnosis - read the gaps (GUI timeline)¶

The stats tell you what ran; the timeline tells you what the GPU waited for. In nsys-ui, you see stacked rows: CPU threads on top, CUDA API calls, then the GPU's kernel execution stream. The pattern you're hunting:

GPU stream:   [matmul][gap........][matmul][gap........][matmul]
CPU thread:   [....dataloader collate....][launch]      [....dataloader....]
                    ^ the GPU is IDLE here, waiting for the CPU to prepare the next batch

Gaps in the GPU row = the GPU sitting idle. If those gaps line up with the CPU doing dataloader work, your bottleneck is the dataloader, not the GPU - the GPU finishes a step and waits for the CPU to feed it the next batch. This is the #1 cause of "100% util but slow": util counts the kernel bursts, but the gaps (where the real time goes) are invisible to nvidia-smi. The fix is num_workers > 0 in the DataLoader, pin_memory=True, prefetching - CPU-side, nothing to do with the GPU or model.

Three gap patterns and their causes: - Gaps aligned with CPU dataloader activity -> input pipeline starvation. Fix: more dataloader workers, prefetch, faster storage. - Gaps aligned with cudaStreamSynchronize / cudaDeviceSynchronize -> a sync point serializing CPU and GPU (often a .item(), .cpu(), or print(loss) mid-step forcing the GPU to finish before the CPU continues). Fix: remove or defer the sync. - No gaps, but kernels are tiny and numerous -> overhead/fusion-bound. Fix: torch.compile, larger batches, fused ops.

Step 4: use NVTX to find your step boundaries¶

A raw trace is a sea of kernels with no structure. Annotate your code with NVTX ranges so the timeline shows your phases:

import torch.cuda.nvtx as nvtx

for batch in loader:
    nvtx.range_push("dataload")
    x, y = batch
    nvtx.range_pop()

    nvtx.range_push("forward")
    out = model(x)
    loss = criterion(out, y)
    nvtx.range_pop()

    nvtx.range_push("backward")
    loss.backward()
    nvtx.range_pop()

    nvtx.range_push("optimizer")
    optimizer.step(); optimizer.zero_grad()
    nvtx.range_pop()

Now the timeline has labeled bands - dataload, forward, backward, optimizer - and you can measure each phase. If dataload is 40% of the step wall-clock, that's your answer in one glance. (PyTorch's profiler emits these automatically with with torch.profiler.profile(...) + record_function, an alternative to manual NVTX.) NVTX ranges are the GPU-trace equivalent of giving a flame graph meaningful function names instead of raw addresses.

Step 5: confirm the fix moved the bottleneck¶

The discipline from every performance investigation: after a fix, re-trace and confirm the bottleneck moved. Suppose Step 3 showed dataloader gaps and you added num_workers=8:

BEFORE (starved):                        AFTER (num_workers=8):
GPU:  [mm][gap....][mm][gap....]         GPU:  [mm][mm][mm][mm]   (gaps gone)
step time: 180 ms                        step time: 95 ms        (1.9x faster)
GPU gaps: 45% of step                    GPU gaps: 4% of step

The gaps closed, step time nearly halved, and now the GPU stream is dense - the bottleneck moved onto the GPU (now you'd profile the kernels with ncu - the next investigation). You stop when the GPU is genuinely the bound and the kernels are efficient. This before/after - gaps quantified, step time measured - is the proof, not "it feels faster."

Now you do it (on any NVIDIA GPU)¶

nsys profile --trace=cuda,nvtx,osrt -o t python your_train.py --max-steps 20, then nsys stats t.nsys-rep. Read the kernel summary - what's the top kernel, and is it matmul (good) or thousands of tiny elementwise ops (fusion opportunity)?
Add NVTX ranges around dataload/forward/backward/optimizer. Re-trace, open in nsys-ui, and measure what fraction of each step is each phase. Where does the time actually go? (Bet against your guess.)
Find a gap in the GPU row. What's the CPU doing during it? Dataloader? A sync? That's your bottleneck.
If you find dataloader starvation, set num_workers>0, pin_memory=True, re-trace, and confirm the gaps shrink and step time drops. Quantify the speedup.

What you might wonder¶

"nsys vs ncu vs the PyTorch profiler - which?" nsys (Nsight Systems) = the timeline, the big picture: CPU+GPU, gaps, what's serializing, dataloader vs compute. Start here always. ncu (Nsight Compute) = a microscope on one kernel - is this matmul using the tensor cores, hitting memory bandwidth limits (the next investigation). The PyTorch profiler (torch.profiler) = a convenient framework-level view that emits NVTX and a nice table, great for "which PyTorch op is expensive" without leaving Python. Use nsys for system-level, ncu for kernel-level, the torch profiler for quick framework-level checks.

"My trace file is enormous / nsys-ui is sluggish." Trace fewer steps (10-20), and skip the first few (warmup/JIT compile distort everything - the same warmup lesson as JMH/Linux benchmarking). Use --capture-range with NVTX to record only the steady-state region. A focused 15-step trace beats a 5-minute one you can't load.

"Why is the first step always weird in the trace?" cuDNN/cuBLAS autotuning, lazy CUDA context init, allocator warmup, and (if used) torch.compile JIT all happen on the first step(s) - they're 10-100x slower than steady state. Always profile steady-state steps, never the first. This is the GPU version of JIT warmup from the profiling investigations.

"How does this connect to the 'low power at 100% util' clue?" Directly. Low power + 100% util = the GPU is busy with overhead or waiting, not doing heavy math. The nsys timeline shows you which: gaps (waiting) or tiny kernels (overhead). nvidia-smi raised the suspicion; nsys confirms and localizes it.

What this gave you¶

You can capture a CPU+GPU timeline of real training steps with nsys.
You read the kernel summary to see whether matmul dominates (good) or tiny elementwise ops do (fuse them).
You diagnose by the gaps in the GPU stream - dataloader starvation, sync points, or overhead - which utilization numbers hide.
You annotate with NVTX to measure your own forward/backward/dataload/optimizer phases.
You confirm a fix by re-tracing and watching the bottleneck move, with quantified before/after.
You know when to escalate to ncu (kernel microscope) vs stay at the nsys system level.

Back to the Framework Internals month or the PyTorch Internals deep dive, or on to profiling a single kernel with ncu.