Worked investigation - Watch tensor-core utilization rise when you fix precision¶

Companion to AI Systems -> Month 04 (Distributed Training), the Numerics and Mixed Precision deep dive, and the GPU Architecture deep dive. The deep dives explain tensor cores and mixed precision in theory. This page makes you prove whether your training is actually using the tensor cores - the specialized hardware that does the bulk of modern ML math - and watch utilization jump when you fix a precision mistake that silently leaves them idle. ~40 minutes. Needs an NVIDIA GPU with tensor cores (Volta/Turing/Ampere/Hopper - V100, T4, A10G, A100, H100; not older cards).

The symptom you're learning to diagnose¶

Your training is running, the GPU is busy, loss is going down - but it's slower than the hardware should allow, and you suspect you're not using the GPU's best math units. Or: you "turned on mixed precision" but saw little speedup and aren't sure it's actually doing anything. Tensor cores can do matrix math 8-16x faster than the regular FP32 units - but only if your data is in the right precision and shapes. A single wrong dtype can leave them completely idle while everything looks fine. You're going to measure whether they're working.

Step 0: the one fact - tensor cores need the right precision¶

Modern NVIDIA GPUs have two kinds of math units: - CUDA cores - general FP32/FP64 arithmetic. Flexible, but the "slow" path for matrix math. - Tensor cores - specialized units that do small matrix-multiply-accumulate operations per clock, dramatically faster - but only for reduced-precision inputs (FP16, BF16, TF32, FP8, INT8 depending on generation). Feed them FP32 and they mostly can't engage; the work falls back to the slower CUDA cores.

So the entire game is: get your matmuls into a precision the tensor cores accept (FP16/BF16/TF32), without wrecking numerical stability. That's what "mixed precision training" is - keep most math in BF16/FP16 for tensor-core speed, keep the numerically sensitive parts (loss scaling, reductions, master weights) in FP32 for stability (the Numerics deep dive). The failure mode this page catches: thinking you enabled it but actually still running FP32, leaving the tensor cores idle.

Step 1: the baseline - profile an FP32 training step¶

Use ncu (previous investigation) on the matmul, looking specifically at the tensor-core utilization metric. With a default FP32 model:

$ ncu --set full -k "regex:gemm|matmul" -c 1 python train.py

In the report (or ncu-ui), find the compute breakdown / pipe utilization:

  Section: Compute Workload Analysis
  ---------------------------------------------------------------------
  Executed Ipc Active            [inst/cycle]    1.21
  SM Busy                        [%]             68.4
  Tensor (FP)                    [%]             0.0      <- TENSOR CORES IDLE
  FMA (FP32)                     [%]             64.1     <- all math on the slow units

Tensor (FP) 0.0% - the tensor cores are doing nothing. All the matrix math is going through the FP32 FMA units (64%). The GPU looks busy (SM Busy 68%), nvidia-smi shows high util - but you're using the slow math path. This is the silent waste: everything appears to work, you're just leaving an 8x+ speedup on the floor. nvidia-smi and even nsys won't tell you this; only the tensor-core metric does.

Step 2: enable mixed precision (the right way)¶

Turn on autocast + gradient scaling - the standard PyTorch mixed-precision recipe:

from torch.amp import autocast, GradScaler

scaler = GradScaler()
for x, y in loader:
    optimizer.zero_grad()
    with autocast(device_type="cuda", dtype=torch.bfloat16):   # matmuls run in BF16
        out = model(x)
        loss = criterion(out, y)
    scaler.scale(loss).backward()      # scale to avoid FP16 gradient underflow
    scaler.step(optimizer)             # master weights stay FP32 for stability
    scaler.update()

autocast automatically runs the tensor-core-friendly ops (matmuls, convolutions) in BF16/FP16 while keeping reductions and the optimizer step in FP32. GradScaler prevents small gradients from underflowing in FP16 (BF16 needs it less, but it's harmless). This is mixed precision: fast where it's safe, precise where it matters.

Step 3: prove it worked - re-profile¶

Run ncu again on the same kernel:

  Section: Compute Workload Analysis
  ---------------------------------------------------------------------
  SM Busy                        [%]             74.2
  Tensor (FP)                    [%]             82.7     <- TENSOR CORES NOW ENGAGED
  FMA (FP32)                     [%]             9.3      <- most math moved off the slow path

Tensor (FP) jumped from 0.0% to 82.7%. The matrix math is now running on the tensor cores. Confirm the payoff at the timeline/throughput level:

FP32 (tensor cores idle):                BF16 autocast (tensor cores 83%):
step time:        310 ms                 step time:        78 ms      (~4x faster)
Tensor (FP):      0.0%                    Tensor (FP):      82.7%
tokens/sec:       4,100                   tokens/sec:       16,300

~4x faster wall-clock, tensor cores engaged, and (you'd verify) the loss curve essentially unchanged. You measured the difference between "mixed precision is on" (the config) and "tensor cores are actually working" (the hardware reality) - and they're not the same thing until you check.

Step 4: the subtle traps that silently disable tensor cores¶

This is the real-world value - mixed precision can be "on" and still not use tensor cores, for reasons that look fine in code:

Bad shapes. Tensor cores want dimensions that are multiples of 8 (FP16) or 16 (some configs). A matmul with an inner dimension of, say, 4095 may fall back to CUDA cores or run inefficiently. Padding vocab sizes and hidden dims to multiples of 8/64 is a real, free speedup. (ncu's tensor-core % drops if shapes don't fit.)
TF32 turned off. On Ampere+, FP32 matmuls can transparently use tensor cores via TF32 - but it may be disabled. torch.backends.cuda.matmul.allow_tf32 = True (and torch.set_float32_matmul_precision("high")) flips it on, speeding up even "FP32" code with negligible accuracy loss.
An accidental FP32 cast mid-model. A custom layer that does .float() somewhere forces that matmul back to FP32, and you'd never know without profiling. The tensor-core metric catches it.
Ops autocast doesn't cover. Some custom or unusual ops aren't autocast-eligible and stay FP32. Visible as FP32 FMA still dominating in the report.

The discipline: don't assume mixed precision is working because you wrote the autocast block. Measure the tensor-core percentage. The config is a request; the hardware metric is the truth - exactly the lesson of every investigation in this set.

Step 5: the precision ladder (what you're trading)¶

Why not always use the lowest precision? Because precision trades speed for numerical range/accuracy (the Numerics deep dive). The ladder, fastest-and-riskiest last:

FP32  - 8x slower matmul, full stability. The safe baseline.
TF32  - ~Ampere FP32 path via tensor cores; near-free speedup, tiny accuracy cost. Almost always on.
BF16  - wide exponent (same range as FP32), reduced mantissa. The modern default for training -
        stable, fast, rarely needs loss scaling. Use this.
FP16  - narrow exponent (underflow risk) -> needs GradScaler. Faster on some older cards. Fiddlier.
FP8   - Hopper+ (H100). 2x over BF16 again, but needs careful scaling per-tensor (the deep dive's
        FP8 section). The frontier for the largest models.

The practical answer for training in 2026: BF16 autocast for training, TF32 always enabled, FP8 only on H100 with care. For inference, you go further down (INT8/INT4 quantization - the Quantization deep dive). Each step down the ladder, you re-profile to confirm tensor cores engage and you re-check the loss/eval to confirm accuracy held. Speed and stability are the two axes; the tensor-core metric measures the speed half, your eval measures the stability half.

Now you do it (on a tensor-core GPU)¶

Profile an FP32 training step with ncu --set full -k "regex:gemm" -c 1 and find Tensor (FP). Confirm it's ~0% - tensor cores idle.
Wrap the step in autocast(device_type="cuda", dtype=torch.bfloat16). Re-profile. Watch Tensor (FP) jump and step time drop. Record the speedup.
Enable TF32 (torch.set_float32_matmul_precision("high")) on an otherwise-FP32 model and confirm even "FP32" matmuls now show nonzero tensor-core usage.
Deliberately add a .float() cast inside the model's forward. Re-profile and watch the tensor-core % for that matmul drop - see how one stray cast silently disables the fast path.
Confirm the loss curve is essentially unchanged between FP32 and BF16 over a few hundred steps - speed gained, accuracy held.

What you might wonder¶

"BF16 vs FP16 - which should I use?" BF16 for training, almost always. It has the same exponent range as FP32 (8 bits), so gradients don't underflow and you usually don't even need loss scaling - far fewer "loss went to NaN" surprises. FP16 has more mantissa precision but a narrow exponent (5 bits) that overflows/underflows easily, requiring GradScaler and care. FP16 was the only option on older cards (pre-Ampere); on modern hardware BF16 is the default. (Inference is different - FP16 is common there since the dynamic range is smaller.)

"Is the speedup always ~4x?" No - it depends on how matmul-dominated your model is (the nsys investigation). A transformer (mostly matmuls) sees large speedups; a model bottlenecked on elementwise ops, dataloading, or tiny kernels sees less, because the tensor cores only accelerate the matrix math. This is why you profile the whole step (nsys) first to confirm matmuls dominate, then optimize their precision (ncu/this page). Optimizing tensor-core usage on a dataloader-bound run does nothing - fix the bottleneck nsys identified first.

"Will reduced precision hurt my model's accuracy?" Usually negligibly with BF16 + FP32 master weights (the standard recipe keeps the sensitive parts in FP32). TF32 is essentially free. FP8 needs real care (per-tensor scaling) and you must validate. The rule: drop precision for speed, but always re-check eval metrics - the tensor-core % tells you it's faster; only your validation set tells you it's still correct. Both matter.

"How does this relate to quantization?" Mixed precision (this page) is about training speed - keeping matmuls in BF16/FP16 for tensor cores while preserving training stability. Quantization (the deep dive) is mostly about inference - compressing a trained model to INT8/INT4 to fit more in memory and serve faster. Same underlying idea (lower precision = faster + smaller, at an accuracy cost), different stage and different precisions. Both ultimately about feeding the hardware's fast low-precision paths.

What this gave you¶

You know tensor cores are the fast math path and only engage for reduced precision (FP16/BF16/TF32/FP8).
You can read the Tensor (FP) metric in ncu to prove whether tensor cores are actually working - 0% means you're silently on the slow path.
You enabled BF16 autocast correctly and watched tensor-core utilization jump 0% -> 83% with a ~4x speedup.
You know the silent disablers (bad shapes, TF32 off, stray FP32 casts, non-autocast ops) and how the metric catches them.
You understand the precision ladder (FP32/TF32/BF16/FP16/FP8) and the speed-vs-stability trade at each rung.
You re-check eval, not just speed - the config is a request, the hardware metric and the validation set are the truth.

Back to the Distributed Training month, the Numerics and Mixed Precision deep dive, or revisit the ncu kernel investigation for the compute-vs-memory frame this builds on.