Worked investigation - Diagnose a GPU out-of-memory with nvidia-smi¶
Companion to AI Systems -> Month 05 (Inference Systems) and the GPU Architecture deep dive. The curriculum explains GPU memory hierarchy in theory. This page makes you read GPU memory on a real card, understand the CUDA out of memory error every ML engineer hits weekly, and diagnose what is eating VRAM and why. ~30 minutes. Needs an NVIDIA GPU (any - a laptop RTX, a cloud T4/L4/A10G, a rented A100). On Colab or a rented box if you don't own one.
The symptom you're learning to diagnose¶
You launch a training run or load a model and get the most common error in all of deep learning:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 23.69 GiB total capacity; 21.44 GiB already allocated; 1.80 GiB free; 22.11 GiB reserved in total by PyTorch)The card has 24 GB. You're "only" loading a 7B model. Why is it full? Is it the model, the activations, the optimizer, a memory leak, or another process? Most people just lower the batch size at random and pray. You're going to read what's actually there and fix the real cause.
Step 0: the four things that occupy GPU memory¶
Before the terminal - the mental model the error message assumes you have. GPU VRAM during training holds four distinct things, and knowing their relative sizes is the whole diagnosis:
- Model weights -
params x bytes/param. A 7B model in fp16 = 7e9 x 2 = 14 GB. In fp32, 28 GB. In 4-bit, ~3.5 GB. This is fixed and predictable. - Optimizer state - the silent killer. Adam keeps two extra fp32 values (momentum + variance) per parameter. For a 7B model that's 7e9 x 2 x 4 = 56 GB - four times the model itself. This is why "the model fits but training OOMs."
- Gradients - one per parameter, same size as the weights (~14 GB fp16 for 7B).
- Activations - intermediate tensors saved for the backward pass. Scales with batch size x sequence length - the only one you control at runtime, which is why lowering batch size "works."
The error says "21.44 GiB already allocated." The diagnosis is figuring out which of these four dominates. Inference is just (1) + small activations; training is all four, and (2) usually dwarfs the rest.
Step 1: read the card - nvidia-smi¶
The fundamental tool. Run it:
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 550.x Driver Version: 550.x CUDA Version: 12.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 34C P0 58W / 300W | 21440MiB / 24258MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
|=============================================================================|
| 0 N/A N/A 4127 C python train.py 21436MiB |
+-----------------------------------------------------------------------------+
Decode the fields that matter:
Memory-Usage: 21440MiB / 24258MiB- 21.4 GB used of 24 GB. This matches the error's "already allocated." The card really is nearly full.GPU-Util: 97%- the compute units are 97% busy. (Important: this is not memory - a common confusion. High GPU-Util + high memory = working hard. Low GPU-Util + high memory = memory-bound or stalled, a different problem.)Pwr: 58W / 300W- drawing only 58 W of 300 W. This is a red flag - 97% "util" but low power means the GPU is busy waiting (often on memory or the CPU feeding it), not doing heavy math. We'll come back to this in the nsys investigation.- The Processes table - PID 4127 (
python train.py) holds 21436MiB. This is the key: it tells you which process owns the memory. If you see two processes, a leftover one is stealing VRAM - the most common "why is my GPU full when I haven't launched anything" cause.
Step 2: the most common real cause - a zombie process¶
Run nvidia-smi and see this:
| Processes: |
| 0 N/A N/A 3088 C python 8000MiB | <- ???
| 0 N/A N/A 4127 C python train.py 21436MiB |
+-----------------------------------------------------------------------------+
There's a second python (PID 3088) holding 8 GB - a crashed or detached previous run (a Jupyter kernel you forgot, a killed training that didn't release VRAM, a ipython session). It's eating memory your real run needs. The fix is not "lower batch size" - it's kill the zombie:
$ kill -9 3088
$ nvidia-smi --query-gpu=memory.used,memory.free --format=csv
memory.used [MiB], memory.free [MiB]
13436 MiB, 10822 MiB # 8 GB freed - the OOM is gone
Always check the process table before touching your model config. A huge fraction of GPU OOMs are leftover processes, not your code. (On a shared/cloud box, also check you're not sharing the GPU with someone else's job.)
Step 3: watch memory grow live - find a leak¶
If it's not a zombie, watch your run's memory over time. Stream nvidia-smi:
$ nvidia-smi --query-gpu=memory.used --format=csv,noheader -l 1
13436 MiB
14102 MiB
14780 MiB <- climbing every second...
15455 MiB
16130 MiB <- ...and never coming back down = a leak
Memory that climbs monotonically across training steps and never plateaus is a leak - in PyTorch, almost always one of two classic bugs:
# LEAK 1: accumulating tensors that keep the autograd graph alive
losses.append(loss) # BAD - loss carries the whole computation graph
losses.append(loss.item()) # FIX - .item() extracts a plain float, frees the graph
# LEAK 2: holding references across steps
self.history.append(output) # BAD if output is a CUDA tensor never freed
self.history.append(output.detach().cpu()) # FIX - detach + move off GPU
Memory that rises to a plateau and stays flat is not a leak - that's normal (weights + optimizer + steady activations). Memory that spikes on the first backward pass is activations (lower batch size or use gradient checkpointing). The shape of the curve tells you which of the four memory consumers is the problem. This live-streaming view is the GPU equivalent of the Linux page-cache watch free - you watch the resource move and read the pattern.
Step 4: see PyTorch's own accounting¶
nvidia-smi shows the whole card; PyTorch can tell you what it allocated, broken down:
import torch
print(torch.cuda.memory_allocated() / 1e9, "GB actually in use by tensors")
print(torch.cuda.memory_reserved() / 1e9, "GB reserved by PyTorch's caching allocator")
print(torch.cuda.max_memory_allocated() / 1e9, "GB peak")
13.9 GB actually in use by tensors
21.4 GB reserved by PyTorch's caching allocator <- matches nvidia-smi
2.1 GB peak above current
The gap between allocated (13.9) and reserved (21.4) explains a confusing part of the error message ("reserved in total by PyTorch"): PyTorch's caching allocator grabs VRAM from the driver and keeps it for reuse (faster than asking the driver each time), so nvidia-smi shows the reserved amount even when tensors use less. This is why torch.cuda.empty_cache() exists (returns reserved-but-unused memory to the driver) - and why it rarely helps your own OOM (the memory is reserved because you're using it). For the full breakdown by allocation site:
Step 5: the fixes, matched to the cause¶
Now the diagnosis pays off - each cause has a different fix, and you no longer guess:
| What the data showed | The real fix |
|---|---|
| A second process in the table | Kill the zombie (Step 2) - not a code change at all |
| Monotonic climb, never plateaus | Fix the leak: .item(), .detach(), don't accumulate graph tensors (Step 3) |
| Optimizer state dominates (training a big model) | 8-bit optimizer (bitsandbytes.AdamW8bit), or ZeRO/FSDP sharding (Month 4) |
| Spikes on backward pass (activations) | Lower batch size, or gradient checkpointing (trade compute for memory), or shorter sequences |
| Model weights alone don't fit | Quantize (4-bit/8-bit - the Quantization deep dive), or model parallelism (Month 4) |
"Lower the batch size" is only correct for one row of that table. Reading the memory first tells you which row you're in.
Now you do it (on any NVIDIA GPU)¶
nvidia-smion your GPU. Read the memory usage, GPU-util, power, and the process table. Identify every process holding VRAM.- Load a model in PyTorch (
AutoModel.from_pretrained("gpt2")is small and fits anywhere). Printtorch.cuda.memory_allocated()before and after. That delta is the model weights - check it againstparams x 2 bytes. - Stream
nvidia-smi --query-gpu=memory.used --format=csv,noheader -l 1while running a few training steps. Is the curve flat (healthy) or climbing (leak)? Deliberately introducelosses.append(loss)(no.item()) and watch it climb. - Trigger an OOM on purpose (giant batch size), read the full error message, and map "already allocated / reserved / free" to what you saw in nvidia-smi and
memory_summary().
What you might wonder¶
"GPU-Util is 100% but training is slow - is the GPU the bottleneck?" Not necessarily, and this is the deepest nvidia-smi trap. "GPU-Util" means "a kernel was running during the sample window" - it does not mean the compute units were saturated. A GPU spending all its time on tiny memory-bound kernels, or waiting on the CPU dataloader, shows ~100% util at low power. The real question - "are the tensor cores actually busy?" - needs nsys/ncu (the next two investigations). Low power draw at "100% util" is the giveaway that you're not actually compute-bound.
"What's MIG?" Multi-Instance GPU - newer datacenter cards (A100, H100) can be partitioned into isolated slices, each appearing as its own GPU in nvidia-smi (the GI/CI columns). If you're on a MIG slice you have less memory than the full card; check the slice size. Relevant on shared cluster GPUs.
"Why does nvidia-smi show memory used when my program exited?" Usually a zombie/detached process (Step 2) or, occasionally, a driver not reclaiming after an unclean exit. nvidia-smi process table finds the former; a rare GPU reset (nvidia-smi --gpu-reset, needs no running processes) handles the latter.
"How is this different from system RAM OOM (the Linux OOM investigation)?" Same concept, different memory pool and no OOM-killer mercy: GPU VRAM is much smaller (24-80 GB vs hundreds of GB of RAM), has no swap, and the failure is a hard CUDA out of memory exception rather than a killed process. The diagnostic discipline - read what's actually consuming the memory before you change config - is identical to the Linux page-cache and OOM investigations.
What this gave you¶
- You can read every field of
nvidia-smi- memory, util, power, and crucially the process table. - You know the four consumers of GPU memory and their relative sizes (optimizer state is the silent giant).
- You catch zombie processes - the most common OOM cause - before touching your model config.
- You can tell a leak (monotonic climb) from healthy steady-state (plateau) from an activation spike, by the shape of the live memory curve.
- You understand PyTorch allocated-vs-reserved and why
empty_cache()rarely saves you. - You match each cause to its specific fix instead of randomly lowering batch size.
Back to the Inference Systems month or the GPU Architecture deep dive, or on to reading an nsys trace to answer "why is my 100%-util GPU slow?"