Worked investigation - Watch the OOM killer fire¶
Companion to Linux Kernel -> Month 02 -> Week 6: Swapping, OOM, Memory Pressure (PSI). The chapter tells you PSI and the OOM killer exist. This page makes you watch them happen on your own machine, read the exact output, and walk out able to diagnose the single most common production memory incident: "my process died and I don't know why."
Everything here runs on any modern Linux box (kernel 5.2+ for PSI, cgroups v2). No special hardware. ~40 minutes.
The symptom you're learning to diagnose¶
You will hit this in production:
A service gets killed. The dashboard says it was using 1.6 GB and its limit was 2 GB - 80%, not full.
freeshowed plenty of memory on the host. Yet the kernel killed it. The on-call engineer stares at a green memory graph and a dead process and has no idea why.
By the end of this page you'll know exactly what happened, where to look, and which knob to turn. We're going to reproduce it deliberately.
Step 0: the two facts that explain everything¶
Before the terminal, two facts the survey chapter states but doesn't make tangible. Hold them:
-
"Used memory" is two very different things. Anonymous memory (your program's heap, stacks - things with no file backing) cannot be reclaimed under pressure; it can only be swapped or killed. Page cache (file contents the kernel keeps around) can be dropped instantly.
freelumps them together, which is why it lies to your intuition. The OOM killer fires when anonymous memory can't be satisfied, regardless of how much page cache is around. -
The OOM killer's job is to pick a victim, not to free your specific process. When a memory cgroup hits its hard limit and reclaim fails, the kernel kills the process with the highest
oom_scorein that cgroup. The victim may not be the process that asked for the last byte.
Keep these two in mind; the terminal is about to make them concrete.
Step 1: create a constrained cgroup¶
We'll box a process into 512 MB so we don't need to exhaust your whole machine. (This is exactly what a container runtime does - a Kubernetes pod limit is a cgroup memory.max.)
$ sudo mkdir /sys/fs/cgroup/oomlab
$ echo "+memory" | sudo tee /sys/fs/cgroup/cgroup.subtree_control
$ echo "512M" | sudo tee /sys/fs/cgroup/oomlab/memory.max
$ echo "400M" | sudo tee /sys/fs/cgroup/oomlab/memory.high
memory.high (400M) is the soft throttle - the kernel slows allocations and reclaims hard when you cross it. memory.max (512M) is the hard cliff - cross it and something dies. The gap between them is where you'll watch the struggle before the kill.
Step 2: open the pressure gauge¶
In a second terminal, watch PSI for the cgroup live. This is the gauge the survey chapter described abstractly - now you see it move:
Right now, with nothing running in the cgroup, you'll see:
Decode every field, because this is the output you'll stare at in an incident:
some- the percentage of wall-clock time at least one task in the cgroup was stalled waiting for memory (blocked on reclaim instead of doing work).full- the percentage of time every runnable task was stalled simultaneously - i.e. the cgroup got zero useful work done because it was all waiting on memory.fullclimbing is the "we are in real trouble" signal.avg10/avg60/avg300- those stall percentages averaged over the last 10, 60, and 300 seconds.avg10reacts fast;avg300shows a sustained trend.total- cumulative microseconds stalled since the cgroup was created. Monotonic; rate-of-change is the live signal.
A healthy service sits near avg10=0. Anything above ~5-10 on avg10 for some, sustained, means memory is actively hurting you even if nothing has died yet. That's the number most teams don't watch and wish they had.
Step 3: apply pressure, watch the gauge climb¶
Back in the first terminal, put a memory eater into the cgroup. A tiny Python loop that grabs memory and holds it:
$ echo $$ | sudo tee /sys/fs/cgroup/oomlab/cgroup.procs # put THIS shell in the cgroup
$ python3 -c '
import time
chunks = []
for i in range(2000):
chunks.append(bytearray(1024 * 1024)) # 1 MB of anonymous memory, untouchable by reclaim
if i % 50 == 0:
print(f"{i} MB allocated", flush=True)
time.sleep(0.02)
'
Watch the second terminal as it runs. As allocation crosses ~400 MB (memory.high), the PSI gauge comes alive:
some avg10=18.43 avg60=4.21 avg300=0.83 total=2147483
full avg10=11.02 avg60=2.55 avg300=0.49 total=1203847
This is the tangible version of "PSI reports stall time." The process is now spending ~18% of its time stalled on memory reclaim instead of running - you can watch total race upward. The allocator is being throttled at memory.high; the kernel is frantically trying to reclaim (there's nothing reclaimable - it's all anonymous), and the process crawls. This is what a memory-starved service feels like before it dies: not a clean crash, a slow strangle. The PSI numbers are the only metric that shows it; CPU and "memory used" both look unremarkable.
Step 4: cross the cliff - the kill¶
The allocator keeps pushing past memory.high toward memory.max (512M). When anonymous memory can't be reclaimed and the hard limit is hit, the OOM killer fires. Your first terminal dies:
Killed (and exit code 137 = 128 + signal 9) is the process being SIGKILLed by the OOM killer. Now read why - this is the line every on-call engineer needs to find:
[ 4521.88] Memory cgroup out of memory: Killed process 18472 (python3)
total-vm:531204kB, anon-rss:511820kB, file-rss:2304kB, shmem-rss:0kB,
UID:1000 pgtables:1080kB oom_score_adj:0
Decode it, field by field:
Memory cgroup out of memory- the kill was cgroup-local, not host-wide. The host had plenty of RAM. The cgroup'smemory.maxwas the wall. This is the answer to the opening mystery: the pod "at 80%" was at 80% of a number that wasn't the real ceiling, or the metric was sampling RSS while a spike crossedmaxbetween samples.anon-rss:511820kB- ~511 MB of anonymous memory. The unreclaimable kind from Step 0, fact 1. This is what killed it. Notefile-rssis tiny - page cache wasn't the problem, and wouldn't have been (it's droppable).oom_score_adj:0- the victim's score adjustment. Default 0. We're about to change this and watch a different process die.
Step 5: the real-world fix - protect the right process¶
Now the part that's actually a job skill. Suppose this cgroup runs two processes: a critical one (a server) and an expendable one (a batch job). Under memory pressure you want the batch job to die, not the server. oom_score_adj (range -1000 to +1000) biases the choice.
Re-run the lab with two memory eaters, but tell the kernel to spare the "critical" one:
# In the cgroup, start a "critical" process and protect it:
$ python3 -c 'import time; x=bytearray(200*1024*1024); time.sleep(300)' &
$ echo -800 | sudo tee /proc/$!/oom_score_adj # strongly protect this PID
# Start an expendable hog with default score:
$ python3 -c 'import time; c=[]; \
[c.append(bytearray(1024*1024)) or time.sleep(0.02) for i in range(2000)]'
When the cgroup hits memory.max, watch dmesg again - the kernel kills the expendable hog (score 0), and the critical process (score -800) survives:
In production this is systemd's OOMScoreAdjust=-800 on a critical unit, or Kubernetes priority/QoS classes that set it for you. You've just done by hand what those abstractions do.
Step 6: watch the kill happen live (bonus observability)¶
For the real flex - a one-line bpftrace tool that prints every OOM kill as it happens, across the whole machine:
$ sudo bpftrace -e 'kprobe:oom_kill_process { printf("OOM kill at %s\n", strftime("%H:%M:%S", nsecs)); }'
Re-trigger the lab and watch it fire in real time. This is the kind of always-on tracer you'd leave running on a box you're debugging. (Full eBPF tooling is its own investigation - see the bpftrace page.)
Healthy vs sick, side by side¶
The diagnostic skill is recognizing the difference at a glance:
HEALTHY cgroup under normal load: SICK cgroup about to die:
some avg10=0.12 full avg10=0.00 some avg10=24.7 full avg10=16.3
(occasional brief stall, no full) (a quarter of time stalled, full climbing)
When some.avg10 is low single digits and full is ~0, memory is fine. When some.avg10 is double digits and full is climbing, you have minutes before something dies - act now (raise the limit, add memory, or fix the leak from the Memory chapter of AI Systems / the leak section of any of these paths).
Clean up¶
Now you do it¶
- Recreate the cgroup at
memory.max=256M. Run the memory eater. Predict at roughly what allocation it dies, then confirm againstdmesg. (It'll be a bit under 256 MB - the process has overhead beyond the bytearrays.) - While it runs, capture
memory.pressureevery 0.5s. Find the momentfull.avg10first crosses 10. That's your "page someone" threshold - note how many seconds of warning you got before the kill. - Set
memory.high=200Mbelowmemory.max=256Mwith a slower allocator (time.sleep(0.1)). Watch the process throttle and survive in thehigh-to-maxband for a while - PSI high, but no kill. This is the difference betweenhigh(back-pressure) andmax(cliff), felt directly. - Two processes, protect one with
oom_score_adj=-900, confirm the other dies. You've reproduced the production "protect the critical service" pattern.
What you might wonder¶
"Why did free show free memory while my process died?" Because the limit that bound it was the cgroup's memory.max, not the host's RAM. free shows the host. Always check the cgroup limit (/sys/fs/cgroup/<group>/memory.max and memory.current) for a containerized process, not host free. This single confusion causes more wasted incident time than almost anything else.
"Is avg10 over 5 always bad?" No - a brief spike during a known burst (startup, a batch step) is fine. Sustained double-digit some.avg10, or any meaningful full, on a latency-sensitive service is the actionable signal. PSI is about trends, not instants.
"Should I add swap so this doesn't happen?" Swap changes the failure mode from "killed" to "slow" (anonymous memory gets paged to disk instead of triggering a kill). For latency-sensitive services that's often worse - a thrashing-but-alive service can take down a whole cluster via timeouts, where a clean OOM kill + restart is recoverable. Many production systems run swapless deliberately for exactly this reason. zram (compressed RAM swap) is the middle ground. The point: "add swap" is a real tradeoff, not an obvious fix.
"How does this map to Kubernetes?" A pod's memory limit becomes the cgroup memory.max. OOMKilled in kubectl describe pod is exactly the dmesg line you just read. QoS class (Guaranteed/Burstable/BestEffort) sets oom_score_adj for you - BestEffort pods get killed first. You now understand the mechanism under the abstraction.
What this gave you¶
- You watched PSI move and can read every field of
memory.pressurein an incident. - You know anonymous vs page-cache memory and why the distinction decides who dies.
- You can read an OOM-kill
dmesgline and answer "why did it die" definitively. - You can protect a critical process with
oom_score_adj- by hand and via the abstractions. - You can tell a healthy cgroup from one minutes-from-death at a glance.
- You can explain the "killed at 80%, host had free RAM" mystery cold.
That is the difference between knowing PSI exists and being the person in the room who knows what the graph means. Back to the Week 6 chapter for the conceptual frame, or on to the scheduler investigation.