Skip to content

Worked investigation - See the page cache (and why free lies)

Companion to Linux Kernel -> Month 02 -> Week 5: Paging and the Page Cache. The chapter explains the kernel caches file contents in RAM. This page makes you watch that cache fill, watch it make a file read 1000x faster, and resolve the eternal "why is my memory full?!" panic. ~30 minutes, any Linux box.

The symptom you're learning to diagnose

Someone runs free -h, sees "only 200 MB free" out of 32 GB, and panics: "we're out of memory, the server's going to die!" Meanwhile the box is perfectly healthy. Or: a batch job reads the same dataset twice and the second run is mysteriously 100x faster, and nobody can explain it. Both are the page cache, and both confuse people who don't know it exists.

By the end you'll read free correctly, prove the cache speedup with a stopwatch, and never panic at "low free memory" again.

Step 0: the one fact that defuses the panic

When you read a file, the kernel keeps its contents in RAM afterward - the page cache - so the next read is served from memory instead of disk. This cache grows to fill all otherwise-unused RAM, because unused RAM is wasted RAM. Crucially: page cache is instantly reclaimable. The moment a program needs that memory, the kernel drops cache pages to make room, no cost. So "RAM full of page cache" is not "out of memory" - it's "RAM doing its job." free's scary "low free" number counts cache as used; its available number tells the truth.

Step 1: read free correctly

$ free -h
               total        used        free      shared  buff/cache   available
Mem:            31Gi        4.2Gi       412Mi       1.1Gi        27Gi        26Gi

The panic-inducing number is free = 412Mi. The number that matters is available = 26Gi. Decode the row:

  • used (4.2Gi) - actually consumed by processes (anonymous memory - the unreclaimable kind from the OOM investigation).
  • buff/cache (27Gi) - page cache + buffers. RAM holding file contents for speed. Reclaimable.
  • free (412Mi) - RAM doing literally nothing. You want this low - idle RAM is wasted.
  • available (26Gi) - the kernel's honest estimate of "how much a new program could get right now" = free + most of buff/cache. This is the only number to watch.

So this box has 26Gi available despite 412Mi "free." Anyone alarmed by the free column is reading the wrong column. That's the whole panic, resolved.

Step 2: watch the cache fill

Make a big file, drop the cache, then read it and watch buff/cache jump:

$ dd if=/dev/zero of=/tmp/big.bin bs=1M count=1024    # a 1 GB file
$ sync && echo 3 | sudo tee /proc/sys/vm/drop_caches  # flush all caches (test only!)
$ free -h | grep Mem                                  # note buff/cache now
Mem:  31Gi  4.2Gi  27Gi  1.1Gi  1.2Gi  26Gi           # buff/cache low after drop

$ cat /tmp/big.bin > /dev/null                        # read the whole file once
$ free -h | grep Mem
Mem:  31Gi  4.2Gi  26Gi  1.1Gi  2.2Gi  25Gi           # buff/cache jumped ~1Gi - that's our file

buff/cache rose by ~1 GB - the file you just read is now sitting in RAM. used (real process memory) didn't change at all. You watched the cache fill with your own file. The exact accounting per file is visible too:

$ vmtouch /tmp/big.bin       # if installed: shows how much of the file is cached
           Pages: 262144/262144   # 100% resident
         Elements: 1
       Cached: 100%

Step 3: prove the speedup with a stopwatch

This is the "second run is mysteriously faster" mystery, reproduced on demand. Drop caches, time a cold read, then time a warm read:

$ sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
$ time cat /tmp/big.bin > /dev/null      # COLD - comes from disk
real    0m3.812s

$ time cat /tmp/big.bin > /dev/null      # WARM - comes from page cache
real    0m0.231s

3.8 seconds cold, 0.23 seconds warm - a 16x speedup, same command, same file. On spinning disk the gap is larger (50-100x); on NVMe smaller but still real. The only difference: the second read never touched the disk, because the first read populated the page cache. This is why "run it twice and it's faster" happens, why benchmarks must drop caches to be honest (the strace/JMH measure-don't-lie lesson applies here too), and why warming the cache before a latency-sensitive workload is a real technique.

The magnitudes that make it tangible: a page-cache hit is a RAM access (~100 ns); a cold read is a disk access (~100 µs NVMe, ~10 ms spinning). That's 3-5 orders of magnitude. The page cache is the kernel quietly turning disk speed into RAM speed for data you touch twice.

Step 4: see it from the process side

/proc/<pid>/status and friends distinguish a process's anonymous memory (its real footprint) from file-backed pages. The key fields when judging "how much memory does this process really need":

$ grep -E 'VmRSS|RssAnon|RssFile' /proc/self/status
VmRSS:    52340 kB     # total resident (anon + file)
RssAnon:  18204 kB     # anonymous - the unreclaimable footprint that matters
RssFile:  34136 kB     # file-backed - shared, reclaimable, often counted twice across procs

When someone says "this process is using 52 MB," the number that constrains the system is RssAnon (18 MB) - the part that can't be dropped (tie back to the OOM investigation: anon is what gets you killed). RssFile is largely shared page cache that shows up in many processes' RSS at once, which is why summing RSS across processes wildly overcounts real memory use.

Healthy vs actually-out-of-memory

The skill is telling "RAM full of cache (fine)" from "RAM full of anonymous (danger)":

FINE - cache-full, healthy:              DANGER - anon-full, near OOM:
free:        300Mi                       free:        300Mi
buff/cache:   27Gi  <- reclaimable       buff/cache:  1.2Gi <- nothing to reclaim
available:    26Gi  <- lots of headroom  available:   800Mi <- almost none!

Both show low free. The first is healthy (huge available, cache will yield on demand). The second is genuinely dangerous (available tiny, buff/cache already squeezed to nothing - the kernel has no cache left to reclaim, the next allocation triggers the OOM killer). available and buff/cache together tell you which situation you're in; free alone tells you nothing.

Clean up

$ rm /tmp/big.bin

(Never run drop_caches on a production box for fun - it evicts every cached file system-wide, making the whole machine slow until caches refill. It's a benchmarking/teaching tool only.)

Now you do it

  1. free -h on your machine. Identify free vs available. Is anyone you work with watching the wrong one?
  2. Reproduce the cold-vs-warm time cat on a ~1 GB file. Record the speedup ratio. That ratio is your disk-to-RAM gap, measured.
  3. Read a file, then grep Cached /proc/meminfo before and after - watch the number rise by the file size.
  4. Check a real process you run: grep -E 'RssAnon|RssFile' /proc/<pid>/status. How much of its "memory" is actually its own anonymous footprint vs shared cache? Usually less than people assume.

What you might wonder

"Should I try to reduce buff/cache to free up memory?" No - that's exactly the mistake. High buff/cache is good; it's free performance and yields instantly when needed. Trying to keep free high (e.g. periodic drop_caches) just makes everything slower by throwing away useful cache. Watch available, not free, and leave the cache alone.

"Why does my container show the host's huge cache?" Older tools inside a container read host /proc/meminfo. Modern cgroup-v2-aware tools read the cgroup's memory.current/memory.stat (which separates file and anon per cgroup). When debugging container memory, use the cgroup files (/sys/fs/cgroup/.../memory.stat), not host free - same lesson as the OOM investigation.

"Does writing also use the cache?" Yes - writes go to the page cache first (marked "dirty") and are flushed to disk asynchronously by kernel threads. That's why a write can "finish" instantly then take time to actually hit disk (and why sync and fsync exist - to force the flush, critical for databases). Dirty-page writeback is its own deep topic; the read-cache here is the foundation.

"How does this relate to mmap?" mmap-ing a file maps its page-cache pages directly into your address space - reads become memory accesses with no read() syscall, served from the same cache. It's how databases and the dynamic linker load files efficiently. Same cache, different access path.

What this gave you

  • You read free correctly: available is truth, free is noise, buff/cache is reclaimable.
  • You watched the page cache fill with your own file and measured a 16x+ cold-vs-warm speedup.
  • You know the magnitudes (RAM ~100 ns vs disk ~100 µs-10 ms) that make caching matter.
  • You distinguish a process's real anonymous footprint (RssAnon) from shared cache (RssFile).
  • You can tell "healthy cache-full RAM" from "genuinely near OOM" using available + buff/cache.
  • You'll never panic at "low free memory" again, and you can stop a teammate from doing it.

Back to the Memory & Scheduling month, or on to building a container by hand.

Comments