Saltar a contenido

Virtual memory

When your program prints the address of a variable as 0x7ffe1234abcd, that address is a lie. There is no DRAM cell at byte 0x7ffe1234abcd in your machine. The number is a virtual address; the CPU and the operating system translate it to a different physical address (the actual location in DRAM) on every load and store. Every process on the system gets its own lie, so two processes can both think they own address 0x7ffe1234abcd without conflict.

Virtual memory is the operating-system feature that makes every modern system - process isolation, mmap, fork, demand paging, swap, copy-on-write, shared libraries, even the way malloc works - possible. It is one of the highest-leverage abstractions in computing, and understanding it changes the way you read crash dumps, performance traces, and top output.

This page walks the arc: why we have it, how the translation works, what mmap is doing under the hood, the patterns engineers care about, and the production gotchas that come from forgetting it is there.

1. The problem virtual memory solves

A computer in 1965 was simple: there was one program running, addresses in the CPU were physical addresses, and the program could read or write any byte of DRAM. This works for one program; it does not work for multiple:

  • Two programs both want to live at address 0x1000. They cannot - the bytes physically exist in one place.
  • A buggy program writes to address 0x12345678. If that address belongs to the OS kernel or another program, the system is now corrupted.
  • A program is bigger than physical RAM. It cannot run.
  • A program wants to load a shared library. The library has to be relocated to wherever the program's address space has room - either at runtime (slow, complex) or at compile time (rigid).

The fix is a level of indirection: each program is presented with its own private address space, and a translation layer maps those private addresses to wherever the OS chooses to place the data in physical memory. Now:

  • Two programs can each think they live at 0x1000. Their virtual 0x1000s map to different physical addresses.
  • A bug that writes to an unmapped virtual address triggers a hardware fault (a segmentation fault) instead of corrupting other programs.
  • A program's virtual address space can be larger than physical RAM; the OS pages cold parts out to disk (swap) and pages them back in when accessed.
  • Shared libraries map cleanly into each program's address space at whatever virtual address the program wants.

All of this is one mechanism: the page table.

2. Pages and the page table

The unit of virtual-memory bookkeeping is the page, almost always 4096 bytes (4 KB) on x86-64 and ARM. (Linux supports 2 MB and 1 GB "huge pages" too, mostly for databases and JITs; we will get to those.) A 4-KB page corresponds to a page frame in physical memory.

The CPU's MMU (memory management unit) holds a pointer to a per-process page table, a tree-structured data structure that maps virtual page numbers to physical page frame numbers. On x86-64 the page table is four levels deep (PML4 → PDPT → PD → PT) because a 48-bit virtual address is broken into four 9-bit indices and a 12-bit page offset:

  bit:  47..39   38..30   29..21   20..12   11..0
        PML4     PDPT     PD       PT       offset within page
         9        9        9        9         12

To translate a virtual address V:

  1. Look up V[47..39] in the PML4 (top-level table) → get the address of a PDPT.

  2. Look up V[38..30] in that PDPT → get the address of a PD.

  3. Look up V[29..21] in that PD → get the address of a PT.

  4. Look up V[20..12] in that PT → get the physical page frame number.

  5. Concatenate the physical frame number with V[11..0] (the offset within the page) → physical address.

Each table entry is 8 bytes, each table has 512 entries (one 4-KB page exactly), so the structure fits neatly into pages. The cost: every memory access requires up to four extra memory accesses to walk the page table. That would be unacceptable. The fix is the next concept.

3. The TLB: making translation fast

The CPU caches recent virtual-to-physical translations in the TLB (Translation Lookaside Buffer). A modern x86 has ~64 entries in the L1 data TLB, ~512 in L2 TLB. A TLB hit is one cycle; a TLB miss triggers the page-table walk above (which itself goes through the data caches but is dramatically slower).

The TLB is per-CPU and gets flushed every time the OS switches between processes (because each process has its own page table). Recent CPUs have PCID (Process-Context Identifier) to tag TLB entries with their owning process, allowing entries to survive a context switch.

The TLB has a small capacity. A process touching memory across many pages will thrash the TLB - every new page costs a miss. Two patterns make this worse:

  • Random-access workloads like hash tables and graph traversal. Each pointer chase might land on a fresh page; TLB misses dominate the runtime.
  • Large working sets. Walking through 2 GB of data needs 524,288 TLB entries' worth of pages, far more than the TLB holds. Even sequential access has TLB misses (though much fewer per byte than random access).

Huge pages solve this. A 2-MB huge page replaces 512 4-KB pages in the page table; one TLB entry now covers 2 MB instead of 4 KB. Databases (PostgreSQL, MySQL), JVMs (-XX:+UseLargePages), and ML frameworks all enable huge pages on production servers. The trade-off: huge pages need contiguous physical memory, which can be hard to find after the system has been running for a while.

You can see huge page usage at /proc/meminfo (HugePages_Total, HugePages_Free) and force a process to use them with madvise(MADV_HUGEPAGE) or by mapping /dev/hugepages directly.

4. Page faults

When the CPU accesses a virtual address that is not currently mapped to a physical frame, the MMU raises a page fault - a hardware exception that traps to the OS kernel. The OS then decides what to do:

  • Minor fault: the page exists in memory but is not currently in this process's page table (e.g., shared with another process and just needs to be linked in). The kernel updates the page table and returns. Fast.
  • Major fault: the page needs to be loaded from disk - either from swap (it was paged out) or from a file (the program just touched a mmaped region for the first time). The CPU blocks for milliseconds while the disk I/O happens. Slow.
  • Invalid access: the virtual address is not mapped at all (e.g., dereferencing a NULL pointer or a freed pointer). The kernel delivers SIGSEGV to the process, which usually dies. This is what a "segmentation fault" is.

In Linux, /usr/bin/time -v ./your_program reports minor and major faults separately. A program that hammers swap will have a huge Major (requiring I/O) page faults count and run slowly. A program that does a lot of mmap and munmap will have many minor faults but they are nearly free. The distinction matters for performance diagnosis.

5. mmap: the operation that powers everything

mmap ("memory map") is the system call that creates a new mapping in a process's virtual address space. It is the foundation underneath:

  • Allocators: malloc for large blocks calls mmap to grab a chunk of anonymous memory.
  • Shared libraries: dlopen and the dynamic linker mmap library code into your address space.
  • File I/O: mmap a file and the OS lets you access its bytes as if they were memory, demand-loading pages from disk as you read them.
  • Inter-process communication: two processes can mmap the same shared-memory segment to communicate without copying.
  • JIT compilers: a JIT writes machine code into a writable region, calls mprotect to make it executable, then executes it.
// Map a 1 MB anonymous region (like a malloc for one big block)
void *p = mmap(NULL, 1024*1024, PROT_READ | PROT_WRITE,
               MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

// Map a file - reading from p[i] reads from offset i in the file
int fd = open("data.bin", O_RDONLY);
void *p = mmap(NULL, st.st_size, PROT_READ, MAP_SHARED, fd, 0);

The clever part: the mapping is lazy. mmap does not load anything into memory; it just records "if anyone touches the bytes from p to p + size, here is where to find them." Pages are loaded on demand at the first access (which triggers a page fault that the kernel handles by reading from disk). A 4-GB mmap of a file uses zero RAM until you actually read from the addresses.

This is demand paging - the principle that lets a 100 GB database file open instantly and only load the pages the queries actually touch. It is also why a cat huge_file > /dev/null warms the OS page cache: every page of the file gets read, and the kernel keeps them around in case anything else wants them.

6. Copy-on-write and fork

fork() creates a child process that is a duplicate of the parent's address space. Naively, you would have to copy every page - gigabytes of memory for a large process. Instead, virtual memory provides a trick: mark every shared page read-only. Both parent and child point at the same physical pages. When either tries to write to a page, the MMU faults; the kernel allocates a fresh physical page, copies the original contents, updates the writer's page table to point at the new page, and lets the write proceed. Pages that are never written are never copied.

This is copy-on-write (COW), and it is the reason Python's multiprocessing.Pool, Redis's RDB snapshots, and the shell's fork; exec pattern are tolerable. Without COW, every fork() of a 10 GB Postgres process would take ten seconds and double memory usage.

The catch: COW is per-page, not per-allocation. If the child writes to one byte of a page, the entire page is copied. A process with a 1-GB heap that the child writes to in a scattered pattern can quickly duplicate most of the heap, defeating the optimization. Python's reference counting is famously hostile to COW because every "read" of an object updates its refcount, which writes to the page, which triggers a copy.

7. Swap and the working set

If the OS runs out of physical RAM, it can write rarely-used pages to disk (the swap partition or file), free the physical frames, and reclaim them for more pressing demands. When the original program later touches the swapped-out page, a major page fault loads it back from disk. The user sees the system "thrash" - apparent freezes while the disk reads in pages.

The working set of a process is the set of pages it touches "recently" (over some window). If the working set fits in physical RAM, the process runs at memory speed. If it does not, every additional access likely costs a disk read, and the program slows down by 1000-100,000× depending on whether swap is on SSD or HDD.

Modern servers usually disable swap entirely (swapoff -a or vm.swappiness=0). The reasoning: an OOM kill is bad, but a slow-by-1000× server is worse - it is bad and the OS may not be able to make progress to recover from it. Database servers (Postgres, MySQL, Cassandra) all recommend disabling swap.

8. Memory protection bits

Each page-table entry carries permission bits. The relevant ones for production code:

  • Present: page is currently mapped (vs paged out or unmapped).
  • Writable: write access allowed. A read-only page that gets written triggers a fault (which COW exploits).
  • User: user-space access allowed (vs kernel-only).
  • NX (No eXecute): page cannot be executed as code. Modern systems mark the heap and stack NX to prevent classic stack-overflow exploits where an attacker writes shellcode and jumps to it.

mprotect(addr, len, prot) changes the permissions on a range. JITs use it constantly: allocate a region with PROT_READ | PROT_WRITE, generate code into it, then mprotect it to PROT_READ | PROT_EXEC. The "W^X" rule ("writable XOR executable") in modern systems means you cannot have both at once - the OS will reject a PROT_WRITE | PROT_EXEC request on most distributions for security reasons.

9. Advanced: things that bite in production

9.1 RSS vs VSZ: what does ps actually mean?

ps aux shows two memory columns: VSZ (virtual size, the size of all mapped regions) and RSS (resident set size, the physical pages currently in RAM). A process with VSZ of 100 GB and RSS of 200 MB is using 200 MB of RAM; the other 99.8 GB is just reserved virtual address space (maybe a huge mmap of a file, or unused arena allocator regions).

The metric that matters for "how much memory am I using" is RSS in nearly all cases. Cloud cost dashboards that bill by VSZ are wrong; OOM killers reasonably target RSS-heavy processes.

9.2 Memory overcommit

Linux by default overcommits memory: malloc and mmap can return successfully for more memory than physically exists, on the bet that most processes do not actually touch all of what they allocate. If the bet fails (everyone really does use their allocations), the OOM killer picks a process and terminates it.

You can tune this via /proc/sys/vm/overcommit_memory (0 = heuristic, 1 = always say yes, 2 = strict accounting). Most production servers use 0 or 1. Databases sometimes set 2 to prevent surprise terminations at the cost of more conservative allocation.

9.3 The page cache

When you read() a file, Linux first looks in the page cache - a kernel-managed cache of file pages already in RAM. If the data is there, the read is a memcpy from cache to your buffer; no disk I/O. If not, the kernel reads the page from disk into the page cache, then copies to your buffer. The page cache eats all otherwise-free RAM on a busy server.

This is why free shows "buff/cache" eating most of your memory on an idle production box: Linux is using the RAM productively for file caching. The available memory is in the available column, which accounts for cache that can be reclaimed.

mmap-ing a file is the same as using the page cache directly - the mapped pages and the cached pages are the same physical pages. This is why high-throughput databases (and tail -f) often prefer mmap over read(): no copy from cache to userspace, just access the cache directly.

9.4 Address space layout randomization (ASLR)

Modern OSes randomize the virtual addresses of code, data, stack, and heap on every process start. This is ASLR, and it defeats a class of exploits that depend on knowing the address of a particular gadget or buffer. Without ASLR, a buffer-overflow vulnerability that overwrites a return address can reliably point at known shellcode; with ASLR, the attacker has to guess, with a low probability of success.

ASLR has a downside for debugging: addresses in crash dumps differ across runs. coredumpctl debug resolves the ASLR-relative offsets to symbols via the binary's .debug info, so the user-facing experience is fine. But "the bug always happens at address 0x7f...123" stops being a meaningful statement.

9.5 NUMA

On a multi-socket server, each CPU socket has its own local memory bank. Memory access from a CPU to its local DRAM is fast; access to the other socket's DRAM is much slower (50-100% latency penalty). The OS schedules processes and allocates pages with awareness of this non-uniform memory access (NUMA) topology.

numactl --hardware shows the topology; numactl --cpunodebind=0 --membind=0 ./prog pins a process to one socket and its memory. Databases and HPC workloads frequently care about NUMA placement; web servers usually do not.

9.6 Transparent huge pages can hurt latency

Linux's "transparent huge pages" (THP) feature tries to opportunistically promote groups of 4-KB pages to 2-MB pages. The promotion involves background defragmentation and can stall a process during a long latency tail spike at unpredictable moments.

Latency-sensitive systems (low-latency trading, real-time audio, some databases) disable THP entirely: echo never > /sys/kernel/mm/transparent_hugepage/enabled. The MongoDB, Redis, and Cassandra docs all recommend this. The throughput-vs-tail-latency tradeoff is real.

10. The mental model to keep

  • Every address your program sees is a virtual address. The MMU translates it to a physical address on each access using the per-process page table.
  • The TLB caches recent translations; missing the TLB is expensive. Huge pages (2 MB) reduce TLB pressure dramatically for large working sets.
  • A page fault is the hardware mechanism for "the page is not currently mapped." Minor faults are cheap (just update the table); major faults read from disk; invalid accesses become SIGSEGV.
  • mmap is the foundation: it sets up the mapping, but the pages are loaded lazily on first access. This is demand paging, and it is why a 100 GB file can be opened instantly.
  • fork uses copy-on-write: pages are shared until one process writes, then the page is copied. Python's refcounting is hostile to this.
  • Swap exists; production servers usually disable it because slow-by-1000× is worse than dead.
  • VSZ is "address space reserved." RSS is "physical RAM used." Watch RSS.
  • The page cache uses all otherwise-free RAM. available in free is the number that matters.

Once virtual memory is a clear picture in your head, half of what top, vmstat, free, pmap, and strace show you stops being mysterious.

11. Further reading

  • Operating Systems: Three Easy Pieces (free online), the "Virtualization" chapters - the clearest textbook treatment.
  • Ulrich Drepper, "What Every Programmer Should Know About Memory" - covers cache, TLB, NUMA, and prefetching.
  • The Linux mmap(2), mprotect(2), madvise(2) man pages.
  • /proc/<pid>/maps and pmap <pid> - the live view of any process's virtual address space.
  • Pointer arithmetic and arrays - the layer above this one.
  • Bit operations - the page-table-walk address arithmetic is bit-shifting on the virtual address.