Home › System Design › OS & Kernel Internals

Virtual Memory, Paging & the TLB

Every load and store your program issues names a virtual address, and the CPU's memory-management unit (MMU) silently rewrites that address into a physical DRAM address on the fly, page by page, using a per-process translation table anchored in a control register. That indirection is what lets each process believe it owns a private, contiguous, huge address space while the kernel packs everyone into the same finite RAM, isolates processes from each other, and pages cold data out to disk.

Why it matters. Without paging you get the pre-1970s world: programs are hard-linked to physical addresses, one process can scribble over another, and you cannot run a program larger than RAM. Paging solves protection (each PTE carries permission bits), isolation (no process can even name another's physical memory), over-commit (map more virtual memory than you have DRAM), and sharing (two processes map the same physical frame — how fork, shared libraries, and mmap'd files work). The cost is that every memory access now needs a translation — and making that translation nearly free is the entire job of the TLB.

Address translation: the multi-level page table

A 64-bit machine does not use all 64 bits — x86-64 canonically uses 48 bits of virtual address with 4 KB pages. The low 12 bits are the page offset (2^12 = 4096 bytes within a page) and never get translated. The upper 36 bits are the page number, which must be mapped to a physical frame number.

A single flat table over 36 bits would need 2^36 entries × 8 bytes = 512 GB per process — absurd, since address spaces are overwhelmingly sparse (a bit of code near the bottom, a stack near the top, gaps everywhere). So the table is split into a 4-level radix tree: the 36-bit page number is chopped into four 9-bit indices (2^9 = 512 entries per node, each node exactly one 4 KB page of 512 × 8-byte entries). Empty regions cost nothing — you simply don't allocate the lower-level tables. The MMU walks the tree from the root, whose physical base sits in register CR3.

Traced walk of one address

Take virtual address 0x0000008080604123. Decompose the low 48 bits: offset 0x123 (= 291), then four 9-bit indices — PT = 4, PD = 3, PDPT = 2, PML4 = 1 (the address was built as (1<<39)|(2<<30)|(3<<21)|(4<<12)|0x123). The MMU performs:

Step	Read from	Yields
1	`CR3 + 1×8` (PML4[1])	physical base of the PDPT
2	PDPT base + 2×8 (PDPT[2])	physical base of the PD
3	PD base + 3×8 (PD[3])	physical base of the PT
4	PT base + 4×8 (PT[4])	PTE: frame `0x1F2A3`, present=1

Final physical address = (0x1F2A3 << 12) | 0x123 = 0x1F2A3123. Note the sting: a single program memory access just cost four extra memory accesses to walk the tree. If every load paid that, the machine would run at a fifth of its speed. That is precisely why the TLB exists.

The TLB and TLB misses

The Translation Lookaside Buffer is a small, fully-/set-associative hardware cache inside the MMU that memoizes recent page-number → frame-number translations (plus permission bits). On every access the CPU checks the TLB in parallel with address formation; a hit costs effectively 0–1 cycles and the four-level walk is skipped entirely.

Real numbers on a modern Intel core: an L1 data TLB of ~64 entries backed by a unified L2 STLB of ~1536–2048 entries. A TLB miss triggers the hardware page walk shown above. It is not always 4 full DRAM trips — CPUs add page-walk caches that cache upper-level entries, and the PTEs themselves often live in L1/L2 data cache, so a warm miss may cost ~10–50 cycles. A cold miss, where every level of the walk actually misses cache and pays a ~100 ns DRAM access, is a different regime: four sequential ~100 ns loads is ~400 ns, which at a 3 GHz+ clock is well over a thousand cycles, not merely 'hundreds' — so treat hundreds of cycles as the common warm-ish case (page-walk caches absorb one or two of the four accesses) and a low thousands of cycles as the realistic worst case when the walk is fully cold. Multiply that by a workload that misses constantly and translation becomes the bottleneck — you will see it in perf stat as dtlb_load_misses.walk_active cycles.

The TLB is tagged per address space. A naive context switch would have to flush the whole TLB (why fork/context-switch-heavy workloads used to stall); modern CPUs tag entries with an address-space identifier (ASID / PCID on x86) so switches keep other processes' entries warm.

Hardware-walked vs. software-managed TLB refill. x86 and ARM use a hardware page-table walker: on a miss the MMU itself walks the radix tree and refills the TLB with no OS involvement, which is fast (tens of cycles when cache-resident) but locks the page-table format into silicon — the OS cannot use an arbitrary in-memory structure. MIPS and (classic) SPARC instead trap to a short OS-written handler on every TLB miss (software-managed refill): the kernel is free to use any table shape it likes (including a single shared hashed table), which is more flexible and simpler hardware, but every miss now pays a full trap-and-return (~50–150 cycles of pure overhead) even for a walk that would otherwise be a couple of cache hits. Use hardware walking when raw miss latency dominates (general-purpose CPUs); software refill is attractive when you want a non-standard table format or are building simple/embedded hardware and can tolerate trap overhead.

Page faults: minor vs. major

When the walk reaches a PTE whose present bit is 0, the MMU cannot translate — it raises a page fault, a synchronous trap that hands control to the kernel's fault handler with the faulting address. The handler decides what kind of fault this is:

Minor (soft) fault — the page is already sitting in physical RAM, it just isn't wired into this process's page table yet. Cases: first touch of anonymous memory (kernel hands you the shared zero page or a fresh frame), a copy-on-write page after fork, a shared library already resident for another process, or a file page that's already in the page cache. No disk I/O — the handler just installs a PTE. Cost: hundreds of ns to a few µs.
Major (hard) fault — the data is not in RAM and must be read from storage: a file page not yet in the page cache, or a page previously evicted to swap. This blocks the thread on I/O. Cost: ~10–100 µs on NVMe, ~5–10 ms on a spinning disk — four to six orders of magnitude worse than a minor fault.

This distinction is directly observable: /usr/bin/time -v ./app reports "major (requiring I/O) page faults" separately from minor ones, and a healthy steady-state process should show near-zero major faults. A rising major-fault count under load is the classic signature of memory pressure and thrashing.

The page cache & demand paging

Demand paging is the policy of never loading a page until it is actually touched. When you mmap a 2 GB file or exec a large binary, the kernel doesn't read anything — it just sets up the virtual mappings with present=0. Pages stream in lazily as minor/major faults fire on first access. This makes process startup fast and lets you map data far larger than RAM.

The page cache is the kernel's unified in-RAM cache of file-backed pages. Every read()/write() and every file mmap goes through it: on a read miss the kernel does a major fault, pulls the page from disk into a free frame, and leaves it cached so the next access is a cheap minor fault (or a direct hit for read()). It uses all otherwise-free RAM — this is why Linux "free" memory always looks low and why free -m shows a large "buff/cache" column that is instantly reclaimable. Under pressure the kernel reclaims pages using an LRU-approximation (Linux keeps active and inactive lists, a 2Q-style scheme that resists one-shot scans polluting the hot set). Dirty pages must be written back before reclaim; clean file pages can be dropped for free.

Why locality is the whole game for performance

The key derived quantity is TLB reach = (TLB entries) × (page size) — the amount of memory you can address without a single TLB miss. With 64 L1-dTLB entries × 4 KB, reach is only 256 KB. The moment your working set (the pages actually being touched) exceeds reach, behavior forks sharply:

Sequential / dense access touches each page once and reuses it many times; you pay one walk per page and then hit. A tight loop over a contiguous array is TLB-friendly.
Random / pointer-chasing access over a multi-megabyte structure (hash tables, linked lists, large graphs) touches a new page almost every access, exceeding reach and paying a walk per access — a hidden 10–100× tax that never shows up in the algorithm's Big-O. This is mechanical sympathy: the same O(1) hash lookup is dramatically slower when it thrashes the TLB.

The standard fix is huge pages (2 MB or 1 GB): one entry now covers 2 MB, so 64 entries reach 128 MB, and the walk is one level shorter. Databases (Postgres, Oracle), the JVM (-XX:+UseLargePages), and DPDK use them precisely to keep large hot regions inside TLB reach.

Pitfalls a working engineer actually hits

Blaming the CPU for a TLB-bound workload. High IPC-stall with low cache-miss counts but high dtlb_load_misses.walk_active means you're translation-bound, not compute-bound. Profile with perf before optimizing arithmetic.
Transparent Huge Pages (THP) latency spikes. Linux THP can improve throughput but the defrag path can stall a thread for milliseconds while it compacts memory to form a 2 MB page — a notorious tail-latency source for latency-sensitive services (many run with THP set to madvise or off).
Confusing page cache with a leak. "My server has 2 GB free out of 64 GB!" — the rest is reclaimable page cache, not leaked. Watch available, not free.
Silent major-fault thrashing. Once resident set exceeds RAM, the kernel evicts hot pages that are immediately re-faulted from swap. Throughput collapses while CPU looks idle (everyone is blocked on I/O). Watch vmstat si/so and major-fault rate.
False sharing at page granularity after fork. COW means a write to one byte faults and copies the whole 4 KB page; a process that writes sparsely across a huge COW region can multiply memory use unexpectedly.

Trade-offs & when to use what

Hierarchical (radix) page tables vs. inverted/hashed page tables. The x86/ARM 4-level tree costs a bounded 4-read walk and keeps sparse address spaces cheap, but consumes memory proportional to mapped virtual space per process. An inverted page table (PowerPC, IA-64, and hashed variants elsewhere) instead stores one entry per physical frame, so its size scales with RAM, not with the number/size of processes — attractive for machines with enormous virtual spaces and many processes. The cost: translation becomes a hash lookup with collision chains (variable, sometimes worse than 4 reads), and shared/aliased mappings are awkward since the table is indexed by frame, not by virtual address. Use hierarchical when address spaces are sparse and cross-process sharing is common (the general-purpose case, and what Linux/Windows/macOS all pick); consider inverted/hashed only when per-process table overhead genuinely dominates, e.g. many processes each mapping a small fraction of a very large physical RAM.

4 KB pages vs. huge pages (2 MB / 1 GB). Huge pages multiply TLB reach (64 entries × 2 MB = 128 MB instead of 256 KB) and shorten the walk by a level, which is exactly what large hot working sets (databases, JVM heaps, DPDK packet buffers) need. The trade-off is internal fragmentation: a 2 MB page backing a 4 KB-sized object wastes up to ~2 MB, and a request for many huge pages can fail or stall under memory pressure because the kernel must find (or compact memory into) physically contiguous 2 MB regions — the same compaction cost that makes Transparent Huge Pages a tail-latency risk (see pitfalls above). Use huge pages deliberately for large, long-lived, densely-used regions; keep 4 KB pages for small or short-lived allocations where fragmentation would dominate.

Hardware-walked TLB refill vs. software-managed TLB refill. As above: x86/ARM hardware walkers minimize per-miss latency at the cost of a fixed table format baked into the ISA; MIPS/SPARC-style software refill trades a fixed per-trap overhead for total freedom over the table's in-memory shape. Pick hardware walking for general-purpose high-throughput CPUs; software refill suits research/embedded designs that want a non-standard or unified table format more than they want the fastest possible miss path.

🤖 Don't fully get this? Learn it with Claude

Stuck on Virtual Memory, Paging & the TLB? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🎨 Explain it visually

Build the mental picture, not memorization.

I just read a lesson on **Virtual Memory, Paging & the TLB** (System Design) and want to truly understand it. Explain Virtual Memory, Paging & the TLB from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🤔 Walk me through it (interactive)

Socratic — adapts to where you're stuck.

Teach me **Virtual Memory, Paging & the TLB** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧪 Quiz me & fix my gaps

Active recall exposes what you missed.

Quiz me on **Virtual Memory, Paging & the TLB** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧠 Make it stick

Intuition + hook + flashcards for long-term memory.

Help me remember **Virtual Memory, Paging & the TLB** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes

← Process & Thread Scheduling (CFS)I/O Models: Blocking, epoll & io_u →