Knowledge Guide
HomeDatabasesDatabase Engine Internals

The Buffer Pool & Page Cache

The buffer pool is a hash-indexed array of fixed-size RAM frames that shadows disk pages, so the engine reads and writes RAM at roughly 100 ns instead of touching a disk that answers in ~100 µs–10 ms

A storage engine never reads one row from disk. The disk — HDD or SSD — only transfers whole aligned blocks, and the OS only maps memory in pages, so the engine adopts the same quantum: the page (8 KB in PostgreSQL, 16 KB in InnoDB, 4 KB in SQLite by default). A page is the unit of I/O, the unit of caching, the unit of write-back, and the unit the WAL protects. Everything above — rows, tuples, B-tree nodes — is just bytes laid out inside a page.

The buffer pool (PostgreSQL calls its copy shared_buffers; the kernel's own copy is the page cache) is a large slab of RAM carved into page-sized frames, plus a hash table mapping (relation_id, page_number) → frame. When a query needs page 4711, the engine hashes that key: a hit returns the frame pointer in nanoseconds; a miss issues one disk read into a free frame, then returns it. Because hot pages — the upper levels of every B-tree, the current insert page, small dimension tables — stay resident, a well-sized pool serves the vast majority of accesses from RAM.

Why it matters

The gap it hides is enormous. On a hit, locating an already-resident 8 KB page and returning a pointer to it costs on the order of 100 nanoseconds — this is a rough figure for the hash lookup plus pointer/copy overhead of a resident page, not the latency of a single DRAM access (a single memory access is closer to ~1–1000 ns depending on cache level; copying 8 KB of already-cached bytes adds a small, bandwidth-bound cost on top). The same 8 KB from an NVMe SSD costs ~100 µs (roughly 1,000× slower than the RAM case), and from a spinning disk with a seek costs ~5–10 ms (50,000–100,000×). Without a buffer pool, every index descent — 3 to 4 page reads per point lookup — would pay disk latency, capping a single core at a few thousand queries per second. With a 99% hit ratio those lookups run almost entirely in RAM and throughput rises by orders of magnitude. The buffer pool is the single most important reason a database feels fast.

A traced lookup, hit then miss

Take an index point query SELECT * FROM orders WHERE id = 5001 against a B-tree of height 3 (root → internal → leaf), with a 4-frame pool for illustration. Pages are numbered; D marks dirty.

StepNeed pagePool beforeHit/MissAction
1root (10)[10, 22, —, —]HITroot is always hot; return frame
2internal (37)[10, 22, —, —]MISSread pg 37 into free frame → [10,22,37,—]
3leaf (58)[10, 22, 37, —]MISSread pg 58 → [10,22,37,58]; row found
4leaf (58)[10, 22, 37, 58]HITa re-read of the same key: 0 disk I/O

The first execution paid 2 disk reads (~200 µs). The second execution of the same query pays zero — all three pages are resident. This is why the very first queries after a restart are slow (a cold cache) and why benchmarks must “warm up” the pool before measuring. Each access also touches the page's reference metadata (below), which is what lets eviction later distinguish the hot root from a one-shot leaf.

Dirty pages, and how they leave

A write does not go to disk. UPDATE orders SET status='shipped' WHERE id=5001 modifies the row inside the resident leaf frame and sets that frame's dirty bit. The page now diverges from its on-disk copy. Dirty pages are flushed later — lazily by a background writer/checkpointer, or eagerly when the frame is chosen as an eviction victim. Batching many writes to one hot page into a single flush is a huge win: a counter updated 1,000 times can hit disk once.

Eviction: LRU vs the CLOCK approximation

When every frame is full and a miss needs one, the pool must evict a victim. The ideal is to evict the page reused farthest in the future (Bélády's optimum) — unknowable at runtime — so engines approximate it with recency. Strict LRU keeps a doubly linked list and moves a page to the head on every access. That is correct but expensive: every single page hit — the 99% common case — must take a global list latch and relink pointers, which becomes a contention hotspot under concurrency.

So real engines use CLOCK (a.k.a. second-chance), an O(1) LRU approximation. Frames sit in a circular buffer; each carries a reference bit. An access just sets the bit (a cheap store, no list surgery). To evict, a hand sweeps the circle: if the bit is 1, clear it and move on (a second chance); if it is 0, that frame is cold — evict it. Pages touched since the last sweep survive; pages touched once and abandoned fall out. PostgreSQL uses a refined variant, CLOCK-sweep with a usage counter (0–5) instead of a single bit, so genuinely hot pages accumulate more chances. InnoDB uses a midpoint-insertion LRU that splits the list into “young” and “old” sublists — a new page enters at the old-list head, and only survives to the young list if it is touched again, which is the key defense against scan pollution (below).

Steal / no-force — and why it forces a WAL

Two independent policies decide when a page's changes reach disk relative to transaction commit:

Every high-performance engine chooses steal + no-force — the fastest combination — but that combination breaks durability and atomicity on its own, and understanding why is the whole point of the WAL.

Trace the no-force failure. Transaction T sets status='shipped' on the leaf page for order 5001. The change lands only in the RAM frame; the dirty bit is set; commit returns immediately without touching the data file. One second later the machine loses power. RAM is gone. The on-disk copy of that leaf page still says the old status — the committed update was never physically written anywhere durable except RAM, which just vanished. Without a log, this is unrecoverable data loss on a transaction the client was told succeeded. This is exactly the gap a Write-Ahead Log closes: before the in-memory page is modified (or at least before commit is acknowledged), the engine appends a compact log record — “txn T, page P, offset O, old value, new value” — to a sequential log file and fsyncs just that log tail. The log record is tiny and sequential, so this fsync is cheap compared to flushing the full 8 KB data page. On commit, the durability guarantee is satisfied by the log being on disk, not by the data page being on disk. After the crash, recovery reads the log from the last checkpoint forward and REDOes every logged change whose transaction committed — reapplying “set status=shipped” to the stale on-disk leaf page — reconstructing exactly the in-memory update that was lost. This is the WAL rule in one sentence: the log record for a change must be durable before the data page carrying that change is allowed to be durable (log-before-data), which is precisely what lets the data page itself be written lazily, whenever it happens to be evicted or checkpointed.

Steal has a mirror-image failure. Because steal lets the pool evict and flush dirty pages from an uncommitted transaction (say, under memory pressure mid-transaction), an abort or crash can leave partially-applied, never-committed changes sitting on disk. Recovery must UNDO them — the log also records the old value for exactly this purpose, so recovery can roll back any transaction that was in-flight (not committed) at crash time, restoring the pre-transaction bytes to those data pages.

So the two policies create two distinct risks and the WAL is engineered to cover both at once: no-force → REDO (a committed change might still be sitting only in the now-lost buffer pool — replay it forward from the log), and steal → UNDO (an uncommitted change might already be sitting on disk — roll it back using the log's old-value record). This REDO/UNDO recovery over a steal+no-force buffer pool is the core of the ARIES protocol, and it is exactly why the buffer pool can be so aggressively lazy about writing data pages: the log, not the data file, is the durable source of truth at commit time.

Hit ratio, and why sequential beats random

The buffer hit ratio = hits / (hits + misses) is the pool's headline health metric. The nonlinearity is the point: at a 99% hit ratio, 1 in 100 accesses hits disk; drop to 98% and you have doubled the disk traffic. Because a miss is ~1,000× slower than a hit, average access time is dominated by the miss rate long before it looks alarming. A ratio sliding from 99.9% toward 95% is often the first sign the working set has outgrown RAM. (Caveat: on Linux, PostgreSQL's shared_buffers hit ratio ignores pages served by the OS page cache, so a “low” ratio there is not necessarily disk I/O — confirm with iostat/actual reads.)

Sequential vs random I/O is the other half. Reading 128 contiguous pages in one sweep and reading 128 scattered pages transfer the same bytes but cost wildly differently. On an HDD each random read pays a seek + rotational latency (~5–10 ms); 128 randoms ≈ 1 second, while one sequential sweep ≈ a few ms. Even on SSDs — no seeks — sequential wins: it saturates the flash channels, lets the engine issue large multi-block reads, and enables read-ahead / prefetch, where the pool detects a scan and fetches pages 200–264 before the query asks. This asymmetry is why the query planner will happily choose a full sequential table scan over an index when a query touches a large fraction of rows: many cheap sequential page reads beat fewer expensive random ones. It is the same reason B-trees pack keys densely per page (fewer, fatter pages = fewer random seeks) and why bulk loads sort data first.

Pitfalls a working engineer hits

Trade-offs & when to use vs a named alternative

The buffer pool is the read-optimized, in-place-update model of the classic B-tree / page-oriented engine (InnoDB, PostgreSQL, SQLite). Its defining choice: mutate pages in RAM, write them back in place, absorb random writes behind a WAL. This is ideal for read-heavy and mixed OLTP workloads with a working set that fits (mostly) in RAM — point lookups and range scans are fast because a page maps directly to its on-disk home and hot pages stay resident.

The named alternative is the LSM-tree (RocksDB, Cassandra, ScyllaDB). An LSM buffers writes in an in-memory memtable, flushes them as immutable sorted files (SSTables), and merges them later via compaction — turning random writes into sequential ones. Trade-off: LSMs win on write-heavy, ingest-heavy workloads and offer better space efficiency (no half-empty pages), but they pay read amplification (a read may probe several SSTable levels + bloom filters) and suffer unpredictable compaction latency spikes. So: choose the buffer-pool B-tree engine when reads dominate, you need predictable low-latency point reads, and strong secondary-index / transactional semantics matter; choose an LSM when write throughput and storage cost dominate and you can tolerate read amplification and compaction jitter. Most relational databases are the former precisely because the buffer pool makes their read path cheap. A closer contrast within the same family is relying purely on the OS page cache (as SQLite and older MySQL setups largely do) instead of a dedicated pool: it is simpler and avoids double-buffering, but the engine loses control over eviction policy, page pinning, and prefetch — which is why serious engines run their own pool.

Takeaways

Recall

Q: The engine uses steal + no-force. A transaction commits, then the machine loses power one second later before its data pages were flushed. What guarantees the committed change survives, and which part of the WAL provides it? (Then: what if the transaction had instead aborted but its dirty page was already stolen to disk?)


Sources: Ramakrishnan & Gehrke, Database Management Systems (buffer management, steal/force matrix); Mohan et al., “ARIES” (1992) for WAL/REDO/UNDO; Petrov, Database Internals (page cache, eviction, LSM vs B-tree); Kleppmann, Designing Data-Intensive Applications ch. 3 (SSTables/LSM vs B-trees); Gregg, Systems Performance (memory/disk latency, sequential vs random I/O); PostgreSQL and InnoDB documentation (shared_buffers, CLOCK-sweep, midpoint-insertion LRU). Re-authored/Deepened for this guide.

🤖 Don't fully get this? Learn it with Claude

Stuck on The Buffer Pool & Page Cache? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🎨 Explain it visually

Build the mental picture, not memorization.

I just read a lesson on **The Buffer Pool & Page Cache** (Databases) and want to truly understand it. Explain The Buffer Pool & Page Cache from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🤔 Walk me through it (interactive)

Socratic — adapts to where you're stuck.

Teach me **The Buffer Pool & Page Cache** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🧪 Quiz me & fix my gaps

Active recall exposes what you missed.

Quiz me on **The Buffer Pool & Page Cache** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🧠 Make it stick

Intuition + hook + flashcards for long-term memory.

Help me remember **The Buffer Pool & Page Cache** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes