The Buffer Pool & Page Cache
The buffer pool is a hash-indexed array of fixed-size RAM frames that shadows disk pages, so the engine reads and writes RAM at roughly 100 ns instead of touching a disk that answers in ~100 µs–10 ms
A storage engine never reads one row from disk. The disk — HDD or SSD — only transfers whole aligned blocks, and the OS only maps memory in pages, so the engine adopts the same quantum: the page (8 KB in PostgreSQL, 16 KB in InnoDB, 4 KB in SQLite by default). A page is the unit of I/O, the unit of caching, the unit of write-back, and the unit the WAL protects. Everything above — rows, tuples, B-tree nodes — is just bytes laid out inside a page.
The buffer pool (PostgreSQL calls its copy shared_buffers; the kernel's own copy is the page cache) is a large slab of RAM carved into page-sized frames, plus a hash table mapping (relation_id, page_number) → frame. When a query needs page 4711, the engine hashes that key: a hit returns the frame pointer in nanoseconds; a miss issues one disk read into a free frame, then returns it. Because hot pages — the upper levels of every B-tree, the current insert page, small dimension tables — stay resident, a well-sized pool serves the vast majority of accesses from RAM.
Why it matters
The gap it hides is enormous. On a hit, locating an already-resident 8 KB page and returning a pointer to it costs on the order of 100 nanoseconds — this is a rough figure for the hash lookup plus pointer/copy overhead of a resident page, not the latency of a single DRAM access (a single memory access is closer to ~1–1000 ns depending on cache level; copying 8 KB of already-cached bytes adds a small, bandwidth-bound cost on top). The same 8 KB from an NVMe SSD costs ~100 µs (roughly 1,000× slower than the RAM case), and from a spinning disk with a seek costs ~5–10 ms (50,000–100,000×). Without a buffer pool, every index descent — 3 to 4 page reads per point lookup — would pay disk latency, capping a single core at a few thousand queries per second. With a 99% hit ratio those lookups run almost entirely in RAM and throughput rises by orders of magnitude. The buffer pool is the single most important reason a database feels fast.
A traced lookup, hit then miss
Take an index point query SELECT * FROM orders WHERE id = 5001 against a B-tree of height 3 (root → internal → leaf), with a 4-frame pool for illustration. Pages are numbered; D marks dirty.
| Step | Need page | Pool before | Hit/Miss | Action |
|---|---|---|---|---|
| 1 | root (10) | [10, 22, —, —] | HIT | root is always hot; return frame |
| 2 | internal (37) | [10, 22, —, —] | MISS | read pg 37 into free frame → [10,22,37,—] |
| 3 | leaf (58) | [10, 22, 37, —] | MISS | read pg 58 → [10,22,37,58]; row found |
| 4 | leaf (58) | [10, 22, 37, 58] | HIT | a re-read of the same key: 0 disk I/O |
The first execution paid 2 disk reads (~200 µs). The second execution of the same query pays zero — all three pages are resident. This is why the very first queries after a restart are slow (a cold cache) and why benchmarks must “warm up” the pool before measuring. Each access also touches the page's reference metadata (below), which is what lets eviction later distinguish the hot root from a one-shot leaf.
Dirty pages, and how they leave
A write does not go to disk. UPDATE orders SET status='shipped' WHERE id=5001 modifies the row inside the resident leaf frame and sets that frame's dirty bit. The page now diverges from its on-disk copy. Dirty pages are flushed later — lazily by a background writer/checkpointer, or eagerly when the frame is chosen as an eviction victim. Batching many writes to one hot page into a single flush is a huge win: a counter updated 1,000 times can hit disk once.
Eviction: LRU vs the CLOCK approximation
When every frame is full and a miss needs one, the pool must evict a victim. The ideal is to evict the page reused farthest in the future (Bélády's optimum) — unknowable at runtime — so engines approximate it with recency. Strict LRU keeps a doubly linked list and moves a page to the head on every access. That is correct but expensive: every single page hit — the 99% common case — must take a global list latch and relink pointers, which becomes a contention hotspot under concurrency.
So real engines use CLOCK (a.k.a. second-chance), an O(1) LRU approximation. Frames sit in a circular buffer; each carries a reference bit. An access just sets the bit (a cheap store, no list surgery). To evict, a hand sweeps the circle: if the bit is 1, clear it and move on (a second chance); if it is 0, that frame is cold — evict it. Pages touched since the last sweep survive; pages touched once and abandoned fall out. PostgreSQL uses a refined variant, CLOCK-sweep with a usage counter (0–5) instead of a single bit, so genuinely hot pages accumulate more chances. InnoDB uses a midpoint-insertion LRU that splits the list into “young” and “old” sublists — a new page enters at the old-list head, and only survives to the young list if it is touched again, which is the key defense against scan pollution (below).
Steal / no-force — and why it forces a WAL
Two independent policies decide when a page's changes reach disk relative to transaction commit:
- Steal vs no-steal — may the pool evict (and thus write to disk) a dirty page belonging to an uncommitted transaction? Steal = yes. Real engines steal, because pinning every uncommitted page in RAM would let one long transaction exhaust the pool.
- Force vs no-force — must all of a transaction's dirty pages be flushed to disk before commit returns? No-force = no. Forcing would turn every commit into a storm of random page writes, destroying commit latency — a single-row update on a hot page could require flushing that page plus every other dirty page belonging to the transaction, each a random 8 KB write, before the client even gets an acknowledgment.
Every high-performance engine chooses steal + no-force — the fastest combination — but that combination breaks durability and atomicity on its own, and understanding why is the whole point of the WAL.
Trace the no-force failure. Transaction T sets status='shipped' on the leaf page for order 5001. The change lands only in the RAM frame; the dirty bit is set; commit returns immediately without touching the data file. One second later the machine loses power. RAM is gone. The on-disk copy of that leaf page still says the old status — the committed update was never physically written anywhere durable except RAM, which just vanished. Without a log, this is unrecoverable data loss on a transaction the client was told succeeded. This is exactly the gap a Write-Ahead Log closes: before the in-memory page is modified (or at least before commit is acknowledged), the engine appends a compact log record — “txn T, page P, offset O, old value, new value” — to a sequential log file and fsyncs just that log tail. The log record is tiny and sequential, so this fsync is cheap compared to flushing the full 8 KB data page. On commit, the durability guarantee is satisfied by the log being on disk, not by the data page being on disk. After the crash, recovery reads the log from the last checkpoint forward and REDOes every logged change whose transaction committed — reapplying “set status=shipped” to the stale on-disk leaf page — reconstructing exactly the in-memory update that was lost. This is the WAL rule in one sentence: the log record for a change must be durable before the data page carrying that change is allowed to be durable (log-before-data), which is precisely what lets the data page itself be written lazily, whenever it happens to be evicted or checkpointed.
Steal has a mirror-image failure. Because steal lets the pool evict and flush dirty pages from an uncommitted transaction (say, under memory pressure mid-transaction), an abort or crash can leave partially-applied, never-committed changes sitting on disk. Recovery must UNDO them — the log also records the old value for exactly this purpose, so recovery can roll back any transaction that was in-flight (not committed) at crash time, restoring the pre-transaction bytes to those data pages.
So the two policies create two distinct risks and the WAL is engineered to cover both at once: no-force → REDO (a committed change might still be sitting only in the now-lost buffer pool — replay it forward from the log), and steal → UNDO (an uncommitted change might already be sitting on disk — roll it back using the log's old-value record). This REDO/UNDO recovery over a steal+no-force buffer pool is the core of the ARIES protocol, and it is exactly why the buffer pool can be so aggressively lazy about writing data pages: the log, not the data file, is the durable source of truth at commit time.
Hit ratio, and why sequential beats random
The buffer hit ratio = hits / (hits + misses) is the pool's headline health metric. The nonlinearity is the point: at a 99% hit ratio, 1 in 100 accesses hits disk; drop to 98% and you have doubled the disk traffic. Because a miss is ~1,000× slower than a hit, average access time is dominated by the miss rate long before it looks alarming. A ratio sliding from 99.9% toward 95% is often the first sign the working set has outgrown RAM. (Caveat: on Linux, PostgreSQL's shared_buffers hit ratio ignores pages served by the OS page cache, so a “low” ratio there is not necessarily disk I/O — confirm with iostat/actual reads.)
Sequential vs random I/O is the other half. Reading 128 contiguous pages in one sweep and reading 128 scattered pages transfer the same bytes but cost wildly differently. On an HDD each random read pays a seek + rotational latency (~5–10 ms); 128 randoms ≈ 1 second, while one sequential sweep ≈ a few ms. Even on SSDs — no seeks — sequential wins: it saturates the flash channels, lets the engine issue large multi-block reads, and enables read-ahead / prefetch, where the pool detects a scan and fetches pages 200–264 before the query asks. This asymmetry is why the query planner will happily choose a full sequential table scan over an index when a query touches a large fraction of rows: many cheap sequential page reads beat fewer expensive random ones. It is the same reason B-trees pack keys densely per page (fewer, fatter pages = fewer random seeks) and why bulk loads sort data first.
Pitfalls a working engineer hits
- Scan pollution / cache thrashing. A one-off analytical query over a huge table streams millions of pages through the pool, evicting the hot OLTP working set (the B-tree roots, the small hot tables). Latency for normal traffic spikes long after the scan finishes, until the pool re-warms. Defenses: InnoDB's midpoint-insertion LRU (a scanned page dies in the old sublist unless re-touched), PostgreSQL's ring buffers that confine large seq-scans/VACUUM to a few frames, and
fadvise(DONTNEED). - Over-sizing the pool. Bigger is not linearly better. Past the working-set size the marginal hit-ratio gain is tiny, while you steal RAM from the OS page cache (double buffering), lengthen checkpoint flush storms, and slow crash recovery. Common guidance for a dedicated Postgres box is ~25% of RAM for
shared_buffers, letting the OS cache do the rest. - Checkpoint write storms. No-force lets dirty pages pile up; when a checkpoint forces them all out at once, a burst of random writes stalls foreground commits. Tuning spreads the flush over time (
checkpoint_completion_target) instead of one spike. - Forgetting the cold cache. A failover, restart, or deploy resets the pool to empty; the first minutes of traffic run at disk speed. Production systems pre-warm (replay recent access patterns) after restart.
Trade-offs & when to use vs a named alternative
The buffer pool is the read-optimized, in-place-update model of the classic B-tree / page-oriented engine (InnoDB, PostgreSQL, SQLite). Its defining choice: mutate pages in RAM, write them back in place, absorb random writes behind a WAL. This is ideal for read-heavy and mixed OLTP workloads with a working set that fits (mostly) in RAM — point lookups and range scans are fast because a page maps directly to its on-disk home and hot pages stay resident.
The named alternative is the LSM-tree (RocksDB, Cassandra, ScyllaDB). An LSM buffers writes in an in-memory memtable, flushes them as immutable sorted files (SSTables), and merges them later via compaction — turning random writes into sequential ones. Trade-off: LSMs win on write-heavy, ingest-heavy workloads and offer better space efficiency (no half-empty pages), but they pay read amplification (a read may probe several SSTable levels + bloom filters) and suffer unpredictable compaction latency spikes. So: choose the buffer-pool B-tree engine when reads dominate, you need predictable low-latency point reads, and strong secondary-index / transactional semantics matter; choose an LSM when write throughput and storage cost dominate and you can tolerate read amplification and compaction jitter. Most relational databases are the former precisely because the buffer pool makes their read path cheap. A closer contrast within the same family is relying purely on the OS page cache (as SQLite and older MySQL setups largely do) instead of a dedicated pool: it is simpler and avoids double-buffering, but the engine loses control over eviction policy, page pinning, and prefetch — which is why serious engines run their own pool.
Takeaways
- The page is the universal unit of I/O and caching; the buffer pool keeps hot pages in RAM to hide a 1,000–100,000× latency cliff.
- Eviction approximates LRU cheaply with CLOCK/second-chance (usage bits), because true LRU's per-hit relinking is a concurrency bottleneck.
- Steal + no-force gives the fastest commit path but mandates a WAL: no-force needs REDO (committed changes may only exist in a now-lost buffer pool), steal needs UNDO (uncommitted changes may already be on disk) — this is the ARIES rationale.
- Hit ratio is nonlinear and sequential I/O beats random — which is why planners prefer big seq-scans and why LSM-trees exist for write-heavy loads.
Recall
Q: The engine uses steal + no-force. A transaction commits, then the machine loses power one second later before its data pages were flushed. What guarantees the committed change survives, and which part of the WAL provides it? (Then: what if the transaction had instead aborted but its dirty page was already stolen to disk?)
Sources: Ramakrishnan & Gehrke, Database Management Systems (buffer management, steal/force matrix); Mohan et al., “ARIES” (1992) for WAL/REDO/UNDO; Petrov, Database Internals (page cache, eviction, LSM vs B-tree); Kleppmann, Designing Data-Intensive Applications ch. 3 (SSTables/LSM vs B-trees); Gregg, Systems Performance (memory/disk latency, sequential vs random I/O); PostgreSQL and InnoDB documentation (shared_buffers, CLOCK-sweep, midpoint-insertion LRU). Re-authored/Deepened for this guide.
🤖 Don't fully get this? Learn it with Claude
Stuck on The Buffer Pool & Page Cache? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.
Build the mental picture, not memorization.
I just read a lesson on **The Buffer Pool & Page Cache** (Databases) and want to truly understand it. Explain The Buffer Pool & Page Cache from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Socratic — adapts to where you're stuck.
Teach me **The Buffer Pool & Page Cache** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Active recall exposes what you missed.
Quiz me on **The Buffer Pool & Page Cache** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Intuition + hook + flashcards for long-term memory.
Help me remember **The Buffer Pool & Page Cache** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.