Write-Ahead Logging & Durability
Write-ahead logging works because a sequential append to a log file forces to durable storage far cheaper than a random write to a data page: before any modification touches the in-memory page (the buffer pool copy of an on-disk page), the engine writes a record describing the change to a log and fsyncs that log to stable storage. The cardinal rule — the WAL protocol — is: the log record must reach durable storage before the corresponding data page does. The log is the source of truth; the data files are a lazily-updated cache of it.
Why it matters: the problem WAL solves
A transaction that commits must survive a power loss one microsecond later (durability), and a crash mid-transaction must leave none of its effects (atomicity). Without a log the engine faces an impossible choice. If it flushes every dirty page at commit, commit latency is dominated by random disk seeks and a transaction touching 20 scattered pages pays 20 syncs. If it doesn't flush, a crash loses committed data. Worse, a page is often larger (8 KB in PostgreSQL, 16 KB in InnoDB) than the disk's atomic write unit (typically a 512-byte or 4 KB sector), so a crash mid-write can leave a torn page — half old, half new — which is neither state the database ever intended.
WAL breaks the deadlock. Commit becomes one sequential append + one fsync of the log. The expensive random page writes are deferred and batched by a background process. Recovery replays the log to reconstruct exactly the committed state.
Redo, undo, and the two policies that make WAL necessary
Each log record is enough to redo (reapply a change whose data page never made it to disk) and/or undo (roll back a change whose data page did reach disk before the transaction committed). Whether you need each depends on two buffer-manager policies:
- Steal: may the buffer manager write a dirty page belonging to an uncommitted transaction back to disk (to reclaim a buffer)? If yes → you need undo, because a crash could leave uncommitted changes on disk that must be reversed.
- No-force: at commit, are you allowed to not force the transaction's dirty pages to disk? If yes → you need redo, because a committed change may live only in the log at crash time.
Essentially every high-throughput engine chooses steal + no-force — the most flexible policy for the buffer manager, and the one that makes commit cheap — which is precisely why it needs a log carrying both redo and undo. The alternatives (force = slow commits; no-steal = pins the whole working set in RAM) don't scale.
A traced commit: what actually hits the disk
Every log record has a monotonically increasing LSN (Log Sequence Number). Each data page stores in its header the LSN of the last log record applied to it (pageLSN); this is what lets recovery tell whether a change is already reflected on disk. Trace a single-row update, UPDATE accounts SET bal=900 WHERE id=7 (old value 1000):
| Step | LSN | Action | What is durable after |
|---|---|---|---|
| 1 | — | BEGIN in memory; pin page holding id=7 in buffer pool | nothing |
| 2 | 101 | Two distinct actions bound by the WAL rule, not one: (a) mutate the in-memory copy of page 88 — set its bal field to 900 and stamp its pageLSN=101 — and (b) append the log record UPDATE, page=88, before=1000, after=900 to the log buffer. The WAL rule governs the ordering of a later event, not this step: it says the log append from (b) must be fsync'd before page 88's mutated bytes are ever flushed to disk. The in-memory mutation itself can happen immediately; nothing durable has occurred yet. | log buffer only (not yet synced); page 88 dirty in RAM |
| 3 | 102 | Append COMMIT record | log buffer |
| 4 | — | fsync the WAL up to LSN 102 | log through 102 on disk → transaction is now durable |
| 5 | — | Return "committed" to the client | page 88 still only in RAM |
| 6 | — | Background checkpoint/eviction writes page 88 | data file now matches log |
Crash between step 4 and step 6? Recovery reads the log, sees COMMIT at 102, finds page 88 on disk still at 1000 (its pageLSN < 101), and redoes the change. Crash between step 2 and step 4? No COMMIT record exists, so the change is discarded (and undone if the page had been stolen to disk). The client was never told "committed", so no promise is broken.
Group commit & fsync batching: the throughput lever
The physics: a single fsync on a spinning disk costs one rotation (~8–10 ms); on an enterprise SSD/NVMe with a power-loss-protected write cache it is ~50–200 µs. Either way, if every transaction did its own fsync, throughput would be capped at 1 / fsync_latency — on a HDD roughly 100–200 commits/sec, no matter how many CPU cores you have. Crucially, that ceiling doesn't rise just because more transactions run concurrently: a WAL is a single append-only file with one physical write head (and, on most engines, one in-process log-writer/queue), so N transactions each calling fsync on their own don't parallelize their syncs — they queue up behind each other, and the disk still only completes one rotation-bound flush at a time. Ten concurrent committers with no group commit still serialize on that one physical resource, so wall-clock throughput stays pinned near 100–200/sec even at high concurrency. The fsync, not the CPU or the concurrency level, is the bottleneck.
Group commit amortizes it. Instead of each committer syncing alone, the engine holds a brief window: while transaction A's fsync is in flight, B, C, and D append their COMMIT records to the same log buffer and wait. When A's sync returns, one single subsequent fsync covers B, C, and D too (a sequential log means one sync flushes everyone's records up to the current LSN). Ten concurrent committers can share one physical sync — a near-10× throughput gain at the cost of a few ms of added latency per commit. This is exactly what turns the serialized fsync queue from a hard ceiling into a batching opportunity: the same one-writer-at-a-time constraint that caps naive throughput is what group commit exploits, since all waiters are already funneling through that single queue anyway. PostgreSQL exposes this as commit_delay/commit_siblings; MySQL/InnoDB as binlog_group_commit_sync_delay. This is the classic latency-for-throughput trade: you deliberately delay to batch.
Checkpoints: bounding recovery time
The log grows forever and dirty pages accumulate in RAM; without intervention, recovery would have to replay the entire log from the beginning of time. A checkpoint bounds this by recording, at some point in the log, enough information that recovery can safely skip everything before it.
The naive approach — a sharp (consistent) checkpoint — stalls all new writes, flushes every dirty page to disk, and only then writes a single checkpoint record. It gives the simplest possible recovery (start exactly at the checkpoint, nothing before it matters at all), but the write-stall is proportional to buffer-pool size: on a server with tens of GB of dirty pages that can be seconds to minutes of frozen writes, which is unacceptable for an always-on OLTP system.
Modern engines instead use a fuzzy checkpoint, which never stalls transactions:
- Write a
begin_checkpointrecord marking the start. - Without quiescing anything, snapshot two in-memory structures as they stand right now: the dirty page table (DPT — every page currently dirty in the buffer pool, plus its
recLSN, the LSN of the log record that first dirtied it) and the active transaction table (ATT — every transaction currently in flight). - Write those two snapshots into an
end_checkpointrecord. Pages can keep being read, written, and even flushed to disk while steps 2–3 happen — hence "fuzzy": the snapshot is not a single atomic instant, just a consistent enough bracket.
Recovery uses this directly: take the smallest recLSN across every page in the DPT at the last end_checkpoint. Every logged change before that LSN is guaranteed to already be on disk (its page can't still be dirty with an older recLSN and be unflushed, by definition of how the DPT is maintained), so redo only needs to scan forward from that bound — not from the start of the log. This is the entire payoff: a checkpoint doesn't make recovery correct, it makes it fast, by shrinking the redo scan from "whole log" to "log since the oldest still-dirty page."
The trade-off is the same one sharp checkpoints tried to dodge, just relocated: checkpoint too often and you burn I/O bandwidth continually re-flushing hot pages that will be dirtied again seconds later; checkpoint too rarely and the DPT's oldest recLSN drifts far behind, so the redo bound barely moves and recovery after a crash can take minutes. Sharp checkpoints buy simplicity at the cost of availability (write-stalls); fuzzy checkpoints buy availability at the cost of a slightly more complex recovery (you now need the DPT/ATT bookkeeping in Analysis, covered next). Every production engine chooses fuzzy for exactly that reason. PostgreSQL's checkpoint_timeout and max_wal_size, and InnoDB's adaptive flushing, tune this recovery-time-vs-steady-state-I/O balance.
Crash recovery: ARIES in three passes
The industry-standard algorithm is ARIES (Mohan et al., IBM, 1992). Its defining principles are WAL, repeating history during redo, and logging changes during undo (so undo itself is crash-safe). Recovery runs three passes over the log:
- Analysis. Start at the last checkpoint's
begin_checkpoint, load the DPT/ATT from itsend_checkpoint, then scan forward to the end of the log, updating both tables as it goes. The result: the set of dirty pages (and the smallestrecLSN= where redo begins) and the set of transactions that were live at crash time ("losers" — they never committed and must be undone). - Redo. Start at the smallest
recLSNand replay every logged change — winners and losers — that isn't already reflected on the page (i.e. wherelogLSN > pageLSN). This "repeats history" to reconstruct the exact page state at the moment of the crash. Redo is idempotent precisely because of thepageLSNcomparison: replaying an already-applied record is skipped. - Undo. Roll back the losers, walking each transaction's log records backward via the
prevLSNback-chain. Each undo writes a Compensation Log Record (CLR) describing the reversal and carrying anUndoNextpointer. If the system crashes again mid-recovery, CLRs ensure undo resumes where it left off and never re-undoes an already-reversed action — this is what makes ARIES recovery itself restartable.
Concretely: T1 committed (LSN 105 COMMIT), T2 was mid-flight at crash (last record LSN 110, no COMMIT). Analysis marks T2 a loser. Redo reapplies both T1's and T2's data changes to reconstruct crash-time state. Undo then reverses only T2, appending CLRs; T1's committed effects remain. Net result: T1 durable, T2 vanished — atomicity and durability, both satisfied.
Pitfalls a working engineer actually hits
fsyncthat lies. Consumer drives and some virtualized/cloud block layers acknowledgefsyncbefore data is truly on stable media (volatile write cache). Your WAL is then not durable and no software can fix it. This caused real MySQL/Postgres data loss; enterprise drives use battery/capacitor-backed caches. Verify with tools likediskchecker.plorpg_test_fsync.- The 2018 "fsync gate." PostgreSQL discovered that on Linux, if a writeback error occurs, the error may be reported to one
fsynccaller and then the dirty page is dropped — a laterfsyncreturns success despite lost data. The fix was to panic and force full recovery on anyfsyncfailure rather than trust a retry. Lesson: a failedfsyncis not safely retryable. - Torn pages. Because a page write isn't atomic across sectors, a crash mid-write corrupts the page. PostgreSQL guards this with full-page writes (the first modification of a page after a checkpoint logs the entire page image into the WAL); InnoDB uses the doublewrite buffer. Both cost extra WAL/IO — and disabling them for speed is a classic footgun.
- WAL that won't recycle. An abandoned replication slot, a stuck archiver, or a long-running transaction pins WAL segments; the log directory fills and the database halts. A frequent 3 AM page.
- Assuming
fsync=off/synchronous_commit=offis "just faster." It moves you off the durability guarantee entirely — a crash loses the last window of commits. Fine for a rebuildable cache, catastrophic for a system of record.
Trade-offs & when to use vs a named alternative
The durability knob is the design decision. From strongest to weakest: (a) fsync every commit (full durability, lowest throughput); (b) group commit (full durability, high throughput, small added latency — the default sweet spot); (c) asynchronous commit (synchronous_commit=off: commit returns before fsync, risking a bounded window of committed-but-lost transactions on crash); (d) fsync=off (no durability). Move down the ladder only as the cost of losing a recent commit falls.
WAL vs shadow paging (the named alternative, used by classic SQLite rollback mode and LMDB): shadow paging never overwrites a page in place — it writes a new copy and atomically flips a root pointer at commit, giving atomicity/durability without a redo log. It shines for simple, low-concurrency, embedded workloads: no log to manage, trivially crash-safe. But it fragments the file, does poorly with high write concurrency (the root-pointer flip serializes), and loses locality. WAL wins for high-throughput, high-concurrency OLTP because sequential log appends + group commit + deferred batched page writes extract far more IOPS from the hardware — which is why every server-class engine (PostgreSQL, InnoDB, Oracle, SQL Server, and SQLite's own WAL mode) uses write-ahead logging. Choose shadow paging for an embedded single-writer store; choose WAL when many transactions commit concurrently and you need to bound both commit latency and recovery time.
Sharp vs fuzzy checkpoints (the named alternative within WAL itself): a sharp checkpoint trades availability for recovery simplicity — every write blocks until the whole buffer pool is flushed, then recovery has nothing to redo before the checkpoint at all. A fuzzy checkpoint trades a little recovery-code complexity (you must track the DPT/ATT and compute the minimum recLSN) for zero write-stalling. Any system that can tolerate a brief freeze (offline batch systems, some embedded stores) can use sharp checkpoints for their simplicity; anything user-facing and always-on uses fuzzy.
Takeaways
- WAL's power is physical: it converts durability from expensive random page writes into a cheap sequential log append + one
fsync, deferring and batching the random writes. - Steal + no-force is the high-throughput policy, and it is exactly why the log must carry both undo (for pages stolen while uncommitted) and redo (for commits not yet flushed).
- The WAL rule orders two different events — log durability and page-flush-to-disk — not the in-memory mutation, which can happen immediately.
fsynclatency is a hard per-writer ceiling (~100–200/sec on HDD) because every committer serializes on one physical log write head; group commit raises throughput by sharing that one sync across many waiting transactions, not by parallelizing it.- Fuzzy checkpoints bound recovery time without stalling writes, by snapshotting the dirty-page table's minimum
recLSNas the point redo must start from; sharp checkpoints get simpler recovery at the cost of write-stalls, and every production engine picks fuzzy. - ARIES recovery is three passes — Analysis rebuilds state, Redo repeats history from the checkpoint bound, Undo rolls back losers with restart-safe CLRs.
- WAL vs shadow paging is a concurrency trade: shadow paging is simpler for a single-writer embedded store; WAL wins whenever many transactions commit concurrently.
🤖 Don't fully get this? Learn it with Claude
Stuck on Write-Ahead Logging & Durability? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.
Build the mental picture, not memorization.
I just read a lesson on **Write-Ahead Logging & Durability** (Databases) and want to truly understand it. Explain Write-Ahead Logging & Durability from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Socratic — adapts to where you're stuck.
Teach me **Write-Ahead Logging & Durability** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Active recall exposes what you missed.
Quiz me on **Write-Ahead Logging & Durability** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Intuition + hook + flashcards for long-term memory.
Help me remember **Write-Ahead Logging & Durability** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.