Knowledge Guide
HomeDatabasesDatabase Engine Internals

MVCC & Snapshot Isolation

The core trick: never overwrite, never read-lock

Multi-version concurrency control (MVCC) works because an UPDATE does not mutate a row in place — the engine keeps the old version and writes a new physical version tagged with the transaction that created it. A reader is handed a snapshot: a rule for deciding, per row version, "was this change committed before my transaction started?" Because every version carries the id of the transaction that wrote it and the id of the transaction that deleted/superseded it, a reader can reconstruct exactly the database as it looked at one instant — without taking a single lock. Readers never block writers and writers never block readers; the only contention is writer-vs-writer on the same row.

Why it matters

Before MVCC, consistent reads meant shared read locks (two-phase locking). A long analytic query would hold read locks across thousands of rows and stall every writer behind it; a writer holding an exclusive lock would stall every reader. Throughput collapsed under mixed read/write load. MVCC removes that entire class of contention: a 10-minute reporting query sees a frozen, self-consistent view of the data at its start time while OLTP writes stream past it untouched. This is why essentially every serious OLTP engine — PostgreSQL, InnoDB, Oracle, SQL Server (RCSI), CockroachDB — is MVCC-based.

Transaction ids and the visibility test

Each transaction gets a monotonically increasing id (PostgreSQL calls it a txid/xid; InnoDB a trx_id). Every row version stores two hidden fields:

A snapshot in PostgreSQL is the triple (xmin, xmax, xip_list): xmin = oldest transaction still running, xmax = the next txid not yet assigned, and xip_list = the set of txids in-progress at the moment the snapshot was taken. The visibility rule for a version, given snapshot S, is:

  1. Its xmin must be committed and visible to S — i.e. xmin < S.xmax and xmin not in S.xip_list. (A change made by a transaction that started after me, or was still running when I took my snapshot, is invisible.)
  2. Its xmax must be absent, aborted, or NOT yet visible to S — otherwise the version was deleted from my point of view.

Commit state is not stored in the row; it lives in a separate commit log (PostgreSQL's pg_xact/clog, InnoDB's transaction system). Visibility = row's xmin/xmax + commit log + snapshot, evaluated together.

A traced example

Assume accounts holds one row for #42, initially balance=100 written by committed txid 100. Two transactions run:

TimeTransaction TW (txid 205)Transaction TR (txid 210)Physical state of row #42
t0BEGINv1{bal 100, xmin 100, xmax 0}
t1UPDATE bal=150v1{xmax←205}; v2{bal 150, xmin 205, xmax 0}
t2BEGIN; snapshot taken (xip includes 205)unchanged
t3SELECT bal → 100 (205 in xip → v2 invisible)unchanged
t4COMMITclog: 205 committed
t5SELECT bal → still 100 (snapshot frozen at t2)unchanged
t6COMMITv1 now dead for all future snapshots

Key point at t5: even though 205 committed at t4, TR's snapshot was captured at t2 with 205 in its in-progress list, so v2 stays invisible for TR's whole life. That is repeatable read under snapshot isolation. (Under PostgreSQL's default READ COMMITTED, TR would take a fresh snapshot per statement, so t5 would return 150.)

Snapshot isolation is not serializable — write skew

Snapshot isolation (SI) is very strong: no dirty reads, no non-repeatable reads, and no lost updates — because two transactions updating the same row create a write-write conflict and one of them is rejected. PostgreSQL raises could not serialize access due to concurrent update, but the exact mechanism is worth being precise about: PostgreSQL implements first-updater-wins, not pure optimistic first-committer-wins. The second writer to touch the row blocks on the ordinary row lock until the first either commits or aborts; only then does it re-check whether the row changed under it, and if so, it errors out. Classic Berenson-et-al. SI and systems like CockroachDB instead let both writers proceed and use an optimistic check purely at commit time (first committer wins). The observable end state — one writer wins, the other retries — is the same, but the blocking-then-checking mechanism differs from a pure optimistic scheme. But SI is not serializable either way. Its signature anomaly is write skew: two transactions read an overlapping set of rows, then each writes a different row. There is no write-write conflict on any single row, so nothing aborts — yet the combined result violates an invariant that spans rows.

Canonical case: a hospital requires at least one doctor on call. Two doctors, Alice and Bob, are both on call. Each opens a transaction, reads "on-call count = 2 ≥ 1, safe to leave," and sets their own row to off-call. Both snapshots read the count before either write. Both commit. Result: zero doctors on call — an invariant broken by two individually-legal transactions.

Vacuum and garbage collection of old versions

Never overwriting means old versions accumulate. A version can be reclaimed only once no live or future snapshot could ever need it — i.e. its xmax transaction is committed and older than the oldest snapshot still running (PostgreSQL's xmin horizon / "OldestXmin"). Until then it is dead but unremovable.

PostgreSQL reclaims dead tuples with VACUUM (usually autovacuum in the background): it scans, frees dead tuples for reuse within the page, updates the free space map and visibility map, and advances the frozen-xid horizon to prevent transaction id wraparound (the 32-bit txid space is circular; if the oldest unfrozen row is ~2 billion txids behind, the database force-stops to avoid data appearing "in the future"). Note plain VACUUM does not usually shrink the file — that needs VACUUM FULL (which rewrites and takes an exclusive lock). InnoDB's equivalent is the purge thread, which trims undo-log records once no ReadView needs them.

How PostgreSQL and MySQL/InnoDB differ

Both are MVCC, but they store versions in opposite ways — and the trade-offs follow directly.

PostgreSQL (heap MVCC)MySQL InnoDB (undo-log MVCC)
Where old versions liveIn the table heap itself — every UPDATE inserts a whole new tupleOnly the latest version is in the clustered index; old versions are reconstructed from the undo log
Row metadataxmin, xmax, t_ctid chain per tupleDB_TRX_ID + DB_ROLL_PTR pointing into undo
Read of an older versionWalk the heap; skip versions failing the visibility testFollow roll_ptr back through undo, applying diffs until the version is visible to the ReadView
GCVACUUM / autovacuumPurge thread trims undo
Main costTable & index bloat; every index points at each tuple version (mitigated by HOT updates when no indexed column changes)Undo/history bloat; a long-running read transaction pins undo ("history list length" grows), slowing everyone
Write-write conflict handlingFirst-updater-wins: second writer blocks on the row lock, then re-checks after the first commits/abortsStandard row-lock queueing under 2PL-style locking reads; MVCC governs what concurrent readers see, not writer ordering
Default isolationREAD COMMITTED (fresh snapshot per statement)REPEATABLE READ (one snapshot per transaction) + gap/next-key locks to block phantoms

The deep consequence: in InnoDB, secondary indexes are compact and reads of hot rows are cheap, but a forgotten BEGIN in a reporting connection can bloat undo indefinitely. In PostgreSQL, updates are fast (append a tuple), but a churny table with lagging autovacuum bloats badly and index scans degrade. Neither engine's default isolation level is serializable — both need an explicit opt-in (Postgres SERIALIZABLE/SSI, or application-level locking reads in InnoDB) to close the write-skew gap described above.

Pitfalls a working engineer hits

Trade-offs & when to use vs. named alternatives

MVCC snapshot isolation vs. two-phase locking (2PL / classic SERIALIZABLE). 2PL gives true serializability by taking shared locks on everything read and exclusive locks on everything written, held to commit. It prevents write skew — but readers block writers and vice versa, so a long read stalls the write path and concurrency drops. MVCC-SI buys enormous read/write concurrency (no read locks) at the cost of one anomaly class (write skew). Use SI as your default for OLTP with heavy mixed read/write load and no fragile cross-row invariants. Use 2PL-style locking only where you truly need serializable guarantees and can tolerate the contention.

MVCC-SI vs. Serializable Snapshot Isolation (SSI). SSI (Cahill/Fekete, shipped as PostgreSQL's SERIALIZABLE) keeps MVCC's lock-free reads but adds runtime tracking of read-write dependencies; when it detects the "dangerous structure" of two rw-antidependencies that could form a cycle, it aborts one transaction. You get true serializability with near-SI concurrency — the price is bookkeeping overhead and occasional false-positive aborts (so, again, retry logic). Prefer SSI when you need serializability and reads dominate. Fixes short of full serializable: under plain SI you can defeat specific write skews cheaply with SELECT ... FOR UPDATE / FOR SHARE to materialize a conflict, or a promotion lock / explicit constraint. Note MySQL InnoDB has no SSI — its SERIALIZABLE degrades to locking reads (adds S/next-key locks), closer to 2PL.

Takeaways

Recall question

Why does snapshot isolation reliably prevent lost updates but not write skew, even though both involve concurrent transactions reading stale data?

Answer: a lost update is two transactions writing the same row — a write-write conflict on one version chain, which SI detects (first-updater-wins in Postgres; first-committer-wins in classic SI) and rejects the loser. Write skew has each transaction writing a different row, so there is no write-write conflict on any single version to detect; the violated invariant spans multiple rows, which version-level conflict checking cannot see.


Sources: M. Kleppmann, Designing Data-Intensive Applications, ch. 7 (Weak Isolation, Snapshot Isolation, Write Skew & Phantoms); A. Petrov, Database Internals, ch. 5; PostgreSQL documentation — Concurrency Control (MVCC) & Routine Vacuuming; H. Berenson et al., "A Critique of ANSI SQL Isolation Levels" (SIGMOD 1995); M. Cahill, U. Röhm, A. Fekete, "Serializable Isolation for Snapshot Databases" (SIGMOD 2008); MySQL InnoDB manual — Multi-Versioning & the undo/purge subsystem. Re-authored/Deepened for this guide.

🤖 Don't fully get this? Learn it with Claude

Stuck on MVCC & Snapshot Isolation? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🎨 Explain it visually

Build the mental picture, not memorization.

I just read a lesson on **MVCC & Snapshot Isolation** (Databases) and want to truly understand it. Explain MVCC & Snapshot Isolation from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🤔 Walk me through it (interactive)

Socratic — adapts to where you're stuck.

Teach me **MVCC & Snapshot Isolation** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🧪 Quiz me & fix my gaps

Active recall exposes what you missed.

Quiz me on **MVCC & Snapshot Isolation** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🧠 Make it stick

Intuition + hook + flashcards for long-term memory.

Help me remember **MVCC & Snapshot Isolation** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes