MVCC & Snapshot Isolation
The core trick: never overwrite, never read-lock
Multi-version concurrency control (MVCC) works because an UPDATE does not mutate a row in place — the engine keeps the old version and writes a new physical version tagged with the transaction that created it. A reader is handed a snapshot: a rule for deciding, per row version, "was this change committed before my transaction started?" Because every version carries the id of the transaction that wrote it and the id of the transaction that deleted/superseded it, a reader can reconstruct exactly the database as it looked at one instant — without taking a single lock. Readers never block writers and writers never block readers; the only contention is writer-vs-writer on the same row.
Why it matters
Before MVCC, consistent reads meant shared read locks (two-phase locking). A long analytic query would hold read locks across thousands of rows and stall every writer behind it; a writer holding an exclusive lock would stall every reader. Throughput collapsed under mixed read/write load. MVCC removes that entire class of contention: a 10-minute reporting query sees a frozen, self-consistent view of the data at its start time while OLTP writes stream past it untouched. This is why essentially every serious OLTP engine — PostgreSQL, InnoDB, Oracle, SQL Server (RCSI), CockroachDB — is MVCC-based.
Transaction ids and the visibility test
Each transaction gets a monotonically increasing id (PostgreSQL calls it a txid/xid; InnoDB a trx_id). Every row version stores two hidden fields:
xmin— the txid that created this version (the inserting/updating transaction).xmax— the txid that deleted or superseded it. On disk this is never a true NULL —0is the invalid-transaction sentinel value, and PostgreSQL further gates whether it counts as "set" using theXMAX_INVALID/XMAX_COMMITTEDhint bits in the tuple's infomask. Conceptually, though, "xmax unset" means "still live."
A snapshot in PostgreSQL is the triple (xmin, xmax, xip_list): xmin = oldest transaction still running, xmax = the next txid not yet assigned, and xip_list = the set of txids in-progress at the moment the snapshot was taken. The visibility rule for a version, given snapshot S, is:
- Its
xminmust be committed and visible to S — i.e.xmin < S.xmaxandxminnot inS.xip_list. (A change made by a transaction that started after me, or was still running when I took my snapshot, is invisible.) - Its
xmaxmust be absent, aborted, or NOT yet visible to S — otherwise the version was deleted from my point of view.
Commit state is not stored in the row; it lives in a separate commit log (PostgreSQL's pg_xact/clog, InnoDB's transaction system). Visibility = row's xmin/xmax + commit log + snapshot, evaluated together.
A traced example
Assume accounts holds one row for #42, initially balance=100 written by committed txid 100. Two transactions run:
| Time | Transaction TW (txid 205) | Transaction TR (txid 210) | Physical state of row #42 |
|---|---|---|---|
| t0 | BEGIN | v1{bal 100, xmin 100, xmax 0} | |
| t1 | UPDATE bal=150 | v1{xmax←205}; v2{bal 150, xmin 205, xmax 0} | |
| t2 | BEGIN; snapshot taken (xip includes 205) | unchanged | |
| t3 | SELECT bal → 100 (205 in xip → v2 invisible) | unchanged | |
| t4 | COMMIT | clog: 205 committed | |
| t5 | SELECT bal → still 100 (snapshot frozen at t2) | unchanged | |
| t6 | COMMIT | v1 now dead for all future snapshots |
Key point at t5: even though 205 committed at t4, TR's snapshot was captured at t2 with 205 in its in-progress list, so v2 stays invisible for TR's whole life. That is repeatable read under snapshot isolation. (Under PostgreSQL's default READ COMMITTED, TR would take a fresh snapshot per statement, so t5 would return 150.)
Snapshot isolation is not serializable — write skew
Snapshot isolation (SI) is very strong: no dirty reads, no non-repeatable reads, and no lost updates — because two transactions updating the same row create a write-write conflict and one of them is rejected. PostgreSQL raises could not serialize access due to concurrent update, but the exact mechanism is worth being precise about: PostgreSQL implements first-updater-wins, not pure optimistic first-committer-wins. The second writer to touch the row blocks on the ordinary row lock until the first either commits or aborts; only then does it re-check whether the row changed under it, and if so, it errors out. Classic Berenson-et-al. SI and systems like CockroachDB instead let both writers proceed and use an optimistic check purely at commit time (first committer wins). The observable end state — one writer wins, the other retries — is the same, but the blocking-then-checking mechanism differs from a pure optimistic scheme. But SI is not serializable either way. Its signature anomaly is write skew: two transactions read an overlapping set of rows, then each writes a different row. There is no write-write conflict on any single row, so nothing aborts — yet the combined result violates an invariant that spans rows.
Canonical case: a hospital requires at least one doctor on call. Two doctors, Alice and Bob, are both on call. Each opens a transaction, reads "on-call count = 2 ≥ 1, safe to leave," and sets their own row to off-call. Both snapshots read the count before either write. Both commit. Result: zero doctors on call — an invariant broken by two individually-legal transactions.
Vacuum and garbage collection of old versions
Never overwriting means old versions accumulate. A version can be reclaimed only once no live or future snapshot could ever need it — i.e. its xmax transaction is committed and older than the oldest snapshot still running (PostgreSQL's xmin horizon / "OldestXmin"). Until then it is dead but unremovable.
PostgreSQL reclaims dead tuples with VACUUM (usually autovacuum in the background): it scans, frees dead tuples for reuse within the page, updates the free space map and visibility map, and advances the frozen-xid horizon to prevent transaction id wraparound (the 32-bit txid space is circular; if the oldest unfrozen row is ~2 billion txids behind, the database force-stops to avoid data appearing "in the future"). Note plain VACUUM does not usually shrink the file — that needs VACUUM FULL (which rewrites and takes an exclusive lock). InnoDB's equivalent is the purge thread, which trims undo-log records once no ReadView needs them.
How PostgreSQL and MySQL/InnoDB differ
Both are MVCC, but they store versions in opposite ways — and the trade-offs follow directly.
| PostgreSQL (heap MVCC) | MySQL InnoDB (undo-log MVCC) | |
|---|---|---|
| Where old versions live | In the table heap itself — every UPDATE inserts a whole new tuple | Only the latest version is in the clustered index; old versions are reconstructed from the undo log |
| Row metadata | xmin, xmax, t_ctid chain per tuple | DB_TRX_ID + DB_ROLL_PTR pointing into undo |
| Read of an older version | Walk the heap; skip versions failing the visibility test | Follow roll_ptr back through undo, applying diffs until the version is visible to the ReadView |
| GC | VACUUM / autovacuum | Purge thread trims undo |
| Main cost | Table & index bloat; every index points at each tuple version (mitigated by HOT updates when no indexed column changes) | Undo/history bloat; a long-running read transaction pins undo ("history list length" grows), slowing everyone |
| Write-write conflict handling | First-updater-wins: second writer blocks on the row lock, then re-checks after the first commits/aborts | Standard row-lock queueing under 2PL-style locking reads; MVCC governs what concurrent readers see, not writer ordering |
| Default isolation | READ COMMITTED (fresh snapshot per statement) | REPEATABLE READ (one snapshot per transaction) + gap/next-key locks to block phantoms |
The deep consequence: in InnoDB, secondary indexes are compact and reads of hot rows are cheap, but a forgotten BEGIN in a reporting connection can bloat undo indefinitely. In PostgreSQL, updates are fast (append a tuple), but a churny table with lagging autovacuum bloats badly and index scans degrade. Neither engine's default isolation level is serializable — both need an explicit opt-in (Postgres SERIALIZABLE/SSI, or application-level locking reads in InnoDB) to close the write-skew gap described above.
Pitfalls a working engineer hits
- Assuming SI = serializable. Write skew and phantom-based invariant violations slip through. If correctness depends on a cross-row invariant ("at least one on call", "balance never negative across two accounts"), SI will silently break it under concurrency.
- Long-running transactions poison GC. One idle-in-transaction connection holds the xmin horizon back, so VACUUM/purge cannot reclaim anything newer — bloat balloons cluster-wide. Monitor
pg_stat_activityforidle in transactionand InnoDB's history list length. - Serialization failures are normal under RR/SERIALIZABLE. PostgreSQL will abort with
40001(could not serialize access); the app must retry the whole transaction, not just resubmit the statement. - Txid wraparound. Suppressing/starving autovacuum on a high-write cluster risks an emergency read-only shutdown. This is a genuine production outage cause.
- Read Committed re-reads. In PostgreSQL's default level, two
SELECTs in one transaction can return different values — each takes a new snapshot. Code that assumes stability must useREPEATABLE READ. - Bloated indexes on Postgres UPDATE-heavy tables — check whether HOT updates apply (don't index columns you update constantly).
Trade-offs & when to use vs. named alternatives
MVCC snapshot isolation vs. two-phase locking (2PL / classic SERIALIZABLE). 2PL gives true serializability by taking shared locks on everything read and exclusive locks on everything written, held to commit. It prevents write skew — but readers block writers and vice versa, so a long read stalls the write path and concurrency drops. MVCC-SI buys enormous read/write concurrency (no read locks) at the cost of one anomaly class (write skew). Use SI as your default for OLTP with heavy mixed read/write load and no fragile cross-row invariants. Use 2PL-style locking only where you truly need serializable guarantees and can tolerate the contention.
MVCC-SI vs. Serializable Snapshot Isolation (SSI). SSI (Cahill/Fekete, shipped as PostgreSQL's SERIALIZABLE) keeps MVCC's lock-free reads but adds runtime tracking of read-write dependencies; when it detects the "dangerous structure" of two rw-antidependencies that could form a cycle, it aborts one transaction. You get true serializability with near-SI concurrency — the price is bookkeeping overhead and occasional false-positive aborts (so, again, retry logic). Prefer SSI when you need serializability and reads dominate. Fixes short of full serializable: under plain SI you can defeat specific write skews cheaply with SELECT ... FOR UPDATE / FOR SHARE to materialize a conflict, or a promotion lock / explicit constraint. Note MySQL InnoDB has no SSI — its SERIALIZABLE degrades to locking reads (adds S/next-key locks), closer to 2PL.
Takeaways
- MVCC = versions tagged with creator/deleter txids + a per-reader snapshot; reads reconstruct a consistent instant with no read locks, so readers and writers never block each other.
- Snapshot isolation eliminates dirty reads, non-repeatable reads, and lost updates — but permits write skew, so it is not serializable.
- Postgres's write-write conflict handling is first-updater-wins (block-then-recheck), a related but distinct mechanism from the pure optimistic first-committer-wins scheme in classic SI/CockroachDB — same observable outcome, different path to it.
- Old versions must be garbage-collected (VACUUM / InnoDB purge); a single long-running transaction stalls GC cluster-wide and causes bloat.
- Postgres stores versions in the heap (fast writes, bloat + autovacuum + wraparound to manage); InnoDB reconstructs from undo (compact table, undo/history bloat under long readers). For true serializability, reach for Postgres SSI, targeted
FOR UPDATE, or locking.
Recall question
Why does snapshot isolation reliably prevent lost updates but not write skew, even though both involve concurrent transactions reading stale data?
Answer: a lost update is two transactions writing the same row — a write-write conflict on one version chain, which SI detects (first-updater-wins in Postgres; first-committer-wins in classic SI) and rejects the loser. Write skew has each transaction writing a different row, so there is no write-write conflict on any single version to detect; the violated invariant spans multiple rows, which version-level conflict checking cannot see.
Sources: M. Kleppmann, Designing Data-Intensive Applications, ch. 7 (Weak Isolation, Snapshot Isolation, Write Skew & Phantoms); A. Petrov, Database Internals, ch. 5; PostgreSQL documentation — Concurrency Control (MVCC) & Routine Vacuuming; H. Berenson et al., "A Critique of ANSI SQL Isolation Levels" (SIGMOD 1995); M. Cahill, U. Röhm, A. Fekete, "Serializable Isolation for Snapshot Databases" (SIGMOD 2008); MySQL InnoDB manual — Multi-Versioning & the undo/purge subsystem. Re-authored/Deepened for this guide.
🤖 Don't fully get this? Learn it with Claude
Stuck on MVCC & Snapshot Isolation? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.
Build the mental picture, not memorization.
I just read a lesson on **MVCC & Snapshot Isolation** (Databases) and want to truly understand it. Explain MVCC & Snapshot Isolation from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Socratic — adapts to where you're stuck.
Teach me **MVCC & Snapshot Isolation** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Active recall exposes what you missed.
Quiz me on **MVCC & Snapshot Isolation** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Intuition + hook + flashcards for long-term memory.
Help me remember **MVCC & Snapshot Isolation** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.