Home › Databases › Database Engine Internals

MVCC & Snapshot Isolation

The core trick: never overwrite, never read-lock

Multi-version concurrency control (MVCC) works because an UPDATE does not mutate a row in place — the engine keeps the old version and writes a new physical version tagged with the transaction that created it. A reader is handed a snapshot: a rule for deciding, per row version, "was this change committed before my transaction started?" Because every version carries the id of the transaction that wrote it and the id of the transaction that deleted/superseded it, a reader can reconstruct exactly the database as it looked at one instant — without taking a single lock. Readers never block writers and writers never block readers; the only contention is writer-vs-writer on the same row.

Why it matters

Before MVCC, consistent reads meant shared read locks (two-phase locking). A long analytic query would hold read locks across thousands of rows and stall every writer behind it; a writer holding an exclusive lock would stall every reader. Throughput collapsed under mixed read/write load. MVCC removes that entire class of contention: a 10-minute reporting query sees a frozen, self-consistent view of the data at its start time while OLTP writes stream past it untouched. This is why essentially every serious OLTP engine — PostgreSQL, InnoDB, Oracle, SQL Server (RCSI), CockroachDB — is MVCC-based.

Transaction ids and the visibility test

Each transaction gets a monotonically increasing id (PostgreSQL calls it a txid/xid; InnoDB a trx_id). Every row version stores two hidden fields:

xmin — the txid that created this version (the inserting/updating transaction).
xmax — the txid that deleted or superseded it. On disk this is never a true NULL — 0 is the invalid-transaction sentinel value, and PostgreSQL further gates whether it counts as "set" using the XMAX_INVALID/XMAX_COMMITTED hint bits in the tuple's infomask. Conceptually, though, "xmax unset" means "still live."

A snapshot in PostgreSQL is the triple (xmin, xmax, xip_list): xmin = oldest transaction still running, xmax = the next txid not yet assigned, and xip_list = the set of txids in-progress at the moment the snapshot was taken. The visibility rule for a version, given snapshot S, is:

Its xmin must be committed and visible to S — i.e. xmin < S.xmax and xmin not in S.xip_list. (A change made by a transaction that started after me, or was still running when I took my snapshot, is invisible.)
Its xmax must be absent, aborted, or NOT yet visible to S — otherwise the version was deleted from my point of view.

Commit state is not stored in the row; it lives in a separate commit log (PostgreSQL's pg_xact/clog, InnoDB's transaction system). Visibility = row's xmin/xmax + commit log + snapshot, evaluated together.

A traced example

Assume accounts holds one row for #42, initially balance=100 written by committed txid 100. Two transactions run:

Time	Transaction T_W (txid 205)	Transaction T_R (txid 210)	Physical state of row #42
t0	BEGIN		v1{bal 100, xmin 100, xmax 0}
t1	UPDATE bal=150		v1{xmax←205}; v2{bal 150, xmin 205, xmax 0}
t2		BEGIN; snapshot taken (xip includes 205)	unchanged
t3		SELECT bal → 100 (205 in xip → v2 invisible)	unchanged
t4	COMMIT		clog: 205 committed
t5		SELECT bal → still 100 (snapshot frozen at t2)	unchanged
t6		COMMIT	v1 now dead for all future snapshots

Key point at t5: even though 205 committed at t4, T_R's snapshot was captured at t2 with 205 in its in-progress list, so v2 stays invisible for T_R's whole life. That is repeatable read under snapshot isolation. (Under PostgreSQL's default READ COMMITTED, T_R would take a fresh snapshot per statement, so t5 would return 150.)

Snapshot isolation is not serializable — write skew

Snapshot isolation (SI) is very strong: no dirty reads, no non-repeatable reads, and no lost updates — because two transactions updating the same row create a write-write conflict and one of them is rejected. PostgreSQL raises could not serialize access due to concurrent update, but the exact mechanism is worth being precise about: PostgreSQL implements first-updater-wins, not pure optimistic first-committer-wins. The second writer to touch the row blocks on the ordinary row lock until the first either commits or aborts; only then does it re-check whether the row changed under it, and if so, it errors out. Classic Berenson-et-al. SI and systems like CockroachDB instead let both writers proceed and use an optimistic check purely at commit time (first committer wins). The observable end state — one writer wins, the other retries — is the same, but the blocking-then-checking mechanism differs from a pure optimistic scheme. But SI is not serializable either way. Its signature anomaly is write skew: two transactions read an overlapping set of rows, then each writes a different row. There is no write-write conflict on any single row, so nothing aborts — yet the combined result violates an invariant that spans rows.

Canonical case: a hospital requires at least one doctor on call. Two doctors, Alice and Bob, are both on call. Each opens a transaction, reads "on-call count = 2 ≥ 1, safe to leave," and sets their own row to off-call. Both snapshots read the count before either write. Both commit. Result: zero doctors on call — an invariant broken by two individually-legal transactions.

Vacuum and garbage collection of old versions

Never overwriting means old versions accumulate. A version can be reclaimed only once no live or future snapshot could ever need it — i.e. its xmax transaction is committed and older than the oldest snapshot still running (PostgreSQL's xmin horizon / "OldestXmin"). Until then it is dead but unremovable.

PostgreSQL reclaims dead tuples with VACUUM (usually autovacuum in the background): it scans, frees dead tuples for reuse within the page, updates the free space map and visibility map, and advances the frozen-xid horizon to prevent transaction id wraparound (the 32-bit txid space is circular; if the oldest unfrozen row is ~2 billion txids behind, the database force-stops to avoid data appearing "in the future"). Note plain VACUUM does not usually shrink the file — that needs VACUUM FULL (which rewrites and takes an exclusive lock). InnoDB's equivalent is the purge thread, which trims undo-log records once no ReadView needs them.

How PostgreSQL and MySQL/InnoDB differ

Both are MVCC, but they store versions in opposite ways — and the trade-offs follow directly.

	PostgreSQL (heap MVCC)	MySQL InnoDB (undo-log MVCC)
Where old versions live	In the table heap itself — every UPDATE inserts a whole new tuple	Only the latest version is in the clustered index; old versions are reconstructed from the undo log
Row metadata	`xmin`, `xmax`, `t_ctid` chain per tuple	`DB_TRX_ID` + `DB_ROLL_PTR` pointing into undo
Read of an older version	Walk the heap; skip versions failing the visibility test	Follow `roll_ptr` back through undo, applying diffs until the version is visible to the ReadView
GC	VACUUM / autovacuum	Purge thread trims undo
Main cost	Table & index bloat; every index points at each tuple version (mitigated by HOT updates when no indexed column changes)	Undo/history bloat; a long-running read transaction pins undo ("history list length" grows), slowing everyone
Write-write conflict handling	First-updater-wins: second writer blocks on the row lock, then re-checks after the first commits/aborts	Standard row-lock queueing under 2PL-style locking reads; MVCC governs what concurrent readers see, not writer ordering
Default isolation	`READ COMMITTED` (fresh snapshot per statement)	`REPEATABLE READ` (one snapshot per transaction) + gap/next-key locks to block phantoms

The deep consequence: in InnoDB, secondary indexes are compact and reads of hot rows are cheap, but a forgotten BEGIN in a reporting connection can bloat undo indefinitely. In PostgreSQL, updates are fast (append a tuple), but a churny table with lagging autovacuum bloats badly and index scans degrade. Neither engine's default isolation level is serializable — both need an explicit opt-in (Postgres SERIALIZABLE/SSI, or application-level locking reads in InnoDB) to close the write-skew gap described above.

Pitfalls a working engineer hits

Assuming SI = serializable. Write skew and phantom-based invariant violations slip through. If correctness depends on a cross-row invariant ("at least one on call", "balance never negative across two accounts"), SI will silently break it under concurrency.
Long-running transactions poison GC. One idle-in-transaction connection holds the xmin horizon back, so VACUUM/purge cannot reclaim anything newer — bloat balloons cluster-wide. Monitor pg_stat_activity for idle in transaction and InnoDB's history list length.
Serialization failures are normal under RR/SERIALIZABLE. PostgreSQL will abort with 40001 (could not serialize access); the app must retry the whole transaction, not just resubmit the statement.
Txid wraparound. Suppressing/starving autovacuum on a high-write cluster risks an emergency read-only shutdown. This is a genuine production outage cause.
Read Committed re-reads. In PostgreSQL's default level, two SELECTs in one transaction can return different values — each takes a new snapshot. Code that assumes stability must use REPEATABLE READ.
Bloated indexes on Postgres UPDATE-heavy tables — check whether HOT updates apply (don't index columns you update constantly).

Trade-offs & when to use vs. named alternatives

MVCC snapshot isolation vs. two-phase locking (2PL / classic SERIALIZABLE). 2PL gives true serializability by taking shared locks on everything read and exclusive locks on everything written, held to commit. It prevents write skew — but readers block writers and vice versa, so a long read stalls the write path and concurrency drops. MVCC-SI buys enormous read/write concurrency (no read locks) at the cost of one anomaly class (write skew). Use SI as your default for OLTP with heavy mixed read/write load and no fragile cross-row invariants. Use 2PL-style locking only where you truly need serializable guarantees and can tolerate the contention.

MVCC-SI vs. Serializable Snapshot Isolation (SSI). SSI (Cahill/Fekete, shipped as PostgreSQL's SERIALIZABLE) keeps MVCC's lock-free reads but adds runtime tracking of read-write dependencies; when it detects the "dangerous structure" of two rw-antidependencies that could form a cycle, it aborts one transaction. You get true serializability with near-SI concurrency — the price is bookkeeping overhead and occasional false-positive aborts (so, again, retry logic). Prefer SSI when you need serializability and reads dominate. Fixes short of full serializable: under plain SI you can defeat specific write skews cheaply with SELECT ... FOR UPDATE / FOR SHARE to materialize a conflict, or a promotion lock / explicit constraint. Note MySQL InnoDB has no SSI — its SERIALIZABLE degrades to locking reads (adds S/next-key locks), closer to 2PL.

Takeaways

MVCC = versions tagged with creator/deleter txids + a per-reader snapshot; reads reconstruct a consistent instant with no read locks, so readers and writers never block each other.
Snapshot isolation eliminates dirty reads, non-repeatable reads, and lost updates — but permits write skew, so it is not serializable.
Postgres's write-write conflict handling is first-updater-wins (block-then-recheck), a related but distinct mechanism from the pure optimistic first-committer-wins scheme in classic SI/CockroachDB — same observable outcome, different path to it.
Old versions must be garbage-collected (VACUUM / InnoDB purge); a single long-running transaction stalls GC cluster-wide and causes bloat.
Postgres stores versions in the heap (fast writes, bloat + autovacuum + wraparound to manage); InnoDB reconstructs from undo (compact table, undo/history bloat under long readers). For true serializability, reach for Postgres SSI, targeted FOR UPDATE, or locking.

Recall question

Why does snapshot isolation reliably prevent lost updates but not write skew, even though both involve concurrent transactions reading stale data?

Answer: a lost update is two transactions writing the same row — a write-write conflict on one version chain, which SI detects (first-updater-wins in Postgres; first-committer-wins in classic SI) and rejects the loser. Write skew has each transaction writing a different row, so there is no write-write conflict on any single version to detect; the violated invariant spans multiple rows, which version-level conflict checking cannot see.

Sources: M. Kleppmann, Designing Data-Intensive Applications, ch. 7 (Weak Isolation, Snapshot Isolation, Write Skew & Phantoms); A. Petrov, Database Internals, ch. 5; PostgreSQL documentation — Concurrency Control (MVCC) & Routine Vacuuming; H. Berenson et al., "A Critique of ANSI SQL Isolation Levels" (SIGMOD 1995); M. Cahill, U. Röhm, A. Fekete, "Serializable Isolation for Snapshot Databases" (SIGMOD 2008); MySQL InnoDB manual — Multi-Versioning & the undo/purge subsystem. Re-authored/Deepened for this guide.

🤖 Don't fully get this? Learn it with Claude

Stuck on MVCC & Snapshot Isolation? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🎨 Explain it visually

Build the mental picture, not memorization.

I just read a lesson on **MVCC & Snapshot Isolation** (Databases) and want to truly understand it. Explain MVCC & Snapshot Isolation from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🤔 Walk me through it (interactive)

Socratic — adapts to where you're stuck.

Teach me **MVCC & Snapshot Isolation** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧪 Quiz me & fix my gaps

Active recall exposes what you missed.

Quiz me on **MVCC & Snapshot Isolation** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧠 Make it stick

Intuition + hook + flashcards for long-term memory.

Help me remember **MVCC & Snapshot Isolation** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes

← B-Tree vs LSM-Tree Storage Engines