B-Tree vs LSM-Tree Storage Engines
Two answers to one question: where does a row physically go when you write it?
Every OLTP storage engine must reconcile a hardware fact — disks (and SSDs) love large sequential writes and hate small random ones — with the logical need to keep keys in sorted order for fast lookups and range scans. A B-tree keeps the sorted structure permanently on disk and mutates it in place: to change a key you seek to the exact disk page that owns it and overwrite that page. An LSM-tree (Log-Structured Merge-tree) refuses to seek: it appends every write to an in-memory sorted buffer, flushes buffers as immutable sorted files, and lets a background process merge those files later. The B-tree pays at write time to make reads cheap and predictable; the LSM-tree pays at read time (and in background CPU/IO) to make writes cheap and sequential.
This one design fork determines the write throughput, read latency, disk footprint, and tail-latency behavior of nearly every database you will touch — so a senior engineer chooses the engine by matching its amplification profile to the workload, not by brand loyalty.
B-tree: navigate a fixed tree, overwrite one page
A B-tree stores data in fixed-size pages (PostgreSQL 8 KB, InnoDB 16 KB, SQLite 4 KB). Each page holds many sorted keys plus child-page pointers, giving a high branching factor — with a typical fanout of a few hundred children per page, even billions of keys sit in a tree only 3–4 levels deep (narrower fanout means more levels, wider fanout means fewer). A lookup is a top-down walk: read the root page, binary-search its keys to pick a child pointer, descend, repeat until a leaf. Because depth is fixed for a given fanout, every point read costs the same small, bounded number of page fetches — the killer feature for read-latency SLAs.
Now trace an update. Say we set user:8123.balance = 500 on an InnoDB table:
- Write-ahead log (WAL) first. The change is appended to the redo log and fsynced — this is what makes the update durable and crash-recoverable.
- Locate the leaf. Root → internal → leaf: 3 page reads (often cache hits in the buffer pool).
- Mutate in place. The 100-byte row lives inside a 16 KB leaf page. The engine edits the row inside that page and marks the whole page dirty.
- Flush the whole page. A background checkpoint eventually writes the entire 16 KB page back to disk — you cannot write 100 bytes; the disk block is the unit.
That is write amplification: a 100-byte logical change caused a 16 KB page write (~160×), plus the WAL record. If the insert overflows the leaf, the page splits into two half-full pages and a pointer is pushed up — extra writes and, over time, fragmentation (pages left ~50–70% full). This is not what Postgres's VACUUM fixes: VACUUM reclaims dead tuples left behind by MVCC updates/deletes (old row versions), it does not defragment or re-pack half-full btree leaf pages from splits. Undoing split fragmentation needs a full structural rewrite — REINDEX, VACUUM FULL, or pg_repack — and btree's own page-merge-on-delete logic only partially offsets it in between.
LSM-tree: append, flush immutable files, merge in the background
The LSM-tree turns random writes into sequential ones. The write path:
- WAL append. As with the B-tree, durability comes first: append the write to a commit log, fsync.
- Memtable insert. The key/value goes into an in-memory sorted structure — a skip list (RocksDB) or red-black tree — kept sorted by key. This is an in-RAM O(log n) insert; no disk seek.
- Flush → SSTable. When the memtable exceeds a threshold (e.g. 64 MB), it is frozen and written to disk in one big sequential pass as an immutable SSTable (Sorted String Table): keys in order, with a sparse index and a bloom filter in its footer. The old WAL segment can now be discarded.
- Compaction. A background thread merges multiple SSTables into fewer, larger, non-overlapping ones — dropping overwritten values and applying deletes. This is a k-way merge of already-sorted files, so it too is sequential IO.
Deletes are writes. There is no page to edit, so a delete appends a tombstone marker; the real space is reclaimed only when compaction runs. An update is likewise just a newer append that shadows the old value — so a single key may exist in several SSTables at once, with the newest copy winning.
Reads must reconcile that. A read checks the memtable, then SSTables newest→oldest, returning the first hit. To avoid touching files that lack the key, each SSTable carries a bloom filter — a probabilistic bit-set that answers "definitely absent" or "possibly present." At ~10 bits/key it gives roughly a 1% false-positive rate and zero false negatives, so a point lookup for a missing key usually skips every SSTable's disk read. The residual cost of consulting several files is read amplification.
The core trade-off: three amplifications you cannot all minimize
Every storage engine trades along three axes — the RUM conjecture (Read, Update, Memory: optimizing two worsens the third):
- Write amplification — bytes written to disk per byte of logical data. B-trees: full-page rewrites + page splits + WAL. LSMs: each byte is rewritten once per compaction level it passes through (leveled compaction can be 10–30×).
- Read amplification — disk reads per logical read. B-trees: ~tree depth (3–4), bounded and stable. LSMs: memtable + potentially one SSTable per level, mitigated but not erased by bloom filters — and bloom filters help point lookups only, not range scans.
- Space amplification — disk bytes per byte of live data. B-trees: fragmentation from half-full split pages (often 1.3–2×). LSMs: stale/tombstoned copies awaiting compaction, plus transient double-space during a compaction — but immutable, densely-packed SSTables compress far better, so steady-state LSM space is often the smaller of the two.
Compaction strategy is the LSM's own internal fork of the same trade-off. Size-tiered (Cassandra STCS) merges similarly-sized files — low write amplification, but high space and read amplification (many overlapping files). Leveled (RocksDB/LevelDB LCS) keeps non-overlapping files per level — low read and space amplification, but higher write amplification. Choosing the strategy is choosing which amplification hurts least for your workload.
| B-tree | LSM-tree | |
|---|---|---|
| Write pattern | Random, in-place, full-page | Sequential appends + merges |
| Write throughput | Lower (seek + page rewrite) | Higher (sequential) |
| Read latency | Low & predictable (fixed depth) | Higher, variable (multi-file + compaction) |
| Range scans | Excellent (linked leaves) | Good, but merge across files |
| Steady-state space | Larger (fragmentation) | Smaller (compression, packing) |
| Used by | PostgreSQL, MySQL/InnoDB, SQLite, Oracle, MongoDB WiredTiger | RocksDB, LevelDB, Cassandra, ScyllaDB, HBase, Bigtable, ClickHouse |
Pitfalls a working engineer actually hits
- LSM write stalls. If ingest outruns compaction, L0 files pile up and RocksDB throttles or halts writes (the dreaded
Stalling writes/slowdowntriggers). Symptom: throughput is fine for minutes, then latency spikes brutally. You must provision compaction threads/IO budget for peak, not average, write rate. - Compaction tail latency. A background compaction competes for disk IO with foreground reads, so LSM p99/p999 read latency is spiky even when p50 is great. This is why latency-SLO systems often prefer B-trees.
- Zombie deletes (Cassandra). A tombstone must survive long enough to reach every replica before compaction removes it (
gc_grace_seconds, default 10 days). Compact too early and a deleted row can resurrect from a replica that missed the delete. Range-scanning over millions of live tombstones also causes read timeouts. - Bloom filters don't help range queries. They answer point membership only. A
WHERE ts BETWEEN …scan still touches every candidate SSTable — LSM range performance leans on the sparse index and level layout, not blooms. - B-tree fragmentation & bloat — two distinct mechanisms, often conflated. Page splits leave leaf pages half-full; that fragmentation is only undone by a full rewrite (
REINDEX,VACUUM FULL,pg_repack). Separately, MVCC leaves dead row versions behind after updates/deletes; that bloat is reclaimed by routineVACUUM. Skip either kind of maintenance and a table can silently balloon well past its live-data size. - Both need a WAL. Neither structure is crash-safe on its own — durability comes from the write-ahead/commit log, and a mistuned
fsyncpolicy trades durability for throughput in both.
When to use which
Reach for a B-tree engine (PostgreSQL, InnoDB) when reads dominate, when you need predictable low read latency and tight p99 SLOs, for heavy range scans / secondary indexes, and when you want mature transactional (MVCC + row-locking) semantics. The named alternative's cost you're avoiding: the LSM's compaction-driven tail-latency jitter and read amplification.
Reach for an LSM engine (RocksDB, Cassandra, ScyllaDB) when writes are heavy or bursty (time-series, event logs, metrics, message queues, high-ingest KV), when you're write-bound on a single node, or when storage cost/compression matters at scale. The B-tree cost you're avoiding: random-write IO and full-page write amplification that caps write throughput.
A useful rule of thumb: if your working set is read-heavy and latency-sensitive, favor the B-tree; if it is write-heavy and throughput/space-sensitive, favor the LSM — then tune the compaction strategy (leveled for read/space, size-tiered for write) to shave the amplification that hurts most. Modern systems increasingly offer both (MongoDB ships WiredTiger's B-tree by default but has offered an LSM variant; MySQL has MyRocks alongside InnoDB), so the choice is often per-table, not per-database.
Takeaways
- Same goal, opposite bet: B-trees mutate a sorted structure in place (read-optimized, random writes); LSM-trees append and merge later (write-optimized, sequential writes).
- You can't win all three amplifications (read/write/space) — the RUM conjecture. Pick the engine, then the compaction strategy, to sacrifice the amplification your workload tolerates best.
- Bloom filters rescue LSM point reads (skip absent keys, ~1% false positive, zero false negatives) but do nothing for range scans; B-tree read cost is just its fixed depth (for a given fanout).
- LSM's hidden tax is tail latency and write stalls from background compaction; B-tree's hidden tax is split fragmentation and MVCC bloat — two different problems needing two different fixes (REINDEX/VACUUM FULL vs. routine VACUUM), not one.
Recall question: Your service ingests 200k sensor writes/sec and mostly serves recent-window range queries, but your on-call is paging over p999 read spikes. Which engine did you likely choose, what is causing the spikes, and what two knobs would you turn first?
Answer: An LSM engine (right for the write rate). The p999 spikes are background compaction stealing IO from foreground reads, possibly compounded by L0 write stalls. First knobs: increase/bound compaction throughput and threads (and separate its IO), and switch/tune compaction strategy (leveled to cut read amplification) plus rate-limit compaction so it doesn't starve reads; verify bloom filter bits-per-key for point paths.
Sources: Designing Data-Intensive Applications, Kleppmann, Ch. 3 ("Storage and Retrieval" — B-trees, LSM-trees, SSTables, the read/write/space amplification framing); Database Internals, Alex Petrov, Parts I–II (B-tree mechanics, page splits, LSM compaction strategies); the RUM Conjecture (Athanassoulis et al., EDBT 2016); the original LSM-tree paper (O'Neil et al., 1996) and Google's Bigtable paper (SSTables, memtable); RocksDB and Apache Cassandra compaction documentation; PostgreSQL documentation on VACUUM, REINDEX, and MVCC. Re-authored/Deepened for this guide with concrete write-path traces and named-system trade-offs.
🤖 Don't fully get this? Learn it with Claude
Stuck on B-Tree vs LSM-Tree Storage Engines? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.
Build the mental picture, not memorization.
I just read a lesson on **B-Tree vs LSM-Tree Storage Engines** (Databases) and want to truly understand it. Explain B-Tree vs LSM-Tree Storage Engines from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Socratic — adapts to where you're stuck.
Teach me **B-Tree vs LSM-Tree Storage Engines** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Active recall exposes what you missed.
Quiz me on **B-Tree vs LSM-Tree Storage Engines** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Intuition + hook + flashcards for long-term memory.
Help me remember **B-Tree vs LSM-Tree Storage Engines** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.