Knowledge Guide
HomeDatabasesDatabase Engine Internals

B-Tree vs LSM-Tree Storage Engines

Two answers to one question: where does a row physically go when you write it?

Every OLTP storage engine must reconcile a hardware fact — disks (and SSDs) love large sequential writes and hate small random ones — with the logical need to keep keys in sorted order for fast lookups and range scans. A B-tree keeps the sorted structure permanently on disk and mutates it in place: to change a key you seek to the exact disk page that owns it and overwrite that page. An LSM-tree (Log-Structured Merge-tree) refuses to seek: it appends every write to an in-memory sorted buffer, flushes buffers as immutable sorted files, and lets a background process merge those files later. The B-tree pays at write time to make reads cheap and predictable; the LSM-tree pays at read time (and in background CPU/IO) to make writes cheap and sequential.

This one design fork determines the write throughput, read latency, disk footprint, and tail-latency behavior of nearly every database you will touch — so a senior engineer chooses the engine by matching its amplification profile to the workload, not by brand loyalty.

B-tree: navigate a fixed tree, overwrite one page

A B-tree stores data in fixed-size pages (PostgreSQL 8 KB, InnoDB 16 KB, SQLite 4 KB). Each page holds many sorted keys plus child-page pointers, giving a high branching factor — with a typical fanout of a few hundred children per page, even billions of keys sit in a tree only 3–4 levels deep (narrower fanout means more levels, wider fanout means fewer). A lookup is a top-down walk: read the root page, binary-search its keys to pick a child pointer, descend, repeat until a leaf. Because depth is fixed for a given fanout, every point read costs the same small, bounded number of page fetches — the killer feature for read-latency SLAs.

Now trace an update. Say we set user:8123.balance = 500 on an InnoDB table:

  1. Write-ahead log (WAL) first. The change is appended to the redo log and fsynced — this is what makes the update durable and crash-recoverable.
  2. Locate the leaf. Root → internal → leaf: 3 page reads (often cache hits in the buffer pool).
  3. Mutate in place. The 100-byte row lives inside a 16 KB leaf page. The engine edits the row inside that page and marks the whole page dirty.
  4. Flush the whole page. A background checkpoint eventually writes the entire 16 KB page back to disk — you cannot write 100 bytes; the disk block is the unit.

That is write amplification: a 100-byte logical change caused a 16 KB page write (~160×), plus the WAL record. If the insert overflows the leaf, the page splits into two half-full pages and a pointer is pushed up — extra writes and, over time, fragmentation (pages left ~50–70% full). This is not what Postgres's VACUUM fixes: VACUUM reclaims dead tuples left behind by MVCC updates/deletes (old row versions), it does not defragment or re-pack half-full btree leaf pages from splits. Undoing split fragmentation needs a full structural rewrite — REINDEX, VACUUM FULL, or pg_repack — and btree's own page-merge-on-delete logic only partially offsets it in between.

LSM-tree: append, flush immutable files, merge in the background

The LSM-tree turns random writes into sequential ones. The write path:

  1. WAL append. As with the B-tree, durability comes first: append the write to a commit log, fsync.
  2. Memtable insert. The key/value goes into an in-memory sorted structure — a skip list (RocksDB) or red-black tree — kept sorted by key. This is an in-RAM O(log n) insert; no disk seek.
  3. Flush → SSTable. When the memtable exceeds a threshold (e.g. 64 MB), it is frozen and written to disk in one big sequential pass as an immutable SSTable (Sorted String Table): keys in order, with a sparse index and a bloom filter in its footer. The old WAL segment can now be discarded.
  4. Compaction. A background thread merges multiple SSTables into fewer, larger, non-overlapping ones — dropping overwritten values and applying deletes. This is a k-way merge of already-sorted files, so it too is sequential IO.

Deletes are writes. There is no page to edit, so a delete appends a tombstone marker; the real space is reclaimed only when compaction runs. An update is likewise just a newer append that shadows the old value — so a single key may exist in several SSTables at once, with the newest copy winning.

Reads must reconcile that. A read checks the memtable, then SSTables newest→oldest, returning the first hit. To avoid touching files that lack the key, each SSTable carries a bloom filter — a probabilistic bit-set that answers "definitely absent" or "possibly present." At ~10 bits/key it gives roughly a 1% false-positive rate and zero false negatives, so a point lookup for a missing key usually skips every SSTable's disk read. The residual cost of consulting several files is read amplification.

The core trade-off: three amplifications you cannot all minimize

Every storage engine trades along three axes — the RUM conjecture (Read, Update, Memory: optimizing two worsens the third):

Compaction strategy is the LSM's own internal fork of the same trade-off. Size-tiered (Cassandra STCS) merges similarly-sized files — low write amplification, but high space and read amplification (many overlapping files). Leveled (RocksDB/LevelDB LCS) keeps non-overlapping files per level — low read and space amplification, but higher write amplification. Choosing the strategy is choosing which amplification hurts least for your workload.

B-treeLSM-tree
Write patternRandom, in-place, full-pageSequential appends + merges
Write throughputLower (seek + page rewrite)Higher (sequential)
Read latencyLow & predictable (fixed depth)Higher, variable (multi-file + compaction)
Range scansExcellent (linked leaves)Good, but merge across files
Steady-state spaceLarger (fragmentation)Smaller (compression, packing)
Used byPostgreSQL, MySQL/InnoDB, SQLite, Oracle, MongoDB WiredTigerRocksDB, LevelDB, Cassandra, ScyllaDB, HBase, Bigtable, ClickHouse

Pitfalls a working engineer actually hits

When to use which

Reach for a B-tree engine (PostgreSQL, InnoDB) when reads dominate, when you need predictable low read latency and tight p99 SLOs, for heavy range scans / secondary indexes, and when you want mature transactional (MVCC + row-locking) semantics. The named alternative's cost you're avoiding: the LSM's compaction-driven tail-latency jitter and read amplification.

Reach for an LSM engine (RocksDB, Cassandra, ScyllaDB) when writes are heavy or bursty (time-series, event logs, metrics, message queues, high-ingest KV), when you're write-bound on a single node, or when storage cost/compression matters at scale. The B-tree cost you're avoiding: random-write IO and full-page write amplification that caps write throughput.

A useful rule of thumb: if your working set is read-heavy and latency-sensitive, favor the B-tree; if it is write-heavy and throughput/space-sensitive, favor the LSM — then tune the compaction strategy (leveled for read/space, size-tiered for write) to shave the amplification that hurts most. Modern systems increasingly offer both (MongoDB ships WiredTiger's B-tree by default but has offered an LSM variant; MySQL has MyRocks alongside InnoDB), so the choice is often per-table, not per-database.

Takeaways

Recall question: Your service ingests 200k sensor writes/sec and mostly serves recent-window range queries, but your on-call is paging over p999 read spikes. Which engine did you likely choose, what is causing the spikes, and what two knobs would you turn first?

Answer: An LSM engine (right for the write rate). The p999 spikes are background compaction stealing IO from foreground reads, possibly compounded by L0 write stalls. First knobs: increase/bound compaction throughput and threads (and separate its IO), and switch/tune compaction strategy (leveled to cut read amplification) plus rate-limit compaction so it doesn't starve reads; verify bloom filter bits-per-key for point paths.


Sources: Designing Data-Intensive Applications, Kleppmann, Ch. 3 ("Storage and Retrieval" — B-trees, LSM-trees, SSTables, the read/write/space amplification framing); Database Internals, Alex Petrov, Parts I–II (B-tree mechanics, page splits, LSM compaction strategies); the RUM Conjecture (Athanassoulis et al., EDBT 2016); the original LSM-tree paper (O'Neil et al., 1996) and Google's Bigtable paper (SSTables, memtable); RocksDB and Apache Cassandra compaction documentation; PostgreSQL documentation on VACUUM, REINDEX, and MVCC. Re-authored/Deepened for this guide with concrete write-path traces and named-system trade-offs.

🤖 Don't fully get this? Learn it with Claude

Stuck on B-Tree vs LSM-Tree Storage Engines? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🎨 Explain it visually

Build the mental picture, not memorization.

I just read a lesson on **B-Tree vs LSM-Tree Storage Engines** (Databases) and want to truly understand it. Explain B-Tree vs LSM-Tree Storage Engines from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🤔 Walk me through it (interactive)

Socratic — adapts to where you're stuck.

Teach me **B-Tree vs LSM-Tree Storage Engines** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🧪 Quiz me & fix my gaps

Active recall exposes what you missed.

Quiz me on **B-Tree vs LSM-Tree Storage Engines** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🧠 Make it stick

Intuition + hook + flashcards for long-term memory.

Help me remember **B-Tree vs LSM-Tree Storage Engines** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes