Home › Databases › Database Engine Internals

The Cost-Based Query Optimizer

A cost-based optimizer works because there are usually many physically different ways to execute the same SQL, and the engine can predict — using statistics about the stored data — roughly how much CPU and I/O each way will cost, then run only the cheapest one. SQL is declarative: you say what rows you want, never how to fetch them. The optimizer is the component that turns the what into a concrete how — a plan tree of scans, joins, sorts and aggregates — by estimating the size of each intermediate result and pricing every candidate plan against a cost model.

This matters because the difference between the best and worst legal plan for the same query is routinely 3–5 orders of magnitude. A three-table join can be executed as a nested loop that does 800 million random index probes, or as a hash join that does two sequential scans — same answer, but one takes 40 minutes and the other 4 seconds. Without a cost model the engine would have no principled way to choose, and every non-trivial query would be a coin flip. The optimizer is what makes SQL usable at scale without hand-tuning each query.

The pipeline: statistics → cardinality → cost → cheapest plan

Every cost-based optimizer runs the same four-stage machine. First it collects statistics about each table and column, stored in the catalog (PostgreSQL's pg_statistic, exposed via pg_stats). From those it computes cardinality estimates — how many rows each scan, filter and join will emit. It feeds those row counts into a cost model that converts them to an abstract cost (a proxy for page reads plus CPU work). Finally it searches the plan space — different join orders, join algorithms, and access methods — and keeps the plan with the lowest total cost.

Statistics: what the catalog actually stores

The optimizer never scans your data to plan a query — that would be as expensive as running it. Instead it reads a compact summary gathered by ANALYZE (triggered automatically by autovacuum in PostgreSQL). For each relevant column it keeps four things:

Row count (reltuples, plus relpages): how many rows and disk pages the table has. This scales every other estimate.
Distinct values (n_distinct): the count of distinct values in the column. For an equality predicate with no better information, selectivity ≈ 1 / n_distinct (a uniformity assumption).
Most-common values (the MCV list: most_common_vals + most_common_freqs): the top-K frequent values and their exact frequencies. This captures skew — e.g. that status='shipped' is 30% of rows while status='refunded' is 0.1%.
Histogram (histogram_bounds): an equi-depth histogram of the non-MCV values — a list of boundary values chosen so each bucket holds the same number of rows (not the same value-range width). This is what makes range predicates (amount > 500) estimable.

By default PostgreSQL keeps 100 MCVs and 100 histogram buckets per column (default_statistics_target). MCVs handle skew; the histogram handles ranges; n_distinct handles the uniform tail. Together they let the planner estimate almost any single-column predicate without touching the table.

Cardinality estimation: a traced example

Take a concrete query against orders (10,000,000 rows, ~100 rows/page ⇒ ~100,000 pages) and customers (100,000 rows):

SELECT o.id
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'DE'
  AND o.amount > 500;

The planner estimates the size of each operator bottom-up:

Filter c.country = 'DE'. The MCV list for country contains 'DE' with frequency 0.08. So estimated rows = 0.08 × 100,000 = 8,000. (Had 'DE' not been an MCV, it would fall back to (1 − sum(mcv_freqs)) / (n_distinct − num_mcv).)
Filter o.amount > 500. This is a range, so it uses the histogram. With default_statistics_target=100 there are 100 equi-depth bucket bounds; suppose 60 buckets lie at or below 500 and 40 above. Each bucket = 1% of rows, so selectivity = 0.40 ⇒ 0.40 × 10,000,000 = 4,000,000 rows.
Join on customer_id = id. The textbook equijoin estimate is
|R ⋈ S| ≈ |R| × |S| / max(n_distinct(R.key), n_distinct(S.key)).
With filters pushed down first (8,000 customers, 4,000,000 orders) and n_distinct(orders.customer_id) ≈ 100,000 (the unfiltered value — PostgreSQL doesn't recompute n_distinct for a filtered subset), the join emits ≈ 8,000 × 4,000,000 / 100,000 = 320,000 rows.

Note the assumption baked into step 3: this formula treats the join key's distinct-value count and the two filters as statistically independent — it never checks whether customers matching country='DE' place orders with the same customer_id distribution as the whole table, or whether the amount > 500 filter correlates with which customers order. Real optimizers make this independence assumption everywhere because tracking cross-column, cross-table correlation exactly is combinatorially expensive. It is usually close enough — but it is also the single biggest reason cardinality estimates drift from reality, as the pitfalls section below shows concretely.

These three numbers — 8,000, 4,000,000, 320,000 — are the entire basis for what happens next. If any is badly wrong, the plan built on top of it is wrong.

The cost model: turning rows into a number

Cardinalities become a single comparable number through a cost model whose constants encode the relative price of operations. PostgreSQL's defaults:

Constant	Default	Meaning
`seq_page_cost`	1.0	read one page sequentially
`random_page_cost`	4.0	read one page via random I/O (4× a seq read)
`cpu_tuple_cost`	0.01	process one row
`cpu_index_tuple_cost`	0.005	process one index entry

A sequential scan of orders costs roughly seq_page_cost × pages + cpu_tuple_cost × rows = 1.0 × 100,000 + 0.01 × 10,000,000 = ~200,000. The random_page_cost / seq_page_cost = 4 ratio is the single most consequential knob: it is what tells the planner that random index lookups are expensive relative to a big sequential read — the reason it will scan a whole table rather than do millions of scattered probes. On SSDs many shops lower random_page_cost to ~1.1 because random reads are no longer 4× slower; getting this ratio wrong is a classic source of bad plans on modern hardware.

Join order and join algorithm — where the search happens

Two independent decisions dominate the plan space:

Which join algorithm?

Nested loop join — for each outer row, probe the inner (ideally via an index). Cost ≈ |outer| × cost_per_inner_probe.
Hash join — build a hash table on the smaller input, then stream the larger past it. Cost ≈ scan both sides once, plus work_mem for the build side.
Sort-merge join — sort both inputs on the join key (or reuse existing order), then merge them in one linear pass. Cost ≈ two sorts (O(n log n)) plus a linear merge, unless the inputs are already ordered, in which case it's effectively free of sort cost.

For our example the planner prices the realistic options. Hash join: scan all 10M orders sequentially (~200,000) + scan/hash 8,000 DE customers (~small) ≈ ~205,000. Nested loop with an index on orders(customer_id): 8,000 DE customers × ~100 orders each = 800,000 rows fetched via random I/O (random_page_cost 4.0) ≈ ~3,300,000. Sort-merge: sorting 4,000,000 filtered orders from scratch (no usable index order) costs roughly n log n in I/O-equivalent units — for 4M rows this lands well above the hash join's linear scan, so it loses here too, unless the orders were already coming out of an index scan in customer_id order for some other reason. Hash join wins this query by roughly 16× over nested loop and clearly beats an from-scratch sort-merge.

Algorithm	When it wins	When it's the wrong choice
Nested loop	Outer side is tiny (few rows), or inner side has a highly selective index — e.g. a lookup by primary key for a handful of rows.	Outer side is large or its cardinality was underestimated — probe count explodes into millions of random I/Os, as in the stale-statistics example below.
Hash join	Large, unsorted, equi-join inputs where the smaller side's hash table fits in `work_mem` — the common case for fact/dimension joins.	Build side doesn't fit in `work_mem` (spills to disk, multiple batches) or the join isn't an equality predicate (hash join only supports equijoins).
Sort-merge	Both inputs are already sorted on the join key (e.g. delivered by an index scan or a prior merge) or a sort is needed downstream anyway (e.g. for `ORDER BY` or `GROUP BY` on the same key).	Neither input is pre-sorted and both are large — paying for two full sorts from scratch usually loses to a hash join.

Which join order?

Join order is the combinatorial hard part: n tables have up to (2n−2)! / (n−1)! plan shapes. PostgreSQL's planner (a System-R-style bottom-up dynamic-programming search) computes the cheapest way to join every subset of tables, reusing sub-results — O(2ⁿ) rather than factorial. Beyond geqo_threshold (default 12 tables) it switches to a genetic algorithm that samples the space heuristically because exhaustive search becomes too expensive to plan.

Why estimates go wrong — the root of nearly all bad plans

The optimizer's arithmetic is sound; its inputs are where reality breaks in. Two failure modes cause the vast majority of production incidents:

1. Stale statistics

Statistics are a snapshot. After a bulk load, a large delete, or steady growth, the catalog can describe a table that no longer exists. Suppose autovacuum hasn't run and the planner still thinks 'DE' matches 50 customers when it now matches 8,000. It estimates the nested-loop outer side at 50 rows, prices NL at ~20,000, and picks it. At runtime the loop actually drives 8,000 customers → 800,000 random probes → the ~3.3M-cost plan it thought it was avoiding. A 16× regression from a single stale number. The tell in EXPLAIN ANALYZE is a large gap between estimated and actual rows. Fix: run ANALYZE, and for volatile tables raise autovacuum aggressiveness.

2. Correlated columns (the independence assumption)

To combine predicates the planner multiplies per-column selectivities, assuming columns are statistically independent — the same assumption used for the join estimate above. Consider WHERE country = 'DE' AND city = 'Berlin'. It computes 0.08 × 0.001 = 0.00008 → ~8 rows. But city functionally determines country — every Berlin row is already a DE row — so the true count is ~the city count alone, maybe 2,000 rows. A 250× underestimate, which typically makes the planner pick a nested loop or forgo a hash join that would have won. PostgreSQL's remedy is extended statistics: CREATE STATISTICS s (dependencies, mcv) ON country, city FROM customers; then ANALYZE. This stores multi-column dependency and MCV data so the planner stops assuming independence. Correlation between join keys and filter predicates is the single hardest thing for any optimizer to estimate.

EXPLAIN — reading the planner's mind

EXPLAIN prints the chosen plan tree with estimated costs; EXPLAIN (ANALYZE, BUFFERS) runs the query and prints actual rows, actual time, and pages read, so you can compare estimate against reality. A typical line:

Hash Join  (cost=254.00..12894.00 rows=320000 width=8)
           (actual time=3.1..1204.7 rows=41822 loops=1)

Read it like this: cost=254.00..12894.00 is startup cost (before the first row — e.g. building the hash) then total cost. rows=320000 is the estimate; actual ... rows=41822 is the truth. Here the planner overestimated by ~8×.

The senior engineer's workflow with EXPLAIN ANALYZE: (1) scan the tree for the operator where estimated rows and actual rows diverge sharply — that node is where the bad decision was seeded. (2) Check loops= on the inner side of nested loops; a huge loop count is the fingerprint of a mis-estimated outer cardinality. (3) Watch for Rows Removed by Filter (a scan reading far more than it keeps) and for a hash join reporting Batches: > 1 (it spilled because work_mem was too small). Fix the estimate — usually via ANALYZE or extended statistics — before you reach for a hint.

Pitfalls a working engineer actually hits

Estimation error compounds up the tree. A 10× error at a leaf scan becomes 100×–1000× three joins up. The join at the top is only as good as the cardinalities feeding it — always debug bottom-up.
Functions and expressions blind the planner. WHERE lower(email) = ... or WHERE created_at::date = ... can't use column statistics or the plain index — the planner falls back to a default guess (~0.5% selectivity). Use an expression index or extended statistics.
Parameterized / prepared statements and one plan for many parameters. A generic plan cached for status = $1 may be great for the rare value and catastrophic for the 30%-frequent one — the 'parameter sniffing' problem (acute in SQL Server; PostgreSQL mitigates with its custom-vs-generic plan heuristic after 5 executions).
Out-of-range predicates. A filter like created_at > now() on a table whose histogram was last refreshed weeks ago falls entirely outside the newest bucket bound. The planner assumes a small, fixed selectivity beyond the histogram's edge, which for a fast-growing table can badly underestimate a query that actually matches millions of very recent rows — another argument for aggressive autovacuum on append-heavy tables.
Reaching for hints before fixing the estimate. Forcing a join order or algorithm (via pg_hint_plan or rewriting the query) treats a symptom. If the underlying cardinality estimate is wrong, the same class of query will misfire again the moment data shifts. Fix statistics first; hint only as a last resort for a known, stable skew the optimizer structurally can't model.

🤖 Don't fully get this? Learn it with Claude

Stuck on The Cost-Based Query Optimizer? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🎨 Explain it visually

Build the mental picture, not memorization.

I just read a lesson on **The Cost-Based Query Optimizer** (Databases) and want to truly understand it. Explain The Cost-Based Query Optimizer from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🤔 Walk me through it (interactive)

Socratic — adapts to where you're stuck.

Teach me **The Cost-Based Query Optimizer** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧪 Quiz me & fix my gaps

Active recall exposes what you missed.

Quiz me on **The Cost-Based Query Optimizer** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧠 Make it stick

Intuition + hook + flashcards for long-term memory.

Help me remember **The Cost-Based Query Optimizer** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes

← Write-Ahead Logging & Durability B-Tree vs LSM-Tree Storage Engines →