Bloom Filters

Step 16 in the System Design path · 5 concepts · 0 problems

0 / 5 complete

📘 Learn Bloom Filters from zero

A Bloom filter is a compact, probabilistic data structure that answers one question fast: "have I definitely NOT seen this item, or have I MAYBE seen it?" It can return false positives (says "maybe present" when absent) but never false negatives (if it says "absent," it truly is). In interviews it's the classic move when you need a cheap membership pre-check in front of an expensive lookup — skipping disk reads, network calls, or cache misses for items you can prove aren't there. Reach for it when you have huge cardinality, a tight memory budget, and can tolerate a small false-positive rate.

✨ Added by the guide to build intuition — not from the source course.

Lessons in this topic

🏗️ Apply it — design walkthrough

Work through this after you've learned the concepts in the lessons above.

Why not just a HashSet?

🤔 You need to track whether each of 1 billion URLs has been crawled. A HashSet works — so why would anyone invent a different structure?

Reveal the reasoning

Chain: a HashSet stores the actual keys → 1B URLs averaging ~64 bytes each → ~64 GB of RAM just to remember what you've seen. That won't fit on one machine, so you shard or spill to disk → every membership check may cross the network or hit disk.

A Bloom filter stores no keys at all — only bits. For 1B items at a 1% false-positive rate it needs about 9.6 bits/item ≈ 1.2 GB. That's a ~50x memory reduction, small enough to live in RAM on a single node.

The cost: you give up the ability to retrieve or enumerate keys, and you accept that ~1% of "present" answers are wrong (false positives).

How does insert work?

🤔 You have a bit array of size m (all zeros) and k hash functions. What exactly happens when you insert the URL "a.com"?

Reveal the reasoning

Chain: run the item through all k independent hashes → each produces an index in [0, m) → set the bit at each of those k positions to 1. Example with k=3, m=16: "a.com" hashes to {2, 7, 13} → bits 2, 7, 13 become 1.

Insert "b.com" → hashes to {7, 9, 13} → bits 7 and 13 are already 1, so they stay 1; bit 9 flips to 1. Bits are shared across items and never reset on insert.

The cost: because bits are shared and never cleared, you can never delete from a standard Bloom filter — clearing a bit for one item could silently erase another item that relied on it. Insert is O(k), independent of how many items are already stored.

Why no false negatives?

🤔 The filter set bits {2,7,13} for "a.com" — if you later query "a.com" and one of those bits reads 0, what does that tell you, and why is that impossible?

Reveal the reasoning

Chain: query = hash the item with the same k functions → check those k bits → if all are 1, answer "maybe present"; if any is 0, answer "definitely absent."

Once a bit is set to 1 it is never cleared. So if "a.com" was inserted, bits 2, 7, 13 are permanently 1. A query for "a.com" can therefore never find a 0 in those positions → it can never wrongly say "absent." That's the no-false-negatives guarantee.

The asymmetry / cost: the reverse can happen. "c.com" might hash to {2, 9, 13} — all of which were set by other items — so the filter says "maybe present" for something never inserted. That's a false positive, the price you pay for the guarantee.

Tune m and k

🤔 You want a 1% false-positive rate for n = 10 million items. How big should the bit array m be, and how many hash functions k? More hashes always = fewer errors, right?

Reveal the reasoning

Chain: false-positive probability is roughly p ≈ (1 − e^(−kn/m))^k. The optimal sizing falls out of this:

Bits per item: m/n ≈ −1.44 · log₂(p) → for p=1%, that's ~9.6 bits/item → m ≈ 96 Mbit ≈ 12 MB for 10M items.
Optimal hashes: k ≈ 0.693 · (m/n) → ~7 hash functions.

Why "more hashes" is wrong: more k sets more bits per insert → for a fixed array the bits saturate with 1s faster → past the optimum, false positives go up, not down. (Concretely, with this m: k=7 gives ~1.0%, but k=12 gives ~1.8%.) There's a sweet spot.

The cost: p is fixed at design time by m and k. If your real n exceeds the planned capacity, the array fills and the false-positive rate degrades sharply — and you can't resize a standard filter in place.

Where it pays off

🤔 In an LSM-tree database (Cassandra, RocksDB), a read may have to check many on-disk SSTables, most of which don't contain the key. How does a Bloom filter turn that into a win — and when does it NOT help?

Reveal the reasoning

Chain: keep one small in-memory Bloom filter per SSTable → on a read, check the filter first → if it says "definitely absent," skip the disk seek entirely. Across several SSTables at a 1% false-positive rate, ~99% of the irrelevant disk reads vanish → a disk seek (~5–10 ms) is replaced by an in-RAM bit check (~100 ns), a ~50,000x gap.

Same pattern elsewhere: a CDN/cache checks "have we ever stored this key?" before a backend round-trip; a crawler checks "seen this URL?" before fetching; Chrome historically pre-screened malicious URLs locally before calling Google's service.

When it does NOT help: the filter only adds value when the answer is usually "no." If most queried items are present, the filter says "maybe" almost every time → you do the expensive lookup anyway plus the filter check → net slowdown. It's a negative-lookup optimizer.

Deletion problem

🤔 Your dedup filter needs to support removing items (a user unsubscribes, a URL is purged). You can't clear a bit safely — so how do Counting Bloom Filters fix this, and what do they cost?

Reveal the reasoning

Chain: replace each single bit with a small counter (typically 4 bits). Insert → increment the k counters. Delete → decrement them. A position is "set" if its counter is > 0. Now decrementing for one item doesn't wipe a position another item depends on (those counters are still positive) → safe deletion.

The costs:

~4x the memory — 4-bit counters instead of 1-bit slots.
Counter overflow re-introduces false negatives — a 4-bit counter saturates at 15; once it stops incrementing, later decrements pull it below the true count, so a future delete can drop a still-live position to 0 → a real item now reads "absent."
Deleting a never-inserted item is unsafe — it decrements counters shared with real items, which can later flip those items to false negatives. CBFs only preserve the no-false-negative guarantee if you delete strictly items you actually inserted.

Filter is full

🤔 You sized your filter for 10M items but traffic grew and you're now at 50M — false positives have spiked. You can't resize a Bloom filter in place. What's the standard fix, and what does it sacrifice?

Reveal the reasoning

Chain: use a Scalable Bloom Filter — a chain of sub-filters. Fill filter #1 to its target; when it's saturated, freeze it and start filter #2 with more bits and a tighter target FPR. A query checks every layer and returns "present" if any layer says so.

Why the geometric tightening: because a query consults all layers, their per-layer false-positive rates compound. Shrinking each new layer's FPR by a fixed ratio (r < 1) makes the total a convergent geometric series → the overall false-positive rate stays bounded as n grows without knowing n up front.

The costs: every query now checks all layers → query time grows with the number of layers (more hashing, worse cache locality); and the tightening means total memory is somewhat larger than one perfectly-sized filter would have been. (A Cuckoo filter is the modern alternative: similar space, but supports deletion and better lookup locality — at the cost of more complex inserts that can fail and force a rebuild.)

State it crisply

🤔 The interviewer says: "Summarize in two sentences why a Bloom filter fits here and what risk you're accepting." What do you say?

Reveal the reasoning

Model answer: "A Bloom filter gives me an O(k), constant-memory membership pre-check — ~1.2 GB for 1B items at 1% FPR vs ~64 GB for a real set — so I skip the expensive disk/network lookup for items I can prove aren't present."

"The risk I'm accepting is a tunable false-positive rate: ~1% of the time I'll do a wasted lookup that finds nothing, but I will never miss an item that's actually there, and I lose the ability to delete or enumerate keys."

The trade-off in one line: you trade exactness and deletability for a massive space saving and O(1)-ish speed. Name the FPR number, name the no-false-negative guarantee, and name what you gave up — that's the complete answer.

📐 Architecture diagrams (6)

🎯 Guided practice

Easy — read the bit array. A Bloom filter has m = 8 bits (indices 0–7) and k = 3 hash functions. After some inserts the array is [0,1,1,0,1,0,1,1] (bits 1,2,4,6,7 are on). A query for "apple" hashes to positions {1, 4, 7}; a query for "mango" hashes to {2, 4, 5}. What does the filter answer for each?
Reasoning: A query checks whether all k bits are on. For "apple": positions 1,4,7 → all on → answer "possibly present." For "mango": positions 2,4,5 → bit 5 is 0 → since even one bit is off, answer "definitely not present." Teaching point: a single zero is decisive (true negative guaranteed); all-ones is only suggestive (could be a false positive). Never invert this — there is no such thing as a false negative.
Medium — size the filter. You are building a Cassandra-style read path and want a Bloom filter holding n = 1,000,000 keys with a target false-positive rate p = 1%. How many bits m and hash functions k do you need, and what does that mean operationally?
Reasoning — apply the standard formulas. Optimal bits: m = -n·ln(p) / (ln2)². With ln(0.01) ≈ -4.6052 and (ln2)² ≈ 0.4805: m ≈ (1,000,000 × 4.6052) / 0.4805 ≈ 9.585M bits ≈ 1.2 MB. That is the well-known rule of thumb: ~9.6 bits per element for 1% FP. Optimal hashes: k = (m/n)·ln2 ≈ 9.585 × 0.693 ≈ 6.64, so round to k = 7 (the FP curve is flat near the optimum, so rounding to 6 or 7 barely changes p). Operational meaning: ~1.2 MB of RAM lets you skip a disk read for ~99% of keys that aren't in this SSTable; about 1 in 100 absent keys triggers a wasted read, and a true present key is never missed. Trade-off check: tighten p to 0.1% and m grows to ~14.4 bits/element (~1.8 MB) with k ≈ 10 — every 10× drop in FP rate costs a fixed additive ln(10)/(ln2)² ≈ 4.79 bits per element. Recognising this space-cost-is-linear-in-the-log-of-the-FP-rate pattern is the core insight interviewers probe.
Hard — choose the variant. Your crawler's "seen URLs" set must support removing stale URLs (re-crawl after expiry) and it grows past the n you sized for. A plain Bloom filter can do neither. What do you reach for, and what does each choice cost?
Reasoning: Deletion rules out a standard Bloom filter (clearing a shared bit corrupts other keys). Two canonical answers: (1) a Counting Bloom filter replaces each bit with a small counter (e.g. 4 bits) so delete decrements instead of clearing — correct, but ~3–4× the memory and counters can theoretically overflow. (2) A Cuckoo filter stores short fingerprints in a cuckoo hash table; it supports delete, gives bounded worst-case lookup, and at low target FP rates is often more space-efficient than a counting filter — the modern default when deletes are required. For the unbounded-growth half, a Scalable Bloom filter chains progressively larger sub-filters with geometrically tightening FP rates so the overall rate stays under a bound as n rises. The interview signal here is naming the right variant for the constraint (delete → Counting/Cuckoo; growth → Scalable) rather than insisting one structure does everything.

✨ Added by the guide — work these before the full problem set.