Knowledge Guide
HomeSystem Design

Caching

Step 7 in the System Design path · 9 concepts · 0 problems

0 / 9 complete

📘 Learn Caching from zero

A cache is a small, fast store (usually in-memory like Redis) that sits in front of a slow source (database, disk, network) and serves a copy of frequently-requested data. In an interview, "add a cache" is the single most common lever for cutting latency and offloading a database — but the interviewer is really testing whether you understand the cost: every cache introduces a consistency problem. Reasoning out loud about hit ratio, invalidation, eviction, and stampede is what separates "I'd add Redis" from a real design.

✨ Added by the guide to build intuition — not from the source course.

Lessons in this topic

🏗️ Apply it — design walkthrough

Work through this after you've learned the concepts in the lessons above.

Why cache at all?

🤔 A read hits a disk-backed DB at ~10 ms. The same query runs 10,000 times/sec. What does putting a copy in RAM actually buy you, in numbers?

Reveal the reasoning

RAM access is ~100 ns vs disk/DB ~10 ms — roughly a 100,000× speed gap. Chain: read goes to RAM instead of disk → per-read latency drops from 10 ms to ~0.5 ms (including the network hop to Redis) → at 10k req/s you also remove most of those queries from the DB → the DB now only sees the small fraction that miss.

Concretely, if 95% of reads hit the cache (95% hit ratio), the DB load drops 20× (from 10k to 500 q/s), and average latency becomes 0.95 × 0.5ms + 0.05 × 10ms ≈ 0.97 ms instead of 10 ms.

Cost it introduces: you now store data in two places, so you own a consistency/staleness problem forever, plus extra infra (the cache cluster) to run and pay for.

Hit ratio is everything

🤔 Why does a cache with a 90% hit ratio sometimes still leave your p99 latency at the full 10 ms — even though the average looks great?

Reveal the reasoning

Averages hide the tail. With 90% hits, 1 in 10 requests still misses and pays the full DB cost. The hits are fast (~0.5 ms), so the bulk of the distribution looks great — but the slowest 10% of requests are exactly the misses, so the p99 lands inside the miss path and sits near 10 ms.

Chain: lower hit ratio → more requests fall through to the slow path → the tail (p99/p999) is set by the miss path, not the hit path. Pushing hit ratio from 90% → 99% is what actually pulls the tail down, because now only 1 in 100 pays full price, so the 99th percentile finally falls on a fast hit instead of a miss.

Key metric: hit ratio = hits / (hits + misses). Cost: chasing a higher hit ratio usually means caching more data → more memory → more money, with diminishing returns once you're caching cold, rarely-reused keys.

Read & write strategies

🤔 With cache-aside (lazy loading), your app checks the cache, and on a miss reads the DB and writes the value back. What happens to the very first request for any key — and how is that different from read-through?

Reveal the reasoning

Cache-aside: the app is in control. On a miss: app reads DB → writes the result into the cache → returns it. So the first request for each key always misses and pays full DB cost (a 'cold cache'); subsequent reads are fast. Pros: only requested data is cached (memory-efficient), and a cache outage degrades to slow reads rather than broken reads. Con: the app repeats the miss-handling logic everywhere, and there's a window where stale data can be served.

Read-through: the cache layer itself fetches from the DB on a miss, so the app only ever talks to the cache. Cleaner app code, but you depend on the cache layer for correctness and availability.

Write strategies pair with these (trade-offs): write-through writes to cache + DB synchronously — consistent, but slower writes; write-back writes to cache and flushes to the DB later — fast writes, but risk of data loss if the cache dies before the flush.

Eviction policies

🤔 Your cache holds 10 GB but the working set is 50 GB. When it fills up, why does LRU usually beat FIFO for hit ratio — and when does LRU fail badly?

Reveal the reasoning

LRU (Least Recently Used) evicts whatever hasn't been touched longest. Chain: real traffic is skewed (a hot 10% of keys gets ~90% of reads) → recently-used keys are likely to be used again soon → keeping them maximizes future hits. FIFO evicts the oldest-inserted item regardless of usage, so it can throw out a red-hot key just because it was loaded early — tanking hit ratio.

Where LRU fails: a large sequential scan (e.g. a batch job reading 50 GB once) floods the cache with one-time keys → those recent-but-useless keys push out your hot set. This is cache pollution. LFU (Least Frequently Used) resists it by keeping high-count items, but it reacts slowly when popularity shifts (a formerly-hot key lingers).

Cost: LRU must track recency (a linked list / timestamps) → extra memory and per-access bookkeeping. No policy is free — you trade metadata overhead plus a workload assumption for hit ratio.

Invalidation & staleness

🤔 Phil Karlton quipped that cache invalidation is one of the two hard problems in CS. Concretely: a user updates their profile, you write the DB but forget the cache. What goes wrong, and what are your two main tools to bound it?

Reveal the reasoning

Chain: the DB now holds the new value, the cache still holds the old one → every read keeps hitting the cache → users see stale data until that key is evicted, which could be never. The cache and the source-of-truth have diverged with no automatic reconciliation.

Tool 1 — TTL (time-to-live): set e.g. TTL = 60s so the key auto-expires. This bounds staleness to at most 60s with zero explicit invalidation logic. Cost: a tuning knob — short TTL → fresher data but lower hit ratio (more misses); long TTL → higher hit ratio but staler data.

Tool 2 — explicit invalidation: on write, delete or update the cache key. Fresher, but you must reliably reach every place the key is cached, and a write-then-read interleaving can re-populate the cache with the old value (a stale-set race). Many systems combine both: explicit delete on write plus a TTL as a safety net.

Consistency models

🤔 Why do almost all real caches choose eventual consistency over strong consistency — and what's one concrete anomaly a user can observe because of that choice?

Reveal the reasoning

Strong consistency would require every write to synchronously update or invalidate the cache before returning — and across a multi-node cache, coordinate every replica first. Chain: that coordination adds round-trips and locking on the write path → writes get slower and the cache stops being 'fast' → you've defeated its purpose. So caches accept eventual consistency: the cache converges to the truth after a short delay (the TTL window or replication lag).

Observable anomaly: a read-your-own-writes violation — a user updates their bio, the write lands in the DB, but their next read hits a cache entry (or replica) that still has the old value, so for a few seconds they see their old bio and think the save failed.

Cost: you accept a bounded window of stale/divergent reads in exchange for low latency and high throughput. Fixing the specific anomaly (e.g. routing a user's reads so they reflect their own recent writes) adds complexity to the read path.

Cache stampede

🤔 A super-popular key with TTL=60s expires at exactly noon. 50,000 requests/sec are reading it. What happens in the millisecond after it expires, and why can it take down your DB?

Reveal the reasoning

This is a cache stampede (a.k.a. thundering herd / dogpile). Chain: the key expires → the next request misses → but so do all the concurrent requests in that instant, because none of them find the key → they all hit the DB simultaneously for the same value → the DB that normally sees ~500 q/s suddenly takes a 50,000-query spike (100×) → it saturates, latency explodes, and because the recompute is now slow the cache stays empty, so the herd keeps arriving.

Fixes (each with a cost): request coalescing / locking — only one request recomputes while the rest wait for its result (adds a lock plus wait latency); early / probabilistic recompute — refresh the key slightly before it expires so it never goes fully cold (occasional wasted recomputes); jittered TTLs — randomize expiry so many keys don't expire at the same instant (freshness becomes slightly less predictable).

Should you even cache?

🤔 You're told 'just add a cache to make it faster.' Name a case where a cache hurts — where you'd push back in the interview.

Reveal the reasoning

Caching pays off only when reads repeat and data is read-heavy and reasonably stable. When those don't hold, it actively hurts:

  • Low / zero reuse (unique keys, one-off analytics scans): hit ratio is near 0% → you pay the cache lookup + memory but still hit the DB every time → net slower and more expensive than no cache.
  • Write-heavy, rapidly-changing data: you invalidate keys faster than you read them → constant churn, near-zero hit ratio, plus you've added a consistency risk for no latency win.
  • Strong-consistency requirements (account balances, inventory at checkout): any staleness window is unacceptable → reading a stale value can cause double-spend or oversell, so the safe answer is don't cache (or cache only with synchronous write-through and accept the slower writes).

The interview move: before adding a cache, state the access pattern you're assuming — high reuse, read-heavy, staleness-tolerant. If you can't justify all three, say so. 'A cache is a bet that the same data gets read again before it changes; if that bet is wrong, it's pure overhead.'

📐 Architecture diagrams (8)
Image
Image
Image
Image
Write Through Cache
Write Through Cache
Write Around Cache
Write Around Cache
Write Back Cache
Write Back Cache
Cache Invalidation Methods
Cache Invalidation Methods
Image
Image
Cache Stampede
Cache Stampede

🎯 Guided practice

  1. (Easy) Compute the hit ratio and average latency. A cache serves 8,000 requests as hits and 2,000 as misses. A hit costs 2 ms; a miss costs 40 ms (lookup + origin fetch).

    Reasoning: First, hit ratio = hits / (hits + misses) = 8000 / 10000 = 0.8 (80%), so miss ratio = 0.2. Average latency is the weighted average: 0.8 × 2 ms + 0.2 × 40 ms = 1.6 + 8.0 = 9.6 ms. The takeaway: even at 80% hits, the 20% of misses dominate the average because the miss penalty is 20× the hit cost. The lesson FAANG interviewers probe: raising the hit ratio matters most when the miss penalty is large — that is exactly when a cache is worth building, and why you optimize for the hot keys.

  2. (Medium) Trace an LRU cache of capacity 2 and explain the eviction. Operations on a key→value LRU cache (capacity 2): put(1,A), put(2,B), get(1), put(3,C), get(2). What does the final get(2) return, and why?

    Reasoning — track recency, most-recent on the right:
    1) put(1,A) → order [1].
    2) put(2,B) → order [1, 2] (now full).
    3) get(1) returns A and marks 1 as most-recently-used → order [2, 1]. This is the step people forget: a read counts as a use and reorders recency.
    4) put(3,C) → cache is full, so evict the least-recently-used, which is now 2, not 1. Order becomes [1, 3].
    5) get(2) → key 2 was evicted, so this is a miss (returns null / not found).
    Why it matters: the answer hinges on get(1) rescuing key 1 from eviction. To do every operation in O(1), back the cache with a hash map (key → node) plus a doubly linked list ordering nodes by recency: the map gives O(1) lookup, and the list lets you move-to-front and pop-the-tail in O(1). Contrast with FIFO, which evicts by insertion order and ignores reads entirely: get(1) changes nothing, so on put(3,C) FIFO evicts 1 (the oldest insert), leaving [2, 3] — and the later get(2) would be a hit. Same trace, opposite outcome, purely because of how each policy treats a read.

✨ Added by the guide — work these before the full problem set.