Knowledge Guide
HomeSystem Design

System Design Building Blocks

Step 2 in the System Design path · 23 concepts · 0 problems

0 / 23 complete

📘 Learn System Design Building Blocks from zero

System Design "building blocks" are the reusable primitives — caches, load balancers, replicas, partitions, indexes — you snap together to answer the open-ended "design X" interview question. The interviewer rarely wants the one right answer; they want to hear you reason about WHY a block belongs there, the exact mechanism by which it helps, and the cost it quietly adds (staleness, complexity, more failure modes). Each step below asks you a concrete question first — commit to an answer in your head before you reveal the reasoning. That habit of "decision → mechanism → trade-off" is exactly what separates a hire from a no-hire.

✨ Added by the guide to build intuition — not from the source course.

Lessons in this topic

🏗️ Apply it — design walkthrough

Work through this after you've learned the concepts in the lessons above.

Estimate before you design

🤔 You're asked to design a Twitter-like feed for 200M daily users posting 2 tweets/day each. Before drawing a single box, what numbers must you derive, and why does this gate every later block?

Reveal the reasoning

Chain: 200M × 2 = 400M tweets/day → ÷ 86,400s ≈ ~4,600 writes/sec average, and peak is ~2–3x average → ~9,000–14,000 writes/sec (round to ~12K). Reads dominate: if each user refreshes a 100-tweet feed several times/day, you land around ~100K–1M reads/sec.

  • This read:write ratio of roughly 100:1 is the single most important number — it is what justifies caching and read replicas later.
  • Storage: 400M tweets × ~300 bytes ≈ 120 GB/day → ~44 TB/year of raw text → forces partitioning.

Cause → effect: the estimate dictates which blocks are mandatory — a read-heavy ratio mandates caches/replicas, the storage figure mandates sharding. Cost/trade-off: these are order-of-magnitude guesses, not precise figures — stating a wrong assumption confidently (e.g. forgetting the peak multiplier) makes you under-provision by ~3x. Always say your assumptions out loud so the interviewer can correct them cheaply.

Why a load balancer?

🤔 You have 12,000 writes/sec hitting one server that maxes out at 2,000 req/sec. You add 10 servers. What exact problem does a load balancer solve, and what new single point of failure did you just create?

Reveal the reasoning

Chain: clients can't know which of 10 servers to hit → an LB sits behind one virtual IP / DNS name → it spreads 12,000 req/sec across 10 boxes = 1,200 each, comfortably under the 2,000 limit. It also runs health checks (e.g. every few seconds) and stops routing to a dead node within seconds, so one crash doesn't drop traffic.

  • Mechanism: L4 LBs route on IP/port (fast, payload-opaque); L7 LBs parse the HTTP request and can route /images vs /api differently (smarter, slightly more CPU per request).
  • When: the moment you have >1 server, or want zero-downtime rolling deploys.

Cost/trade-off: the LB itself is now a single point of failure — if it dies, all 10 healthy servers become unreachable. Fix: run LBs in an active-passive pair with a floating/virtual IP that fails over to the standby, which adds cost and one more component to monitor. You've traded many per-server failure modes for one smaller, centralized one that you then make redundant.

Which LB algorithm?

🤔 Round-robin sends request 1→server A, 2→B, 3→C, evenly. So why do real systems often NOT use plain round-robin, and when would "least connections" or a hashing algorithm be strictly better?

Reveal the reasoning

Chain: round-robin assumes every request costs the same. But if server A draws three 10-second report queries while B/C get 10ms lookups, round-robin keeps feeding A its even share → A's queue grows → its p99 latency spikes while B/C sit idle.

  • Least-connections: routes to whoever has the fewest in-flight requests → naturally drains work away from the slow/overloaded A. Best for uneven request durations.
  • Weighted round-robin: a 16-core box gets weight 4, a 4-core box weight 1 → handles heterogeneous hardware.
  • IP/URL hash: the same client always lands on the same server → enables warm local cache hits / sticky sessions.

Cost/trade-off: least-connections needs the LB to track live connection counts (extra state and bookkeeping). Hash-based stickiness deliberately breaks even distribution: one whale client (a celebrity, a hot key) can overload its assigned server, and naive mod N hashing reshuffles nearly every client when a server is added or removed — which is the exact pain consistent hashing later solves by remapping only ~1/N of keys.

Why cache, and how does it cut p99?

🤔 Your DB read takes 20ms; a Redis lookup takes 1ms. With a 100:1 read:write ratio and a 90% cache hit rate, walk through the exact latency and load math. Then: what breaks when the cached data is wrong?

Reveal the reasoning

Chain: cache-aside — the app checks Redis first; on a miss it reads the DB, then writes the value back to Redis with a TTL. At a 90% hit rate: 90% of reads cost 1ms, 10% cost 1ms + 20ms ≈ 21ms → weighted average ≈ 3ms vs 20ms before, and crucially the DB now sees only 10% of read traffic (e.g. 100K → 10K reads/sec). That drop is what lets the DB survive.

  • p99 effect: the slow tail (DB queueing under load) is now hit by far fewer requests, so once the DB is no longer saturated the p99 collapses.

Cost/trade-off — staleness: a cached value can be out of date until its TTL expires. You pick the dial: short TTL (fresher, more DB load) or write-through / invalidate-on-write (consistent, more write-path complexity). Eviction: RAM is finite, so LRU drops cold keys — only worth it if the hot subset fits in memory. Stampede: when a hot key expires, thousands of concurrent misses hammer the DB at once — mitigate with a per-key rebuild lock (only one request repopulates) or jittered TTLs. Caching trades correctness-by-default for speed.

Partition when one box isn't enough

🤔 Your ~44 TB/year of tweets won't fit on one machine, and writes exceed one node's capacity. You decide to shard. What's the difference between sharding by hash(user_id) vs by date range — and what's the failure mode of each?

Reveal the reasoning

Chain: partitioning splits data across N nodes so each holds ~44 TB / N and absorbs ~12,000 / N writes/sec. The shard key decides everything downstream.

  • Hash(user_id): spreads load evenly → no single hot node. But a query like "all tweets between Jan–Mar" has no single home and must scatter-gather across all N shards, and with naive mod N hashing, adding a node remaps nearly every key (use consistent hashing to limit churn).
  • Range by date: time-range queries hit one shard → fast and cheap. But today's shard takes ~100% of writes while older shards sit idle → a write hotspot that moves over time.

Cost/trade-off: the deeper pain is cross-shard operations — a JOIN or a transaction spanning two shards needs distributed coordination (two-phase commit, slow and failure-prone) or app-level stitching. Celebrity hotspot: one user with 100M followers can overwhelm their shard regardless of key choice. Partitioning buys horizontal scale at the price of cheap joins and global transactions — so you pick a key that matches your dominant query pattern.

Index the read path

🤔 A query "find tweets by @user" scans a 10M-row table at maybe 1M rows/sec — that's ~10 seconds. You add an index and it returns in ~1ms. What did the index actually build, and why isn't it free to just index every column?

Reveal the reasoning

Chain: an index is a separate sorted structure (typically a B-tree) mapping user_id → row locations. Instead of a full scan (O(n), ~10M row reads), the DB descends the tree (O(log n); log₂(10M) ≈ ~23 levels of comparison, and far fewer actual node reads thanks to high B-tree fanout) → the 10s scan becomes a handful of seeks ≈ ~1ms.

  • Mechanism: it trades a linear scan for a logarithmic lookup by keeping the indexed column pre-sorted.
  • When: columns you frequently filter / sort / join on AND that are selective (many distinct values).

Cost/trade-off: every index is extra storage and must be updated on every write — each insert/update/delete also rewrites every affected index's B-tree. So 5 indexes can make writes meaningfully slower (often ~2–3x) and bloat disk. Indexing a low-cardinality column (e.g. a boolean) barely helps because it can't narrow the candidate set. Indexes optimize reads by taxing writes — the wrong default for a write-heavy table.

Replicate for HA and read scale

🤔 Your single DB serves 10K reads/sec fine, but if that box dies you lose everything, and you can't scale reads further. You add 2 replicas. How does this fix both problems at once — and why might a user post a tweet and then NOT see it?

Reveal the reasoning

Chain: leader-follower replication — all writes go to the leader, which streams its change log to 2 followers. Now: (1) reads spread across leader + 2 followers → roughly 3x read capacity; (2) if the leader dies, a follower is promoted (failover) → no committed data lost, only a brief unavailability window.

  • Mechanism: followers replay the leader's log, trailing it by anywhere from a few ms to seconds under load.

Cost/trade-off — replication lag: with async replication the leader acks the write immediately, so a user who posts (write → leader) and instantly reads from a lagging follower sees nothing yet → a read-your-writes violation. Fixes: pin that user's reads to the leader for a short window after a write, or use sync replication (the leader waits for a follower to confirm — safer, but every write now pays the slower of the two and write latency rises). Replication trades a window of inconsistency for availability and read scale — so you decide per read path how fresh the data must be.

📐 Architecture diagrams (8)
Vertical scaling vs. Horizontal scaling
Vertical scaling vs. Horizontal scaling
Image
Image
Image
Image
Image
Image
Round Robin
Round Robin
Least Connections
Least Connections
Weighted Round Robin
Weighted Round Robin
Weighted Least Connections
Weighted Least Connections

🎯 Guided practice

  1. Easy — Pick the right block. "A news site shows the same 50 trending articles to 2M readers/hour; the DB is melting under read load. What do you add?" Step 1: classify the workload — extreme read-heavy, tiny working set (50 items), repeated reads of identical data. Step 2: map the symptom to a limit — DB CPU/IO saturated by redundant identical reads. Step 3: the matching block is a cache in front of the DB using cache-aside: on a request, check Redis; on miss, read DB and populate the cache. Step 4: name the tradeoff — cached articles can go stale, so set a short TTL (e.g. 60s) or invalidate on edit; guard the expiry moment against a thundering herd (request coalescing / lock-on-miss). Pattern learned: read-heavy + small hot set ⇒ cache; always pair a cache with an invalidation/expiry story.
  2. Medium — Scale storage and survive failure. "Design the data layer for a chat app: 500M messages/day, must stay available if a server dies, and inbox queries must be fast." Step 1 (estimate): 500M / 86,400 ≈ 5,800 writes/sec average (size for a peak multiple, say 3–5×); one node can't hold or serve this → scale out. Step 2 (partition): shard by user_id (or conversation_id) so one user's messages co-locate; use consistent hashing so adding capacity moves only ~1/N of keys, not the whole keyspace. Step 3 (availability): add replication — one leader per shard takes writes and replicates to followers; a heartbeat detects leader death and triggers failover so there's no single point of failure. Step 4 (fast queries): add a secondary index on (user_id, timestamp) for O(log n) range scans of a conversation. Step 5 (tradeoff): with async replication, a read served by a follower can lag the leader — you get eventual consistency and possible stale reads (no read-your-writes unless you route to the leader). This is not "AP" by itself: CAP only describes behavior during a partition, and a single-leader design typically sacrifices availability of the affected shard while a new leader is elected. Per PACELC, even with no partition (the Else branch) async replication trades consistency for lower latency. If correctness matters more, use synchronous quorum reads/writes with W + R > N. Pattern learned: scale via partition (+consistent hashing), survive via replication (+heartbeat/failover), query fast via indexes — and explicitly state the consistency/latency tradeoff each choice forces.

✨ Added by the guide — work these before the full problem set.