Redundancy
Step 12 in the System Design path · 1 concepts · 0 problems
📘 Learn Redundancy from zero
Start from the problem. Every physical thing eventually fails — a server's disk dies, a power supply burns out, a network cable is unplugged. If your system depends on exactly one of something, then when that one thing dies, your whole system is down. That single critical component is called a single point of failure (SPOF). Redundancy is the deliberate duplication of critical components so that if one copy fails, another takes over and users barely notice.
Analogy. Think of a passenger jet with two engines. One engine is enough to fly. The second engine isn't there to go faster — it's there so that if one engine fails mid-flight, the plane stays in the air. You pay for fuel and maintenance on an engine you hope never to need. That is exactly the redundancy trade-off: extra cost in normal operation, bought as insurance against failure.
Worked example. Suppose a web app runs on a single server with 99% availability — that's about 3.65 days of downtime per year. Add a second identical server behind a load balancer. Assuming the two fail independently, the app is fully down only when both are down at once: 1 − (1 − 0.99)² = 99.99%, dropping downtime to under an hour per year. The load balancer detects a dead server via health checks and routes all traffic to the survivor. We also duplicate the database with a primary-secondary setup: writes go to the primary, which replicates to the secondary; if the primary dies, the secondary is promoted — note this is a failover with a detection-plus-promotion gap (seconds, and possibly lost un-replicated writes), not a seamless handover.
Key insight: redundancy converts a guaranteed catastrophic failure (one thing dies → everything stops) into a tolerable, recoverable event (one copy dies → a redundant copy serves) — by paying for spare capacity you hope to never fully use. The leverage only holds while failures stay independent; correlated failures across a shared failure domain quietly erase it.
✨ Added by the guide to build intuition — not from the source course.
Lessons in this topic
🎯 Guided practice
Problem 1 (Easy): Remove the SPOF in a two-tier app. You have one web server talking to one database. Make it survive any single server failure.
- Find the SPOFs. Two of them: the web server and the database. If either dies, the app is down. Each is a single point of failure.
- Duplicate the stateless tier. Run two (or more) identical web servers. They hold no unique data, so adding copies is cheap and easy — any instance can serve any request.
- Add a distributor. Put a load balancer in front so clients hit one address; it spreads requests across the servers and uses health checks to stop routing to a dead one — that's automatic failover for the stateless tier.
- Duplicate the stateful tier. Use database replication: a primary that takes writes and a secondary that copies the data and can be promoted if the primary fails. This is the hard tier — you now own a consistency/lag trade-off (sync replication is safe but slower; async is fast but can lose recent writes on failover).
- Check the new SPOF. A single load balancer is now itself a SPOF, so add a second LB in an active-passive pair that share a virtual/floating IP; a heartbeat between them moves the IP to the standby if the active LB dies. Lesson: redundancy is iterative — each fix can reveal the next SPOF.
Problem 2 (Medium): Choose active-passive vs active-active, and reason about availability. Two data centers, each with full app stacks at ~99.9% availability. Pick a redundancy strategy and compute the result.
- Active-passive (failover): one data center serves all traffic; the second is a hot standby that takes over on failure. Simpler, but the standby's capacity sits idle and failover takes time (failure detection + DNS/promotion).
- Active-active: both data centers serve live traffic simultaneously (e.g. via GeoDNS / anycast), roughly doubling usable capacity and shrinking the recovery gap to detection time — but now you must handle data synchronization across sites and the risk of split-brain if they diverge during a partition. (It shrinks, not fully eliminates, the gap: in-flight requests to the failed site still error until health checks redirect them.)
- Compute availability. Treat the two sites as parallel (redundant) components: overall = 1 − (1 − 0.999)² = 1 − (0.001)² = 1 − 0.000001 =
99.9999%("six nines"). Redundancy turned three nines into six. - Add the realism caveat. That number is a theoretical ceiling — it assumes independent failures. A shared dependency (same DNS provider, same global config push, a correlated regional outage) acts as a hidden series component that drags the real number down. That is exactly why you spread across independent failure domains and avoid shared blast-radius dependencies.
- Decide. If cost-sensitive and a brief failover gap is acceptable → active-passive. If you need full capacity, the lowest latency globally, and the smallest recovery gap, and you can pay for cross-region data sync → active-active. State the trade-off out loud; that reasoning is what the interviewer is grading.
✨ Added by the guide — work these before the full problem set.