Replication

Step 11 in the System Design path · 3 concepts · 0 problems

0 / 3 complete

📘 Learn Replication from zero

Replication means keeping multiple copies of the same data on different machines. In an interview it shows up the moment you say "what if this database dies?" or "reads are too slow" — and the interviewer wants to see that you understand the mechanism (how a write propagates to copies), not just the buzzword. Get the trade-offs right (consistency vs. availability, lag, write bottlenecks) and you sound senior; wave your hands and you sound like you memorized a diagram. This walkthrough makes you reason through each decision the way you'd have to defend it live.

✨ Added by the guide to build intuition — not from the source course.

Lessons in this topic

🏗️ Apply it — design walkthrough

Work through this after you've learned the concepts in the lessons above.

What problem does it solve?

🤔 You have ONE database server holding all user data. Name three distinct things that break the moment that single box has a problem — and which of them does simply adding a second copy actually fix?

Reveal the reasoning

Cause→effect chain for a single box:

Durability/availability: the disk dies → 100% of data gone or unreachable → site is down. A second copy means the data still exists elsewhere → you survive the failure.
Read throughput: 50,000 read QPS all hit one box → CPU/IO saturate → p99 latency spikes from 10ms to 2s. Extra copies let you spread reads → e.g. 3 copies ≈ 3x read capacity.
Geographic latency: users in India hit a US box → ~200ms round-trip baseline. A copy in Mumbai cuts that to ~10ms.

Replication helps all three. But note the cost: it does NOT increase write capacity (every copy must absorb every write), and it introduces the hard problem of keeping copies in sync. So replication trades a simple consistency story for availability + read scale.

Sync vs async

🤔 A write arrives at the leader. The leader can reply 'OK' to the client EITHER after the follower confirms it got the data (synchronous) OR immediately, before the follower hears about it (asynchronous). What does each choice cost you when the leader crashes one millisecond later?

Reveal the reasoning

Synchronous: the leader waits for a follower ack → write latency = leader time + network RTT to follower (e.g. 2ms → 12ms). If the leader then crashes, the follower already has the write → zero data loss. The cost: if that follower is slow or down, the write blocks → one stalled replica can freeze all writes.
Asynchronous: the leader acks immediately (e.g. stays at ~2ms) → fast, and a dead follower doesn't block anyone. The cost: the leader can crash holding writes that never reached any follower → those acknowledged writes are permanently lost on failover.

Real systems compromise with semi-synchronous replication — keep ONE follower synchronous (a guaranteed durable copy) and the rest async (so you don't block on slow replicas). PostgreSQL and MySQL both offer this. The trade-off you're naming out loud: latency + write availability vs durability guarantee.

Single-leader mechanism

🤔 In the most common setup, exactly one node accepts writes (the leader) and the rest are read-only (followers). Walk the path of a single write from client to all replicas — what is the actual thing shipped over the wire, and why not just re-run the SQL on each follower?

Reveal the reasoning

Step-by-step mechanism:

1. Client sends the write → leader.
2. Leader applies it locally AND appends it to a replication log (e.g. MySQL binlog, Postgres WAL).
3. Leader streams that log to each follower.
4. Each follower replays the log entries in the same order → ends in the same state.

Why ship a log, not the raw SQL? Re-running statements is non-deterministic: NOW(), RANDOM(), or an UPDATE ... LIMIT without an ORDER BY can produce different results on each follower → replicas silently diverge. Shipping the actual row changes (row-based replication / physical WAL) is deterministic → every replica converges to the same state.

Cost: a single write point is both a write bottleneck and a single point of failure for writes; if the leader dies you must promote a follower (failover), which takes seconds and risks losing any not-yet-replicated async writes.

Replication lag

🤔 A user updates their profile photo, the write succeeds on the leader, then they immediately reload the page — and see their OLD photo. Nothing crashed. What happened, and why is this almost guaranteed at scale?

Reveal the reasoning

Cause→effect: with async replication the leader acks before followers catch up. The reload hit a follower that is N milliseconds behind → it hasn't replayed the photo update yet → stale read. This is replication lag, and it's not a bug — it's the price of async.

Lag is usually a few ms but spikes to seconds under load, large transactions, or network hiccups. The stale-read-after-your-own-write case is a read-your-own-writes violation. Fixes:

Read-your-writes: route a user's reads to the leader for ~1s after they write (or until their writes have replicated).
Monotonic reads: pin a user to one replica so they never see time go backwards (if replica A is fresher than B, bouncing between them can make saved data appear to vanish).

Trade-off: every consistency fix pushes load back toward the leader or pins users to specific replicas → you claw back consistency at the cost of the read-scaling and flexibility replication gave you in the first place.

Multi-leader / leaderless

🤔 Single-leader still funnels every write through one box. What if you let TWO data centers both accept writes (multi-leader), or let any node accept a write (leaderless, like Dynamo/Cassandra)? What new, nasty problem appears that single-leader never had?

Reveal the reasoning

The new problem is write conflicts. With one leader, all writes are ordered by that single node → no conflict possible. With multiple write-acceptors, two users can edit the same record concurrently on different nodes:

DC-A sets title = "X" at 10:00:00.
DC-B sets title = "Y" at 10:00:00.
They sync → which wins? There's no single global order.

Resolution strategies: Last-Write-Wins (simple, but silently discards one user's edit), version vectors to detect concurrency, or CRDTs that merge automatically (e.g. collaborative editors). Leaderless systems add quorum reads/writes: with N replicas, require W acks on write + R responses on read where W + R > N (e.g. N=3, W=2, R=2 → 4 > 3) → the read set always overlaps the latest write set → you see fresh data.

When to use: multi-leader for multi-region write availability or offline clients; leaderless for high write availability with tunable consistency. Cost: you trade the simple, automatically-correct ordering of single-leader for application-level conflict logic — far more complexity, and a real chance of silent data loss if you pick LWW carelessly.

Backup vs replication

🤔 You have 3 replicas, all in perfect sync. An engineer runs DELETE FROM users; with no WHERE clause. How many of your replicas still have the data — and what does that tell you about whether replication counts as a backup?

Reveal the reasoning

Zero. Replication's whole job is to faithfully copy every write → the destructive DELETE replicates to all 3 replicas in milliseconds → data gone everywhere, instantly. Same story for a bad migration, ransomware, or corrupted rows: replication spreads the damage, it doesn't contain it.

The key distinction:

Replication protects against hardware/node failure (a disk or box dies). It's online, live, and current.
Backup protects against logical/human errors (a bad delete, corruption, a bug). It's a point-in-time snapshot from the past you can restore from — yesterday's copy still has the users table.

So a replica is NOT a backup. Real setups need both: e.g. continuous replication for availability, plus periodic snapshots (say, daily) + a transaction-log archive to enable point-in-time recovery (restore to 10:59, one minute before the DELETE). Cost: backups consume extra storage and the restore is slow (minutes to hours), so they're your safety net, not your hot path.

Backup vs Disaster Recovery

🤔 You diligently take a backup every night to a disk in the SAME data center. A flood destroys that data center. You have backups — so why might you still be out of business, and what two numbers should you have agreed on beforehand?

Reveal the reasoning

The backups died with the data center → having backups isn't the same as being able to recover. Disaster Recovery (DR) is the broader plan: off-site / cross-region copies, standby infrastructure, and a tested runbook to actually bring service back. A backup is one ingredient; DR is the whole recipe.

The two numbers that define a DR plan:

RPO (Recovery Point Objective): how much data you can afford to lose, measured in time. Nightly backup → up to 24h of data lost on a crash; continuous log shipping → seconds. Lower RPO = more frequent/continuous copying = more cost and write overhead.
RTO (Recovery Time Objective): how long you can afford to be down before service is back. Restoring a cold backup onto fresh hardware → hours; a warm cross-region standby you can promote → seconds to minutes. Lower RTO = more standby infrastructure running idle = more cost.

Trade-off: both numbers trend toward zero only by spending money — continuous replication and hot standbys are expensive. You set RPO/RTO from the business cost of lost data and downtime, then buy exactly the DR posture that meets them. Stating these two numbers out loud is what separates 'we have backups' from 'we have a recovery plan.'

📐 Architecture diagrams (2)

🎯 Guided practice

Problem 1 (Easy). A photo-sharing app's single PostgreSQL instance is at 90% CPU, almost entirely from read queries (feed loads, profile views). Writes (new uploads) are a tiny fraction. How do you scale, and what's the one bug users might notice?

Identify the pattern: read-heavy, write-light. The signal "reads dominate" points to single-leader replication with read replicas, not sharding (sharding is for write/storage scaling).
Apply it: keep the existing instance as the leader for all writes. Add 2-3 read replicas. Route read queries to the replicas (via a read/write split in the driver, a proxy like PgBouncer/ProxySQL, or app-level routing).
Reason about the result: read CPU is now spread across nodes; the leader handles only writes plus streaming its replication log. Read capacity scales nearly linearly with replica count.
The bug — and which anomaly it is: right after a user uploads a photo, their feed read may hit a lagging replica that hasn't applied the write yet — the photo "disappears." This is a read-your-writes violation. Fix: for a short window after a write (or for data the user owns), route that user's reads to the leader, or track the write's log position and only read from a replica that has caught up to it.

Problem 2 (Medium). A global e-commerce site wants users in the US, EU, and Asia to all write reviews with low latency, tolerating regional outages. Single-leader means EU/Asia writes cross an ocean to the US leader. What replication topology fits, and what new problem does it create?

Spot why single-leader fails here: all writes funnel to one region, so two of three regions always pay cross-ocean write latency, and an outage of the leader's region halts all writes until failover. "Low-latency writes everywhere + survive a regional outage" rules it out.
Choose the topology — and know the difference: multi-leader (one leader per region, leaders replicate to each other asynchronously) gives a fast local write path per region. Leaderless (Dynamo-style quorum: write to W of N, read from R of N) is the alternative — it tolerates node failures via quorum rather than failover. Both accept writes in multiple places concurrently.
Surface the new problem — write conflicts: the same row can be written concurrently in two regions, producing divergent values. Resolution differs by topology: multi-leader uses last-write-wins (simple, silently drops data), version vectors, or app-level merge; leaderless adds quorum reads, read repair, and anti-entropy to converge, but still surfaces conflicting versions (siblings) the app must merge. CRDTs (e.g. for a shopping cart) are the canonical conflict-free merge.
Sanity-check the data type: reviews are append-mostly and rarely conflict, so multi-leader is acceptable. If the same design were proposed for inventory counts or account balances, reject it — those need a single authority per key, so keep a single leader (or a per-shard leader) for that data. Match the topology to the invariant, not to the geography alone.

✨ Added by the guide — work these before the full problem set.