Knowledge Guide
HomeSystem Design

Heartbeat

Step 18 in the System Design path · 1 concepts · 0 problems

0 / 1 complete

📘 Learn Heartbeat from zero

The problem from zero: In a distributed system machines fail constantly — power loss, kernel panic, a yanked network cable — and crucially they fail silently. A crashed server doesn't announce "I'm dying"; it just stops responding. So how does the rest of the system find out? Fundamentally you cannot distinguish "slow/unreachable" from "dead" by waiting for one reply — this is the impossibility at the heart of asynchronous failure detection, and every scheme below is a practical tradeoff around it.

Analogy: a night-shift guard must phone HQ every 5 minutes to say "all clear." HQ doesn't call the guard (that scales badly across many guards) — the guard proactively checks in. If HQ hears nothing for 15 minutes (three missed calls), it assumes trouble and sends backup. The periodic call is the heartbeat; the 15-minute window is the timeout; the 3-missed-call rule is the threshold that absorbs a single dropped call.

Worked example: A coordinator monitors 3 workers. Each sends a UDP packet {node_id, timestamp} every interval = 5s. The coordinator keeps a last_seen table and a background sweep marks a node DEAD when now - last_seen > timeout (15s). Worker B's last beat lands at t=10s, then B crashes. The sweep flags B once its entry is 15s stale, i.e. by t=25s, and reassigns B's tasks to A and C. Worst-case detection latency from the moment of crash is bounded by interval + timeout (B could crash just after a beat, so up to one interval passes before the missing-beat clock effectively starts), and is at least timeout.

Key insight: Heartbeat converts a silent failure into a detectable one by replacing "wait for a response" with "expect a periodic signal, and treat its sustained absence as failure." The hard part is never the mechanism — it's the tradeoff: short timeout = fast detection but false positives under load/jitter; long timeout = stable but slow. Everything else (who listens, how it scales) is topology around that core.

✨ Added by the guide to build intuition — not from the source course.

Lessons in this topic

🎯 Guided practice

  1. Easy — Choosing interval and timeout. A monitor must detect a dead node within roughly 30 seconds, but the network occasionally drops a single packet. What interval and timeout do you pick? Reasoning: Work from the detection bound: worst-case latency from crash ≈ interval + timeout (and at least timeout). To survive a dropped beat without a false positive, make timeout a multiple of interval so several consecutive beats must be missed. Pick interval = 5s, timeout = 20s (4 missed beats): one or two lost packets still leave beats arriving inside the window, yet a genuinely dead node is caught in ≤ 25s, meeting the ~30s goal. The lesson: never set timeout = interval — a single hiccup would falsely kill a healthy node — and don't quote detection as just timeout; include the up-to-one-interval slack before the clock starts.
  2. Medium — Removing the central-monitor hotspot at scale. You now have 10,000 nodes. One coordinator receiving a beat from every node each interval is an O(N) bottleneck and a single point of failure. Redesign it. Reasoning: (1) Name the cost: central monitor is O(N) inbound per interval and dies if the monitor dies. (2) Decentralize membership. Two canonical, distinct designs — don't conflate them: (a) Gossip-style heartbeating (van Renesse; the model behind Cassandra's membership): each node holds a table of {node_id, heartbeat_counter, last_updated}, bumps its own counter each round, and exchanges the table with a few random peers; counters propagate epidemically in O(log N) rounds. A node locally declares a peer dead when that peer's counter has not advanced within the timeout. (b) SWIM (Serf/Consul): no global heartbeat counters — each node periodically probes a random peer with ping; on no ack it asks k other nodes to ping-req the target indirectly (ruling out a single bad link), marks it SUSPECT, gossips that suspicion, and promotes to DEAD if unrefuted. (3) Decide failure locally in both. (4) Tame false positives from jitter/GC with a phi accrual detector (continuous suspicion from inter-arrival history) rather than a single hard timeout, and require quorum + fencing before any destructive failover so a partition can't drive a split brain. Note: Amazon's Dynamo deliberately uses purely local, request-driven failure detection (A notices B only when A's own request to B fails) and gossips only membership — so it's the wrong citation for gossip failure detection; cite Cassandra/SWIM instead. Core invariant carried through: still "periodic signal, sustained absence = failure" — only the who-listens-to-whom topology changed to kill the bottleneck.

✨ Added by the guide — work these before the full problem set.