Heartbeat
Step 18 in the System Design path · 1 concepts · 0 problems
📘 Learn Heartbeat from zero
The problem from zero: In a distributed system machines fail constantly — power loss, kernel panic, a yanked network cable — and crucially they fail silently. A crashed server doesn't announce "I'm dying"; it just stops responding. So how does the rest of the system find out? Fundamentally you cannot distinguish "slow/unreachable" from "dead" by waiting for one reply — this is the impossibility at the heart of asynchronous failure detection, and every scheme below is a practical tradeoff around it.
Analogy: a night-shift guard must phone HQ every 5 minutes to say "all clear." HQ doesn't call the guard (that scales badly across many guards) — the guard proactively checks in. If HQ hears nothing for 15 minutes (three missed calls), it assumes trouble and sends backup. The periodic call is the heartbeat; the 15-minute window is the timeout; the 3-missed-call rule is the threshold that absorbs a single dropped call.
Worked example: A coordinator monitors 3 workers. Each sends a UDP packet {node_id, timestamp} every interval = 5s. The coordinator keeps a last_seen table and a background sweep marks a node DEAD when now - last_seen > timeout (15s). Worker B's last beat lands at t=10s, then B crashes. The sweep flags B once its entry is 15s stale, i.e. by t=25s, and reassigns B's tasks to A and C. Worst-case detection latency from the moment of crash is bounded by interval + timeout (B could crash just after a beat, so up to one interval passes before the missing-beat clock effectively starts), and is at least timeout.
Key insight: Heartbeat converts a silent failure into a detectable one by replacing "wait for a response" with "expect a periodic signal, and treat its sustained absence as failure." The hard part is never the mechanism — it's the tradeoff: short timeout = fast detection but false positives under load/jitter; long timeout = stable but slow. Everything else (who listens, how it scales) is topology around that core.
✨ Added by the guide to build intuition — not from the source course.
Lessons in this topic
🎯 Guided practice
- Easy — Choosing interval and timeout. A monitor must detect a dead node within roughly 30 seconds, but the network occasionally drops a single packet. What interval and timeout do you pick? Reasoning: Work from the detection bound: worst-case latency from crash ≈
interval + timeout(and at leasttimeout). To survive a dropped beat without a false positive, maketimeouta multiple ofintervalso several consecutive beats must be missed. Pickinterval = 5s,timeout = 20s(4 missed beats): one or two lost packets still leave beats arriving inside the window, yet a genuinely dead node is caught in ≤ 25s, meeting the ~30s goal. The lesson: never settimeout = interval— a single hiccup would falsely kill a healthy node — and don't quote detection as justtimeout; include the up-to-one-interval slack before the clock starts. - Medium — Removing the central-monitor hotspot at scale. You now have 10,000 nodes. One coordinator receiving a beat from every node each interval is an O(N) bottleneck and a single point of failure. Redesign it. Reasoning: (1) Name the cost: central monitor is O(N) inbound per interval and dies if the monitor dies. (2) Decentralize membership. Two canonical, distinct designs — don't conflate them: (a) Gossip-style heartbeating (van Renesse; the model behind Cassandra's membership): each node holds a table of
{node_id, heartbeat_counter, last_updated}, bumps its own counter each round, and exchanges the table with a few random peers; counters propagate epidemically in O(log N) rounds. A node locally declares a peer dead when that peer's counter has not advanced within the timeout. (b) SWIM (Serf/Consul): no global heartbeat counters — each node periodically probes a random peer withping; on noackit asks k other nodes toping-reqthe target indirectly (ruling out a single bad link), marks itSUSPECT, gossips that suspicion, and promotes toDEADif unrefuted. (3) Decide failure locally in both. (4) Tame false positives from jitter/GC with aphi accrualdetector (continuous suspicion from inter-arrival history) rather than a single hard timeout, and require quorum + fencing before any destructive failover so a partition can't drive a split brain. Note: Amazon's Dynamo deliberately uses purely local, request-driven failure detection (A notices B only when A's own request to B fails) and gossips only membership — so it's the wrong citation for gossip failure detection; cite Cassandra/SWIM instead. Core invariant carried through: still "periodic signal, sustained absence = failure" — only the who-listens-to-whom topology changed to kill the bottleneck.
✨ Added by the guide — work these before the full problem set.