Knowledge Guide
HomeSystem DesignMental Models & Systems Thinking

Designing for Failure — Blast Radius, Timeouts, Breakers & Bulkheads

Assume failure is the default state

The junior mental model is "it works, and sometimes fails." The senior model is inverted: at scale, something is always failing — a disk, a node, a dependency, a network link. So the design question is never "if this dependency is down" but "when it's slow or down, what happens to everything that depends on it?" Get that wrong and one slow service cascades into a full outage.

Left: a slow dependency cascades and takes down services B and C. Right: a circuit breaker, timeout, and bulkhead isolate the failure so B and C degrade gracefully
Left: a slow dependency cascades and takes down services B and C. Right: a circuit breaker, timeout, and bulkhead isolate the failure so B and C degrade gracefully

The cascade, and how to contain it

The classic failure: service A calls a slow dependency; A's request threads all block waiting; the thread pool exhausts; now A is down for everyone, and B (which calls A) blocks too — the failure propagates backwards. The defenses, each answering a failure question:

DefenseStops
Timeout (always set one)waiting forever on a slow dependency
Retry with backoff + jitter (+ idempotency!)transient blips — but jitter prevents a retry storm (thundering herd)
Circuit breakerhammering a dependency that's already down; fail fast, recover when healthy
Bulkhead (isolate resource pools)one slow dependency exhausting all threads — partition them so failure is contained
Graceful degradation / fallbacka total outage — serve stale cache or a reduced feature instead of an error
Load sheddingcollapse under overload — reject excess early to protect the core

The retry-storm trap

Naive retries make outages worse: a struggling service gets 3× the load from retries, plus every client retrying in lockstep (synchronized). Always: cap retries, exponential backoff, random jitter, and only retry idempotent operations.

This is also a design question for every dependency: "what's my timeout, my retry policy, and my fallback when this is down?" If you can't answer it, you have an outage waiting.

Takeaways


Re-authored for this guide; blast-radius diagram hand-authored as SVG. Follows Release It! (Nygard) and the Google SRE Workbook. See also: the microservices resilience patterns (circuit breaker, bulkhead, retry), Rate Limiting (load shedding), Replication (failover).

🤖 Don't fully get this? Learn it with Claude

Stuck on Designing for Failure — Blast Radius, Timeouts, Breakers & Bulkheads? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🎨 Explain it visually

Build the mental picture, not memorization.

I just read a lesson on **Designing for Failure — Blast Radius, Timeouts, Breakers & Bulkheads** (System Design) and want to truly understand it. Explain Designing for Failure — Blast Radius, Timeouts, Breakers & Bulkheads from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🤔 Walk me through it (interactive)

Socratic — adapts to where you're stuck.

Teach me **Designing for Failure — Blast Radius, Timeouts, Breakers & Bulkheads** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🧪 Quiz me & fix my gaps

Active recall exposes what you missed.

Quiz me on **Designing for Failure — Blast Radius, Timeouts, Breakers & Bulkheads** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🧠 Make it stick

Intuition + hook + flashcards for long-term memory.

Help me remember **Designing for Failure — Blast Radius, Timeouts, Breakers & Bulkheads** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes