Home › System Design › Observability & SRE

Burn-Rate Alerting

Burn-rate alerting fires when the speed at which a service is consuming its error budget — measured over a rolling time window — exceeds the constant speed that would drain the entire budget by the end of the SLO period; the alert threshold is therefore derived arithmetically from the SLO itself rather than guessed by a human.

Why static-threshold alerts are noisy

A rule like “page if error rate > 1% for 5 minutes” has no memory of the budget and no notion of how much error the service is allowed. That produces two failure modes at once:

False pages. A 90-second deploy blip at 1.5% errors trips the pager even though it endangers nothing. Engineers learn to ignore the pager — alert fatigue, the thing that actually causes missed incidents.
Missed slow burns. A steady 0.4% error leak sits below the 1% line forever, yet it quietly drains a 99.9% budget in days. The static rule never fires.

You cannot tune your way out: sensitive thresholds flap, insensitive ones miss real degradation. The static rule conflates a brief spike (harmless) with a sustained leak (dangerous) because it looks only at the instantaneous rate, never at cumulative budget spend.

Error budget and burn rate

An SLO of 99.9% success over 30 days permits a 0.1% error budget. Concretely, for a service assumed to run at a constant 1,000 req/s (a simplification — real traffic is bursty; production burn-rate math tracks raw good-events/bad-events counts over each window rather than an assumed-constant QPS, precisely so the ratio stays correct through traffic swings):

30 days = 43,200 minutes = ~2.59 billion requests at that constant rate.
Budget = 0.1% of those = ~2.59 million allowed error requests (equivalently, ~43.2 minutes of total outage).

The burn rate is a dimensionless multiplier, and because it is a ratio of rates it stays valid even when the constant-QPS assumption doesn’t hold:

burn_rate = observed_error_rate / (1 - SLO)
          = observed_error_rate / 0.001   (for a 99.9% SLO)

Read it as “how many times faster than sustainable are we spending?”

burn rate = 1 → error rate is exactly 0.1%; you hit zero budget precisely at day 30. Sustainable.
burn rate = 2 → 0.2% errors; budget gone in 15 days.
burn rate = 14.4 → 1.44% errors; budget gone in ~50 hours.

The general relation: time_to_exhaust = SLO_period / burn_rate. This single number replaces every hand-tuned threshold, and it auto-scales with traffic (it is a ratio, not an absolute count).

Multi-window, multi-burn-rate (MWMBR) alerts

A single window forces a bad trade-off. A long window (say 1h) gives statistical significance — it won’t fire on a five-error blip — but it is slow to reset: after you fix the incident the average stays elevated for the full hour, so the pager keeps screaming at an already-resolved problem. A short window (5m) resets fast but flaps on noise.

The MWMBR pattern (Google SRE Workbook, ch. 5) fires only when both windows are over threshold:

alert = (long_window_burn ≥ threshold) AND (short_window_burn ≥ threshold)

The long window supplies significance and precision; the short window (typically 1/12 the length) supplies recency and fast reset — it confirms the burn is still happening now. The instant the incident clears, the short window drops and the AND clause goes false within minutes, even while the long window is still mathematically elevated from the earlier spend.

Then you stack tiers, mapping burn severity to response urgency. The canonical config for a 99.9% SLO:

Severity	Long / short window	Burn rate	Error-rate threshold	Budget spent to fire	Route
Fast burn	1h / 5m	14.4	1.44%	2% in 1h	Page
Medium burn	6h / 30m	6	0.6%	5% in 6h	Page
Slow burn	3d / 6h	1	0.1%	10% in 3d	Ticket

The math ties out: budget-fraction consumed = burn_rate × (window / SLO_period). For 1h: 14.4 × (1 / 720) = 2%. For 3 days: 1 × (72 / 720) = 10%. Each threshold is derived, not chosen.

Worked example: a 3% error burn on a 99.9% SLO

Service at 1,000 req/s, baseline ~0.02% errors. At 10:00 a bad deploy pushes the sustained error rate to 3%, and it stays at 3% continuously until the rollback lands at 11:00 (a full 60-minute incident, so the numbers below are traceable end to end). Burn rate = 3.0 / 0.1 = 30 → at this rate the whole 30-day budget would be gone in 720/30 = 24 hours if it kept going. Trace the fast-burn tier (1h / 5m, threshold 1.44%):

Time	5m window (short)	1h window (long)	AND → PAGE?
10:00	rising	~0.02%	no
10:05	3.0% ≥ 1.44% ✓	3%×(5/60)=0.25% ✗	no (long not yet significant)
10:29	3.0% ✓	3%×(29/60)=1.45% ≥ 1.44% ✓	PAGE (budget spent ~2%)
11:00	still elevated, 3.0% ✓ (rollback lands right at 11:00)	window is 10:00–11:00, fully elevated → 3.0% ✓	page holds
11:05	rollback took effect → 0.02% ✗	window is 10:05–11:05, 55 of 60 min still elevated → 3%×(55/60)=2.75% ✓ (still high)	RESET — short dropped, so AND goes false even though the long window is still hot
12:00	0.02%	window is 11:00–12:00, entirely post-rollback → ~0.02% ✗	fully clear

This is the fast-reset mechanism working exactly as designed: the long window stays mathematically elevated for a while after the fix (it's still averaging in the bad minutes), but because the short window snaps back to baseline immediately, the AND condition — and therefore the page — clears within 5 minutes of the rollback, not after a full hour.

Detection time follows a clean formula:

t_detect ≈ (threshold_error_rate × long_window) / actual_error_rate
        = (1.44% × 60 min) / 3%  ≈ 28.8 min

The elegant consequence: detection time is inversely proportional to burn rate. A total outage (100% errors, BR 1000) crosses the 1h average in 1.44% × 60 / 100% ≈ 0.9 min — it pages in under a minute. A mild leak takes longer to page but its slow-burn ticket tier catches it within days. Urgency of response now matches severity of burn, automatically.

Pitfalls a working engineer hits

Low-traffic services wreck the ratio. If a 5-minute window sees only 20 requests, a single error is a 5% rate (BR 50) and pages spuriously. Fix: require a minimum event count in the window, widen the windows, or aggregate related services. This is the single most common MWMBR false-page source, and it's a real limit on where burn-rate alerting applies at all — below some traffic floor the ratio simply isn't statistically stable, no matter how you tune windows.
Short window alone flaps; long window alone lingers. Skipping the AND reintroduces exactly the noise/lag you were escaping. Always join both.
Rolling vs. calendar budget. A rolling 30-day window continuously forgives old errors; a calendar-month reset forgives them all at once at month start. Alert semantics and “budget remaining” dashboards must agree on which one you use, or the numbers lie.
Wrong SLI. Measuring HTTP 200s at the load balancer while users hit client-side timeouts means your budget looks healthy during an outage. The whole scheme inherits the quality of its SLI — the SLI must reflect user-perceived success (latency + errors), not a proxy; a sloppy SLI makes burn-rate math precisely wrong instead of vaguely right.
Overlapping tiers double-notify. Without inhibition rules, a severe burn trips the fast and medium and slow alert at once. Suppress lower tiers when a higher one is active.
Zero-denominator. No traffic → division by zero. Guard the ratio and treat “no data” as its own (separate) alert.

Trade-offs and when to use it

vs. static-threshold alerting. Static rules are simpler, have zero window lag, and are the right tool for binary symptoms with no budget semantics — host down, disk 95% full, certificate expiring in 7 days, a queue that must never exceed N. Reach for burn-rate alerting instead whenever the signal is a rate against a user-facing SLO: it is self-tuning to traffic, distinguishes spike from sustained leak, and cuts page volume dramatically. The cost is complexity — you maintain several recording rules and multiple windows, and on-call must understand what “2% budget in an hour” means.

vs. single-window burn-rate (SRE approaches #3–#5). A lone long window has good precision but slow reset and poor detection-time control; a lone short window has good recall but flaps. MWMBR (approach #6) is the recommended sweet spot: it balances precision, recall, detection time, and reset time simultaneously — at the price of more moving parts. Use single-window only for a scrappy first SLO where the operational overhead of tiers isn’t yet justified.

Limitations of burn-rate alerting itself, independent of any comparison: it is only as trustworthy as its SLI — a sloppy or proxy SLI makes the whole scheme confidently wrong rather than vaguely right; it needs a traffic floor to keep the good/bad ratio statistically stable, so very low-QPS services (internal tools, cron-triggered jobs, new features pre-launch) can't use it without aggregation or a minimum-sample guard; a single SLO rarely tells the whole story, so production setups need several correlated SLOs (availability, latency, correctness) reconciled against each other, not one number in isolation; and running multiple tiers across multiple SLOs multiplies the alert-dedup and inhibition-rule burden — someone has to own that config or you get exactly the double-paging pitfall above.

When NOT to use burn-rate at all: for infrastructure liveness (process up/down), capacity limits, and security events — there is no error budget to burn, so a threshold or absence check is clearer and faster. Also avoid it, or use it only with heavy sample-size guards, for services too low-traffic to produce a stable ratio.

Takeaways

The threshold is derived from the SLO: burn_rate = error_rate / (1 − SLO). Stop guessing percentages.
Two windows joined by AND: long = significance/precision, short = recency/fast reset.
Multiple tiers map burn severity to response urgency — fast burn pages, slow burn tickets.
Detection time ∝ 1/burn_rate, so catastrophes page in seconds and slow leaks still get caught within days.
The technique is only as good as its SLI and needs enough traffic for the ratio to be stable — it is not a universal replacement for every alert.

Recall question

Why do multi-burn-rate alerts join a long window and a short window with AND, instead of using a single window? (Answer: the long window supplies statistical significance so the alert won’t fire on a tiny blip — controlling precision — while the short window confirms the burn is still happening right now, so the alert resets within minutes of resolution even if the long window's average is still mathematically elevated from the earlier incident.)

Sources: Google SRE Workbook, ch. 5 “Alerting on SLOs” (multiwindow multi-burn-rate config); Google SRE Book, ch. 4 “Service Level Objectives” and ch. 6 “Monitoring”; Rob Ewaschuk, “My Philosophy on Alerting”; Michael Nygard, Release It! (stability & feedback). Re-authored/Deepened for this guide.

🤖 Don't fully get this? Learn it with Claude

Stuck on Burn-Rate Alerting? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🎨 Explain it visually

Build the mental picture, not memorization.

I just read a lesson on **Burn-Rate Alerting** (System Design) and want to truly understand it. Explain Burn-Rate Alerting from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🤔 Walk me through it (interactive)

Socratic — adapts to where you're stuck.

Teach me **Burn-Rate Alerting** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧪 Quiz me & fix my gaps

Active recall exposes what you missed.

Quiz me on **Burn-Rate Alerting** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧠 Make it stick

Intuition + hook + flashcards for long-term memory.

Help me remember **Burn-Rate Alerting** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes

← SLIs, SLOs, SLAs & Error Budgets Distributed Tracing & Exemplars →