Knowledge Guide
HomeSystem DesignObservability & SRE

SLIs, SLOs, SLAs & Error Budgets

Reliability is not a feeling — it is a measured ratio that a monitoring system computes from real request events, compared every window against a number you committed to, so that a machine (not an argument in a meeting) can decide whether you are allowed to ship the next release. That single mechanism — turn reliability into a countable quantity, then let the arithmetic govern behavior — is what SLIs, SLOs, and error budgets exist to provide.

The three terms, from the inside out

They form a chain: you measure an SLI, you target it with an SLO, and you sometimes promise a weaker version of it in an SLA.

Why the SLA is weaker than the SLO — on purpose

If your SLA and SLO were both 99.9%, then the instant you broke your internal target you would simultaneously owe customers money and have zero warning margin. Setting SLA < SLO builds a buffer: the SLO breach is an internal signal that fires while there is still contractual headroom left. Google's rule of thumb is to make the internal SLO stricter than the external SLA by a meaningful margin for exactly this reason.

The error budget: reliability as a spendable quantity

Here is the pivotal move. If the SLO is 99.9%, then 0.1% of events are allowed to fail. That leftover is not a defect to be driven to zero — it is a budget. Over a fixed window it is a finite, countable amount of failure you are permitted to spend, and spending it is not shameful: it is the resource that pays for shipping features.

This converts an unwinnable philosophical fight — "devs want to ship, SREs want stability" — into a shared account balance. When budget remains, you ship aggressively: risky launches, config changes, experiments are all funded by the budget. When the budget is exhausted, an agreed policy kicks in automatically: freeze feature releases, and spend all engineering effort on reliability until the rolling window refills the budget. Nobody has to win an argument; the number decides.

Burn rate — how fast you are spending

The instantaneous measure is the burn rate: how fast you are consuming budget relative to the rate that would exhaust it exactly at the end of the window. Burn rate 1× means you will use the whole budget precisely on schedule (i.e. you are running right at your SLO). Burn rate 10× means you will be broke in a tenth of the window.

Google's SRE Workbook uses multi-window, multi-burn-rate alerts built on this: a fast-burn alert (e.g. 14.4× sustained over both a 1-hour window and a paired 5-minute window) pages a human immediately, while a slow-burn alert (e.g. 6× over 6 hours, or 1× over 3 days) files a ticket. The next section demonstrates exactly why this beats a raw error-count or single-window threshold, with numbers you can check by hand.

The nines: turning a percentage into minutes

An SLO percentage is abstract until you convert it to allowed downtime per window. The arithmetic is just (1 − SLO) × window duration. A 30-day month is 30 × 24 × 60 = 43,200 minutes; a 365-day year is 525,600 minutes.

So for 99.9% ("three nines"), the budget is 0.1% of the month: 0.001 × 43,200 = 43.2 minutes/month. Each extra nine divides the allowed downtime by 10.

SLO (availability)Unavailable fractionDowntime / month (30d)Downtime / year
99% ("two nines")1%7.2 hours3.65 days
99.9% ("three nines")0.1%43.2 minutes8.76 hours
99.95%0.05%21.6 minutes4.38 hours
99.99% ("four nines")0.01%4.32 minutes52.6 minutes
99.999% ("five nines")0.001%25.9 seconds5.26 minutes

The table is a design tool, not trivia. "Five nines" allows ~26 seconds of downtime a month — less than a single pod reschedule or a JVM stop-the-world GC pause on a bad day. It forces the honest question: can our deploy process, our dependencies, and our incident response even fit inside that budget? Usually the answer is that each extra nine multiplies engineering cost, so you buy the fewest nines your users actually notice.

A traced example: one month of budget, computed exactly

Payments API, SLO = 99.9% availability over a rolling 28-day window. Traffic is a steady 500 requests/second, so valid events in 28 days = 500 × 86,400 × 28 = 1,209,600,000 requests. The error budget is 0.1% of that = 1,209,600 allowed failures. At exactly 1× burn rate — the rate that exhausts the whole budget precisely at day 28 — that is 1,209,600 / 28 = 43,200 failures/day, the number every other burn rate below is a multiple of.

  1. Day 0. Budget full: 1,209,600 failures available (100%). Release train runs freely; two feature deploys ship.
  2. Day 6, 02:00. A bad deploy returns 5xx for 12 minutes at 500 rps → 500 × 720 s = 360,000 failures burned in one incident. Remaining = 1,209,600 − 360,000 = 849,600 (70.2%). Fast-burn alert (14.4× sustained over both the 1h and 5m windows) had already paged at minute ~2.
  3. Day 6–18 (12 days). Steady-state error rate settles at 0.02% of traffic — exactly 0.2× nominal burn, since 0.02% / 0.1% = 0.2, i.e. 0.2 × 43,200 = 8,640 failures/day. Over 12 days that burns 8,640 × 12 = 103,680 failures. Remaining = 849,600 − 103,680 = 745,920 (61.7%) — a number you can reproduce from just the rate and the duration, not eyeballed off a chart.
  4. Day 18. A dependency degrades; latency SLI (p99 < 300 ms) starts failing, pushing the total bad-event rate to 0.15% — exactly 1.5× nominal burn (0.15% / 0.1% = 1.5), i.e. 1.5 × 43,200 = 64,800 failures/day. The slow-burn alert (sustained multi-hour elevated burn) files a ticket, not a page.
  5. Day 18–26 (8 days) at 1.5×. Burns 64,800 × 8 = 518,400 failures. Remaining = 745,920 − 518,400 = 227,520 (18.8%).
  6. Day 26. Remaining budget (18.8%) crosses the team's pre-agreed 20%-remaining freeze threshold. Release freeze triggers automatically. No new features ship; the team's whole focus is the dependency and the latency regression until the rolling window ages out the day-6 incident and/or the dependency is fixed and the budget climbs back above 20%.

Notice the budget did its job twice: it funded risk early (deploy freely while flush) and halted risk late (freeze once thin) — and every number along the way is the stated rate multiplied by the stated duration, nothing asserted.

Choosing good SLIs

A bad SLI makes the whole edifice lie. The discipline: measure what a user experiences at the point of interaction, as a ratio, and pick the threshold where users start to care.

Availability

Define "good" precisely. good = HTTP responses that are not 5xx is the common request-based SLI, but decide the denominator carefully: exclude requests the client aborted, and be explicit about whether 429 (rate-limited) or 400 (client's own bad input) count as "bad" — usually they should not, since they are not your service failing. For long-lived systems, a time-based availability SLI (fraction of good minutes) can fit worse than a request-based one; prefer request-based when traffic is uneven.

Latency — always a percentile, never a mean

The average latency is the single most misleading number in a distributed system: one slow tail hides behind millions of fast requests. If p50 = 40 ms but p99 = 900 ms, one in a hundred users waits nearly a second — and on a page that fans out to 100 backends, the chance all of them beat p99 is only 0.99¹⁰⁰ ≈ 37%, so tail latency becomes the typical page latency (Dean & Barroso, "The Tail at Scale"). So the latency SLI is a threshold on a percentile: good = requests served in < 300 ms; SLI = good / valid; SLO: p99 < 300 ms holds 99.9% of the time. Set the threshold at the value where user behavior changes, not at a round number.

The other three of the golden signals

Availability and latency are the two you almost always turn into SLOs. Google's four golden signals add traffic and saturation (usually watched, rarely SLO'd) and fold errors into availability. Correctness/quality/freshness SLIs matter for data pipelines and reads-of-stale-data systems.

Pitfalls a working engineer hits

Trade-offs & when to use error budgets

Use SLO/error-budget governance when you have a service with enough traffic for ratios to be statistically meaningful, a real tension between shipping speed and stability, and the organizational will to honor the freeze. It shines for user-facing request/response systems where "good" is cleanly definable per request.

Versus raw threshold alerting (the named alternative) — demonstrated, not asserted

The traditional approach is a static, single-window threshold: "page if 5xx rate > 1% over a 5-minute window." Reuse the Payments API numbers above (500 rps, 1× burn = 43,200 failures/day) to see exactly where this breaks in both directions.

That is the actual mechanism, not a slogan: a single-window count threshold conflates a large-but-brief blip with a genuine trend (false pages), and it is blind to any leak that stays under its fixed line no matter how long it runs (missed detections). Burn-rate alerting fixes both because it multiplies rate × remaining window — the same quantity the freeze policy itself is denominated in — so the alert threshold and the governance threshold are the same currency.

They are not exclusive: mature teams keep a couple of hard threshold/heartbeat alerts ("service totally down") for instant catastrophe detection, and layer burn-rate SLO alerts on top for everything nuanced. Small internal tools with trivial traffic often need neither — a heartbeat check is enough, and formal SLOs are over-engineering.

Takeaways

Recall question

Your service has a 99.95% monthly availability SLO and serves 200 rps. A single incident causes total outage. How long can that outage last before it burns the entire month's error budget, and how many failed requests is that? (Answer: 0.05% of 43,200 min = 21.6 minutes; at 200 rps that is 200 × 21.6 × 60 ≈ 259,200 failed requests.)


Sources: Google, Site Reliability Engineering (Beyer et al.) ch. 3 & 4 and The SRE Workbook ch. 2 & 5 (SLIs/SLOs, error budgets, multi-window multi-burn-rate alerting, the four golden signals); Dean & Barroso, "The Tail at Scale" (CACM, 2013) for tail-latency and fan-out amplification; Nygard, Release It! for stability-vs-velocity framing. Re-authored/Deepened for this guide.

🤖 Don't fully get this? Learn it with Claude

Stuck on SLIs, SLOs, SLAs & Error Budgets? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🎨 Explain it visually

Build the mental picture, not memorization.

I just read a lesson on **SLIs, SLOs, SLAs & Error Budgets** (System Design) and want to truly understand it. Explain SLIs, SLOs, SLAs & Error Budgets from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🤔 Walk me through it (interactive)

Socratic — adapts to where you're stuck.

Teach me **SLIs, SLOs, SLAs & Error Budgets** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🧪 Quiz me & fix my gaps

Active recall exposes what you missed.

Quiz me on **SLIs, SLOs, SLAs & Error Budgets** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🧠 Make it stick

Intuition + hook + flashcards for long-term memory.

Help me remember **SLIs, SLOs, SLAs & Error Budgets** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes