The Cost of Observability — Cardinality
The Cost of Observability — Cardinality
A metrics backend like Prometheus stores one independent time series for every distinct combination of metric name and label values, indexed by a hash of that label set; because storage, memory, and query cost scale with the number of series rather than the number of data points, attaching an unbounded label such as user_id or request_id silently multiplies your series count into the billions and topples the whole system. This is the single most common way self-hosted observability blows its budget — and it fails not with a clean error but with a slow OOM-and-restart death spiral.
Why it matters
Observability is supposed to be the cheap insurance you buy against production surprises. But metrics have a peculiar cost curve: an extra label value is nearly free, while an extra high-cardinality label is catastrophic, because cardinality multiplies. Engineers reach for "just add user_id so we can slice by customer" without realizing they have converted a 40,000-series metric into a 40-billion-series time bomb. Getting the metrics-vs-logs-vs-traces boundary right is what keeps a monitoring bill at hundreds of dollars instead of hundreds of thousands.
Cardinality is a product, not a sum
The total number of series a metric can produce is the Cartesian product of the count of distinct values across all its labels. Take a request counter:
http_requests_total{method, status, endpoint}Trace the arithmetic with realistic value counts:
| Label | Distinct values | Running product |
|---|---|---|
| method | 5 (GET, POST, PUT, DELETE, PATCH) | 5 |
| status | ~40 (all HTTP codes seen) | 200 |
| endpoint | ~200 route templates | 40,000 |
40,000 active series is comfortable — a few hundred MB of RAM. Now a well-meaning engineer adds user_id to "debug per-customer latency," with 1,000,000 users:
| Label added | Distinct values | Series |
|---|---|---|
| + user_id | 1,000,000 | 40,000,000,000 (40 billion) |
| + request_id | new every request | unbounded — grows forever |
The jump is multiplicative, not additive: you didn't add a million series, you multiplied the existing 40,000 by a million. And request_id is worse than large — it is unbounded: every request mints a value that is never reused, so the series count climbs without limit until the process dies. This is called a cardinality explosion (or, when driven by user input, a cardinality bomb).
Where the cost actually lives inside the engine
To see why series count — not sample count — is the cost driver, follow one sample into a Prometheus-style TSDB. When http_requests_total{method="GET", status="200", user_id="U8842"} arrives, the engine hashes the full label set to a stable series ID. That ID owns three things for its entire lifetime:
- Inverted-index postings. Every label value gets a posting list mapping it to the series IDs that carry it (
user_id="U8842" → [ids…]). This is what makessum by (status)fast — and what balloons when there are a million distinctuser_idvalues, each needing its own posting list entry. - A head chunk in RAM. Each active series keeps an open, in-memory chunk it appends samples to. As a working figure for Prometheus specifically, budget roughly 1–3 KB of RAM per active series (index + chunk overhead) — this is a Prometheus default, not a universal TSDB constant; other engines (VictoriaMetrics, Mimir, Thanos) use different chunk encodings and index layouts and land at different (often lower) per-series overhead. 1,000,000 series ≈ 1–3 GB of head memory before you have stored a single interesting value; 40 billion is simply unallocatable on any of them.
- An on-disk compressed chunk stream plus index segments. Prometheus flushes the head block to disk on a configurable interval that defaults to ~2 hours (
--storage.tsdb.min-block-duration); Mimir, Thanos, and VictoriaMetrics use their own compaction/flush cadences, so treat the number as "how Prometheus is configured out of the box," not a law of TSDBs.
The crucial asymmetry — and this part does generalize across time-series engines: appending the millionth sample to an existing series is nearly free (delta-of-delta + XOR compression, often <2 bytes/sample). Creating the millionth series costs a fresh chunk, index entries, and memory that is never reclaimed while the series stays active. High scrape frequency is cheap; high cardinality is ruinous.
What belongs in metrics vs logs vs traces
The fix is not "never record user_id" — it is recording each dimension in the signal whose cost model can absorb it. The three pillars have fundamentally different cardinality tolerances:
| Signal | Cost model | Cardinality tolerance | Put here | Keep out |
|---|---|---|---|---|
| Metrics | per active series (one open chunk each) | Low — must be bounded & small | numeric aggregates: rates, latencies (histograms), error counts, saturation; low-card labels (method, status, region, endpoint template) | user_id, request_id, email, full URL, SQL text |
| Logs | per event (bytes written & indexed) | High — every field can vary | rich per-event context: the exact user, params, error message, stack trace | data you need to graph continuously (that is a metric) |
| Traces | per sampled request | High, but sampled | request_id, span timings, per-hop causality across services | unsampled high-QPS firehose (cost + storage) |
Rule of thumb: if a dimension is unbounded or user-controlled, it is a log field or a trace attribute — never a metric label. Metrics answer "how many / how fast, sliced by a handful of fixed dimensions." Logs and traces answer "show me this specific event." The bridge between them is the exemplar: attach a trace_id to a single sample inside a latency-histogram bucket, so you can jump from the p99 spike on the graph straight to one representative trace — high-cardinality pointer, zero high-cardinality series.
Pitfalls a working engineer actually hits
- The unbounded-value trap. Any label sourced from user input or a UUID (
request_id,session_id, rawpathwith IDs in it) grows forever. Normalize the URL to its route template (/users/{id}, not/users/8842) before it becomes a label. - Cardinality is per-target, then summed. A metric with 500 series per instance × 400 instances = 200,000 series centrally. Local-looking labels multiply by fleet size.
- Error-message-as-label.
error="connection reset by peer: 10.2.3.4:5432"embeds an IP and port — effectively unbounded. Use a boundederror_typeenum; put the full message in a log. - Deleted series still cost you. A series stops receiving samples but its index entries and head-block metadata persist until the block is compacted and its retention window passes. A churny label (pods that restart, ephemeral IPs) creates "dead" series that bloat memory long after they stop reporting — this is churn, and it hurts as much as raw cardinality.
- Silent failure mode. There is rarely a clean rejection. Ingestion latency creeps, the head block grows, memory climbs, GC thrashes, and Prometheus OOM-kills and restarts — replaying the WAL, which is slow, so it lags and drops data exactly when you are trying to debug the incident that caused it.
- Histogram footprint. A native histogram is cheap, but a classic Prometheus histogram creates one series per bucket. A 12-bucket histogram already multiplies your label combinations by 12 — mind that when the base label set is non-trivial.
Trade-offs & when to reach for a different tool
The inverted-index TSDB model (Prometheus, Thanos, Cortex/Mimir, VictoriaMetrics) is optimized for low-cardinality, high-frequency data and cheap aggregation queries — that is exactly why it punishes cardinality. The named alternatives and mitigations trade differently:
- Prometheus vs. wide-event / columnar stores (Honeycomb, ClickHouse-backed systems). Column stores keep events row-wise with per-column compression and scan-based query, so cardinality is nearly free — you can group by
user_idafter the fact. The cost: queries scan rather than index-lookup (slower for simple dashboards), storage is larger, and you typically sample. Use metrics for always-on SLO dashboards and alerting on bounded dimensions; use wide events / high-card tracing when the questions are open-ended and per-entity ("which 3 customers saw p99 > 2s in the last hour?"). - Adding a metric label vs. using exemplars. A label makes every query auto-slice by that dimension but multiplies series. An exemplar attaches a trace pointer to bucket samples — you keep the graph cheap and drill into a real trace on demand. Prefer exemplars whenever the high-cardinality dimension is for investigation, not alerting.
- Keep-the-label vs. drop-it-at-ingest (relabeling). If a label already leaked in,
metric_relabel_configscan drop or aggregate it at scrape time before it ever creates a series. Cheaper than re-instrumenting, but you lose the raw dimension permanently — good for stopping active bleeding, not a substitute for fixing the instrumentation. - Ad-hoc label vs. a recording rule. Instead of adding a new label to the raw metric, precompute the aggregation you actually query (e.g.
job:http_errors:rate5mbyregion, not byuser_id) with a recording rule. You get a fast, bounded series for dashboards while raw high-cardinality slicing, if ever needed, goes to logs/traces. - No limit vs. a cardinality limit. Prometheus/Mimir/Cortex support per-target or per-tenant sample/series limits (
sample_limit, ingester series limits) that reject or drop excess series instead of letting one bad label field OOM the whole instance — a blast-radius control, not a fix, but it converts a cluster-wide outage into a contained, alertable rejection.
The senior instinct: default to bounded metrics + sampled traces + rich logs, relabel/limit at the edges as a safety net, and only pay for a high-cardinality store when the business genuinely needs per-entity slicing that alerting-grade metrics can't give.
Takeaways
- Cardinality multiplies. Total series = product of distinct label-value counts; one unbounded label (user_id, request_id) turns thousands of series into billions or infinity.
- Cost tracks series, not samples. Each active series owns a head chunk (~1–3 KB RAM on Prometheus defaults; other TSDBs differ) and index entries; high scrape rates are cheap, high cardinality is fatal.
- Match the dimension to the signal. Bounded numeric aggregates → metrics; per-event detail → logs; per-request causality → sampled traces. Bridge them with exemplars.
- Normalize before you label, and keep a safety net. Route templates over raw paths, error-type enums over error strings; use relabeling to drop unbounded labels at ingest and per-tenant series limits to cap blast radius — this is the difference between a $200 and a $200,000 bill.
Recall question
A counter api_calls_total{region, tier, endpoint} has 4 regions, 3 tiers, and 250 endpoints, scraped from 300 instances. A teammate proposes adding customer_id (80,000 customers) so dashboards can slice by customer. What happens to the series count, and what should you do instead?
Answer: base cardinality is 4 × 3 × 250 = 3,000 per instance × 300 = 900,000 series (already large). Adding customer_id multiplies by 80,000 → ~72 billion series — an instant OOM. Instead, keep the metric bounded, emit per-customer detail as a log field or a trace attribute, use a recording rule for the aggregate you actually dashboard, and attach a trace_id exemplar to the latency histogram so you can still drill from a dashboard spike into a specific customer's request. If the label leaks in before instrumentation is fixed, drop it at ingest with metric_relabel_configs and set a series limit as a backstop.
Sources: B. Brazil, Prometheus: Up & Running (label/cardinality guidance, the ~1–3 KB/series working figure — a Prometheus-specific default, not a universal TSDB constant); Prometheus documentation on naming, TSDB head blocks (including the configurable min-block-duration), inverted index, relabeling, series limits, and exemplars; C. Majors et al., Observability Engineering (Honeycomb) on wide events and high-cardinality querying; Google SRE Book & Workbook (metrics, SLOs, and the monitoring signal boundary); B. Gregg, Systems Performance (the USE method and metric selection). Re-authored/Deepened for this guide.
🤖 Don't fully get this? Learn it with Claude
Stuck on The Cost of Observability — Cardinality? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.
Build the mental picture, not memorization.
I just read a lesson on **The Cost of Observability — Cardinality** (System Design) and want to truly understand it. Explain The Cost of Observability — Cardinality from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Socratic — adapts to where you're stuck.
Teach me **The Cost of Observability — Cardinality** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Active recall exposes what you missed.
Quiz me on **The Cost of Observability — Cardinality** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Intuition + hook + flashcards for long-term memory.
Help me remember **The Cost of Observability — Cardinality** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.