Home › System Design › Observability & SRE

The Cost of Observability — Cardinality

A metrics backend like Prometheus stores one independent time series for every distinct combination of metric name and label values, indexed by a hash of that label set; because storage, memory, and query cost scale with the number of series rather than the number of data points, attaching an unbounded label such as user_id or request_id silently multiplies your series count into the billions and topples the whole system. This is the single most common way self-hosted observability blows its budget — and it fails not with a clean error but with a slow OOM-and-restart death spiral.

Why it matters

Observability is supposed to be the cheap insurance you buy against production surprises. But metrics have a peculiar cost curve: an extra label value is nearly free, while an extra high-cardinality label is catastrophic, because cardinality multiplies. Engineers reach for "just add user_id so we can slice by customer" without realizing they have converted a 40,000-series metric into a 40-billion-series time bomb. Getting the metrics-vs-logs-vs-traces boundary right is what keeps a monitoring bill at hundreds of dollars instead of hundreds of thousands.

Cardinality is a product, not a sum

The total number of series a metric can produce is the Cartesian product of the count of distinct values across all its labels. Take a request counter:

http_requests_total{method, status, endpoint}

Trace the arithmetic with realistic value counts:

Label	Distinct values	Running product
method	5 (GET, POST, PUT, DELETE, PATCH)	5
status	~40 (all HTTP codes seen)	200
endpoint	~200 route templates	40,000

40,000 active series is comfortable — a few hundred MB of RAM. Now a well-meaning engineer adds user_id to "debug per-customer latency," with 1,000,000 users:

Label added	Distinct values	Series
+ user_id	1,000,000	40,000,000,000 (40 billion)
+ request_id	new every request	unbounded — grows forever

The jump is multiplicative, not additive: you didn't add a million series, you multiplied the existing 40,000 by a million. And request_id is worse than large — it is unbounded: every request mints a value that is never reused, so the series count climbs without limit until the process dies. This is called a cardinality explosion (or, when driven by user input, a cardinality bomb).

Where the cost actually lives inside the engine

To see why series count — not sample count — is the cost driver, follow one sample into a Prometheus-style TSDB. When http_requests_total{method="GET", status="200", user_id="U8842"} arrives, the engine hashes the full label set to a stable series ID. That ID owns three things for its entire lifetime:

Inverted-index postings. Every label value gets a posting list mapping it to the series IDs that carry it (user_id="U8842" → [ids…]). This is what makes sum by (status) fast — and what balloons when there are a million distinct user_id values, each needing its own posting list entry.
A head chunk in RAM. Each active series keeps an open, in-memory chunk it appends samples to. As a working figure for Prometheus specifically, budget roughly 1–3 KB of RAM per active series (index + chunk overhead) — this is a Prometheus default, not a universal TSDB constant; other engines (VictoriaMetrics, Mimir, Thanos) use different chunk encodings and index layouts and land at different (often lower) per-series overhead. 1,000,000 series ≈ 1–3 GB of head memory before you have stored a single interesting value; 40 billion is simply unallocatable on any of them.
An on-disk compressed chunk stream plus index segments. Prometheus flushes the head block to disk on a configurable interval that defaults to ~2 hours (--storage.tsdb.min-block-duration); Mimir, Thanos, and VictoriaMetrics use their own compaction/flush cadences, so treat the number as "how Prometheus is configured out of the box," not a law of TSDBs.

The crucial asymmetry — and this part does generalize across time-series engines: appending the millionth sample to an existing series is nearly free (delta-of-delta + XOR compression, often <2 bytes/sample). Creating the millionth series costs a fresh chunk, index entries, and memory that is never reclaimed while the series stays active. High scrape frequency is cheap; high cardinality is ruinous.

What belongs in metrics vs logs vs traces

The fix is not "never record user_id" — it is recording each dimension in the signal whose cost model can absorb it. The three pillars have fundamentally different cardinality tolerances:

Signal	Cost model	Cardinality tolerance	Put here	Keep out
Metrics	per active series (one open chunk each)	Low — must be bounded & small	numeric aggregates: rates, latencies (histograms), error counts, saturation; low-card labels (method, status, region, endpoint template)	user_id, request_id, email, full URL, SQL text
Logs	per event (bytes written & indexed)	High — every field can vary	rich per-event context: the exact user, params, error message, stack trace	data you need to graph continuously (that is a metric)
Traces	per sampled request	High, but sampled	request_id, span timings, per-hop causality across services	unsampled high-QPS firehose (cost + storage)

Rule of thumb: if a dimension is unbounded or user-controlled, it is a log field or a trace attribute — never a metric label. Metrics answer "how many / how fast, sliced by a handful of fixed dimensions." Logs and traces answer "show me this specific event." The bridge between them is the exemplar: attach a trace_id to a single sample inside a latency-histogram bucket, so you can jump from the p99 spike on the graph straight to one representative trace — high-cardinality pointer, zero high-cardinality series.

Pitfalls a working engineer actually hits

The unbounded-value trap. Any label sourced from user input or a UUID (request_id, session_id, raw path with IDs in it) grows forever. Normalize the URL to its route template (/users/{id}, not /users/8842) before it becomes a label.
Cardinality is per-target, then summed. A metric with 500 series per instance × 400 instances = 200,000 series centrally. Local-looking labels multiply by fleet size.
Error-message-as-label. error="connection reset by peer: 10.2.3.4:5432" embeds an IP and port — effectively unbounded. Use a bounded error_type enum; put the full message in a log.
Deleted series still cost you. A series stops receiving samples but its index entries and head-block metadata persist until the block is compacted and its retention window passes. A churny label (pods that restart, ephemeral IPs) creates "dead" series that bloat memory long after they stop reporting — this is churn, and it hurts as much as raw cardinality.
Silent failure mode. There is rarely a clean rejection. Ingestion latency creeps, the head block grows, memory climbs, GC thrashes, and Prometheus OOM-kills and restarts — replaying the WAL, which is slow, so it lags and drops data exactly when you are trying to debug the incident that caused it.
Histogram footprint. A native histogram is cheap, but a classic Prometheus histogram creates one series per bucket. A 12-bucket histogram already multiplies your label combinations by 12 — mind that when the base label set is non-trivial.

Trade-offs & when to reach for a different tool

The inverted-index TSDB model (Prometheus, Thanos, Cortex/Mimir, VictoriaMetrics) is optimized for low-cardinality, high-frequency data and cheap aggregation queries — that is exactly why it punishes cardinality. The named alternatives and mitigations trade differently:

Prometheus vs. wide-event / columnar stores (Honeycomb, ClickHouse-backed systems). Column stores keep events row-wise with per-column compression and scan-based query, so cardinality is nearly free — you can group by user_id after the fact. The cost: queries scan rather than index-lookup (slower for simple dashboards), storage is larger, and you typically sample. Use metrics for always-on SLO dashboards and alerting on bounded dimensions; use wide events / high-card tracing when the questions are open-ended and per-entity ("which 3 customers saw p99 > 2s in the last hour?").
Adding a metric label vs. using exemplars. A label makes every query auto-slice by that dimension but multiplies series. An exemplar attaches a trace pointer to bucket samples — you keep the graph cheap and drill into a real trace on demand. Prefer exemplars whenever the high-cardinality dimension is for investigation, not alerting.
Keep-the-label vs. drop-it-at-ingest (relabeling). If a label already leaked in, metric_relabel_configs can drop or aggregate it at scrape time before it ever creates a series. Cheaper than re-instrumenting, but you lose the raw dimension permanently — good for stopping active bleeding, not a substitute for fixing the instrumentation.
Ad-hoc label vs. a recording rule. Instead of adding a new label to the raw metric, precompute the aggregation you actually query (e.g. job:http_errors:rate5m by region, not by user_id) with a recording rule. You get a fast, bounded series for dashboards while raw high-cardinality slicing, if ever needed, goes to logs/traces.
No limit vs. a cardinality limit. Prometheus/Mimir/Cortex support per-target or per-tenant sample/series limits (sample_limit, ingester series limits) that reject or drop excess series instead of letting one bad label field OOM the whole instance — a blast-radius control, not a fix, but it converts a cluster-wide outage into a contained, alertable rejection.

The senior instinct: default to bounded metrics + sampled traces + rich logs, relabel/limit at the edges as a safety net, and only pay for a high-cardinality store when the business genuinely needs per-entity slicing that alerting-grade metrics can't give.

Takeaways

Cardinality multiplies. Total series = product of distinct label-value counts; one unbounded label (user_id, request_id) turns thousands of series into billions or infinity.
Cost tracks series, not samples. Each active series owns a head chunk (~1–3 KB RAM on Prometheus defaults; other TSDBs differ) and index entries; high scrape rates are cheap, high cardinality is fatal.
Match the dimension to the signal. Bounded numeric aggregates → metrics; per-event detail → logs; per-request causality → sampled traces. Bridge them with exemplars.
Normalize before you label, and keep a safety net. Route templates over raw paths, error-type enums over error strings; use relabeling to drop unbounded labels at ingest and per-tenant series limits to cap blast radius — this is the difference between a $200 and a $200,000 bill.

Recall question

A counter api_calls_total{region, tier, endpoint} has 4 regions, 3 tiers, and 250 endpoints, scraped from 300 instances. A teammate proposes adding customer_id (80,000 customers) so dashboards can slice by customer. What happens to the series count, and what should you do instead?

Answer: base cardinality is 4 × 3 × 250 = 3,000 per instance × 300 = 900,000 series (already large). Adding customer_id multiplies by 80,000 → ~72 billion series — an instant OOM. Instead, keep the metric bounded, emit per-customer detail as a log field or a trace attribute, use a recording rule for the aggregate you actually dashboard, and attach a trace_id exemplar to the latency histogram so you can still drill from a dashboard spike into a specific customer's request. If the label leaks in before instrumentation is fixed, drop it at ingest with metric_relabel_configs and set a series limit as a backstop.

Sources: B. Brazil, Prometheus: Up & Running (label/cardinality guidance, the ~1–3 KB/series working figure — a Prometheus-specific default, not a universal TSDB constant); Prometheus documentation on naming, TSDB head blocks (including the configurable min-block-duration), inverted index, relabeling, series limits, and exemplars; C. Majors et al., Observability Engineering (Honeycomb) on wide events and high-cardinality querying; Google SRE Book & Workbook (metrics, SLOs, and the monitoring signal boundary); B. Gregg, Systems Performance (the USE method and metric selection). Re-authored/Deepened for this guide.

🤖 Don't fully get this? Learn it with Claude

Stuck on The Cost of Observability — Cardinality? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🎨 Explain it visually

Build the mental picture, not memorization.

I just read a lesson on **The Cost of Observability — Cardinality** (System Design) and want to truly understand it. Explain The Cost of Observability — Cardinality from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🤔 Walk me through it (interactive)

Socratic — adapts to where you're stuck.

Teach me **The Cost of Observability — Cardinality** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧪 Quiz me & fix my gaps

Active recall exposes what you missed.

Quiz me on **The Cost of Observability — Cardinality** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧠 Make it stick

Intuition + hook + flashcards for long-term memory.

Help me remember **The Cost of Observability — Cardinality** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes

← Distributed Tracing & Exemplars