Knowledge Guide
HomeSystem DesignObservability & SRE

Metrics, Logs & Traces — the Three Pillars

The mechanism: three ways to record what a system did

Every observability signal is the same act — a running process writes a fact about itself to an emitter that a backend later stores and queries — but the three pillars differ in what they throw away at write time, and that discarding is the whole story. A metric is a number pre-aggregated into fixed-width time buckets, so the process keeps only a running counter or histogram and forgets which individual request touched it. A log is a discrete, timestamped event kept in full detail, so nothing is discarded but every event costs a full write. A trace is a set of causally-linked spans stitched by a propagated trace ID, so the process keeps the shape of one request as it fans across services but samples away most requests to survive the volume.

Understanding the pillars means understanding those three write-time bargains: metrics trade detail for cheap aggregation, logs trade cost for total detail, traces trade completeness (via sampling) for cross-service causality. Pick the wrong pillar for a question and you either can't answer it or you bankrupt your telemetry budget answering it.

Why it matters: three different questions in an incident

At 02:14 a pager fires. Over the next ten minutes you ask three structurally different questions, and each maps to exactly one pillar:

A metric tells you the building is on fire; a trace tells you which floor; a log tells you it was the deep-fryer in the third-floor kitchen. Trying to substitute one for another is the classic failure: alerting on log volume (expensive and laggy), or debugging a single request from a dashboard (impossible — the number was averaged away).

A traced incident, step by step

Concrete numbers make the hand-off between pillars obvious. A checkout service normally serves p99 = 45 ms. A bad deploy ships at 02:10.

  1. t=02:14 — metric fires. http_request_duration_seconds is conventionally instrumented as a Prometheus Histogram (not a Counter) — it exposes cumulative per-bucket counters (_bucket), plus _sum and _count. A recording rule computes histogram_quantile(0.99, …) over those buckets across a 5-minute window and ~2M requests. p99 crosses 500 ms for 3 consecutive evaluations → alert. Note what the metric can and can't say: it knows the rate and the quantile, but not which of the 2M requests were slow — that detail was summed into buckets at write time.
  2. t=02:15 — pivot to a trace. You open a trace tagged latency > 500ms. The waterfall shows the parent span POST /checkout = 780 ms, with children: auth 8 ms, cart-svc 12 ms, payment-svc 740 ms. Inside payment-svc, the child span db.query = 720 ms. The trace localized the fault to one hop and one operation — something a metric averaged over all hops could never do.
  3. t=02:17 — pivot to logs. Every span carries the trace_id; you query the log store for trace_id=abc123 AND service=payment-svc. The matching line: WARN slow query: SELECT … FROM orders WHERE user_id=? — full table scan, 1.2M rows, missing index idx_orders_user after migration 0047. The log gives you the exact SQL, the row count, and the migration — the root cause.

The correlation key is what makes this flow work: a single trace_id propagated in request headers (W3C traceparent) and stamped onto both spans and log lines. Without it you'd be eyeballing timestamps across five services. This is why mature stacks inject trace_id into every structured log — it turns three separate haystacks into one joinable dataset.

The cost and cardinality trade-off — the constraint that shapes real stacks

The reason you can't just "log everything and compute metrics from logs" is cardinality. A metric's cost is not driven by traffic volume — it's driven by the number of unique label combinations (time series). A counter http_requests_total{method, status} with 4 methods × 6 status codes = 24 series, cheap forever. Add user_id as a label and you get one series per user — millions of series, each with its own memory footprint in the TSDB index. This is a cardinality explosion, and it OOMs Prometheus. The rule of thumb: label values must be bounded and low-cardinality (region, endpoint, status class), never unbounded IDs.

That constraint is exactly why the pillars are separate systems rather than one:

SignalCost driverHigh-cardinality data?Typical retention
Metrics# of time series (label combos)No — keep IDs OUT of labelsweeks–months
Tracesvolume × sampling rateYes — trace_id, user_id fine as attributesdays
Logsbytes ingested + indexedYes — full detail is the pointdays–weeks

High-cardinality identifiers (user, order, request ID) belong in logs and trace attributes, never in metric labels. The senior move is to keep the alert-driving metric skinny and low-cardinality, and let the trace_id stitch you across into the high-cardinality world only when you need detail.

Sampling and the exemplar bridge

Traces are sampled precisely because keeping every span at 100k RPS is unaffordable. Two strategies trade off differently: head-based sampling makes the keep/drop decision at the very first span, before anyone knows how the request turns out — it's cheap (one coin-flip, no buffering) and works with any collector, but it discards evidence blindly, so a rare slow or erroring request is dropped at the same 1% rate as a boring one. Tail-based sampling instead buffers the whole trace until it completes and only then decides, which means it can keep 100% of errors and slow requests while still dropping most fast, healthy ones — but it needs a collector holding every in-flight trace in memory until it finishes, which costs more infrastructure and adds a few seconds of latency before a trace is exportable. In short: head-based is cheap but statistically blind; tail-based is evidence-aware but operationally heavier.

Exemplars are the bridge from a metric back to a concrete trace: when the histogram records an observation into a latency bucket, it optionally attaches a sample trace_id for one request that landed in that bucket. Practically, that means clicking the p99 spike on a Grafana/Prometheus dashboard can jump you straight to one representative slow trace, skipping the manual step of finding a matching request by timestamp. The caveat: an exemplar is a single sample of whatever request happened to be observed when that bucket last updated — it is not guaranteed to be the worst offender, or even representative, of everything in that bucket, so treat it as a fast lead to pull on, not proof of the dominant failure mode; corroborate with a real trace search (e.g. latency > 500ms) when the exemplar doesn't match the pattern you expect.

Pitfalls a working engineer actually hits

Trade-offs & when to reach for which

The pillars are complementary, not competing — but each has a named alternative you might over-reach for, and the discipline is not to:

Metrics vs. wide structured events (the honeycomb critique)

A named alternative to the three-pillars model itself is "observability 2.0" / wide events (Honeycomb, Charity Majors): instead of three siloed systems, emit one very-wide, high-cardinality structured event per request and derive metrics, traces, and logs from it at query time. The trade-off: wide events give you arbitrary high-cardinality slicing ("p99 latency for users on iOS 17 in eu-west hitting endpoint X") that pre-aggregated metrics fundamentally cannot — you must pick metric labels before the incident, and you can only ever ask questions about the dimensions you chose. The cost is a columnar store that can scan billions of events, and abandoning the cheap, months-long retention of a TSDB. For most teams the three pillars remain the pragmatic default; wide events win when your debugging questions are unpredictable and high-dimensional.

Takeaways

Recall

You have a counter payments_total and want to break it down by customer to find which customer is erroring. Why is adding customer_id as a metric label the wrong move, and what should you do instead?

Answer: it causes a cardinality explosion — one time series per customer will exhaust the TSDB's index memory. Instead keep the metric low-cardinality (label by status/region only), and use logs or trace attributes — which are built for high-cardinality data — filtered by trace_id or customer_id to get the per-customer detail.


Sources: Google SRE Book & SRE Workbook (monitoring, the four golden signals, SLO/burn-rate alerting); Charity Majors et al., "Observability Engineering" (O'Reilly) and the wide-events / cardinality argument; Cindy Sridharan, "Distributed Systems Observability"; the W3C Trace Context spec and OpenTelemetry data model (metrics/logs/traces + exemplars); Prometheus documentation (histograms, cardinality, recording rules); Jaeger/Zipkin tracing model; Dean & Barroso, "The Tail at Scale" (why tails, not means, matter). Re-authored/Deepened for this guide.

🤖 Don't fully get this? Learn it with Claude

Stuck on Metrics, Logs & Traces — the Three Pillars? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🎨 Explain it visually

Build the mental picture, not memorization.

I just read a lesson on **Metrics, Logs & Traces — the Three Pillars** (System Design) and want to truly understand it. Explain Metrics, Logs & Traces — the Three Pillars from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🤔 Walk me through it (interactive)

Socratic — adapts to where you're stuck.

Teach me **Metrics, Logs & Traces — the Three Pillars** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🧪 Quiz me & fix my gaps

Active recall exposes what you missed.

Quiz me on **Metrics, Logs & Traces — the Three Pillars** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🧠 Make it stick

Intuition + hook + flashcards for long-term memory.

Help me remember **Metrics, Logs & Traces — the Three Pillars** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes