Home › System Design › Observability & SRE

Metrics, Logs & Traces — the Three Pillars

The mechanism: three ways to record what a system did

Every observability signal is the same act — a running process writes a fact about itself to an emitter that a backend later stores and queries — but the three pillars differ in what they throw away at write time, and that discarding is the whole story. A metric is a number pre-aggregated into fixed-width time buckets, so the process keeps only a running counter or histogram and forgets which individual request touched it. A log is a discrete, timestamped event kept in full detail, so nothing is discarded but every event costs a full write. A trace is a set of causally-linked spans stitched by a propagated trace ID, so the process keeps the shape of one request as it fans across services but samples away most requests to survive the volume.

Understanding the pillars means understanding those three write-time bargains: metrics trade detail for cheap aggregation, logs trade cost for total detail, traces trade completeness (via sampling) for cross-service causality. Pick the wrong pillar for a question and you either can't answer it or you bankrupt your telemetry budget answering it.

Why it matters: three different questions in an incident

At 02:14 a pager fires. Over the next ten minutes you ask three structurally different questions, and each maps to exactly one pillar:

"Is something wrong, and how bad?" — a trend over all traffic. Only metrics answer this cheaply: p99 latency jumped from 40 ms to 900 ms, error rate 0.1% → 12%. This is what alerts fire on because you can evaluate it every 15 s over millions of requests for pennies.
"Which service in the chain is slow?" — the request path. Only a trace shows that the 900 ms lives in payment-svc → fraud-check, not in the API gateway, because it captures parent/child timing across the RPC boundary.
"What exactly happened to this request?" — a discrete event with full context. Only a log has the stack trace, the SQL text, the customer ID, the actual exception message.

A metric tells you the building is on fire; a trace tells you which floor; a log tells you it was the deep-fryer in the third-floor kitchen. Trying to substitute one for another is the classic failure: alerting on log volume (expensive and laggy), or debugging a single request from a dashboard (impossible — the number was averaged away).

A traced incident, step by step

Concrete numbers make the hand-off between pillars obvious. A checkout service normally serves p99 = 45 ms. A bad deploy ships at 02:10.

t=02:14 — metric fires. http_request_duration_seconds is conventionally instrumented as a Prometheus Histogram (not a Counter) — it exposes cumulative per-bucket counters (_bucket), plus _sum and _count. A recording rule computes histogram_quantile(0.99, …) over those buckets across a 5-minute window and ~2M requests. p99 crosses 500 ms for 3 consecutive evaluations → alert. Note what the metric can and can't say: it knows the rate and the quantile, but not which of the 2M requests were slow — that detail was summed into buckets at write time.
t=02:15 — pivot to a trace. You open a trace tagged latency > 500ms. The waterfall shows the parent span POST /checkout = 780 ms, with children: auth 8 ms, cart-svc 12 ms, payment-svc 740 ms. Inside payment-svc, the child span db.query = 720 ms. The trace localized the fault to one hop and one operation — something a metric averaged over all hops could never do.
t=02:17 — pivot to logs. Every span carries the trace_id; you query the log store for trace_id=abc123 AND service=payment-svc. The matching line: WARN slow query: SELECT … FROM orders WHERE user_id=? — full table scan, 1.2M rows, missing index idx_orders_user after migration 0047. The log gives you the exact SQL, the row count, and the migration — the root cause.

The correlation key is what makes this flow work: a single trace_id propagated in request headers (W3C traceparent) and stamped onto both spans and log lines. Without it you'd be eyeballing timestamps across five services. This is why mature stacks inject trace_id into every structured log — it turns three separate haystacks into one joinable dataset.

The cost and cardinality trade-off — the constraint that shapes real stacks

The reason you can't just "log everything and compute metrics from logs" is cardinality. A metric's cost is not driven by traffic volume — it's driven by the number of unique label combinations (time series). A counter http_requests_total{method, status} with 4 methods × 6 status codes = 24 series, cheap forever. Add user_id as a label and you get one series per user — millions of series, each with its own memory footprint in the TSDB index. This is a cardinality explosion, and it OOMs Prometheus. The rule of thumb: label values must be bounded and low-cardinality (region, endpoint, status class), never unbounded IDs.

That constraint is exactly why the pillars are separate systems rather than one:

Signal	Cost driver	High-cardinality data?	Typical retention
Metrics	# of time series (label combos)	No — keep IDs OUT of labels	weeks–months
Traces	volume × sampling rate	Yes — trace_id, user_id fine as attributes	days
Logs	bytes ingested + indexed	Yes — full detail is the point	days–weeks

High-cardinality identifiers (user, order, request ID) belong in logs and trace attributes, never in metric labels. The senior move is to keep the alert-driving metric skinny and low-cardinality, and let the trace_id stitch you across into the high-cardinality world only when you need detail.

Sampling and the exemplar bridge

Traces are sampled precisely because keeping every span at 100k RPS is unaffordable. Two strategies trade off differently: head-based sampling makes the keep/drop decision at the very first span, before anyone knows how the request turns out — it's cheap (one coin-flip, no buffering) and works with any collector, but it discards evidence blindly, so a rare slow or erroring request is dropped at the same 1% rate as a boring one. Tail-based sampling instead buffers the whole trace until it completes and only then decides, which means it can keep 100% of errors and slow requests while still dropping most fast, healthy ones — but it needs a collector holding every in-flight trace in memory until it finishes, which costs more infrastructure and adds a few seconds of latency before a trace is exportable. In short: head-based is cheap but statistically blind; tail-based is evidence-aware but operationally heavier.

Exemplars are the bridge from a metric back to a concrete trace: when the histogram records an observation into a latency bucket, it optionally attaches a sample trace_id for one request that landed in that bucket. Practically, that means clicking the p99 spike on a Grafana/Prometheus dashboard can jump you straight to one representative slow trace, skipping the manual step of finding a matching request by timestamp. The caveat: an exemplar is a single sample of whatever request happened to be observed when that bucket last updated — it is not guaranteed to be the worst offender, or even representative, of everything in that bucket, so treat it as a fast lead to pull on, not proof of the dominant failure mode; corroborate with a real trace search (e.g. latency > 500ms) when the exemplar doesn't match the pattern you expect.

Pitfalls a working engineer actually hits

Cardinality explosion in metric labels. Putting user_id, request_id, full URL paths (with IDs), or error messages in labels. It works in staging with 3 users and takes down the TSDB in prod. Normalize paths (/user/{id}) and drop unbounded labels.
Averages hide the tail. Alerting on mean latency is nearly useless — a healthy mean of 50 ms can hide a p99 of 5 s affecting 1% of users. Alert on quantiles (p99/p95) or SLO burn rate. And beware: you cannot average pre-computed percentiles across instances; aggregate the histogram buckets, then compute the quantile.
Logging on the hot path. Synchronous, unbuffered logging (especially at DEBUG in prod) adds latency and I/O to every request and can itself cause the incident. Use async appenders and sample high-volume logs.
Sampling away the evidence. Head-based 1% sampling means the one error trace you need is 99% likely gone. Use tail-based sampling or always-sample-on-error so failures are never dropped.
Orphaned signals. Logs without trace_id and traces without correlated logs force manual timestamp archaeology across services. Inject the trace context into your logging MDC/context from day one.
Clock skew across services makes trace waterfalls show negative or overlapping durations; rely on the span's own duration, not cross-host wall-clock subtraction.

Trade-offs & when to reach for which

The pillars are complementary, not competing — but each has a named alternative you might over-reach for, and the discipline is not to:

Reach for metrics when the question is aggregate and continuous: SLOs, dashboards, alerts, capacity. Not for per-request debugging — the individual request was averaged away. Alternative misuse: computing metrics from log queries ("count errors in Loki") — works at low volume, but it's slower, laggier, and far more expensive than a purpose-built counter at scale.
Reach for traces when latency or errors cross service boundaries and you need to know where. Not as your primary detection signal (they're sampled — you'll miss trends) and not for questions answerable within one service by a log. Named alternative here is the sampling strategy choice itself: head-based (cheap, statistically blind to rare bad requests) vs. tail-based (evidence-aware, needs a buffering collector and adds export latency) — pick tail-based, or always-sample-on-error, whenever the cost of missing the one bad trace outweighs the extra collector infrastructure; stick with head-based when trace volume is enormous, requests are largely uniform, and you mainly need rough latency shape rather than guaranteed capture of outliers.
Reach for logs when you need the full, discrete detail of specific events: the exception, the query, the payload. Not as an alerting substrate (expensive, high-latency) and not for trend analysis (that's a metric).

Metrics vs. wide structured events (the honeycomb critique)

A named alternative to the three-pillars model itself is "observability 2.0" / wide events (Honeycomb, Charity Majors): instead of three siloed systems, emit one very-wide, high-cardinality structured event per request and derive metrics, traces, and logs from it at query time. The trade-off: wide events give you arbitrary high-cardinality slicing ("p99 latency for users on iOS 17 in eu-west hitting endpoint X") that pre-aggregated metrics fundamentally cannot — you must pick metric labels before the incident, and you can only ever ask questions about the dimensions you chose. The cost is a columnar store that can scan billions of events, and abandoning the cheap, months-long retention of a TSDB. For most teams the three pillars remain the pragmatic default; wide events win when your debugging questions are unpredictable and high-dimensional.

Takeaways

The pillars differ by what they discard at write time: metrics discard per-request identity for cheap aggregation, logs discard nothing (and pay for it), traces discard most requests (sampling) to keep cross-service causality.
Incident flow is metrics → traces → logs: detect the problem, localize the hop, explain the cause. The trace_id is the join key that makes the hand-off work.
Cardinality is the master constraint. Keep unbounded IDs out of metric labels; put them in logs and span attributes. This single rule prevents the most common self-inflicted outage.
Head-based sampling is cheap but statistically blind to rare bad requests; tail-based sampling captures them but costs a buffering collector and export latency — exemplars give a fast, non-authoritative lead from a metric to one representative trace.
Don't substitute pillars: alerting on logs is expensive, debugging one request from a metric is impossible, detecting trends from sampled traces misses reality.

Recall

You have a counter payments_total and want to break it down by customer to find which customer is erroring. Why is adding customer_id as a metric label the wrong move, and what should you do instead?

Answer: it causes a cardinality explosion — one time series per customer will exhaust the TSDB's index memory. Instead keep the metric low-cardinality (label by status/region only), and use logs or trace attributes — which are built for high-cardinality data — filtered by trace_id or customer_id to get the per-customer detail.

Sources: Google SRE Book & SRE Workbook (monitoring, the four golden signals, SLO/burn-rate alerting); Charity Majors et al., "Observability Engineering" (O'Reilly) and the wide-events / cardinality argument; Cindy Sridharan, "Distributed Systems Observability"; the W3C Trace Context spec and OpenTelemetry data model (metrics/logs/traces + exemplars); Prometheus documentation (histograms, cardinality, recording rules); Jaeger/Zipkin tracing model; Dean & Barroso, "The Tail at Scale" (why tails, not means, matter). Re-authored/Deepened for this guide.

🤖 Don't fully get this? Learn it with Claude

Stuck on Metrics, Logs & Traces — the Three Pillars? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🎨 Explain it visually

Build the mental picture, not memorization.

I just read a lesson on **Metrics, Logs & Traces — the Three Pillars** (System Design) and want to truly understand it. Explain Metrics, Logs & Traces — the Three Pillars from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🤔 Walk me through it (interactive)

Socratic — adapts to where you're stuck.

Teach me **Metrics, Logs & Traces — the Three Pillars** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧪 Quiz me & fix my gaps

Active recall exposes what you missed.

Quiz me on **Metrics, Logs & Traces — the Three Pillars** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧠 Make it stick

Intuition + hook + flashcards for long-term memory.

Help me remember **Metrics, Logs & Traces — the Three Pillars** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes

SLIs, SLOs, SLAs & Error Budgets →