Home › System Design › Observability & SRE

Distributed Tracing & Exemplars

Distributed tracing works by stamping every request with a single trace ID at the edge, then having each service create a span — a timestamped record of one unit of work — that carries the trace ID plus its own span ID and its parent's span ID; the collector later reassembles these spans, linked by parent pointers, into one causal tree that reconstructs the request's entire path across process and network boundaries.

The problem it solves is specific to microservices. A single user action fans out into dozens of RPCs across services owned by different teams, on different hosts. When p99 latency doubles, per-service dashboards each look fine — the pain is smeared across the call graph and no single log line sees the whole request. Metrics tell you that something is slow; logs tell you what one process did; only a trace tells you where in the distributed call path the time went and which downstream dependency was on the critical path.

The trace / span model

A trace is a tree of spans sharing one 16-byte trace ID — within a single trace the parent-id relation is strictly one-parent-per-span, so the structure is always a tree, never a general graph. Each span records: an 8-byte span ID, the parent span ID (empty for the root), a name (GET /checkout, db.query), start and end wall-clock timestamps, a status (OK / ERROR), and a bag of key–value attributes (http.status_code=500, db.system=postgres). Spans may also carry events (timestamped logs) and links — references to spans in other traces (e.g. a batch job triggered by many separate requests). Links are where OpenTelemetry's DAG language actually applies: they let a span point across trace boundaries, so the graph of all traces connected by links can be a DAG, even though each individual trace's parent/child skeleton stays a tree.

Because every span holds its parent ID, the collector rebuilds that tree with no clock coordination between hosts — parentage, not timestamps, defines structure; this reconstruction works precisely because it's a tree, one parent per span, not a graph needing extra bookkeeping. Timestamps only position spans on the timeline, which is why clock skew between hosts can make a child span appear to start slightly before its parent; good UIs clamp this rather than trusting raw wall-clocks.

Propagating context across services

The tree only forms if the trace ID survives every hop. This is context propagation: the caller serializes its active span's identity into the outgoing request and the callee deserializes it to set its parent. The vendor-neutral wire format is the W3C Trace Context header:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

00	4bf92f3577b34da6a3ce929d0e0e4736	00f067aa0ba902b7	01
version	trace-id — 16 bytes (32 hex), stable for the whole trace	parent-id — this hop's span-id, changes every hop	trace-flags — 01 = sampled

A second header, tracestate, carries vendor-specific key–values so multiple systems coexist without clobbering each other. The literal example from the W3C spec: tracestate: rojo=00f067aa0ba902b7,congo=t61rcWkgMzE — each entry is vendor-key=vendor-value, comma-separated, and a vendor only ever writes its own entry, leaving the others untouched.

Traced hop-by-hop

Gateway receives an untraced request → generates trace-id 4bf9…4736 and root span-id a1…b7, makes the sampling decision (flag 01).
Gateway calls payment over HTTP, injecting traceparent: 00-4bf9…4736-a1…b7-01.
payment extracts the header, sees a valid parent, starts a child span with parent-id = a1…b7 and a fresh span-id c2…d9, keeping the same trace-id and flag.
payment calls the PSP, injecting ...-c2…d9-01. And so on. The trace-id is invariant; only parent-id changes per hop.

The critical failure mode: the 01 flag must propagate too. If a service re-samples independently instead of honoring the inbound flag, you get broken traces — half the spans sampled in, half dropped, an incomplete tree.

Sampling: head vs tail

Tracing every request at scale is ruinous — a service doing 100k RPS emitting a 1 KB span per hop across 20 hops produces ~2 GB/s of trace data. Sampling keeps a representative subset. The decision where it's made defines the two families:

Head-based sampling decides at the root, before the request runs, typically probabilistically ("keep 1%"). The decision rides the sampled flag downstream so the whole trace is coherently kept or dropped. Cheap, stateless, no buffering — but it's blind: it commits before knowing whether the request errored or was slow, so 99% of your rare 500s and p99 outliers are thrown away.
Tail-based sampling exports all spans to a collector that buffers them, groups by trace ID, waits for the trace to complete (a decision window, e.g. 10s), then applies policies: keep if any span errored, or latency > 1s, else keep 0.1%. You capture every interesting trace — but the collector must hold all in-flight spans in memory, is stateful, and needs all spans of one trace routed to the same collector instance (consistent hashing on trace ID).

The practical pattern is hybrid: modest head sampling to cap ingest, plus tail sampling in the collector to guarantee errors and slow traces are never dropped.

Exemplars: from a metric spike to the exact trace

Metrics and traces are sampled independently, so when a Grafana graph shows p99 latency spiking, you historically had no way to jump from that aggregate to a concrete slow request — you'd guess trace IDs by time-window. An exemplar fixes this by attaching a trace ID to a specific metric sample at record time. In the OpenMetrics text format an exemplar is appended after a # on a histogram bucket line:

http_request_duration_seconds_bucket{le="2.5"} 84102 # {trace_id="4bf9…4736"} 2.31 1609459200.123
#                                       bucket count      exemplar: this trace   value  timestamp

Read that as: "among the requests that landed in the ≤2.5s bucket, here is one real trace — 4bf9…4736 — that took 2.31s." The metrics client records an exemplar when it observes an outlier into a high-latency bucket, capturing the trace ID from the currently-active span context. In the UI the p99 line renders little diamonds; click one and you land in the trace view for that exact request. Exemplars are the bridge across the three pillars — metric (cheap, always-on aggregate) → trace (expensive, detailed, sampled) — turning "p99 is up" into "here is the span tree of a request that was slow."

Crucially, an exemplar defeats the sampling mismatch: because the exemplar pins a real trace ID that made it into the histogram, tail sampling can be told to always keep exemplar'd traces, so the trace you click is guaranteed to still exist.

OpenTelemetry: the vendor-neutral plumbing

OpenTelemetry (OTel) is the CNCF standard that decouples instrumentation from your backend. Its pieces: the API (what your code calls to start spans — stable, no-op if no SDK), the SDK (the implementation: samplers, span processors, batching), the OTLP wire protocol (gRPC/HTTP), and the Collector — a standalone process with a receive → process → export pipeline where tail sampling, batching, and redaction live. You instrument once against the OTel API; swapping Jaeger for Tempo for a vendor is a Collector config change, not a code change. OTel also unifies traces, metrics, and logs under one context, which is precisely what makes exemplars possible: the metrics SDK reads the active span's trace ID from the same context object.

Pitfalls a working engineer hits

Broken propagation on async boundaries. Context is stored per-thread (or in an async-local). Hand work to a thread pool, a message queue, or a callback without explicitly capturing and restoring context and the child span silently re-parents to nothing — orphaned spans, split traces.
Missing spans across a queue. Kafka/SQS hops break the in-band header chain. You must inject traceparent into message headers and use span links (not parent-child) on the consumer, since one consumer batch may drain many traces.
Cardinality explosion. Putting a user ID or full URL with IDs into span attributes is fine (traces are per-request), but copying those into metric labels detonates your TSDB. Keep high-cardinality data on spans, low-cardinality on metrics.
Head-sampling starving errors. A 1% head sampler plus a 0.5% error rate means you see ~1 in 20,000 error traces. If you can't explain outages, you're sampling at the wrong end.
Cost creep. Traces are the most expensive pillar per byte. Unbounded span attributes and 100% retention will dwarf your metrics bill.

Trade-offs & when to use vs alternatives

Vs. structured logs with a correlation ID (the common alternative): injecting a request ID into every log line and grepping across services also reconstructs a request — and it's simpler, needs no new backend, and captures full detail. But logs give you a flat, unordered list with no timing structure and no parent-child causality; you can't see that inventory ran parallel to payment, or read the critical path off a timeline. Correlated logs answer "what happened in this request?"; tracing answers "where did the time go and what was on the critical path?" Reach for tracing when latency is the problem and the call graph fans out; logs+correlation-ID suffice for low-fan-out services or pure error-context debugging.

Vs. eBPF-based auto-instrumentation: tools that attach eBPF probes to kernel/syscall and library entry points (e.g. Pixie, Odigos, Grafana Beyla) capture spans for HTTP/gRPC/SQL calls with zero code changes and no redeploy — a real win for legacy or unowned services you can't touch. The cost: you get generic, protocol-level spans with no business-meaningful attributes (user_id, cart_total) or custom span names, and inferring context propagation across in-process async boundaries (thread pools, coroutines) from outside the process is harder and less reliable than explicit code-level propagation. Manual SDK instrumentation costs code changes but gives precise, semantically rich spans and full control over sampling. In practice: eBPF for fast baseline coverage everywhere, manual instrumentation on the services that carry your critical business logic.

Vs. metrics alone: metrics are orders of magnitude cheaper and always-on — keep them as the primary SLO signal. Tracing is the sampled, detailed drill-down you jump into from a metric anomaly. The exemplar is exactly the seam that lets the cheap always-on layer hand off to the expensive detailed one. Use all three; don't try to make one do another's job.

Takeaways

A trace is a tree of spans linked by parent IDs, not timestamps — that's why it survives clock skew and reassembles across hosts. (The DAG language in OTel refers to cross-trace links, not to this parent/child skeleton.)
The trace only forms if context propagates on every hop (W3C traceparent), including the sampled flag and across async/queue boundaries.
Head sampling is cheap but blind; tail sampling catches every error/slow trace at the cost of stateful buffering — production uses a hybrid.
Exemplars pin a real trace ID onto a metric sample, turning "p99 is up" into a one-click jump to the span tree that explains why.

Recall

You run 1% head-based sampling. A customer reports intermittent 2-second checkouts that occur on ~0.3% of requests, and none of the sampled traces show the slowness. What two changes make the slow trace reliably appear in your trace UI, and why does each work?

Answer: (1) Add tail-based sampling in the Collector with a latency policy (keep if duration > 1s). This works because the decision is made after the trace completes, so it can look at the actual outcome — it doesn't need to gamble at the root, it just checks "was this one of the slow ones?" and keeps it deterministically. (2) Wire an exemplar from the checkout latency histogram and configure tail sampling to always keep exemplar'd trace IDs. This works because the exemplar pins the exact trace ID of a request that landed in the slow bucket at record time, so even if your sampling policies are still tuned imperfectly, that specific trace ID is guaranteed to be retained and clickable from the p99 graph. Raising the head sampling rate alone would not reliably fix this: head sampling decides blind, before the slow downstream call has even happened, so a 2s checkout is no more likely to be kept than a 20ms one — only a decision made after the request finishes (tail-based, or exemplar-pinned) can specifically target "duration > 1s."

🤖 Don't fully get this? Learn it with Claude

Stuck on Distributed Tracing & Exemplars? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🎨 Explain it visually

Build the mental picture, not memorization.

I just read a lesson on **Distributed Tracing & Exemplars** (System Design) and want to truly understand it. Explain Distributed Tracing & Exemplars from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🤔 Walk me through it (interactive)

Socratic — adapts to where you're stuck.

Teach me **Distributed Tracing & Exemplars** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧪 Quiz me & fix my gaps

Active recall exposes what you missed.

Quiz me on **Distributed Tracing & Exemplars** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧠 Make it stick

Intuition + hook + flashcards for long-term memory.

Help me remember **Distributed Tracing & Exemplars** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes

← Burn-Rate Alerting The Cost of Observability — Cardin →