Distributed Tracing
Distributed tracing tracks a request as it flows through multiple services. In a microservice architecture where a single user action might touch dozens of services, tracing answers the critical question: where did the time go?
Trace Context
A trace represents the entire journey of a request. It is identified by a unique trace ID that propagates across every service the request touches.
Trace context components:
trace-id: 128-bit globally unique identifier for the entire request
span-id: 64-bit identifier for a single operation within the trace
parent-id: span-id of the caller (links spans into a tree)
flags: sampling decision and other metadata
W3C Trace Context header (standard):
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
version-trace_id-span_id-flags
Context Propagation
The trace context must travel with the request across every boundary: HTTP calls, message queues, gRPC calls, and database queries.
Request flow with context propagation:
Browser --> API Gateway (generates trace-id: abc-123)
Header: traceparent: 00-abc123-span01-01
API Gateway --> Order Service
Header: traceparent: 00-abc123-span02-01
Order Service --> Payment Service (HTTP call)
Header: traceparent: 00-abc123-span03-01
Order Service --> Kafka (async message)
Message header: traceparent: 00-abc123-span04-01
Kafka --> Notification Service (consumes message)
Reads traceparent from message, continues trace abc-123
Without propagation, each service creates an isolated trace, and you lose the end-to-end picture. This is the most common tracing failure mode.
Spans
A span represents a single unit of work within a trace. Spans form a tree structure where each span has at most one parent and zero or more children.
Span Structure
Span fields:
traceId: "abc-123"
spanId: "span-05"
parentSpanId: "span-03"
operationName: "POST /api/payments"
serviceName: "payment-service"
startTime: 1710504727445 (epoch ms)
duration: 155ms
status: OK
attributes:
http.method: "POST"
http.status_code: 200
payment.amount: 59.99
payment.currency: "USD"
events:
- timestamp: 1710504727500, message: "Stripe API call started"
- timestamp: 1710504727600, message: "Stripe API call completed"
Span Tree Visualization
Trace: abc-123 (order placement)
[API Gateway] 0ms ---------------------------------> 280ms
[Auth Service] 5ms -------> 25ms
[Order Service] 30ms ----------------------------> 270ms
[DB: insert order] 35ms ---> 45ms
[Payment Service] 50ms --------------------> 230ms
[Stripe API call] 55ms ----------------> 220ms
[Inventory Service] 235ms -----> 255ms
[DB: update stock] 238ms --> 250ms
[Notification: enqueue] 258ms -> 265ms
From this view, the Stripe API call at 165ms is the clear bottleneck. Without tracing, you would only know the overall request took 280ms.
Span Kinds
Span kinds:
CLIENT: outgoing request (calling another service)
SERVER: incoming request (handling a call from another service)
PRODUCER: creating a message for async processing
CONSUMER: processing a message from a queue
INTERNAL: in-process operation (no network boundary)
Uber's Jaeger project was built specifically to trace rides across their microservice architecture. A single ride lifecycle generates traces spanning driver matching, pricing, routing, payment, and notification services.
Propagation Mechanisms
Trace context must cross different communication boundaries. Each requires a specific propagation approach.
HTTP Headers
W3C standard (recommended):
traceparent: 00-trace_id-span_id-flags
tracestate: vendor-specific key-value pairs
B3 format (Zipkin legacy):
X-B3-TraceId: trace_id
X-B3-SpanId: span_id
X-B3-ParentSpanId: parent_span_id
X-B3-Sampled: 1
gRPC Metadata
gRPC propagation:
Trace context travels as gRPC metadata (similar to HTTP headers)
OpenTelemetry gRPC interceptors handle this automatically
Client interceptor: injects trace context into outgoing metadata
Server interceptor: extracts trace context from incoming metadata
Message Queue Headers
Kafka propagation:
Trace context stored in Kafka message headers
Producer: inject traceparent into message headers
Consumer: extract traceparent from message headers, continue trace
This links the asynchronous consumer span to the producer span,
even though they execute at different times.
Cross-Language Propagation
In polyglot environments, services in different languages must agree on propagation format. W3C Trace Context is the standard that enables Java, Go, Python, and Node.js services to participate in the same trace.
LinkedIn's infrastructure spans Java, Scala, Python, and C++ services. Their adoption of a standard propagation format allows traces to flow seamlessly across language boundaries.
Sampling
Collecting traces for every request is prohibitively expensive at scale. Sampling selects which requests to trace.
Head-Based Sampling
The sampling decision is made at the entry point (head) of the trace, and propagated to all downstream services.
Head-based sampling:
API Gateway receives request
Random: is rand() < 0.01? (1% sampling rate)
Yes: set sampled flag, all downstream services record spans
No: set unsampled flag, no service records spans
Pros: simple, consistent (all-or-nothing per trace)
Cons: interesting traces (errors, high latency) may not be sampled
Tail-Based Sampling
The sampling decision is deferred until the trace is complete. This allows keeping traces that are interesting (errors, high latency) while dropping routine ones.
Tail-based sampling:
All services record spans to a local buffer
Spans are forwarded to a sampling collector
Collector assembles complete traces
Decision rules:
- Keep all traces with errors: always
- Keep traces with latency > 1s: always
- Keep traces for specific users: always
- Keep random sample of remaining: 1%
Pros: captures all interesting traces
Cons: requires buffering all spans before decision, higher infrastructure cost
Adaptive Sampling
Adjust sampling rates based on current conditions. During normal operation, sample 1%. During an incident, increase to 100% for the affected service.
Adaptive sampling rules:
Default: 1% random sampling
If error_rate > 5%: increase to 50% sampling
If specific service latency > p99 threshold: 100% sampling for that service
High-priority customers: always 100% sampling
Google's Dapper system pioneered adaptive sampling, maintaining low overhead during normal operation while capturing detailed traces during problems.
OpenTelemetry
OpenTelemetry (OTel) is the industry-standard framework for collecting traces, metrics, and logs. It provides vendor-neutral instrumentation that works with any backend.
OpenTelemetry Architecture
Application code
--> OTel SDK (creates spans, metrics, logs)
--> OTel Exporter (formats data for backend)
--> OTel Collector (receives, processes, exports)
--> Backend (Jaeger, Zipkin, Datadog, Honeycomb, etc.)
OTel Collector pipeline:
Receivers: accept data (OTLP, Jaeger, Zipkin formats)
Processors: batch, filter, sample, add attributes
Exporters: send to backends (multiple simultaneously)
Auto-Instrumentation
OTel provides automatic instrumentation for common libraries, creating spans without code changes.
Auto-instrumented libraries (examples):
HTTP clients: spans for outgoing HTTP calls
HTTP servers: spans for incoming requests
Database drivers: spans for queries (with query text)
gRPC: spans for client and server calls
Kafka: spans for produce and consume operations
Redis: spans for cache operations
Result: basic distributed tracing with zero code changes
Custom spans are added for business-specific operations
Shopify migrated to OpenTelemetry across their Ruby and Go services, replacing multiple vendor-specific agents with a single instrumentation framework. This reduced their observability vendor lock-in while maintaining trace quality.
Debugging with Traces
Traces transform debugging from guesswork into directed investigation.
Latency Analysis
Finding the slow service:
Trace shows total duration: 2,300ms
Span breakdown:
API Gateway: overhead 15ms
Auth Service: 22ms
Product Service: 45ms
Recommendation: 1,850ms <-- obvious bottleneck
Cart Service: 35ms
Drill into Recommendation Service span:
DB query: 50ms
ML model call: 1,780ms <-- root cause
Response build: 20ms
Error Propagation
Trace shows error propagation:
[API Gateway] status: 500
[Order Service] status: 500
[Payment Service] status: 503
[Stripe API] status: timeout
Root cause: Stripe API timeout at the leaf span
Every parent span failed because of this one timeout
Without traces: you see a 500 at the gateway with no context
Identifying Cascading Failures
Trace pattern during cascade:
Multiple traces show:
Service A --> Service B (timeout after 30s)
Service A retries --> Service B (timeout after 30s)
Service A retries --> Service B (timeout after 30s)
Service A's thread pool is now exhausted waiting for Service B
Service C calls Service A --> Service A returns 503 (no threads available)
Traces reveal: Service B is the root cause, Service A is the victim,
Service C is collateral damage
Common Pitfalls
- Broken context propagation. If any service in the chain does not propagate the trace context, the trace splits into disconnected fragments. Test propagation across every service boundary.
- Sampling away all interesting traces. Random 1% sampling misses most errors and latency outliers. Use tail-based sampling or always-on sampling for errors.
- Too many spans per trace. Instrumenting every function call creates traces with thousands of spans that are expensive to store and hard to read. Instrument service boundaries and significant operations, not every method.
- Missing async propagation. Trace context often breaks at message queue boundaries. Explicitly inject context into message headers and extract it on the consumer side.
- No span attributes. Spans without attributes (user ID, order ID, error message) provide timing but no debugging context. Add relevant business attributes to spans.
- Vendor lock-in through proprietary instrumentation. Using vendor-specific tracing SDKs ties you to one backend. OpenTelemetry provides vendor-neutral instrumentation.
Key Takeaways
- A trace follows one request end-to-end. Spans are the building blocks, forming a tree that shows timing and causality across services.
- Context propagation is the foundation. If trace context does not cross every service boundary, tracing is incomplete and misleading.
- Tail-based sampling captures interesting traces (errors, latency outliers) that head-based sampling misses. Use it if your infrastructure budget allows.
- OpenTelemetry is the standard for vendor-neutral instrumentation. It supports auto-instrumentation for common libraries and manual instrumentation for business logic.
- Traces excel at answering "where is the bottleneck" and "what caused this error cascade." Combine with metrics for breadth and logs for depth.