Distributed Tracing

Distributed tracing tracks a request as it flows through multiple services. In a microservice architecture where a single user action might touch dozens of services, tracing answers the critical question: where did the time go?

Trace Context

A trace represents the entire journey of a request. It is identified by a unique trace ID that propagates across every service the request touches.

Trace context components:
  trace-id:   128-bit globally unique identifier for the entire request
  span-id:    64-bit identifier for a single operation within the trace
  parent-id:  span-id of the caller (links spans into a tree)
  flags:      sampling decision and other metadata

W3C Trace Context header (standard):
  traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
               version-trace_id-span_id-flags

Context Propagation

The trace context must travel with the request across every boundary: HTTP calls, message queues, gRPC calls, and database queries.

Request flow with context propagation:

  Browser --> API Gateway (generates trace-id: abc-123)
    Header: traceparent: 00-abc123-span01-01
    
  API Gateway --> Order Service
    Header: traceparent: 00-abc123-span02-01
    
  Order Service --> Payment Service (HTTP call)
    Header: traceparent: 00-abc123-span03-01
    
  Order Service --> Kafka (async message)
    Message header: traceparent: 00-abc123-span04-01
    
  Kafka --> Notification Service (consumes message)
    Reads traceparent from message, continues trace abc-123

Without propagation, each service creates an isolated trace, and you lose the end-to-end picture. This is the most common tracing failure mode.

Spans

A span represents a single unit of work within a trace. Spans form a tree structure where each span has at most one parent and zero or more children.

Span Structure

Span fields:
  traceId:      "abc-123"
  spanId:       "span-05"
  parentSpanId: "span-03"
  operationName: "POST /api/payments"
  serviceName:  "payment-service"
  startTime:    1710504727445 (epoch ms)
  duration:     155ms
  status:       OK
  attributes:
    http.method: "POST"
    http.status_code: 200
    payment.amount: 59.99
    payment.currency: "USD"
  events:
    - timestamp: 1710504727500, message: "Stripe API call started"
    - timestamp: 1710504727600, message: "Stripe API call completed"

Span Tree Visualization

Trace: abc-123 (order placement)

[API Gateway]              0ms ---------------------------------> 280ms
  [Auth Service]           5ms -------> 25ms
  [Order Service]          30ms ----------------------------> 270ms
    [DB: insert order]     35ms ---> 45ms
    [Payment Service]      50ms --------------------> 230ms
      [Stripe API call]   55ms ----------------> 220ms
    [Inventory Service]   235ms -----> 255ms
      [DB: update stock]  238ms --> 250ms
    [Notification: enqueue] 258ms -> 265ms

From this view, the Stripe API call at 165ms is the clear bottleneck. Without tracing, you would only know the overall request took 280ms.

Span Kinds

Span kinds:
  CLIENT:   outgoing request (calling another service)
  SERVER:   incoming request (handling a call from another service)
  PRODUCER: creating a message for async processing
  CONSUMER: processing a message from a queue
  INTERNAL: in-process operation (no network boundary)

Uber's Jaeger project was built specifically to trace rides across their microservice architecture. A single ride lifecycle generates traces spanning driver matching, pricing, routing, payment, and notification services.

Propagation Mechanisms

Trace context must cross different communication boundaries. Each requires a specific propagation approach.

HTTP Headers

W3C standard (recommended):
  traceparent: 00-trace_id-span_id-flags
  tracestate: vendor-specific key-value pairs

B3 format (Zipkin legacy):
  X-B3-TraceId: trace_id
  X-B3-SpanId: span_id
  X-B3-ParentSpanId: parent_span_id
  X-B3-Sampled: 1

gRPC Metadata

gRPC propagation:
  Trace context travels as gRPC metadata (similar to HTTP headers)
  OpenTelemetry gRPC interceptors handle this automatically
  
  Client interceptor: injects trace context into outgoing metadata
  Server interceptor: extracts trace context from incoming metadata

Message Queue Headers

Kafka propagation:
  Trace context stored in Kafka message headers
  Producer: inject traceparent into message headers
  Consumer: extract traceparent from message headers, continue trace
  
  This links the asynchronous consumer span to the producer span,
  even though they execute at different times.

Cross-Language Propagation

In polyglot environments, services in different languages must agree on propagation format. W3C Trace Context is the standard that enables Java, Go, Python, and Node.js services to participate in the same trace.

LinkedIn's infrastructure spans Java, Scala, Python, and C++ services. Their adoption of a standard propagation format allows traces to flow seamlessly across language boundaries.

Sampling

Collecting traces for every request is prohibitively expensive at scale. Sampling selects which requests to trace.

Head-Based Sampling

The sampling decision is made at the entry point (head) of the trace, and propagated to all downstream services.

Head-based sampling:
  API Gateway receives request
  Random: is rand() < 0.01?  (1% sampling rate)
    Yes: set sampled flag, all downstream services record spans
    No:  set unsampled flag, no service records spans

Pros: simple, consistent (all-or-nothing per trace)
Cons: interesting traces (errors, high latency) may not be sampled

Tail-Based Sampling

The sampling decision is deferred until the trace is complete. This allows keeping traces that are interesting (errors, high latency) while dropping routine ones.

Tail-based sampling:
  All services record spans to a local buffer
  Spans are forwarded to a sampling collector
  Collector assembles complete traces
  Decision rules:
    - Keep all traces with errors: always
    - Keep traces with latency > 1s: always
    - Keep traces for specific users: always
    - Keep random sample of remaining: 1%

Pros: captures all interesting traces
Cons: requires buffering all spans before decision, higher infrastructure cost

Adaptive Sampling

Adjust sampling rates based on current conditions. During normal operation, sample 1%. During an incident, increase to 100% for the affected service.

Adaptive sampling rules:
  Default: 1% random sampling
  If error_rate > 5%: increase to 50% sampling
  If specific service latency > p99 threshold: 100% sampling for that service
  High-priority customers: always 100% sampling

Google's Dapper system pioneered adaptive sampling, maintaining low overhead during normal operation while capturing detailed traces during problems.

OpenTelemetry

OpenTelemetry (OTel) is the industry-standard framework for collecting traces, metrics, and logs. It provides vendor-neutral instrumentation that works with any backend.

OpenTelemetry Architecture

Application code
  --> OTel SDK (creates spans, metrics, logs)
  --> OTel Exporter (formats data for backend)
  --> OTel Collector (receives, processes, exports)
  --> Backend (Jaeger, Zipkin, Datadog, Honeycomb, etc.)

OTel Collector pipeline:
  Receivers: accept data (OTLP, Jaeger, Zipkin formats)
  Processors: batch, filter, sample, add attributes
  Exporters: send to backends (multiple simultaneously)

Auto-Instrumentation

OTel provides automatic instrumentation for common libraries, creating spans without code changes.

Auto-instrumented libraries (examples):
  HTTP clients: spans for outgoing HTTP calls
  HTTP servers: spans for incoming requests
  Database drivers: spans for queries (with query text)
  gRPC: spans for client and server calls
  Kafka: spans for produce and consume operations
  Redis: spans for cache operations

Result: basic distributed tracing with zero code changes
Custom spans are added for business-specific operations

Shopify migrated to OpenTelemetry across their Ruby and Go services, replacing multiple vendor-specific agents with a single instrumentation framework. This reduced their observability vendor lock-in while maintaining trace quality.

Debugging with Traces

Traces transform debugging from guesswork into directed investigation.

Latency Analysis

Finding the slow service:
  Trace shows total duration: 2,300ms
  
  Span breakdown:
    API Gateway:     overhead 15ms
    Auth Service:    22ms
    Product Service: 45ms
    Recommendation:  1,850ms  <-- obvious bottleneck
    Cart Service:    35ms
    
  Drill into Recommendation Service span:
    DB query:        50ms
    ML model call:   1,780ms  <-- root cause
    Response build:  20ms

Error Propagation

Trace shows error propagation:
  [API Gateway] status: 500
    [Order Service] status: 500
      [Payment Service] status: 503
        [Stripe API] status: timeout
        
  Root cause: Stripe API timeout at the leaf span
  Every parent span failed because of this one timeout
  Without traces: you see a 500 at the gateway with no context

Identifying Cascading Failures

Trace pattern during cascade:
  Multiple traces show:
    Service A --> Service B (timeout after 30s)
    Service A retries --> Service B (timeout after 30s)
    Service A retries --> Service B (timeout after 30s)
    
  Service A's thread pool is now exhausted waiting for Service B
  Service C calls Service A --> Service A returns 503 (no threads available)
  
  Traces reveal: Service B is the root cause, Service A is the victim,
  Service C is collateral damage

Common Pitfalls

Broken context propagation. If any service in the chain does not propagate the trace context, the trace splits into disconnected fragments. Test propagation across every service boundary.
Sampling away all interesting traces. Random 1% sampling misses most errors and latency outliers. Use tail-based sampling or always-on sampling for errors.
Too many spans per trace. Instrumenting every function call creates traces with thousands of spans that are expensive to store and hard to read. Instrument service boundaries and significant operations, not every method.
Missing async propagation. Trace context often breaks at message queue boundaries. Explicitly inject context into message headers and extract it on the consumer side.
No span attributes. Spans without attributes (user ID, order ID, error message) provide timing but no debugging context. Add relevant business attributes to spans.
Vendor lock-in through proprietary instrumentation. Using vendor-specific tracing SDKs ties you to one backend. OpenTelemetry provides vendor-neutral instrumentation.

Key Takeaways

A trace follows one request end-to-end. Spans are the building blocks, forming a tree that shows timing and causality across services.
Context propagation is the foundation. If trace context does not cross every service boundary, tracing is incomplete and misleading.
Tail-based sampling captures interesting traces (errors, latency outliers) that head-based sampling misses. Use it if your infrastructure budget allows.
OpenTelemetry is the standard for vendor-neutral instrumentation. It supports auto-instrumentation for common libraries and manual instrumentation for business logic.
Traces excel at answering "where is the bottleneck" and "what caused this error cascade." Combine with metrics for breadth and logs for depth.