Distributed Tracing

In a monolith, a stack trace tells you what happened. In a distributed system, a request crosses five services, two databases, and a message queue before returning. When something is slow, which service is the bottleneck? When something fails, where did it break? Distributed tracing answers these questions by following a request across every service it touches.

The Problem Tracing Solves

A user reports that checkout is slow. You check the API gateway logs: the request took 4.2 seconds. But the gateway just proxies to other services. Which one was slow?

Without tracing, you search logs in every downstream service, try to match timestamps, and hope the clocks are synchronized. With tracing, you pull up one trace ID and see:

checkout-service         |=============================| 4200ms
  -> inventory-service   |====|                          120ms
  -> pricing-service     |==|                              45ms
  -> payment-service     |=========================|    3800ms
     -> stripe-api       |=======================|      3650ms

Payment service took 3.8 seconds because the Stripe API call took 3.65 seconds. You found the bottleneck in 30 seconds instead of 30 minutes.

Core Concepts

Traces

A trace represents the entire journey of a request through the system. It has a unique trace ID that stays constant across all services.

Spans

A span represents a single unit of work within a trace -- an HTTP request, a database query, a function call. Each span has:

A span ID (unique to this span)
A parent span ID (which span created this one)
A start time and duration
Attributes (key-value metadata)
A status (OK, ERROR)

Trace: abc-123
  Span: checkout (root span)
    Span: check-inventory (child of checkout)
    Span: calculate-price (child of checkout)
    Span: process-payment (child of checkout)
      Span: stripe-charge (child of process-payment)

Context Propagation

For tracing to work across services, the trace context must be passed from one service to the next. This happens via HTTP headers.

OpenTelemetry

OpenTelemetry (OTel) is the standard for distributed tracing. It is a CNCF project that provides APIs, SDKs, and tools for generating and collecting telemetry data (traces, metrics, and logs).

Before OpenTelemetry, there were competing standards: OpenTracing and OpenCensus. OTel merged both and is now the only standard you should adopt.

Architecture

Application (instrumented with OTel SDK)
  -> OTel Collector (receives, processes, exports)
    -> Backend (Jaeger, Tempo, Zipkin, Datadog)

W3C Trace Context Headers

OpenTelemetry uses the W3C Trace Context standard for propagation. Two headers carry the context:

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
             |  |                                |                |
             v  trace-id                         span-id          flags (sampled)
             version

tracestate: vendor1=value1,vendor2=value2

When service A calls service B, it includes these headers. Service B extracts the trace ID and creates a child span under the same trace.

OTel SDK Setup (Python)

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

resource = Resource.create({
    "service.name": "checkout-service",
    "service.version": "1.2.3",
    "deployment.environment": "production",
})

provider = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

The Go setup follows the same pattern using go.opentelemetry.io/otel packages: create a resource with service name and version, configure an OTLP gRPC exporter, and register a TracerProvider with a batch span processor.

OTel Collector

The collector sits between your applications and the backend. It receives spans, processes them (filtering, sampling, enriching), and exports to one or more backends.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 512

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  logging:
    loglevel: warn

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo, logging]

Auto-Instrumentation vs Manual Spans

Auto-Instrumentation

OTel provides auto-instrumentation libraries that patch common frameworks and libraries. With minimal code, you get spans for HTTP requests, database queries, gRPC calls, and message queue operations.

Python:

pip install opentelemetry-instrumentation-flask opentelemetry-instrumentation-requests opentelemetry-instrumentation-sqlalchemy

opentelemetry-instrument --service_name checkout-service python app.py

This wraps your Flask routes, outgoing HTTP calls, and SQLAlchemy queries with spans automatically. No code changes required.

Node.js:

npm install @opentelemetry/auto-instrumentations-node

node --require @opentelemetry/auto-instrumentations-node/register app.js

Auto-instrumentation covers the framework boundaries -- HTTP in, HTTP out, database queries. It does not cover your business logic.

Manual Spans

For visibility into application-specific operations, create manual spans:

tracer = trace.get_tracer(__name__)

def process_checkout(order):
    with tracer.start_as_current_span("process_checkout") as span:
        span.set_attribute("order.id", order.id)
        span.set_attribute("order.items", len(order.items))

        with tracer.start_as_current_span("validate_inventory"):
            check_inventory(order.items)

        with tracer.start_as_current_span("calculate_total"):
            total = calculate_price(order.items)
            span.set_attribute("order.total", total)

        with tracer.start_as_current_span("charge_payment"):
            try:
                charge(order.payment_method, total)
            except PaymentError as e:
                span.set_status(trace.StatusCode.ERROR, str(e))
                span.record_exception(e)
                raise

The best approach: start with auto-instrumentation for the basics, then add manual spans for critical business logic.

Visualization Backends

Jaeger

Originally from Uber. Full-featured trace visualization. Supports search by service, operation, tags, and duration. Open source.

Grafana Tempo

Trace backend that integrates natively with Grafana. Uses object storage (S3, GCS) for cost-effective retention. Pairs naturally with Loki (logs) and Prometheus (metrics) for full observability.

# Tempo configuration
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:

storage:
  trace:
    backend: s3
    s3:
      bucket: my-tempo-bucket
      endpoint: s3.amazonaws.com

Choosing a Backend

Zipkin is another option -- simpler than Jaeger with a smaller footprint. But if you already run Grafana and Prometheus, use Tempo. The integration between metrics, logs, and traces in a single Grafana UI is powerful.

Connecting the Dots

The real power of tracing emerges when you connect traces to logs and metrics.

Trace to Logs

Include the trace ID in every log line (covered in the structured logging topic). In Grafana, configure the Tempo data source with tracesToLogs pointing to Loki. Click a span in a trace and see the exact log lines emitted during that span.

Metrics to Traces

Prometheus metrics can include exemplars -- sample trace IDs attached to metric data points. When you see a latency spike in Grafana, click the data point to jump to an example trace.

Sampling

At high traffic, collecting every trace is expensive. Sampling strategies reduce volume while preserving visibility.

Head-Based Sampling

Decide at the start of the trace whether to sample it. Simple but means you might miss interesting traces.

# OTel SDK sampling
processors:
  probabilistic_sampler:
    sampling_percentage: 10  # Keep 10% of traces

Tail-Based Sampling

Decide after the trace completes. Keep all error traces and slow traces, sample the rest. More useful but requires the collector to buffer complete traces.

# OTel Collector tail sampling
processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 2000 }
      - name: default
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

Tail-based sampling is almost always what you want in production. You keep 100% of errors and slow requests (the traces you actually need) and sample the happy path.

Common Pitfalls

Not propagating context. If one service in the chain does not pass trace headers, the trace breaks into disconnected fragments. Verify propagation end to end.
Missing auto-instrumentation for a key library. Your HTTP framework is instrumented but your Redis client is not. Check which libraries have OTel instrumentation available and install them.
Not sampling in production. Collecting 100% of traces at 10,000 requests per second will overwhelm your backend and your budget. Use tail-based sampling.
Ignoring trace data. Tracing is useless if nobody looks at it. Link traces from alerts and dashboards so they are part of the debugging workflow.
Not adding manual spans for business logic. Auto-instrumentation shows you the framework boundaries. Manual spans show you where time is spent inside your code.
Treating tracing as a logging replacement. Traces show request flow and timing. Logs show detailed context. Use both.

Key Takeaways

Distributed tracing follows a request across every service it touches, showing where time is spent and where failures occur.
OpenTelemetry is the standard. Use its SDKs for instrumentation and its collector for processing and exporting.
Start with auto-instrumentation for HTTP, database, and gRPC calls. Add manual spans for critical business operations.
Use tail-based sampling to keep error and slow traces while reducing volume on the happy path.
Connect traces to logs (via trace ID) and to metrics (via exemplars). The three pillars of observability are most powerful when linked.
When tracing is integrated into your debugging workflow, it turns multi-hour investigations into 5-minute lookups.