Distributed Tracing
In a monolith, a stack trace tells you what happened. In a distributed system, a request crosses five services, two databases, and a message queue before returning. When something is slow, which service is the bottleneck? When something fails, where did it break? Distributed tracing answers these questions by following a request across every service it touches.
The Problem Tracing Solves
A user reports that checkout is slow. You check the API gateway logs: the request took 4.2 seconds. But the gateway just proxies to other services. Which one was slow?
Without tracing, you search logs in every downstream service, try to match timestamps, and hope the clocks are synchronized. With tracing, you pull up one trace ID and see:
checkout-service |=============================| 4200ms
-> inventory-service |====| 120ms
-> pricing-service |==| 45ms
-> payment-service |=========================| 3800ms
-> stripe-api |=======================| 3650ms
Payment service took 3.8 seconds because the Stripe API call took 3.65 seconds. You found the bottleneck in 30 seconds instead of 30 minutes.
Core Concepts
Traces
A trace represents the entire journey of a request through the system. It has a unique trace ID that stays constant across all services.
Spans
A span represents a single unit of work within a trace -- an HTTP request, a database query, a function call. Each span has:
- A span ID (unique to this span)
- A parent span ID (which span created this one)
- A start time and duration
- Attributes (key-value metadata)
- A status (OK, ERROR)
Trace: abc-123
Span: checkout (root span)
Span: check-inventory (child of checkout)
Span: calculate-price (child of checkout)
Span: process-payment (child of checkout)
Span: stripe-charge (child of process-payment)
Context Propagation
For tracing to work across services, the trace context must be passed from one service to the next. This happens via HTTP headers.
OpenTelemetry
OpenTelemetry (OTel) is the standard for distributed tracing. It is a CNCF project that provides APIs, SDKs, and tools for generating and collecting telemetry data (traces, metrics, and logs).
Before OpenTelemetry, there were competing standards: OpenTracing and OpenCensus. OTel merged both and is now the only standard you should adopt.
Architecture
Application (instrumented with OTel SDK)
-> OTel Collector (receives, processes, exports)
-> Backend (Jaeger, Tempo, Zipkin, Datadog)
W3C Trace Context Headers
OpenTelemetry uses the W3C Trace Context standard for propagation. Two headers carry the context:
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
| | | |
v trace-id span-id flags (sampled)
version
tracestate: vendor1=value1,vendor2=value2
When service A calls service B, it includes these headers. Service B extracts the trace ID and creates a child span under the same trace.
OTel SDK Setup (Python)
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
resource = Resource.create({
"service.name": "checkout-service",
"service.version": "1.2.3",
"deployment.environment": "production",
})
provider = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
The Go setup follows the same pattern using go.opentelemetry.io/otel packages: create a resource with service name and version, configure an OTLP gRPC exporter, and register a TracerProvider with a batch span processor.
OTel Collector
The collector sits between your applications and the backend. It receives spans, processes them (filtering, sampling, enriching), and exports to one or more backends.
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
logging:
loglevel: warn
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/tempo, logging]
Auto-Instrumentation vs Manual Spans
Auto-Instrumentation
OTel provides auto-instrumentation libraries that patch common frameworks and libraries. With minimal code, you get spans for HTTP requests, database queries, gRPC calls, and message queue operations.
Python:
pip install opentelemetry-instrumentation-flask opentelemetry-instrumentation-requests opentelemetry-instrumentation-sqlalchemy
opentelemetry-instrument --service_name checkout-service python app.py
This wraps your Flask routes, outgoing HTTP calls, and SQLAlchemy queries with spans automatically. No code changes required.
Node.js:
npm install @opentelemetry/auto-instrumentations-node
node --require @opentelemetry/auto-instrumentations-node/register app.js
Auto-instrumentation covers the framework boundaries -- HTTP in, HTTP out, database queries. It does not cover your business logic.
Manual Spans
For visibility into application-specific operations, create manual spans:
tracer = trace.get_tracer(__name__)
def process_checkout(order):
with tracer.start_as_current_span("process_checkout") as span:
span.set_attribute("order.id", order.id)
span.set_attribute("order.items", len(order.items))
with tracer.start_as_current_span("validate_inventory"):
check_inventory(order.items)
with tracer.start_as_current_span("calculate_total"):
total = calculate_price(order.items)
span.set_attribute("order.total", total)
with tracer.start_as_current_span("charge_payment"):
try:
charge(order.payment_method, total)
except PaymentError as e:
span.set_status(trace.StatusCode.ERROR, str(e))
span.record_exception(e)
raise
The best approach: start with auto-instrumentation for the basics, then add manual spans for critical business logic.
Visualization Backends
Jaeger
Originally from Uber. Full-featured trace visualization. Supports search by service, operation, tags, and duration. Open source.
Grafana Tempo
Trace backend that integrates natively with Grafana. Uses object storage (S3, GCS) for cost-effective retention. Pairs naturally with Loki (logs) and Prometheus (metrics) for full observability.
# Tempo configuration
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
storage:
trace:
backend: s3
s3:
bucket: my-tempo-bucket
endpoint: s3.amazonaws.com
Choosing a Backend
Zipkin is another option -- simpler than Jaeger with a smaller footprint. But if you already run Grafana and Prometheus, use Tempo. The integration between metrics, logs, and traces in a single Grafana UI is powerful.
Connecting the Dots
The real power of tracing emerges when you connect traces to logs and metrics.
Trace to Logs
Include the trace ID in every log line (covered in the structured logging topic). In Grafana, configure the Tempo data source with tracesToLogs pointing to Loki. Click a span in a trace and see the exact log lines emitted during that span.
Metrics to Traces
Prometheus metrics can include exemplars -- sample trace IDs attached to metric data points. When you see a latency spike in Grafana, click the data point to jump to an example trace.
Sampling
At high traffic, collecting every trace is expensive. Sampling strategies reduce volume while preserving visibility.
Head-Based Sampling
Decide at the start of the trace whether to sample it. Simple but means you might miss interesting traces.
# OTel SDK sampling
processors:
probabilistic_sampler:
sampling_percentage: 10 # Keep 10% of traces
Tail-Based Sampling
Decide after the trace completes. Keep all error traces and slow traces, sample the rest. More useful but requires the collector to buffer complete traces.
# OTel Collector tail sampling
processors:
tail_sampling:
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-requests
type: latency
latency: { threshold_ms: 2000 }
- name: default
type: probabilistic
probabilistic: { sampling_percentage: 5 }
Tail-based sampling is almost always what you want in production. You keep 100% of errors and slow requests (the traces you actually need) and sample the happy path.
Common Pitfalls
- Not propagating context. If one service in the chain does not pass trace headers, the trace breaks into disconnected fragments. Verify propagation end to end.
- Missing auto-instrumentation for a key library. Your HTTP framework is instrumented but your Redis client is not. Check which libraries have OTel instrumentation available and install them.
- Not sampling in production. Collecting 100% of traces at 10,000 requests per second will overwhelm your backend and your budget. Use tail-based sampling.
- Ignoring trace data. Tracing is useless if nobody looks at it. Link traces from alerts and dashboards so they are part of the debugging workflow.
- Not adding manual spans for business logic. Auto-instrumentation shows you the framework boundaries. Manual spans show you where time is spent inside your code.
- Treating tracing as a logging replacement. Traces show request flow and timing. Logs show detailed context. Use both.
Key Takeaways
- Distributed tracing follows a request across every service it touches, showing where time is spent and where failures occur.
- OpenTelemetry is the standard. Use its SDKs for instrumentation and its collector for processing and exporting.
- Start with auto-instrumentation for HTTP, database, and gRPC calls. Add manual spans for critical business operations.
- Use tail-based sampling to keep error and slow traces while reducing volume on the happy path.
- Connect traces to logs (via trace ID) and to metrics (via exemplars). The three pillars of observability are most powerful when linked.
- When tracing is integrated into your debugging workflow, it turns multi-hour investigations into 5-minute lookups.