Three Pillars of Observability
Observability is the ability to understand a system's internal state from its external outputs. The three pillars, metrics, logs, and traces, each provide a distinct lens into system behavior. Together they form a complete picture that no single pillar can achieve alone.

Metrics
Metrics are numeric measurements collected at regular intervals. They answer "what is happening" with aggregate numbers: request rates, error percentages, CPU utilization, queue depths.
What Metrics Provide
Metrics characteristics:
- Numeric values sampled over time (time series)
- Fixed storage cost per metric regardless of traffic volume
- Ideal for dashboards, alerts, and trend analysis
- Aggregated by nature (averages, percentiles, sums)
Examples:
http_requests_total: 45,230 (counter)
request_duration_seconds: p50=12ms, p99=340ms (histogram)
active_connections: 1,247 (gauge)
error_rate: 2.3% (derived)
Metrics Strengths
Metrics are cheap to store and fast to query. A metric with 1-second resolution for a year costs a few megabytes. Querying "what was the p99 latency last Tuesday at 3pm" returns in milliseconds. This makes metrics the default starting point for monitoring.
Metrics Limitations
Metrics show aggregates but hide individual events. If the p99 latency spikes, metrics tell you it spiked but not which specific request was slow or why. You cannot drill into a metric data point to see the underlying request.
Metric tells you: error rate jumped from 0.1% to 5% at 14:32
Metric does NOT tell you:
- Which specific requests failed
- What error messages they returned
- Which downstream service caused the failures
- What the request payload looked like
Datadog, Prometheus, and CloudWatch are widely used metrics platforms. Netflix monitors over 2 billion metrics time series across their infrastructure, using Atlas for real-time operational insight.
Logs
Logs are timestamped text records of discrete events. They answer "what happened" with specific details: which request, what error message, what parameters, what user.
What Logs Provide
Log characteristics:
- Individual event records with arbitrary detail
- Variable storage cost (proportional to traffic volume)
- Ideal for debugging specific incidents
- Rich context per event (stack traces, request bodies, user IDs)
Example structured log entry:
{
"timestamp": "2025-03-15T14:32:07.445Z",
"level": "ERROR",
"service": "payment-service",
"traceId": "abc-123",
"userId": "u-789",
"message": "Payment processing failed",
"error": "Timeout connecting to payment gateway",
"duration_ms": 30004,
"gateway": "stripe",
"amount": 59.99,
"correlationId": "req-456"
}
Logs Strengths
Logs provide the richest detail about individual events. When debugging, you need to know exactly what happened: the error message, the stack trace, the request parameters. Logs deliver this context. They are also the most natural form of telemetry for developers, since every language has logging built in.
Logs Limitations
Logs are expensive at scale. A service handling 100,000 requests per second generates enormous log volume. Searching across millions of log entries is slow without proper indexing. Logs from different services are isolated unless explicitly correlated.
Log tells you: payment-service returned "timeout" for user u-789 at 14:32
Log does NOT tell you (without correlation):
- What other services were involved in this request
- How long each service took
- Where the bottleneck was in the request path
- Whether this is part of a broader pattern or an isolated incident
The ELK stack (Elasticsearch, Logstash, Kibana) and Splunk are the most common log aggregation platforms. Stripe processes petabytes of logs to debug payment failures and maintain audit trails for compliance.
Traces
Traces follow a single request as it propagates through multiple services. They answer "where did time go" by showing the path, timing, and relationships between service calls.
What Traces Provide
Trace characteristics:
- End-to-end view of a single request across services
- Timing breakdown per service and operation
- Parent-child relationships between operations (spans)
- Causal chain of what triggered what
Example trace for an order placement:
Trace ID: abc-123
[API Gateway] 0ms ------> 250ms
[Auth Service] 5ms --> 25ms
[Order Service] 30ms --------> 240ms
[Inventory Check] 35ms --> 55ms
[Payment Service] 60ms ----------> 230ms
[Stripe API] 65ms --------> 220ms
[Email Service] 235ms -> 245ms
Traces Strengths
Traces reveal latency bottlenecks instantly. In the example above, the Stripe API call is clearly the bottleneck at 155ms. Without traces, you would need to correlate timestamps across log entries from four different services to reconstruct this picture.
Traces Limitations
Traces are expensive to collect at 100% sampling. A single request to a microservice architecture might generate dozens of spans. Most systems sample traces (collect 1% or 10%) to manage cost, which means rare issues may not be captured.
Trace tells you: the Stripe API call took 155ms in request abc-123
Trace does NOT tell you:
- Whether 155ms is normal or abnormal for Stripe calls
- The business impact of this latency
- How many requests are affected (that requires metrics)
- The detailed error message (that requires logs)
Jaeger, Zipkin, and AWS X-Ray are popular tracing platforms. Uber built their own distributed tracing system that processes billions of spans per day across thousands of microservices.
How the Pillars Complement Each Other
No single pillar is sufficient. Each answers different questions about the same incident.
Incident Investigation Flow
Typical debugging workflow:
Step 1: Metrics alert fires
"Error rate exceeded 5% threshold on order-service"
You know SOMETHING is wrong.
Step 2: Metrics dashboard drill-down
Error rate spike started at 14:32, affects /api/orders endpoint
p99 latency jumped from 200ms to 3s
You know WHAT is wrong and WHEN it started.
Step 3: Logs investigation
Filter logs: service=order-service, level=ERROR, time=14:32-14:35
Find: "Connection timeout to payment-gateway after 30s"
You know the specific error.
Step 4: Trace analysis
Pick a failed trace from the affected time period
See: order-service -> payment-service -> stripe (timeout at 30s)
Payment-service retried 3 times before failing
You know WHERE in the call chain the problem is.
Step 5: Root cause
Check payment-service metrics: connection pool exhausted
Check payment-service logs: "Max connections (100) reached"
Root cause: connection leak introduced in yesterday's deploy
Correlation Between Pillars
The key to effective observability is connecting the three pillars through shared identifiers.
Correlation strategy:
Trace ID links everything:
Metric label: traceId="abc-123" (exemplars)
Log field: "traceId": "abc-123"
Trace: traceId: abc-123
From a metric spike:
Click on an exemplar --> jump to a specific trace
From trace --> see logs for each span
From a log entry:
Extract traceId --> view the full distributed trace
From trace --> see where time was spent
From a trace:
Check service metrics at that time --> see if this is isolated or systemic
Grafana provides this cross-pillar navigation. You can click from a metric alert to exemplar traces to correlated logs, all within the same interface. This workflow turns hours of debugging into minutes.
Choosing the Right Pillar
| Question | Primary Pillar | Supporting Pillar |
|---------------------------------------|---------------|-------------------|
| Is the system healthy right now? | Metrics | |
| What triggered this alert? | Metrics | Logs |
| Why is this specific request slow? | Traces | Logs |
| What error did this user encounter? | Logs | Traces |
| Where is the bottleneck? | Traces | Metrics |
| Is this a widespread issue? | Metrics | Logs |
| What changed in the last deploy? | Logs | Metrics |
| What is the request flow? | Traces | |
Real-World Observability Stacks
Open source stack:
Metrics: Prometheus + Grafana
Logs: Loki or ELK (Elasticsearch, Logstash, Kibana)
Traces: Jaeger or Tempo
Correlation: Grafana (unified UI)
Collection: OpenTelemetry
Cloud-native stack (AWS):
Metrics: CloudWatch Metrics
Logs: CloudWatch Logs
Traces: AWS X-Ray
Correlation: CloudWatch ServiceLens
Commercial stack:
Datadog: unified metrics, logs, traces in one platform
New Relic: similar unified approach
Honeycomb: trace-first observability with high-cardinality support
Common Pitfalls
- Implementing only one pillar. Metrics alone cannot debug individual failures. Logs alone cannot show system-wide trends. Traces alone cannot alert on problems. You need all three.
- No correlation between pillars. Without trace IDs linking metrics, logs, and traces, moving between pillars requires manual timestamp matching, which is slow and error-prone.
- Over-logging, under-metricing. Teams often log everything and metric nothing. Logs at scale are expensive and slow to query. Metrics are cheap and fast. Use metrics for dashboards and alerts, logs for detail.
- Collecting traces without sampling strategy. Tracing 100% of requests at scale is prohibitively expensive. Define sampling rules that capture all errors and a representative sample of successes.
- Tool sprawl without integration. Using separate unconnected tools for each pillar makes cross-pillar navigation impossible. Choose tools that integrate or use OpenTelemetry as a unified collection layer.
- Treating observability as ops-only. Developers should instrument their code with metrics, structured logs, and trace spans. Observability is a development concern, not just an operations concern.
Key Takeaways
- Metrics tell you what is happening (aggregate trends and alerts). Logs tell you what happened (individual event details). Traces tell you where time is spent (request flow across services).
- The real power of observability comes from correlating all three pillars through shared identifiers like trace IDs.
- Metrics are the cheapest and fastest signal. Start every investigation with metrics, then drill into traces and logs for specifics.
- OpenTelemetry is the emerging standard for collecting all three signals through a single instrumentation framework.
- Observability is a property of the system, not a product you buy. It requires intentional instrumentation in application code.