Three Pillars of Observability

Observability is the ability to understand a system's internal state from its external outputs. The three pillars, metrics, logs, and traces, each provide a distinct lens into system behavior. Together they form a complete picture that no single pillar can achieve alone.

Three pillars of observability feeding into dashboards and alerts

Metrics

Metrics are numeric measurements collected at regular intervals. They answer "what is happening" with aggregate numbers: request rates, error percentages, CPU utilization, queue depths.

What Metrics Provide

Metrics characteristics:
  - Numeric values sampled over time (time series)
  - Fixed storage cost per metric regardless of traffic volume
  - Ideal for dashboards, alerts, and trend analysis
  - Aggregated by nature (averages, percentiles, sums)
  
Examples:
  http_requests_total: 45,230 (counter)
  request_duration_seconds: p50=12ms, p99=340ms (histogram)
  active_connections: 1,247 (gauge)
  error_rate: 2.3% (derived)

Metrics Strengths

Metrics are cheap to store and fast to query. A metric with 1-second resolution for a year costs a few megabytes. Querying "what was the p99 latency last Tuesday at 3pm" returns in milliseconds. This makes metrics the default starting point for monitoring.

Metrics Limitations

Metrics show aggregates but hide individual events. If the p99 latency spikes, metrics tell you it spiked but not which specific request was slow or why. You cannot drill into a metric data point to see the underlying request.

Metric tells you: error rate jumped from 0.1% to 5% at 14:32
Metric does NOT tell you:
  - Which specific requests failed
  - What error messages they returned
  - Which downstream service caused the failures
  - What the request payload looked like

Datadog, Prometheus, and CloudWatch are widely used metrics platforms. Netflix monitors over 2 billion metrics time series across their infrastructure, using Atlas for real-time operational insight.

Logs

Logs are timestamped text records of discrete events. They answer "what happened" with specific details: which request, what error message, what parameters, what user.

What Logs Provide

Log characteristics:
  - Individual event records with arbitrary detail
  - Variable storage cost (proportional to traffic volume)
  - Ideal for debugging specific incidents
  - Rich context per event (stack traces, request bodies, user IDs)

Example structured log entry:
  {
    "timestamp": "2025-03-15T14:32:07.445Z",
    "level": "ERROR",
    "service": "payment-service",
    "traceId": "abc-123",
    "userId": "u-789",
    "message": "Payment processing failed",
    "error": "Timeout connecting to payment gateway",
    "duration_ms": 30004,
    "gateway": "stripe",
    "amount": 59.99,
    "correlationId": "req-456"
  }

Logs Strengths

Logs provide the richest detail about individual events. When debugging, you need to know exactly what happened: the error message, the stack trace, the request parameters. Logs deliver this context. They are also the most natural form of telemetry for developers, since every language has logging built in.

Logs Limitations

Logs are expensive at scale. A service handling 100,000 requests per second generates enormous log volume. Searching across millions of log entries is slow without proper indexing. Logs from different services are isolated unless explicitly correlated.

Log tells you: payment-service returned "timeout" for user u-789 at 14:32
Log does NOT tell you (without correlation):
  - What other services were involved in this request
  - How long each service took
  - Where the bottleneck was in the request path
  - Whether this is part of a broader pattern or an isolated incident

The ELK stack (Elasticsearch, Logstash, Kibana) and Splunk are the most common log aggregation platforms. Stripe processes petabytes of logs to debug payment failures and maintain audit trails for compliance.

Traces

Traces follow a single request as it propagates through multiple services. They answer "where did time go" by showing the path, timing, and relationships between service calls.

What Traces Provide

Trace characteristics:
  - End-to-end view of a single request across services
  - Timing breakdown per service and operation
  - Parent-child relationships between operations (spans)
  - Causal chain of what triggered what

Example trace for an order placement:
  Trace ID: abc-123
  
  [API Gateway]         0ms  ------>  250ms
    [Auth Service]      5ms  -->  25ms
    [Order Service]     30ms -------->  240ms
      [Inventory Check] 35ms -->  55ms
      [Payment Service] 60ms ---------->  230ms
        [Stripe API]    65ms -------->  220ms
      [Email Service]   235ms -> 245ms

Traces Strengths

Traces reveal latency bottlenecks instantly. In the example above, the Stripe API call is clearly the bottleneck at 155ms. Without traces, you would need to correlate timestamps across log entries from four different services to reconstruct this picture.

Traces Limitations

Traces are expensive to collect at 100% sampling. A single request to a microservice architecture might generate dozens of spans. Most systems sample traces (collect 1% or 10%) to manage cost, which means rare issues may not be captured.

Trace tells you: the Stripe API call took 155ms in request abc-123
Trace does NOT tell you:
  - Whether 155ms is normal or abnormal for Stripe calls
  - The business impact of this latency
  - How many requests are affected (that requires metrics)
  - The detailed error message (that requires logs)

Jaeger, Zipkin, and AWS X-Ray are popular tracing platforms. Uber built their own distributed tracing system that processes billions of spans per day across thousands of microservices.

How the Pillars Complement Each Other

No single pillar is sufficient. Each answers different questions about the same incident.

Incident Investigation Flow

Typical debugging workflow:

Step 1: Metrics alert fires
  "Error rate exceeded 5% threshold on order-service"
  You know SOMETHING is wrong.

Step 2: Metrics dashboard drill-down
  Error rate spike started at 14:32, affects /api/orders endpoint
  p99 latency jumped from 200ms to 3s
  You know WHAT is wrong and WHEN it started.

Step 3: Logs investigation
  Filter logs: service=order-service, level=ERROR, time=14:32-14:35
  Find: "Connection timeout to payment-gateway after 30s"
  You know the specific error.

Step 4: Trace analysis
  Pick a failed trace from the affected time period
  See: order-service -> payment-service -> stripe (timeout at 30s)
  Payment-service retried 3 times before failing
  You know WHERE in the call chain the problem is.

Step 5: Root cause
  Check payment-service metrics: connection pool exhausted
  Check payment-service logs: "Max connections (100) reached"
  Root cause: connection leak introduced in yesterday's deploy

Correlation Between Pillars

The key to effective observability is connecting the three pillars through shared identifiers.

Correlation strategy:
  
  Trace ID links everything:
    Metric label: traceId="abc-123" (exemplars)
    Log field: "traceId": "abc-123"
    Trace: traceId: abc-123
  
  From a metric spike:
    Click on an exemplar --> jump to a specific trace
    From trace --> see logs for each span
  
  From a log entry:
    Extract traceId --> view the full distributed trace
    From trace --> see where time was spent
  
  From a trace:
    Check service metrics at that time --> see if this is isolated or systemic

Grafana provides this cross-pillar navigation. You can click from a metric alert to exemplar traces to correlated logs, all within the same interface. This workflow turns hours of debugging into minutes.

Choosing the Right Pillar

| Question                              | Primary Pillar | Supporting Pillar |
|---------------------------------------|---------------|-------------------|
| Is the system healthy right now?      | Metrics       |                   |
| What triggered this alert?            | Metrics       | Logs              |
| Why is this specific request slow?    | Traces        | Logs              |
| What error did this user encounter?   | Logs          | Traces            |
| Where is the bottleneck?              | Traces        | Metrics           |
| Is this a widespread issue?           | Metrics       | Logs              |
| What changed in the last deploy?      | Logs          | Metrics           |
| What is the request flow?             | Traces        |                   |

Real-World Observability Stacks

Open source stack:
  Metrics: Prometheus + Grafana
  Logs: Loki or ELK (Elasticsearch, Logstash, Kibana)
  Traces: Jaeger or Tempo
  Correlation: Grafana (unified UI)
  Collection: OpenTelemetry

Cloud-native stack (AWS):
  Metrics: CloudWatch Metrics
  Logs: CloudWatch Logs
  Traces: AWS X-Ray
  Correlation: CloudWatch ServiceLens

Commercial stack:
  Datadog: unified metrics, logs, traces in one platform
  New Relic: similar unified approach
  Honeycomb: trace-first observability with high-cardinality support

Common Pitfalls

Implementing only one pillar. Metrics alone cannot debug individual failures. Logs alone cannot show system-wide trends. Traces alone cannot alert on problems. You need all three.
No correlation between pillars. Without trace IDs linking metrics, logs, and traces, moving between pillars requires manual timestamp matching, which is slow and error-prone.
Over-logging, under-metricing. Teams often log everything and metric nothing. Logs at scale are expensive and slow to query. Metrics are cheap and fast. Use metrics for dashboards and alerts, logs for detail.
Collecting traces without sampling strategy. Tracing 100% of requests at scale is prohibitively expensive. Define sampling rules that capture all errors and a representative sample of successes.
Tool sprawl without integration. Using separate unconnected tools for each pillar makes cross-pillar navigation impossible. Choose tools that integrate or use OpenTelemetry as a unified collection layer.
Treating observability as ops-only. Developers should instrument their code with metrics, structured logs, and trace spans. Observability is a development concern, not just an operations concern.

Key Takeaways

Metrics tell you what is happening (aggregate trends and alerts). Logs tell you what happened (individual event details). Traces tell you where time is spent (request flow across services).
The real power of observability comes from correlating all three pillars through shared identifiers like trace IDs.
Metrics are the cheapest and fastest signal. Start every investigation with metrics, then drill into traces and logs for specifics.
OpenTelemetry is the emerging standard for collecting all three signals through a single instrumentation framework.
Observability is a property of the system, not a product you buy. It requires intentional instrumentation in application code.