Logging at Scale

Logs are the most detailed record of what happened in a system. At scale, the challenge shifts from producing logs to managing the volume: collecting, transporting, storing, and searching billions of log entries per day without drowning in cost or complexity.

Structured Logging

Structured logging outputs log entries as key-value pairs or JSON objects instead of free-form text. This is the single most impactful logging practice for systems at scale.

Unstructured vs Structured

Unstructured (bad for machines):
  2025-03-15 14:32:07 ERROR PaymentService - Payment failed for user 
  u-789 order abc-123: timeout after 30s connecting to Stripe

Structured (good for machines):
  {
    "timestamp": "2025-03-15T14:32:07.445Z",
    "level": "ERROR",
    "service": "payment-service",
    "instance": "payment-service-7b4d-02",
    "message": "Payment processing failed",
    "error": "Timeout connecting to payment gateway",
    "userId": "u-789",
    "orderId": "abc-123",
    "duration_ms": 30004,
    "gateway": "stripe",
    "traceId": "trace-456",
    "correlationId": "req-012"
  }

Structured logs can be parsed, indexed, and queried automatically. Finding all errors for user u-789 becomes a simple field query instead of a regex against unpredictable text formats.

Log Levels

Standard levels (from most to least severe):
  FATAL:  system is unusable, immediate shutdown
  ERROR:  operation failed, requires attention
  WARN:   unexpected condition, system still functioning
  INFO:   significant business events (request completed, order placed)
  DEBUG:  detailed diagnostic information
  TRACE:  extremely detailed, step-by-step execution

Production recommendations:
  Default level: INFO
  During incident investigation: temporarily set to DEBUG
  Never run TRACE in production (volume is overwhelming)
  
Volume impact:
  INFO only: ~1 KB per request
  DEBUG: ~10 KB per request (10x more volume)
  TRACE: ~100 KB per request (100x more volume)

What to Log

Always log:
  - Request entry and exit (with duration)
  - Errors and exceptions (with stack traces)
  - Authentication events (login, logout, failed attempts)
  - Business-critical operations (payment, order state changes)
  - External service calls (with latency and status)

Never log:
  - Passwords, API keys, tokens, or secrets
  - Full credit card numbers or SSNs (use masking)
  - Large request/response bodies (use sampling)
  - Health check requests (noise at high frequency)
  - Per-iteration loop details in production

GitHub processes hundreds of terabytes of logs daily. Their structured logging standard ensures every service emits logs in the same format, enabling cross-service investigation through a unified search interface.

Log Aggregation

In a distributed system, logs are scattered across hundreds or thousands of instances. Log aggregation collects them into a centralized system for unified search and analysis.

Aggregation Pipeline

Log aggregation architecture:

  Service A instances (20 pods) --> log shipper (Fluentd/Filebeat)
  Service B instances (50 pods) --> log shipper
  Service C instances (10 pods) --> log shipper
                                       |
                                       v
                                  Message buffer (Kafka)
                                       |
                                       v
                                  Processing (parse, enrich, filter)
                                       |
                                       v
                                  Storage & Index (Elasticsearch / Loki)
                                       |
                                       v
                                  Query UI (Kibana / Grafana)

Collection Approaches

Sidecar pattern (Kubernetes):
  Each pod has a log collector sidecar container
  Sidecar reads from shared log volume or stdout
  Ships to central aggregation
  Pro: per-pod configuration, isolation
  Con: resource overhead per pod

DaemonSet pattern (Kubernetes):
  One log collector per node (not per pod)
  Collects logs from all pods on the node via node filesystem
  Pro: lower resource overhead
  Con: shared configuration, less isolation

Direct shipping:
  Application sends logs directly to aggregation service
  Pro: simplest architecture
  Con: application blocked if aggregation is slow, coupling

ELK Stack

The Elasticsearch, Logstash, Kibana stack is the most widely deployed log aggregation solution.

ELK pipeline:
  Application --> Filebeat (lightweight shipper on each host)
    --> Logstash (parse, transform, enrich)
    --> Elasticsearch (index and store)
    --> Kibana (search and visualize)

Elasticsearch log index:
  Logs are stored in daily indices: logs-2025.03.15
  Each index has configurable shard count and replication
  Full-text search on any field
  Aggregations for dashboards

Grafana Loki

Loki takes a different approach: it indexes only metadata labels, not full log content, making it much cheaper to operate.

Loki vs Elasticsearch:
  | Aspect          | Elasticsearch          | Loki                      |
  |-----------------|------------------------|---------------------------|
  | Indexing         | Full text of every log | Labels only (service, pod)|
  | Query speed     | Fast on any field      | Fast by label, grep body  |
  | Storage cost    | High (full index)      | Low (compressed chunks)   |
  | Operational cost| Complex (cluster mgmt) | Simple (minimal state)    |
  | Best for        | Ad-hoc search          | Known query patterns      |

Retention

Log storage costs grow continuously. Retention policies balance the need for historical data against storage costs.

Tiered Retention

Tiered retention strategy:

  Hot tier (0-7 days):
    Fast SSD storage, full indexing
    All log levels, all fields searchable
    Used for active incident investigation
  
  Warm tier (7-30 days):
    Standard storage, reduced replicas
    Queries are slower but still reasonable
    Used for recent trend analysis
  
  Cold tier (30-90 days):
    Compressed object storage (S3, GCS)
    Must be rehydrated before searching
    Used for compliance and post-mortems
  
  Archive (90 days - 7 years):
    Deep archive storage (S3 Glacier)
    Days to retrieve
    Used for regulatory compliance only

Retention by Log Level

Differential retention:
  ERROR/FATAL: retain 1 year (critical for post-mortems)
  WARN: retain 90 days
  INFO: retain 30 days
  DEBUG: retain 7 days (only enabled temporarily)
  
  This reduces storage by 60-80% compared to flat retention
  across all levels.

Netflix retains billions of log entries across tiered storage. Their hot tier covers the last few days for active debugging, while cold storage in S3 holds months of compressed logs for long-term analysis.

Search

Searching logs efficiently at scale requires both good indexing and good query patterns.

Query Patterns

Common log search queries:

  By service and time range:
    service="payment-service" AND timestamp > "2025-03-15T14:00:00Z"

  By error type:
    level="ERROR" AND message="timeout"

  By user (incident investigation):
    userId="u-789" AND timestamp BETWEEN "14:30" AND "14:35"

  By trace (following a request):
    traceId="trace-456"

  Aggregation (pattern detection):
    GROUP BY error_message, service
    COUNT(*) WHERE level="ERROR" AND timestamp > now() - 1h
    ORDER BY count DESC

Search Performance

Performance strategies:
  
  Time-based partitioning:
    Logs stored in daily/hourly indices
    Queries always include time range to limit scan scope
    Searching 1 hour of logs vs 1 month: 720x difference
  
  Field indexing:
    Index high-value fields: service, level, traceId, userId
    Do not index high-cardinality free text (full message body)
    Balance: more indexes = faster queries but higher storage
  
  Query optimization:
    Always include time range (most selective filter)
    Filter by indexed fields before full-text search
    Use aggregations instead of retrieving millions of raw entries

Correlation IDs

Correlation IDs link related log entries across services, enabling end-to-end request tracking through logs alone.

Types of Correlation IDs

Trace ID:
  Generated at the entry point, propagated through all services
  Links all logs for a single end-to-end request
  Same as the distributed tracing trace ID

Correlation ID (request ID):
  Similar to trace ID, sometimes used interchangeably
  May be client-generated (for idempotency tracking)

Session ID:
  Links all requests from a single user session
  Useful for understanding user journeys

Causation ID:
  Links an effect to its cause
  Event B was caused by Event A: causationId = A's eventId

Implementing Correlation IDs

Propagation flow:

  1. API Gateway generates correlationId: "req-012"
  2. Gateway adds to response header: X-Correlation-ID: req-012
  3. Gateway logs: { correlationId: "req-012", message: "Request received" }
  4. Gateway passes to Order Service via header
  5. Order Service logs: { correlationId: "req-012", message: "Creating order" }
  6. Order Service calls Payment Service with same header
  7. Payment Service logs: { correlationId: "req-012", message: "Processing payment" }

  Searching for correlationId="req-012" returns logs from ALL three services,
  in chronological order, showing the complete request story.

Correlation in Async Flows

Async correlation (message queues):
  
  Order Service:
    1. Process request (correlationId: "req-012")
    2. Publish to Kafka with correlationId in message header
    
  Notification Service (consumes later):
    3. Extract correlationId from message header
    4. Log: { correlationId: "req-012", message: "Sending email" }
    
  Even though these execute minutes apart, the correlation ID
  links them as part of the same logical operation.

Stripe includes a request ID in every API response. When merchants report issues, Stripe support can search for that request ID across all internal service logs to reconstruct exactly what happened.

Log Processing Patterns

Sampling High-Volume Logs

When a service handles 100K requests/second:
  Full logging: 100K log entries/second = 8.6 billion/day
  1% sampling: 1K entries/second = 86 million/day (97% cost reduction)

Sampling strategies:
  Random: log 1% of all requests
  Error-biased: log 100% of errors, 1% of successes
  Head-based: decide at request start, log all or nothing for that request
  Dynamic: increase sampling during incidents

Log Enrichment

Enriching logs at collection time:

  Raw log from application:
    { "userId": "u-789", "action": "purchase" }
  
  Enriched by log pipeline:
    {
      "userId": "u-789",
      "action": "purchase",
      "environment": "production",
      "region": "us-east-1",
      "kubernetes_pod": "order-svc-7b4d-02",
      "kubernetes_node": "node-15",
      "deploy_version": "v2.34.1",
      "team": "commerce"
    }
  
  Environment metadata added without application code changes.

Common Pitfalls

Logging secrets. Passwords, API keys, and tokens in logs are a security breach waiting to happen. Implement log scrubbing in the pipeline and audit regularly.
Unstructured log formats. Free-form log messages cannot be reliably parsed or queried. Enforce structured logging standards across all services.
No time range in queries. Searching all logs without a time range scans the entire dataset. Always include a time range as the first filter.
Flat retention for all log levels. Retaining DEBUG logs for a year costs orders of magnitude more than retaining only ERROR logs long-term. Differentiate retention by level and importance.
Missing correlation IDs. Without correlation IDs, logs from different services are isolated islands of information. Propagate correlation IDs across every service boundary, including async flows.
Logging too much during normal operation. Verbose logging in the hot path degrades application performance and floods the aggregation pipeline. Use INFO for business events, DEBUG only when investigating.

Key Takeaways

Structured logging with consistent field names across services is the foundation for log search and analysis at scale.
Log aggregation centralizes scattered logs into a searchable system. Choose between full-text indexing (Elasticsearch) and label-based indexing (Loki) based on your query patterns and budget.
Tiered retention with differential policies by log level reduces storage costs by 60-80% while keeping critical data available.
Correlation IDs link logs across services and async boundaries, enabling end-to-end request tracking without distributed tracing infrastructure.
Sample aggressively for high-volume success logs. Retain 100% of error logs. This balances cost with debugging capability.