Logging at Scale
Logs are the most detailed record of what happened in a system. At scale, the challenge shifts from producing logs to managing the volume: collecting, transporting, storing, and searching billions of log entries per day without drowning in cost or complexity.
Structured Logging
Structured logging outputs log entries as key-value pairs or JSON objects instead of free-form text. This is the single most impactful logging practice for systems at scale.
Unstructured vs Structured
Unstructured (bad for machines):
2025-03-15 14:32:07 ERROR PaymentService - Payment failed for user
u-789 order abc-123: timeout after 30s connecting to Stripe
Structured (good for machines):
{
"timestamp": "2025-03-15T14:32:07.445Z",
"level": "ERROR",
"service": "payment-service",
"instance": "payment-service-7b4d-02",
"message": "Payment processing failed",
"error": "Timeout connecting to payment gateway",
"userId": "u-789",
"orderId": "abc-123",
"duration_ms": 30004,
"gateway": "stripe",
"traceId": "trace-456",
"correlationId": "req-012"
}
Structured logs can be parsed, indexed, and queried automatically. Finding all errors for user u-789 becomes a simple field query instead of a regex against unpredictable text formats.
Log Levels
Standard levels (from most to least severe):
FATAL: system is unusable, immediate shutdown
ERROR: operation failed, requires attention
WARN: unexpected condition, system still functioning
INFO: significant business events (request completed, order placed)
DEBUG: detailed diagnostic information
TRACE: extremely detailed, step-by-step execution
Production recommendations:
Default level: INFO
During incident investigation: temporarily set to DEBUG
Never run TRACE in production (volume is overwhelming)
Volume impact:
INFO only: ~1 KB per request
DEBUG: ~10 KB per request (10x more volume)
TRACE: ~100 KB per request (100x more volume)
What to Log
Always log:
- Request entry and exit (with duration)
- Errors and exceptions (with stack traces)
- Authentication events (login, logout, failed attempts)
- Business-critical operations (payment, order state changes)
- External service calls (with latency and status)
Never log:
- Passwords, API keys, tokens, or secrets
- Full credit card numbers or SSNs (use masking)
- Large request/response bodies (use sampling)
- Health check requests (noise at high frequency)
- Per-iteration loop details in production
GitHub processes hundreds of terabytes of logs daily. Their structured logging standard ensures every service emits logs in the same format, enabling cross-service investigation through a unified search interface.
Log Aggregation
In a distributed system, logs are scattered across hundreds or thousands of instances. Log aggregation collects them into a centralized system for unified search and analysis.
Aggregation Pipeline
Log aggregation architecture:
Service A instances (20 pods) --> log shipper (Fluentd/Filebeat)
Service B instances (50 pods) --> log shipper
Service C instances (10 pods) --> log shipper
|
v
Message buffer (Kafka)
|
v
Processing (parse, enrich, filter)
|
v
Storage & Index (Elasticsearch / Loki)
|
v
Query UI (Kibana / Grafana)
Collection Approaches
Sidecar pattern (Kubernetes):
Each pod has a log collector sidecar container
Sidecar reads from shared log volume or stdout
Ships to central aggregation
Pro: per-pod configuration, isolation
Con: resource overhead per pod
DaemonSet pattern (Kubernetes):
One log collector per node (not per pod)
Collects logs from all pods on the node via node filesystem
Pro: lower resource overhead
Con: shared configuration, less isolation
Direct shipping:
Application sends logs directly to aggregation service
Pro: simplest architecture
Con: application blocked if aggregation is slow, coupling
ELK Stack
The Elasticsearch, Logstash, Kibana stack is the most widely deployed log aggregation solution.
ELK pipeline:
Application --> Filebeat (lightweight shipper on each host)
--> Logstash (parse, transform, enrich)
--> Elasticsearch (index and store)
--> Kibana (search and visualize)
Elasticsearch log index:
Logs are stored in daily indices: logs-2025.03.15
Each index has configurable shard count and replication
Full-text search on any field
Aggregations for dashboards
Grafana Loki
Loki takes a different approach: it indexes only metadata labels, not full log content, making it much cheaper to operate.
Loki vs Elasticsearch:
| Aspect | Elasticsearch | Loki |
|-----------------|------------------------|---------------------------|
| Indexing | Full text of every log | Labels only (service, pod)|
| Query speed | Fast on any field | Fast by label, grep body |
| Storage cost | High (full index) | Low (compressed chunks) |
| Operational cost| Complex (cluster mgmt) | Simple (minimal state) |
| Best for | Ad-hoc search | Known query patterns |
Retention
Log storage costs grow continuously. Retention policies balance the need for historical data against storage costs.
Tiered Retention
Tiered retention strategy:
Hot tier (0-7 days):
Fast SSD storage, full indexing
All log levels, all fields searchable
Used for active incident investigation
Warm tier (7-30 days):
Standard storage, reduced replicas
Queries are slower but still reasonable
Used for recent trend analysis
Cold tier (30-90 days):
Compressed object storage (S3, GCS)
Must be rehydrated before searching
Used for compliance and post-mortems
Archive (90 days - 7 years):
Deep archive storage (S3 Glacier)
Days to retrieve
Used for regulatory compliance only
Retention by Log Level
Differential retention:
ERROR/FATAL: retain 1 year (critical for post-mortems)
WARN: retain 90 days
INFO: retain 30 days
DEBUG: retain 7 days (only enabled temporarily)
This reduces storage by 60-80% compared to flat retention
across all levels.
Netflix retains billions of log entries across tiered storage. Their hot tier covers the last few days for active debugging, while cold storage in S3 holds months of compressed logs for long-term analysis.
Search
Searching logs efficiently at scale requires both good indexing and good query patterns.
Query Patterns
Common log search queries:
By service and time range:
service="payment-service" AND timestamp > "2025-03-15T14:00:00Z"
By error type:
level="ERROR" AND message="timeout"
By user (incident investigation):
userId="u-789" AND timestamp BETWEEN "14:30" AND "14:35"
By trace (following a request):
traceId="trace-456"
Aggregation (pattern detection):
GROUP BY error_message, service
COUNT(*) WHERE level="ERROR" AND timestamp > now() - 1h
ORDER BY count DESC
Search Performance
Performance strategies:
Time-based partitioning:
Logs stored in daily/hourly indices
Queries always include time range to limit scan scope
Searching 1 hour of logs vs 1 month: 720x difference
Field indexing:
Index high-value fields: service, level, traceId, userId
Do not index high-cardinality free text (full message body)
Balance: more indexes = faster queries but higher storage
Query optimization:
Always include time range (most selective filter)
Filter by indexed fields before full-text search
Use aggregations instead of retrieving millions of raw entries
Correlation IDs
Correlation IDs link related log entries across services, enabling end-to-end request tracking through logs alone.
Types of Correlation IDs
Trace ID:
Generated at the entry point, propagated through all services
Links all logs for a single end-to-end request
Same as the distributed tracing trace ID
Correlation ID (request ID):
Similar to trace ID, sometimes used interchangeably
May be client-generated (for idempotency tracking)
Session ID:
Links all requests from a single user session
Useful for understanding user journeys
Causation ID:
Links an effect to its cause
Event B was caused by Event A: causationId = A's eventId
Implementing Correlation IDs
Propagation flow:
1. API Gateway generates correlationId: "req-012"
2. Gateway adds to response header: X-Correlation-ID: req-012
3. Gateway logs: { correlationId: "req-012", message: "Request received" }
4. Gateway passes to Order Service via header
5. Order Service logs: { correlationId: "req-012", message: "Creating order" }
6. Order Service calls Payment Service with same header
7. Payment Service logs: { correlationId: "req-012", message: "Processing payment" }
Searching for correlationId="req-012" returns logs from ALL three services,
in chronological order, showing the complete request story.
Correlation in Async Flows
Async correlation (message queues):
Order Service:
1. Process request (correlationId: "req-012")
2. Publish to Kafka with correlationId in message header
Notification Service (consumes later):
3. Extract correlationId from message header
4. Log: { correlationId: "req-012", message: "Sending email" }
Even though these execute minutes apart, the correlation ID
links them as part of the same logical operation.
Stripe includes a request ID in every API response. When merchants report issues, Stripe support can search for that request ID across all internal service logs to reconstruct exactly what happened.
Log Processing Patterns
Sampling High-Volume Logs
When a service handles 100K requests/second:
Full logging: 100K log entries/second = 8.6 billion/day
1% sampling: 1K entries/second = 86 million/day (97% cost reduction)
Sampling strategies:
Random: log 1% of all requests
Error-biased: log 100% of errors, 1% of successes
Head-based: decide at request start, log all or nothing for that request
Dynamic: increase sampling during incidents
Log Enrichment
Enriching logs at collection time:
Raw log from application:
{ "userId": "u-789", "action": "purchase" }
Enriched by log pipeline:
{
"userId": "u-789",
"action": "purchase",
"environment": "production",
"region": "us-east-1",
"kubernetes_pod": "order-svc-7b4d-02",
"kubernetes_node": "node-15",
"deploy_version": "v2.34.1",
"team": "commerce"
}
Environment metadata added without application code changes.
Common Pitfalls
- Logging secrets. Passwords, API keys, and tokens in logs are a security breach waiting to happen. Implement log scrubbing in the pipeline and audit regularly.
- Unstructured log formats. Free-form log messages cannot be reliably parsed or queried. Enforce structured logging standards across all services.
- No time range in queries. Searching all logs without a time range scans the entire dataset. Always include a time range as the first filter.
- Flat retention for all log levels. Retaining DEBUG logs for a year costs orders of magnitude more than retaining only ERROR logs long-term. Differentiate retention by level and importance.
- Missing correlation IDs. Without correlation IDs, logs from different services are isolated islands of information. Propagate correlation IDs across every service boundary, including async flows.
- Logging too much during normal operation. Verbose logging in the hot path degrades application performance and floods the aggregation pipeline. Use INFO for business events, DEBUG only when investigating.
Key Takeaways
- Structured logging with consistent field names across services is the foundation for log search and analysis at scale.
- Log aggregation centralizes scattered logs into a searchable system. Choose between full-text indexing (Elasticsearch) and label-based indexing (Loki) based on your query patterns and budget.
- Tiered retention with differential policies by log level reduces storage costs by 60-80% while keeping critical data available.
- Correlation IDs link logs across services and async boundaries, enabling end-to-end request tracking without distributed tracing infrastructure.
- Sample aggressively for high-volume success logs. Retain 100% of error logs. This balances cost with debugging capability.