Structured Logging
Plain text logs were fine when you had one server and could tail -f a file. In a distributed system with dozens of services, you need logs that machines can parse, index, and query. Structured logging means emitting logs as key-value pairs -- typically JSON -- instead of free-form strings.
JSON Logs Over Plain Text
The Problem with Plain Text
2025-03-15 14:22:03 INFO Processing order 12345 for user john@example.com
2025-03-15 14:22:03 ERROR Failed to charge card for order 12345: timeout connecting to payment gateway
This is readable by humans but painful for machines. To find all errors for order 12345, you need regex. To count errors per service, you need pattern matching. To correlate across services, you need luck.
The Structured Alternative
{"timestamp":"2025-03-15T14:22:03.412Z","level":"INFO","service":"order-service","message":"Processing order","order_id":"12345","user_id":"u-789","trace_id":"abc-def-123"}
{"timestamp":"2025-03-15T14:22:03.891Z","level":"ERROR","service":"order-service","message":"Failed to charge card","order_id":"12345","error":"timeout connecting to payment gateway","trace_id":"abc-def-123","duration_ms":5003}
Now you can filter by order_id, group by level, search by trace_id, and aggregate by service -- all without regex. Every log aggregation system (Elasticsearch, Loki, Datadog) handles JSON natively.
Standard Fields
Consistency matters more than perfection. Agree on field names across your organization and stick to them.
Required Fields
| Field | Description | Example |
|---|---|---|
timestamp |
ISO 8601 with timezone | 2025-03-15T14:22:03.412Z |
level |
Log severity | INFO, ERROR |
message |
Human-readable description | Failed to charge card |
service |
Which service emitted this | order-service |
Recommended Fields
| Field | Description | Example |
|---|---|---|
trace_id |
Distributed trace identifier | abc-def-123 |
span_id |
Current span in trace | span-456 |
request_id |
Unique ID for this request | req-789 |
user_id |
Who triggered this action | u-789 |
environment |
Deployment environment | production |
version |
Application version | v2.3.1 |
duration_ms |
How long the operation took | 142 |
error |
Error message or type | TimeoutError |
Field Naming Conventions
Pick one convention and enforce it. Snake_case is the most common in logging:
Good: trace_id, user_id, order_id, duration_ms
Bad: traceId, TraceID, trace-id (mixing conventions)
Log Levels
Use log levels consistently across all services. Each level has a purpose.
DEBUG
Detailed information useful during development. Never enable in production by default -- the volume will overwhelm your log aggregation system and your budget.
{"level":"DEBUG","message":"Cache lookup","key":"user:789","hit":true,"latency_ms":2}
INFO
Normal operations worth recording. The heartbeat of your application.
{"level":"INFO","message":"Order created","order_id":"12345","items":3,"total":149.99}
WARN
Something unexpected happened but the operation continued. Investigate if the rate increases.
{"level":"WARN","message":"Retry succeeded","service":"payment-gateway","attempt":2,"order_id":"12345"}
ERROR
An operation failed. Requires attention. Link to enough context to diagnose.
{"level":"ERROR","message":"Payment failed","order_id":"12345","error":"connection refused","gateway":"stripe","trace_id":"abc-def-123"}
Choosing the Right Level
Ask: "If I see 1000 of these in a minute, should someone investigate?"
- If yes immediately, it is ERROR.
- If yes eventually, it is WARN.
- If no, it is INFO.
- If only a developer debugging cares, it is DEBUG.
Context Propagation
The most powerful aspect of structured logging is carrying context across the request lifecycle.
Request IDs
Generate a unique ID at the entry point (API gateway, load balancer) and pass it through every service call:
import uuid
import logging
logger = logging.getLogger(__name__)
def middleware(request, call_next):
request_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
# Attach to thread-local or context variable
context.request_id = request_id
response = call_next(request)
response.headers["X-Request-ID"] = request_id
return response
Every log line from every service includes the same request_id. When a user reports a problem, one ID traces the entire journey.
Trace Context
If you use distributed tracing (OpenTelemetry), the trace ID and span ID are automatically available. Include them in logs:
from opentelemetry import trace
def log_with_trace(logger, message, **kwargs):
span = trace.get_current_span()
ctx = span.get_span_context()
logger.info(
message,
extra={
"trace_id": format(ctx.trace_id, "032x"),
"span_id": format(ctx.span_id, "016x"),
**kwargs,
},
)
This bridges logs and traces. Click a trace ID in your log aggregation tool and jump directly to the trace view.
Do Not Log PII
Personally identifiable information (PII) in logs creates legal liability and compliance violations (GDPR, HIPAA, CCPA). Sanitize before logging.
Never log:
- Email addresses
- Phone numbers
- IP addresses (in many jurisdictions)
- Credit card numbers
- Social security numbers
- Passwords or tokens
Instead:
- Log user IDs, not usernames or emails
- Log order IDs, not shipping addresses
- Log token hashes, not tokens
- Log error types, not full stack traces containing user data
# Bad
logger.info("User login", extra={"email": "john@example.com", "ip": "192.168.1.1"})
# Good
logger.info("User login", extra={"user_id": "u-789", "region": "us-east-1"})
If you must log PII for debugging, use a separate log stream with stricter access controls and shorter retention.
Structured Logging Libraries
Python (structlog)
import structlog
structlog.configure(
processors=[
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.add_log_level,
structlog.processors.JSONRenderer(),
]
)
logger = structlog.get_logger()
logger.info("order.created", order_id="12345", items=3, total=149.99)
Output:
{"timestamp":"2025-03-15T14:22:03.412Z","level":"info","event":"order.created","order_id":"12345","items":3,"total":149.99}
Go (zerolog)
import "github.com/rs/zerolog/log"
log.Info().
Str("order_id", "12345").
Int("items", 3).
Float64("total", 149.99).
Msg("order created")
Node.js (pino)
const pino = require("pino");
const logger = pino({ level: "info" });
logger.info({ orderId: "12345", items: 3, total: 149.99 }, "order created");
Java (Logback + Logstash Encoder)
<!-- logback.xml -->
<appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="net.logstash.logback.encoder.LogstashEncoder" />
</appender>
import org.slf4j.Logger;
import static net.logstash.logback.argument.StructuredArguments.*;
logger.info("order created", kv("order_id", "12345"), kv("items", 3));
Dynamic Log Levels
In production, you typically run at INFO level. But when debugging an issue, you need DEBUG logs for a specific service or even a specific request. Support runtime log level changes without redeploying:
# Change log level via environment variable reload
curl -X POST http://myapp:8080/admin/loglevel -d '{"level": "DEBUG"}'
Some frameworks support per-request log level overrides via a header:
X-Log-Level: DEBUG
This gives you debug-level logging for one request without flooding the system.
Common Pitfalls
- Inconsistent field names. One service uses
userId, another usesuser_id, a third usesuser. Standardize and enforce with linting. - Logging too much. Logging every database query at INFO level creates enormous volume and cost. Use DEBUG for verbose output.
- Logging too little. An ERROR log that says "something went wrong" is useless. Include the context: what operation, what input, what error.
- Logging PII. Emails, IPs, and card numbers in logs create compliance risk. Audit your log output regularly.
- Not including trace IDs. Without correlation IDs, debugging across services requires matching timestamps and guessing. Always propagate trace context.
- Mixing structured and unstructured logs. One library outputs JSON, another outputs plain text. The aggregation system cannot parse both consistently.
- Not testing log output. Verify that your structured logs parse correctly. A missing closing brace in JSON breaks the entire log line.
Key Takeaways
- Emit logs as JSON with consistent field names. Every log line should be machine-parseable.
- Standardize on required fields: timestamp, level, message, service. Add trace_id and request_id for correlation.
- Use log levels deliberately. ERROR means something is broken. INFO means normal operations. DEBUG is for development.
- Propagate request IDs and trace IDs through every service call. This is what makes debugging possible in distributed systems.
- Never log PII. Log identifiers, not personal data.
- Choose a structured logging library for your language and configure it at application startup. The effort is minimal and the payoff is immediate.