5 min read
On this page

Structured Logging

Plain text logs were fine when you had one server and could tail -f a file. In a distributed system with dozens of services, you need logs that machines can parse, index, and query. Structured logging means emitting logs as key-value pairs -- typically JSON -- instead of free-form strings.

JSON Logs Over Plain Text

The Problem with Plain Text

2025-03-15 14:22:03 INFO Processing order 12345 for user john@example.com
2025-03-15 14:22:03 ERROR Failed to charge card for order 12345: timeout connecting to payment gateway

This is readable by humans but painful for machines. To find all errors for order 12345, you need regex. To count errors per service, you need pattern matching. To correlate across services, you need luck.

The Structured Alternative

{"timestamp":"2025-03-15T14:22:03.412Z","level":"INFO","service":"order-service","message":"Processing order","order_id":"12345","user_id":"u-789","trace_id":"abc-def-123"}
{"timestamp":"2025-03-15T14:22:03.891Z","level":"ERROR","service":"order-service","message":"Failed to charge card","order_id":"12345","error":"timeout connecting to payment gateway","trace_id":"abc-def-123","duration_ms":5003}

Now you can filter by order_id, group by level, search by trace_id, and aggregate by service -- all without regex. Every log aggregation system (Elasticsearch, Loki, Datadog) handles JSON natively.

Standard Fields

Consistency matters more than perfection. Agree on field names across your organization and stick to them.

Required Fields

Field Description Example
timestamp ISO 8601 with timezone 2025-03-15T14:22:03.412Z
level Log severity INFO, ERROR
message Human-readable description Failed to charge card
service Which service emitted this order-service
Field Description Example
trace_id Distributed trace identifier abc-def-123
span_id Current span in trace span-456
request_id Unique ID for this request req-789
user_id Who triggered this action u-789
environment Deployment environment production
version Application version v2.3.1
duration_ms How long the operation took 142
error Error message or type TimeoutError

Field Naming Conventions

Pick one convention and enforce it. Snake_case is the most common in logging:

Good: trace_id, user_id, order_id, duration_ms
Bad:  traceId, TraceID, trace-id (mixing conventions)

Log Levels

Use log levels consistently across all services. Each level has a purpose.

DEBUG

Detailed information useful during development. Never enable in production by default -- the volume will overwhelm your log aggregation system and your budget.

{"level":"DEBUG","message":"Cache lookup","key":"user:789","hit":true,"latency_ms":2}

INFO

Normal operations worth recording. The heartbeat of your application.

{"level":"INFO","message":"Order created","order_id":"12345","items":3,"total":149.99}

WARN

Something unexpected happened but the operation continued. Investigate if the rate increases.

{"level":"WARN","message":"Retry succeeded","service":"payment-gateway","attempt":2,"order_id":"12345"}

ERROR

An operation failed. Requires attention. Link to enough context to diagnose.

{"level":"ERROR","message":"Payment failed","order_id":"12345","error":"connection refused","gateway":"stripe","trace_id":"abc-def-123"}

Choosing the Right Level

Ask: "If I see 1000 of these in a minute, should someone investigate?"

  • If yes immediately, it is ERROR.
  • If yes eventually, it is WARN.
  • If no, it is INFO.
  • If only a developer debugging cares, it is DEBUG.

Context Propagation

The most powerful aspect of structured logging is carrying context across the request lifecycle.

Request IDs

Generate a unique ID at the entry point (API gateway, load balancer) and pass it through every service call:

import uuid
import logging

logger = logging.getLogger(__name__)

def middleware(request, call_next):
    request_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
    # Attach to thread-local or context variable
    context.request_id = request_id
    response = call_next(request)
    response.headers["X-Request-ID"] = request_id
    return response

Every log line from every service includes the same request_id. When a user reports a problem, one ID traces the entire journey.

Trace Context

If you use distributed tracing (OpenTelemetry), the trace ID and span ID are automatically available. Include them in logs:

from opentelemetry import trace

def log_with_trace(logger, message, **kwargs):
    span = trace.get_current_span()
    ctx = span.get_span_context()
    logger.info(
        message,
        extra={
            "trace_id": format(ctx.trace_id, "032x"),
            "span_id": format(ctx.span_id, "016x"),
            **kwargs,
        },
    )

This bridges logs and traces. Click a trace ID in your log aggregation tool and jump directly to the trace view.

Do Not Log PII

Personally identifiable information (PII) in logs creates legal liability and compliance violations (GDPR, HIPAA, CCPA). Sanitize before logging.

Never log:

  • Email addresses
  • Phone numbers
  • IP addresses (in many jurisdictions)
  • Credit card numbers
  • Social security numbers
  • Passwords or tokens

Instead:

  • Log user IDs, not usernames or emails
  • Log order IDs, not shipping addresses
  • Log token hashes, not tokens
  • Log error types, not full stack traces containing user data
# Bad
logger.info("User login", extra={"email": "john@example.com", "ip": "192.168.1.1"})

# Good
logger.info("User login", extra={"user_id": "u-789", "region": "us-east-1"})

If you must log PII for debugging, use a separate log stream with stricter access controls and shorter retention.

Structured Logging Libraries

Python (structlog)

import structlog

structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.JSONRenderer(),
    ]
)

logger = structlog.get_logger()
logger.info("order.created", order_id="12345", items=3, total=149.99)

Output:

{"timestamp":"2025-03-15T14:22:03.412Z","level":"info","event":"order.created","order_id":"12345","items":3,"total":149.99}

Go (zerolog)

import "github.com/rs/zerolog/log"

log.Info().
    Str("order_id", "12345").
    Int("items", 3).
    Float64("total", 149.99).
    Msg("order created")

Node.js (pino)

const pino = require("pino");
const logger = pino({ level: "info" });

logger.info({ orderId: "12345", items: 3, total: 149.99 }, "order created");

Java (Logback + Logstash Encoder)

<!-- logback.xml -->
<appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
  <encoder class="net.logstash.logback.encoder.LogstashEncoder" />
</appender>
import org.slf4j.Logger;
import static net.logstash.logback.argument.StructuredArguments.*;

logger.info("order created", kv("order_id", "12345"), kv("items", 3));

Dynamic Log Levels

In production, you typically run at INFO level. But when debugging an issue, you need DEBUG logs for a specific service or even a specific request. Support runtime log level changes without redeploying:

# Change log level via environment variable reload
curl -X POST http://myapp:8080/admin/loglevel -d '{"level": "DEBUG"}'

Some frameworks support per-request log level overrides via a header:

X-Log-Level: DEBUG

This gives you debug-level logging for one request without flooding the system.

Common Pitfalls

  • Inconsistent field names. One service uses userId, another uses user_id, a third uses user. Standardize and enforce with linting.
  • Logging too much. Logging every database query at INFO level creates enormous volume and cost. Use DEBUG for verbose output.
  • Logging too little. An ERROR log that says "something went wrong" is useless. Include the context: what operation, what input, what error.
  • Logging PII. Emails, IPs, and card numbers in logs create compliance risk. Audit your log output regularly.
  • Not including trace IDs. Without correlation IDs, debugging across services requires matching timestamps and guessing. Always propagate trace context.
  • Mixing structured and unstructured logs. One library outputs JSON, another outputs plain text. The aggregation system cannot parse both consistently.
  • Not testing log output. Verify that your structured logs parse correctly. A missing closing brace in JSON breaks the entire log line.

Key Takeaways

  • Emit logs as JSON with consistent field names. Every log line should be machine-parseable.
  • Standardize on required fields: timestamp, level, message, service. Add trace_id and request_id for correlation.
  • Use log levels deliberately. ERROR means something is broken. INFO means normal operations. DEBUG is for development.
  • Propagate request IDs and trace IDs through every service call. This is what makes debugging possible in distributed systems.
  • Never log PII. Log identifiers, not personal data.
  • Choose a structured logging library for your language and configure it at application startup. The effort is minimal and the payoff is immediate.