Log Aggregation

When you run 50 services across 200 containers, you cannot SSH into each one to read logs. Log aggregation collects logs from every source, ships them to a central store, and provides a search interface. This is not optional infrastructure -- it is how you debug production.

The Architecture

Every log aggregation system follows the same pattern:

Application -> Log Shipper -> Central Store -> Query Interface

Applications write logs to stdout (containers) or files.
A log shipper collects and forwards them.
A central store indexes and retains them.
An interface lets you search and analyze.

Central Logging Stacks

ELK (Elasticsearch, Logstash, Kibana)

The original open-source log aggregation stack.

Elasticsearch: Stores and indexes logs. Full-text search. Powerful but resource-hungry.
Logstash: Ingests, transforms, and routes logs. Heavy -- often replaced by lighter alternatives.
Kibana: Web UI for searching and visualizing logs.

# Logstash pipeline configuration
input {
  beats {
    port => 5044
  }
}

filter {
  json {
    source => "message"
  }
  date {
    match => ["timestamp", "ISO8601"]
    target => "@timestamp"
  }
  mutate {
    remove_field => ["agent", "ecs", "host"]
  }
}

output {
  elasticsearch {
    hosts => ["http://elasticsearch:9200"]
    index => "logs-%{[service]}-%{+YYYY.MM.dd}"
  }
}

ELK is powerful but operationally expensive. Elasticsearch clusters need careful capacity planning, and storage costs scale linearly with log volume.

Loki + Grafana

Grafana Loki is the lightweight alternative. It indexes only labels (metadata), not the full log content. This makes it dramatically cheaper to run.

# Loki configuration
auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
  chunk_idle_period: 5m

schema_config:
  configs:
    - from: 2024-01-01
      store: boltdb-shipper
      object_store: s3
      schema: v12
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /loki/index
    shared_store: s3
  aws:
    s3: s3://us-east-1/my-loki-bucket

Query with LogQL in Grafana:

{service="order-service"} |= "error" | json | level="ERROR" | duration_ms > 1000

Loki is the natural choice if you already run Grafana for metrics. The trade-off: full-text search is slower than Elasticsearch because content is not pre-indexed.

Datadog, Splunk & Cloud Providers

Managed services handle infrastructure for you. Datadog Logs, Splunk Cloud, AWS CloudWatch Logs, and Google Cloud Logging all provide collection, storage, search, and alerting.

The trade-off is cost. At high volume (hundreds of GB per day), managed services become expensive. But for many teams, the operational simplicity is worth it -- you are paying to not manage Elasticsearch clusters.

Log Shippers

Log shippers run on every node or as sidecars, collecting logs and forwarding them to the central store.

Fluent Bit

Lightweight, designed for containers. Low memory footprint (typically under 10MB). The standard choice for Kubernetes.

# Fluent Bit configuration
[SERVICE]
    Flush        1
    Log_Level    info
    Parsers_File parsers.conf

[INPUT]
    Name             tail
    Tag              kube.*
    Path             /var/log/containers/*.log
    Parser           docker
    Refresh_Interval 10
    Mem_Buf_Limit    5MB

[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_URL            https://kubernetes.default.svc:443
    Kube_Tag_Prefix     kube.var.log.containers.
    Merge_Log           On

[OUTPUT]
    Name            loki
    Match           *
    Host            loki.monitoring.svc.cluster.local
    Port            3100
    Labels          job=fluent-bit, namespace=$kubernetes['namespace_name'], pod=$kubernetes['pod_name']

Fluentd

The more feature-rich sibling. More plugins, more transformation capability, but uses more resources. Use Fluentd when you need complex log processing (multi-line parsing, routing to multiple destinations, custom transformations).

# Fluentd configuration
<source>
  @type forward
  port 24224
</source>

<filter **>
  @type parser
  key_name log
  reserve_data true
  <parse>
    @type json
  </parse>
</filter>

<match **>
  @type elasticsearch
  host elasticsearch.monitoring.svc.cluster.local
  port 9200
  logstash_format true
  logstash_prefix logs
  <buffer>
    @type memory
    flush_interval 5s
    chunk_limit_size 8MB
  </buffer>
</match>

Vector

A newer option from Datadog (open source). Written in Rust, high performance, can replace both Fluent Bit and Logstash in many architectures.

# vector.yaml
sources:
  kubernetes_logs:
    type: kubernetes_logs

transforms:
  parse_json:
    type: remap
    inputs: [kubernetes_logs]
    source: |
      . = parse_json!(.message)
      .namespace = .kubernetes.pod_namespace
      .pod = .kubernetes.pod_name

sinks:
  loki:
    type: loki
    inputs: [parse_json]
    endpoint: http://loki:3100
    labels:
      service: "{{ service }}"
      namespace: "{{ namespace }}"
    encoding:
      codec: json

Log Retention

Logs have a lifecycle. Not all logs need to be instantly searchable forever. Define tiers based on access patterns and cost.

Retention Tiers

Tier	Duration	Storage	Use Case
Hot	7 days	Elasticsearch / Loki	Active debugging, recent incidents
Warm	30 days	Cheaper index, slower queries	Incident investigation, recent trends
Cold	90 days	Object storage (S3, GCS)	Compliance, audit, rare lookups
Archive	1+ years	Glacier, deep archive	Legal hold, regulatory requirements

Implement retention with Elasticsearch's Index Lifecycle Management (ILM) to automatically roll over, shrink, and delete indices by age. For Loki, configure retention_period in limits_config and enable the compactor.

Cost Management

Logs are expensive at scale. A busy service producing 10KB per request at 1000 requests per second generates 864GB per day. Multiply by the number of services. The storage and indexing costs add up fast.

Strategies to Control Cost

Drop unnecessary fields. Remove verbose metadata that nobody queries. Strip Kubernetes annotations, agent information, and redundant fields at the shipper level.

# Fluent Bit filter to remove fields
[FILTER]
    Name         modify
    Match        *
    Remove       kubernetes.annotations
    Remove       kubernetes.labels
    Remove       stream

Sample verbose logs. You do not need every DEBUG log in production. Sample them -- keep 1 in 10 or 1 in 100.

# Vector sampling transform
transforms:
  sample_debug:
    type: sample
    inputs: [parse_json]
    rate: 10  # Keep 1 in 10
    condition:
      type: vrl
      source: .level == "DEBUG"

Set appropriate retention. Do not keep 90 days of hot storage if nobody looks at logs older than 7 days. Move old data to cold storage or delete it.

Use Loki over Elasticsearch for most workloads. If you do not need full-text search on every field, Loki's label-only indexing is significantly cheaper.

Aggregate instead of log. If you are logging to count something, use a metric instead. A counter increment costs almost nothing compared to a log line.

Cost Estimation

A rough formula: daily storage (GB) = avg log size (KB) x logs/sec x 86400 / 1024 / 1024. A service generating 5KB logs at 500/sec produces roughly 206 GB/day. With 30-day retention, that is 6.2 TB stored. Evaluate whether you need all those logs or whether sampling and shorter retention would suffice.

Kubernetes Log Collection

In Kubernetes, containers write to stdout/stderr. The container runtime captures these to files on the node. Deploy Fluent Bit as a DaemonSet that reads from /var/log/containers/ on each node and forwards to your central store. Mount the host's /var/log directory and provide configuration via a ConfigMap.

Common Pitfalls

No centralized logging. If you are still SSHing into containers to read logs, you are losing hours per incident. Set up aggregation.
Logging everything at DEBUG. The volume overwhelms storage and search. Use INFO as the default production level.
No retention policy. Logs grow forever. Disks fill up. Costs spiral. Define retention tiers from day one.
Ignoring cost until the bill arrives. Estimate log volume before choosing a stack. A 500GB/day workload on managed Elasticsearch is a very different proposition than on Loki.
Shipping logs without parsing. Raw unstructured logs in your aggregation system are hard to query. Parse JSON at the shipper level.
Single point of failure. If the log shipper crashes and logs are lost, you have a gap. Use persistent buffers and acknowledge-based delivery.
Not sampling verbose logs. You do not need every health check log line. Sample or drop repetitive low-value logs.

Key Takeaways

Centralize your logs. Pick a stack (ELK, Loki + Grafana, or a managed service) and ship everything there.
Use lightweight shippers like Fluent Bit in Kubernetes. Reserve Fluentd or Vector for complex transformations.
Define retention tiers: hot (7 days), warm (30 days), cold (90 days). Match storage cost to access frequency.
Manage cost actively. Drop unnecessary fields, sample verbose logs, and use metrics instead of logs when counting things.
Parse logs at the shipper level so they are searchable in the central store.
Estimate your daily log volume before committing to a stack. The right choice at 10GB/day may be wrong at 500GB/day.