Log Aggregation
When you run 50 services across 200 containers, you cannot SSH into each one to read logs. Log aggregation collects logs from every source, ships them to a central store, and provides a search interface. This is not optional infrastructure -- it is how you debug production.
The Architecture
Every log aggregation system follows the same pattern:
Application -> Log Shipper -> Central Store -> Query Interface
- Applications write logs to stdout (containers) or files.
- A log shipper collects and forwards them.
- A central store indexes and retains them.
- An interface lets you search and analyze.
Central Logging Stacks
ELK (Elasticsearch, Logstash, Kibana)
The original open-source log aggregation stack.
- Elasticsearch: Stores and indexes logs. Full-text search. Powerful but resource-hungry.
- Logstash: Ingests, transforms, and routes logs. Heavy -- often replaced by lighter alternatives.
- Kibana: Web UI for searching and visualizing logs.
# Logstash pipeline configuration
input {
beats {
port => 5044
}
}
filter {
json {
source => "message"
}
date {
match => ["timestamp", "ISO8601"]
target => "@timestamp"
}
mutate {
remove_field => ["agent", "ecs", "host"]
}
}
output {
elasticsearch {
hosts => ["http://elasticsearch:9200"]
index => "logs-%{[service]}-%{+YYYY.MM.dd}"
}
}
ELK is powerful but operationally expensive. Elasticsearch clusters need careful capacity planning, and storage costs scale linearly with log volume.
Loki + Grafana
Grafana Loki is the lightweight alternative. It indexes only labels (metadata), not the full log content. This makes it dramatically cheaper to run.
# Loki configuration
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
chunk_idle_period: 5m
schema_config:
configs:
- from: 2024-01-01
store: boltdb-shipper
object_store: s3
schema: v12
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/index
shared_store: s3
aws:
s3: s3://us-east-1/my-loki-bucket
Query with LogQL in Grafana:
{service="order-service"} |= "error" | json | level="ERROR" | duration_ms > 1000
Loki is the natural choice if you already run Grafana for metrics. The trade-off: full-text search is slower than Elasticsearch because content is not pre-indexed.
Datadog, Splunk & Cloud Providers
Managed services handle infrastructure for you. Datadog Logs, Splunk Cloud, AWS CloudWatch Logs, and Google Cloud Logging all provide collection, storage, search, and alerting.
The trade-off is cost. At high volume (hundreds of GB per day), managed services become expensive. But for many teams, the operational simplicity is worth it -- you are paying to not manage Elasticsearch clusters.
Log Shippers
Log shippers run on every node or as sidecars, collecting logs and forwarding them to the central store.
Fluent Bit
Lightweight, designed for containers. Low memory footprint (typically under 10MB). The standard choice for Kubernetes.
# Fluent Bit configuration
[SERVICE]
Flush 1
Log_Level info
Parsers_File parsers.conf
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser docker
Refresh_Interval 10
Mem_Buf_Limit 5MB
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
[OUTPUT]
Name loki
Match *
Host loki.monitoring.svc.cluster.local
Port 3100
Labels job=fluent-bit, namespace=$kubernetes['namespace_name'], pod=$kubernetes['pod_name']
Fluentd
The more feature-rich sibling. More plugins, more transformation capability, but uses more resources. Use Fluentd when you need complex log processing (multi-line parsing, routing to multiple destinations, custom transformations).
# Fluentd configuration
<source>
@type forward
port 24224
</source>
<filter **>
@type parser
key_name log
reserve_data true
<parse>
@type json
</parse>
</filter>
<match **>
@type elasticsearch
host elasticsearch.monitoring.svc.cluster.local
port 9200
logstash_format true
logstash_prefix logs
<buffer>
@type memory
flush_interval 5s
chunk_limit_size 8MB
</buffer>
</match>
Vector
A newer option from Datadog (open source). Written in Rust, high performance, can replace both Fluent Bit and Logstash in many architectures.
# vector.yaml
sources:
kubernetes_logs:
type: kubernetes_logs
transforms:
parse_json:
type: remap
inputs: [kubernetes_logs]
source: |
. = parse_json!(.message)
.namespace = .kubernetes.pod_namespace
.pod = .kubernetes.pod_name
sinks:
loki:
type: loki
inputs: [parse_json]
endpoint: http://loki:3100
labels:
service: "{{ service }}"
namespace: "{{ namespace }}"
encoding:
codec: json
Log Retention
Logs have a lifecycle. Not all logs need to be instantly searchable forever. Define tiers based on access patterns and cost.
Retention Tiers
| Tier | Duration | Storage | Use Case |
|---|---|---|---|
| Hot | 7 days | Elasticsearch / Loki | Active debugging, recent incidents |
| Warm | 30 days | Cheaper index, slower queries | Incident investigation, recent trends |
| Cold | 90 days | Object storage (S3, GCS) | Compliance, audit, rare lookups |
| Archive | 1+ years | Glacier, deep archive | Legal hold, regulatory requirements |
Implement retention with Elasticsearch's Index Lifecycle Management (ILM) to automatically roll over, shrink, and delete indices by age. For Loki, configure retention_period in limits_config and enable the compactor.
Cost Management
Logs are expensive at scale. A busy service producing 10KB per request at 1000 requests per second generates 864GB per day. Multiply by the number of services. The storage and indexing costs add up fast.
Strategies to Control Cost
Drop unnecessary fields. Remove verbose metadata that nobody queries. Strip Kubernetes annotations, agent information, and redundant fields at the shipper level.
# Fluent Bit filter to remove fields
[FILTER]
Name modify
Match *
Remove kubernetes.annotations
Remove kubernetes.labels
Remove stream
Sample verbose logs. You do not need every DEBUG log in production. Sample them -- keep 1 in 10 or 1 in 100.
# Vector sampling transform
transforms:
sample_debug:
type: sample
inputs: [parse_json]
rate: 10 # Keep 1 in 10
condition:
type: vrl
source: .level == "DEBUG"
Set appropriate retention. Do not keep 90 days of hot storage if nobody looks at logs older than 7 days. Move old data to cold storage or delete it.
Use Loki over Elasticsearch for most workloads. If you do not need full-text search on every field, Loki's label-only indexing is significantly cheaper.
Aggregate instead of log. If you are logging to count something, use a metric instead. A counter increment costs almost nothing compared to a log line.
Cost Estimation
A rough formula: daily storage (GB) = avg log size (KB) x logs/sec x 86400 / 1024 / 1024. A service generating 5KB logs at 500/sec produces roughly 206 GB/day. With 30-day retention, that is 6.2 TB stored. Evaluate whether you need all those logs or whether sampling and shorter retention would suffice.
Kubernetes Log Collection
In Kubernetes, containers write to stdout/stderr. The container runtime captures these to files on the node. Deploy Fluent Bit as a DaemonSet that reads from /var/log/containers/ on each node and forwards to your central store. Mount the host's /var/log directory and provide configuration via a ConfigMap.
Common Pitfalls
- No centralized logging. If you are still SSHing into containers to read logs, you are losing hours per incident. Set up aggregation.
- Logging everything at DEBUG. The volume overwhelms storage and search. Use INFO as the default production level.
- No retention policy. Logs grow forever. Disks fill up. Costs spiral. Define retention tiers from day one.
- Ignoring cost until the bill arrives. Estimate log volume before choosing a stack. A 500GB/day workload on managed Elasticsearch is a very different proposition than on Loki.
- Shipping logs without parsing. Raw unstructured logs in your aggregation system are hard to query. Parse JSON at the shipper level.
- Single point of failure. If the log shipper crashes and logs are lost, you have a gap. Use persistent buffers and acknowledge-based delivery.
- Not sampling verbose logs. You do not need every health check log line. Sample or drop repetitive low-value logs.
Key Takeaways
- Centralize your logs. Pick a stack (ELK, Loki + Grafana, or a managed service) and ship everything there.
- Use lightweight shippers like Fluent Bit in Kubernetes. Reserve Fluentd or Vector for complex transformations.
- Define retention tiers: hot (7 days), warm (30 days), cold (90 days). Match storage cost to access frequency.
- Manage cost actively. Drop unnecessary fields, sample verbose logs, and use metrics instead of logs when counting things.
- Parse logs at the shipper level so they are searchable in the central store.
- Estimate your daily log volume before committing to a stack. The right choice at 10GB/day may be wrong at 500GB/day.