Alerting That Works
Alerting is the bridge between monitoring and action. Good alerting wakes you up when users are affected. Bad alerting wakes you up at 3 AM because a single node's CPU touched 82% for thirty seconds. The difference between the two is the difference between a sustainable on-call rotation and burnout.
Alert on Symptoms, Not Causes
This is the single most important principle. Alert on what users experience, not on internal system state.
Symptom-based (good):
- Error rate exceeds 1% for 5 minutes
- p99 latency exceeds 2 seconds for 10 minutes
- Successful login rate drops below 95%
- Payment processing failures exceed 0.1%
Cause-based (usually bad):
- CPU usage above 80%
- Memory usage above 90%
- Disk usage above 75%
- A single pod restarted
CPU at 80% might be perfectly normal during peak traffic. A pod restart might be Kubernetes doing its job. These are not problems until they affect users. Cause-based metrics are useful on dashboards for investigation, but they should rarely trigger pages.
The Exception
Some cause-based alerts are legitimate because they predict imminent user impact:
- Disk will be full in 4 hours (based on linear prediction)
- Certificate expires in 7 days
- Database connection pool is 95% exhausted
These are not alerting on current state -- they are alerting on trajectory. That distinction matters.
Writing Alert Rules
Prometheus Alerting Rules
# alerts.yml
groups:
- name: service-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
> 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "{{ $labels.service }} error rate is {{ $value | humanizePercentage }} over the last 5 minutes."
runbook: "https://wiki.example.com/runbooks/high-error-rate"
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 2.0
for: 10m
labels:
severity: warning
annotations:
summary: "High p99 latency on {{ $labels.service }}"
description: "{{ $labels.service }} p99 latency is {{ $value }}s."
runbook: "https://wiki.example.com/runbooks/high-latency"
- alert: DiskWillFillIn4Hours
expr: |
predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600) < 0
for: 30m
labels:
severity: warning
annotations:
summary: "Disk on {{ $labels.instance }} predicted to fill within 4 hours"
runbook: "https://wiki.example.com/runbooks/disk-full"
Key elements:
for: The condition must be true for this duration before firing. This prevents alerting on brief spikes. A 5-minuteformeans the error rate must sustain above 1% for a full 5 minutes.labels.severity: Used for routing. Critical pages go to on-call. Warnings go to a Slack channel.annotations.runbook: A direct link to what to do when this alert fires.
Severity Levels
Not every alert is equal. Define clear severity levels and route them appropriately.
Critical (Page)
User-facing impact is happening right now. Someone needs to respond within minutes.
Examples:
- Service is returning 5xx errors to users
- Payment processing is failing
- The application is completely down
Route to: PagerDuty/OpsGenie, phone call to on-call engineer.
Warning (Notify)
Something is degraded or heading toward a problem. Response within hours is appropriate.
Examples:
- Latency is elevated but within SLO
- Disk usage is trending toward full
- Certificate expires in 14 days
Route to: Slack channel, email to the team.
Info (Log)
Nothing is broken, but something noteworthy happened. No response required.
Examples:
- A deployment completed
- Auto-scaling added instances
- A scheduled job finished
Route to: Monitoring channel, dashboard annotation.
# Alertmanager routing configuration
route:
receiver: default-slack
group_by: [alertname, service]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: pagerduty-oncall
repeat_interval: 15m
- match:
severity: warning
receiver: team-slack
repeat_interval: 4h
- match:
severity: info
receiver: monitoring-slack
repeat_interval: 24h
receivers:
- name: pagerduty-oncall
pagerduty_configs:
- routing_key: "your-pagerduty-integration-key"
severity: critical
- name: team-slack
slack_configs:
- api_url: "https://hooks.slack.com/services/T00/B00/xxx"
channel: "#platform-alerts"
title: "{{ .GroupLabels.alertname }}"
text: "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}"
- name: monitoring-slack
slack_configs:
- api_url: "https://hooks.slack.com/services/T00/B00/xxx"
channel: "#monitoring"
- name: default-slack
slack_configs:
- api_url: "https://hooks.slack.com/services/T00/B00/xxx"
channel: "#alerts-default"
Runbooks
Every alert should link to a runbook. A runbook is a document that tells the on-call engineer what to do when the alert fires. It is not a novel. It is a checklist.
A good runbook includes:
## Alert: HighErrorRate
### What it means
The service is returning HTTP 5xx errors above the threshold (1% of traffic).
### Impact
Users are seeing errors. Affected functionality: user authentication, API requests.
### Investigation steps
1. Check which endpoint is failing:
sum(rate(http_requests_total{status=~"5.."}[5m])) by (path)
2. Check recent deployments:
kubectl rollout history deployment/myapp
3. Check application logs:
kubectl logs -l app=myapp --tail=100
4. Check downstream dependencies:
curl -s http://database-service:8080/health
curl -s http://cache-service:8080/health
### Remediation
- If a recent deploy caused it: kubectl rollout undo deployment/myapp
- If a downstream service is down: check that service's alerts and status
- If the database is slow: check connection pool and query performance
### Escalation
If not resolved in 30 minutes, page the service owner: @backend-team
Without runbooks, every alert becomes a research project at 3 AM. With runbooks, the on-call engineer has a starting point.
Alert Fatigue
Alert fatigue is what happens when your alerting system sends so many notifications that the on-call engineer stops paying attention. Every false positive erodes trust. After enough false alarms, real alerts get ignored.
Signs of Alert Fatigue
- On-call engineers acknowledge alerts without investigating
- Alerts are routinely silenced or snoozed
- The alert channel has hundreds of unread messages
- New team members ask "do we actually look at these?"
How to Fix It
Delete alerts nobody acts on. If an alert has fired 50 times and no one has ever taken action, it is noise. Remove it or fix the underlying flakiness.
Increase thresholds and durations. CPU at 80% for 30 seconds is not a problem. Raise the threshold or increase the for duration.
Group related alerts. If one failing database triggers 15 alerts from 15 services, group them. Alertmanager's group_by helps:
route:
group_by: [alertname, cluster]
group_wait: 30s
group_interval: 5m
Measure alert quality. Track how many alerts fire per week, how many result in action, and how many are false positives. Review monthly.
Good target:
- On-call gets paged fewer than 2 times per shift
- >80% of pages result in meaningful action
- Zero false-positive pages per week
PagerDuty & OpsGenie Integration
Both PagerDuty and OpsGenie integrate with Prometheus Alertmanager via webhook or native integration.
PagerDuty
receivers:
- name: pagerduty
pagerduty_configs:
- routing_key: "your-integration-key"
severity: '{{ if eq .GroupLabels.severity "critical" }}critical{{ else }}warning{{ end }}'
description: "{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}"
details:
service: "{{ .GroupLabels.service }}"
runbook: "{{ .CommonAnnotations.runbook }}"
On-Call Rotation
On-call is a responsibility shared across the team. No single person should be on-call permanently.
Principles:
- Rotate weekly. Each engineer takes one week of primary on-call.
- Have a secondary. If primary does not respond in 10 minutes, escalate to secondary.
- Compensate fairly. On-call is work. Compensate it with time off, pay, or both.
- Limit pages. If on-call gets paged more than twice per shift regularly, fix the alerts or the system.
- Handoff notes. At rotation handoff, the outgoing engineer shares what happened during their shift -- active issues, recent changes, things to watch.
Blameless Post-Mortems
When an incident happens, conduct a post-mortem. The goal is to learn, not to assign blame. A post-mortem covers: summary, timeline, root cause, and action items. The focus is on the system, not the person. "The deploy lacked a performance test" is actionable. "Bob should have caught this" is not.
Common Pitfalls
- Alerting on causes instead of symptoms. CPU at 80% is not a user-facing problem until it causes latency or errors.
- No
forduration. Without it, a 10-second spike pages someone at 2 AM. Setfor: 5mminimum for most alerts. - Missing runbooks. An alert without a runbook is a puzzle. The on-call engineer should not have to figure out what to do from scratch.
- Too many critical alerts. If everything is critical, nothing is. Reserve critical for actual user-facing impact.
- Never reviewing alert quality. Alerts accumulate. Review them quarterly. Delete the ones that produce noise.
- Skipping post-mortems. Without post-mortems, you repeat the same incidents. The 45 minutes spent writing one saves hours of future debugging.
Key Takeaways
- Alert on symptoms (error rate, latency) not causes (CPU, memory). Users do not care about your CPU usage.
- Define severity levels and route accordingly. Critical pages the on-call. Warning goes to Slack.
- Every alert needs a runbook. The runbook tells the on-call engineer exactly where to start.
- Fight alert fatigue aggressively. Delete alerts that are not actionable, raise thresholds, and measure alert quality.
- Blameless post-mortems turn incidents into improvements. Focus on systems, not people.
- On-call should be sustainable. If it is not, fix the alerts or fix the system.