Alerting That Works

Alerting is the bridge between monitoring and action. Good alerting wakes you up when users are affected. Bad alerting wakes you up at 3 AM because a single node's CPU touched 82% for thirty seconds. The difference between the two is the difference between a sustainable on-call rotation and burnout.

Alert on Symptoms, Not Causes

This is the single most important principle. Alert on what users experience, not on internal system state.

Symptom-based (good):

Error rate exceeds 1% for 5 minutes
p99 latency exceeds 2 seconds for 10 minutes
Successful login rate drops below 95%
Payment processing failures exceed 0.1%

Cause-based (usually bad):

CPU usage above 80%
Memory usage above 90%
Disk usage above 75%
A single pod restarted

CPU at 80% might be perfectly normal during peak traffic. A pod restart might be Kubernetes doing its job. These are not problems until they affect users. Cause-based metrics are useful on dashboards for investigation, but they should rarely trigger pages.

The Exception

Some cause-based alerts are legitimate because they predict imminent user impact:

Disk will be full in 4 hours (based on linear prediction)
Certificate expires in 7 days
Database connection pool is 95% exhausted

These are not alerting on current state -- they are alerting on trajectory. That distinction matters.

Writing Alert Rules

Prometheus Alerting Rules

# alerts.yml
groups:
  - name: service-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
          > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "{{ $labels.service }} error rate is {{ $value | humanizePercentage }} over the last 5 minutes."
          runbook: "https://wiki.example.com/runbooks/high-error-rate"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          ) > 2.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High p99 latency on {{ $labels.service }}"
          description: "{{ $labels.service }} p99 latency is {{ $value }}s."
          runbook: "https://wiki.example.com/runbooks/high-latency"

      - alert: DiskWillFillIn4Hours
        expr: |
          predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600) < 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Disk on {{ $labels.instance }} predicted to fill within 4 hours"
          runbook: "https://wiki.example.com/runbooks/disk-full"

Key elements:

for: The condition must be true for this duration before firing. This prevents alerting on brief spikes. A 5-minute for means the error rate must sustain above 1% for a full 5 minutes.
labels.severity: Used for routing. Critical pages go to on-call. Warnings go to a Slack channel.
annotations.runbook: A direct link to what to do when this alert fires.

Severity Levels

Not every alert is equal. Define clear severity levels and route them appropriately.

Critical (Page)

User-facing impact is happening right now. Someone needs to respond within minutes.

Examples:

Service is returning 5xx errors to users
Payment processing is failing
The application is completely down

Route to: PagerDuty/OpsGenie, phone call to on-call engineer.

Warning (Notify)

Something is degraded or heading toward a problem. Response within hours is appropriate.

Examples:

Latency is elevated but within SLO
Disk usage is trending toward full
Certificate expires in 14 days

Route to: Slack channel, email to the team.

Info (Log)

Nothing is broken, but something noteworthy happened. No response required.

Examples:

A deployment completed
Auto-scaling added instances
A scheduled job finished

Route to: Monitoring channel, dashboard annotation.

# Alertmanager routing configuration
route:
  receiver: default-slack
  group_by: [alertname, service]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: pagerduty-oncall
      repeat_interval: 15m
    - match:
        severity: warning
      receiver: team-slack
      repeat_interval: 4h
    - match:
        severity: info
      receiver: monitoring-slack
      repeat_interval: 24h

receivers:
  - name: pagerduty-oncall
    pagerduty_configs:
      - routing_key: "your-pagerduty-integration-key"
        severity: critical

  - name: team-slack
    slack_configs:
      - api_url: "https://hooks.slack.com/services/T00/B00/xxx"
        channel: "#platform-alerts"
        title: "{{ .GroupLabels.alertname }}"
        text: "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}"

  - name: monitoring-slack
    slack_configs:
      - api_url: "https://hooks.slack.com/services/T00/B00/xxx"
        channel: "#monitoring"

  - name: default-slack
    slack_configs:
      - api_url: "https://hooks.slack.com/services/T00/B00/xxx"
        channel: "#alerts-default"

Runbooks

Every alert should link to a runbook. A runbook is a document that tells the on-call engineer what to do when the alert fires. It is not a novel. It is a checklist.

A good runbook includes:

## Alert: HighErrorRate

### What it means
The service is returning HTTP 5xx errors above the threshold (1% of traffic).

### Impact
Users are seeing errors. Affected functionality: user authentication, API requests.

### Investigation steps
1. Check which endpoint is failing:
   sum(rate(http_requests_total{status=~"5.."}[5m])) by (path)

2. Check recent deployments:
   kubectl rollout history deployment/myapp

3. Check application logs:
   kubectl logs -l app=myapp --tail=100

4. Check downstream dependencies:
   curl -s http://database-service:8080/health
   curl -s http://cache-service:8080/health

### Remediation
- If a recent deploy caused it: kubectl rollout undo deployment/myapp
- If a downstream service is down: check that service's alerts and status
- If the database is slow: check connection pool and query performance

### Escalation
If not resolved in 30 minutes, page the service owner: @backend-team

Without runbooks, every alert becomes a research project at 3 AM. With runbooks, the on-call engineer has a starting point.

Alert Fatigue

Alert fatigue is what happens when your alerting system sends so many notifications that the on-call engineer stops paying attention. Every false positive erodes trust. After enough false alarms, real alerts get ignored.

Signs of Alert Fatigue

On-call engineers acknowledge alerts without investigating
Alerts are routinely silenced or snoozed
The alert channel has hundreds of unread messages
New team members ask "do we actually look at these?"

How to Fix It

Delete alerts nobody acts on. If an alert has fired 50 times and no one has ever taken action, it is noise. Remove it or fix the underlying flakiness.

Increase thresholds and durations. CPU at 80% for 30 seconds is not a problem. Raise the threshold or increase the for duration.

Group related alerts. If one failing database triggers 15 alerts from 15 services, group them. Alertmanager's group_by helps:

route:
  group_by: [alertname, cluster]
  group_wait: 30s
  group_interval: 5m

Measure alert quality. Track how many alerts fire per week, how many result in action, and how many are false positives. Review monthly.

Good target:
- On-call gets paged fewer than 2 times per shift
- >80% of pages result in meaningful action
- Zero false-positive pages per week

PagerDuty & OpsGenie Integration

Both PagerDuty and OpsGenie integrate with Prometheus Alertmanager via webhook or native integration.

PagerDuty

receivers:
  - name: pagerduty
    pagerduty_configs:
      - routing_key: "your-integration-key"
        severity: '{{ if eq .GroupLabels.severity "critical" }}critical{{ else }}warning{{ end }}'
        description: "{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}"
        details:
          service: "{{ .GroupLabels.service }}"
          runbook: "{{ .CommonAnnotations.runbook }}"

On-Call Rotation

On-call is a responsibility shared across the team. No single person should be on-call permanently.

Principles:

Rotate weekly. Each engineer takes one week of primary on-call.
Have a secondary. If primary does not respond in 10 minutes, escalate to secondary.
Compensate fairly. On-call is work. Compensate it with time off, pay, or both.
Limit pages. If on-call gets paged more than twice per shift regularly, fix the alerts or the system.
Handoff notes. At rotation handoff, the outgoing engineer shares what happened during their shift -- active issues, recent changes, things to watch.

Blameless Post-Mortems

When an incident happens, conduct a post-mortem. The goal is to learn, not to assign blame. A post-mortem covers: summary, timeline, root cause, and action items. The focus is on the system, not the person. "The deploy lacked a performance test" is actionable. "Bob should have caught this" is not.

Common Pitfalls

Alerting on causes instead of symptoms. CPU at 80% is not a user-facing problem until it causes latency or errors.
No for duration. Without it, a 10-second spike pages someone at 2 AM. Set for: 5m minimum for most alerts.
Missing runbooks. An alert without a runbook is a puzzle. The on-call engineer should not have to figure out what to do from scratch.
Too many critical alerts. If everything is critical, nothing is. Reserve critical for actual user-facing impact.
Never reviewing alert quality. Alerts accumulate. Review them quarterly. Delete the ones that produce noise.
Skipping post-mortems. Without post-mortems, you repeat the same incidents. The 45 minutes spent writing one saves hours of future debugging.

Key Takeaways

Alert on symptoms (error rate, latency) not causes (CPU, memory). Users do not care about your CPU usage.
Define severity levels and route accordingly. Critical pages the on-call. Warning goes to Slack.
Every alert needs a runbook. The runbook tells the on-call engineer exactly where to start.
Fight alert fatigue aggressively. Delete alerts that are not actionable, raise thresholds, and measure alert quality.
Blameless post-mortems turn incidents into improvements. Focus on systems, not people.
On-call should be sustainable. If it is not, fix the alerts or fix the system.