Metrics & Alerting

Metrics are the first line of defense in system monitoring. They quantify system behavior into numbers that can be graphed, compared, and alerted on. A well-designed metrics and alerting system catches problems before users notice them.

Metric Types

Prometheus defines four core metric types that cover virtually every measurement need.

Counter

A counter is a monotonically increasing value. It only goes up (or resets to zero on restart). Counters track cumulative totals.

Counter examples:
  http_requests_total: 1,245,892
  errors_total: 3,417
  bytes_transmitted_total: 89,432,001,556

Usage:
  Rate of change: rate(http_requests_total[5m]) = 423 requests/second
  Error rate: rate(errors_total[5m]) / rate(http_requests_total[5m]) = 0.8%

Common mistake:
  Never use a counter's raw value for display.
  "1,245,892 total requests" is meaningless without a time context.
  "423 requests/second" is actionable.

Gauge

A gauge is a value that can go up or down. It represents a snapshot of current state.

Gauge examples:
  cpu_utilization: 72.5%
  active_connections: 1,247
  queue_depth: 8,932
  memory_used_bytes: 3,421,000,000
  temperature_celsius: 68.3

Usage:
  Current value is directly meaningful
  "1,247 active connections right now" is useful as-is
  Can compute averages, min, max over time windows

Histogram

A histogram samples observations and counts them in configurable buckets. It is essential for measuring distributions, especially latency.

Histogram example: http_request_duration_seconds

Buckets and counts:
  le="0.01"   (10ms):   45,230 requests
  le="0.05"   (50ms):   89,445 requests
  le="0.1"    (100ms):  95,112 requests
  le="0.25"   (250ms):  98,890 requests
  le="0.5"    (500ms):  99,650 requests
  le="1.0"    (1s):     99,890 requests
  le="+Inf":            100,000 requests

Derived percentiles:
  p50 (median): ~45ms
  p90: ~180ms
  p99: ~620ms

Why not just track the average?
  Average of [10ms, 10ms, 10ms, 10ms, 5000ms] = 1008ms
  p50 = 10ms, p99 = 5000ms
  The average hides that 80% of requests are fast and one is very slow.

Summary

Summaries calculate percentiles on the client side before sending to the metrics backend. Unlike histograms, they provide exact percentiles but cannot be aggregated across instances.

Summary vs Histogram:
  | Aspect           | Histogram              | Summary              |
  |------------------|------------------------|----------------------|
  | Percentile calc  | Server-side (approx)   | Client-side (exact)  |
  | Aggregatable     | Yes (across instances)  | No                   |
  | Bucket config    | Must define upfront     | Define quantiles     |
  | Recommended for  | Most use cases          | When exact quantiles |
  |                  |                        | are critical         |

For most production systems, histograms are preferred because you can aggregate them across multiple service instances to get system-wide percentiles.

SLIs, SLOs & SLAs

These three concepts form a hierarchy from technical measurement to business commitment.

Service Level Indicators (SLIs)

An SLI is a specific metric that measures one aspect of service quality. It is a number, not a target.

Common SLIs:
  Availability: percentage of successful requests
    SLI = (total requests - server errors) / total requests
    Current value: 99.97%

  Latency: percentage of requests faster than a threshold
    SLI = requests completing in under 200ms / total requests
    Current value: 95.2%

  Throughput: requests processed per second
    SLI = rate(successful_requests_total[5m])
    Current value: 12,450 req/s

  Error rate: percentage of requests returning errors
    SLI = 5xx responses / total responses
    Current value: 0.03%

Service Level Objectives (SLOs)

An SLO is a target value for an SLI over a time window. It defines "good enough" for the service.

SLO examples:
  "99.9% of requests will succeed over a 30-day rolling window"
  "95% of requests will complete in under 200ms over a 7-day window"
  "The error rate will remain below 0.1% over each calendar month"

Error budget:
  SLO: 99.9% availability over 30 days
  Total minutes in 30 days: 43,200
  Allowed downtime: 43.2 minutes (error budget)
  
  If 30 minutes of downtime have occurred:
    Remaining budget: 13.2 minutes
    Budget consumed: 69.4%
    Action: slow down risky deployments

Google pioneered error budgets as part of their SRE practices. When a team exhausts its error budget, feature development pauses in favor of reliability work. This creates a natural balance between velocity and stability.

Service Level Agreements (SLAs)

An SLA is a contractual commitment with consequences for failure, typically financial penalties or service credits.

SLA vs SLO:
  SLO: internal target, 99.95% availability (aspirational)
  SLA: external contract, 99.9% availability (legally binding)
  
  The SLO should be stricter than the SLA.
  If you target 99.95% internally, you have a buffer before breaching
  the 99.9% SLA commitment.

AWS SLA example:
  S3: 99.9% availability SLA
    Below 99.9% but above 99.0%: 10% service credit
    Below 99.0%: 25% service credit

Designing Effective Alerts

Alerting connects metrics to human action. Bad alerting either misses real problems or wakes people up for non-issues.

Alert Design Principles

Good alert criteria:
  1. Actionable: someone can do something about it right now
  2. Relevant: it indicates a real user-facing problem
  3. Urgent: it requires attention within the alert's time window
  4. Unique: it is not a duplicate of another alert

Bad alert examples:
  "CPU usage above 80%"        --> not necessarily a problem
  "One request returned 500"   --> noise at scale
  "Disk usage above 60%"       --> not urgent
  "Service restarted"          --> expected during deployments

Symptom-Based vs Cause-Based Alerts

Cause-based (fragile, noisy):
  "Database CPU above 90%"
  "Memory usage above 85%"
  "Thread pool 80% utilized"
  Problem: high resource usage might be normal under load

Symptom-based (user-focused, reliable):
  "Error rate above 1% for 5 minutes"
  "p99 latency above 500ms for 10 minutes"
  "Availability dropped below 99.9% SLO threshold"
  Problem detected: users are actually affected

Alert on symptoms (what users experience) and use cause-based metrics for investigation dashboards, not alerts.

Alert Severity Levels

Severity levels:
  P1 (Critical): user-facing outage, immediate page
    Example: "Payment processing error rate above 10%"
    Response: on-call engineer paged immediately
  
  P2 (High): degraded experience, page during business hours
    Example: "Search latency p99 above 2 seconds"
    Response: investigate within 1 hour
  
  P3 (Medium): potential issue, ticket created
    Example: "Error budget 80% consumed with 10 days remaining"
    Response: investigate within 1 business day
  
  P4 (Low): informational, logged
    Example: "Certificate expires in 30 days"
    Response: handle during normal work

Alert Fatigue

Alert fatigue occurs when too many alerts, especially false positives, cause operators to ignore or delay responding to real incidents.

Alert fatigue symptoms:
  - On-call engineers silence alerts without investigating
  - Alert channels have hundreds of unread messages
  - Pages are acknowledged but not acted on
  - Real incidents are missed because they blend into the noise

Common causes:
  - Thresholds set too aggressively (alert on any deviation)
  - Alerting on causes instead of symptoms
  - No deduplication (one incident triggers 50 related alerts)
  - Alerts that cannot be acted upon (informational masquerading as alerts)

Combating Alert Fatigue

Strategies:
  1. Regularly review alert volume (aim for fewer than 2 pages per on-call shift)
  2. Delete alerts that have not been actionable in 30 days
  3. Group related alerts into a single notification
  4. Use escalation chains: notification first, page only if unacknowledged
  5. Require a runbook for every alert
  6. Track signal-to-noise ratio: real incidents / total alerts

PagerDuty reports that organizations with effective alert management resolve incidents 70% faster than those drowning in alert noise.

Runbooks

A runbook documents the response procedure for a specific alert. It transforms an alert from "something is wrong" into "here is what to do."

Runbook structure:

Alert: payment-error-rate-high
Severity: P1
Triggers when: payment error rate > 5% for 5 minutes

What this means:
  More than 5% of payment attempts are failing.
  Users cannot complete purchases.

Immediate actions:
  1. Check payment-service dashboard for error breakdown
  2. Check downstream payment gateway status page (stripe.com/status)
  3. Check recent deployments: did a deploy happen in the last hour?
  
If gateway is down:
  4. Enable fallback payment gateway (see playbook link)
  5. Notify customer support team
  
If recent deploy caused it:
  4. Roll back the deployment (see rollback procedure link)
  5. Create incident ticket

Escalation:
  If not resolved in 15 minutes, escalate to payments team lead
  Contact: #payments-oncall Slack channel

Metrics Infrastructure

Collection & Storage

Metrics pipeline:
  Application --> metrics library (Prometheus client) 
    --> Prometheus scrapes every 15s 
    --> Time-series database (Prometheus TSDB)
    --> Long-term storage (Thanos, Cortex, or Mimir)
    --> Grafana dashboards

Push vs Pull:
  Pull (Prometheus): server scrapes metrics from applications
    Pro: server controls collection rate, easy to detect down targets
    Con: requires service discovery, harder with short-lived jobs
  
  Push (StatsD, Datadog Agent): application sends metrics to collector
    Pro: works with ephemeral jobs, simpler network setup
    Con: can overwhelm collector, harder to detect missing sources

Cardinality Management

High cardinality (too many unique label combinations) is the most common cause of metrics system failure.

Cardinality explosion:
  http_requests_total{method, status, endpoint, userId}
  
  methods: 5
  statuses: 10
  endpoints: 100
  userIds: 1,000,000
  
  Total time series: 5 * 10 * 100 * 1,000,000 = 5 billion
  This will crash your metrics system.

Fix: never use high-cardinality values (user IDs, request IDs) as metric labels.
  Use logs or traces for per-request data.
  Use metrics for aggregate data only.

Common Pitfalls

Alerting on raw metric values instead of rates. A counter value of 500 errors is meaningless without knowing the time period and total request count. Alert on error rate, not error count.
Setting SLOs without measuring SLIs first. Measure your current performance for at least a month before committing to an SLO. Setting aspirational targets leads to immediate budget exhaustion.
Using averages for latency. Averages hide outliers. A 50ms average can mask a p99 of 5 seconds. Always use percentiles for latency metrics.
Too many dashboards, no one looks at them. A wall of green dashboards that no one checks provides false comfort. Dashboards should be linked to alerts and used during investigation, not passively monitored.
Metric label cardinality explosion. Adding user IDs, request IDs, or other unbounded values as metric labels creates millions of time series and crashes the metrics backend.
No runbooks for alerts. An alert without a runbook is an alarm clock with no purpose. Every alert must have documented response steps.

Key Takeaways

The four metric types (counter, gauge, histogram, summary) cover all measurement needs. Use counters for rates, gauges for current state, and histograms for distributions.
SLIs measure service quality. SLOs set targets. SLAs make contractual commitments. Your SLO should be stricter than your SLA.
Alert on user-facing symptoms, not internal causes. Cause-based metrics belong on investigation dashboards, not in alert rules.
Alert fatigue is the silent killer of incident response. Fewer, higher-quality alerts are dramatically more effective than comprehensive alerting on everything.
Every alert needs a runbook. If you cannot write response steps for an alert, the alert should not exist.