Metrics & Alerting
Metrics are the first line of defense in system monitoring. They quantify system behavior into numbers that can be graphed, compared, and alerted on. A well-designed metrics and alerting system catches problems before users notice them.
Metric Types
Prometheus defines four core metric types that cover virtually every measurement need.
Counter
A counter is a monotonically increasing value. It only goes up (or resets to zero on restart). Counters track cumulative totals.
Counter examples:
http_requests_total: 1,245,892
errors_total: 3,417
bytes_transmitted_total: 89,432,001,556
Usage:
Rate of change: rate(http_requests_total[5m]) = 423 requests/second
Error rate: rate(errors_total[5m]) / rate(http_requests_total[5m]) = 0.8%
Common mistake:
Never use a counter's raw value for display.
"1,245,892 total requests" is meaningless without a time context.
"423 requests/second" is actionable.
Gauge
A gauge is a value that can go up or down. It represents a snapshot of current state.
Gauge examples:
cpu_utilization: 72.5%
active_connections: 1,247
queue_depth: 8,932
memory_used_bytes: 3,421,000,000
temperature_celsius: 68.3
Usage:
Current value is directly meaningful
"1,247 active connections right now" is useful as-is
Can compute averages, min, max over time windows
Histogram
A histogram samples observations and counts them in configurable buckets. It is essential for measuring distributions, especially latency.
Histogram example: http_request_duration_seconds
Buckets and counts:
le="0.01" (10ms): 45,230 requests
le="0.05" (50ms): 89,445 requests
le="0.1" (100ms): 95,112 requests
le="0.25" (250ms): 98,890 requests
le="0.5" (500ms): 99,650 requests
le="1.0" (1s): 99,890 requests
le="+Inf": 100,000 requests
Derived percentiles:
p50 (median): ~45ms
p90: ~180ms
p99: ~620ms
Why not just track the average?
Average of [10ms, 10ms, 10ms, 10ms, 5000ms] = 1008ms
p50 = 10ms, p99 = 5000ms
The average hides that 80% of requests are fast and one is very slow.
Summary
Summaries calculate percentiles on the client side before sending to the metrics backend. Unlike histograms, they provide exact percentiles but cannot be aggregated across instances.
Summary vs Histogram:
| Aspect | Histogram | Summary |
|------------------|------------------------|----------------------|
| Percentile calc | Server-side (approx) | Client-side (exact) |
| Aggregatable | Yes (across instances) | No |
| Bucket config | Must define upfront | Define quantiles |
| Recommended for | Most use cases | When exact quantiles |
| | | are critical |
For most production systems, histograms are preferred because you can aggregate them across multiple service instances to get system-wide percentiles.
SLIs, SLOs & SLAs
These three concepts form a hierarchy from technical measurement to business commitment.
Service Level Indicators (SLIs)
An SLI is a specific metric that measures one aspect of service quality. It is a number, not a target.
Common SLIs:
Availability: percentage of successful requests
SLI = (total requests - server errors) / total requests
Current value: 99.97%
Latency: percentage of requests faster than a threshold
SLI = requests completing in under 200ms / total requests
Current value: 95.2%
Throughput: requests processed per second
SLI = rate(successful_requests_total[5m])
Current value: 12,450 req/s
Error rate: percentage of requests returning errors
SLI = 5xx responses / total responses
Current value: 0.03%
Service Level Objectives (SLOs)
An SLO is a target value for an SLI over a time window. It defines "good enough" for the service.
SLO examples:
"99.9% of requests will succeed over a 30-day rolling window"
"95% of requests will complete in under 200ms over a 7-day window"
"The error rate will remain below 0.1% over each calendar month"
Error budget:
SLO: 99.9% availability over 30 days
Total minutes in 30 days: 43,200
Allowed downtime: 43.2 minutes (error budget)
If 30 minutes of downtime have occurred:
Remaining budget: 13.2 minutes
Budget consumed: 69.4%
Action: slow down risky deployments
Google pioneered error budgets as part of their SRE practices. When a team exhausts its error budget, feature development pauses in favor of reliability work. This creates a natural balance between velocity and stability.
Service Level Agreements (SLAs)
An SLA is a contractual commitment with consequences for failure, typically financial penalties or service credits.
SLA vs SLO:
SLO: internal target, 99.95% availability (aspirational)
SLA: external contract, 99.9% availability (legally binding)
The SLO should be stricter than the SLA.
If you target 99.95% internally, you have a buffer before breaching
the 99.9% SLA commitment.
AWS SLA example:
S3: 99.9% availability SLA
Below 99.9% but above 99.0%: 10% service credit
Below 99.0%: 25% service credit
Designing Effective Alerts
Alerting connects metrics to human action. Bad alerting either misses real problems or wakes people up for non-issues.
Alert Design Principles
Good alert criteria:
1. Actionable: someone can do something about it right now
2. Relevant: it indicates a real user-facing problem
3. Urgent: it requires attention within the alert's time window
4. Unique: it is not a duplicate of another alert
Bad alert examples:
"CPU usage above 80%" --> not necessarily a problem
"One request returned 500" --> noise at scale
"Disk usage above 60%" --> not urgent
"Service restarted" --> expected during deployments
Symptom-Based vs Cause-Based Alerts
Cause-based (fragile, noisy):
"Database CPU above 90%"
"Memory usage above 85%"
"Thread pool 80% utilized"
Problem: high resource usage might be normal under load
Symptom-based (user-focused, reliable):
"Error rate above 1% for 5 minutes"
"p99 latency above 500ms for 10 minutes"
"Availability dropped below 99.9% SLO threshold"
Problem detected: users are actually affected
Alert on symptoms (what users experience) and use cause-based metrics for investigation dashboards, not alerts.
Alert Severity Levels
Severity levels:
P1 (Critical): user-facing outage, immediate page
Example: "Payment processing error rate above 10%"
Response: on-call engineer paged immediately
P2 (High): degraded experience, page during business hours
Example: "Search latency p99 above 2 seconds"
Response: investigate within 1 hour
P3 (Medium): potential issue, ticket created
Example: "Error budget 80% consumed with 10 days remaining"
Response: investigate within 1 business day
P4 (Low): informational, logged
Example: "Certificate expires in 30 days"
Response: handle during normal work
Alert Fatigue
Alert fatigue occurs when too many alerts, especially false positives, cause operators to ignore or delay responding to real incidents.
Alert fatigue symptoms:
- On-call engineers silence alerts without investigating
- Alert channels have hundreds of unread messages
- Pages are acknowledged but not acted on
- Real incidents are missed because they blend into the noise
Common causes:
- Thresholds set too aggressively (alert on any deviation)
- Alerting on causes instead of symptoms
- No deduplication (one incident triggers 50 related alerts)
- Alerts that cannot be acted upon (informational masquerading as alerts)
Combating Alert Fatigue
Strategies:
1. Regularly review alert volume (aim for fewer than 2 pages per on-call shift)
2. Delete alerts that have not been actionable in 30 days
3. Group related alerts into a single notification
4. Use escalation chains: notification first, page only if unacknowledged
5. Require a runbook for every alert
6. Track signal-to-noise ratio: real incidents / total alerts
PagerDuty reports that organizations with effective alert management resolve incidents 70% faster than those drowning in alert noise.
Runbooks
A runbook documents the response procedure for a specific alert. It transforms an alert from "something is wrong" into "here is what to do."
Runbook structure:
Alert: payment-error-rate-high
Severity: P1
Triggers when: payment error rate > 5% for 5 minutes
What this means:
More than 5% of payment attempts are failing.
Users cannot complete purchases.
Immediate actions:
1. Check payment-service dashboard for error breakdown
2. Check downstream payment gateway status page (stripe.com/status)
3. Check recent deployments: did a deploy happen in the last hour?
If gateway is down:
4. Enable fallback payment gateway (see playbook link)
5. Notify customer support team
If recent deploy caused it:
4. Roll back the deployment (see rollback procedure link)
5. Create incident ticket
Escalation:
If not resolved in 15 minutes, escalate to payments team lead
Contact: #payments-oncall Slack channel
Metrics Infrastructure
Collection & Storage
Metrics pipeline:
Application --> metrics library (Prometheus client)
--> Prometheus scrapes every 15s
--> Time-series database (Prometheus TSDB)
--> Long-term storage (Thanos, Cortex, or Mimir)
--> Grafana dashboards
Push vs Pull:
Pull (Prometheus): server scrapes metrics from applications
Pro: server controls collection rate, easy to detect down targets
Con: requires service discovery, harder with short-lived jobs
Push (StatsD, Datadog Agent): application sends metrics to collector
Pro: works with ephemeral jobs, simpler network setup
Con: can overwhelm collector, harder to detect missing sources
Cardinality Management
High cardinality (too many unique label combinations) is the most common cause of metrics system failure.
Cardinality explosion:
http_requests_total{method, status, endpoint, userId}
methods: 5
statuses: 10
endpoints: 100
userIds: 1,000,000
Total time series: 5 * 10 * 100 * 1,000,000 = 5 billion
This will crash your metrics system.
Fix: never use high-cardinality values (user IDs, request IDs) as metric labels.
Use logs or traces for per-request data.
Use metrics for aggregate data only.
Common Pitfalls
- Alerting on raw metric values instead of rates. A counter value of 500 errors is meaningless without knowing the time period and total request count. Alert on error rate, not error count.
- Setting SLOs without measuring SLIs first. Measure your current performance for at least a month before committing to an SLO. Setting aspirational targets leads to immediate budget exhaustion.
- Using averages for latency. Averages hide outliers. A 50ms average can mask a p99 of 5 seconds. Always use percentiles for latency metrics.
- Too many dashboards, no one looks at them. A wall of green dashboards that no one checks provides false comfort. Dashboards should be linked to alerts and used during investigation, not passively monitored.
- Metric label cardinality explosion. Adding user IDs, request IDs, or other unbounded values as metric labels creates millions of time series and crashes the metrics backend.
- No runbooks for alerts. An alert without a runbook is an alarm clock with no purpose. Every alert must have documented response steps.
Key Takeaways
- The four metric types (counter, gauge, histogram, summary) cover all measurement needs. Use counters for rates, gauges for current state, and histograms for distributions.
- SLIs measure service quality. SLOs set targets. SLAs make contractual commitments. Your SLO should be stricter than your SLA.
- Alert on user-facing symptoms, not internal causes. Cause-based metrics belong on investigation dashboards, not in alert rules.
- Alert fatigue is the silent killer of incident response. Fewer, higher-quality alerts are dramatically more effective than comprehensive alerting on everything.
- Every alert needs a runbook. If you cannot write response steps for an alert, the alert should not exist.