SLIs, SLOs & Error Budgets

Reliability is not a binary state. Your service is not "up" or "down." It exists on a spectrum, and you need a framework to decide how reliable it needs to be and what to do when reliability drops. That framework is SLIs, SLOs, and error budgets. Google's SRE team formalized these concepts, and they have become the standard vocabulary for talking about reliability.

SLI: What You Measure

A Service Level Indicator (SLI) is a quantitative measure of some aspect of your service. It is the raw measurement, not the target. Common SLIs include:

Availability:
  Percentage of requests that succeed (HTTP 2xx or 3xx)
  Formula: successful requests / total requests

Latency:
  How long requests take to complete
  Usually measured at p50, p95, and p99
  Example: p99 latency = 250ms (99% of requests complete in under 250ms)

Error rate:
  Percentage of requests that fail (HTTP 5xx)
  Formula: failed requests / total requests

Throughput:
  Number of requests processed per second
  Important for batch processing and data pipelines

Freshness:
  How up-to-date the data is
  Important for caches, search indexes, and data pipelines
  Example: search index is at most 5 minutes behind primary database

Choosing Good SLIs

Good SLIs measure what users care about. Users care about whether the page loads, whether it loads fast, and whether the data is correct. They do not care about CPU utilization.

Good SLIs (user-facing):
  - Request success rate
  - Request latency (p99)
  - Data freshness
  - Checkout completion rate

Bad SLIs (infrastructure-focused):
  - CPU utilization
  - Memory usage
  - Disk I/O
  - Number of pods running

Infrastructure metrics are useful for debugging but not for defining reliability from the user's perspective.

SLO: Your Target

A Service Level Objective (SLO) is the target value for an SLI. It defines how reliable your service should be. Not how reliable it can be -- how reliable it should be.

Example SLOs:
  - 99.9% of requests succeed (availability)
  - 99% of requests complete in under 500ms (latency)
  - 99.95% of data is fresh within 1 minute (freshness)

Choosing the Right SLO

The temptation is to set SLOs as high as possible. Resist this. Higher reliability costs more and slows development.

Availability  Allowed downtime per year  Cost and complexity
99%           3.65 days                  Low
99.9%         8.76 hours                 Moderate
99.95%        4.38 hours                 High
99.99%        52.6 minutes               Very high
99.999%       5.26 minutes               Extreme

Going from 99.9% to 99.99% is not a 0.09% improvement. It requires a 10x reduction in downtime. This means redundancy, sophisticated failover, multi-region deployments, and significantly more engineering effort.

Setting SLOs in Practice

Step 1: Measure your current performance
  "Our service currently succeeds 99.7% of the time"

Step 2: Understand user expectations
  "Our users notice when the service is down for more than 10 minutes"

Step 3: Set an SLO slightly above current performance
  "We will target 99.9% availability"

Step 4: Build alerting around the SLO
  "Alert when we are burning error budget faster than expected"

Do not set your SLO higher than your dependencies can support. If your database has 99.95% availability, your service cannot achieve 99.99%.

Error Budget: How Much Failure You Can Tolerate

The error budget is the inverse of your SLO. If your SLO is 99.9% availability, your error budget is 0.1% of requests. This is how much failure you can afford.

SLO: 99.9% availability over a 30-day window

Total requests in 30 days: 10,000,000
Allowed failures: 10,000,000 * 0.001 = 10,000 failed requests

That is your error budget: 10,000 failures per month.
Every deployment that causes errors, every outage, every bug
consumes error budget.

Why Error Budgets Matter

Error budgets create a shared language between reliability and velocity. Instead of arguing about whether to ship a risky feature, teams can look at the error budget.

Error budget remaining: 80%
  → Ship features. You have room for some risk.

Error budget remaining: 30%
  → Be careful. Test more thoroughly. Deploy in smaller batches.

Error budget remaining: 5%
  → Freeze features. Focus on reliability work.

Error budget exhausted: 0%
  → Feature freeze. All engineering effort goes to reliability
    until the budget recovers.

This is the key insight: the error budget gives the team permission to take risks when reliability is good and forces them to prioritize reliability when it is not. It removes the subjective argument and replaces it with data.

Error Budget Policy

An error budget policy defines what happens at different thresholds. Write it down and get buy-in from leadership before an incident forces the conversation.

Error budget policy:
  > 50% remaining:
    - Normal development velocity
    - Standard deployment process

  20-50% remaining:
    - Increase test coverage for new deployments
    - Require canary deployments
    - Review recent changes for reliability risks

  < 20% remaining:
    - Feature freeze for the service
    - All engineering time dedicated to reliability
    - Root cause analysis for error budget consumption
    - Unfreeze when budget recovers to 50%

SLA: The Contract with Customers

A Service Level Agreement (SLA) is a contract between you and your customers. It specifies the level of service you promise to deliver and the penalties if you fail.

SLO vs SLA:

SLO: Internal target. "We aim for 99.9% availability."
     No financial consequences if missed.
     Used to guide engineering priorities.

SLA: External contract. "We guarantee 99.9% availability.
     If we miss it, we credit your account."
     Legal and financial consequences if missed.

Why Most Teams Need SLOs, Not SLAs

SLAs are for contracts with paying customers. They involve legal review, financial penalties, and careful negotiation. Most internal services and most early-stage products do not need SLAs.

SLOs, on the other hand, are for every team. They give you a framework for making reliability decisions without the legal complexity.

When you need an SLA:
  - You sell a service to external customers (AWS, Stripe, Twilio)
  - Your contract requires availability guarantees
  - Customers demand financial accountability for downtime

When you need SLOs (but not SLAs):
  - Internal services consumed by other teams
  - Early-stage products where the priority is speed
  - Any service where you want to balance reliability and velocity

The SLA Buffer

If you offer an SLA, your SLO should be stricter. This gives you a buffer.

SLA to customers: 99.9% availability (contractual guarantee)
Internal SLO:     99.95% availability (engineering target)

The 0.05% gap is your safety margin.
If you hit 99.92%, you have missed your SLO but not your SLA.
The SLO violation triggers reliability work before the SLA is breached.

Implementing SLOs

Step 1: Define SLIs

# Example SLI definitions
slis:
  availability:
    description: "Percentage of non-5xx responses"
    query: |
      sum(rate(http_requests_total{status!~"5.."}[5m]))
      /
      sum(rate(http_requests_total[5m]))

  latency_p99:
    description: "99th percentile request duration"
    query: |
      histogram_quantile(0.99,
        sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
      )

Step 2: Set SLO Targets

# Example SLO configuration
slos:
  - name: "Payment service availability"
    sli: availability
    target: 0.999
    window: 30d

  - name: "Payment service latency"
    sli: latency_p99
    target_threshold: 0.5  # 500ms
    target: 0.99           # 99% of requests under 500ms
    window: 30d

Step 3: Calculate Error Budget

# Error budget calculation
# SLO: 99.9% availability over 30 days

# Total budget in minutes: 30 * 24 * 60 = 43,200 minutes
# Error budget: 43,200 * 0.001 = 43.2 minutes of allowed downtime

# Current month: 15 minutes of downtime consumed
# Remaining budget: 43.2 - 15 = 28.2 minutes (65.3% remaining)

Real-World Example

A payment processing company set an SLO of 99.95% availability for their checkout API. In the first month, they measured 99.87% -- missing the SLO due to three deployment-related outages totaling 56 minutes.

The error budget policy kicked in: they froze new features for the checkout service and spent two weeks on reliability improvements.

They added canary deployments that rolled back automatically on elevated error rates
They improved health check endpoints to detect issues faster
They added circuit breakers for downstream dependencies

The next month, availability was 99.97%. The error budget was healthy, and development velocity resumed. The SLO framework gave the team a clear, data-driven reason to invest in reliability -- and an equally clear signal for when to resume shipping features.

Common Pitfalls

Setting SLOs too high -- A 99.99% SLO for an internal dashboard is wasteful; match the SLO to user expectations
Not measuring SLIs before setting SLOs -- If you do not know your current reliability, you cannot set a meaningful target; measure first, then set the SLO
Error budget without a policy -- An error budget that nobody acts on is just a number; write the policy before the first breach
SLOs that nobody checks -- If the SLO dashboard is not part of the team's weekly review, it might as well not exist
Confusing SLOs with SLAs -- SLOs are internal engineering targets; SLAs are contractual guarantees; mixing them up leads to either over-engineering or legal risk
Too many SLIs -- Track three to five SLIs per service, not twenty; more metrics means more noise and less clarity

Key Takeaways

SLIs are what you measure (latency, error rate, availability), SLOs are your targets, and error budgets are your tolerance for failure
Error budgets create a shared language: ship features when the budget is healthy, fix reliability when it is not
SLAs are contractual guarantees with penalties; most teams need SLOs, not SLAs
Set your SLO based on user expectations and current performance, not on aspirational targets
An error budget policy must be written and agreed upon before an incident forces the conversation
If you offer an SLA, your internal SLO should be stricter to provide a safety margin