Observability and SRE

Observability is the ability to understand a system's internal state from its external outputs. SRE (Site Reliability Engineering) applies software engineering to operations.

Three Pillars of Observability

Logs

Structured records of discrete events.

{
    "timestamp": "2024-03-15T10:30:45Z",
    "level": "error",
    "service": "order-service",
    "trace_id": "abc123",
    "message": "Payment failed",
    "user_id": 42,
    "error": "card_declined",
    "duration_ms": 250
}

Structured logging: JSON or key-value format. Machine-parseable. Searchable.

Log levels: TRACE, DEBUG, INFO, WARN, ERROR, FATAL.

Log aggregation: Collect logs from all instances into a central system. Search, filter, alert. Tools: Elasticsearch + Kibana (ELK), Loki + Grafana, Datadog, Splunk.

Metrics

Numerical measurements aggregated over time.

Metric types:

Counter: Monotonically increasing value (total requests, errors). Rate = derivative.
Gauge: Value that can go up and down (temperature, queue length, active connections).
Histogram: Distribution of values (request latency percentiles). Bucketed counts.
Summary: Similar to histogram but computes quantiles client-side.

# Prometheus format
http_requests_total{method="GET", path="/api/users", status="200"} 12345
http_request_duration_seconds_bucket{le="0.1"} 8000
http_request_duration_seconds_bucket{le="0.5"} 11000
http_request_duration_seconds_bucket{le="1.0"} 11800
node_cpu_seconds_total{mode="idle"} 98765.4

Tools: Prometheus (pull-based), StatsD (push-based), OpenTelemetry, Datadog, CloudWatch.

Traces

End-to-end request flow across services.

Trace ID: abc-123
├── Span: API Gateway (2ms)
│   └── Span: Auth Service (5ms)
├── Span: Order Service (100ms)
│   ├── Span: Database Query (80ms)
│   └── Span: Cache Lookup (2ms)
└── Span: Notification Service (50ms)
    └── Span: Email Provider (45ms)
Total: 157ms

OpenTelemetry: Vendor-neutral standard for traces, metrics, and logs. SDKs for all major languages. Export to Jaeger, Zipkin, Datadog, Grafana Tempo.

Propagation: Trace context (trace ID, span ID) passed in HTTP headers (W3C Trace Context standard).

Alerting

Good Alerts

Actionable: Someone needs to do something about it.
Relevant: Indicates a real problem affecting users.
Timely: Fires before users notice.
Not noisy: Avoid alert fatigue. If an alert fires and nobody cares → delete it.

Alert on Symptoms, Not Causes

Bad: Alert on CPU > 90%. (Maybe the CPU is supposed to be busy.)

Good: Alert on p99 latency > 500ms. (Users are affected.)

Better: Alert on error budget burn rate. (See SLO section.)

SLI, SLO, SLA

SLI (Service Level Indicator)

A metric that measures service quality from the user's perspective.

SLI	Measurement
Availability	% of requests returning non-5xx
Latency	p50, p95, p99 response time
Throughput	Requests per second
Error rate	% of failed requests
Freshness	Age of data (for caches, replicas)

SLO (Service Level Objective)

A target for an SLI.

SLO: 99.9% of requests return 200 within 500ms over a 30-day window

Error budget = 1 - SLO. For 99.9% SLO: error budget = 0.1% = ~43 minutes/month of downtime.

Error budget policy: If error budget is exhausted → freeze deployments, focus on reliability. If error budget is healthy → deploy faster, take more risk.

SLA (Service Level Agreement)

A contract with consequences (refunds, penalties) if the SLO is not met. External-facing.

SLA < SLO typically (internal target is stricter than external commitment).

Incident Management

On-Call

Rotation: Engineers take turns being "on-call" (responsible for responding to alerts). Typically 1 week on, several weeks off.

Escalation: If the primary on-call doesn't respond → escalate to secondary → escalate to manager.

Runbooks: Pre-written procedures for common incidents. "If X alert fires, do Y."

Incident Response

Detect: Alert fires or user reports.
Triage: Assess severity. Communicate to stakeholders.
Mitigate: Stop the bleeding (rollback, scale up, disable feature flag).
Diagnose: Find root cause.
Resolve: Fix the underlying issue.
Follow up: Postmortem.

Blameless Postmortems

After an incident, write a postmortem:

## Incident: Database overload
**Duration**: 45 minutes (14:30 - 15:15 UTC)
**Impact**: 30% of API requests returned 503

## Timeline
- 14:30: Monitoring alert: p99 latency > 2s
- 14:35: On-call investigates. Database CPU at 100%.
- 14:40: Identified: bulk import job running without rate limiting.
- 14:45: Killed the bulk import job. Latency recovering.
- 15:15: Fully recovered. Error rate back to normal.

## Root Cause
Bulk import job queried 10M rows without pagination, overwhelming the database.

## Action Items
1. Add rate limiting to bulk import jobs (owner: Alice, due: March 25)
2. Add database CPU alert at 80% (owner: Bob, due: March 20)
3. Implement query timeout for batch jobs (owner: Carol, due: March 28)

## Lessons Learned
- Batch jobs should always have rate limits and timeouts.
- Need better visibility into background job resource usage.

Blameless: Focus on systems and processes, not individuals. "How did the system allow this to happen?" not "Who made the mistake?"

Chaos Engineering

Deliberately inject failures to test system resilience.

Principles (Netflix):

Define "steady state" (normal behavior metrics).
Hypothesize that steady state continues during the experiment.
Introduce real-world events (server crash, network partition, latency spike).
Try to disprove the hypothesis.

Tools: Chaos Monkey (kill random instances), Litmus (Kubernetes chaos), Gremlin, toxiproxy (network faults).

Start small: Kill one instance. Then: network partition. Then: datacenter failover. Build confidence gradually.

Toil Reduction

Toil: Manual, repetitive, automatable operational work that doesn't provide lasting value.

Examples: Manual deployments, manual scaling, copy-paste configuration, manual certificate rotation, hand-crafted monitoring alerts.

SRE goal: Automate toil. Spend ≤ 50% of time on toil. The rest on engineering work that reduces future toil.

Applications in CS

Production operations: Observability is essential for running reliable services at scale.
Debugging: Distributed tracing reveals where time is spent. Logs provide event-level detail.
Capacity planning: Metrics drive scaling decisions and resource allocation.
Reliability engineering: SLOs quantify reliability. Error budgets balance reliability with velocity.
Team practices: Blameless postmortems improve systems. On-call rotations distribute responsibility.