Circuit Breakers & Retries

When a downstream service fails, naive retry logic can turn a minor outage into a cascading catastrophe. Circuit breakers and well-designed retry strategies protect systems from amplifying failures. This file covers the patterns, their interactions, and how to implement them correctly.

The Problem: Cascading Failures

Service A calls Service B. Service B becomes slow or unresponsive.

Without protection:
  Service A sends requests to Service B
  Service B is slow -> requests pile up
  Service A's thread pool fills up waiting for B
  Service A stops responding to its own clients
  Service C, which depends on A, also fails
  Cascade continues...

A single slow service can take down an entire distributed system. Circuit breakers and retries are the primary defenses.

Circuit Breaker Pattern

The circuit breaker pattern wraps calls to a remote service and monitors for failures. When failures exceed a threshold, the circuit "opens" and calls fail immediately without reaching the remote service.

States

CLOSED (normal operation)
  -> All calls pass through to the service
  -> Failures are counted
  -> If failure count exceeds threshold: transition to OPEN

OPEN (protecting the system)
  -> All calls fail immediately (no network call)
  -> Returns a fallback response or error
  -> After a timeout period: transition to HALF-OPEN

HALF-OPEN (testing recovery)
  -> A limited number of calls pass through
  -> If they succeed: transition to CLOSED
  -> If they fail: transition back to OPEN

Configuration Parameters

Failure threshold: Number or percentage of failures to trigger opening (e.g., 50% failures in the last 10 calls).
Timeout period: How long the circuit stays open before testing with half-open (e.g., 30 seconds).
Success threshold: Number of successful calls in half-open state required to close the circuit (e.g., 3 consecutive successes).
Monitoring window: The time window over which failures are counted (e.g., last 60 seconds).

What Counts as a Failure

HTTP 5xx responses
Timeouts
Connection refused
Exceptions

Do NOT count 4xx responses (client errors) as failures — those indicate bad input, not a broken service.

Example Implementation Logic

function callService(request):
  if circuit is OPEN:
    if timeout has elapsed:
      circuit = HALF_OPEN
    else:
      return fallback_response  // fail fast

  try:
    response = service.call(request)
    record_success()
    if circuit is HALF_OPEN and success_count >= threshold:
      circuit = CLOSED
    return response
  catch error:
    record_failure()
    if failure_rate >= threshold:
      circuit = OPEN
      start_timeout_timer()
    throw error

Real-World: Netflix Hystrix (Legacy) & Resilience4j

Netflix pioneered the circuit breaker pattern at scale with Hystrix. Each service-to-service call was wrapped in a Hystrix command with its own circuit breaker, thread pool, and fallback. Hystrix is now in maintenance mode; Resilience4j is the modern Java replacement with the same concepts.

Real-World: Envoy Proxy

Envoy implements circuit breaking at the proxy level with outlier detection. If a backend instance returns too many 5xx errors, Envoy ejects it from the load balancer pool for a configurable duration. This acts as a per-instance circuit breaker without application code changes.

Retry Strategies

Retries handle transient failures — short-lived issues that resolve on their own (network blip, brief overload, temporary unavailability).

Simple Retry

Retry immediately, a fixed number of times.

attempt 1: failed
attempt 2: failed
attempt 3: success

Problem: Immediate retries hit an already-struggling service with more load, potentially making the problem worse.

Retry with Fixed Delay

Wait a fixed amount of time between retries.

attempt 1: failed
wait 1 second
attempt 2: failed
wait 1 second
attempt 3: success

Better than immediate retry, but all clients retry at the same interval, creating "retry storms" at regular intervals.

Exponential Backoff

Double the delay between each retry.

attempt 1: failed
wait 1 second
attempt 2: failed
wait 2 seconds
attempt 3: failed
wait 4 seconds
attempt 4: success

This gives the failing service progressively more time to recover. The standard formula:

delay = base_delay * 2^(attempt - 1)

With base_delay = 1s:
  Attempt 1: 1s
  Attempt 2: 2s
  Attempt 3: 4s
  Attempt 4: 8s

Exponential Backoff with Jitter

Even with exponential backoff, if 1000 clients all start retrying at the same time (after a service goes down and comes back), they all retry at the same moments, overwhelming the recovering service.

Jitter adds randomness to the delay.

delay = random(0, base_delay * 2^(attempt - 1))

Client A attempt 2: wait 1.3 seconds
Client B attempt 2: wait 0.7 seconds
Client C attempt 2: wait 1.9 seconds

This spreads retries over time, reducing the thundering herd effect. AWS recommends "full jitter" (random between 0 and the calculated delay) for optimal results.

Retry Budget

Instead of a per-request retry count, set a system-wide retry budget: "retry at most 10% of total requests." This prevents retry amplification under sustained failure.

Normal: 1000 requests/sec, 0 retries
Failure: 1000 requests/sec, 100 retries (10% budget)
  Total load on downstream: 1100 req/sec (manageable)

Without budget: 1000 requests/sec, 3000 retries (3 retries each)
  Total load: 4000 req/sec (4x amplification, makes outage worse)

Google's gRPC framework and service mesh Istio support retry budgets.

Timeouts

Every outbound call must have a timeout. Without one, a hung service causes the caller to wait indefinitely, tying up threads and connections.

Setting Timeouts

Connection timeout: How long to wait for a TCP connection to establish (typically 1-5 seconds).
Read timeout: How long to wait for a response after the request is sent (depends on the operation — 1 second for a cache lookup, 30 seconds for a report generation).

Timeout Guidelines

Set timeouts based on the P99 latency of the downstream service, not the average. If P99 is 200ms, a 500ms timeout gives comfortable headroom.
A timeout too short causes false failures during normal slow periods.
A timeout too long defeats the purpose — the caller is still blocked for too long.

Service B P50: 50ms, P99: 200ms, P99.9: 500ms

Good timeout: 500ms-1s (covers P99.9 with margin)
Too short: 100ms (fails on normal slow requests)
Too long: 30s (caller blocks for half a minute on failure)

Bulkheads

The bulkhead pattern isolates components so that a failure in one doesn't drain resources from others. Named after the watertight compartments in a ship's hull.

Thread Pool Isolation

Assign each downstream dependency its own thread pool. If Service B is slow, only its thread pool fills up; Service C's thread pool is unaffected.

Without bulkheads:
  Shared thread pool: 200 threads
  Service B hangs -> all 200 threads waiting on B -> no threads for C

With bulkheads:
  Service B pool: 100 threads
  Service C pool: 100 threads
  Service B hangs -> B's 100 threads full, C's 100 threads still available

1. Set a TIMEOUT on every outbound call (prevents indefinite blocking)
2. RETRY transient failures with exponential backoff + jitter (handles blips)
3. Wrap retries in a CIRCUIT BREAKER (stops retries when the service is down)
4. Isolate dependencies with BULKHEADS (prevents cascade to unrelated calls)
5. Return a FALLBACK when the circuit is open (maintain partial functionality)

Order of Operations

Incoming request
  -> Bulkhead: is there capacity for this dependency? No -> reject
  -> Circuit breaker: is the circuit open? Yes -> fallback
  -> Make the call with timeout
  -> Success? -> return response
  -> Failure? -> retry with backoff + jitter (within circuit breaker)
  -> All retries failed? -> circuit breaker records failures
  -> Return fallback or error

Common Pitfalls

Retrying non-idempotent operations. Retrying a payment charge can charge the customer twice. Only retry operations that are safe to repeat (reads, idempotent writes).
Retrying on all errors. Don't retry 400 Bad Request or 404 Not Found. Retry only on transient errors (503, timeout, connection reset).
No jitter. Exponential backoff without jitter still creates synchronized retry waves. Always add randomness.
Circuit breaker thresholds too sensitive. Opening the circuit on a single failure causes unnecessary outages. Use a percentage over a window (e.g., 50% of 20 calls in 60 seconds).
No fallback when circuit is open. If the circuit opens and the caller just returns an error, the user experience is terrible. Provide degraded functionality: cached data, default values, or a friendly error message.
Timeouts too generous. A 30-second timeout means a user stares at a spinner for 30 seconds. Set timeouts that match user expectations.
Testing only the happy path. Inject failures in staging to verify that circuit breakers open, retries back off, and bulkheads isolate correctly.

Key Takeaways

Circuit breakers prevent calls to a known-failing service, giving it time to recover and protecting the caller.
Retries handle transient failures. Use exponential backoff with jitter to avoid thundering herds.
Timeouts are mandatory on every outbound call. Set them based on P99 latency, not averages.
Bulkheads isolate dependencies so one slow service can't drain all resources.
Combine all four patterns: timeout, retry, circuit breaker, bulkhead. Each handles a different failure mode.
Only retry idempotent operations. Non-idempotent retries cause duplicate side effects.