5 min read
On this page

Graceful Degradation

A system that either works perfectly or fails completely is brittle. Graceful degradation means maintaining partial functionality when components fail or load exceeds capacity. Users get a reduced experience rather than a broken one. This is the difference between "recommendations are temporarily unavailable" and "the entire site is down."

The Principle

Not all features are equally important. When resources are constrained, shed non-essential work and protect the critical path.

E-commerce site under extreme load:
  Critical:     Browse products, add to cart, complete checkout
  Important:    Search, order history, account settings
  Nice-to-have: Personalized recommendations, reviews, recently viewed

Degradation strategy:
  Stage 1: Disable recommendations, serve cached reviews
  Stage 2: Disable search, show curated product lists
  Stage 3: Disable everything except checkout for users with items in cart

Feature Flags

Feature flags (feature toggles) let you enable or disable functionality at runtime without deploying new code. They are the primary mechanism for graceful degradation.

Types of Feature Flags

  • Kill switches: Disable a feature instantly when it causes problems.
  • Operational flags: Control system behavior (enable read-only mode, disable background jobs).
  • Release flags: Gradually roll out new features to a percentage of users.
  • Experiment flags: A/B tests.

Implementation

if feature_flag("recommendations_enabled"):
  recommendations = recommendation_service.get(user_id)
else:
  recommendations = cached_popular_items()

if feature_flag("search_enabled"):
  results = search_service.query(term)
else:
  results = "Search is temporarily unavailable. Browse categories instead."

Feature Flag Service

In production, feature flags are stored in a centralized service (LaunchDarkly, Unleash, or a custom system backed by Redis/database). Application servers poll or subscribe for flag changes.

[Feature Flag Service] <-- Admin dashboard (operators flip flags)
       |
       v (push/poll)
[App Server 1] [App Server 2] [App Server 3]
  Each server evaluates flags locally for zero-latency decisions

Real-World: Facebook

Facebook's Gatekeeper system manages thousands of feature flags. During incidents, engineers can disable non-critical features within seconds. During the 2021 outage, the difficulty of reaching internal systems to flip flags prolonged the incident — a lesson in ensuring flag systems are accessible even during catastrophic failures.

Fallbacks

When a dependency fails, a fallback provides an alternative response instead of an error.

Fallback Strategies

Cached data: Serve the last known good response from cache, even if it's stale.

try:
  product_price = pricing_service.get_price(product_id)
catch ServiceUnavailable:
  product_price = cache.get("price:" + product_id)  // stale but better than nothing

Default values: Return a sensible default when the real value is unavailable.

try:
  shipping_estimate = shipping_service.estimate(address)
catch ServiceUnavailable:
  shipping_estimate = "3-5 business days"  // generic default

Static content: Serve a pre-generated static page when the dynamic system is down.

try:
  homepage = render_dynamic_homepage(user)
catch ServiceUnavailable:
  homepage = serve_static_homepage()  // pre-built, cached at CDN

Simplified computation: Use a simpler algorithm when the full one is too expensive.

Normal: ML-based personalized recommendations (calls model server)
Fallback: Most popular items in the user's country (cached list, no ML call)

Real-World: Netflix

Netflix's fallback hierarchy for its homepage:

  1. Personalized recommendations (full ML pipeline)
  2. Cached personalized recommendations (from last successful computation)
  3. Regional popular titles (precomputed, cached)
  4. Global popular titles (static list)

The user always sees something. The experience degrades gracefully through each level.

Read-Only Mode

When the write path (database primary, payment processor, or message queue) fails, switching to read-only mode preserves the read experience.

How It Works

Normal mode:
  Users can browse, search, add to cart, checkout, write reviews

Write path failure detected:
  -> Switch to read-only mode
  -> Users can browse and search (reads still work)
  -> Add-to-cart and checkout are disabled with a friendly message
  -> Reviews are hidden or shown from cache

Implementation

  • Feature flag toggles write endpoints
  • API returns 503 with a Retry-After header for write requests
  • UI shows a banner: "Some features are temporarily unavailable"

When to Use

  • Database primary is down but replicas are serving reads
  • Payment processor is offline
  • Message queue is full and writes would block

Queue-Based Load Leveling

Instead of rejecting requests during traffic spikes, put them in a queue and process them at a sustainable rate.

The Pattern

Without load leveling:
  Spike: 10,000 req/sec -> Server capacity: 2,000 req/sec -> failures

With load leveling:
  Spike: 10,000 req/sec -> Queue -> Workers process at 2,000 req/sec
  Requests are delayed but not dropped

Appropriate Workloads

Load leveling works for operations where the user doesn't need an immediate result:

  • Order processing (user gets "order received" immediately; processing happens async)
  • Image/video processing
  • Report generation
  • Email/notification sending

It does NOT work for synchronous, latency-sensitive operations like page loads or search queries.

Queue Depth as a Signal

Monitor queue depth. If it grows continuously, the system is under-provisioned. Use queue depth as an auto-scaling trigger for workers.

Queue depth > 1000 for 5 minutes -> scale up workers
Queue depth < 100 for 10 minutes -> scale down workers

Real-World: Shopify Flash Sales

During flash sales, Shopify queues checkout requests. Users see a "waiting room" (queue position) rather than an error page. The checkout system processes orders at its maximum sustainable rate. Users wait longer but ultimately succeed.

Priority Shedding

When the system can't handle all requests, shed low-priority work to protect high-priority work.

Priority Classification

Priority 1 (Critical):  Checkout, payment processing, authentication
Priority 2 (Important): Product pages, search, cart operations
Priority 3 (Normal):    Account settings, order history, recommendations
Priority 4 (Low):       Analytics collection, background sync, pre-fetching

Shedding Strategy

System at 80% capacity:  Process all priorities
System at 90% capacity:  Drop Priority 4
System at 95% capacity:  Drop Priority 3 and 4
System at 99% capacity:  Drop Priority 2, 3, and 4 (only checkout works)

Implementation

Tag requests with priority at the edge (load balancer or API gateway). Under load, the gateway rejects low-priority requests with 503 before they consume backend resources.

[API Gateway]
  -> Check system load (CPU, queue depth, error rate)
  -> Compare request priority against current threshold
  -> Below threshold: reject with 503 + Retry-After
  -> Above threshold: forward to backend

Real-World: Google

Google's load balancing system implements priority-based admission control. During overload, lower-priority requests (prefetch, background sync) are shed before user-facing search queries. This is documented in the Google SRE book as "load shedding."

Designing for Degradation

Step 1: Classify Features by Criticality

Map every feature to a criticality tier. Get agreement from product and engineering.

Step 2: Define Degradation Modes

For each criticality tier, define what happens when it is shed: fallback response, cached data, disabled UI element, error message.

Step 3: Implement Feature Flags

Wire each degradation mode to a feature flag that can be flipped manually or automatically.

Step 4: Automate Triggers

Connect system metrics (error rate, latency, queue depth) to automatic flag changes. When P99 latency exceeds 2 seconds, automatically disable recommendations.

Step 5: Test Degradation

Regularly test each degradation mode. Verify that disabling a feature doesn't crash the system (missing null checks, unhandled exceptions from disabled services).

Common Pitfalls

  • No fallback for critical dependencies. If the payment service is down and there's no fallback (even a "try again later" message), the checkout page crashes instead of degrading.
  • Feature flags without testing. A flag that hasn't been flipped in 6 months may not work when you need it. Test flags regularly.
  • Degradation that's invisible to users. If features silently disappear, users think the site is broken. Show clear messaging: "Recommendations temporarily unavailable."
  • All-or-nothing design. Many systems have two modes: fully operational and completely down. Design intermediate modes.
  • Shedding without priority. If all requests are treated equally, critical operations fail alongside non-critical ones. Classify and prioritize.
  • Queue without backpressure. A load-leveling queue that grows without bound will eventually exhaust memory. Set a maximum depth and reject new work beyond it.
  • No runbooks for degradation. Operators need to know which flags to flip and in what order. Document the degradation playbook.

Key Takeaways

  • Graceful degradation means partial functionality is better than total failure.
  • Feature flags are the control plane for degradation. Every non-critical feature should have a kill switch.
  • Fallbacks (cached data, defaults, static content) keep the user experience alive when dependencies fail.
  • Read-only mode preserves the read path when the write path is unavailable.
  • Queue-based load leveling absorbs traffic spikes for asynchronous workloads.
  • Priority shedding ensures critical operations survive when the system is overloaded.
  • Design, implement, and test degradation modes before you need them.