Fault Tolerance Fundamentals

Systems fail. Hardware dies, networks partition, software crashes, humans make mistakes. Fault tolerance is the ability of a system to continue operating correctly — possibly at reduced capacity — when components fail. The goal is not to prevent all failures but to ensure no single failure brings down the whole system.

Circuit breaker state machine pattern

Redundancy

Redundancy means having more capacity or components than strictly necessary, so that a failure can be absorbed.

Types of Redundancy

Active-active: All instances handle traffic simultaneously. If one fails, the others absorb its share.

[Load Balancer]
  -> [Server A] (handling requests)
  -> [Server B] (handling requests)
  -> [Server C] (handling requests)

Server B fails -> A and C absorb B's traffic

Active-passive: One instance handles traffic; the standby waits idle. On failure, the standby takes over.

[Server A] (active, handling requests)
[Server B] (passive, idle, receiving replication)

Server A fails -> Server B promoted to active

Active-active trade-offs:

Better resource utilization (all nodes work)
More complex (state synchronization, conflict resolution)
Faster failover (no promotion step)

Active-passive trade-offs:

Simpler (one writer, no conflicts)
Wasted capacity (standby is idle)
Slower failover (promotion takes time)

Redundancy at Every Layer

Layer           Redundancy approach
DNS             Multiple NS records, multiple DNS providers
Load balancer   Active-passive pair or anycast
Application     Multiple stateless instances
Cache           Replicated Redis cluster
Database        Primary + replicas, multi-AZ
Message queue   Clustered brokers, replicated topics
Storage         S3 (11 nines durability), RAID, replicated volumes

Replication

Replication copies data across multiple nodes so that no single node holds the only copy.

Synchronous Replication

The writer waits for replicas to acknowledge before confirming the write.

Pros: No data loss on primary failure (all replicas are up to date).
Cons: Higher write latency. If a replica is slow or unreachable, writes stall.

Asynchronous Replication

The writer confirms immediately; replicas catch up in the background.

Pros: Low write latency.
Cons: Replicas may lag. If the primary fails before replication completes, recent writes are lost.

Quorum Replication

Write to W of N replicas; read from R of N replicas. As long as W + R > N, the reader is guaranteed to see the latest write.

N = 3 replicas
W = 2 (write to 2 replicas before confirming)
R = 2 (read from 2 replicas, take latest)

W + R = 4 > 3 -> guaranteed overlap with the latest write

Used by DynamoDB, Cassandra, and other leaderless databases.

Real-World: PostgreSQL Streaming Replication

PostgreSQL streams WAL (Write-Ahead Log) records to replicas. In synchronous mode, the primary waits for at least one replica to write the WAL to disk. In asynchronous mode, the primary doesn't wait. Most production deployments use async with one synchronous standby as a compromise.

Failover

Failover is the process of switching from a failed component to a healthy backup.

Automatic Failover

A health-checking mechanism detects failure and triggers promotion of a standby.

1. Health checker pings primary every 5 seconds
2. Primary misses 3 consecutive checks
3. Health checker declares primary dead
4. Standby is promoted to primary
5. DNS or load balancer updated to point to new primary
6. Clients reconnect

Manual Failover

An operator manually promotes the standby. Slower but avoids false positives (split-brain scenarios where both nodes think they are primary).

Failover Challenges

Detection time: How quickly is a failure detected? Faster detection means shorter downtime but higher risk of false positives.
Split brain: If the old primary is not actually dead (just slow or network-partitioned), you may end up with two primaries accepting conflicting writes. Fencing (STONITH — Shoot The Other Node In The Head) prevents this by forcibly shutting down the old primary.
Data loss window: With async replication, the new primary may be missing the most recent writes from the old primary.
Connection reset: Clients with open connections to the old primary must reconnect. Connection poolers and smart drivers handle this automatically.

Real-World: Amazon RDS Multi-AZ

RDS Multi-AZ maintains a synchronous standby in a different Availability Zone. On primary failure, RDS automatically promotes the standby and updates the DNS endpoint. Typical failover time is 60-120 seconds.

Single Points of Failure

A single point of failure (SPOF) is any component whose failure brings down the entire system.

Common SPOFs

A single database instance with no replicas
A single load balancer with no failover pair
A single DNS provider
A single network path between data centers
A configuration service (etcd, ZooKeeper) with no cluster
A single person who knows how to deploy (human SPOF)

Finding SPOFs

Walk through the request path and ask: "If this component fails, does the system stop working?"

User -> DNS -> CDN -> Load Balancer -> App Server -> Cache -> Database

Is DNS redundant?         Multiple NS records, multiple providers?
Is the LB redundant?      Active-passive pair?
Are app servers redundant? Multiple instances behind the LB?
Is the cache redundant?    Replicated cluster? Can the system work without it?
Is the DB redundant?       Primary + replica? Automatic failover?

Eliminating SPOFs

Every critical component should have at least one backup. The backup should be tested regularly — an untested backup is not a backup.

Blast Radius

Blast radius is the scope of impact when a failure occurs. Good architecture minimizes blast radius so that a failure affects the smallest possible portion of the system.

Reducing Blast Radius

Cell-based architecture: Divide users into cells (independent, isolated groups). Each cell has its own infrastructure. A failure in Cell A doesn't affect Cell B.

Cell 1: Users 1-1M      [App servers] [Database] [Cache]
Cell 2: Users 1M-2M     [App servers] [Database] [Cache]
Cell 3: Users 2M-3M     [App servers] [Database] [Cache]

Database failure in Cell 2 affects only users 1M-2M

Service isolation: Each microservice runs independently. A failure in the recommendation service doesn't prevent users from browsing the catalog.

Availability zone isolation: Deploy across multiple AZs. An AZ outage affects only the instances in that zone; the other AZs continue serving.

Regional isolation: Deploy across multiple regions. A full regional outage (rare but possible) only affects users routed to that region.

Real-World: AWS Cell-Based Architecture

AWS itself uses cell-based architecture internally. DynamoDB, for example, partitions its control plane into cells. An issue in one cell's control plane doesn't affect tables in other cells. This is why DynamoDB outages tend to be partial rather than global.

Real-World: Slack

Slack partitions its infrastructure by workspace. Each workspace is assigned to a specific shard of infrastructure. An incident on one shard affects only the workspaces assigned to it, limiting the blast radius.

Failure Domains

A failure domain is a group of components that can fail together due to a shared dependency.

Failure domain examples:
  - All servers on one physical rack (shared power, top-of-rack switch)
  - All instances in one Availability Zone (shared data center)
  - All services depending on one database
  - All clients using one DNS resolver

Spread critical components across failure domains. Don't put all replicas on the same rack, in the same AZ, or in the same region.

Common Pitfalls

Assuming hardware is reliable. Disks fail, servers crash, entire data centers go dark. Design for failure from day one.
Untested failover. If you've never tested failover, it doesn't work. Schedule regular failover drills.
Split brain after failover. Without proper fencing, both the old and new primary accept writes, causing data divergence. Implement fencing mechanisms.
Redundancy without diversity. Running all replicas on the same cloud provider, same OS, same version means a single bug can take them all down. Consider multi-provider or multi-version diversity for critical systems.
Ignoring human SPOFs. If one engineer is the only person who can deploy or debug the system, that person is a single point of failure. Document procedures and cross-train.
Large blast radius by default. A single shared database backing many services means one DB failure takes everything down. Isolate failure domains.

Key Takeaways

Redundancy is the foundation of fault tolerance. Every critical component needs a backup.
Replication keeps data safe. Choose synchronous, asynchronous, or quorum based on your durability and latency requirements.
Failover must be fast and tested. An untested failover plan is no plan at all.
Identify and eliminate single points of failure at every layer of the stack.
Minimize blast radius through cell-based architecture, service isolation, and multi-AZ/multi-region deployment.
Spread components across failure domains. Don't let a shared dependency take down everything at once.