Disaster Recovery

Disaster recovery (DR) is the set of policies, tools, and procedures for restoring a system after a catastrophic failure — data center outage, region-wide cloud failure, data corruption, or a severe security breach. The goal is to resume operations within acceptable time and data-loss limits.

Every system needs a DR plan proportional to the cost of downtime. For some businesses, an hour of downtime is an inconvenience. For others, it's millions in lost revenue and regulatory penalties.

RPO & RTO

The two metrics that define every DR plan:

RPO (Recovery Point Objective)

The maximum acceptable amount of data loss, measured in time.

RPO = 0: No data loss. Requires synchronous replication.
RPO = 1 hour: Can lose up to 1 hour of data. Hourly backups are sufficient.
RPO = 24 hours: Daily backups are sufficient.

RTO (Recovery Time Objective)

The maximum acceptable downtime, measured from the moment of failure to full service restoration.

RTO = 0 (near-zero): Automatic failover to a hot standby. Seconds of downtime.
RTO = 1 hour: Manual failover to a warm standby, plus validation.
RTO = 24 hours: Restore from backups, rebuild infrastructure.

The Cost Trade-Off

Lower RPO and RTO cost more. Zero data loss and zero downtime require synchronous replication to a hot standby in a separate region — expensive infrastructure and complex operations.

RPO/RTO         Approach                      Relative cost
24h / 24h       Daily backups to S3           $
4h / 4h         Frequent backups + warm DR    $$
1h / 1h         Continuous replication + warm  $$$
~0 / ~0         Multi-region active-active     $$$$

Choose RPO and RTO based on business impact. Not every system needs active-active multi-region.

Backup Strategies

Full Backups

Copy the entire dataset. Simple to restore but large and slow to create.

Monday: Full backup (100 GB)
Tuesday: Full backup (101 GB)
Wednesday: Full backup (102 GB)

Storage cost: ~300 GB for 3 days of backups

Incremental Backups

After the initial full backup, only back up the data that changed since the last backup (full or incremental).

Monday: Full backup (100 GB)
Tuesday: Incremental (2 GB of changes since Monday)
Wednesday: Incremental (3 GB of changes since Tuesday)

Storage cost: ~105 GB for 3 days
Restore: Apply Monday full + Tuesday incremental + Wednesday incremental

Faster backups, less storage, but slower restores (must replay the chain).

Differential Backups

After the initial full backup, back up everything that changed since the last full backup.

Monday: Full backup (100 GB)
Tuesday: Differential (2 GB of changes since Monday)
Wednesday: Differential (5 GB of changes since Monday — includes Tuesday's changes)

Restore: Apply Monday full + Wednesday differential (just two steps)

A middle ground: faster restores than incremental, less storage than full.

Point-in-Time Recovery (PITR)

Combine a base backup with continuous WAL (Write-Ahead Log) archiving. Restore to any specific point in time.

Sunday 00:00: Base backup
WAL segments archived continuously

Restore to Wednesday 14:30:
  1. Restore Sunday's base backup
  2. Replay WAL segments up to Wednesday 14:30
  3. Database reflects exact state at that moment

PostgreSQL, MySQL, and managed services like AWS RDS support PITR. This is the gold standard for database disaster recovery.

Backup Locations

Same region, different AZ: Protects against single-AZ failure. Does not protect against regional disaster.
Different region: Protects against regional outage. Higher storage and transfer costs.
Different cloud provider: Protects against provider-wide incidents. Highest complexity.
Offline/air-gapped: Protects against ransomware and compromise. Essential for critical data.

Real-World: GitLab Database Incident (2017)

GitLab accidentally deleted a production database. Five backup methods were in place, but most were broken or never tested. The only working backup was a 6-hour-old copy taken coincidentally for testing. Lesson: backup systems must be verified regularly.

Multi-Region Architecture

Pilot Light

Minimal DR infrastructure is always running in the DR region: core services (database replicas) but no application servers.

Primary region: Full stack running
DR region: Database replica running, app servers stopped

Failover:
  1. Start app servers in DR region (~minutes)
  2. Point DNS to DR region
  3. Validate and serve traffic

RTO: 10-30 minutes (depends on startup time)
RPO: Depends on replication lag (typically seconds with async replication)
Cost: Low — only database replica is always running

Warm Standby

A scaled-down version of the full environment runs in the DR region.

Primary region: 20 app servers, full database
DR region: 2 app servers, database replica, all services at minimum scale

Failover:
  1. Scale up DR app servers
  2. Promote database replica to primary
  3. Update DNS
  4. Validate

RTO: 5-15 minutes
RPO: Seconds (async replication)
Cost: Moderate

Active-Active (Multi-Region)

Both regions handle live traffic simultaneously. Each region has a full copy of the data.

US-East: Full stack, serving US users, writing to local DB
EU-West: Full stack, serving EU users, writing to local DB
Cross-region replication keeps both in sync

Region failure:
  DNS automatically routes all traffic to the surviving region
  No manual intervention needed

RTO: Seconds (automatic DNS failover)
RPO: Near-zero (synchronous or fast async replication)
Cost: High — double infrastructure plus cross-region replication

Active-Active Challenges

Conflict resolution: Two regions writing to the same record simultaneously. Strategies: last-write-wins, merge functions, conflict-free replicated data types (CRDTs).
Data residency: GDPR may require EU data to stay in EU. Active-active must respect geographic constraints.
Latency: Cross-region replication adds latency to writes if synchronous, or introduces consistency lag if asynchronous.

Real-World: CockroachDB

CockroachDB is designed for multi-region active-active from the ground up. It uses the Raft consensus protocol to replicate data across regions with configurable consistency. Writes to a local region are fast; cross-region writes incur consensus latency. Tables can be pinned to specific regions for data residency.

Failover Testing

A disaster recovery plan that hasn't been tested doesn't work. Period.

Types of Tests

Tabletop exercise: Walk through the DR plan on paper. Identify gaps, missing steps, and unclear responsibilities. Low cost, no risk.

Backup restoration test: Regularly restore backups to a separate environment and verify data integrity. Confirm that the restore process actually works and meets the RTO.

Failover drill: Actually fail over to the DR region. Verify that traffic is served correctly, data is intact, and the system meets its RPO/RTO targets.

Quarterly failover drill:
  1. Announce the drill (stakeholders informed)
  2. Simulate primary region failure (stop routing traffic)
  3. Execute failover runbook
  4. Verify system health in DR region
  5. Run test traffic and compare results
  6. Fail back to primary region
  7. Post-drill review: what went well, what didn't

Game day: A broader exercise that simulates a real disaster scenario, including communication, escalation, and coordination across teams. Netflix popularized this concept.

Testing Frequency

Backup restoration: monthly
Tabletop exercise: quarterly
Failover drill: quarterly or semi-annually
Full game day: annually

Chaos Engineering

Chaos engineering proactively injects failures into production to verify that the system handles them correctly.

Principles

Define "steady state" — the normal behavior of the system (latency, error rate, throughput).
Hypothesize that steady state continues during the experiment.
Introduce a real-world failure (kill a server, inject latency, drop packets).
Observe whether steady state is maintained.
If not, fix the weakness.

Types of Chaos Experiments

Experiment               What it tests
Kill a random instance   Auto-scaling, load balancer health checks
Inject network latency   Timeouts, circuit breakers, retries
Block external service   Fallbacks, graceful degradation
Fill disk                Disk space alerts, log rotation
CPU stress               Auto-scaling, priority shedding
Kill a database replica  Replication failover, read routing
Simulate AZ failure      Multi-AZ redundancy

Real-World: Netflix Chaos Monkey

Netflix's Chaos Monkey randomly terminates production instances during business hours. This forces every team to build services that survive instance failure. The broader Simian Army included:

Chaos Monkey: Kills instances
Chaos Gorilla: Simulates AZ failures
Latency Monkey: Injects artificial latency
Chaos Kong: Simulates entire region failure

Real-World: Gremlin

Gremlin is a commercial chaos engineering platform that provides a controlled way to run chaos experiments without building custom tooling. It offers a library of attacks (CPU, memory, network, disk, process) with safety controls (automatic rollback, blast radius limits).

Safety Guidelines

Start with non-production environments
Limit blast radius (one service, one AZ, a percentage of traffic)
Have a kill switch to abort the experiment instantly
Run during business hours when engineers are available to respond
Start small and increase scope as confidence grows

DR Runbooks

Document every step of the recovery process. During a real disaster, stress and time pressure make it easy to forget steps.

Runbook Contents

1. Detection: How do we know a disaster has occurred?
2. Assessment: What is the scope of impact?
3. Communication: Who to notify (on-call, leadership, customers)?
4. Decision: Who authorizes failover?
5. Execution: Step-by-step failover procedure
6. Validation: How do we verify the DR environment is healthy?
7. Monitoring: What to watch after failover
8. Failback: How to return to the primary region
9. Post-incident: Review and update the plan

Common Pitfalls

Untested backups. A backup that can't be restored is not a backup. Test restores regularly.
RPO/RTO without measurement. If you claim 1-hour RTO but have never timed a failover, you don't actually know your RTO.
Backups in the same region. A regional outage takes out both the primary and the backups. Replicate to another region.
No runbook. During a disaster, people panic. A step-by-step runbook prevents mistakes and saves time.
Failback is harder than failover. Most teams practice failover but forget that returning to the primary region is equally complex. Test failback too.
DR infrastructure drift. The DR region's configuration slowly diverges from production. Use infrastructure-as-code and automated validation to keep them in sync.
Ignoring data corruption. Replication faithfully copies corrupted data to all replicas. Backups with retention (keep 30 days of daily backups) let you restore to before the corruption.

Key Takeaways

RPO defines acceptable data loss; RTO defines acceptable downtime. These drive every DR design decision.
Backup strategies range from full daily backups to continuous WAL archiving with PITR. Choose based on RPO.
Multi-region architectures range from pilot light (cheapest, slowest recovery) to active-active (most expensive, fastest recovery).
Test your DR plan regularly. Untested plans fail when you need them most.
Chaos engineering builds confidence by proving resilience in production before a real disaster strikes.
Document everything in a runbook. During a real incident, clear procedures save hours.