Disaster Recovery
Disaster recovery (DR) is the set of policies, tools, and procedures for restoring a system after a catastrophic failure — data center outage, region-wide cloud failure, data corruption, or a severe security breach. The goal is to resume operations within acceptable time and data-loss limits.
Every system needs a DR plan proportional to the cost of downtime. For some businesses, an hour of downtime is an inconvenience. For others, it's millions in lost revenue and regulatory penalties.
RPO & RTO
The two metrics that define every DR plan:
RPO (Recovery Point Objective)
The maximum acceptable amount of data loss, measured in time.
- RPO = 0: No data loss. Requires synchronous replication.
- RPO = 1 hour: Can lose up to 1 hour of data. Hourly backups are sufficient.
- RPO = 24 hours: Daily backups are sufficient.
RTO (Recovery Time Objective)
The maximum acceptable downtime, measured from the moment of failure to full service restoration.
- RTO = 0 (near-zero): Automatic failover to a hot standby. Seconds of downtime.
- RTO = 1 hour: Manual failover to a warm standby, plus validation.
- RTO = 24 hours: Restore from backups, rebuild infrastructure.
The Cost Trade-Off
Lower RPO and RTO cost more. Zero data loss and zero downtime require synchronous replication to a hot standby in a separate region — expensive infrastructure and complex operations.
RPO/RTO Approach Relative cost
24h / 24h Daily backups to S3 $
4h / 4h Frequent backups + warm DR $$
1h / 1h Continuous replication + warm $$$
~0 / ~0 Multi-region active-active $$$$
Choose RPO and RTO based on business impact. Not every system needs active-active multi-region.
Backup Strategies
Full Backups
Copy the entire dataset. Simple to restore but large and slow to create.
Monday: Full backup (100 GB)
Tuesday: Full backup (101 GB)
Wednesday: Full backup (102 GB)
Storage cost: ~300 GB for 3 days of backups
Incremental Backups
After the initial full backup, only back up the data that changed since the last backup (full or incremental).
Monday: Full backup (100 GB)
Tuesday: Incremental (2 GB of changes since Monday)
Wednesday: Incremental (3 GB of changes since Tuesday)
Storage cost: ~105 GB for 3 days
Restore: Apply Monday full + Tuesday incremental + Wednesday incremental
Faster backups, less storage, but slower restores (must replay the chain).
Differential Backups
After the initial full backup, back up everything that changed since the last full backup.
Monday: Full backup (100 GB)
Tuesday: Differential (2 GB of changes since Monday)
Wednesday: Differential (5 GB of changes since Monday — includes Tuesday's changes)
Restore: Apply Monday full + Wednesday differential (just two steps)
A middle ground: faster restores than incremental, less storage than full.
Point-in-Time Recovery (PITR)
Combine a base backup with continuous WAL (Write-Ahead Log) archiving. Restore to any specific point in time.
Sunday 00:00: Base backup
WAL segments archived continuously
Restore to Wednesday 14:30:
1. Restore Sunday's base backup
2. Replay WAL segments up to Wednesday 14:30
3. Database reflects exact state at that moment
PostgreSQL, MySQL, and managed services like AWS RDS support PITR. This is the gold standard for database disaster recovery.
Backup Locations
- Same region, different AZ: Protects against single-AZ failure. Does not protect against regional disaster.
- Different region: Protects against regional outage. Higher storage and transfer costs.
- Different cloud provider: Protects against provider-wide incidents. Highest complexity.
- Offline/air-gapped: Protects against ransomware and compromise. Essential for critical data.
Real-World: GitLab Database Incident (2017)
GitLab accidentally deleted a production database. Five backup methods were in place, but most were broken or never tested. The only working backup was a 6-hour-old copy taken coincidentally for testing. Lesson: backup systems must be verified regularly.
Multi-Region Architecture
Pilot Light
Minimal DR infrastructure is always running in the DR region: core services (database replicas) but no application servers.
Primary region: Full stack running
DR region: Database replica running, app servers stopped
Failover:
1. Start app servers in DR region (~minutes)
2. Point DNS to DR region
3. Validate and serve traffic
- RTO: 10-30 minutes (depends on startup time)
- RPO: Depends on replication lag (typically seconds with async replication)
- Cost: Low — only database replica is always running
Warm Standby
A scaled-down version of the full environment runs in the DR region.
Primary region: 20 app servers, full database
DR region: 2 app servers, database replica, all services at minimum scale
Failover:
1. Scale up DR app servers
2. Promote database replica to primary
3. Update DNS
4. Validate
- RTO: 5-15 minutes
- RPO: Seconds (async replication)
- Cost: Moderate
Active-Active (Multi-Region)
Both regions handle live traffic simultaneously. Each region has a full copy of the data.
US-East: Full stack, serving US users, writing to local DB
EU-West: Full stack, serving EU users, writing to local DB
Cross-region replication keeps both in sync
Region failure:
DNS automatically routes all traffic to the surviving region
No manual intervention needed
- RTO: Seconds (automatic DNS failover)
- RPO: Near-zero (synchronous or fast async replication)
- Cost: High — double infrastructure plus cross-region replication
Active-Active Challenges
- Conflict resolution: Two regions writing to the same record simultaneously. Strategies: last-write-wins, merge functions, conflict-free replicated data types (CRDTs).
- Data residency: GDPR may require EU data to stay in EU. Active-active must respect geographic constraints.
- Latency: Cross-region replication adds latency to writes if synchronous, or introduces consistency lag if asynchronous.
Real-World: CockroachDB
CockroachDB is designed for multi-region active-active from the ground up. It uses the Raft consensus protocol to replicate data across regions with configurable consistency. Writes to a local region are fast; cross-region writes incur consensus latency. Tables can be pinned to specific regions for data residency.
Failover Testing
A disaster recovery plan that hasn't been tested doesn't work. Period.
Types of Tests
Tabletop exercise: Walk through the DR plan on paper. Identify gaps, missing steps, and unclear responsibilities. Low cost, no risk.
Backup restoration test: Regularly restore backups to a separate environment and verify data integrity. Confirm that the restore process actually works and meets the RTO.
Failover drill: Actually fail over to the DR region. Verify that traffic is served correctly, data is intact, and the system meets its RPO/RTO targets.
Quarterly failover drill:
1. Announce the drill (stakeholders informed)
2. Simulate primary region failure (stop routing traffic)
3. Execute failover runbook
4. Verify system health in DR region
5. Run test traffic and compare results
6. Fail back to primary region
7. Post-drill review: what went well, what didn't
Game day: A broader exercise that simulates a real disaster scenario, including communication, escalation, and coordination across teams. Netflix popularized this concept.
Testing Frequency
- Backup restoration: monthly
- Tabletop exercise: quarterly
- Failover drill: quarterly or semi-annually
- Full game day: annually
Chaos Engineering
Chaos engineering proactively injects failures into production to verify that the system handles them correctly.
Principles
- Define "steady state" — the normal behavior of the system (latency, error rate, throughput).
- Hypothesize that steady state continues during the experiment.
- Introduce a real-world failure (kill a server, inject latency, drop packets).
- Observe whether steady state is maintained.
- If not, fix the weakness.
Types of Chaos Experiments
Experiment What it tests
Kill a random instance Auto-scaling, load balancer health checks
Inject network latency Timeouts, circuit breakers, retries
Block external service Fallbacks, graceful degradation
Fill disk Disk space alerts, log rotation
CPU stress Auto-scaling, priority shedding
Kill a database replica Replication failover, read routing
Simulate AZ failure Multi-AZ redundancy
Real-World: Netflix Chaos Monkey
Netflix's Chaos Monkey randomly terminates production instances during business hours. This forces every team to build services that survive instance failure. The broader Simian Army included:
- Chaos Monkey: Kills instances
- Chaos Gorilla: Simulates AZ failures
- Latency Monkey: Injects artificial latency
- Chaos Kong: Simulates entire region failure
Real-World: Gremlin
Gremlin is a commercial chaos engineering platform that provides a controlled way to run chaos experiments without building custom tooling. It offers a library of attacks (CPU, memory, network, disk, process) with safety controls (automatic rollback, blast radius limits).
Safety Guidelines
- Start with non-production environments
- Limit blast radius (one service, one AZ, a percentage of traffic)
- Have a kill switch to abort the experiment instantly
- Run during business hours when engineers are available to respond
- Start small and increase scope as confidence grows
DR Runbooks
Document every step of the recovery process. During a real disaster, stress and time pressure make it easy to forget steps.
Runbook Contents
1. Detection: How do we know a disaster has occurred?
2. Assessment: What is the scope of impact?
3. Communication: Who to notify (on-call, leadership, customers)?
4. Decision: Who authorizes failover?
5. Execution: Step-by-step failover procedure
6. Validation: How do we verify the DR environment is healthy?
7. Monitoring: What to watch after failover
8. Failback: How to return to the primary region
9. Post-incident: Review and update the plan
Common Pitfalls
- Untested backups. A backup that can't be restored is not a backup. Test restores regularly.
- RPO/RTO without measurement. If you claim 1-hour RTO but have never timed a failover, you don't actually know your RTO.
- Backups in the same region. A regional outage takes out both the primary and the backups. Replicate to another region.
- No runbook. During a disaster, people panic. A step-by-step runbook prevents mistakes and saves time.
- Failback is harder than failover. Most teams practice failover but forget that returning to the primary region is equally complex. Test failback too.
- DR infrastructure drift. The DR region's configuration slowly diverges from production. Use infrastructure-as-code and automated validation to keep them in sync.
- Ignoring data corruption. Replication faithfully copies corrupted data to all replicas. Backups with retention (keep 30 days of daily backups) let you restore to before the corruption.
Key Takeaways
- RPO defines acceptable data loss; RTO defines acceptable downtime. These drive every DR design decision.
- Backup strategies range from full daily backups to continuous WAL archiving with PITR. Choose based on RPO.
- Multi-region architectures range from pilot light (cheapest, slowest recovery) to active-active (most expensive, fastest recovery).
- Test your DR plan regularly. Untested plans fail when you need them most.
- Chaos engineering builds confidence by proving resilience in production before a real disaster strikes.
- Document everything in a runbook. During a real incident, clear procedures save hours.