Disaster Recovery Planning

When Your Primary AZ Goes Down at 3 AM

Your phone buzzes. The on-call alert says the primary database is unreachable. You check the cloud provider status page: an entire availability zone is experiencing a network partition. Your database is in that zone. Every request to your application is failing. Customers are seeing errors. Revenue is dropping by the minute.

What you do in the next 30 minutes depends entirely on what you planned in the last 30 days. Disaster recovery is not something you figure out during the disaster. It is a plan you write, test, and drill so that when the inevitable happens, the response is mechanical rather than panicked.

RTO & RPO: The Two Numbers That Define Your DR Strategy

Every disaster recovery strategy is defined by two metrics:

RPO (Recovery Point Objective): How much data can you afford to lose? If your RPO is 1 hour, you accept that up to 1 hour of transactions might be gone after recovery. If your RPO is zero, you need synchronous replication.

RTO (Recovery Time Objective): How fast must you recover? If your RTO is 5 minutes, you need automated failover with a hot standby. If your RTO is 4 hours, restoring from a backup might be acceptable.

RPO	Meaning	Required Technology
0	No data loss tolerated	Synchronous replication
Seconds	Minimal data loss	Streaming async replication
Minutes	Some loss acceptable	WAL archiving
Hours	Significant loss tolerable	Periodic pg_dump/pg_basebackup

RTO	Meaning	Required Technology
Seconds	Near-zero downtime	Automated failover with hot standby
Minutes	Brief outage acceptable	Manual failover with warm standby
Hours	Extended outage tolerable	Restore from backup
Days	Non-critical system	Cold standby or rebuild

Your business requirements determine these numbers. A payment processing system might need RPO=0, RTO=30 seconds. An internal analytics dashboard might tolerate RPO=24 hours, RTO=4 hours.

DR Tiers: Cold, Warm & Hot Standby

Cold Standby

A cold standby is a server that is provisioned but not running. Recovery means: provision the server, restore a backup, replay WAL, start Postgres. This is the cheapest option.

# Cold standby recovery (manual process)
# 1. Provision new server (or start the standby VM)
# 2. Restore base backup
cp -r /backups/base/latest /var/lib/postgresql/16/main

# 3. Configure recovery
touch /var/lib/postgresql/16/main/recovery.signal
echo "restore_command = 'wal-g wal-fetch %f %p'" >> postgresql.conf
echo "recovery_target = 'immediate'" >> postgresql.conf

# 4. Start Postgres
sudo systemctl start postgresql

RTO: hours. RPO: depends on backup frequency and WAL archiving.

Warm Standby

A warm standby continuously receives WAL but does not accept connections. It is almost up to date and can be promoted quickly.

# On the standby: postgresql.conf
primary_conninfo = 'host=primary.db.internal port=5432 user=replication_user'
restore_command = 'wal-g wal-fetch %f %p'

# Promote the warm standby
pg_ctl promote -D /var/lib/postgresql/16/main

# Or via SQL on the standby
SELECT pg_promote();

RTO: minutes. RPO: seconds (the lag between primary and standby).

Hot Standby

A hot standby receives WAL continuously and accepts read-only queries. It is functionally identical to the primary except it cannot accept writes. This is the standard for production systems that need high availability.

# On the standby: postgresql.conf
hot_standby = on
primary_conninfo = 'host=primary.db.internal port=5432 user=replication_user'

-- On the primary: check replication lag
SELECT client_addr,
       state,
       sent_lsn,
       write_lsn,
       flush_lsn,
       replay_lsn,
       pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication;

RTO: seconds to minutes. RPO: seconds (async) or zero (sync).

Cross-Region Replicas

A standby in the same data center protects against server failure but not against data center failure. Cross-region replicas protect against regional outages.

# Cross-region async replica configuration
# Higher latency means more replication lag
primary_conninfo = 'host=primary.us-east.db port=5432 user=repl sslmode=require'

Cross-region replication adds latency. Synchronous cross-region replication is usually impractical because every write waits for a round trip across regions (50-200 ms). Async cross-region replication means you might lose the last few seconds of transactions during a regional failover.

-- Monitor cross-region replication lag
SELECT application_name,
       state,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
       replay_lag
FROM pg_stat_replication
WHERE application_name = 'us-west-replica';

The trade-off is clear: synchronous cross-region replication gives you RPO=0 but degrades write performance. Async cross-region replication preserves performance but accepts a small data loss window.

Automated Failover vs Manual

Manual Failover

An operator decides when to fail over, verifies the standby is healthy, and promotes it. Manual failover is simpler and avoids split-brain scenarios, but your RTO depends on how fast a human responds.

# Manual failover runbook (simplified)
# 1. Verify primary is actually down (not just a network blip)
pg_isready -h primary.db.internal -p 5432

# 2. Check standby replication lag
psql -h standby.db.internal -c "SELECT pg_last_wal_replay_lsn();"

# 3. Promote standby
psql -h standby.db.internal -c "SELECT pg_promote();"

# 4. Update DNS or connection string to point to new primary
# 5. Verify application connectivity
# 6. Notify the team

Automated Failover

Tools like Patroni, repmgr, or pg_auto_failover detect primary failure and promote a standby automatically. Automated failover achieves lower RTO but risks false positives.

# Patroni configuration (etcd-based consensus)
scope: myapp-cluster
name: node1

postgresql:
  listen: 0.0.0.0:5432
  data_dir: /var/lib/postgresql/16/main
  parameters:
    wal_level: replica
    max_wal_senders: 5
    hot_standby: on

etcd:
  hosts: etcd1:2379,etcd2:2379,etcd3:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576

Patroni uses distributed consensus (etcd, Consul, or ZooKeeper) to agree on which node is the leader. If the leader disappears, Patroni promotes the most up-to-date standby. This is the industry standard for automated PostgreSQL failover.

# Check Patroni cluster status
patronictl -c /etc/patroni.yml list

+ Cluster: myapp-cluster ----+---------+---------+----+-----------+
| Member | Host              | Role    | State   | TL | Lag in MB |
+--------+-------------------+---------+---------+----+-----------+
| node1  | 10.0.1.10         | Leader  | running |  3 |           |
| node2  | 10.0.1.11         | Replica | running |  3 |         0 |
| node3  | 10.0.2.10         | Replica | running |  3 |         0 |
+--------+-------------------+---------+---------+----+-----------+

The Disaster Recovery Runbook

A DR runbook is a step-by-step document that anyone on the team can follow during an outage. It removes decision-making from the crisis. Here is the structure:

Detection - How do you know there is a problem? Alert thresholds, escalation paths.
Assessment - Is this a real outage or a monitoring false positive? How to verify.
Communication - Who to notify, what status page to update, what Slack channel to use.
Failover decision - Criteria for initiating failover. Who has authority to approve.
Failover execution - Exact commands to run, in order, with expected outputs.
Verification - How to confirm the failover succeeded. Which queries to run.
Application cutover - How to point the application to the new primary.
Post-incident - How to rebuild the old primary as a new standby. Timeline for postmortem.

# Example: Failover verification queries
# Run these after promoting the standby

# Verify the server is accepting writes
psql -c "CREATE TEMP TABLE failover_test (id int); DROP TABLE failover_test;"

# Check the timeline advanced (new timeline = successful promotion)
psql -c "SELECT pg_current_wal_lsn(), timeline_id FROM pg_control_checkpoint();"

# Verify application tables are accessible
psql -c "SELECT count(*) FROM orders WHERE created_at > now() - interval '1 hour';"

Quarterly DR Drills

A DR plan that has never been executed is a collection of assumptions. Quarterly drills validate:

The runbook is accurate and complete
Team members can execute it under pressure
Failover time meets your RTO target
Data loss stays within your RPO target
The application handles the cutover gracefully

# DR drill checklist
# 1. Schedule the drill (business hours, with stakeholder awareness)
# 2. Simulate primary failure (stop Postgres, block network, etc.)
# 3. Start the clock (measure RTO)
# 4. Execute the runbook step by step
# 5. Verify recovery (check data, run application smoke tests)
# 6. Record actual RTO and RPO
# 7. Rebuild the original primary as a standby
# 8. Document what went wrong and update the runbook

Track drill results over time. Your RTO should improve or stay stable. If it is getting worse, your infrastructure or team familiarity is degrading.

Common Pitfalls

Having a plan but never testing it. A DR plan is a hypothesis until it is tested. Quarterly drills convert it into knowledge.
Assuming automated failover means no manual process. Automated failover handles the database promotion. You still need to handle DNS changes, application reconnection, customer communication, and post-incident recovery.
Ignoring replication lag in RPO calculations. Your RPO is not zero just because you have a replica. If the replica is 30 seconds behind, your RPO is 30 seconds at best.
No cross-region strategy. A primary and replica in the same availability zone both go down in a zone failure. Cross-region replication is essential for true disaster resilience.
Split-brain scenarios. If both the old primary and the promoted standby accept writes simultaneously, you have conflicting data. Fencing the old primary (STONITH) is critical.
DR runbook is outdated. Infrastructure changes. IP addresses change. Hostnames change. Review and update the runbook quarterly, not just after drills.
Underestimating DNS propagation. Changing a DNS record does not instantly redirect traffic. TTLs, caching, and client-side DNS caching can delay the cutover. Use low TTLs for database endpoints.

Key Takeaways

RPO and RTO define your DR strategy. Know your numbers before choosing technology.
Cold standby is cheap but slow. Hot standby with automated failover is expensive but meets aggressive RTO targets. Choose based on business requirements.
Cross-region replicas protect against data center and regional failures. Accept the async replication trade-off or pay the synchronous latency penalty.
Automated failover (Patroni, pg_auto_failover) is the standard for production PostgreSQL. Manual failover is acceptable only for non-critical systems.
A disaster recovery runbook must be written, maintained, and drilled quarterly. If the team cannot execute it from memory under stress, it is not ready.
The worst time to write a DR plan is during a disaster. The second worst time is tomorrow.