High Availability Patterns
Replication gives you a copy of your data. High availability gives you automated failover — when the primary goes down, a replica takes over and the application continues with minimal downtime.
The Failover Process
Every HA solution follows the same basic steps:
- Detect failure: determine that the primary is actually down, not just slow.
- Promote a replica: one replica becomes the new primary and starts accepting writes.
- Update connections: applications and other replicas connect to the new primary.
- Prevent split-brain: ensure the old primary cannot accept writes if it comes back online.
The complexity is in getting each step right under real-world failure conditions.
Patroni
Patroni is the most widely used open-source HA solution for PostgreSQL. It manages the PostgreSQL lifecycle and coordinates failover through a distributed consensus store (etcd, ZooKeeper, or Consul).
How Patroni Works
- Each PostgreSQL node runs a Patroni agent.
- Patroni uses etcd (or similar) as a distributed key-value store for leader election.
- The leader holds a lock in etcd. If the leader fails to renew the lock, another node acquires it and promotes its PostgreSQL instance.
- Patroni handles replica creation, configuration management, and switchover (planned failover).
Patroni Configuration
# patroni.yml
scope: my-cluster
name: node1
restapi:
listen: 0.0.0.0:8008
connect_address: 10.0.1.1:8008
etcd3:
hosts: 10.0.2.1:2379,10.0.2.2:2379,10.0.2.3:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1MB
synchronous_mode: false
postgresql:
use_pg_rewind: true
parameters:
wal_level: replica
max_wal_senders: 5
max_replication_slots: 5
hot_standby: on
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.0.1.1:5432
data_dir: /var/lib/postgresql/16/main
authentication:
superuser:
username: postgres
password: secret
replication:
username: replicator
password: secret
Patroni Operations
# Check cluster status
patronictl -c /etc/patroni.yml list
+--------+----------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+--------+----------+---------+---------+----+-----------+
| node1 | 10.0.1.1 | Leader | running | 3 | |
| node2 | 10.0.1.2 | Replica | running | 3 | 0 |
| node3 | 10.0.1.3 | Replica | running | 3 | 0 |
+--------+----------+---------+---------+----+-----------+
# Planned switchover (graceful)
patronictl -c /etc/patroni.yml switchover --master node1 --candidate node2
# Emergency failover
patronictl -c /etc/patroni.yml failover
# Reinitialize a failed node
patronictl -c /etc/patroni.yml reinit my-cluster node1
pg_rewind
When a former primary comes back online, it may have WAL that diverged from the new primary. pg_rewind synchronizes the old primary's data directory with the new primary without a full base backup.
# Patroni handles this automatically with use_pg_rewind: true
pg_auto_failover
A lighter-weight alternative from Citus/Microsoft. It uses a dedicated monitor node instead of an external consensus store.
Architecture
- One monitor node runs the pg_autoctl monitor.
- Each PostgreSQL node runs pg_autoctl.
- The monitor tracks node health and orchestrates failover.
Setup
# On the monitor node
pg_autoctl create monitor \
--pgdata /var/lib/postgresql/monitor \
--pgport 5000
# On the primary
pg_autoctl create postgres \
--pgdata /var/lib/postgresql/16/main \
--pgport 5432 \
--monitor postgres://autoctl@monitor-host:5000/pg_auto_failover
# On the replica
pg_autoctl create postgres \
--pgdata /var/lib/postgresql/16/main \
--pgport 5432 \
--monitor postgres://autoctl@monitor-host:5000/pg_auto_failover
Monitoring
pg_autoctl show state
Name | Node | Host:Port | TLI: LSN | Connection | State
-------+-------+----------------+----------------+--------------+-----------
node1 | 1 | 10.0.1.1:5432 | 3: 0/5000060 | read-write | primary
node2 | 2 | 10.0.1.2:5432 | 3: 0/5000060 | read-only | secondary
The trade-off: pg_auto_failover is simpler to set up than Patroni but the monitor is a single point of failure (though it can be made HA itself).
Connection Routing
After failover, applications need to connect to the new primary. There are several approaches.
HAProxy
HAProxy sits in front of PostgreSQL and routes connections based on health checks.
# haproxy.cfg
listen postgres-primary
bind *:5432
mode tcp
option httpchk GET /primary
http-check expect status 200
default-server inter 3s fall 3 rise 2
server node1 10.0.1.1:5432 check port 8008
server node2 10.0.1.2:5432 check port 8008
server node3 10.0.1.3:5432 check port 8008
listen postgres-replicas
bind *:5433
mode tcp
balance roundrobin
option httpchk GET /replica
http-check expect status 200
default-server inter 3s fall 3 rise 2
server node2 10.0.1.2:5432 check port 8008
server node3 10.0.1.3:5432 check port 8008
The health check endpoint (port 8008) is the Patroni REST API. It returns 200 for the primary on /primary and 200 for replicas on /replica.
PgBouncer with Patroni
Patroni can manage PgBouncer configuration, updating it automatically on failover.
libpq Multi-Host Connection Strings
PostgreSQL's client library supports listing multiple hosts:
postgresql://host1:5432,host2:5432,host3:5432/mydb?target_session_attrs=read-write
The client tries each host in order and connects to the one that accepts read-write sessions (the primary). This works without any proxy but requires client-side retry logic on failover.
Split-Brain Prevention
Split-brain occurs when two nodes both think they are the primary and accept writes. This causes data divergence that is extremely difficult to resolve.
How Patroni Prevents Split-Brain
- The leader lock in etcd has a TTL. If the primary cannot renew the lock (e.g., network partition), it voluntarily demotes itself.
- Patroni checks the lock before allowing PostgreSQL to accept writes.
- The watchdog module can fence the old primary (shut down PostgreSQL or even the OS) if it cannot confirm it still holds the lock.
# patroni.yml — watchdog configuration
watchdog:
mode: required # required, automatic, or off
device: /dev/watchdog
safety_margin: 5
Network Partition Scenarios
In a three-node cluster (one primary, two replicas) with etcd:
- If the primary loses connectivity to etcd but not to replicas: Patroni demotes the primary. A replica takes over.
- If a replica is partitioned: no impact on writes. The replica falls behind and catches up when reconnected.
- If etcd loses quorum: no failover can happen. The existing primary continues. This is safe — no split-brain, but no failover capability until etcd recovers.
Testing Failover
Test failover before you need it. The first time should not be during an actual outage.
What to Test
- Kill the primary process: does a replica promote within your target time?
- Network partition the primary: does it demote itself? Does a replica take over?
- Kill a replica: does the cluster continue operating?
- Slow disk on the primary: does the health check detect the degradation?
- Restart the old primary: does it rejoin as a replica correctly (pg_rewind)?
Measuring Recovery Time
-- On the application side, measure:
-- 1. Time from primary failure to error detection
-- 2. Time from error to successful write on new primary
-- 3. Total downtime experienced by the application
A well-configured Patroni cluster typically achieves failover in 15-30 seconds.
RTO & RPO Trade-offs
RTO (Recovery Time Objective): how long can the database be down?
- Patroni with etcd: 15-30 seconds typical.
- Manual failover: minutes to hours depending on the operator.
RPO (Recovery Point Objective): how much data can you lose?
- Asynchronous replication: RPO > 0. You can lose transactions that were committed on the primary but not yet replicated.
- Synchronous replication: RPO = 0. No committed transaction is lost. But write latency increases and a replica failure can block writes on the primary.
# Patroni synchronous mode configuration
bootstrap:
dcs:
synchronous_mode: true
synchronous_mode_strict: false # true = block writes if no sync replica
With synchronous_mode_strict: false, Patroni falls back to asynchronous replication if all synchronous replicas are down. This preserves availability at the cost of potential data loss during that window.
Choosing Your Trade-off
- Most web applications: asynchronous replication, RPO of a few seconds. The simplicity and performance are worth the small risk.
- Financial systems: synchronous replication, RPO = 0. Accept the latency cost.
- Regulated industries: synchronous replication with synchronous_mode_strict: true. Writes stop if no replica is available.
Common Pitfalls
- Testing failover for the first time during an actual outage. Run failover drills regularly — monthly at minimum.
- Using a single etcd node for Patroni. etcd needs a quorum (3 or 5 nodes). A single etcd node is a single point of failure.
- Not configuring pg_rewind. Without it, a former primary requires a full base backup to rejoin the cluster, which takes hours for large databases.
- Ignoring the application's connection behavior during failover. If the application caches DNS or holds stale connections, it may not reconnect to the new primary promptly.
- Setting synchronous_mode_strict without understanding that writes will block. If all synchronous replicas are down, the primary stops accepting commits.
- Running HAProxy on the same nodes as PostgreSQL. If the node fails, both the database and the proxy go down. Run proxies on separate infrastructure.
Key Takeaways
- Patroni with etcd is the industry standard for PostgreSQL HA. It handles leader election, failover, and node reinitialization.
- pg_auto_failover is a simpler alternative that uses a monitor node instead of a distributed consensus store.
- Connection routing (HAProxy, PgBouncer, or multi-host connection strings) ensures applications find the current primary after failover.
- Split-brain prevention is the most critical aspect of HA. Patroni uses etcd leader locks and optional watchdog fencing.
- Test failover regularly. Measure your actual RTO and RPO against your targets.
- The synchronous vs asynchronous replication choice directly determines your RPO. Most applications can tolerate asynchronous replication.