High Availability Patterns

Replication gives you a copy of your data. High availability gives you automated failover — when the primary goes down, a replica takes over and the application continues with minimal downtime.

The Failover Process

Every HA solution follows the same basic steps:

Detect failure: determine that the primary is actually down, not just slow.
Promote a replica: one replica becomes the new primary and starts accepting writes.
Update connections: applications and other replicas connect to the new primary.
Prevent split-brain: ensure the old primary cannot accept writes if it comes back online.

The complexity is in getting each step right under real-world failure conditions.

Patroni is the most widely used open-source HA solution for PostgreSQL. It manages the PostgreSQL lifecycle and coordinates failover through a distributed consensus store (etcd, ZooKeeper, or Consul).

How Patroni Works

Each PostgreSQL node runs a Patroni agent.
Patroni uses etcd (or similar) as a distributed key-value store for leader election.
The leader holds a lock in etcd. If the leader fails to renew the lock, another node acquires it and promotes its PostgreSQL instance.
Patroni handles replica creation, configuration management, and switchover (planned failover).

Patroni Configuration

# patroni.yml
scope: my-cluster
name: node1

restapi:
  listen: 0.0.0.0:8008
  connect_address: 10.0.1.1:8008

etcd3:
  hosts: 10.0.2.1:2379,10.0.2.2:2379,10.0.2.3:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1MB
    synchronous_mode: false
    postgresql:
      use_pg_rewind: true
      parameters:
        wal_level: replica
        max_wal_senders: 5
        max_replication_slots: 5
        hot_standby: on

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.0.1.1:5432
  data_dir: /var/lib/postgresql/16/main
  authentication:
    superuser:
      username: postgres
      password: secret
    replication:
      username: replicator
      password: secret

Patroni Operations

# Check cluster status
patronictl -c /etc/patroni.yml list

+--------+----------+---------+---------+----+-----------+
| Member | Host     | Role    | State   | TL | Lag in MB |
+--------+----------+---------+---------+----+-----------+
| node1  | 10.0.1.1 | Leader  | running |  3 |           |
| node2  | 10.0.1.2 | Replica | running |  3 |         0 |
| node3  | 10.0.1.3 | Replica | running |  3 |         0 |
+--------+----------+---------+---------+----+-----------+

# Planned switchover (graceful)
patronictl -c /etc/patroni.yml switchover --master node1 --candidate node2

# Emergency failover
patronictl -c /etc/patroni.yml failover

# Reinitialize a failed node
patronictl -c /etc/patroni.yml reinit my-cluster node1

pg_rewind

When a former primary comes back online, it may have WAL that diverged from the new primary. pg_rewind synchronizes the old primary's data directory with the new primary without a full base backup.

# Patroni handles this automatically with use_pg_rewind: true

pg_auto_failover

A lighter-weight alternative from Citus/Microsoft. It uses a dedicated monitor node instead of an external consensus store.

Architecture

One monitor node runs the pg_autoctl monitor.
Each PostgreSQL node runs pg_autoctl.
The monitor tracks node health and orchestrates failover.

Setup

# On the monitor node
pg_autoctl create monitor \
  --pgdata /var/lib/postgresql/monitor \
  --pgport 5000

# On the primary
pg_autoctl create postgres \
  --pgdata /var/lib/postgresql/16/main \
  --pgport 5432 \
  --monitor postgres://autoctl@monitor-host:5000/pg_auto_failover

# On the replica
pg_autoctl create postgres \
  --pgdata /var/lib/postgresql/16/main \
  --pgport 5432 \
  --monitor postgres://autoctl@monitor-host:5000/pg_auto_failover

Monitoring

pg_autoctl show state
  Name |  Node |      Host:Port |       TLI: LSN |   Connection |      State
-------+-------+----------------+----------------+--------------+-----------
 node1 |     1 | 10.0.1.1:5432  |   3: 0/5000060 |   read-write |   primary
 node2 |     2 | 10.0.1.2:5432  |   3: 0/5000060 |    read-only | secondary

The trade-off: pg_auto_failover is simpler to set up than Patroni but the monitor is a single point of failure (though it can be made HA itself).

Connection Routing

After failover, applications need to connect to the new primary. There are several approaches.

HAProxy

HAProxy sits in front of PostgreSQL and routes connections based on health checks.

# haproxy.cfg
listen postgres-primary
    bind *:5432
    mode tcp
    option httpchk GET /primary
    http-check expect status 200
    default-server inter 3s fall 3 rise 2
    server node1 10.0.1.1:5432 check port 8008
    server node2 10.0.1.2:5432 check port 8008
    server node3 10.0.1.3:5432 check port 8008

listen postgres-replicas
    bind *:5433
    mode tcp
    balance roundrobin
    option httpchk GET /replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2
    server node2 10.0.1.2:5432 check port 8008
    server node3 10.0.1.3:5432 check port 8008

The health check endpoint (port 8008) is the Patroni REST API. It returns 200 for the primary on /primary and 200 for replicas on /replica.

PgBouncer with Patroni

Patroni can manage PgBouncer configuration, updating it automatically on failover.

libpq Multi-Host Connection Strings

PostgreSQL's client library supports listing multiple hosts:

postgresql://host1:5432,host2:5432,host3:5432/mydb?target_session_attrs=read-write

The client tries each host in order and connects to the one that accepts read-write sessions (the primary). This works without any proxy but requires client-side retry logic on failover.

Split-Brain Prevention

Split-brain occurs when two nodes both think they are the primary and accept writes. This causes data divergence that is extremely difficult to resolve.

How Patroni Prevents Split-Brain

The leader lock in etcd has a TTL. If the primary cannot renew the lock (e.g., network partition), it voluntarily demotes itself.
Patroni checks the lock before allowing PostgreSQL to accept writes.
The watchdog module can fence the old primary (shut down PostgreSQL or even the OS) if it cannot confirm it still holds the lock.

# patroni.yml — watchdog configuration
watchdog:
  mode: required  # required, automatic, or off
  device: /dev/watchdog
  safety_margin: 5

Network Partition Scenarios

In a three-node cluster (one primary, two replicas) with etcd:

If the primary loses connectivity to etcd but not to replicas: Patroni demotes the primary. A replica takes over.
If a replica is partitioned: no impact on writes. The replica falls behind and catches up when reconnected.
If etcd loses quorum: no failover can happen. The existing primary continues. This is safe — no split-brain, but no failover capability until etcd recovers.

Testing Failover

Test failover before you need it. The first time should not be during an actual outage.

What to Test

Kill the primary process: does a replica promote within your target time?
Network partition the primary: does it demote itself? Does a replica take over?
Kill a replica: does the cluster continue operating?
Slow disk on the primary: does the health check detect the degradation?
Restart the old primary: does it rejoin as a replica correctly (pg_rewind)?

Measuring Recovery Time

-- On the application side, measure:
-- 1. Time from primary failure to error detection
-- 2. Time from error to successful write on new primary
-- 3. Total downtime experienced by the application

A well-configured Patroni cluster typically achieves failover in 15-30 seconds.

RTO & RPO Trade-offs

RTO (Recovery Time Objective): how long can the database be down?

Patroni with etcd: 15-30 seconds typical.
Manual failover: minutes to hours depending on the operator.

RPO (Recovery Point Objective): how much data can you lose?

Asynchronous replication: RPO > 0. You can lose transactions that were committed on the primary but not yet replicated.
Synchronous replication: RPO = 0. No committed transaction is lost. But write latency increases and a replica failure can block writes on the primary.

# Patroni synchronous mode configuration
bootstrap:
  dcs:
    synchronous_mode: true
    synchronous_mode_strict: false  # true = block writes if no sync replica

With synchronous_mode_strict: false, Patroni falls back to asynchronous replication if all synchronous replicas are down. This preserves availability at the cost of potential data loss during that window.

Choosing Your Trade-off

Most web applications: asynchronous replication, RPO of a few seconds. The simplicity and performance are worth the small risk.
Financial systems: synchronous replication, RPO = 0. Accept the latency cost.
Regulated industries: synchronous replication with synchronous_mode_strict: true. Writes stop if no replica is available.

Common Pitfalls

Testing failover for the first time during an actual outage. Run failover drills regularly — monthly at minimum.
Using a single etcd node for Patroni. etcd needs a quorum (3 or 5 nodes). A single etcd node is a single point of failure.
Not configuring pg_rewind. Without it, a former primary requires a full base backup to rejoin the cluster, which takes hours for large databases.
Ignoring the application's connection behavior during failover. If the application caches DNS or holds stale connections, it may not reconnect to the new primary promptly.
Setting synchronous_mode_strict without understanding that writes will block. If all synchronous replicas are down, the primary stops accepting commits.
Running HAProxy on the same nodes as PostgreSQL. If the node fails, both the database and the proxy go down. Run proxies on separate infrastructure.

Key Takeaways

Patroni with etcd is the industry standard for PostgreSQL HA. It handles leader election, failover, and node reinitialization.
pg_auto_failover is a simpler alternative that uses a monitor node instead of a distributed consensus store.
Connection routing (HAProxy, PgBouncer, or multi-host connection strings) ensures applications find the current primary after failover.
Split-brain prevention is the most critical aspect of HA. Patroni uses etcd leader locks and optional watchdog fencing.
Test failover regularly. Measure your actual RTO and RPO against your targets.
The synchronous vs asynchronous replication choice directly determines your RPO. Most applications can tolerate asynchronous replication.