WAL & Durability

Write-Ahead Logging

Write-Ahead Logging is the mechanism that makes PostgreSQL durable. The rule is absolute: every change to data must be written to the WAL before the corresponding data page is written to disk. If the server crashes, Postgres replays the WAL from the last checkpoint to bring the database back to a consistent state.

Without WAL, Postgres would need to flush every dirty buffer to disk on every commit. That would be catastrophically slow. WAL converts random writes (scattered data pages across the filesystem) into sequential writes (append-only log), which is orders of magnitude faster on both spinning disks and SSDs.

How WAL Works

When a transaction modifies data:

The backend writes a WAL record describing the change into the WAL buffer (shared memory).
On COMMIT, the backend (or the WAL writer) flushes the WAL buffer to disk. The fsync call guarantees the data is on stable storage.
The dirty data page in shared buffers is not flushed yet. It remains in memory with the modification applied.
Eventually, the checkpointer writes the dirty page to the data file.

The critical insight: the WAL record is written first. If the server crashes after step 2 but before step 4, the data file is stale. But the WAL contains everything needed to reconstruct the change. On recovery, Postgres replays WAL records to bring data files up to date.

-- Current WAL position
SELECT pg_current_wal_lsn();

 pg_current_wal_lsn
---------------------
 0/1A3B4C00

The LSN (Log Sequence Number) is a pointer into the WAL stream. It is a 64-bit value represented as two 32-bit hex numbers separated by a slash.

WAL Segments

WAL is stored as a series of 16MB files (called segments) in the pg_wal/ directory:

$ ls -la pg_wal/
-rw------- 1 postgres postgres 16777216 Apr 15 10:30 000000010000000000000001
-rw------- 1 postgres postgres 16777216 Apr 15 10:35 000000010000000000000002
-rw------- 1 postgres postgres 16777216 Apr 15 10:40 000000010000000000000003

The filename encodes the timeline and segment number. Postgres writes to segments sequentially, and after a checkpoint, old segments that have been fully checkpointed can be recycled (renamed and reused) or removed.

The number of WAL segments on disk is bounded by min_wal_size and max_wal_size:

min_wal_size = 1GB      # keep at least this much WAL on disk
max_wal_size = 4GB      # trigger a checkpoint if WAL exceeds this

If WAL production exceeds max_wal_size, Postgres forces a checkpoint to allow old segments to be recycled.

Checkpoints

A checkpoint is the event where all dirty buffers in shared memory are written to the data files on disk. After a checkpoint completes, Postgres knows that all changes up to a specific WAL position are safely in the data files. WAL before that position is no longer needed for crash recovery.

What Triggers a Checkpoint

checkpoint_timeout = 5min      # Time-based trigger
max_wal_size = 4GB             # WAL volume trigger

A checkpoint occurs when either condition is met, or when you run the CHECKPOINT command manually.

Checkpoint Spread

Writing all dirty buffers at once creates an I/O spike. The checkpoint_completion_target setting spreads the writes:

checkpoint_completion_target = 0.9

With a 5-minute checkpoint interval and a target of 0.9, Postgres spreads the writes over 4.5 minutes, leaving 0.5 minutes of buffer before the next checkpoint.

Monitoring Checkpoints

SELECT checkpoints_timed, checkpoints_req,
       buffers_checkpoint, buffers_clean, buffers_backend
FROM pg_stat_bgwriter;

checkpoints_timed: Checkpoints triggered by checkpoint_timeout (normal)
checkpoints_req: Checkpoints triggered by max_wal_size or manual request (too many means WAL is filling too fast)
buffers_backend: Buffers written by backends (not the background writer). If this is high, the background writer is not keeping up.

Enable checkpoint logging to see timing details:

log_checkpoints = on

LOG:  checkpoint starting: time
LOG:  checkpoint complete: wrote 12483 buffers (9.5%); 0 WAL file(s) added, 0 removed, 2 recycled;
      write=269.035 s, sync=0.012 s, total=269.082 s; sync files=94, longest=0.004 s, average=0.001 s;
      distance=128394 kB, estimate=256788 kB

Full-Page Writes

There is a subtlety in crash recovery. Operating systems write disk pages (typically 4KB) atomically, but Postgres pages are 8KB. If the server crashes mid-write, a data page could be partially written: the first 4KB is new, the second 4KB is old. This is a torn page.

To handle this, Postgres writes a full copy of the page to WAL the first time it is modified after a checkpoint. This is called a full-page write (FPW). If a torn page occurs during recovery, Postgres restores the full page from WAL before applying incremental changes.

full_page_writes = on      # default and strongly recommended

Full-page writes increase WAL volume significantly (sometimes 2-3x). This is the cost of correctness. Disabling FPW risks undetectable data corruption after a crash. The only scenario where disabling is safe is when the filesystem guarantees atomic 8KB writes (e.g., ZFS with matching block size), and even then most DBAs leave it on.

Reducing FPW Overhead

Increase checkpoint_timeout. Longer intervals mean fewer checkpoints, and FPW only happens once per page per checkpoint cycle.
Use wal_compression. Compresses full-page images in WAL, reducing I/O at the cost of CPU.

wal_compression = on

Crash Recovery

When Postgres starts after an unclean shutdown, it automatically runs crash recovery:

Reads the pg_control file to find the last completed checkpoint.
Reads WAL from that checkpoint's redo point forward.
Replays each WAL record, applying changes to data pages.
When all WAL is replayed, the database is consistent and ready to accept connections.

Recovery time depends on the amount of WAL since the last checkpoint. With a 5-minute checkpoint interval, recovery replays at most 5 minutes of WAL, which typically takes seconds to a few minutes.

LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/1A000060
LOG:  redo done at 0/1A3B4C00
LOG:  database system is ready to accept connections

Reducing Recovery Time

Shorter checkpoint intervals mean less WAL to replay, but more I/O during normal operation. This is a fundamental trade-off:

# Aggressive checkpoints (faster recovery, more I/O)
checkpoint_timeout = 1min
max_wal_size = 256MB

# Relaxed checkpoints (slower recovery, less I/O)
checkpoint_timeout = 15min
max_wal_size = 8GB

Most production systems use 5-15 minutes with max_wal_size between 2GB and 8GB.

WAL Archiving & Point-in-Time Recovery

WAL archiving copies completed WAL segments to an external location (local directory, S3, GCS, etc.). Combined with a base backup, archived WAL enables point-in-time recovery (PITR): you can restore the database to any moment in time.

Enabling Archiving

archive_mode = on
archive_command = 'cp %p /mnt/wal_archive/%f'

The %p is the path to the WAL segment file. The %f is the filename. The archive command must return exit code 0 on success. Postgres retains WAL segments until the archive command succeeds.

For production, use a tool like pgBackRest, WAL-G, or Barman instead of a raw cp:

archive_command = 'pgbackrest --stanza=mydb archive-push %p'

Point-in-Time Recovery

To restore to a specific point in time:

Restore a base backup to the data directory.
Create a recovery.signal file.
Configure restore_command and recovery_target_time in postgresql.conf.

restore_command = 'cp /mnt/wal_archive/%f %p'
recovery_target_time = '2024-03-15 14:30:00'
recovery_target_action = 'promote'

Start Postgres. It replays archived WAL up to the target time and then promotes to a read-write server.

This is how you recover from accidental data deletion. If someone drops a table at 14:35, you restore to 14:29 and the table is back.

WAL Levels

The wal_level setting controls how much information is written to WAL:

wal_level = minimal      # Crash recovery only. No archiving, no replication.
wal_level = replica       # Default. Supports streaming replication and archiving.
wal_level = logical       # Adds information for logical replication/decoding.

replica is the default since Postgres 10 and is appropriate for most deployments. Set logical only if you use logical replication or change data capture.

The Durability-Performance Trade-Off

WAL provides durability, but every fsync on commit is an I/O operation. Postgres offers knobs to trade durability for speed:

synchronous_commit

synchronous_commit = on       # Default: wait for WAL flush to disk
synchronous_commit = off      # Return immediately, WAL flushes asynchronously

With synchronous_commit = off, a committed transaction might be lost if the server crashes within the WAL writer flush interval (up to 600ms by default). The database remains consistent (no corruption), but recent commits may vanish. This is acceptable for high-throughput workloads where losing a fraction of a second of data is tolerable.

You can set this per transaction:

SET LOCAL synchronous_commit = off;
INSERT INTO logs (message) VALUES ('not critical');
COMMIT;  -- returns immediately without waiting for WAL flush

wal_sync_method

Controls the system call used for WAL flushes:

wal_sync_method = fdatasync   # Linux default: sync data but not metadata
wal_sync_method = fsync        # Full fsync

On Linux, fdatasync is typically the best choice. On some storage configurations, open_datasync or open_sync may be faster.

commit_delay & commit_siblings

Groups multiple concurrent commits into a single WAL flush:

commit_delay = 10             # Microseconds to wait for other commits
commit_siblings = 5           # Only delay if this many transactions are active

This reduces the number of fsync calls under high concurrency.

Common Pitfalls

Disabling full_page_writes. This saves WAL volume but risks silent data corruption after a crash. Leave it on unless your filesystem guarantees atomic 8KB writes.
Setting max_wal_size too low. Triggers frequent checkpoints, which cause I/O spikes and write amplification. Set it to at least 2GB for production.
Not monitoring WAL generation rate. If you produce WAL faster than archiving can keep up, the pg_wal directory fills the disk. Monitor pg_wal directory size.
Archive command that silently fails. If the archive command fails, Postgres retains WAL segments indefinitely, filling the disk. Test your archive command and monitor pg_stat_archiver for failures.
Forgetting that synchronous_commit=off can lose data. It does not corrupt the database, but committed transactions can vanish. Only use it for data you can afford to lose.
Not testing recovery. Backups are worthless if you have never tested restoring from them. Schedule regular recovery tests.

Key Takeaways

WAL ensures durability by writing changes to a sequential log before modifying data files. Crash recovery replays WAL to restore consistency.
Checkpoints flush dirty buffers to data files. The checkpoint interval controls recovery time versus I/O overhead.
Full-page writes prevent torn-page corruption at the cost of increased WAL volume. Do not disable them.
WAL archiving combined with base backups enables point-in-time recovery to any moment in the past.
synchronous_commit can be relaxed per-transaction for workloads where losing a fraction of a second of data is acceptable.
The trade-off is always between durability guarantees and write throughput. Understand where your application sits on that spectrum.