WAL & Durability
Write-Ahead Logging
Write-Ahead Logging is the mechanism that makes PostgreSQL durable. The rule is absolute: every change to data must be written to the WAL before the corresponding data page is written to disk. If the server crashes, Postgres replays the WAL from the last checkpoint to bring the database back to a consistent state.
Without WAL, Postgres would need to flush every dirty buffer to disk on every commit. That would be catastrophically slow. WAL converts random writes (scattered data pages across the filesystem) into sequential writes (append-only log), which is orders of magnitude faster on both spinning disks and SSDs.
How WAL Works
When a transaction modifies data:
- The backend writes a WAL record describing the change into the WAL buffer (shared memory).
- On COMMIT, the backend (or the WAL writer) flushes the WAL buffer to disk. The
fsynccall guarantees the data is on stable storage. - The dirty data page in shared buffers is not flushed yet. It remains in memory with the modification applied.
- Eventually, the checkpointer writes the dirty page to the data file.
The critical insight: the WAL record is written first. If the server crashes after step 2 but before step 4, the data file is stale. But the WAL contains everything needed to reconstruct the change. On recovery, Postgres replays WAL records to bring data files up to date.
-- Current WAL position
SELECT pg_current_wal_lsn();
pg_current_wal_lsn
---------------------
0/1A3B4C00
The LSN (Log Sequence Number) is a pointer into the WAL stream. It is a 64-bit value represented as two 32-bit hex numbers separated by a slash.
WAL Segments
WAL is stored as a series of 16MB files (called segments) in the pg_wal/ directory:
$ ls -la pg_wal/
-rw------- 1 postgres postgres 16777216 Apr 15 10:30 000000010000000000000001
-rw------- 1 postgres postgres 16777216 Apr 15 10:35 000000010000000000000002
-rw------- 1 postgres postgres 16777216 Apr 15 10:40 000000010000000000000003
The filename encodes the timeline and segment number. Postgres writes to segments sequentially, and after a checkpoint, old segments that have been fully checkpointed can be recycled (renamed and reused) or removed.
The number of WAL segments on disk is bounded by min_wal_size and max_wal_size:
min_wal_size = 1GB # keep at least this much WAL on disk
max_wal_size = 4GB # trigger a checkpoint if WAL exceeds this
If WAL production exceeds max_wal_size, Postgres forces a checkpoint to allow old segments to be recycled.
Checkpoints
A checkpoint is the event where all dirty buffers in shared memory are written to the data files on disk. After a checkpoint completes, Postgres knows that all changes up to a specific WAL position are safely in the data files. WAL before that position is no longer needed for crash recovery.
What Triggers a Checkpoint
checkpoint_timeout = 5min # Time-based trigger
max_wal_size = 4GB # WAL volume trigger
A checkpoint occurs when either condition is met, or when you run the CHECKPOINT command manually.
Checkpoint Spread
Writing all dirty buffers at once creates an I/O spike. The checkpoint_completion_target setting spreads the writes:
checkpoint_completion_target = 0.9
With a 5-minute checkpoint interval and a target of 0.9, Postgres spreads the writes over 4.5 minutes, leaving 0.5 minutes of buffer before the next checkpoint.
Monitoring Checkpoints
SELECT checkpoints_timed, checkpoints_req,
buffers_checkpoint, buffers_clean, buffers_backend
FROM pg_stat_bgwriter;
- checkpoints_timed: Checkpoints triggered by
checkpoint_timeout(normal) - checkpoints_req: Checkpoints triggered by
max_wal_sizeor manual request (too many means WAL is filling too fast) - buffers_backend: Buffers written by backends (not the background writer). If this is high, the background writer is not keeping up.
Enable checkpoint logging to see timing details:
log_checkpoints = on
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 12483 buffers (9.5%); 0 WAL file(s) added, 0 removed, 2 recycled;
write=269.035 s, sync=0.012 s, total=269.082 s; sync files=94, longest=0.004 s, average=0.001 s;
distance=128394 kB, estimate=256788 kB
Full-Page Writes
There is a subtlety in crash recovery. Operating systems write disk pages (typically 4KB) atomically, but Postgres pages are 8KB. If the server crashes mid-write, a data page could be partially written: the first 4KB is new, the second 4KB is old. This is a torn page.
To handle this, Postgres writes a full copy of the page to WAL the first time it is modified after a checkpoint. This is called a full-page write (FPW). If a torn page occurs during recovery, Postgres restores the full page from WAL before applying incremental changes.
full_page_writes = on # default and strongly recommended
Full-page writes increase WAL volume significantly (sometimes 2-3x). This is the cost of correctness. Disabling FPW risks undetectable data corruption after a crash. The only scenario where disabling is safe is when the filesystem guarantees atomic 8KB writes (e.g., ZFS with matching block size), and even then most DBAs leave it on.
Reducing FPW Overhead
- Increase checkpoint_timeout. Longer intervals mean fewer checkpoints, and FPW only happens once per page per checkpoint cycle.
- Use wal_compression. Compresses full-page images in WAL, reducing I/O at the cost of CPU.
wal_compression = on
Crash Recovery
When Postgres starts after an unclean shutdown, it automatically runs crash recovery:
- Reads the
pg_controlfile to find the last completed checkpoint. - Reads WAL from that checkpoint's redo point forward.
- Replays each WAL record, applying changes to data pages.
- When all WAL is replayed, the database is consistent and ready to accept connections.
Recovery time depends on the amount of WAL since the last checkpoint. With a 5-minute checkpoint interval, recovery replays at most 5 minutes of WAL, which typically takes seconds to a few minutes.
LOG: database system was not properly shut down; automatic recovery in progress
LOG: redo starts at 0/1A000060
LOG: redo done at 0/1A3B4C00
LOG: database system is ready to accept connections
Reducing Recovery Time
Shorter checkpoint intervals mean less WAL to replay, but more I/O during normal operation. This is a fundamental trade-off:
# Aggressive checkpoints (faster recovery, more I/O)
checkpoint_timeout = 1min
max_wal_size = 256MB
# Relaxed checkpoints (slower recovery, less I/O)
checkpoint_timeout = 15min
max_wal_size = 8GB
Most production systems use 5-15 minutes with max_wal_size between 2GB and 8GB.
WAL Archiving & Point-in-Time Recovery
WAL archiving copies completed WAL segments to an external location (local directory, S3, GCS, etc.). Combined with a base backup, archived WAL enables point-in-time recovery (PITR): you can restore the database to any moment in time.
Enabling Archiving
archive_mode = on
archive_command = 'cp %p /mnt/wal_archive/%f'
The %p is the path to the WAL segment file. The %f is the filename. The archive command must return exit code 0 on success. Postgres retains WAL segments until the archive command succeeds.
For production, use a tool like pgBackRest, WAL-G, or Barman instead of a raw cp:
archive_command = 'pgbackrest --stanza=mydb archive-push %p'
Point-in-Time Recovery
To restore to a specific point in time:
- Restore a base backup to the data directory.
- Create a
recovery.signalfile. - Configure
restore_commandandrecovery_target_timeinpostgresql.conf.
restore_command = 'cp /mnt/wal_archive/%f %p'
recovery_target_time = '2024-03-15 14:30:00'
recovery_target_action = 'promote'
- Start Postgres. It replays archived WAL up to the target time and then promotes to a read-write server.
This is how you recover from accidental data deletion. If someone drops a table at 14:35, you restore to 14:29 and the table is back.
WAL Levels
The wal_level setting controls how much information is written to WAL:
wal_level = minimal # Crash recovery only. No archiving, no replication.
wal_level = replica # Default. Supports streaming replication and archiving.
wal_level = logical # Adds information for logical replication/decoding.
replica is the default since Postgres 10 and is appropriate for most deployments. Set logical only if you use logical replication or change data capture.
The Durability-Performance Trade-Off
WAL provides durability, but every fsync on commit is an I/O operation. Postgres offers knobs to trade durability for speed:
synchronous_commit
synchronous_commit = on # Default: wait for WAL flush to disk
synchronous_commit = off # Return immediately, WAL flushes asynchronously
With synchronous_commit = off, a committed transaction might be lost if the server crashes within the WAL writer flush interval (up to 600ms by default). The database remains consistent (no corruption), but recent commits may vanish. This is acceptable for high-throughput workloads where losing a fraction of a second of data is tolerable.
You can set this per transaction:
SET LOCAL synchronous_commit = off;
INSERT INTO logs (message) VALUES ('not critical');
COMMIT; -- returns immediately without waiting for WAL flush
wal_sync_method
Controls the system call used for WAL flushes:
wal_sync_method = fdatasync # Linux default: sync data but not metadata
wal_sync_method = fsync # Full fsync
On Linux, fdatasync is typically the best choice. On some storage configurations, open_datasync or open_sync may be faster.
commit_delay & commit_siblings
Groups multiple concurrent commits into a single WAL flush:
commit_delay = 10 # Microseconds to wait for other commits
commit_siblings = 5 # Only delay if this many transactions are active
This reduces the number of fsync calls under high concurrency.
Common Pitfalls
- Disabling full_page_writes. This saves WAL volume but risks silent data corruption after a crash. Leave it on unless your filesystem guarantees atomic 8KB writes.
- Setting max_wal_size too low. Triggers frequent checkpoints, which cause I/O spikes and write amplification. Set it to at least 2GB for production.
- Not monitoring WAL generation rate. If you produce WAL faster than archiving can keep up, the pg_wal directory fills the disk. Monitor
pg_waldirectory size. - Archive command that silently fails. If the archive command fails, Postgres retains WAL segments indefinitely, filling the disk. Test your archive command and monitor
pg_stat_archiverfor failures. - Forgetting that synchronous_commit=off can lose data. It does not corrupt the database, but committed transactions can vanish. Only use it for data you can afford to lose.
- Not testing recovery. Backups are worthless if you have never tested restoring from them. Schedule regular recovery tests.
Key Takeaways
- WAL ensures durability by writing changes to a sequential log before modifying data files. Crash recovery replays WAL to restore consistency.
- Checkpoints flush dirty buffers to data files. The checkpoint interval controls recovery time versus I/O overhead.
- Full-page writes prevent torn-page corruption at the cost of increased WAL volume. Do not disable them.
- WAL archiving combined with base backups enables point-in-time recovery to any moment in the past.
- synchronous_commit can be relaxed per-transaction for workloads where losing a fraction of a second of data is acceptable.
- The trade-off is always between durability guarantees and write throughput. Understand where your application sits on that spectrum.