MVCC & Visibility
What MVCC Is
Multi-Version Concurrency Control is the mechanism that allows readers and writers to operate simultaneously without blocking each other. In Postgres, when you update a row, the database does not overwrite the old data. Instead, it creates a new version of the row and marks the old version as expired. Readers see whichever version was current at the time their transaction started.
This is the single most important concept in PostgreSQL internals. Every performance behavior, every vacuum discussion, and every concurrency issue traces back to MVCC.
Row Versions: xmin & xmax
Every row in Postgres carries hidden system columns. The two most important are:
- xmin: The transaction ID that created this row version
- xmax: The transaction ID that deleted or updated this row version (0 if still live)
-- You can see these hidden columns explicitly
SELECT xmin, xmax, id, name FROM users WHERE id = 1;
xmin | xmax | id | name
--------+------+----+--------
100042 | 0 | 1 | Alice
When a transaction updates this row:
BEGIN; -- Transaction ID 100050
UPDATE users SET name = 'Alice Smith' WHERE id = 1;
COMMIT;
Postgres does not modify the existing row. It marks the old row with xmax = 100050 and inserts a new row version with xmin = 100050:
-- Old version (now dead): xmin=100042, xmax=100050
-- New version (now live): xmin=100050, xmax=0
Both versions coexist on disk until VACUUM removes the dead one.
How Visibility Works
When a backend reads a row, it checks the row's xmin and xmax against its own snapshot (the set of transaction IDs that were in-progress when its transaction started). The visibility rules are roughly:
- If xmin was committed before my snapshot, and xmax is 0 or was not committed, the row is visible.
- If xmin was committed before my snapshot, and xmax was committed before my snapshot, the row is dead (I see neither version here; I might see the newer version elsewhere).
- If xmin is my own transaction, the row is visible (I can see my own changes).
- If xmin was in-progress at snapshot time or was aborted, the row is invisible.
This is simplified. The actual rules account for subtransactions, combo CIDs, and other edge cases. But the core idea is: every row version carries enough metadata for any transaction to determine if it should see that version.
Snapshot Isolation in Practice
-- Session 1
BEGIN;
SELECT name FROM users WHERE id = 1; -- sees 'Alice'
-- Session 2
BEGIN;
UPDATE users SET name = 'Bob' WHERE id = 1;
COMMIT;
-- Session 1 (still in the same transaction)
SELECT name FROM users WHERE id = 1; -- still sees 'Alice'
COMMIT;
Session 1 sees a consistent snapshot from the start of its transaction. Session 2's committed change is invisible to Session 1 until Session 1 starts a new transaction. This is the REPEATABLE READ behavior. The default READ COMMITTED isolation level takes a new snapshot for each statement instead.
Readers Never Block Writers
This is the fundamental advantage of MVCC. In a lock-based system, a SELECT would acquire a read lock, and an UPDATE would need a write lock, causing one to wait for the other. In Postgres:
- SELECT never acquires row-level locks (it uses snapshots instead)
- UPDATE creates a new version rather than modifying in place
- The only blocking occurs when two transactions try to UPDATE the same row (they contend on the row-level lock for the update)
This is why Postgres handles mixed read/write workloads so well. Reporting queries do not block OLTP transactions.
Dead Tuples & VACUUM
MVCC creates dead tuples: old row versions that no active transaction needs to see. These accumulate over time and cause two problems:
- Table bloat: Dead tuples occupy disk space. A table with 1 million live rows might have 3 million dead rows, tripling its size on disk.
- Index bloat: Indexes still point to dead tuples. Index scans follow pointers to dead rows and must skip them.
VACUUM is the process that reclaims dead tuples.
-- Manual vacuum
VACUUM users;
-- Vacuum with verbose output
VACUUM VERBOSE users;
-- Vacuum and reclaim space to the OS (locks the table)
VACUUM FULL users;
INFO: vacuuming "public.users"
INFO: "users": removed 45000 dead row versions in 200 pages
INFO: "users": found 45000 removable, 120000 nonremovable row versions in 500 pages
What VACUUM Does
- Scans table pages for dead tuples
- Removes dead tuples and marks the space as reusable
- Updates the free space map (so future INSERTs can reuse the space)
- Updates the visibility map (marks pages where all tuples are visible to all transactions)
- Freezes old transaction IDs to prevent wraparound
Regular VACUUM does not return disk space to the OS. It marks space as reusable within the table file. Only VACUUM FULL compacts the table, but it takes an exclusive lock.
Autovacuum
Postgres runs autovacuum by default. The autovacuum launcher checks tables periodically and spawns worker processes when dead tuples exceed a threshold:
# Trigger autovacuum when dead tuples exceed:
# autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * n_live_tup
# Default: 50 + 0.2 * n_live_tup (20% of table is dead)
For large tables, the default 20% threshold is too aggressive. A 100-million-row table waits until 20 million dead tuples before vacuuming. Override per table:
ALTER TABLE large_events SET (
autovacuum_vacuum_scale_factor = 0.01,
autovacuum_vacuum_threshold = 10000
);
Transaction IDs & Wraparound
Transaction IDs in Postgres are 32-bit unsigned integers. They wrap around at approximately 4.2 billion. Postgres uses modular arithmetic: a transaction can see roughly 2 billion transactions in the past and 2 billion in the future.
The danger: if a transaction ID is never "frozen" (marked as definitively in the past), it could wrap around and appear to be in the future, making committed data invisible.
VACUUM prevents this by freezing old transaction IDs. When a tuple's xmin is old enough, VACUUM replaces it with FrozenTransactionId, a special value that is always considered "in the past."
-- Check how close tables are to wraparound
SELECT relname,
age(relfrozenxid) AS xid_age,
pg_size_pretty(pg_total_relation_size(oid)) AS size
FROM pg_class
WHERE relkind = 'r'
ORDER BY age(relfrozenxid) DESC
LIMIT 10;
If age(relfrozenxid) approaches 2 billion, Postgres enters emergency mode: it refuses to accept new transactions and forces a VACUUM. This is a production outage. The fix is to ensure autovacuum runs consistently and is not being blocked by long-running transactions.
WARNING: database "mydb" must be vacuumed within 10000000 transactions
HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
Why VACUUM Is Critical
When VACUUM does not run, the consequences compound:
- Dead tuples accumulate. Table size grows without bound.
- Index bloat increases. Index scans slow down.
- Sequential scans slow down. They must read and skip dead tuples.
- Transaction ID wraparound approaches. The clock is ticking.
- Emergency autovacuum kicks in. It runs as an aggressive, blocking operation.
The most common cause of VACUUM not running: long-running transactions. VACUUM cannot remove tuples that any active transaction might need. A forgotten BEGIN in a psql session or a leaked connection holding an open transaction will pin the VACUUM horizon.
-- Find long-running transactions blocking vacuum
SELECT pid, now() - xact_start AS duration, state, query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start
LIMIT 5;
Kill the offender:
SELECT pg_terminate_backend(12345);
The Visibility Map
The visibility map is a bitmap with one bit per table page. A set bit means every tuple on that page is visible to all current and future transactions. This enables two optimizations:
- VACUUM skips pages. If a page is all-visible, VACUUM does not need to scan it for dead tuples.
- Index-only scans. When the planner sees that all required columns are in the index, it can skip fetching the heap page if the visibility map confirms all tuples are visible. This avoids heap access entirely.
-- Check visibility map coverage
SELECT relname,
n_dead_tup,
pg_stat_get_tuples_hot_updated(oid) AS hot_updates,
pg_relation_size(oid) AS table_size
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;
HOT Updates
Heap-Only Tuple (HOT) updates are a critical optimization. When an UPDATE meets two conditions:
- The new row version fits on the same page as the old version
- No indexed columns were modified
Postgres creates the new version on the same page and chains it to the old version. The index does not need updating because the index still points to the original tuple, which now forwards to the new version.
-- Check HOT update ratio
SELECT relname,
n_tup_upd AS updates,
n_tup_hot_upd AS hot_updates,
round(n_tup_hot_upd::numeric / nullif(n_tup_upd, 0) * 100, 1) AS hot_pct
FROM pg_stat_user_tables
WHERE n_tup_upd > 0
ORDER BY n_tup_upd DESC;
A high HOT update percentage (above 90%) is ideal. If it is low, check whether:
- The
fillfactoris too high (default 100%). Lowering it to 80-90% leaves room on each page for HOT updates. - Indexed columns are being updated unnecessarily.
-- Lower fillfactor to enable more HOT updates
ALTER TABLE users SET (fillfactor = 90);
-- Then VACUUM FULL or pg_repack to reorganize existing data
Common Pitfalls
- Long-running transactions. A single open transaction prevents VACUUM from cleaning up any tuples created after that transaction started. Monitor and kill stale transactions.
- Disabling autovacuum. Some teams disable it to "reduce overhead." This is catastrophic. The overhead of not vacuuming is orders of magnitude worse.
- Ignoring wraparound warnings. The WARNING messages are real. If you see them, VACUUM immediately or face a forced shutdown.
- Using VACUUM FULL as routine maintenance. VACUUM FULL locks the entire table and rewrites it. Use regular VACUUM or pg_repack for online compaction.
- Not monitoring dead tuple counts. If n_dead_tup is consistently high on critical tables, autovacuum is falling behind. Tune its parameters or add manual vacuum jobs.
- Updating indexed columns unnecessarily. This defeats HOT updates and doubles the index maintenance cost. Only SET columns that actually changed.
Key Takeaways
- MVCC means every update creates a new row version. Old versions remain until VACUUM removes them.
- xmin and xmax on every row determine which transactions can see which row versions. Readers never block writers.
- Dead tuples from updates and deletes must be cleaned up by VACUUM. Without it, tables bloat, indexes bloat, and performance degrades.
- Transaction ID wraparound is a real threat. VACUUM freezes old transaction IDs. Long-running transactions prevent this.
- The visibility map enables index-only scans and makes VACUUM efficient by skipping all-visible pages.
- HOT updates avoid index bloat when non-indexed columns change and space is available on the same page. Lower fillfactor to enable them.