MVCC and Concurrency Control
Overview
Multi-Version Concurrency Control (MVCC) allows readers and writers to operate concurrently without blocking each other. Instead of locking rows, the database maintains multiple versions of each row, and each transaction sees a consistent snapshot. MVCC is the foundation of concurrency in PostgreSQL, MySQL/InnoDB, Oracle, and most modern databases.
MVCC Fundamentals
Core Principle
Traditional Locking:
Reader blocks Writer, Writer blocks Reader -> low concurrency
MVCC:
Writer creates new version, Reader sees old version
Readers never block Writers, Writers never block Readers
Writers only block other Writers on the same row
Timestamp Ordering
Each transaction receives a timestamp (or transaction ID). A row version is visible to a transaction if it was created by a committed transaction with a lower timestamp and not yet deleted (or deleted by a transaction with a higher timestamp).
Transaction T5 reads row R:
Version history of R:
v1: created by T1 (committed), deleted by T3 (committed)
v2: created by T3 (committed), not deleted
v3: created by T7 (not yet committed)
T5 sees v2:
- v1 deleted by T3 < T5, so invisible
- v2 created by T3 < T5, not deleted, so VISIBLE
- v3 created by T7 > T5, so invisible
PostgreSQL MVCC: xmin/xmax
PostgreSQL stores version visibility information directly in each tuple (row) using system columns.
Tuple Header Fields
Every tuple contains hidden columns:
xmin - transaction ID that created this tuple version
xmax - transaction ID that deleted/updated this tuple (0 if alive)
cmin - command counter within creating transaction
cmax - command counter within deleting transaction
ctid - physical location (page, offset) of current/next version
Visibility check for transaction T reading a tuple:
1. xmin must be committed AND xmin < T's snapshot
2. xmax must be 0 (not deleted)
OR xmax is not committed
OR xmax >= T's snapshot
Example Walkthrough
-- Initial state: T100 inserts a row
INSERT INTO accounts(id, balance) VALUES (1, 1000);
-- Tuple: xmin=100, xmax=0, data=(1, 1000)
-- T200 updates the row
UPDATE accounts SET balance = 900 WHERE id = 1;
-- Old tuple: xmin=100, xmax=200, data=(1, 1000)
-- New tuple: xmin=200, xmax=0, data=(1, 900)
-- T150 (snapshot between T100 and T200) reads:
-- Sees old tuple: xmin=100 < 150 (visible), xmax=200 > 150 (not yet visible)
-- Result: balance = 1000
-- T250 reads:
-- Old tuple: xmax=200 < 250 (deleted, invisible)
-- New tuple: xmin=200 < 250 (visible)
-- Result: balance = 900
VACUUM
Dead tuples (old versions no longer visible to any transaction) must be reclaimed. PostgreSQL's VACUUM process removes them.
Before VACUUM:
Page: [live_tuple_1] [dead_tuple] [live_tuple_2] [dead_tuple]
After VACUUM:
Page: [live_tuple_1] [free_space] [live_tuple_2] [free_space]
Free space map updated, dead tuple space reusable
autovacuum: background process triggered by modification thresholds
threshold = autovacuum_vacuum_threshold +
autovacuum_vacuum_scale_factor * table_size
MySQL/InnoDB MVCC: Undo Log
InnoDB maintains only the latest version of each row on the data page. Previous versions are reconstructed on-demand from the undo log.
Undo Log Architecture
Clustered Index Page (latest version):
Row: [trx_id=200, roll_ptr -> undo_log] (id=1, balance=900)
Undo Log:
Entry: [trx_id=100, data=(id=1, balance=1000), prev_ptr=NULL]
To reconstruct version for trx_id=150:
1. Read current row: trx_id=200 > 150, too new
2. Follow roll_ptr to undo log
3. Apply undo record: trx_id=100 < 150, visible
4. Return reconstructed row: (id=1, balance=1000)
Read Views
InnoDB Read View contains:
- m_low_limit_id: highest trx_id at view creation + 1
- m_up_limit_id: lowest active trx_id at view creation
- m_ids: list of active (uncommitted) transaction IDs
Visibility: row with trx_id T is visible if:
T < m_up_limit_id -> definitely committed
T >= m_low_limit_id -> definitely not visible
m_up_limit_id <= T < m_low_limit_id:
T NOT IN m_ids -> committed, visible
T IN m_ids -> uncommitted, not visible
Purge Thread
Unlike PostgreSQL's VACUUM, InnoDB uses a background purge thread to clean up undo log entries that are no longer needed by any active read view.
Snapshot Isolation
Snapshot Isolation (SI) gives each transaction a consistent snapshot of the database at its start time. Transactions can only commit if their write set does not conflict with concurrent committed writes.
SI Rules:
1. Reads see the snapshot (no dirty reads, no non-repeatable reads)
2. First-committer-wins: if T1 and T2 both modify row R,
the first to commit succeeds; the other aborts
SI Anomaly - Write Skew:
T1: reads A=10, B=10; checks A+B >= 10; sets A=0
T2: reads A=10, B=10; checks A+B >= 10; sets B=0
Both commit: A=0, B=0, violating A+B >= 10 invariant
Neither sees the other's write (disjoint write sets)
Serializable Snapshot Isolation (SSI)
SSI extends snapshot isolation to detect and prevent serialization anomalies (including write skew) without the performance penalty of traditional serializability.
Detection Mechanism
SSI tracks read-write dependencies between concurrent transactions. If a dangerous structure (two consecutive rw-dependencies forming a cycle) is detected, one transaction is aborted.
Dangerous Structure Detection:
T1 --rw--> T2 --rw--> T3 where T1 and T3 are concurrent
T1 reads X (version before T2's write)
T2 writes X, reads Y (version before T3's write)
T3 writes Y
If T1 committed before T3: potential cycle
SSI aborts T3 to prevent anomaly
PostgreSQL implementation (since 9.1):
- SIREAD locks track what each transaction has read
- rw-conflict edges detected when a write overlaps a SIREAD lock
- Transactions with two consecutive rw-edges are aborted
Write Skew Prevention with SSI
-- PostgreSQL SSI prevents this automatically
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT balance FROM accounts WHERE id IN (1, 2);
-- Returns A=10, B=10; app checks sum >= 10
UPDATE accounts SET balance = 0 WHERE id = 1;
COMMIT; -- SSI may abort if concurrent T2 is doing the symmetric update
-- ERROR: could not serialize access due to read/write dependencies
Deterministic Concurrency Control
Deterministic databases (Calvin, BOHM, FaunaDB) pre-order all transactions before execution, eliminating the need for locks or aborts.
Calvin Architecture
Sequencer Layer:
Batch incoming transactions (every 10ms epoch)
Assign deterministic order within batch
Replicate batch to all replicas via Paxos
Scheduler Layer:
Analyze read/write sets of each transaction
Execute non-conflicting transactions in parallel
Conflicting transactions execute in predetermined order
Execution Layer:
Execute transactions according to schedule
No aborts needed (order predetermined)
All replicas reach same state (deterministic)
Trade-offs:
+ No distributed commit protocol (2PC not needed)
+ No aborts due to conflicts
- Must know read/write sets in advance
- Interactive transactions not supported
Optimistic vs Pessimistic Concurrency
Pessimistic (2PL - Two-Phase Locking):
Acquire locks before accessing data
Hold all locks until commit
Deadlock detection/timeout needed
Best when conflicts are frequent
GROWING phase: acquire locks
SHRINKING phase: release locks (after commit)
Optimistic (OCC):
Read phase: read freely, buffer writes
Validate phase: check for conflicts
Write phase: apply writes if validation passes
Best when conflicts are rare
MVCC sits between:
- Optimistic for reads (no locks)
- Pessimistic for writes (row-level locks or first-committer-wins)
Isolation Levels Summary
| Level | Dirty Read | Non-Repeatable Read | Phantom | Write Skew | |-------|-----------|-------------------|---------|-----------| | Read Uncommitted | Possible | Possible | Possible | Possible | | Read Committed | No | Possible | Possible | Possible | | Repeatable Read | No | No | Possible* | Possible | | Snapshot Isolation | No | No | No | Possible | | Serializable (SSI) | No | No | No | No |
*PostgreSQL's "Repeatable Read" is actually Snapshot Isolation, which prevents phantoms.
Applications
- PostgreSQL uses xmin/xmax MVCC with SSI for full serializability
- MySQL/InnoDB uses undo-log MVCC; its "SERIALIZABLE" uses gap locking (not SSI)
- CockroachDB uses MVCC with timestamp ordering and a hybrid logical clock
- Spanner uses TrueTime for globally consistent snapshots across data centers
- FaunaDB uses Calvin-style deterministic concurrency for global transactions