MVCC and Concurrency Control

Overview

Multi-Version Concurrency Control (MVCC) allows readers and writers to operate concurrently without blocking each other. Instead of locking rows, the database maintains multiple versions of each row, and each transaction sees a consistent snapshot. MVCC is the foundation of concurrency in PostgreSQL, MySQL/InnoDB, Oracle, and most modern databases.

MVCC Fundamentals

Core Principle

Traditional Locking:
  Reader blocks Writer, Writer blocks Reader -> low concurrency

MVCC:
  Writer creates new version, Reader sees old version
  Readers never block Writers, Writers never block Readers
  Writers only block other Writers on the same row

Timestamp Ordering

Each transaction receives a timestamp (or transaction ID). A row version is visible to a transaction if it was created by a committed transaction with a lower timestamp and not yet deleted (or deleted by a transaction with a higher timestamp).

Transaction T5 reads row R:
  Version history of R:
    v1: created by T1 (committed), deleted by T3 (committed)
    v2: created by T3 (committed), not deleted
    v3: created by T7 (not yet committed)

  T5 sees v2:
    - v1 deleted by T3 < T5, so invisible
    - v2 created by T3 < T5, not deleted, so VISIBLE
    - v3 created by T7 > T5, so invisible

PostgreSQL MVCC: xmin/xmax

PostgreSQL stores version visibility information directly in each tuple (row) using system columns.

Tuple Header Fields

Every tuple contains hidden columns:
  xmin  - transaction ID that created this tuple version
  xmax  - transaction ID that deleted/updated this tuple (0 if alive)
  cmin  - command counter within creating transaction
  cmax  - command counter within deleting transaction
  ctid  - physical location (page, offset) of current/next version

Visibility check for transaction T reading a tuple:
  1. xmin must be committed AND xmin < T's snapshot
  2. xmax must be 0 (not deleted)
     OR xmax is not committed
     OR xmax >= T's snapshot

Example Walkthrough

-- Initial state: T100 inserts a row
INSERT INTO accounts(id, balance) VALUES (1, 1000);
-- Tuple: xmin=100, xmax=0, data=(1, 1000)

-- T200 updates the row
UPDATE accounts SET balance = 900 WHERE id = 1;
-- Old tuple: xmin=100, xmax=200, data=(1, 1000)
-- New tuple: xmin=200, xmax=0,   data=(1, 900)

-- T150 (snapshot between T100 and T200) reads:
--   Sees old tuple: xmin=100 < 150 (visible), xmax=200 > 150 (not yet visible)
--   Result: balance = 1000

-- T250 reads:
--   Old tuple: xmax=200 < 250 (deleted, invisible)
--   New tuple: xmin=200 < 250 (visible)
--   Result: balance = 900

VACUUM

Dead tuples (old versions no longer visible to any transaction) must be reclaimed. PostgreSQL's VACUUM process removes them.

Before VACUUM:
  Page: [live_tuple_1] [dead_tuple] [live_tuple_2] [dead_tuple]

After VACUUM:
  Page: [live_tuple_1] [free_space] [live_tuple_2] [free_space]
  Free space map updated, dead tuple space reusable

autovacuum: background process triggered by modification thresholds
  threshold = autovacuum_vacuum_threshold +
              autovacuum_vacuum_scale_factor * table_size

MySQL/InnoDB MVCC: Undo Log

InnoDB maintains only the latest version of each row on the data page. Previous versions are reconstructed on-demand from the undo log.

Undo Log Architecture

Clustered Index Page (latest version):
  Row: [trx_id=200, roll_ptr -> undo_log] (id=1, balance=900)

Undo Log:
  Entry: [trx_id=100, data=(id=1, balance=1000), prev_ptr=NULL]

To reconstruct version for trx_id=150:
  1. Read current row: trx_id=200 > 150, too new
  2. Follow roll_ptr to undo log
  3. Apply undo record: trx_id=100 < 150, visible
  4. Return reconstructed row: (id=1, balance=1000)

Read Views

InnoDB Read View contains:
  - m_low_limit_id:  highest trx_id at view creation + 1
  - m_up_limit_id:   lowest active trx_id at view creation
  - m_ids:           list of active (uncommitted) transaction IDs

Visibility: row with trx_id T is visible if:
  T < m_up_limit_id                       -> definitely committed
  T >= m_low_limit_id                     -> definitely not visible
  m_up_limit_id <= T < m_low_limit_id:
    T NOT IN m_ids                        -> committed, visible
    T IN m_ids                            -> uncommitted, not visible

Purge Thread

Unlike PostgreSQL's VACUUM, InnoDB uses a background purge thread to clean up undo log entries that are no longer needed by any active read view.

Snapshot Isolation

Snapshot Isolation (SI) gives each transaction a consistent snapshot of the database at its start time. Transactions can only commit if their write set does not conflict with concurrent committed writes.

SI Rules:
  1. Reads see the snapshot (no dirty reads, no non-repeatable reads)
  2. First-committer-wins: if T1 and T2 both modify row R,
     the first to commit succeeds; the other aborts

SI Anomaly - Write Skew:
  T1: reads A=10, B=10; checks A+B >= 10; sets A=0
  T2: reads A=10, B=10; checks A+B >= 10; sets B=0
  Both commit: A=0, B=0, violating A+B >= 10 invariant
  Neither sees the other's write (disjoint write sets)

Serializable Snapshot Isolation (SSI)

SSI extends snapshot isolation to detect and prevent serialization anomalies (including write skew) without the performance penalty of traditional serializability.

Detection Mechanism

SSI tracks read-write dependencies between concurrent transactions. If a dangerous structure (two consecutive rw-dependencies forming a cycle) is detected, one transaction is aborted.

Dangerous Structure Detection:

T1 --rw--> T2 --rw--> T3  where T1 and T3 are concurrent

  T1 reads X (version before T2's write)
  T2 writes X, reads Y (version before T3's write)
  T3 writes Y

  If T1 committed before T3: potential cycle
  SSI aborts T3 to prevent anomaly

PostgreSQL implementation (since 9.1):
  - SIREAD locks track what each transaction has read
  - rw-conflict edges detected when a write overlaps a SIREAD lock
  - Transactions with two consecutive rw-edges are aborted

Write Skew Prevention with SSI

-- PostgreSQL SSI prevents this automatically
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;

SELECT balance FROM accounts WHERE id IN (1, 2);
-- Returns A=10, B=10; app checks sum >= 10

UPDATE accounts SET balance = 0 WHERE id = 1;
COMMIT;  -- SSI may abort if concurrent T2 is doing the symmetric update
         -- ERROR: could not serialize access due to read/write dependencies

Deterministic Concurrency Control

Deterministic databases (Calvin, BOHM, FaunaDB) pre-order all transactions before execution, eliminating the need for locks or aborts.

Calvin Architecture

Sequencer Layer:
  Batch incoming transactions (every 10ms epoch)
  Assign deterministic order within batch
  Replicate batch to all replicas via Paxos

Scheduler Layer:
  Analyze read/write sets of each transaction
  Execute non-conflicting transactions in parallel
  Conflicting transactions execute in predetermined order

Execution Layer:
  Execute transactions according to schedule
  No aborts needed (order predetermined)
  All replicas reach same state (deterministic)

Trade-offs:
  + No distributed commit protocol (2PC not needed)
  + No aborts due to conflicts
  - Must know read/write sets in advance
  - Interactive transactions not supported

Optimistic vs Pessimistic Concurrency

Pessimistic (2PL - Two-Phase Locking):
  Acquire locks before accessing data
  Hold all locks until commit
  Deadlock detection/timeout needed
  Best when conflicts are frequent

  GROWING phase: acquire locks
  SHRINKING phase: release locks (after commit)

Optimistic (OCC):
  Read phase:   read freely, buffer writes
  Validate phase: check for conflicts
  Write phase:  apply writes if validation passes
  Best when conflicts are rare

MVCC sits between:
  - Optimistic for reads (no locks)
  - Pessimistic for writes (row-level locks or first-committer-wins)

Isolation Levels Summary

Level	Dirty Read	Non-Repeatable Read	Phantom	Write Skew
Read Uncommitted	Possible	Possible	Possible	Possible
Read Committed	No	Possible	Possible	Possible
Repeatable Read	No	No	Possible*	Possible
Snapshot Isolation	No	No	No	Possible
Serializable (SSI)	No	No	No	No

*PostgreSQL's "Repeatable Read" is actually Snapshot Isolation, which prevents phantoms.

Applications

PostgreSQL uses xmin/xmax MVCC with SSI for full serializability
MySQL/InnoDB uses undo-log MVCC; its "SERIALIZABLE" uses gap locking (not SSI)
CockroachDB uses MVCC with timestamp ordering and a hybrid logical clock
Spanner uses TrueTime for globally consistent snapshots across data centers
FaunaDB uses Calvin-style deterministic concurrency for global transactions