3 min read
On this page

Distributed Transactions

The Problem

A distributed transaction spans multiple nodes. All nodes must either commit or abort -- partial completion leaves the system in an inconsistent state.

Transfer $100 from Account A (Node 1) to Account B (Node 2):

  Node 1: A -= 100     Node 2: B += 100
     ✓                    ✗ (crash!)

  Result: $100 vanished. Atomicity violated.

Two-Phase Commit (2PC)

The classical protocol. A coordinator orchestrates participants to ensure atomic commit.

Phase 1: PREPARE (Voting)
  Coordinator                 Participants
       │── PREPARE ──────────> P1: writes to WAL, locks, votes YES
       │── PREPARE ──────────> P2: writes to WAL, locks, votes YES
       │── PREPARE ──────────> P3: not ready, votes NO
       │                        │
       │<── YES ─────────────  P1
       │<── YES ─────────────  P2
       │<── NO ──────────────  P3

Phase 2: COMMIT/ABORT (Decision)
  If ALL voted YES:
       │── COMMIT ───────────> P1: apply, release locks, ACK
       │── COMMIT ───────────> P2: apply, release locks, ACK
       │── ABORT  ───────────> P3: rollback, release locks, ACK
  If ANY voted NO:
       │── ABORT ────────────> ALL: rollback, release locks, ACK

2PC Failure Scenarios

Scenario 1: Participant crashes before voting
  → Coordinator times out, aborts transaction ✓

Scenario 2: Participant crashes after voting YES
  → Must recover and check coordinator for decision ✓

Scenario 3: Coordinator crashes after collecting votes
  → BLOCKING: participants that voted YES are stuck holding locks
  → Cannot safely abort (coordinator might have sent COMMIT to others)
  → Cannot safely commit (coordinator might have decided ABORT)
  → Must wait for coordinator recovery ✗

The blocking problem is 2PC's fundamental weakness. During coordinator failure, resources remain locked indefinitely.

ENUMERATION TxState ← {INIT, PREPARING, PREPARED, COMMITTED, ABORTED}

STRUCTURE TwoPhaseCoordinator
    tx_id: integer
    participants: list of ParticipantHandle
    state: TxState
    wal: WriteAheadLog

ASYNC PROCEDURE EXECUTE(coord) → success or error
    // Phase 1: Prepare
    coord.state ← PREPARING
    AWAIT WAL_LOG(coord.wal, coord.tx_id, "PREPARING")

    votes ← AWAIT ALL IN PARALLEL(
        FOR EACH p IN coord.participants: p.PREPARE(coord.tx_id)
    )

    all_yes ← ALL votes ARE OK(YES)

    // Phase 2: Commit or Abort
    IF all_yes THEN
        // Log decision BEFORE sending -- crash recovery point
        AWAIT WAL_LOG(coord.wal, coord.tx_id, "COMMIT")
        coord.state ← COMMITTED

        FOR EACH p IN coord.participants DO
            // Retry until ACK (participant must eventually respond)
            REPEAT
                result ← AWAIT p.COMMIT(coord.tx_id)
            UNTIL result IS success
    ELSE
        AWAIT WAL_LOG(coord.wal, coord.tx_id, "ABORT")
        coord.state ← ABORTED

        FOR EACH p IN coord.participants DO
            AWAIT p.ABORT(coord.tx_id)   // best-effort

Three-Phase Commit (3PC)

Adds a PRE-COMMIT phase between PREPARE and COMMIT to avoid blocking. Assumes bounded message delays (synchronous model).

Phase 1: CAN-COMMIT?
  Coordinator ──> Participants: "Can you commit?"
  Participants ──> Coordinator: YES/NO

Phase 2: PRE-COMMIT
  Coordinator ──> Participants: "Prepare to commit"
  Participants ──> Coordinator: ACK
  (At this point, all participants know the decision)

Phase 3: DO-COMMIT
  Coordinator ──> Participants: "Commit now"

Key difference: if coordinator crashes after Phase 2,
  participants know the decision was COMMIT.
  Any participant can act as recovery coordinator.

In practice, 3PC is rarely used because it assumes synchronous communication (bounded delays). In asynchronous networks (the real world), it can violate safety.

Saga Pattern

Breaks a long-lived transaction into a sequence of local transactions, each with a compensating action for rollback.

Choreography

Services communicate directly via events. No central coordinator.

Order Service         Payment Service      Inventory Service
     │                       │                     │
     │── OrderCreated ─────> │                     │
     │                       │── PaymentProcessed ─>
     │                       │                     │── InventoryReserved ──>
     │<──────────────────────│<────────────────────│
     │   OrderCompleted      │                     │

On failure (e.g., PaymentFailed):
     │                       │                     │
     │<── PaymentFailed ─────│                     │
     │── OrderCancelled ───> │                     │
     │                       │── RefundIssued ────> │
     │                       │                     │── InventoryReleased

Orchestration

A central saga orchestrator coordinates the steps.

                    Saga Orchestrator
                    ┌────────────┐
                    │ Step 1: Pay │──> Payment Service
                    │ Step 2: Rsv │──> Inventory Service
                    │ Step 3: Ship│──> Shipping Service
                    └────────────┘
                         │
                    On failure at Step 2:
                         │
                    ┌────────────┐
                    │ Comp 1: Rfd│──> Payment Service (refund)
                    └────────────┘
STRUCTURE SagaStep
    name: string
    action: procedure → success or error
    compensation: procedure → success or error

STRUCTURE SagaOrchestrator
    steps: list of SagaStep

PROCEDURE EXECUTE(saga) → success or error
    completed ← empty list

    FOR i ← 0 TO LENGTH(saga.steps) - 1 DO
        step ← saga.steps[i]
        result ← step.action()
        IF result IS success THEN
            APPEND i TO completed
        ELSE
            // Compensate in reverse order
            FOR j ← LENGTH(completed) - 1 DOWNTO 0 DO
                idx ← completed[j]
                comp_result ← saga.steps[idx].compensation()
                IF comp_result IS error THEN
                    LOG_ERROR("Compensation failed for step "
                              + saga.steps[idx].name + ": " + error_message)
                    // Log for manual resolution
            RETURN error(result)
    RETURN success

Choreography vs Orchestration:

| Aspect | Choreography | Orchestration | |---|---|---| | Coupling | Loose (event-driven) | Tighter (orchestrator knows all steps) | | Visibility | Hard to track flow | Central point of observation | | Complexity | Grows with services | Contained in orchestrator | | Single point of failure | None | Orchestrator (mitigate with replication) |

Google Spanner and TrueTime

Spanner achieves externally consistent (linearizable) distributed transactions using GPS/atomic clock-synchronized TrueTime.

TrueTime: TT.now() = [earliest, latest]

Commit protocol:
  1. Acquire locks (2PL)
  2. Choose commit timestamp s >= TT.now().latest
  3. Wait until TT.after(s) is true  (commit-wait)
  4. Release locks and apply

  The commit-wait ensures that by the time the transaction
  is visible, its timestamp is definitely in the past.

Timeline:
  ├── acquire locks ──── choose s ──── WAIT ──── commit ──>
                                       ↑
                              uncertainty window (ε)
                              typically 1-7ms

This is why Spanner invests heavily in GPS receivers and atomic clocks in every data center -- smaller uncertainty means shorter wait times.

Calvin: Deterministic Database

Eliminates nondeterminism from distributed transactions. All replicas execute the same transactions in the same pre-determined order.

Traditional:                    Calvin:
  Client → Lock → Execute       Client → Sequencer → Execute
  (order depends on lock         (order fixed BEFORE execution)
   acquisition timing)

Calvin Architecture:
  ┌──────────┐     ┌──────────┐     ┌──────────┐
  │ Sequencer│────>│ Scheduler│────>│ Executor  │
  │ (global  │     │ (local   │     │ (local    │
  │  order)  │     │  locks)  │     │  execute) │
  └──────────┘     └──────────┘     └──────────┘

  Sequencer batches txns, assigns global order (via Paxos)
  All replicas execute same batch in same order
  Deterministic → no 2PC needed for replication!

Trade-off: Must know read/write sets upfront. Reconnaissance queries needed for dependent reads. High abort rate under contention with dependent transactions.

Deterministic Databases

Extend Calvin's idea. If execution is deterministic and all replicas agree on input order, they produce the same output without cross-replica coordination for each transaction.

Input:  [Tx1, Tx2, Tx3]  (globally ordered)
         │     │     │
Replica A: execute in order → State A
Replica B: execute in order → State B

State A == State B  (guaranteed by determinism)

Systems: Calvin, FaunaDB, LLAOL.

Comparison

| Approach | Consistency | Latency | Availability | Complexity | |---|---|---|---|---| | 2PC | Atomic | High (2 RTT + locks) | Blocks on coordinator failure | Low | | 3PC | Atomic | Higher (3 RTT) | Non-blocking (sync model) | Medium | | Saga | Eventual (compensating) | Lower per step | High | Medium-High | | Spanner | External consistency | Medium (commit-wait ~7ms) | Majority quorum | High | | Calvin | Serializable | Medium (batch + execute) | Majority quorum | Medium |

Choosing a Strategy

Need strong atomicity across databases?
  ├── Yes: Can you tolerate blocking?
  │    ├── Yes: 2PC (simplest, widely supported)
  │    └── No:  Spanner-style (requires accurate clocks)
  └── No: Saga pattern
       ├── Few services, clear flow → Choreography
       └── Many services, complex flow → Orchestration

Real-World Systems

| System | Approach | Notes | |---|---|---| | PostgreSQL (FDW) | 2PC | Foreign data wrapper transactions | | Google Spanner | 2PC + TrueTime | External consistency | | CockroachDB | 2PC + HLC | Serializable, no atomic clocks needed | | Temporal / Cadence | Saga orchestration | Workflow engine for sagas | | Amazon Step Functions | Saga orchestration | AWS-native workflow | | FaunaDB | Calvin-inspired | Deterministic transactions |

Key Takeaways

  • 2PC is the workhorse of distributed transactions but blocks when the coordinator fails. Use it when transactions are short and coordinator failures are rare.
  • Sagas trade atomicity for availability and are the standard pattern in microservices. Design compensating actions carefully -- they must be idempotent.
  • Spanner's TrueTime is an engineering marvel that turns clock synchronization into a consistency mechanism. CockroachDB approximates this with HLC.
  • Deterministic databases (Calvin) avoid 2PC entirely by fixing execution order upfront -- a fundamentally different approach worth understanding.