Distributed Transactions
The Problem
A distributed transaction spans multiple nodes. All nodes must either commit or abort -- partial completion leaves the system in an inconsistent state.
Transfer $100 from Account A (Node 1) to Account B (Node 2):
Node 1: A -= 100 Node 2: B += 100
✓ ✗ (crash!)
Result: $100 vanished. Atomicity violated.
Two-Phase Commit (2PC)
The classical protocol. A coordinator orchestrates participants to ensure atomic commit.
Phase 1: PREPARE (Voting)
Coordinator Participants
│── PREPARE ──────────> P1: writes to WAL, locks, votes YES
│── PREPARE ──────────> P2: writes to WAL, locks, votes YES
│── PREPARE ──────────> P3: not ready, votes NO
│ │
│<── YES ───────────── P1
│<── YES ───────────── P2
│<── NO ────────────── P3
Phase 2: COMMIT/ABORT (Decision)
If ALL voted YES:
│── COMMIT ───────────> P1: apply, release locks, ACK
│── COMMIT ───────────> P2: apply, release locks, ACK
│── ABORT ───────────> P3: rollback, release locks, ACK
If ANY voted NO:
│── ABORT ────────────> ALL: rollback, release locks, ACK
2PC Failure Scenarios
Scenario 1: Participant crashes before voting
→ Coordinator times out, aborts transaction ✓
Scenario 2: Participant crashes after voting YES
→ Must recover and check coordinator for decision ✓
Scenario 3: Coordinator crashes after collecting votes
→ BLOCKING: participants that voted YES are stuck holding locks
→ Cannot safely abort (coordinator might have sent COMMIT to others)
→ Cannot safely commit (coordinator might have decided ABORT)
→ Must wait for coordinator recovery ✗
The blocking problem is 2PC's fundamental weakness. During coordinator failure, resources remain locked indefinitely.
ENUMERATION TxState ← {INIT, PREPARING, PREPARED, COMMITTED, ABORTED}
STRUCTURE TwoPhaseCoordinator
tx_id: integer
participants: list of ParticipantHandle
state: TxState
wal: WriteAheadLog
ASYNC PROCEDURE EXECUTE(coord) → success or error
// Phase 1: Prepare
coord.state ← PREPARING
AWAIT WAL_LOG(coord.wal, coord.tx_id, "PREPARING")
votes ← AWAIT ALL IN PARALLEL(
FOR EACH p IN coord.participants: p.PREPARE(coord.tx_id)
)
all_yes ← ALL votes ARE OK(YES)
// Phase 2: Commit or Abort
IF all_yes THEN
// Log decision BEFORE sending -- crash recovery point
AWAIT WAL_LOG(coord.wal, coord.tx_id, "COMMIT")
coord.state ← COMMITTED
FOR EACH p IN coord.participants DO
// Retry until ACK (participant must eventually respond)
REPEAT
result ← AWAIT p.COMMIT(coord.tx_id)
UNTIL result IS success
ELSE
AWAIT WAL_LOG(coord.wal, coord.tx_id, "ABORT")
coord.state ← ABORTED
FOR EACH p IN coord.participants DO
AWAIT p.ABORT(coord.tx_id) // best-effort
Three-Phase Commit (3PC)
Adds a PRE-COMMIT phase between PREPARE and COMMIT to avoid blocking. Assumes bounded message delays (synchronous model).
Phase 1: CAN-COMMIT?
Coordinator ──> Participants: "Can you commit?"
Participants ──> Coordinator: YES/NO
Phase 2: PRE-COMMIT
Coordinator ──> Participants: "Prepare to commit"
Participants ──> Coordinator: ACK
(At this point, all participants know the decision)
Phase 3: DO-COMMIT
Coordinator ──> Participants: "Commit now"
Key difference: if coordinator crashes after Phase 2,
participants know the decision was COMMIT.
Any participant can act as recovery coordinator.
In practice, 3PC is rarely used because it assumes synchronous communication (bounded delays). In asynchronous networks (the real world), it can violate safety.
Saga Pattern
Breaks a long-lived transaction into a sequence of local transactions, each with a compensating action for rollback.
Choreography
Services communicate directly via events. No central coordinator.
Order Service Payment Service Inventory Service
│ │ │
│── OrderCreated ─────> │ │
│ │── PaymentProcessed ─>
│ │ │── InventoryReserved ──>
│<──────────────────────│<────────────────────│
│ OrderCompleted │ │
On failure (e.g., PaymentFailed):
│ │ │
│<── PaymentFailed ─────│ │
│── OrderCancelled ───> │ │
│ │── RefundIssued ────> │
│ │ │── InventoryReleased
Orchestration
A central saga orchestrator coordinates the steps.
Saga Orchestrator
┌────────────┐
│ Step 1: Pay │──> Payment Service
│ Step 2: Rsv │──> Inventory Service
│ Step 3: Ship│──> Shipping Service
└────────────┘
│
On failure at Step 2:
│
┌────────────┐
│ Comp 1: Rfd│──> Payment Service (refund)
└────────────┘
STRUCTURE SagaStep
name: string
action: procedure → success or error
compensation: procedure → success or error
STRUCTURE SagaOrchestrator
steps: list of SagaStep
PROCEDURE EXECUTE(saga) → success or error
completed ← empty list
FOR i ← 0 TO LENGTH(saga.steps) - 1 DO
step ← saga.steps[i]
result ← step.action()
IF result IS success THEN
APPEND i TO completed
ELSE
// Compensate in reverse order
FOR j ← LENGTH(completed) - 1 DOWNTO 0 DO
idx ← completed[j]
comp_result ← saga.steps[idx].compensation()
IF comp_result IS error THEN
LOG_ERROR("Compensation failed for step "
+ saga.steps[idx].name + ": " + error_message)
// Log for manual resolution
RETURN error(result)
RETURN success
Choreography vs Orchestration:
| Aspect | Choreography | Orchestration | |---|---|---| | Coupling | Loose (event-driven) | Tighter (orchestrator knows all steps) | | Visibility | Hard to track flow | Central point of observation | | Complexity | Grows with services | Contained in orchestrator | | Single point of failure | None | Orchestrator (mitigate with replication) |
Google Spanner and TrueTime
Spanner achieves externally consistent (linearizable) distributed transactions using GPS/atomic clock-synchronized TrueTime.
TrueTime: TT.now() = [earliest, latest]
Commit protocol:
1. Acquire locks (2PL)
2. Choose commit timestamp s >= TT.now().latest
3. Wait until TT.after(s) is true (commit-wait)
4. Release locks and apply
The commit-wait ensures that by the time the transaction
is visible, its timestamp is definitely in the past.
Timeline:
├── acquire locks ──── choose s ──── WAIT ──── commit ──>
↑
uncertainty window (ε)
typically 1-7ms
This is why Spanner invests heavily in GPS receivers and atomic clocks in every data center -- smaller uncertainty means shorter wait times.
Calvin: Deterministic Database
Eliminates nondeterminism from distributed transactions. All replicas execute the same transactions in the same pre-determined order.
Traditional: Calvin:
Client → Lock → Execute Client → Sequencer → Execute
(order depends on lock (order fixed BEFORE execution)
acquisition timing)
Calvin Architecture:
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Sequencer│────>│ Scheduler│────>│ Executor │
│ (global │ │ (local │ │ (local │
│ order) │ │ locks) │ │ execute) │
└──────────┘ └──────────┘ └──────────┘
Sequencer batches txns, assigns global order (via Paxos)
All replicas execute same batch in same order
Deterministic → no 2PC needed for replication!
Trade-off: Must know read/write sets upfront. Reconnaissance queries needed for dependent reads. High abort rate under contention with dependent transactions.
Deterministic Databases
Extend Calvin's idea. If execution is deterministic and all replicas agree on input order, they produce the same output without cross-replica coordination for each transaction.
Input: [Tx1, Tx2, Tx3] (globally ordered)
│ │ │
Replica A: execute in order → State A
Replica B: execute in order → State B
State A == State B (guaranteed by determinism)
Systems: Calvin, FaunaDB, LLAOL.
Comparison
| Approach | Consistency | Latency | Availability | Complexity | |---|---|---|---|---| | 2PC | Atomic | High (2 RTT + locks) | Blocks on coordinator failure | Low | | 3PC | Atomic | Higher (3 RTT) | Non-blocking (sync model) | Medium | | Saga | Eventual (compensating) | Lower per step | High | Medium-High | | Spanner | External consistency | Medium (commit-wait ~7ms) | Majority quorum | High | | Calvin | Serializable | Medium (batch + execute) | Majority quorum | Medium |
Choosing a Strategy
Need strong atomicity across databases?
├── Yes: Can you tolerate blocking?
│ ├── Yes: 2PC (simplest, widely supported)
│ └── No: Spanner-style (requires accurate clocks)
└── No: Saga pattern
├── Few services, clear flow → Choreography
└── Many services, complex flow → Orchestration
Real-World Systems
| System | Approach | Notes | |---|---|---| | PostgreSQL (FDW) | 2PC | Foreign data wrapper transactions | | Google Spanner | 2PC + TrueTime | External consistency | | CockroachDB | 2PC + HLC | Serializable, no atomic clocks needed | | Temporal / Cadence | Saga orchestration | Workflow engine for sagas | | Amazon Step Functions | Saga orchestration | AWS-native workflow | | FaunaDB | Calvin-inspired | Deterministic transactions |
Key Takeaways
- 2PC is the workhorse of distributed transactions but blocks when the coordinator fails. Use it when transactions are short and coordinator failures are rare.
- Sagas trade atomicity for availability and are the standard pattern in microservices. Design compensating actions carefully -- they must be idempotent.
- Spanner's TrueTime is an engineering marvel that turns clock synchronization into a consistency mechanism. CockroachDB approximates this with HLC.
- Deterministic databases (Calvin) avoid 2PC entirely by fixing execution order upfront -- a fundamentally different approach worth understanding.