Time and Ordering in Distributed Systems
The Problem
There is no global clock in a distributed system. Each node has its own local clock with drift, skew, and potential discontinuities. Yet many operations require a notion of "this happened before that."
Node A clock: |----1----2----3----4----5----6----| (runs fast)
Node B clock: |----1------2------3------4--------| (runs slow)
Node C clock: |----1----2--[jump]--5----6----7---| (NTP correction)
Which event came first? A's event at t=3 or B's event at t=3?
Physical Clocks
NTP (Network Time Protocol)
Synchronizes clocks over the network. Achieves 1-50ms accuracy on the internet, sub-millisecond on LANs.
Client Server
│ │
│──── t1: request ────> │
│ │ t2: receive
│ │ t3: send
│ <──── t4: response ── │
│ │
Round-trip delay: δ = (t4 - t1) - (t3 - t2)
Clock offset: θ = ((t2 - t1) + (t3 - t4)) / 2
Assumes symmetric network delay (often violated)
PTP (Precision Time Protocol)
Hardware-timestamped packets, achieves sub-microsecond accuracy. Used in financial trading, telecom, and industrial control.
GPS Clocks
Direct satellite time reference. ~10ns accuracy. Used by Google Spanner's TrueTime API.
TrueTime API:
TT.now() -> [earliest, latest] (interval, not point)
TT.after(t) -> bool (is t definitely in the past?)
TT.before(t) -> bool (is t definitely in the future?)
Uncertainty interval: typically 1-7ms
Spanner waits out the uncertainty before committing.
Clock Problems
1. Clock Skew: offset between two clocks at a point in time
2. Clock Drift: rate at which a clock gains/loses time
3. Leap Seconds: UTC occasionally inserts/removes a second
4. NTP Jumps: clock steps backward after correction
5. Monotonic vs Wall Clock:
- Wall clock: can jump (NTP corrections)
- Monotonic: always moves forward (good for measuring durations)
Logical Clocks
Lamport Timestamps (1978)
A single counter per process, incremented on local events and message exchanges.
Rules:
- Before each local event, increment counter
- When sending: increment counter, attach it to message
- When receiving: counter = max(local, received) + 1
Process A: (1)────(2)────────────(5)────(6)──>
│ send ↑
└──────> (3)──(4) recv
Process B: recv (3)────(4)────────>
↑ │ send
Process C: │ └───>(5)──(6)──>
recv recv
If a → b (a happens-before b), then L(a) < L(b)
BUT: L(a) < L(b) does NOT imply a → b (partial order)
STRUCTURE LamportClock
time: integer
node_id: integer
PROCEDURE NEW_LAMPORT_CLOCK(node_id) → LamportClock
RETURN LamportClock(time ← 0, node_id ← node_id)
PROCEDURE TICK(clock) → integer
clock.time ← clock.time + 1
RETURN clock.time
PROCEDURE SEND(clock) → integer
RETURN TICK(clock)
PROCEDURE RECEIVE(clock, msg_timestamp)
clock.time ← MAX(clock.time, msg_timestamp) + 1
// Total order: break ties with node_id
PROCEDURE TIMESTAMP(clock) → (integer, integer)
RETURN (clock.time, clock.node_id)
Limitation: Cannot distinguish concurrent events from causally ordered ones.
Vector Clocks
Each process maintains a vector of counters, one per process. Captures the full causal history.
Process A: [1,0,0]──[2,0,0]──────────────────[3,2,0]──>
│ send ↑ recv
Process B: [0,0,0]──[0,1,0]──[2,2,0]──[2,3,0]──┘ send
↑ recv │ send
Process C: [0,0,0]────────────[0,0,1]──[2,3,2]──>
↑ recv
Comparison:
V(a) ≤ V(b) iff ∀i: V(a)[i] ≤ V(b)[i] → a happened-before b
V(a) ∥ V(b) iff neither V(a) ≤ V(b) nor V(b) ≤ V(a) → concurrent
STRUCTURE VectorClock
clock: array of integer (one per node)
node_id: integer
PROCEDURE NEW_VECTOR_CLOCK(node_id, num_nodes) → VectorClock
RETURN VectorClock(clock ← array of num_nodes zeros, node_id ← node_id)
PROCEDURE TICK(vc)
vc.clock[vc.node_id] ← vc.clock[vc.node_id] + 1
PROCEDURE SEND(vc) → array of integer
TICK(vc)
RETURN COPY(vc.clock)
PROCEDURE RECEIVE(vc, other_clock)
FOR i ← 0 TO LENGTH(vc.clock) - 1 DO
vc.clock[i] ← MAX(vc.clock[i], other_clock[i])
vc.clock[vc.node_id] ← vc.clock[vc.node_id] + 1
PROCEDURE HAPPENS_BEFORE(vc, other) → boolean
at_least_one_less ← FALSE
FOR i ← 0 TO LENGTH(vc.clock) - 1 DO
IF vc.clock[i] > other.clock[i] THEN RETURN FALSE
IF vc.clock[i] < other.clock[i] THEN at_least_one_less ← TRUE
RETURN at_least_one_less
PROCEDURE CONCURRENT(vc, other) → boolean
RETURN NOT HAPPENS_BEFORE(vc, other) AND NOT HAPPENS_BEFORE(other, vc)
Downside: Vector size grows with number of nodes. Impractical for systems with millions of clients.
Version Vectors
Similar to vector clocks but track data versions rather than events. Used in Dynamo-style systems for conflict detection.
Key "user:42" replicated on nodes A, B, C:
Client writes to A: {A:1}
Client writes to A: {A:2}
Replicated to B: {A:2}
Network partition occurs
Client writes to A: {A:3}
Client writes to B: {A:2, B:1} ← concurrent!
On read, client sees both versions:
{A:3} and {A:2, B:1} → conflict, application resolves
Hybrid Logical Clocks (HLC)
Combines physical time and logical counters. Provides causal ordering while staying close to wall-clock time.
HLC = (physical_time, logical_counter, node_id)
Rules:
Local/send event:
l' = max(l, pt) // pt = physical time
if l' == l: c' = c + 1
else: c' = 0
Receive event:
l' = max(l, msg.l, pt)
if l' == l == msg.l: c' = max(c, msg.c) + 1
elif l' == l: c' = c + 1
elif l' == msg.l: c' = msg.c + 1
else: c' = 0
Advantage: Fits in a 64-bit integer (e.g., 48 bits physical, 16 bits logical). Used in CockroachDB and MongoDB.
The Happens-Before Relation
Lamport's happens-before (denoted a -> b) is defined:
- If a and b are events in the same process and a comes before b, then a -> b
- If a is the send of a message and b is the receipt of that message, then a -> b
- Transitivity: if a -> b and b -> c, then a -> c
Events not related by happens-before are concurrent (a || b).
happens-before
┌─────────┐
a ──────> b │ a → b │ Causal: a might have influenced b
└─────────┘
concurrent
┌─────────┐
a │ a ∥ b │ Independent: neither influenced the other
b └─────────┘
Causal Ordering
Causal ordering ensures that if event a causally precedes event b, all nodes see a before b. Concurrent events may be delivered in any order.
Causal broadcast guarantees:
If process p broadcasts m1 before receiving m2,
then no process delivers m2 before m1.
Node A: send(m1) ──────────────────── deliver(m2)
\ /
Node B: deliver(m1) ── send(m2) ────/
\
Node C: ──── deliver(m1) ── deliver(m2) ← m1 always before m2
Comparison of Clock Mechanisms
| Mechanism | Size | Detects Causality | Wall-Clock Close | Use Case | |---|---|---|---|---| | Lamport | O(1) | No (only one direction) | No | Total ordering | | Vector Clock | O(n) | Yes | No | Conflict detection | | Version Vector | O(n) | Yes | No | Replicated data stores | | HLC | O(1) | Yes (probabilistic) | Yes | Distributed databases | | TrueTime | O(1) | N/A (interval) | Yes | Spanner |
Real-World Applications
- CockroachDB: HLC for transaction ordering, ensures serializable isolation
- Amazon DynamoDB: Version vectors for conflict detection during partitions
- Google Spanner: TrueTime intervals + commit-wait for external consistency
- Cassandra: Lamport-like timestamps (wall clock + counter) for last-writer-wins
- Riak: Dotted version vectors (compressed version vectors for many clients)
Key Takeaways
- Physical clocks are unreliable for ordering in distributed systems due to drift, skew, and jumps.
- Lamport timestamps provide total order but lose concurrency information.
- Vector clocks capture full causality but don't scale to large numbers of nodes.
- HLCs offer the best practical trade-off: causal ordering in a compact representation with wall-clock proximity.
- Choose your clock mechanism based on what your system actually needs: total order, causal order, conflict detection, or real-time correlation.