3 min read
On this page

Time and Ordering in Distributed Systems

The Problem

There is no global clock in a distributed system. Each node has its own local clock with drift, skew, and potential discontinuities. Yet many operations require a notion of "this happened before that."

Node A clock:  |----1----2----3----4----5----6----|  (runs fast)
Node B clock:  |----1------2------3------4--------|  (runs slow)
Node C clock:  |----1----2--[jump]--5----6----7---|  (NTP correction)

Which event came first?  A's event at t=3 or B's event at t=3?

Physical Clocks

NTP (Network Time Protocol)

Synchronizes clocks over the network. Achieves 1-50ms accuracy on the internet, sub-millisecond on LANs.

Client                    Server
  │                          │
  │──── t1: request ────>    │
  │                          │  t2: receive
  │                          │  t3: send
  │   <──── t4: response ── │
  │                          │

  Round-trip delay: δ = (t4 - t1) - (t3 - t2)
  Clock offset:    θ = ((t2 - t1) + (t3 - t4)) / 2

  Assumes symmetric network delay (often violated)

PTP (Precision Time Protocol)

Hardware-timestamped packets, achieves sub-microsecond accuracy. Used in financial trading, telecom, and industrial control.

GPS Clocks

Direct satellite time reference. ~10ns accuracy. Used by Google Spanner's TrueTime API.

TrueTime API:
  TT.now()  -> [earliest, latest]   (interval, not point)
  TT.after(t) -> bool               (is t definitely in the past?)
  TT.before(t) -> bool              (is t definitely in the future?)

  Uncertainty interval: typically 1-7ms
  Spanner waits out the uncertainty before committing.

Clock Problems

1. Clock Skew:     offset between two clocks at a point in time
2. Clock Drift:    rate at which a clock gains/loses time
3. Leap Seconds:   UTC occasionally inserts/removes a second
4. NTP Jumps:      clock steps backward after correction
5. Monotonic vs Wall Clock:
   - Wall clock: can jump (NTP corrections)
   - Monotonic: always moves forward (good for measuring durations)

Logical Clocks

Lamport Timestamps (1978)

A single counter per process, incremented on local events and message exchanges.

Rules:

  1. Before each local event, increment counter
  2. When sending: increment counter, attach it to message
  3. When receiving: counter = max(local, received) + 1
Process A:  (1)────(2)────────────(5)────(6)──>
                    │  send          ↑
                    └──────>  (3)──(4)  recv
Process B:          recv     (3)────(4)────────>
                              ↑      │  send
Process C:                    │      └───>(5)──(6)──>
                             recv           recv

If a → b (a happens-before b), then L(a) < L(b)
BUT: L(a) < L(b) does NOT imply a → b (partial order)
STRUCTURE LamportClock
    time: integer
    node_id: integer

PROCEDURE NEW_LAMPORT_CLOCK(node_id) → LamportClock
    RETURN LamportClock(time ← 0, node_id ← node_id)

PROCEDURE TICK(clock) → integer
    clock.time ← clock.time + 1
    RETURN clock.time

PROCEDURE SEND(clock) → integer
    RETURN TICK(clock)

PROCEDURE RECEIVE(clock, msg_timestamp)
    clock.time ← MAX(clock.time, msg_timestamp) + 1

// Total order: break ties with node_id
PROCEDURE TIMESTAMP(clock) → (integer, integer)
    RETURN (clock.time, clock.node_id)

Limitation: Cannot distinguish concurrent events from causally ordered ones.

Vector Clocks

Each process maintains a vector of counters, one per process. Captures the full causal history.

Process A:  [1,0,0]──[2,0,0]──────────────────[3,2,0]──>
                       │ send                     ↑ recv
Process B:  [0,0,0]──[0,1,0]──[2,2,0]──[2,3,0]──┘ send
                                 ↑ recv   │ send
Process C:  [0,0,0]────────────[0,0,1]──[2,3,2]──>
                                          ↑ recv

Comparison:
  V(a) ≤ V(b) iff ∀i: V(a)[i] ≤ V(b)[i]  →  a happened-before b
  V(a) ∥ V(b) iff neither V(a) ≤ V(b) nor V(b) ≤ V(a)  →  concurrent
STRUCTURE VectorClock
    clock: array of integer (one per node)
    node_id: integer

PROCEDURE NEW_VECTOR_CLOCK(node_id, num_nodes) → VectorClock
    RETURN VectorClock(clock ← array of num_nodes zeros, node_id ← node_id)

PROCEDURE TICK(vc)
    vc.clock[vc.node_id] ← vc.clock[vc.node_id] + 1

PROCEDURE SEND(vc) → array of integer
    TICK(vc)
    RETURN COPY(vc.clock)

PROCEDURE RECEIVE(vc, other_clock)
    FOR i ← 0 TO LENGTH(vc.clock) - 1 DO
        vc.clock[i] ← MAX(vc.clock[i], other_clock[i])
    vc.clock[vc.node_id] ← vc.clock[vc.node_id] + 1

PROCEDURE HAPPENS_BEFORE(vc, other) → boolean
    at_least_one_less ← FALSE
    FOR i ← 0 TO LENGTH(vc.clock) - 1 DO
        IF vc.clock[i] > other.clock[i] THEN RETURN FALSE
        IF vc.clock[i] < other.clock[i] THEN at_least_one_less ← TRUE
    RETURN at_least_one_less

PROCEDURE CONCURRENT(vc, other) → boolean
    RETURN NOT HAPPENS_BEFORE(vc, other) AND NOT HAPPENS_BEFORE(other, vc)

Downside: Vector size grows with number of nodes. Impractical for systems with millions of clients.

Version Vectors

Similar to vector clocks but track data versions rather than events. Used in Dynamo-style systems for conflict detection.

Key "user:42" replicated on nodes A, B, C:

  Client writes to A:     {A:1}
  Client writes to A:     {A:2}
  Replicated to B:        {A:2}
  Network partition occurs
  Client writes to A:     {A:3}
  Client writes to B:     {A:2, B:1}    ← concurrent!

  On read, client sees both versions:
    {A:3}  and  {A:2, B:1}  → conflict, application resolves

Hybrid Logical Clocks (HLC)

Combines physical time and logical counters. Provides causal ordering while staying close to wall-clock time.

HLC = (physical_time, logical_counter, node_id)

Rules:
  Local/send event:
    l' = max(l, pt)      // pt = physical time
    if l' == l: c' = c + 1
    else:       c' = 0

  Receive event:
    l' = max(l, msg.l, pt)
    if l' == l == msg.l:  c' = max(c, msg.c) + 1
    elif l' == l:         c' = c + 1
    elif l' == msg.l:     c' = msg.c + 1
    else:                 c' = 0

Advantage: Fits in a 64-bit integer (e.g., 48 bits physical, 16 bits logical). Used in CockroachDB and MongoDB.

The Happens-Before Relation

Lamport's happens-before (denoted a -> b) is defined:

  1. If a and b are events in the same process and a comes before b, then a -> b
  2. If a is the send of a message and b is the receipt of that message, then a -> b
  3. Transitivity: if a -> b and b -> c, then a -> c

Events not related by happens-before are concurrent (a || b).

                    happens-before
                    ┌─────────┐
    a ──────> b     │ a → b   │   Causal: a might have influenced b
                    └─────────┘

                    concurrent
                    ┌─────────┐
    a               │ a ∥ b   │   Independent: neither influenced the other
         b          └─────────┘

Causal Ordering

Causal ordering ensures that if event a causally precedes event b, all nodes see a before b. Concurrent events may be delivered in any order.

Causal broadcast guarantees:

  If process p broadcasts m1 before receiving m2,
  then no process delivers m2 before m1.

  Node A: send(m1) ──────────────────── deliver(m2)
                \                        /
  Node B:  deliver(m1) ── send(m2) ────/
                                \
  Node C:  ──── deliver(m1) ── deliver(m2)   ← m1 always before m2

Comparison of Clock Mechanisms

Mechanism Size Detects Causality Wall-Clock Close Use Case
Lamport O(1) No (only one direction) No Total ordering
Vector Clock O(n) Yes No Conflict detection
Version Vector O(n) Yes No Replicated data stores
HLC O(1) Yes (probabilistic) Yes Distributed databases
TrueTime O(1) N/A (interval) Yes Spanner

Real-World Applications

  • CockroachDB: HLC for transaction ordering, ensures serializable isolation
  • Amazon DynamoDB: Version vectors for conflict detection during partitions
  • Google Spanner: TrueTime intervals + commit-wait for external consistency
  • Cassandra: Lamport-like timestamps (wall clock + counter) for last-writer-wins
  • Riak: Dotted version vectors (compressed version vectors for many clients)

Key Takeaways

  • Physical clocks are unreliable for ordering in distributed systems due to drift, skew, and jumps.
  • Lamport timestamps provide total order but lose concurrency information.
  • Vector clocks capture full causality but don't scale to large numbers of nodes.
  • HLCs offer the best practical trade-off: causal ordering in a compact representation with wall-clock proximity.
  • Choose your clock mechanism based on what your system actually needs: total order, causal order, conflict detection, or real-time correlation.