3 min read
On this page

Time and Ordering in Distributed Systems

The Problem

There is no global clock in a distributed system. Each node has its own local clock with drift, skew, and potential discontinuities. Yet many operations require a notion of "this happened before that."

Node A clock:  |----1----2----3----4----5----6----|  (runs fast)
Node B clock:  |----1------2------3------4--------|  (runs slow)
Node C clock:  |----1----2--[jump]--5----6----7---|  (NTP correction)

Which event came first?  A's event at t=3 or B's event at t=3?

Physical Clocks

NTP (Network Time Protocol)

Synchronizes clocks over the network. Achieves 1-50ms accuracy on the internet, sub-millisecond on LANs.

Client                    Server
  │                          │
  │──── t1: request ────>    │
  │                          │  t2: receive
  │                          │  t3: send
  │   <──── t4: response ── │
  │                          │

  Round-trip delay: δ = (t4 - t1) - (t3 - t2)
  Clock offset:    θ = ((t2 - t1) + (t3 - t4)) / 2

  Assumes symmetric network delay (often violated)

PTP (Precision Time Protocol)

Hardware-timestamped packets, achieves sub-microsecond accuracy. Used in financial trading, telecom, and industrial control.

GPS Clocks

Direct satellite time reference. ~10ns accuracy. Used by Google Spanner's TrueTime API.

TrueTime API:
  TT.now()  -> [earliest, latest]   (interval, not point)
  TT.after(t) -> bool               (is t definitely in the past?)
  TT.before(t) -> bool              (is t definitely in the future?)

  Uncertainty interval: typically 1-7ms
  Spanner waits out the uncertainty before committing.

Clock Problems

1. Clock Skew:     offset between two clocks at a point in time
2. Clock Drift:    rate at which a clock gains/loses time
3. Leap Seconds:   UTC occasionally inserts/removes a second
4. NTP Jumps:      clock steps backward after correction
5. Monotonic vs Wall Clock:
   - Wall clock: can jump (NTP corrections)
   - Monotonic: always moves forward (good for measuring durations)

Logical Clocks

Lamport Timestamps (1978)

A single counter per process, incremented on local events and message exchanges.

Rules:

  1. Before each local event, increment counter
  2. When sending: increment counter, attach it to message
  3. When receiving: counter = max(local, received) + 1
Process A:  (1)────(2)────────────(5)────(6)──>
                    │  send          ↑
                    └──────>  (3)──(4)  recv
Process B:          recv     (3)────(4)────────>
                              ↑      │  send
Process C:                    │      └───>(5)──(6)──>
                             recv           recv

If a → b (a happens-before b), then L(a) < L(b)
BUT: L(a) < L(b) does NOT imply a → b (partial order)
STRUCTURE LamportClock
    time: integer
    node_id: integer

PROCEDURE NEW_LAMPORT_CLOCK(node_id) → LamportClock
    RETURN LamportClock(time ← 0, node_id ← node_id)

PROCEDURE TICK(clock) → integer
    clock.time ← clock.time + 1
    RETURN clock.time

PROCEDURE SEND(clock) → integer
    RETURN TICK(clock)

PROCEDURE RECEIVE(clock, msg_timestamp)
    clock.time ← MAX(clock.time, msg_timestamp) + 1

// Total order: break ties with node_id
PROCEDURE TIMESTAMP(clock) → (integer, integer)
    RETURN (clock.time, clock.node_id)

Limitation: Cannot distinguish concurrent events from causally ordered ones.

Vector Clocks

Each process maintains a vector of counters, one per process. Captures the full causal history.

Process A:  [1,0,0]──[2,0,0]──────────────────[3,2,0]──>
                       │ send                     ↑ recv
Process B:  [0,0,0]──[0,1,0]──[2,2,0]──[2,3,0]──┘ send
                                 ↑ recv   │ send
Process C:  [0,0,0]────────────[0,0,1]──[2,3,2]──>
                                          ↑ recv

Comparison:
  V(a) ≤ V(b) iff ∀i: V(a)[i] ≤ V(b)[i]  →  a happened-before b
  V(a) ∥ V(b) iff neither V(a) ≤ V(b) nor V(b) ≤ V(a)  →  concurrent
STRUCTURE VectorClock
    clock: array of integer (one per node)
    node_id: integer

PROCEDURE NEW_VECTOR_CLOCK(node_id, num_nodes) → VectorClock
    RETURN VectorClock(clock ← array of num_nodes zeros, node_id ← node_id)

PROCEDURE TICK(vc)
    vc.clock[vc.node_id] ← vc.clock[vc.node_id] + 1

PROCEDURE SEND(vc) → array of integer
    TICK(vc)
    RETURN COPY(vc.clock)

PROCEDURE RECEIVE(vc, other_clock)
    FOR i ← 0 TO LENGTH(vc.clock) - 1 DO
        vc.clock[i] ← MAX(vc.clock[i], other_clock[i])
    vc.clock[vc.node_id] ← vc.clock[vc.node_id] + 1

PROCEDURE HAPPENS_BEFORE(vc, other) → boolean
    at_least_one_less ← FALSE
    FOR i ← 0 TO LENGTH(vc.clock) - 1 DO
        IF vc.clock[i] > other.clock[i] THEN RETURN FALSE
        IF vc.clock[i] < other.clock[i] THEN at_least_one_less ← TRUE
    RETURN at_least_one_less

PROCEDURE CONCURRENT(vc, other) → boolean
    RETURN NOT HAPPENS_BEFORE(vc, other) AND NOT HAPPENS_BEFORE(other, vc)

Downside: Vector size grows with number of nodes. Impractical for systems with millions of clients.

Version Vectors

Similar to vector clocks but track data versions rather than events. Used in Dynamo-style systems for conflict detection.

Key "user:42" replicated on nodes A, B, C:

  Client writes to A:     {A:1}
  Client writes to A:     {A:2}
  Replicated to B:        {A:2}
  Network partition occurs
  Client writes to A:     {A:3}
  Client writes to B:     {A:2, B:1}    ← concurrent!

  On read, client sees both versions:
    {A:3}  and  {A:2, B:1}  → conflict, application resolves

Hybrid Logical Clocks (HLC)

Combines physical time and logical counters. Provides causal ordering while staying close to wall-clock time.

HLC = (physical_time, logical_counter, node_id)

Rules:
  Local/send event:
    l' = max(l, pt)      // pt = physical time
    if l' == l: c' = c + 1
    else:       c' = 0

  Receive event:
    l' = max(l, msg.l, pt)
    if l' == l == msg.l:  c' = max(c, msg.c) + 1
    elif l' == l:         c' = c + 1
    elif l' == msg.l:     c' = msg.c + 1
    else:                 c' = 0

Advantage: Fits in a 64-bit integer (e.g., 48 bits physical, 16 bits logical). Used in CockroachDB and MongoDB.

The Happens-Before Relation

Lamport's happens-before (denoted a -> b) is defined:

  1. If a and b are events in the same process and a comes before b, then a -> b
  2. If a is the send of a message and b is the receipt of that message, then a -> b
  3. Transitivity: if a -> b and b -> c, then a -> c

Events not related by happens-before are concurrent (a || b).

                    happens-before
                    ┌─────────┐
    a ──────> b     │ a → b   │   Causal: a might have influenced b
                    └─────────┘

                    concurrent
                    ┌─────────┐
    a               │ a ∥ b   │   Independent: neither influenced the other
         b          └─────────┘

Causal Ordering

Causal ordering ensures that if event a causally precedes event b, all nodes see a before b. Concurrent events may be delivered in any order.

Causal broadcast guarantees:

  If process p broadcasts m1 before receiving m2,
  then no process delivers m2 before m1.

  Node A: send(m1) ──────────────────── deliver(m2)
                \                        /
  Node B:  deliver(m1) ── send(m2) ────/
                                \
  Node C:  ──── deliver(m1) ── deliver(m2)   ← m1 always before m2

Comparison of Clock Mechanisms

| Mechanism | Size | Detects Causality | Wall-Clock Close | Use Case | |---|---|---|---|---| | Lamport | O(1) | No (only one direction) | No | Total ordering | | Vector Clock | O(n) | Yes | No | Conflict detection | | Version Vector | O(n) | Yes | No | Replicated data stores | | HLC | O(1) | Yes (probabilistic) | Yes | Distributed databases | | TrueTime | O(1) | N/A (interval) | Yes | Spanner |

Real-World Applications

  • CockroachDB: HLC for transaction ordering, ensures serializable isolation
  • Amazon DynamoDB: Version vectors for conflict detection during partitions
  • Google Spanner: TrueTime intervals + commit-wait for external consistency
  • Cassandra: Lamport-like timestamps (wall clock + counter) for last-writer-wins
  • Riak: Dotted version vectors (compressed version vectors for many clients)

Key Takeaways

  • Physical clocks are unreliable for ordering in distributed systems due to drift, skew, and jumps.
  • Lamport timestamps provide total order but lose concurrency information.
  • Vector clocks capture full causality but don't scale to large numbers of nodes.
  • HLCs offer the best practical trade-off: causal ordering in a compact representation with wall-clock proximity.
  • Choose your clock mechanism based on what your system actually needs: total order, causal order, conflict detection, or real-time correlation.