3 min read
On this page

Consensus in Distributed Systems

Paxos vs Raft vs PBFT comparison

The Consensus Problem

All correct processes must agree on a single value, even when some processes fail.

Properties:

  • Agreement: All correct processes decide on the same value
  • Validity: The decided value was proposed by some process
  • Termination: All correct processes eventually decide
  • Integrity: Each process decides at most once
  P1 proposes: v1 ─┐
  P2 proposes: v2 ─┼──> Consensus ──> All decide: v2
  P3 proposes: v3 ─┤
  P4 (crashed)  ✗ ─┘

Paxos

Lamport's Paxos (1998) solves consensus in a partially synchronous model with crash failures.

Basic Paxos (Single-Decree)

Three roles: Proposers, Acceptors, Learners (a node can play multiple roles).

Phase 1: PREPARE
  Proposer                    Acceptors (majority needed)
     │── Prepare(n) ───────>  A1: if n > max_seen → Promise(n, prev_val)
     │── Prepare(n) ───────>  A2: if n > max_seen → Promise(n, prev_val)
     │── Prepare(n) ───────>  A3: if n > max_seen → Promise(n, prev_val)

Phase 2: ACCEPT
  Proposer                    Acceptors
     │── Accept(n, v) ─────>  A1: if n ≥ max_promised → Accepted(n, v)
     │── Accept(n, v) ─────>  A2: if n ≥ max_promised → Accepted(n, v)
     │── Accept(n, v) ─────>  A3: if n ≥ max_promised → Accepted(n, v)

  v = highest-numbered value from Phase 1 promises,
      or proposer's own value if none

Key insight: if any acceptor already accepted a value, the proposer must use that value (preserving previous decisions).

Multi-Paxos

Optimizes repeated consensus by electing a stable leader who skips Phase 1 for subsequent slots.

Slot 1: Full Paxos (Prepare + Accept)  ← establishes leader
Slot 2: Accept only (leader skip Prepare)
Slot 3: Accept only
  ...
Slot N: Accept only

Leader failure → new leader runs full Paxos for current slot

Paxos Challenges

  • Hard to implement correctly (notoriously subtle)
  • No distinguished leader in basic Paxos leads to livelock (dueling proposers)
  • Multi-Paxos is not fully specified in the original paper
  • Gap filling needed when slots are decided out of order

Raft

Ongaro and Ousterhout (2014) designed Raft explicitly for understandability. Equivalent power to Multi-Paxos.

Core Components

┌──────────────────────────────────────┐
│              Raft Node               │
│  ┌──────────┐  ┌──────────────────┐  │
│  │  State    │  │  Log             │  │
│  │ Machine   │  │  [1][2][3][4][5] │  │
│  └──────────┘  └──────────────────┘  │
│  Role: Leader / Follower / Candidate │
│  Term: monotonically increasing      │
└──────────────────────────────────────┘

Leader Election

Follower ──(election timeout expires)──> Candidate
  │                                          │
  │                                     increments term
  │                                     votes for self
  │     <── RequestVote(term, log_info) ──   │
  │                                          │
  │  ── VoteGranted ──>  (if majority) ──> Leader
  │                                          │
  │  ── VoteRejected ──> (if lost)    ──> Follower
  │                                          │
  └──(receives AppendEntries from leader)──> Follower

Election timeout: randomized (e.g., 150-300ms)
  to avoid split votes

Vote granting rules:

  1. Candidate's term must be at least as large as voter's current term
  2. Candidate's log must be at least as up-to-date (last term, then length)
  3. Each node votes for at most one candidate per term

Log Replication

Client ──> Leader: "set x=5"

Leader log:  [1:set x=1] [1:set y=2] [2:set x=5]
                                        ↑ new entry

Leader ──AppendEntries──> Follower A: appends, ACKs
Leader ──AppendEntries──> Follower B: appends, ACKs
Leader ──AppendEntries──> Follower C: (slow, no ACK yet)
Leader ──AppendEntries──> Follower D: appends, ACKs

Majority (3/5) ACKs → Leader commits entry
                     → responds to client
                     → next AppendEntries tells followers to commit
STRUCTURE LogEntry
    term: integer
    index: integer
    command: bytes

STRUCTURE RaftNode
    id: integer
    current_term: integer
    voted_for: integer or NONE
    log: list of LogEntry
    commit_index: integer
    last_applied: integer
    // Leader state
    next_index: array of integer    // per follower
    match_index: array of integer   // per follower
    role: FOLLOWER | CANDIDATE | LEADER

PROCEDURE SHOULD_GRANT_VOTE(node, candidate_term,
                             candidate_last_log_term,
                             candidate_last_log_index) → boolean
    IF candidate_term < node.current_term THEN
        RETURN FALSE
    IF node.voted_for IS NOT NONE
        AND node.voted_for ≠ candidate_term THEN
        RETURN FALSE   // already voted this term
    // Candidate's log must be at least as up-to-date
    IF node.log IS NOT empty THEN
        my_last_term ← LAST(node.log).term
        my_last_index ← LAST(node.log).index
    ELSE
        my_last_term ← 0
        my_last_index ← 0

    RETURN (candidate_last_log_term, candidate_last_log_index)
            ≥ (my_last_term, my_last_index)

Membership Changes

Raft uses single-server changes (joint consensus in the original paper):

C_old          C_old,new        C_new
  │                │               │
  │── add server ──│── commit ─────│
  │                │               │
  Old config    Transitional    New config
  decides       both configs    decides
                must agree

The simpler single-server approach (add/remove one node at a time) avoids the complexity of joint consensus and is used by most implementations.

PBFT (Practical Byzantine Fault Tolerance)

Castro and Liskov (1999). Tolerates f Byzantine faults among n >= 3f + 1 nodes.

Client ──request──> Primary (leader)
  │                    │
  │              PRE-PREPARE(v, n, m)
  │                    │──────────> R1, R2, R3
  │                    │
  │              PREPARE(v, n, d, i)
  │           R1 <──────────> R2 <──────────> R3
  │                    │
  │              COMMIT(v, n, d, i)
  │           R1 <──────────> R2 <──────────> R3
  │                    │
  │   <── REPLY ────── R1, R2, R3 (client waits for f+1 matching)

  Phases: PRE-PREPARE → PREPARE → COMMIT → REPLY
  Communication: O(n^2) messages per consensus round

View change replaces a faulty primary, similar to leader election in Raft but more complex due to Byzantine assumptions.

Modern BFT Protocols

Tendermint

Used in Cosmos blockchain. Round-based BFT with two voting phases:

Propose → Prevote → Precommit → Commit
   │         │           │
   │    2/3+ prevotes   2/3+ precommits
   │    for proposal    for proposal
   │                         │
   └─── round timeout ──────┘ (move to next round)

Advantages: deterministic finality, simpler than PBFT, O(n) message complexity with threshold signatures.

HotStuff

Used in Facebook's (now Meta's) Diem/Libra. Achieves O(n) communication per round using a star topology and threshold signatures.

Basic HotStuff Pipeline:
  Round k:     PREPARE    ──>
  Round k+1:   PRE-COMMIT ──>  (for round k)
  Round k+2:   COMMIT     ──>  (for round k)
  Round k+3:   DECIDE     ──>  (for round k)

  Chained HotStuff: pipeline all phases
  Round k:   GENERIC message serves as PREPARE for k,
             PRE-COMMIT for k-1, COMMIT for k-2, DECIDE for k-3

Key innovation: responsive leader protocol where the leader only waits for n-f messages (not a fixed timeout).

Comparison

| Protocol | Fault Model | Nodes Needed | Msg Complexity | Latency (rounds) | |---|---|---|---|---| | Paxos | Crash | 2f + 1 | O(n) | 2 | | Raft | Crash | 2f + 1 | O(n) | 2 | | PBFT | Byzantine | 3f + 1 | O(n^2) | 3 | | Tendermint | Byzantine | 3f + 1 | O(n) | 3 | | HotStuff | Byzantine | 3f + 1 | O(n) | 4 (3 pipelined) |

Real-World Deployments

| System | Protocol | Use Case | |---|---|---| | etcd | Raft | Kubernetes cluster state | | CockroachDB | Multi-Raft | Distributed SQL | | Consul | Raft | Service discovery | | Cosmos/Tendermint | Tendermint BFT | Blockchain consensus | | Diem (discontinued) | HotStuff | Permissioned blockchain | | ZooKeeper | Zab (Paxos variant) | Distributed coordination |

Key Takeaways

  • Paxos is theoretically elegant but notoriously difficult to implement correctly. Raft provides equivalent guarantees with a clearer structure.
  • Raft separates concerns (leader election, log replication, safety) making it easier to reason about and implement.
  • BFT protocols handle malicious nodes but pay a cost in message complexity and latency. Modern variants (HotStuff) reduce this to O(n) with pipelining.
  • All these protocols assume partial synchrony for liveness. Under full asynchrony, FLP applies and only randomized protocols can guarantee termination.