4 min read
On this page

Lock-Free Programming

Progress Guarantees

Three levels of non-blocking guarantees, from weakest to strongest:

| Guarantee | Definition | |-----------|-----------| | Obstruction-free | A single thread, in isolation, completes in finite steps | | Lock-free | At least one thread makes progress in finite steps (system-wide progress) | | Wait-free | Every thread completes in a bounded number of steps |

Lock-free algorithms avoid locks entirely, using atomic operations instead. They prevent deadlock and priority inversion but may allow individual thread starvation (unless wait-free).

Compare-and-Swap (CAS)

The fundamental building block of lock-free programming.

// Atomically: if *addr == expected, set *addr = desired, return true
//             else set expected = *addr, return false
bool compare_and_swap(atomic_t *addr, expected_t *expected, desired_t desired);

Hardware Support

| Architecture | Instruction | |-------------|-------------| | x86/x64 | CMPXCHG, CMPXCHG16B (double-width) | | ARM | LDXR/STXR (LL/SC pair) | | RISC-V | LR/SC (LL/SC pair) |

LL/SC (Load-Linked/Store-Conditional) is more general than CAS: SC fails if any write occurred to the cache line (not just the target word), which can cause spurious failures but avoids ABA naturally on some implementations.

CAS Loop Pattern

do {
    old = atomic_load(&shared);
    new_val = compute(old);
} while (!atomic_compare_exchange_weak(&shared, &old, new_val));

compare_exchange_weak may fail spuriously (on LL/SC architectures) but is cheaper in loops. compare_exchange_strong never fails spuriously.

The ABA Problem

Thread 1 reads value A, gets preempted. Thread 2 changes A to B then back to A. Thread 1's CAS succeeds despite intervening modifications, potentially corrupting the data structure.

Example: Lock-Free Stack

Stack: A -> B -> C
Thread 1: reads top=A, next=B (preempted)
Thread 2: pops A, pops B, pushes A back
Stack: A -> C
Thread 1: CAS(top, A, B) succeeds -- but B was freed!

Solution 1: Tagged Pointers

Append a monotonically increasing counter to the pointer. CAS on the pair (pointer, tag).

struct tagged_ptr {
    void *ptr;
    uint64_t tag;
};
// Use CMPXCHG16B (x86-64) or double-width CAS

The tag increments on every modification. ABA requires the tag to wrap around, which is practically impossible with 64-bit tags.

Solution 2: Hazard Pointers (Michael, 2004)

Each thread publishes pointers it is currently accessing. Memory cannot be freed if any thread's hazard pointer references it.

Per-thread: hazard_pointer[0..K-1]  (small fixed array)
Retired list: per-thread list of nodes to be freed

Access protocol:
  1. Load pointer into hazard pointer slot
  2. Verify pointer is still valid (re-read source)
  3. Use the object safely
  4. Clear hazard pointer when done

Reclamation:
  1. Retire node to local retired list
  2. When list exceeds threshold, scan all hazard pointers
  3. Free retired nodes not referenced by any hazard pointer

Properties:

  • O(N*K) memory overhead (N threads, K hazard pointers each)
  • Bounded memory usage with amortized reclamation
  • No false positives (never frees in-use memory)

Solution 3: Epoch-Based Reclamation (EBR)

Simpler than hazard pointers but provides weaker bounds on memory reclamation.

Global epoch counter: E (values cycle through 0, 1, 2)
Per-thread: local_epoch, active flag
Per-epoch: retired_list[3]

Enter critical section:
  active = true
  local_epoch = global_epoch

Exit critical section:
  active = false

Retire node:
  Add to retired_list[global_epoch]

Try advance epoch:
  If all active threads have local_epoch == global_epoch:
    Free all nodes in retired_list[(global_epoch + 1) % 3]
    global_epoch++

Limitation: a stalled thread prevents epoch advancement, potentially unbounded memory growth. Hybrid schemes (epoch + hazard pointers) address this.

Treiber Stack (Lock-Free Stack)

The simplest lock-free data structure. Uses a CAS on the head pointer.

struct node {
    void *data;
    struct node *next;
};
atomic<node*> head;

void push(node *n) {
    do {
        n->next = head.load(relaxed);
    } while (!head.compare_exchange_weak(n->next, n, release, relaxed));
}

node* pop() {
    node *old_head;
    do {
        old_head = head.load(acquire);
        if (!old_head) return NULL;
    } while (!head.compare_exchange_weak(old_head, old_head->next,
                                          relaxed, relaxed));
    return old_head;
}

Vulnerable to ABA on pop -- must combine with a reclamation scheme.

Performance Characteristics

  • High contention on the head pointer limits scalability
  • Suitable for low-contention scenarios or as a building block
  • Elimination backoff: under contention, pair a push and pop to cancel without touching the stack

Michael-Scott Queue (Lock-Free Queue)

Lock-free FIFO queue with separate head and tail pointers, reducing contention.

Structure:
  sentinel (dummy) node at head
  head -> sentinel -> node1 -> node2 -> ... -> tail

Enqueue:
  1. Allocate new node, set next=NULL
  2. Loop:
     a. Read tail, tail->next
     b. If tail->next == NULL:
        CAS(tail->next, NULL, new_node)
        If success: CAS(tail, old_tail, new_node)  // advance tail
        Break
     c. Else: CAS(tail, old_tail, tail->next)      // help advance tail

Dequeue:
  1. Loop:
     a. Read head, tail, head->next
     b. If head == tail:
        If head->next == NULL: return EMPTY
        Else: CAS(tail, old_tail, head->next)       // help lagging tail
     c. Else:
        Read value from head->next
        CAS(head, old_head, head->next)
        If success: free old_head, return value

Key design insight: any thread can help advance the tail pointer, ensuring lock-freedom.

Read-Copy-Update (RCU)

A synchronization mechanism optimized for read-heavy workloads. Readers never block.

Core Idea

Read:  enter RCU read-side critical section (nearly zero cost)
       dereference RCU-protected pointer
       exit critical section

Update:
  1. Copy the existing structure
  2. Modify the copy
  3. Atomically swap the pointer (rcu_assign_pointer)
  4. Wait for all pre-existing readers to finish (synchronize_rcu / call_rcu)
  5. Free old version

API (Linux kernel style)

rcu_read_lock();               // mark reader critical section start
p = rcu_dereference(gp);      // load pointer with acquire semantics
// use *p
rcu_read_unlock();             // mark reader critical section end

// Writer
struct obj *new = copy_and_modify(old);
rcu_assign_pointer(gp, new);  // publish with release semantics
synchronize_rcu();             // wait for all pre-existing readers
kfree(old);

Grace Period

A grace period is the interval during which all threads pass through at least one quiescent state (a point where they hold no RCU references). After a grace period, old data can be safely freed.

Properties

  • Readers: zero overhead (no atomic operations, no memory barriers on many architectures)
  • Writers: bear the cost of copying and waiting
  • Ideal when reads are orders of magnitude more frequent than writes
  • Used extensively in the Linux kernel (routing tables, module unloading)

Wait-Free Algorithms

Every thread completes in bounded steps regardless of other threads' behavior.

Techniques

  • Helping: fast threads help slow threads complete their operations
  • Universal construction: any sequential object can be made wait-free using a consensus object (theoretical, rarely practical)
  • Wait-free counters: use per-thread counters with periodic combining

Kogan-Petrank Wait-Free Queue

Extends Michael-Scott queue with a helping mechanism:

  1. Each operation gets a phase number
  2. Threads help pending operations from earlier phases before their own
  3. Guarantees every operation completes within O(N) steps

Wait-free algorithms are typically 2-5x slower than lock-free equivalents under low contention but provide hard real-time guarantees.

Practical Considerations

When to Use Lock-Free

  • High contention with many threads
  • Real-time requirements (no unbounded blocking)
  • Fault tolerance (no risk of a thread dying while holding a lock)

When NOT to Use Lock-Free

  • Simple, low-contention cases (mutex is simpler and often faster)
  • Complex data structures (correctness is extremely hard to verify)
  • When memory reclamation is a concern (adds significant complexity)

Debugging and Testing

  • Formal verification (TLA+, SPIN)
  • Stress testing with thread sanitizer (TSan)
  • Model checking with CDSChecker or GenMC
  • Linearizability testing with tools like Lincheck