Lock-Free Programming
Progress Guarantees
Three levels of non-blocking guarantees, from weakest to strongest:
| Guarantee | Definition | |-----------|-----------| | Obstruction-free | A single thread, in isolation, completes in finite steps | | Lock-free | At least one thread makes progress in finite steps (system-wide progress) | | Wait-free | Every thread completes in a bounded number of steps |
Lock-free algorithms avoid locks entirely, using atomic operations instead. They prevent deadlock and priority inversion but may allow individual thread starvation (unless wait-free).
Compare-and-Swap (CAS)
The fundamental building block of lock-free programming.
// Atomically: if *addr == expected, set *addr = desired, return true
// else set expected = *addr, return false
bool compare_and_swap(atomic_t *addr, expected_t *expected, desired_t desired);
Hardware Support
| Architecture | Instruction |
|-------------|-------------|
| x86/x64 | CMPXCHG, CMPXCHG16B (double-width) |
| ARM | LDXR/STXR (LL/SC pair) |
| RISC-V | LR/SC (LL/SC pair) |
LL/SC (Load-Linked/Store-Conditional) is more general than CAS: SC fails if any write occurred to the cache line (not just the target word), which can cause spurious failures but avoids ABA naturally on some implementations.
CAS Loop Pattern
do {
old = atomic_load(&shared);
new_val = compute(old);
} while (!atomic_compare_exchange_weak(&shared, &old, new_val));
compare_exchange_weak may fail spuriously (on LL/SC architectures) but is cheaper in loops. compare_exchange_strong never fails spuriously.
The ABA Problem
Thread 1 reads value A, gets preempted. Thread 2 changes A to B then back to A. Thread 1's CAS succeeds despite intervening modifications, potentially corrupting the data structure.
Example: Lock-Free Stack
Stack: A -> B -> C
Thread 1: reads top=A, next=B (preempted)
Thread 2: pops A, pops B, pushes A back
Stack: A -> C
Thread 1: CAS(top, A, B) succeeds -- but B was freed!
Solution 1: Tagged Pointers
Append a monotonically increasing counter to the pointer. CAS on the pair (pointer, tag).
struct tagged_ptr {
void *ptr;
uint64_t tag;
};
// Use CMPXCHG16B (x86-64) or double-width CAS
The tag increments on every modification. ABA requires the tag to wrap around, which is practically impossible with 64-bit tags.
Solution 2: Hazard Pointers (Michael, 2004)
Each thread publishes pointers it is currently accessing. Memory cannot be freed if any thread's hazard pointer references it.
Per-thread: hazard_pointer[0..K-1] (small fixed array)
Retired list: per-thread list of nodes to be freed
Access protocol:
1. Load pointer into hazard pointer slot
2. Verify pointer is still valid (re-read source)
3. Use the object safely
4. Clear hazard pointer when done
Reclamation:
1. Retire node to local retired list
2. When list exceeds threshold, scan all hazard pointers
3. Free retired nodes not referenced by any hazard pointer
Properties:
- O(N*K) memory overhead (N threads, K hazard pointers each)
- Bounded memory usage with amortized reclamation
- No false positives (never frees in-use memory)
Solution 3: Epoch-Based Reclamation (EBR)
Simpler than hazard pointers but provides weaker bounds on memory reclamation.
Global epoch counter: E (values cycle through 0, 1, 2)
Per-thread: local_epoch, active flag
Per-epoch: retired_list[3]
Enter critical section:
active = true
local_epoch = global_epoch
Exit critical section:
active = false
Retire node:
Add to retired_list[global_epoch]
Try advance epoch:
If all active threads have local_epoch == global_epoch:
Free all nodes in retired_list[(global_epoch + 1) % 3]
global_epoch++
Limitation: a stalled thread prevents epoch advancement, potentially unbounded memory growth. Hybrid schemes (epoch + hazard pointers) address this.
Treiber Stack (Lock-Free Stack)
The simplest lock-free data structure. Uses a CAS on the head pointer.
struct node {
void *data;
struct node *next;
};
atomic<node*> head;
void push(node *n) {
do {
n->next = head.load(relaxed);
} while (!head.compare_exchange_weak(n->next, n, release, relaxed));
}
node* pop() {
node *old_head;
do {
old_head = head.load(acquire);
if (!old_head) return NULL;
} while (!head.compare_exchange_weak(old_head, old_head->next,
relaxed, relaxed));
return old_head;
}
Vulnerable to ABA on pop -- must combine with a reclamation scheme.
Performance Characteristics
- High contention on the head pointer limits scalability
- Suitable for low-contention scenarios or as a building block
- Elimination backoff: under contention, pair a push and pop to cancel without touching the stack
Michael-Scott Queue (Lock-Free Queue)
Lock-free FIFO queue with separate head and tail pointers, reducing contention.
Structure:
sentinel (dummy) node at head
head -> sentinel -> node1 -> node2 -> ... -> tail
Enqueue:
1. Allocate new node, set next=NULL
2. Loop:
a. Read tail, tail->next
b. If tail->next == NULL:
CAS(tail->next, NULL, new_node)
If success: CAS(tail, old_tail, new_node) // advance tail
Break
c. Else: CAS(tail, old_tail, tail->next) // help advance tail
Dequeue:
1. Loop:
a. Read head, tail, head->next
b. If head == tail:
If head->next == NULL: return EMPTY
Else: CAS(tail, old_tail, head->next) // help lagging tail
c. Else:
Read value from head->next
CAS(head, old_head, head->next)
If success: free old_head, return value
Key design insight: any thread can help advance the tail pointer, ensuring lock-freedom.
Read-Copy-Update (RCU)
A synchronization mechanism optimized for read-heavy workloads. Readers never block.
Core Idea
Read: enter RCU read-side critical section (nearly zero cost)
dereference RCU-protected pointer
exit critical section
Update:
1. Copy the existing structure
2. Modify the copy
3. Atomically swap the pointer (rcu_assign_pointer)
4. Wait for all pre-existing readers to finish (synchronize_rcu / call_rcu)
5. Free old version
API (Linux kernel style)
rcu_read_lock(); // mark reader critical section start
p = rcu_dereference(gp); // load pointer with acquire semantics
// use *p
rcu_read_unlock(); // mark reader critical section end
// Writer
struct obj *new = copy_and_modify(old);
rcu_assign_pointer(gp, new); // publish with release semantics
synchronize_rcu(); // wait for all pre-existing readers
kfree(old);
Grace Period
A grace period is the interval during which all threads pass through at least one quiescent state (a point where they hold no RCU references). After a grace period, old data can be safely freed.
Properties
- Readers: zero overhead (no atomic operations, no memory barriers on many architectures)
- Writers: bear the cost of copying and waiting
- Ideal when reads are orders of magnitude more frequent than writes
- Used extensively in the Linux kernel (routing tables, module unloading)
Wait-Free Algorithms
Every thread completes in bounded steps regardless of other threads' behavior.
Techniques
- Helping: fast threads help slow threads complete their operations
- Universal construction: any sequential object can be made wait-free using a consensus object (theoretical, rarely practical)
- Wait-free counters: use per-thread counters with periodic combining
Kogan-Petrank Wait-Free Queue
Extends Michael-Scott queue with a helping mechanism:
- Each operation gets a phase number
- Threads help pending operations from earlier phases before their own
- Guarantees every operation completes within O(N) steps
Wait-free algorithms are typically 2-5x slower than lock-free equivalents under low contention but provide hard real-time guarantees.
Practical Considerations
When to Use Lock-Free
- High contention with many threads
- Real-time requirements (no unbounded blocking)
- Fault tolerance (no risk of a thread dying while holding a lock)
When NOT to Use Lock-Free
- Simple, low-contention cases (mutex is simpler and often faster)
- Complex data structures (correctness is extremely hard to verify)
- When memory reclamation is a concern (adds significant complexity)
Debugging and Testing
- Formal verification (TLA+, SPIN)
- Stress testing with thread sanitizer (TSan)
- Model checking with CDSChecker or GenMC
- Linearizability testing with tools like Lincheck