Shared Memory Programming

Threads

A thread is the smallest unit of execution within a process. Threads within a process share the same address space, file descriptors, and heap, but each has its own stack and register state.

Thread Lifecycle

Created -> Runnable -> Running -> Blocked/Waiting -> Terminated

Kernel threads are scheduled by the OS. User-level threads (green threads, goroutines, fibers) are scheduled by a runtime, often M:N mapped onto kernel threads.

Thread vs Process

Aspect	Thread	Process
Memory	Shared address space	Isolated
Creation cost	Low (~10us)	High (~1ms)
Context switch	Cheaper	More expensive (TLB flush)
Communication	Direct memory access	IPC required
Fault isolation	None (one crash kills all)	Full

POSIX Threads (pthreads)

The standard C threading API on Unix-like systems.

Core API

#include <pthread.h>

// Creation and joining
int pthread_create(pthread_t *t, const pthread_attr_t *attr,
                   void *(*start_routine)(void *), void *arg);
int pthread_join(pthread_t t, void **retval);
int pthread_detach(pthread_t t);

// Self-identification
pthread_t pthread_self(void);

// Cancellation
int pthread_cancel(pthread_t t);
int pthread_setcancelstate(int state, int *oldstate);

Thread Attributes

pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
pthread_attr_setstacksize(&attr, 1 << 20);  // 1 MB stack
pthread_attr_destroy(&attr);

Mutexes

Mutual exclusion locks protect critical sections.

pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

pthread_mutex_lock(&lock);    // blocking acquire
pthread_mutex_trylock(&lock); // non-blocking, returns EBUSY if held
pthread_mutex_unlock(&lock);
pthread_mutex_destroy(&lock);

Mutex Types

Type	Behavior on re-lock by owner
PTHREAD_MUTEX_NORMAL	Deadlock (undefined)
PTHREAD_MUTEX_ERRORCHECK	Returns EDEADLK
PTHREAD_MUTEX_RECURSIVE	Increments lock count

Recursive mutexes allow the same thread to acquire the lock multiple times (must unlock the same number of times). Use sparingly; they often mask design issues.

Spinlocks

Busy-wait locks. Efficient when critical sections are short and contention is low.

pthread_spinlock_t slock;
pthread_spin_init(&slock, PTHREAD_PROCESS_PRIVATE);
pthread_spin_lock(&slock);
pthread_spin_unlock(&slock);
pthread_spin_destroy(&slock);

Test-and-Set Spinlock

Simplest implementation, but causes heavy cache-line bouncing under contention.

while (atomic_exchange(&lock, 1) == 1) {
    // spin
}
// critical section
atomic_store(&lock, 0);

Test-and-Test-and-Set (TTAS): spin on a read (cache hit) and only attempt exchange when the lock appears free. Reduces bus traffic.

Ticket Lock

Provides FIFO fairness, preventing starvation.

struct ticket_lock {
    atomic_uint next_ticket;
    atomic_uint now_serving;
};

void lock(struct ticket_lock *l) {
    uint my_ticket = atomic_fetch_add(&l->next_ticket, 1);
    while (atomic_load(&l->now_serving) != my_ticket)
        ; // spin
}

void unlock(struct ticket_lock *l) {
    atomic_fetch_add(&l->now_serving, 1);
}

Problem: all waiters spin on the same cache line (now_serving), causing invalidation storms.

MCS Lock

Each thread spins on its own local variable, eliminating cache-line contention. FIFO fair.

Structure: linked list of queue nodes
- Each node has a {locked, next} pair
- Tail pointer tracks end of queue

Lock:
  1. Allocate node, set locked=true, next=NULL
  2. Swap tail to point to my node (atomic)
  3. If predecessor exists, set predecessor->next = my node
  4. Spin on my own node->locked

Unlock:
  1. If next is NULL, try CAS tail from my node to NULL
  2. If CAS fails, wait for next to appear
  3. Set next->locked = false

MCS is optimal for NUMA systems where local spinning is critical.

Condition Variables

Allow threads to block until a condition becomes true.

pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

// Waiter
pthread_mutex_lock(&mutex);
while (!condition)                    // always use while, not if
    pthread_cond_wait(&cond, &mutex); // atomically releases mutex and sleeps
// condition is true, mutex is held
pthread_mutex_unlock(&mutex);

// Signaler
pthread_mutex_lock(&mutex);
condition = true;
pthread_cond_signal(&cond);   // wake one waiter
// pthread_cond_broadcast(&cond); // wake all waiters
pthread_mutex_unlock(&mutex);

Spurious wakeups can occur, which is why the while loop is mandatory.

signal vs broadcast: use signal when any single waiter can proceed; use broadcast when waiters check different conditions or when the condition is relevant to all.

Semaphores

A semaphore is a non-negative integer counter with atomic wait (decrement) and post (increment) operations.

#include <semaphore.h>

sem_t sem;
sem_init(&sem, 0, initial_value);  // 0 = not shared across processes
sem_wait(&sem);    // decrement; blocks if value is 0
sem_post(&sem);    // increment; wakes one waiter
sem_getvalue(&sem, &val);
sem_destroy(&sem);

Binary Semaphore vs Mutex

Binary semaphore (initial value 1) can serve as a lock, but any thread can post
Mutex has ownership: only the locking thread should unlock
Prefer mutexes for mutual exclusion; use semaphores for signaling and resource counting

Producer-Consumer with Semaphores

sem_t empty, full;
sem_t mutex;

sem_init(&empty, 0, BUFFER_SIZE);
sem_init(&full, 0, 0);
sem_init(&mutex, 0, 1);

// Producer
sem_wait(&empty);
sem_wait(&mutex);
buffer[in] = item; in = (in + 1) % BUFFER_SIZE;
sem_post(&mutex);
sem_post(&full);

// Consumer
sem_wait(&full);
sem_wait(&mutex);
item = buffer[out]; out = (out + 1) % BUFFER_SIZE;
sem_post(&mutex);
sem_post(&empty);

Barriers

Synchronization point where all participating threads must arrive before any can proceed.

pthread_barrier_t barrier;
pthread_barrier_init(&barrier, NULL, num_threads);

// Each thread:
// ... do work phase 1 ...
pthread_barrier_wait(&barrier);
// ... do work phase 2 (all threads have completed phase 1) ...

pthread_barrier_destroy(&barrier);

The return value of barrier_wait is PTHREAD_BARRIER_SERIAL_THREAD for exactly one thread (useful for single-thread cleanup).

Sense-Reversing Barrier

Reusable barrier implementation that avoids race conditions between consecutive barrier uses by toggling a sense flag.

Read-Write Locks

Allow concurrent readers or a single writer, but not both.

pthread_rwlock_t rwlock = PTHREAD_RWLOCK_INITIALIZER;

pthread_rwlock_rdlock(&rwlock);   // shared read access
pthread_rwlock_wrlock(&rwlock);   // exclusive write access
pthread_rwlock_unlock(&rwlock);
pthread_rwlock_destroy(&rwlock);

Policies

Reader-preference: readers never block if lock is read-held (can starve writers)
Writer-preference: pending writers block new readers (can starve readers)
Fair: FIFO ordering (pthreads implementation-defined)

Read-write locks are beneficial when reads vastly outnumber writes and the critical section is non-trivial. For short critical sections, a plain mutex is often faster due to rwlock overhead.

Thread-Local Storage (TLS)

Per-thread data that avoids sharing and synchronization.

POSIX TLS (pthread keys)

pthread_key_t key;
pthread_key_create(&key, destructor_fn);  // destructor called on thread exit
pthread_setspecific(key, value);
void *val = pthread_getspecific(key);
pthread_key_delete(key);

C11 / GCC Extension

_Thread_local int tls_var;        // C11
__thread int tls_var;             // GCC extension
thread_local int tls_var;         // C++11

Common uses: errno, per-thread allocators, per-thread caches, random number generator state.

Implementation

TLS is typically implemented via a per-thread block referenced through a segment register (e.g., %fs on x86-64 Linux). Access is nearly as fast as a global variable.

Deadlock

Four necessary conditions (Coffman conditions):

Mutual exclusion: resource held exclusively
Hold and wait: hold one resource while waiting for another
No preemption: resources cannot be forcibly taken
Circular wait: cycle in the wait-for graph

Prevention: break any one condition. Most practical: impose a total ordering on locks and always acquire in order.

Detection: build a wait-for graph at runtime; detect cycles.

Avoidance: Banker's algorithm (rarely used in practice due to overhead).