Systems Programming Patterns

Zero-Copy I/O

Traditional I/O involves multiple copies between user space, kernel space, and devices. Zero-copy techniques eliminate unnecessary memory copies.

The Problem: Traditional File-to-Socket Transfer

Traditional path (4 copies, 4 context switches):
  1. read():  disk -> kernel page cache -> user buffer     (2 copies)
  2. write(): user buffer -> kernel socket buffer -> NIC    (2 copies)

Zero-copy path (0 user-space copies):
  sendfile(): disk -> kernel page cache -> NIC DMA          (1 copy or 0 with SG-DMA)

sendfile()

sendfile() transfers data between two file descriptors entirely in kernel space, avoiding user-space copies.

// Send a file over a socket without copying to user space
ssize_t sent = sendfile(socket_fd, file_fd, &offset, count);

With scatter-gather DMA: If the NIC supports SG-DMA, sendfile() can achieve true zero-copy -- the NIC reads directly from page cache pages via DMA descriptors. No CPU-driven copies at all.

Limitations: Only works file-to-socket (or file-to-file on some systems). Cannot transform data in flight.

splice() and tee()

splice() moves data between a file descriptor and a pipe without copying to user space. Pipes serve as an in-kernel buffer.

// Move data from a file to a pipe, then from the pipe to a socket
splice(file_fd, &off, pipe_write, NULL, len, SPLICE_F_MOVE);
splice(pipe_read, NULL, socket_fd, NULL, len, SPLICE_F_MOVE);

tee() duplicates pipe data without consuming it -- useful for logging a copy of data flowing through a pipeline.

vmsplice() maps user-space buffers into a pipe, enabling zero-copy from user memory to a pipe (and onward via splice).

Memory-Mapped I/O (mmap)

mmap() maps files or devices directly into the process's virtual address space. File reads become memory loads; writes become stores to page cache pages.

void *data = mmap(NULL, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
// Access file contents as memory -- page faults bring pages in
process(data, file_size);
munmap(data, file_size);

Advantages:

No explicit read/write syscalls -- the page fault handler loads data on demand
Shared mappings (MAP_SHARED) enable zero-copy sharing between processes
Random access patterns benefit from demand paging

Disadvantages:

Page fault overhead for each new page (TLB miss + potential I/O)
Difficult to handle errors (SIGBUS on I/O failure)
No control over I/O scheduling (kernel decides prefetch, writeback)
For sequential access, read() with readahead can outperform mmap

madvise hints:

MADV_SEQUENTIAL -- aggressive readahead, free pages behind
MADV_RANDOM -- disable readahead
MADV_WILLNEED -- prefetch pages
MADV_DONTNEED -- release pages immediately

Lock-Free Data Structures

Lock-free algorithms guarantee system-wide progress: at least one thread makes progress in a finite number of steps, even if other threads are suspended, preempted, or crash.

Compare-and-Swap (CAS)

CAS is the fundamental building block. Atomically: if *addr == expected, set *addr = desired and return true; else return false.

// Lock-free stack push using CAS
void push(Stack *s, Node *node) {
    Node *old_head;
    do {
        old_head = atomic_load(&s->head);
        node->next = old_head;
    } while (!atomic_compare_exchange_weak(&s->head, &old_head, node));
}

ABA problem: Thread reads A, gets preempted, another thread changes A->B->A, first thread's CAS succeeds incorrectly. Solutions:

Tagged pointers (pack a version counter with the pointer)
Double-width CAS (DWCAS) on supported architectures
Hazard pointers or epoch-based reclamation for memory management

Hazard Pointers

Hazard pointers solve safe memory reclamation in lock-free structures. A thread publishes pointers to nodes it is currently accessing. Other threads check these publications before freeing memory.

Thread 1:                          Thread 2:
  hp[0] = node_ptr;                  // wants to free node_ptr
  memory_fence();                    // scans all hazard pointers
  // safely read *node_ptr           // finds node_ptr in hp[0]
  hp[0] = NULL;                      // defers freeing until hp[0] is cleared

Properties:

Per-thread overhead: O(H) hazard pointers (H typically 1-2)
Scan overhead: O(N * H) where N = number of threads
Bounded memory overhead: at most O(N * H * R) retired nodes (R = reclamation batch size)
Patent-free since 2023

Epoch-Based Reclamation (EBR)

EBR is simpler than hazard pointers. Threads enter/exit read-side critical sections and are tracked by a global epoch.

Global epoch: 2

Thread enters critical section:
  local_epoch = global_epoch  // record current epoch

Thread retires a node:
  add node to retire_list[current_epoch]

Reclamation:
  if all threads have observed epoch 2:
    advance epoch to 3
    free all nodes retired in epoch 0  // two epochs ago, guaranteed no readers

Advantages: Very low per-operation overhead (no per-pointer publication) Disadvantage: A single stalled thread (e.g., descheduled) prevents all reclamation, causing unbounded memory growth

Read-Copy-Update (RCU)

RCU is the Linux kernel's solution for read-mostly, lock-free data structures. It separates read-side access from updates and defers reclamation.

Principles:

Readers access shared data without locks (just rcu_read_lock() which disables preemption)
Writers create a new version of the data structure (copy), update the pointer atomically, then wait for all pre-existing readers to finish before freeing the old version
Grace period: The interval during which all pre-existing readers complete. After a grace period, old data can be freed.

// Reader (essentially free -- no locks, no atomics)
rcu_read_lock();
struct node *p = rcu_dereference(gp);
// read p->field
rcu_read_unlock();

// Writer
struct node *new = kmalloc(...);
*new = *old;                        // copy
new->field = new_value;             // modify copy
rcu_assign_pointer(gp, new);       // publish (store_release)
synchronize_rcu();                  // wait for grace period
kfree(old);                         // safe to free

RCU variants in Linux:

synchronize_rcu() -- block until grace period completes
call_rcu() -- register a callback; freed asynchronously after grace period
SRCU (Sleepable RCU) -- allows sleeping in read-side critical sections
Tasks RCU -- for voluntary context switch-based grace periods

Performance: Read-side cost is near zero (no atomics, no cache-line bouncing). RCU is used throughout the Linux kernel: routing tables, dentry cache, module unloading, inode lists.

io_uring Patterns

Basic Submission/Completion Pattern

struct io_uring ring;
io_uring_queue_init(QUEUE_DEPTH, &ring, 0);

// Prepare a read operation
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, len, offset);
io_uring_sqe_set_data(sqe, user_data);

// Submit all prepared operations
io_uring_submit(&ring);

// Reap completions
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
int result = cqe->res;
void *data = io_uring_cqe_get_data(cqe);
io_uring_cqe_seen(&ring, cqe);

Advanced Patterns

Linked operations (dependency chains):

// Read a file, then write to a socket -- second op starts after first completes
sqe1 = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe1, file_fd, buf, len, 0);
sqe1->flags |= IOSQE_IO_LINK;  // link to next SQE

sqe2 = io_uring_get_sqe(&ring);
io_uring_prep_write(sqe2, sock_fd, buf, len, 0);

Fixed files and buffers:

// Register file descriptors -- avoid per-I/O fd lookup
int fds[] = {fd1, fd2, fd3};
io_uring_register_files(&ring, fds, 3);

// Register buffers -- avoid per-I/O page pinning
struct iovec iovs[] = {{buf, len}};
io_uring_register_buffers(&ring, iovs, 1);

// Use registered resources (IOSQE_FIXED_FILE flag, fixed buffer index)
io_uring_prep_read_fixed(sqe, 0 /*registered index*/, buf, len, off, 0);
sqe->flags |= IOSQE_FIXED_FILE;

SQPOLL mode: A kernel thread polls the submission ring, eliminating io_uring_submit() syscalls entirely. The application just writes SQEs to the ring and reads CQEs -- no syscalls in the hot path.

struct io_uring_params params = { .flags = IORING_SETUP_SQPOLL, .sq_thread_idle = 2000 };
io_uring_queue_init_params(DEPTH, &ring, &params);

Multishot Operations

Multishot accept/recv: a single SQE generates multiple CQEs. Install one accept SQE, get a CQE for each new connection -- eliminates re-arming overhead.

io_uring_prep_multishot_accept(sqe, listen_fd, NULL, NULL, 0);

High-Performance Networking

DPDK (Data Plane Development Kit)

DPDK bypasses the kernel networking stack entirely. The application runs in user space with direct access to NIC queues via UIO or VFIO.

Architecture:

Traditional:  App -> socket API -> TCP/IP stack -> driver -> NIC
DPDK:         App -> DPDK PMD (poll-mode driver) -> NIC   (no kernel)

Key concepts:

Poll-mode drivers (PMDs): busy-poll NIC queues, no interrupts
Hugepage-backed mbufs: pre-allocated packet buffers in huge pages
Dedicated cores: lcore pinning -- each core runs a tight poll loop
rte_ring: lock-free multi-producer/multi-consumer ring buffer for inter-core communication

Performance: 100+ Gbps line rate, <1 us latency, 100M+ PPS on modern hardware.

Trade-off: Consumes dedicated CPU cores (100% utilization even when idle). No kernel protection -- bugs in DPDK can crash the process or corrupt NIC state.

RDMA (Remote Direct Memory Access)

RDMA allows one machine to read/write another machine's memory without involving either CPU or OS in the data path.

Verbs API operations:

Verb	Description
`ibv_post_send`	Post a send work request (write/read/send)
`ibv_post_recv`	Post a receive buffer
`ibv_poll_cq`	Poll completion queue for finished operations

RDMA modes:

Send/Receive: Two-sided -- receiver posts buffers, sender sends to them
RDMA Write: One-sided -- write directly to remote memory (remote CPU uninvolved)
RDMA Read: One-sided -- read remote memory directly

Transports: InfiniBand (native), RoCE (RDMA over Converged Ethernet), iWARP (RDMA over TCP)

Use cases: Distributed storage (Ceph with RDMA), databases (Oracle RAC), HPC (MPI), key-value stores (HERD, FaRM)

AF_XDP

AF_XDP brings XDP's performance to user-space applications via a special socket type.

Kernel (XDP program) -> UMEM (shared memory) -> User space (AF_XDP socket)

Components:

UMEM: Shared memory region between kernel and user space, divided into frames
Fill ring: User space provides empty frames to the kernel
Completion ring: Kernel returns transmitted frames to user space
RX/TX rings: Receive and transmit descriptor rings

Comparison to DPDK:

Aspect	DPDK	AF_XDP
Kernel bypass	Complete	Partial (XDP in kernel)
Driver support	Proprietary PMDs	Any XDP-capable NIC
Kernel features	None (no firewall, etc.)	Coexist with kernel stack
CPU dedication	Required	Optional
Performance	~100M PPS	~24M PPS (improving)

AF_XDP provides a middle ground: near-DPDK performance while retaining kernel manageability.

Performance Comparison

Technique	Latency	Throughput	CPU Overhead	Kernel Bypass
Standard sockets	~10-50 us	~10 Gbps	High	No
io_uring	~5-20 us	~25 Gbps	Medium	Partial
AF_XDP	~2-5 us	~50 Gbps	Medium	Partial
DPDK	~0.5-2 us	~100+ Gbps	Low (busy-poll)	Full
RDMA	~1-3 us	~100+ Gbps	Near-zero	Full

Putting It All Together

Modern high-performance systems combine these patterns:

Example: High-performance key-value store
  - Networking: DPDK or AF_XDP for packet I/O
  - Storage: io_uring with registered buffers for persistence
  - Data structures: Lock-free hash table with RCU for reads
  - Memory: Huge pages, NUMA-aware allocation
  - Reclamation: Epoch-based for retired entries
  - Replication: RDMA write for low-latency replication

Key Takeaways

Zero-copy (sendfile, splice, mmap) eliminates CPU-driven memory copies in the I/O path, saving both CPU cycles and memory bandwidth
Lock-free programming requires solving the reclamation problem: hazard pointers provide bounded memory, EBR provides lower overhead, RCU provides near-zero read-side cost
CAS is the fundamental lock-free primitive; the ABA problem is its primary pitfall
io_uring's shared rings, fixed resources, and SQPOLL mode bring near-kernel-bypass performance while remaining in the kernel I/O stack
DPDK trades kernel functionality for maximum throughput; AF_XDP offers a compromise with kernel coexistence
RDMA enables CPU-bypass networking where remote memory reads/writes happen without involving either side's CPU
Production systems layer these patterns: DPDK/AF_XDP for networking, io_uring for storage, RCU for data structures, huge pages for memory