Systems Programming Patterns
Zero-Copy I/O
Traditional I/O involves multiple copies between user space, kernel space, and devices. Zero-copy techniques eliminate unnecessary memory copies.
The Problem: Traditional File-to-Socket Transfer
Traditional path (4 copies, 4 context switches):
1. read(): disk -> kernel page cache -> user buffer (2 copies)
2. write(): user buffer -> kernel socket buffer -> NIC (2 copies)
Zero-copy path (0 user-space copies):
sendfile(): disk -> kernel page cache -> NIC DMA (1 copy or 0 with SG-DMA)
sendfile()
sendfile() transfers data between two file descriptors entirely in kernel space, avoiding user-space copies.
// Send a file over a socket without copying to user space
ssize_t sent = sendfile(socket_fd, file_fd, &offset, count);
With scatter-gather DMA: If the NIC supports SG-DMA, sendfile() can achieve true zero-copy -- the NIC reads directly from page cache pages via DMA descriptors. No CPU-driven copies at all.
Limitations: Only works file-to-socket (or file-to-file on some systems). Cannot transform data in flight.
splice() and tee()
splice() moves data between a file descriptor and a pipe without copying to user space. Pipes serve as an in-kernel buffer.
// Move data from a file to a pipe, then from the pipe to a socket
splice(file_fd, &off, pipe_write, NULL, len, SPLICE_F_MOVE);
splice(pipe_read, NULL, socket_fd, NULL, len, SPLICE_F_MOVE);
tee() duplicates pipe data without consuming it -- useful for logging a copy of data flowing through a pipeline.
vmsplice() maps user-space buffers into a pipe, enabling zero-copy from user memory to a pipe (and onward via splice).
Memory-Mapped I/O (mmap)
mmap() maps files or devices directly into the process's virtual address space. File reads become memory loads; writes become stores to page cache pages.
void *data = mmap(NULL, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
// Access file contents as memory -- page faults bring pages in
process(data, file_size);
munmap(data, file_size);
Advantages:
- No explicit read/write syscalls -- the page fault handler loads data on demand
- Shared mappings (
MAP_SHARED) enable zero-copy sharing between processes - Random access patterns benefit from demand paging
Disadvantages:
- Page fault overhead for each new page (TLB miss + potential I/O)
- Difficult to handle errors (SIGBUS on I/O failure)
- No control over I/O scheduling (kernel decides prefetch, writeback)
- For sequential access,
read()with readahead can outperform mmap
madvise hints:
MADV_SEQUENTIAL-- aggressive readahead, free pages behindMADV_RANDOM-- disable readaheadMADV_WILLNEED-- prefetch pagesMADV_DONTNEED-- release pages immediately
Lock-Free Data Structures
Lock-free algorithms guarantee system-wide progress: at least one thread makes progress in a finite number of steps, even if other threads are suspended, preempted, or crash.
Compare-and-Swap (CAS)
CAS is the fundamental building block. Atomically: if *addr == expected, set *addr = desired and return true; else return false.
// Lock-free stack push using CAS
void push(Stack *s, Node *node) {
Node *old_head;
do {
old_head = atomic_load(&s->head);
node->next = old_head;
} while (!atomic_compare_exchange_weak(&s->head, &old_head, node));
}
ABA problem: Thread reads A, gets preempted, another thread changes A->B->A, first thread's CAS succeeds incorrectly. Solutions:
- Tagged pointers (pack a version counter with the pointer)
- Double-width CAS (DWCAS) on supported architectures
- Hazard pointers or epoch-based reclamation for memory management
Hazard Pointers
Hazard pointers solve safe memory reclamation in lock-free structures. A thread publishes pointers to nodes it is currently accessing. Other threads check these publications before freeing memory.
Thread 1: Thread 2:
hp[0] = node_ptr; // wants to free node_ptr
memory_fence(); // scans all hazard pointers
// safely read *node_ptr // finds node_ptr in hp[0]
hp[0] = NULL; // defers freeing until hp[0] is cleared
Properties:
- Per-thread overhead: O(H) hazard pointers (H typically 1-2)
- Scan overhead: O(N * H) where N = number of threads
- Bounded memory overhead: at most O(N * H * R) retired nodes (R = reclamation batch size)
- Patent-free since 2023
Epoch-Based Reclamation (EBR)
EBR is simpler than hazard pointers. Threads enter/exit read-side critical sections and are tracked by a global epoch.
Global epoch: 2
Thread enters critical section:
local_epoch = global_epoch // record current epoch
Thread retires a node:
add node to retire_list[current_epoch]
Reclamation:
if all threads have observed epoch 2:
advance epoch to 3
free all nodes retired in epoch 0 // two epochs ago, guaranteed no readers
Advantages: Very low per-operation overhead (no per-pointer publication) Disadvantage: A single stalled thread (e.g., descheduled) prevents all reclamation, causing unbounded memory growth
Read-Copy-Update (RCU)
RCU is the Linux kernel's solution for read-mostly, lock-free data structures. It separates read-side access from updates and defers reclamation.
Principles:
- Readers access shared data without locks (just
rcu_read_lock()which disables preemption) - Writers create a new version of the data structure (copy), update the pointer atomically, then wait for all pre-existing readers to finish before freeing the old version
- Grace period: The interval during which all pre-existing readers complete. After a grace period, old data can be freed.
// Reader (essentially free -- no locks, no atomics)
rcu_read_lock();
struct node *p = rcu_dereference(gp);
// read p->field
rcu_read_unlock();
// Writer
struct node *new = kmalloc(...);
*new = *old; // copy
new->field = new_value; // modify copy
rcu_assign_pointer(gp, new); // publish (store_release)
synchronize_rcu(); // wait for grace period
kfree(old); // safe to free
RCU variants in Linux:
synchronize_rcu()-- block until grace period completescall_rcu()-- register a callback; freed asynchronously after grace period- SRCU (Sleepable RCU) -- allows sleeping in read-side critical sections
- Tasks RCU -- for voluntary context switch-based grace periods
Performance: Read-side cost is near zero (no atomics, no cache-line bouncing). RCU is used throughout the Linux kernel: routing tables, dentry cache, module unloading, inode lists.
io_uring Patterns
Basic Submission/Completion Pattern
struct io_uring ring;
io_uring_queue_init(QUEUE_DEPTH, &ring, 0);
// Prepare a read operation
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, len, offset);
io_uring_sqe_set_data(sqe, user_data);
// Submit all prepared operations
io_uring_submit(&ring);
// Reap completions
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
int result = cqe->res;
void *data = io_uring_cqe_get_data(cqe);
io_uring_cqe_seen(&ring, cqe);
Advanced Patterns
Linked operations (dependency chains):
// Read a file, then write to a socket -- second op starts after first completes
sqe1 = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe1, file_fd, buf, len, 0);
sqe1->flags |= IOSQE_IO_LINK; // link to next SQE
sqe2 = io_uring_get_sqe(&ring);
io_uring_prep_write(sqe2, sock_fd, buf, len, 0);
Fixed files and buffers:
// Register file descriptors -- avoid per-I/O fd lookup
int fds[] = {fd1, fd2, fd3};
io_uring_register_files(&ring, fds, 3);
// Register buffers -- avoid per-I/O page pinning
struct iovec iovs[] = {{buf, len}};
io_uring_register_buffers(&ring, iovs, 1);
// Use registered resources (IOSQE_FIXED_FILE flag, fixed buffer index)
io_uring_prep_read_fixed(sqe, 0 /*registered index*/, buf, len, off, 0);
sqe->flags |= IOSQE_FIXED_FILE;
SQPOLL mode: A kernel thread polls the submission ring, eliminating io_uring_submit() syscalls entirely. The application just writes SQEs to the ring and reads CQEs -- no syscalls in the hot path.
struct io_uring_params params = { .flags = IORING_SETUP_SQPOLL, .sq_thread_idle = 2000 };
io_uring_queue_init_params(DEPTH, &ring, ¶ms);
Multishot Operations
Multishot accept/recv: a single SQE generates multiple CQEs. Install one accept SQE, get a CQE for each new connection -- eliminates re-arming overhead.
io_uring_prep_multishot_accept(sqe, listen_fd, NULL, NULL, 0);
High-Performance Networking
DPDK (Data Plane Development Kit)
DPDK bypasses the kernel networking stack entirely. The application runs in user space with direct access to NIC queues via UIO or VFIO.
Architecture:
Traditional: App -> socket API -> TCP/IP stack -> driver -> NIC
DPDK: App -> DPDK PMD (poll-mode driver) -> NIC (no kernel)
Key concepts:
- Poll-mode drivers (PMDs): busy-poll NIC queues, no interrupts
- Hugepage-backed mbufs: pre-allocated packet buffers in huge pages
- Dedicated cores: lcore pinning -- each core runs a tight poll loop
- rte_ring: lock-free multi-producer/multi-consumer ring buffer for inter-core communication
Performance: 100+ Gbps line rate, <1 us latency, 100M+ PPS on modern hardware.
Trade-off: Consumes dedicated CPU cores (100% utilization even when idle). No kernel protection -- bugs in DPDK can crash the process or corrupt NIC state.
RDMA (Remote Direct Memory Access)
RDMA allows one machine to read/write another machine's memory without involving either CPU or OS in the data path.
Verbs API operations:
| Verb | Description |
|----------------|------------------------------------------------|
| ibv_post_send| Post a send work request (write/read/send) |
| ibv_post_recv| Post a receive buffer |
| ibv_poll_cq | Poll completion queue for finished operations |
RDMA modes:
- Send/Receive: Two-sided -- receiver posts buffers, sender sends to them
- RDMA Write: One-sided -- write directly to remote memory (remote CPU uninvolved)
- RDMA Read: One-sided -- read remote memory directly
Transports: InfiniBand (native), RoCE (RDMA over Converged Ethernet), iWARP (RDMA over TCP)
Use cases: Distributed storage (Ceph with RDMA), databases (Oracle RAC), HPC (MPI), key-value stores (HERD, FaRM)
AF_XDP
AF_XDP brings XDP's performance to user-space applications via a special socket type.
Kernel (XDP program) -> UMEM (shared memory) -> User space (AF_XDP socket)
Components:
- UMEM: Shared memory region between kernel and user space, divided into frames
- Fill ring: User space provides empty frames to the kernel
- Completion ring: Kernel returns transmitted frames to user space
- RX/TX rings: Receive and transmit descriptor rings
Comparison to DPDK: | Aspect | DPDK | AF_XDP | |-----------------|-------------------------|-------------------------| | Kernel bypass | Complete | Partial (XDP in kernel) | | Driver support | Proprietary PMDs | Any XDP-capable NIC | | Kernel features | None (no firewall, etc.)| Coexist with kernel stack| | CPU dedication | Required | Optional | | Performance | ~100M PPS | ~24M PPS (improving) |
AF_XDP provides a middle ground: near-DPDK performance while retaining kernel manageability.
Performance Comparison
| Technique | Latency | Throughput | CPU Overhead | Kernel Bypass | |------------------|-------------|---------------|--------------|---------------| | Standard sockets | ~10-50 us | ~10 Gbps | High | No | | io_uring | ~5-20 us | ~25 Gbps | Medium | Partial | | AF_XDP | ~2-5 us | ~50 Gbps | Medium | Partial | | DPDK | ~0.5-2 us | ~100+ Gbps | Low (busy-poll) | Full | | RDMA | ~1-3 us | ~100+ Gbps | Near-zero | Full |
Putting It All Together
Modern high-performance systems combine these patterns:
Example: High-performance key-value store
- Networking: DPDK or AF_XDP for packet I/O
- Storage: io_uring with registered buffers for persistence
- Data structures: Lock-free hash table with RCU for reads
- Memory: Huge pages, NUMA-aware allocation
- Reclamation: Epoch-based for retired entries
- Replication: RDMA write for low-latency replication
Key Takeaways
- Zero-copy (sendfile, splice, mmap) eliminates CPU-driven memory copies in the I/O path, saving both CPU cycles and memory bandwidth
- Lock-free programming requires solving the reclamation problem: hazard pointers provide bounded memory, EBR provides lower overhead, RCU provides near-zero read-side cost
- CAS is the fundamental lock-free primitive; the ABA problem is its primary pitfall
- io_uring's shared rings, fixed resources, and SQPOLL mode bring near-kernel-bypass performance while remaining in the kernel I/O stack
- DPDK trades kernel functionality for maximum throughput; AF_XDP offers a compromise with kernel coexistence
- RDMA enables CPU-bypass networking where remote memory reads/writes happen without involving either side's CPU
- Production systems layer these patterns: DPDK/AF_XDP for networking, io_uring for storage, RCU for data structures, huge pages for memory