I/O Systems

The I/O subsystem manages communication between the CPU and peripheral devices, balancing performance, generality, and device diversity.

I/O Software Layers

┌─────────────────────────────┐
│  User-level I/O (libraries) │  stdio, fread, fwrite
├─────────────────────────────┤
│  Device-independent OS I/O  │  buffering, caching, scheduling
├─────────────────────────────┤
│    Device drivers           │  hardware-specific code
├─────────────────────────────┤
│  Interrupt handlers         │  bottom-half of I/O
├─────────────────────────────┤
│    Hardware                 │  controllers, DMA, devices
└─────────────────────────────┘

Device Drivers

Hardware-specific code that translates generic I/O requests into device commands. Each device type has a driver.

Linux driver model: Character devices (byte stream: serial, keyboard), block devices (random access: disk, SSD), network devices (packets).

Drivers are the largest part of the Linux kernel (~70% of code). Quality varies widely — drivers are the #1 source of kernel bugs.

I/O Scheduling

Disk Scheduling (HDD)

For mechanical hard drives, seek time dominates. Minimize total seek distance.

FCFS: Process requests in arrival order. Fair but poor seek performance.

SSTF (Shortest Seek Time First): Always go to the nearest request. Good average but starvation risk for far-away requests.

SCAN (Elevator): Move head in one direction, servicing requests. Reverse at the end. Like an elevator. Bounded waiting.

C-SCAN (Circular SCAN): Move in one direction only. Return to the beginning without servicing. More uniform wait times.

LOOK / C-LOOK: Like SCAN/C-SCAN but only go as far as the last request in each direction (don't go to the end of the disk).

SSD Scheduling

SSDs have no seek time — all blocks equally fast to access. Different concerns:

NOOP: No reordering. Minimal overhead. Good for SSDs with internal parallelism.

Deadline: Guarantee each request is served within a deadline. Prevents starvation. Good for SSDs.

mq-deadline: Multi-queue version of deadline. Used in modern Linux for NVMe SSDs.

BFQ (Budget Fair Queueing): Assigns budgets to processes. Good for interactive workloads. Ensures responsive I/O.

kyber: Simple two-queue scheduler for fast devices. Separate queues for reads and writes.

Linux 5.0+ uses multi-queue block layer (blk-mq): Per-CPU submission queues → hardware dispatch queues. Eliminates the single-queue bottleneck. Essential for NVMe SSDs (which have multiple hardware queues).

Buffering and Caching

Buffering

Temporary storage for I/O data to smooth speed differences.

Single buffering: One buffer. CPU processes one buffer while I/O fills another.

Double buffering: Two buffers. Overlap I/O and processing (producer-consumer).

Circular buffering: Ring buffer with multiple slots. Used in network I/O, audio.

Page Cache (Linux)

The page cache keeps recently accessed file data in memory:

Application read(file, offset, size)
    → Check page cache
    → HIT: return from memory (fast)
    → MISS: read from disk → store in page cache → return to application

Write-back: Dirty pages are flushed to disk periodically (every 5 seconds by default) or when memory pressure triggers writeback.

Unified cache: Linux uses the page cache for both file I/O and memory-mapped files. Same mechanism.

Cache eviction: Clock/LRU-based. Active and inactive lists with promotion/demotion.

Direct I/O

Bypass the page cache. Application manages its own buffer.

int fd = open("data", O_RDWR | O_DIRECT); // bypass page cache

Used by: Databases (PostgreSQL, MySQL InnoDB) that implement their own buffer pool.

Spooling

Simultaneous Peripheral Operations On-Line. Buffer output for slow devices (printers, tape drives).

Print jobs are queued in a spool directory. A background process (daemon) sends them to the printer one at a time.

I/O Performance

Metrics

Metric	Definition
IOPS	I/O operations per second (random access performance)
Throughput	MB/s or GB/s (sequential performance)
Latency	Time for a single I/O operation
Queue depth	Number of outstanding I/O requests

Device Performance Comparison

Device	Sequential Read	Random IOPS	Latency
HDD	100-250 MB/s	100-200	5-10 ms
SATA SSD	500-560 MB/s	50K-100K	50-100 μs
NVMe SSD	3-14 GB/s	500K-2M	10-20 μs
Intel Optane	2-6 GB/s	500K-1.5M	5-10 μs
RAM	30-50 GB/s	∞	50-100 ns

Optimizing I/O

Batching: Combine many small I/Os into fewer large ones
Sequential access: Much faster than random on all devices (especially HDD)
Async I/O: Don't wait for I/O completion — submit and continue working
Read-ahead: Pre-read sequential data before it's requested
Write combining: Buffer small writes and flush as larger writes
Alignment: Align I/O to device block boundaries (4K for most SSDs)

Modern I/O Interfaces

io_uring (Linux 5.1+)

High-performance async I/O framework. Two ring buffers shared between user space and kernel:

Submission Queue (SQ): User pushes I/O requests
Completion Queue (CQ): Kernel pushes completions

User space:                    Kernel:
  SQ: [req1][req2][req3] ──→  Process requests
  CQ: ←── [comp1][comp2]      Post completions

Advantages:

No syscalls for submission/completion (shared memory rings)
Supports all I/O types (read, write, accept, connect, poll, ...)
Batching: Submit multiple requests in one syscall
Linked operations: Chain operations (read → process → write)
Fixed buffers: Pre-register buffers to avoid per-I/O mapping

Performance: Can achieve 1M+ IOPS from a single thread. 10-50× improvement over epoll for some workloads.

Used by: Tokio (optional), Glommio (Rust async runtime), RocksDB, TigerBeetle.

AIO (Linux Async I/O)

Older async I/O API. Only supports direct I/O. Fewer features than io_uring. Being replaced by io_uring.

epoll / kqueue

Event notification for file descriptors (sockets, pipes, files).

epoll (Linux): Efficient for monitoring thousands of file descriptors. O(1) for ready fd retrieval.

kqueue (BSD/macOS): Similar to epoll. Unified event framework.

Both are the foundation of high-performance network servers (nginx, Node.js).

Applications in CS

Web servers: io_uring or epoll for handling thousands of concurrent connections.
Databases: Custom I/O scheduling, direct I/O, pre-allocated files, aligned writes.
File systems: I/O schedulers determine disk access patterns. Journaling requires ordered writes.
Distributed storage: Network I/O + disk I/O interleaved. Async I/O essential for throughput.
Streaming: Buffered I/O, read-ahead, and memory-mapped files for media playback.
Logging: Buffered writes, periodic flush. fsync for durability guarantees.