5 min read
On this page

I/O Systems

The I/O subsystem manages communication between the CPU and peripheral devices, balancing performance, generality, and device diversity.

I/O Software Layers

┌─────────────────────────────┐
│  User-level I/O (libraries) │  stdio, fread, fwrite
├─────────────────────────────┤
│  Device-independent OS I/O  │  buffering, caching, scheduling
├─────────────────────────────┤
│    Device drivers           │  hardware-specific code
├─────────────────────────────┤
│  Interrupt handlers         │  bottom-half of I/O
├─────────────────────────────┤
│    Hardware                 │  controllers, DMA, devices
└─────────────────────────────┘

Device Drivers

Hardware-specific code that translates generic I/O requests into device commands. Each device type has a driver.

Linux driver model: Character devices (byte stream: serial, keyboard), block devices (random access: disk, SSD), network devices (packets).

Drivers are the largest part of the Linux kernel (~70% of code). Quality varies widely — drivers are the #1 source of kernel bugs.

I/O Scheduling

Disk Scheduling (HDD)

For mechanical hard drives, seek time dominates. Minimize total seek distance.

FCFS: Process requests in arrival order. Fair but poor seek performance.

SSTF (Shortest Seek Time First): Always go to the nearest request. Good average but starvation risk for far-away requests.

SCAN (Elevator): Move head in one direction, servicing requests. Reverse at the end. Like an elevator. Bounded waiting.

C-SCAN (Circular SCAN): Move in one direction only. Return to the beginning without servicing. More uniform wait times.

LOOK / C-LOOK: Like SCAN/C-SCAN but only go as far as the last request in each direction (don't go to the end of the disk).

SSD Scheduling

SSDs have no seek time — all blocks equally fast to access. Different concerns:

NOOP: No reordering. Minimal overhead. Good for SSDs with internal parallelism.

Deadline: Guarantee each request is served within a deadline. Prevents starvation. Good for SSDs.

mq-deadline: Multi-queue version of deadline. Used in modern Linux for NVMe SSDs.

BFQ (Budget Fair Queueing): Assigns budgets to processes. Good for interactive workloads. Ensures responsive I/O.

kyber: Simple two-queue scheduler for fast devices. Separate queues for reads and writes.

Linux I/O Schedulers

Linux 5.0+ uses multi-queue block layer (blk-mq): Per-CPU submission queues → hardware dispatch queues. Eliminates the single-queue bottleneck. Essential for NVMe SSDs (which have multiple hardware queues).

Buffering and Caching

Buffering

Temporary storage for I/O data to smooth speed differences.

Single buffering: One buffer. CPU processes one buffer while I/O fills another.

Double buffering: Two buffers. Overlap I/O and processing (producer-consumer).

Circular buffering: Ring buffer with multiple slots. Used in network I/O, audio.

Page Cache (Linux)

The page cache keeps recently accessed file data in memory:

Application read(file, offset, size)
    → Check page cache
    → HIT: return from memory (fast)
    → MISS: read from disk → store in page cache → return to application

Write-back: Dirty pages are flushed to disk periodically (every 5 seconds by default) or when memory pressure triggers writeback.

Unified cache: Linux uses the page cache for both file I/O and memory-mapped files. Same mechanism.

Cache eviction: Clock/LRU-based. Active and inactive lists with promotion/demotion.

Direct I/O

Bypass the page cache. Application manages its own buffer.

int fd = open("data", O_RDWR | O_DIRECT); // bypass page cache

Used by: Databases (PostgreSQL, MySQL InnoDB) that implement their own buffer pool.

Spooling

Simultaneous Peripheral Operations On-Line. Buffer output for slow devices (printers, tape drives).

Print jobs are queued in a spool directory. A background process (daemon) sends them to the printer one at a time.

I/O Performance

Metrics

| Metric | Definition | |---|---| | IOPS | I/O operations per second (random access performance) | | Throughput | MB/s or GB/s (sequential performance) | | Latency | Time for a single I/O operation | | Queue depth | Number of outstanding I/O requests |

Device Performance Comparison

| Device | Sequential Read | Random IOPS | Latency | |---|---|---|---| | HDD | 100-250 MB/s | 100-200 | 5-10 ms | | SATA SSD | 500-560 MB/s | 50K-100K | 50-100 μs | | NVMe SSD | 3-14 GB/s | 500K-2M | 10-20 μs | | Intel Optane | 2-6 GB/s | 500K-1.5M | 5-10 μs | | RAM | 30-50 GB/s | ∞ | 50-100 ns |

Optimizing I/O

  • Batching: Combine many small I/Os into fewer large ones
  • Sequential access: Much faster than random on all devices (especially HDD)
  • Async I/O: Don't wait for I/O completion — submit and continue working
  • Read-ahead: Pre-read sequential data before it's requested
  • Write combining: Buffer small writes and flush as larger writes
  • Alignment: Align I/O to device block boundaries (4K for most SSDs)

Modern I/O Interfaces

io_uring (Linux 5.1+)

High-performance async I/O framework. Two ring buffers shared between user space and kernel:

Submission Queue (SQ): User pushes I/O requests
Completion Queue (CQ): Kernel pushes completions

User space:                    Kernel:
  SQ: [req1][req2][req3] ──→  Process requests
  CQ: ←── [comp1][comp2]      Post completions

Advantages:

  • No syscalls for submission/completion (shared memory rings)
  • Supports all I/O types (read, write, accept, connect, poll, ...)
  • Batching: Submit multiple requests in one syscall
  • Linked operations: Chain operations (read → process → write)
  • Fixed buffers: Pre-register buffers to avoid per-I/O mapping

Performance: Can achieve 1M+ IOPS from a single thread. 10-50× improvement over epoll for some workloads.

Used by: Tokio (optional), Glommio (Rust async runtime), RocksDB, TigerBeetle.

AIO (Linux Async I/O)

Older async I/O API. Only supports direct I/O. Fewer features than io_uring. Being replaced by io_uring.

epoll / kqueue

Event notification for file descriptors (sockets, pipes, files).

epoll (Linux): Efficient for monitoring thousands of file descriptors. O(1) for ready fd retrieval.

kqueue (BSD/macOS): Similar to epoll. Unified event framework.

Both are the foundation of high-performance network servers (nginx, Node.js).

Applications in CS

  • Web servers: io_uring or epoll for handling thousands of concurrent connections.
  • Databases: Custom I/O scheduling, direct I/O, pre-allocated files, aligned writes.
  • File systems: I/O schedulers determine disk access patterns. Journaling requires ordered writes.
  • Distributed storage: Network I/O + disk I/O interleaved. Async I/O essential for throughput.
  • Streaming: Buffered I/O, read-ahead, and memory-mapped files for media playback.
  • Logging: Buffered writes, periodic flush. fsync for durability guarantees.