I/O Systems
The I/O subsystem manages communication between the CPU and peripheral devices, balancing performance, generality, and device diversity.
I/O Software Layers
┌─────────────────────────────┐
│ User-level I/O (libraries) │ stdio, fread, fwrite
├─────────────────────────────┤
│ Device-independent OS I/O │ buffering, caching, scheduling
├─────────────────────────────┤
│ Device drivers │ hardware-specific code
├─────────────────────────────┤
│ Interrupt handlers │ bottom-half of I/O
├─────────────────────────────┤
│ Hardware │ controllers, DMA, devices
└─────────────────────────────┘
Device Drivers
Hardware-specific code that translates generic I/O requests into device commands. Each device type has a driver.
Linux driver model: Character devices (byte stream: serial, keyboard), block devices (random access: disk, SSD), network devices (packets).
Drivers are the largest part of the Linux kernel (~70% of code). Quality varies widely — drivers are the #1 source of kernel bugs.
I/O Scheduling
Disk Scheduling (HDD)
For mechanical hard drives, seek time dominates. Minimize total seek distance.
FCFS: Process requests in arrival order. Fair but poor seek performance.
SSTF (Shortest Seek Time First): Always go to the nearest request. Good average but starvation risk for far-away requests.
SCAN (Elevator): Move head in one direction, servicing requests. Reverse at the end. Like an elevator. Bounded waiting.
C-SCAN (Circular SCAN): Move in one direction only. Return to the beginning without servicing. More uniform wait times.
LOOK / C-LOOK: Like SCAN/C-SCAN but only go as far as the last request in each direction (don't go to the end of the disk).
SSD Scheduling
SSDs have no seek time — all blocks equally fast to access. Different concerns:
NOOP: No reordering. Minimal overhead. Good for SSDs with internal parallelism.
Deadline: Guarantee each request is served within a deadline. Prevents starvation. Good for SSDs.
mq-deadline: Multi-queue version of deadline. Used in modern Linux for NVMe SSDs.
BFQ (Budget Fair Queueing): Assigns budgets to processes. Good for interactive workloads. Ensures responsive I/O.
kyber: Simple two-queue scheduler for fast devices. Separate queues for reads and writes.
Linux I/O Schedulers
Linux 5.0+ uses multi-queue block layer (blk-mq): Per-CPU submission queues → hardware dispatch queues. Eliminates the single-queue bottleneck. Essential for NVMe SSDs (which have multiple hardware queues).
Buffering and Caching
Buffering
Temporary storage for I/O data to smooth speed differences.
Single buffering: One buffer. CPU processes one buffer while I/O fills another.
Double buffering: Two buffers. Overlap I/O and processing (producer-consumer).
Circular buffering: Ring buffer with multiple slots. Used in network I/O, audio.
Page Cache (Linux)
The page cache keeps recently accessed file data in memory:
Application read(file, offset, size)
→ Check page cache
→ HIT: return from memory (fast)
→ MISS: read from disk → store in page cache → return to application
Write-back: Dirty pages are flushed to disk periodically (every 5 seconds by default) or when memory pressure triggers writeback.
Unified cache: Linux uses the page cache for both file I/O and memory-mapped files. Same mechanism.
Cache eviction: Clock/LRU-based. Active and inactive lists with promotion/demotion.
Direct I/O
Bypass the page cache. Application manages its own buffer.
int fd = open("data", O_RDWR | O_DIRECT); // bypass page cache
Used by: Databases (PostgreSQL, MySQL InnoDB) that implement their own buffer pool.
Spooling
Simultaneous Peripheral Operations On-Line. Buffer output for slow devices (printers, tape drives).
Print jobs are queued in a spool directory. A background process (daemon) sends them to the printer one at a time.
I/O Performance
Metrics
| Metric | Definition | |---|---| | IOPS | I/O operations per second (random access performance) | | Throughput | MB/s or GB/s (sequential performance) | | Latency | Time for a single I/O operation | | Queue depth | Number of outstanding I/O requests |
Device Performance Comparison
| Device | Sequential Read | Random IOPS | Latency | |---|---|---|---| | HDD | 100-250 MB/s | 100-200 | 5-10 ms | | SATA SSD | 500-560 MB/s | 50K-100K | 50-100 μs | | NVMe SSD | 3-14 GB/s | 500K-2M | 10-20 μs | | Intel Optane | 2-6 GB/s | 500K-1.5M | 5-10 μs | | RAM | 30-50 GB/s | ∞ | 50-100 ns |
Optimizing I/O
- Batching: Combine many small I/Os into fewer large ones
- Sequential access: Much faster than random on all devices (especially HDD)
- Async I/O: Don't wait for I/O completion — submit and continue working
- Read-ahead: Pre-read sequential data before it's requested
- Write combining: Buffer small writes and flush as larger writes
- Alignment: Align I/O to device block boundaries (4K for most SSDs)
Modern I/O Interfaces
io_uring (Linux 5.1+)
High-performance async I/O framework. Two ring buffers shared between user space and kernel:
Submission Queue (SQ): User pushes I/O requests
Completion Queue (CQ): Kernel pushes completions
User space: Kernel:
SQ: [req1][req2][req3] ──→ Process requests
CQ: ←── [comp1][comp2] Post completions
Advantages:
- No syscalls for submission/completion (shared memory rings)
- Supports all I/O types (read, write, accept, connect, poll, ...)
- Batching: Submit multiple requests in one syscall
- Linked operations: Chain operations (read → process → write)
- Fixed buffers: Pre-register buffers to avoid per-I/O mapping
Performance: Can achieve 1M+ IOPS from a single thread. 10-50× improvement over epoll for some workloads.
Used by: Tokio (optional), Glommio (Rust async runtime), RocksDB, TigerBeetle.
AIO (Linux Async I/O)
Older async I/O API. Only supports direct I/O. Fewer features than io_uring. Being replaced by io_uring.
epoll / kqueue
Event notification for file descriptors (sockets, pipes, files).
epoll (Linux): Efficient for monitoring thousands of file descriptors. O(1) for ready fd retrieval.
kqueue (BSD/macOS): Similar to epoll. Unified event framework.
Both are the foundation of high-performance network servers (nginx, Node.js).
Applications in CS
- Web servers: io_uring or epoll for handling thousands of concurrent connections.
- Databases: Custom I/O scheduling, direct I/O, pre-allocated files, aligned writes.
- File systems: I/O schedulers determine disk access patterns. Journaling requires ordered writes.
- Distributed storage: Network I/O + disk I/O interleaved. Async I/O essential for throughput.
- Streaming: Buffered I/O, read-ahead, and memory-mapped files for media playback.
- Logging: Buffered writes, periodic flush. fsync for durability guarantees.