7 min read
On this page

Advanced File Systems

Log-Structured File Systems

Traditional file systems (ext4, XFS) perform in-place updates. Log-structured file systems (LFS) write all data and metadata sequentially to a log, converting random writes into sequential writes.

Design

  1. All writes are buffered in memory and flushed as a contiguous segment
  2. An inode map tracks the current location of each inode (since inodes move)
  3. A segment cleaner reclaims space by compacting live data from old segments

Advantages and Drawbacks

| Advantage | Drawback | |----------------------------------|---------------------------------------| | Excellent sequential write speed | Segment cleaning overhead | | Crash recovery is simple (replay log) | Read performance can degrade (data scattered) | | Natural support for snapshots | Write amplification from cleaning |

Implementations: NILFS2 (Linux), F2FS (flash-optimized LFS used in Android), Sprite LFS (original research)

F2FS: Flash-Optimized LFS

F2FS (Flash-Friendly File System) adapts LFS design for NAND flash:

  • Multi-head logging with hot/warm/cold data separation
  • Aligns writes to flash erase blocks to reduce garbage collection
  • Node Address Table (NAT) and Segment Information Table (SIT) for metadata
  • Default file system on many Android devices

Copy-on-Write File Systems

COW file systems never overwrite existing data. Writes go to new locations, and metadata trees are updated with new pointers (also via COW). This provides atomic updates, snapshots, and self-healing.

ZFS

ZFS (originally Sun Microsystems, now OpenZFS) is a combined file system and volume manager.

Core concepts:

  • Vdevs (virtual devices): Abstraction over physical disks, organized into a zpool
  • Copy-on-write everywhere: All writes go to new blocks; no in-place update
  • Merkle tree integrity: Every block has a checksum stored in its parent. The entire tree is verified on read, detecting silent data corruption (bit rot)
  • ARC (Adaptive Replacement Cache): Balances recency and frequency for caching
  • ZIL (ZFS Intent Log): Synchronous write log for fsync semantics
  • SLOG (Separate Log): Fast device (NVMe) for ZIL to accelerate sync writes
ZFS data hierarchy:
  zpool -> vdev (mirror, raidz) -> disk
  zpool -> dataset (filesystem | zvol | snapshot | clone)

Advanced features:

  • Send/receive: incremental replication of snapshots between pools
  • Native encryption (per-dataset, AES-256-GCM)
  • Inline deduplication and compression
  • Special allocation class: metadata and small blocks on fast SSDs

Btrfs

Btrfs is Linux's native COW file system.

Key features:

  • B-tree based metadata and data organization
  • Subvolumes: lightweight, independently snapshotable file system partitions
  • Built-in RAID (0, 1, 10, 5/6 -- though RAID 5/6 has had stability issues)
  • Online defragmentation, balance, resize
  • Transparent compression (zstd, lzo, zlib)
  • Send/receive for incremental backups
# Create a Btrfs subvolume and snapshot
btrfs subvolume create /mnt/data/@home
btrfs subvolume snapshot /mnt/data/@home /mnt/data/@home_snap

# Enable zstd compression
btrfs property set /mnt/data/@home compression zstd

ZFS vs. Btrfs

| Feature | ZFS | Btrfs | |-------------------|--------------------------|---------------------------| | License | CDDL (not GPL-compatible)| GPL | | RAID | RAID-Z1/Z2/Z3 (stable) | RAID 0/1/10 stable, 5/6 fragile | | Dedup | In-memory DDT (RAM-hungry)| Offline via reflink | | Max capacity | 256 ZiB | 16 EiB | | Scrubbing | Comprehensive | Supported | | Community | OpenZFS (FreeBSD, Linux) | Mainline Linux kernel |

Deduplication

Deduplication identifies and eliminates duplicate data blocks, storing each unique block only once.

Inline vs. Offline

  • Inline (ZFS): Deduplicate at write time. Requires an in-memory deduplication table (DDT). Rule of thumb: ~5 GB RAM per TB of data.
  • Offline/batch (Btrfs, Windows ReFS): Scan and dedup after data is written. Lower memory pressure but delayed savings.
  • Reflinks: cp --reflink=auto creates COW copies that share data blocks until modified. Near-instant copies with no DDT overhead.

Compression

Modern file systems support transparent block-level compression.

| Algorithm | Ratio | Speed (compress) | Speed (decompress) | Typical Use | |-----------|------------|-------------------|--------------------|-----------------------| | LZ4 | Low-medium | Very fast (~700 MB/s)| Very fast | Default for speed | | Zstd | High | Fast (~400 MB/s) | Very fast (~1 GB/s)| Best general choice | | Zlib | High | Slow (~100 MB/s) | Medium | Legacy, archival | | Gzip | Medium-high| Slow | Medium | Compatibility |

ZFS and Btrfs allow per-dataset or per-file compression selection.

Snapshots

A snapshot captures the file system state at a point in time. With COW, snapshots are nearly free -- they simply preserve pointers to existing blocks.

Properties:

  • Creation is O(1) -- no data is copied
  • Space overhead is proportional to changes since the snapshot
  • Read-only snapshots (ZFS default) or writable clones (forked from a snapshot)

Use cases:

  • Pre-upgrade rollback (Btrfs on openSUSE, Fedora)
  • Consistent backups without downtime
  • Database point-in-time recovery
  • Container image layers (overlayfs with underlying Btrfs snapshots)

RAID Levels

| Level | Min Disks | Redundancy | Space Efficiency | Read Perf | Write Perf | |---------|-----------|------------|------------------|-----------|------------| | RAID 0 | 2 | None | 100% | Nx | Nx | | RAID 1 | 2 | 1 disk | 50% | Nx (read) | 1x | | RAID 5 | 3 | 1 disk | (N-1)/N | (N-1)x | Reduced | | RAID 6 | 4 | 2 disks | (N-2)/N | (N-2)x | Reduced | | RAID 10 | 4 | 1 per mirror| 50% | Nx | N/2 x |

RAID-Z (ZFS)

ZFS RAID-Z eliminates the write hole problem (where a crash during parity update leaves RAID in an inconsistent state). Because ZFS uses COW, parity and data are always written together atomically.

| Level | Parity Disks | Equivalent | |-----------|--------------|------------| | RAID-Z1 | 1 | RAID 5 | | RAID-Z2 | 2 | RAID 6 | | RAID-Z3 | 3 | Triple parity |

RAID-Z uses variable-width stripes, avoiding the read-modify-write penalty of traditional RAID 5.

NVMe Architecture

NVMe (Non-Volatile Memory Express) is a storage protocol designed for flash and persistent memory, replacing the AHCI/SATA stack.

Key Design Choices

  • Parallel submission queues: Up to 65,535 I/O queue pairs (submission + completion), mapped to CPU cores
  • Low overhead command set: 13 mandatory commands vs. SCSI's hundreds
  • No locks in the I/O path: Each core has its own queue, eliminating contention
  • Direct PCIe attachment: No SAS/SATA HBA -- lower latency

NVMe Namespaces

A single NVMe device can expose multiple namespaces (logical partitions), each appearing as a separate block device. NVMe-oF (over Fabrics) extends NVMe over RDMA or TCP for networked storage.

Performance Comparison

| Metric | SATA SSD | NVMe SSD | Optane PMEM | |------------------|-----------------|-----------------|-----------------| | Latency (read) | ~100 us | ~10-20 us | ~300 ns (DAX) | | IOPS (random 4K) | ~100K | ~500K-1M | ~10M+ | | Bandwidth (seq) | ~550 MB/s | ~3-7 GB/s | ~6-8 GB/s |

io_uring

io_uring (Linux 5.1+) provides high-performance asynchronous I/O via shared memory rings between user space and kernel.

Architecture

User space                     Kernel
+-----------------+           +------------------+
| Submission Ring | --------> | Process entries  |
| (SQEs)          |           | (no syscall per  |
|                 |           |  I/O in SQPOLL)  |
| Completion Ring | <-------- | Post completions |
| (CQEs)          |           |                  |
+-----------------+           +------------------+

Key features:

  • Batched submission: multiple I/O operations per syscall (or zero syscalls with SQPOLL)
  • Linked operations: chain dependent I/Os
  • Fixed buffers/files: register resources once, avoid per-I/O lookup
  • Supports: read/write, fsync, poll, accept, connect, send/recv, open/close

Performance Impact

io_uring achieves near-SPDK (user-space) performance while remaining in the kernel I/O stack:

  • 1M+ IOPS from a single core on NVMe
  • 50-90% reduction in syscall overhead compared to libaio

Persistent Memory File Systems

PMEM-aware file systems are designed to exploit byte-addressable persistent memory with DAX.

| File System | Design | Status | |-------------|------------------------------------|-----------------| | ext4-DAX | ext4 with DAX mmap bypass | Production | | XFS-DAX | XFS with DAX support | Production | | NOVA | Log-structured, per-inode logging | Research | | WineFS | Huge-page-aware for PMEM | Research |

NOVA provides per-inode logs for metadata (avoiding global journal bottleneck) and direct data access for file contents, achieving high concurrency on many-core systems.

Key Takeaways

  1. COW file systems (ZFS, Btrfs) enable atomic updates, cheap snapshots, and self-healing via checksummed Merkle trees
  2. Log-structured file systems excel at write-heavy workloads but require garbage collection; F2FS optimizes this for flash
  3. ZFS RAID-Z eliminates the write hole; Btrfs RAID 5/6 remains less mature
  4. Deduplication is powerful but memory-intensive inline (ZFS) or latency-free via reflinks (Btrfs)
  5. NVMe's multi-queue architecture eliminates the storage stack bottleneck, enabling millions of IOPS
  6. io_uring closes the performance gap between kernel and user-space I/O by eliminating per-I/O syscall overhead
  7. DAX-enabled file systems on persistent memory bypass the entire page cache and block layer