7 min read
On this page

Advanced File Systems

Log-Structured File Systems

Traditional file systems (ext4, XFS) perform in-place updates. Log-structured file systems (LFS) write all data and metadata sequentially to a log, converting random writes into sequential writes.

Design

  1. All writes are buffered in memory and flushed as a contiguous segment
  2. An inode map tracks the current location of each inode (since inodes move)
  3. A segment cleaner reclaims space by compacting live data from old segments

Advantages and Drawbacks

Advantage Drawback
Excellent sequential write speed Segment cleaning overhead
Crash recovery is simple (replay log) Read performance can degrade (data scattered)
Natural support for snapshots Write amplification from cleaning

Implementations: NILFS2 (Linux), F2FS (flash-optimized LFS used in Android), Sprite LFS (original research)

F2FS: Flash-Optimized LFS

F2FS (Flash-Friendly File System) adapts LFS design for NAND flash:

  • Multi-head logging with hot/warm/cold data separation
  • Aligns writes to flash erase blocks to reduce garbage collection
  • Node Address Table (NAT) and Segment Information Table (SIT) for metadata
  • Default file system on many Android devices

Copy-on-Write File Systems

COW file systems never overwrite existing data. Writes go to new locations, and metadata trees are updated with new pointers (also via COW). This provides atomic updates, snapshots, and self-healing.

ZFS

ZFS (originally Sun Microsystems, now OpenZFS) is a combined file system and volume manager.

Core concepts:

  • Vdevs (virtual devices): Abstraction over physical disks, organized into a zpool
  • Copy-on-write everywhere: All writes go to new blocks; no in-place update
  • Merkle tree integrity: Every block has a checksum stored in its parent. The entire tree is verified on read, detecting silent data corruption (bit rot)
  • ARC (Adaptive Replacement Cache): Balances recency and frequency for caching
  • ZIL (ZFS Intent Log): Synchronous write log for fsync semantics
  • SLOG (Separate Log): Fast device (NVMe) for ZIL to accelerate sync writes
ZFS data hierarchy:
  zpool -> vdev (mirror, raidz) -> disk
  zpool -> dataset (filesystem | zvol | snapshot | clone)

Advanced features:

  • Send/receive: incremental replication of snapshots between pools
  • Native encryption (per-dataset, AES-256-GCM)
  • Inline deduplication and compression
  • Special allocation class: metadata and small blocks on fast SSDs

Btrfs

Btrfs is Linux's native COW file system.

Key features:

  • B-tree based metadata and data organization
  • Subvolumes: lightweight, independently snapshotable file system partitions
  • Built-in RAID (0, 1, 10, 5/6 -- though RAID 5/6 has had stability issues)
  • Online defragmentation, balance, resize
  • Transparent compression (zstd, lzo, zlib)
  • Send/receive for incremental backups
# Create a Btrfs subvolume and snapshot
btrfs subvolume create /mnt/data/@home
btrfs subvolume snapshot /mnt/data/@home /mnt/data/@home_snap

# Enable zstd compression
btrfs property set /mnt/data/@home compression zstd

ZFS vs. Btrfs

Feature ZFS Btrfs
License CDDL (not GPL-compatible) GPL
RAID RAID-Z1/Z2/Z3 (stable) RAID 0/1/10 stable, 5/6 fragile
Dedup In-memory DDT (RAM-hungry) Offline via reflink
Max capacity 256 ZiB 16 EiB
Scrubbing Comprehensive Supported
Community OpenZFS (FreeBSD, Linux) Mainline Linux kernel

Deduplication

Deduplication identifies and eliminates duplicate data blocks, storing each unique block only once.

Inline vs. Offline

  • Inline (ZFS): Deduplicate at write time. Requires an in-memory deduplication table (DDT). Rule of thumb: ~5 GB RAM per TB of data.
  • Offline/batch (Btrfs, Windows ReFS): Scan and dedup after data is written. Lower memory pressure but delayed savings.
  • Reflinks: cp --reflink=auto creates COW copies that share data blocks until modified. Near-instant copies with no DDT overhead.

Compression

Modern file systems support transparent block-level compression.

Algorithm Ratio Speed (compress) Speed (decompress) Typical Use
LZ4 Low-medium Very fast (~700 MB/s) Very fast Default for speed
Zstd High Fast (~400 MB/s) Very fast (~1 GB/s) Best general choice
Zlib High Slow (~100 MB/s) Medium Legacy, archival
Gzip Medium-high Slow Medium Compatibility

ZFS and Btrfs allow per-dataset or per-file compression selection.

Snapshots

A snapshot captures the file system state at a point in time. With COW, snapshots are nearly free -- they simply preserve pointers to existing blocks.

Properties:

  • Creation is O(1) -- no data is copied
  • Space overhead is proportional to changes since the snapshot
  • Read-only snapshots (ZFS default) or writable clones (forked from a snapshot)

Use cases:

  • Pre-upgrade rollback (Btrfs on openSUSE, Fedora)
  • Consistent backups without downtime
  • Database point-in-time recovery
  • Container image layers (overlayfs with underlying Btrfs snapshots)

RAID Levels

Level Min Disks Redundancy Space Efficiency Read Perf Write Perf
RAID 0 2 None 100% Nx Nx
RAID 1 2 1 disk 50% Nx (read) 1x
RAID 5 3 1 disk (N-1)/N (N-1)x Reduced
RAID 6 4 2 disks (N-2)/N (N-2)x Reduced
RAID 10 4 1 per mirror 50% Nx N/2 x

RAID-Z (ZFS)

ZFS RAID-Z eliminates the write hole problem (where a crash during parity update leaves RAID in an inconsistent state). Because ZFS uses COW, parity and data are always written together atomically.

Level Parity Disks Equivalent
RAID-Z1 1 RAID 5
RAID-Z2 2 RAID 6
RAID-Z3 3 Triple parity

RAID-Z uses variable-width stripes, avoiding the read-modify-write penalty of traditional RAID 5.

NVMe Architecture

NVMe (Non-Volatile Memory Express) is a storage protocol designed for flash and persistent memory, replacing the AHCI/SATA stack.

Key Design Choices

  • Parallel submission queues: Up to 65,535 I/O queue pairs (submission + completion), mapped to CPU cores
  • Low overhead command set: 13 mandatory commands vs. SCSI's hundreds
  • No locks in the I/O path: Each core has its own queue, eliminating contention
  • Direct PCIe attachment: No SAS/SATA HBA -- lower latency

NVMe Namespaces

A single NVMe device can expose multiple namespaces (logical partitions), each appearing as a separate block device. NVMe-oF (over Fabrics) extends NVMe over RDMA or TCP for networked storage.

Performance Comparison

Metric SATA SSD NVMe SSD Optane PMEM
Latency (read) ~100 us ~10-20 us ~300 ns (DAX)
IOPS (random 4K) ~100K ~500K-1M ~10M+
Bandwidth (seq) ~550 MB/s ~3-7 GB/s ~6-8 GB/s

io_uring

io_uring (Linux 5.1+) provides high-performance asynchronous I/O via shared memory rings between user space and kernel.

Architecture

User space                     Kernel
+-----------------+           +------------------+
| Submission Ring | --------> | Process entries  |
| (SQEs)          |           | (no syscall per  |
|                 |           |  I/O in SQPOLL)  |
| Completion Ring | <-------- | Post completions |
| (CQEs)          |           |                  |
+-----------------+           +------------------+

Key features:

  • Batched submission: multiple I/O operations per syscall (or zero syscalls with SQPOLL)
  • Linked operations: chain dependent I/Os
  • Fixed buffers/files: register resources once, avoid per-I/O lookup
  • Supports: read/write, fsync, poll, accept, connect, send/recv, open/close

Performance Impact

io_uring achieves near-SPDK (user-space) performance while remaining in the kernel I/O stack:

  • 1M+ IOPS from a single core on NVMe
  • 50-90% reduction in syscall overhead compared to libaio

Persistent Memory File Systems

PMEM-aware file systems are designed to exploit byte-addressable persistent memory with DAX.

File System Design Status
ext4-DAX ext4 with DAX mmap bypass Production
XFS-DAX XFS with DAX support Production
NOVA Log-structured, per-inode logging Research
WineFS Huge-page-aware for PMEM Research

NOVA provides per-inode logs for metadata (avoiding global journal bottleneck) and direct data access for file contents, achieving high concurrency on many-core systems.

Key Takeaways

  1. COW file systems (ZFS, Btrfs) enable atomic updates, cheap snapshots, and self-healing via checksummed Merkle trees
  2. Log-structured file systems excel at write-heavy workloads but require garbage collection; F2FS optimizes this for flash
  3. ZFS RAID-Z eliminates the write hole; Btrfs RAID 5/6 remains less mature
  4. Deduplication is powerful but memory-intensive inline (ZFS) or latency-free via reflinks (Btrfs)
  5. NVMe's multi-queue architecture eliminates the storage stack bottleneck, enabling millions of IOPS
  6. io_uring closes the performance gap between kernel and user-space I/O by eliminating per-I/O syscall overhead
  7. DAX-enabled file systems on persistent memory bypass the entire page cache and block layer