Advanced File Systems

Log-Structured File Systems

Traditional file systems (ext4, XFS) perform in-place updates. Log-structured file systems (LFS) write all data and metadata sequentially to a log, converting random writes into sequential writes.

Design

All writes are buffered in memory and flushed as a contiguous segment
An inode map tracks the current location of each inode (since inodes move)
A segment cleaner reclaims space by compacting live data from old segments

Advantages and Drawbacks

Advantage	Drawback
Excellent sequential write speed	Segment cleaning overhead
Crash recovery is simple (replay log)	Read performance can degrade (data scattered)
Natural support for snapshots	Write amplification from cleaning

Implementations: NILFS2 (Linux), F2FS (flash-optimized LFS used in Android), Sprite LFS (original research)

F2FS: Flash-Optimized LFS

F2FS (Flash-Friendly File System) adapts LFS design for NAND flash:

Multi-head logging with hot/warm/cold data separation
Aligns writes to flash erase blocks to reduce garbage collection
Node Address Table (NAT) and Segment Information Table (SIT) for metadata
Default file system on many Android devices

Copy-on-Write File Systems

COW file systems never overwrite existing data. Writes go to new locations, and metadata trees are updated with new pointers (also via COW). This provides atomic updates, snapshots, and self-healing.

ZFS

ZFS (originally Sun Microsystems, now OpenZFS) is a combined file system and volume manager.

Core concepts:

Vdevs (virtual devices): Abstraction over physical disks, organized into a zpool
Copy-on-write everywhere: All writes go to new blocks; no in-place update
Merkle tree integrity: Every block has a checksum stored in its parent. The entire tree is verified on read, detecting silent data corruption (bit rot)
ARC (Adaptive Replacement Cache): Balances recency and frequency for caching
ZIL (ZFS Intent Log): Synchronous write log for fsync semantics
SLOG (Separate Log): Fast device (NVMe) for ZIL to accelerate sync writes

ZFS data hierarchy:
  zpool -> vdev (mirror, raidz) -> disk
  zpool -> dataset (filesystem | zvol | snapshot | clone)

Advanced features:

Send/receive: incremental replication of snapshots between pools
Native encryption (per-dataset, AES-256-GCM)
Inline deduplication and compression
Special allocation class: metadata and small blocks on fast SSDs

Btrfs

Btrfs is Linux's native COW file system.

Key features:

B-tree based metadata and data organization
Subvolumes: lightweight, independently snapshotable file system partitions
Built-in RAID (0, 1, 10, 5/6 -- though RAID 5/6 has had stability issues)
Online defragmentation, balance, resize
Transparent compression (zstd, lzo, zlib)
Send/receive for incremental backups

# Create a Btrfs subvolume and snapshot
btrfs subvolume create /mnt/data/@home
btrfs subvolume snapshot /mnt/data/@home /mnt/data/@home_snap

# Enable zstd compression
btrfs property set /mnt/data/@home compression zstd

ZFS vs. Btrfs

Feature	ZFS	Btrfs
License	CDDL (not GPL-compatible)	GPL
RAID	RAID-Z1/Z2/Z3 (stable)	RAID 0/1/10 stable, 5/6 fragile
Dedup	In-memory DDT (RAM-hungry)	Offline via reflink
Max capacity	256 ZiB	16 EiB
Scrubbing	Comprehensive	Supported
Community	OpenZFS (FreeBSD, Linux)	Mainline Linux kernel

Deduplication

Deduplication identifies and eliminates duplicate data blocks, storing each unique block only once.

Inline vs. Offline

Inline (ZFS): Deduplicate at write time. Requires an in-memory deduplication table (DDT). Rule of thumb: ~5 GB RAM per TB of data.
Offline/batch (Btrfs, Windows ReFS): Scan and dedup after data is written. Lower memory pressure but delayed savings.
Reflinks: cp --reflink=auto creates COW copies that share data blocks until modified. Near-instant copies with no DDT overhead.

Compression

Modern file systems support transparent block-level compression.

Algorithm	Ratio	Speed (compress)	Speed (decompress)	Typical Use
LZ4	Low-medium	Very fast (~700 MB/s)	Very fast	Default for speed
Zstd	High	Fast (~400 MB/s)	Very fast (~1 GB/s)	Best general choice
Zlib	High	Slow (~100 MB/s)	Medium	Legacy, archival
Gzip	Medium-high	Slow	Medium	Compatibility

ZFS and Btrfs allow per-dataset or per-file compression selection.

Snapshots

A snapshot captures the file system state at a point in time. With COW, snapshots are nearly free -- they simply preserve pointers to existing blocks.

Properties:

Creation is O(1) -- no data is copied
Space overhead is proportional to changes since the snapshot
Read-only snapshots (ZFS default) or writable clones (forked from a snapshot)

Use cases:

Pre-upgrade rollback (Btrfs on openSUSE, Fedora)
Consistent backups without downtime
Database point-in-time recovery
Container image layers (overlayfs with underlying Btrfs snapshots)

RAID Levels

Level	Min Disks	Redundancy	Space Efficiency	Read Perf	Write Perf
RAID 0	2	None	100%	Nx	Nx
RAID 1	2	1 disk	50%	Nx (read)	1x
RAID 5	3	1 disk	(N-1)/N	(N-1)x	Reduced
RAID 6	4	2 disks	(N-2)/N	(N-2)x	Reduced
RAID 10	4	1 per mirror	50%	Nx	N/2 x

RAID-Z (ZFS)

ZFS RAID-Z eliminates the write hole problem (where a crash during parity update leaves RAID in an inconsistent state). Because ZFS uses COW, parity and data are always written together atomically.

Level	Parity Disks	Equivalent
RAID-Z1	1	RAID 5
RAID-Z2	2	RAID 6
RAID-Z3	3	Triple parity

RAID-Z uses variable-width stripes, avoiding the read-modify-write penalty of traditional RAID 5.

NVMe Architecture

NVMe (Non-Volatile Memory Express) is a storage protocol designed for flash and persistent memory, replacing the AHCI/SATA stack.

Key Design Choices

Parallel submission queues: Up to 65,535 I/O queue pairs (submission + completion), mapped to CPU cores
Low overhead command set: 13 mandatory commands vs. SCSI's hundreds
No locks in the I/O path: Each core has its own queue, eliminating contention
Direct PCIe attachment: No SAS/SATA HBA -- lower latency

NVMe Namespaces

A single NVMe device can expose multiple namespaces (logical partitions), each appearing as a separate block device. NVMe-oF (over Fabrics) extends NVMe over RDMA or TCP for networked storage.

Performance Comparison

Metric	SATA SSD	NVMe SSD	Optane PMEM
Latency (read)	~100 us	~10-20 us	~300 ns (DAX)
IOPS (random 4K)	~100K	~500K-1M	~10M+
Bandwidth (seq)	~550 MB/s	~3-7 GB/s	~6-8 GB/s

io_uring

io_uring (Linux 5.1+) provides high-performance asynchronous I/O via shared memory rings between user space and kernel.

Architecture

User space                     Kernel
+-----------------+           +------------------+
| Submission Ring | --------> | Process entries  |
| (SQEs)          |           | (no syscall per  |
|                 |           |  I/O in SQPOLL)  |
| Completion Ring | <-------- | Post completions |
| (CQEs)          |           |                  |
+-----------------+           +------------------+

Key features:

Batched submission: multiple I/O operations per syscall (or zero syscalls with SQPOLL)
Linked operations: chain dependent I/Os
Fixed buffers/files: register resources once, avoid per-I/O lookup
Supports: read/write, fsync, poll, accept, connect, send/recv, open/close

Performance Impact

io_uring achieves near-SPDK (user-space) performance while remaining in the kernel I/O stack:

1M+ IOPS from a single core on NVMe
50-90% reduction in syscall overhead compared to libaio

Persistent Memory File Systems

PMEM-aware file systems are designed to exploit byte-addressable persistent memory with DAX.

File System	Design	Status
ext4-DAX	ext4 with DAX mmap bypass	Production
XFS-DAX	XFS with DAX support	Production
NOVA	Log-structured, per-inode logging	Research
WineFS	Huge-page-aware for PMEM	Research

NOVA provides per-inode logs for metadata (avoiding global journal bottleneck) and direct data access for file contents, achieving high concurrency on many-core systems.

Key Takeaways

COW file systems (ZFS, Btrfs) enable atomic updates, cheap snapshots, and self-healing via checksummed Merkle trees
Log-structured file systems excel at write-heavy workloads but require garbage collection; F2FS optimizes this for flash
ZFS RAID-Z eliminates the write hole; Btrfs RAID 5/6 remains less mature
Deduplication is powerful but memory-intensive inline (ZFS) or latency-free via reflinks (Btrfs)
NVMe's multi-queue architecture eliminates the storage stack bottleneck, enabling millions of IOPS
io_uring closes the performance gap between kernel and user-space I/O by eliminating per-I/O syscall overhead
DAX-enabled file systems on persistent memory bypass the entire page cache and block layer