File System Implementations
This file surveys major file system implementations, their design decisions, and when to use each.
ext2/ext3/ext4 (Linux)
ext2 (1993)
Traditional UNIX-style file system. Block groups divide the disk into manageable segments, each with its own inode table and block bitmap.
Structure:
[Boot Block][Block Group 0][Block Group 1]...[Block Group N]
Block Group:
[Superblock copy][Group Descriptor][Block Bitmap][Inode Bitmap][Inode Table][Data Blocks]
No journaling — fsck on unclean shutdown.
ext3 (2001)
ext2 + journaling. Three journal modes: journal, ordered (default), writeback. Backward compatible with ext2.
ext4 (2008)
The Linux default file system. Major improvements over ext3:
Extents: Replace block lists with (start_block, length) pairs. Reduces metadata for large contiguous files. Up to 128 MB per extent (with 4K blocks).
Multiblock allocation: Allocate multiple blocks at once (reduces fragmentation).
Delayed allocation: Don't allocate blocks until data is flushed to disk. Better allocation decisions.
Maximum sizes: File: 16 TB. File system: 1 EB. Directory: unlimited (htree).
Checksums: Metadata checksums (superblock, group descriptors, inodes). Detects corruption.
Online resize: Grow the file system without unmounting.
XFS
High-performance file system originally from SGI (1994). Designed for large files and high throughput.
Key features:
- B+ tree for everything: inodes, free space, directory entries, extent maps
- Allocation groups: Parallel allocation across independent groups. Good SMP scalability
- Delayed allocation: Like ext4. Reduces fragmentation
- Real-time subvolume: Guaranteed-rate I/O for media streaming
- Excellent scalability: Handles very large files (8 EB) and file systems (8 EB)
Best for: Large files, media servers, HPC storage, data warehouses.
Btrfs (B-tree File System)
Linux's "next-generation" file system. Copy-on-write (COW) design.
Key features:
- COW B-trees: All metadata and data in B-trees. Never overwrites in place
- Snapshots: Instant, space-efficient (share unchanged blocks). Writable snapshots (subvolume clones)
- Checksums: Data and metadata checksummed (CRC32C, xxHash, SHA-256). Detects and optionally repairs corruption
- RAID support: Built-in RAID 0/1/5/6/10 (RAID 5/6 still has known issues)
- Compression: Transparent per-file compression (zlib, LZO, Zstandard)
- Deduplication: Offline and online deduplication
- Subvolumes: Lightweight, independent file system trees within one file system. Can be mounted separately
- Send/receive: Incremental snapshot transfer (efficient backups)
Use cases: Desktop Linux, NAS (Synology uses Btrfs), snapshot-based backups, development environments.
ZFS
The most feature-rich file system. Originally from Sun Microsystems (2005). Available on FreeBSD, Linux (via OpenZFS), macOS.
Key features:
- Pooled storage: Combine multiple disks into a zpool. File systems draw from the pool
- Copy-on-write: Never overwrites live data. Atomic transactions
- End-to-end checksums: Every block checksummed. Detects silent data corruption (bit rot)
- RAID-Z: ZFS's RAID implementation (Z1/Z2/Z3 = 1/2/3 parity disks). No write hole
- Snapshots and clones: Instant, zero-cost snapshots. Writable clones
- Deduplication: Block-level dedup (memory-intensive — needs ~5 GB RAM per TB of data)
- Compression: LZ4 (default), Zstandard, gzip. Transparent
- Adaptive Replacement Cache (ARC): Advanced caching algorithm (better than LRU)
- Self-healing: With redundancy, automatically repairs corrupted blocks using good copies
Use cases: NAS, enterprise storage, backup servers, databases (PostgreSQL on ZFS is popular).
Limitation on Linux: CDDL license incompatible with GPL (loaded as kernel module, not compiled into kernel).
NTFS (Windows)
Microsoft's default file system since Windows NT.
Key features:
- Master File Table (MFT): Central structure. Each file has an MFT entry containing metadata and small file data
- Resident files: Small files stored directly in the MFT entry (< ~900 bytes)
- Journaling: Metadata journaling for crash recovery
- Access Control Lists (ACLs): Fine-grained permissions (richer than UNIX rwx)
- Alternate Data Streams (ADS): Multiple data streams per file (used for metadata, security concerns)
- Compression and encryption: Per-file transparent compression (LZ77) and encryption (EFS)
Maximum sizes: File: 16 EB (theoretical), 256 TB (practical). Volume: 256 TB.
APFS (Apple File System)
Apple's file system since macOS 10.13 / iOS 10.3 (2017). Replaced HFS+.
Key features:
- Copy-on-write: Atomic operations, snapshots
- Space sharing: Multiple volumes share a container (pool of space)
- Encryption: Native per-file and per-volume encryption (hardware-accelerated)
- Snapshots: Instant snapshots (Time Machine uses these)
- Clones: Instant file copies (no data duplication until modified)
- Crash protection: COW metadata ensures consistency
- Optimized for SSD: TRIM support, no journaling needed (COW provides consistency)
F2FS (Flash-Friendly File System)
Samsung-designed file system optimized for NAND flash (SSDs, eMMC, SD cards).
Key features:
- Log-structured: Appends data sequentially (matches flash write patterns)
- Multi-head logging: Multiple log areas for concurrent writes
- Adaptive logging: Switches between append and in-place update based on utilization
- Node Address Table (NAT): Indirection for efficient garbage collection
- Designed for flash: Aligns I/O to erase block boundaries. Minimizes write amplification
Used in: Android phones (default on many Samsung, Google Pixel devices), SD cards.
Distributed File Systems
NFS (Network File System)
Client-server protocol. Remote file access transparent to applications.
NFSv4: Stateful. Compound operations. Delegation (client caching). Security (Kerberos). Performance improvements over v3.
CIFS/SMB (Windows File Sharing)
Microsoft's network file protocol. Used by Windows, macOS (via Samba on Linux).
SMB3: Encryption, multichannel, directory leasing. Used by Azure Files.
HDFS (Hadoop Distributed File System)
Designed for large-scale data processing. Files split into large blocks (128 MB default), replicated across nodes (3× default).
Architecture: Single NameNode (metadata) + many DataNodes (data). Optimized for sequential reads of large files. Write-once, read-many model.
Ceph
Distributed storage providing file, block, and object storage. No single point of failure.
CRUSH algorithm: Algorithmic data placement (no central metadata server for data placement). Self-healing, self-managing.
GlusterFS
Distributed file system aggregating storage from multiple servers. POSIX-compatible. Used for cloud storage and media streaming.
FUSE (Filesystem in Userspace)
Framework for implementing file systems in user space (no kernel code needed).
Application → VFS → FUSE kernel module → FUSE library → User-space FS daemon
Advantages: Easy development (any language). No kernel crashes from bugs. Portable. Disadvantages: Performance overhead (user-kernel context switches per operation).
Examples: sshfs (mount remote directories via SSH), s3fs (mount S3 buckets), ntfs-3g (NTFS on Linux), rclone mount.
Comparison
| FS | COW | Snapshots | Checksums | Compression | Max File | Best For | |---|---|---|---|---|---|---| | ext4 | No | No | Metadata | No | 16 TB | General Linux | | XFS | No | No | Metadata | No | 8 EB | Large files, HPC | | Btrfs | Yes | Yes | Data+Meta | Yes | 16 EB | Desktop, NAS | | ZFS | Yes | Yes | Data+Meta | Yes | 16 EB | Enterprise, NAS | | NTFS | No | VSS | Metadata | Yes | 256 TB | Windows | | APFS | Yes | Yes | Metadata | No | 8 EB | macOS/iOS | | F2FS | Hybrid | No | Metadata | Yes | 16 TB | Flash/Mobile |
Applications in CS
- System administration: Choosing the right file system for workload (ext4 for general, XFS for databases, ZFS for integrity).
- Backup and recovery: ZFS/Btrfs snapshots for instant backups. Send/receive for incremental replication.
- Cloud storage: Distributed file systems (Ceph, HDFS) underpin cloud infrastructure.
- Containers: OverlayFS provides Docker's layered image model. Each layer is a read-only file system snapshot.
- Database storage: Direct I/O bypasses file system caching. COW file systems interact with database WAL.
- Embedded: FAT for compatibility (SD cards, USB). LittleFS for microcontrollers (wear leveling, power-loss resilient).