External Memory Algorithms
Overview
When data exceeds main memory, the cost of disk/SSD I/O dominates computation time. External memory (EM) algorithms minimize the number of I/O operations (block transfers) between slow external storage and fast internal memory, achieving orders-of-magnitude speedups over algorithms designed for the RAM model.
The I/O Model (Aggarwal-Vitter)
Parameters
- N: problem size (number of elements)
- M: internal memory size (number of elements that fit in RAM)
- B: block size (number of elements transferred per I/O operation)
- Assume M >= 2B (at least two blocks fit in memory)
Cost Metric
Count the number of I/O operations (block transfers). CPU computation is free.
Fundamental Bounds
- Scanning N elements: O(N/B) I/Os — read sequentially
- Sorting N elements: O((N/B) * log_{M/B}(N/B)) I/Os
- Searching among N elements: O(log_B N) I/Os (with B-tree)
- Permuting N elements: O(min(N, (N/B) * log_{M/B}(N/B))) I/Os
The sorting bound is the most important: it shows that external sorting is cheaper than arbitrary permutation, and many problems reduce to sorting in the EM model.
Tall Cache Assumption
M >= B^{1+epsilon} for some constant epsilon > 0. Required for many cache-oblivious results. Typically satisfied in practice (M >> B^2).
External Sorting
External Merge Sort
Phase 1 (Run Formation):
Read M elements at a time into memory
Sort each chunk internally (e.g., quicksort)
Write sorted runs of length M back to disk
Result: ceil(N/M) sorted runs
Phase 2 (Multiway Merge):
Merge (M/B - 1) runs at a time:
Allocate one input buffer (B elements) per run
Allocate one output buffer (B elements)
Repeatedly extract minimum element across all runs
Refill input buffers as needed
Repeat until single sorted run remains
I/O Analysis
- Run formation: O(N/B) I/Os (scan once)
- Each merge pass: O(N/B) I/Os (scan all data)
- Number of passes: O(log_{M/B}(N/M))
- Total: O((N/B) * log_{M/B}(N/B)) — matches the lower bound
Practical Optimizations
- Replacement-selection: create runs of expected length 2M (double-length runs)
- Forecasting: prefetch blocks from runs that will be needed soon
- SSD awareness: exploit parallelism in flash storage
- Tournament trees for efficient k-way merging
Example
N = 10^9 elements, M = 10^7 (100 MB), B = 10^4 (page size):
- Runs: 100 sorted runs of 10M elements each
- Fan-out: ~999 (merge all in one pass)
- Total I/Os: ~2 * 10^5 = 200K block transfers
B-Trees
The fundamental external memory search structure.
Structure
- Balanced tree with branching factor O(B)
- Each node occupies one disk block
- Internal node: up to B keys and B+1 child pointers
- Leaf node: up to B key-value pairs
- All leaves at the same depth: O(log_B N)
Operations
| Operation | I/O Complexity | Notes | |-------------|-------------------|------------------------------| | Search | O(log_B N) | Follow root-to-leaf path | | Insert | O(log_B N) | Split full nodes on the way | | Delete | O(log_B N) | Merge/redistribute as needed | | Range query | O(log_B N + K/B) | K = number of results |
B+ Tree Variant
- All data stored in leaves (linked list of leaves for range scans)
- Internal nodes contain only keys and pointers
- Standard in databases (MySQL InnoDB, PostgreSQL) and file systems (NTFS, ext4, Btrfs)
Bulk Loading
- Sort data, then build tree bottom-up
- O(N/B) I/Os — much faster than N individual insertions
- Fill nodes to capacity (or partial for anticipated insertions)
Buffer Trees
- Attach buffers of size B to each internal node
- Batch updates: amortize I/O cost over B operations
- Amortized O((1/B) * log_{M/B}(N/B)) I/Os per operation
- Used for external priority queues and graph algorithms
External BFS
BFS on a graph stored on disk with V vertices and E edges.
Challenge
Standard BFS accesses neighbors of each vertex — random access pattern causes O(V + E) I/Os in the worst case (one I/O per edge).
Munagala-Ranade Algorithm
- Sort adjacency list by vertex ID
- For each BFS level:
- Sort frontier vertices
- Scan adjacency list, extract neighbors of frontier vertices (merge-join)
- Remove already-visited vertices (sort and scan)
- I/Os: O((V + E/B) * sqrt(V/B)) for sparse graphs
Mehlhorn-Meyer Algorithm
- Precompute clustering of vertices to minimize I/Os
- O(V/B + E/B * log(V/B)) I/Os after O(sort(E)) preprocessing
Practical Approaches
- Graph partitioning to improve locality
- Semi-external model: V fits in memory, edges on disk
- BFS in O(E/B) I/Os (scan edges, track visited vertices in memory)
External Graph Algorithms
External DFS
- More difficult than BFS externally
- O((V + E/B) * log(V/B) + sort(E)) I/Os [Arge-Meyer-Zeh]
External MST
- External Kruskal: sort edges O(sort(E)), then scan with union-find
- O(sort(E)) I/Os when V fits in memory (semi-external)
External Shortest Paths
- Tournament tree-based priority queue for Dijkstra
- O((V + E/B) * log_{M/B}(V/B)) I/Os
List Ranking
- Fundamental primitive: given a linked list, compute rank of each element
- External: O(sort(N)) I/Os via independent set removal
- Key building block for external tree algorithms
Cache-Oblivious Algorithms
Algorithms that achieve optimal I/O complexity without knowing M or B. Analyzed in the ideal cache model (optimal replacement policy, two-level memory).
Advantages
- Portable across machines with different cache parameters
- Optimal for all levels of the memory hierarchy simultaneously
- No parameter tuning required
Cache-Oblivious Scanning and Reversal
- Scanning N elements: O(N/B) I/Os automatically (sequential access)
- Array reversal: O(N/B) I/Os
Funnel Sort
Cache-oblivious sorting algorithm matching the external sorting bound.
K-Funnel (Merger)
A K-funnel merges K sorted sequences. Recursive structure:
K-funnel:
Split into sqrt(K) sub-funnels of size sqrt(K)
Connect via a sqrt(K)-merger at the top
Buffers between levels hold sqrt(K)^3 elements
Funnel Sort Algorithm
FunnelSort(A, n):
if n <= constant: sort directly
Split A into n^(1/3) segments of size n^(2/3)
Recursively sort each segment
Merge all segments using an n^(1/3)-funnel
I/O Complexity
- O((N/B) * log_{M/B}(N/B)) — optimal, matches external merge sort
- Achieved without knowing M or B
- Recursive structure ensures good cache behavior at every level
Van Emde Boas Layout
Cache-oblivious static search tree layout achieving O(log_B N) search I/Os.
Construction
Given a complete binary search tree of height h:
- Split at height h/2 into a top tree and O(sqrt(N)) bottom trees
- Recursively lay out the top tree, then each bottom tree
- Concatenate in memory: top tree followed by bottom trees in order
Properties
- Search follows a root-to-leaf path of length O(log N)
- At each level of the recursive layout, the subtree fits in O(1) blocks once the subtree size drops below B
- Search cost: O(log_B N) I/Os — optimal for comparison-based search
- Static: no efficient updates; use cache-oblivious B-trees for dynamic version
Analysis Intuition
The path visits O(log N / log B) = O(log_B N) subtrees of size >= B. Each such subtree is stored contiguously, costing O(1) I/Os.
Cache-Oblivious B-Trees
Dynamic search structure supporting insertions and deletions.
Packed Memory Array (PMA)
- Store N elements in sorted order in an array of size O(N)
- Gaps between elements allow insertions
- Maintain density invariants at each level of a virtual tree
- Amortized O(log^2(N) / B) I/Os per update
Cache-Oblivious B-Tree (Bender-Demaine-Farach-Colton)
- Combine van Emde Boas layout tree (for search) with PMA (for ordered storage)
- Search: O(log_B N) I/Os
- Insert/Delete: O(log_B N + (log^2 N) / B) amortized I/Os
- Range query: O(log_B N + K/B) I/Os
Comparison with B-Trees
| Operation | B-Tree | CO B-Tree | |------------|---------------|------------------------------| | Search | O(log_B N) | O(log_B N) | | Insert | O(log_B N) | O(log_B N + log^2(N)/B) am. | | Range | O(log_B N + K/B) | O(log_B N + K/B) | | Knows M,B? | Yes | No |
Cache-Oblivious Matrix Algorithms
Matrix Transpose
Recursive block decomposition:
Transpose(A, n):
if n <= base: transpose directly
Partition into 4 quadrants
Recursively transpose each
Swap off-diagonal quadrants
- O(N/B) I/Os when M >= B^2 (tall cache)
Matrix Multiplication
Recursive 8-way divide-and-conquer:
- O(N^3 / (B * sqrt(M))) I/Os — matches the lower bound
- Automatically utilizes all cache levels
Practical Considerations
SSD vs HDD
- HDD: large B (sequential is crucial), high seek time
- SSD: smaller effective B, random reads nearly as fast as sequential
- External memory algorithms still beneficial on SSDs for very large data
Real-World Systems
- Database engines: B-tree indexes, external sort for ORDER BY
- MapReduce / Spark: external shuffle is essentially external sorting
- Operating systems: virtual memory paging follows I/O model principles
When to Use External Memory Algorithms
- Data size exceeds available RAM
- Multiple levels of cache hierarchy matter
- Working with databases, large-scale analytics, scientific computing
Summary of I/O Complexities
| Operation | I/O Complexity | |--------------------|--------------------------------------| | Scan | O(N/B) | | Sort | O((N/B) log_{M/B}(N/B)) | | Search (B-tree) | O(log_B N) | | BFS | O(V + E/B) semi-external | | Permute | O(min(N, sort(N))) | | Priority Queue | O((1/B) log_{M/B}(N/B)) amortized |
Summary
External memory algorithms are essential for processing data that exceeds RAM. The I/O model provides a clean framework for analyzing block transfer costs. B-trees and external merge sort are workhorses of database systems. Cache-oblivious algorithms elegantly achieve optimal performance without knowing hardware parameters. As data continues to grow faster than memory capacity, these techniques remain fundamental to systems design.