External Memory Algorithms

Overview

When data exceeds main memory, the cost of disk/SSD I/O dominates computation time. External memory (EM) algorithms minimize the number of I/O operations (block transfers) between slow external storage and fast internal memory, achieving orders-of-magnitude speedups over algorithms designed for the RAM model.

The I/O Model (Aggarwal-Vitter)

Parameters

N: problem size (number of elements)
M: internal memory size (number of elements that fit in RAM)
B: block size (number of elements transferred per I/O operation)
Assume M >= 2B (at least two blocks fit in memory)

Cost Metric

Count the number of I/O operations (block transfers). CPU computation is free.

Fundamental Bounds

Scanning N elements: O(N/B) I/Os — read sequentially
Sorting N elements: O((N/B) * log_{M/B}(N/B)) I/Os
Searching among N elements: O(log_B N) I/Os (with B-tree)
Permuting N elements: O(min(N, (N/B) * log_{M/B}(N/B))) I/Os

The sorting bound is the most important: it shows that external sorting is cheaper than arbitrary permutation, and many problems reduce to sorting in the EM model.

Tall Cache Assumption

M >= B^{1+epsilon} for some constant epsilon > 0. Required for many cache-oblivious results. Typically satisfied in practice (M >> B^2).

External Sorting

External Merge Sort

Phase 1 (Run Formation):
    Read M elements at a time into memory
    Sort each chunk internally (e.g., quicksort)
    Write sorted runs of length M back to disk
    Result: ceil(N/M) sorted runs

Phase 2 (Multiway Merge):
    Merge (M/B - 1) runs at a time:
        Allocate one input buffer (B elements) per run
        Allocate one output buffer (B elements)
        Repeatedly extract minimum element across all runs
        Refill input buffers as needed
    Repeat until single sorted run remains

I/O Analysis

Run formation: O(N/B) I/Os (scan once)
Each merge pass: O(N/B) I/Os (scan all data)
Number of passes: O(log_{M/B}(N/M))
Total: O((N/B) * log_{M/B}(N/B)) — matches the lower bound

Practical Optimizations

Replacement-selection: create runs of expected length 2M (double-length runs)
Forecasting: prefetch blocks from runs that will be needed soon
SSD awareness: exploit parallelism in flash storage
Tournament trees for efficient k-way merging

Example

N = 10^9 elements, M = 10^7 (100 MB), B = 10^4 (page size):

Runs: 100 sorted runs of 10M elements each
Fan-out: ~999 (merge all in one pass)
Total I/Os: ~2 * 10^5 = 200K block transfers

B-Trees

The fundamental external memory search structure.

Structure

Balanced tree with branching factor O(B)
Each node occupies one disk block
Internal node: up to B keys and B+1 child pointers
Leaf node: up to B key-value pairs
All leaves at the same depth: O(log_B N)

Operations

Operation	I/O Complexity	Notes
Search	O(log_B N)	Follow root-to-leaf path
Insert	O(log_B N)	Split full nodes on the way
Delete	O(log_B N)	Merge/redistribute as needed
Range query	O(log_B N + K/B)	K = number of results

B+ Tree Variant

All data stored in leaves (linked list of leaves for range scans)
Internal nodes contain only keys and pointers
Standard in databases (MySQL InnoDB, PostgreSQL) and file systems (NTFS, ext4, Btrfs)

Bulk Loading

Sort data, then build tree bottom-up
O(N/B) I/Os — much faster than N individual insertions
Fill nodes to capacity (or partial for anticipated insertions)

Buffer Trees

Attach buffers of size B to each internal node
Batch updates: amortize I/O cost over B operations
Amortized O((1/B) * log_{M/B}(N/B)) I/Os per operation
Used for external priority queues and graph algorithms

External BFS

BFS on a graph stored on disk with V vertices and E edges.

Challenge

Standard BFS accesses neighbors of each vertex — random access pattern causes O(V + E) I/Os in the worst case (one I/O per edge).

Munagala-Ranade Algorithm

Sort adjacency list by vertex ID
For each BFS level:
- Sort frontier vertices
- Scan adjacency list, extract neighbors of frontier vertices (merge-join)
- Remove already-visited vertices (sort and scan)
I/Os: O((V + E/B) * sqrt(V/B)) for sparse graphs

Mehlhorn-Meyer Algorithm

Precompute clustering of vertices to minimize I/Os
O(V/B + E/B * log(V/B)) I/Os after O(sort(E)) preprocessing

Practical Approaches

Graph partitioning to improve locality
Semi-external model: V fits in memory, edges on disk
- BFS in O(E/B) I/Os (scan edges, track visited vertices in memory)

External Graph Algorithms

External DFS

More difficult than BFS externally
O((V + E/B) * log(V/B) + sort(E)) I/Os [Arge-Meyer-Zeh]

External MST

External Kruskal: sort edges O(sort(E)), then scan with union-find
O(sort(E)) I/Os when V fits in memory (semi-external)

External Shortest Paths

Tournament tree-based priority queue for Dijkstra
O((V + E/B) * log_{M/B}(V/B)) I/Os

List Ranking

Fundamental primitive: given a linked list, compute rank of each element
External: O(sort(N)) I/Os via independent set removal
Key building block for external tree algorithms

Cache-Oblivious Algorithms

Algorithms that achieve optimal I/O complexity without knowing M or B. Analyzed in the ideal cache model (optimal replacement policy, two-level memory).

Advantages

Portable across machines with different cache parameters
Optimal for all levels of the memory hierarchy simultaneously
No parameter tuning required

Cache-Oblivious Scanning and Reversal

Scanning N elements: O(N/B) I/Os automatically (sequential access)
Array reversal: O(N/B) I/Os

Funnel Sort

Cache-oblivious sorting algorithm matching the external sorting bound.

K-Funnel (Merger)

A K-funnel merges K sorted sequences. Recursive structure:

K-funnel:
    Split into sqrt(K) sub-funnels of size sqrt(K)
    Connect via a sqrt(K)-merger at the top
    Buffers between levels hold sqrt(K)^3 elements

Funnel Sort Algorithm

FunnelSort(A, n):
    if n <= constant: sort directly
    Split A into n^(1/3) segments of size n^(2/3)
    Recursively sort each segment
    Merge all segments using an n^(1/3)-funnel

I/O Complexity

O((N/B) * log_{M/B}(N/B)) — optimal, matches external merge sort
Achieved without knowing M or B
Recursive structure ensures good cache behavior at every level

Van Emde Boas Layout

Cache-oblivious static search tree layout achieving O(log_B N) search I/Os.

Construction

Given a complete binary search tree of height h:

Split at height h/2 into a top tree and O(sqrt(N)) bottom trees
Recursively lay out the top tree, then each bottom tree
Concatenate in memory: top tree followed by bottom trees in order

Properties

Search follows a root-to-leaf path of length O(log N)
At each level of the recursive layout, the subtree fits in O(1) blocks once the subtree size drops below B
Search cost: O(log_B N) I/Os — optimal for comparison-based search
Static: no efficient updates; use cache-oblivious B-trees for dynamic version

Analysis Intuition

The path visits O(log N / log B) = O(log_B N) subtrees of size >= B. Each such subtree is stored contiguously, costing O(1) I/Os.

Cache-Oblivious B-Trees

Dynamic search structure supporting insertions and deletions.

Packed Memory Array (PMA)

Store N elements in sorted order in an array of size O(N)
Gaps between elements allow insertions
Maintain density invariants at each level of a virtual tree
Amortized O(log^2(N) / B) I/Os per update

Cache-Oblivious B-Tree (Bender-Demaine-Farach-Colton)

Combine van Emde Boas layout tree (for search) with PMA (for ordered storage)
Search: O(log_B N) I/Os
Insert/Delete: O(log_B N + (log^2 N) / B) amortized I/Os
Range query: O(log_B N + K/B) I/Os

Comparison with B-Trees

Operation	B-Tree	CO B-Tree
Search	O(log_B N)	O(log_B N)
Insert	O(log_B N)	O(log_B N + log^2(N)/B) am.
Range	O(log_B N + K/B)	O(log_B N + K/B)
Knows M,B?	Yes	No

Cache-Oblivious Matrix Algorithms

Matrix Transpose

Recursive block decomposition:

Transpose(A, n):
    if n <= base: transpose directly
    Partition into 4 quadrants
    Recursively transpose each
    Swap off-diagonal quadrants

O(N/B) I/Os when M >= B^2 (tall cache)

Matrix Multiplication

Recursive 8-way divide-and-conquer:

O(N^3 / (B * sqrt(M))) I/Os — matches the lower bound
Automatically utilizes all cache levels

Practical Considerations

SSD vs HDD

HDD: large B (sequential is crucial), high seek time
SSD: smaller effective B, random reads nearly as fast as sequential
External memory algorithms still beneficial on SSDs for very large data

Real-World Systems

Database engines: B-tree indexes, external sort for ORDER BY
MapReduce / Spark: external shuffle is essentially external sorting
Operating systems: virtual memory paging follows I/O model principles

When to Use External Memory Algorithms

Data size exceeds available RAM
Multiple levels of cache hierarchy matter
Working with databases, large-scale analytics, scientific computing

Summary of I/O Complexities

Operation	I/O Complexity
Scan	O(N/B)
Sort	O((N/B) log_{M/B}(N/B))
Search (B-tree)	O(log_B N)
BFS	O(V + E/B) semi-external
Permute	O(min(N, sort(N)))
Priority Queue	O((1/B) log_{M/B}(N/B)) amortized

Summary

External memory algorithms are essential for processing data that exceeds RAM. The I/O model provides a clean framework for analyzing block transfer costs. B-trees and external merge sort are workhorses of database systems. Cache-oblivious algorithms elegantly achieve optimal performance without knowing hardware parameters. As data continues to grow faster than memory capacity, these techniques remain fundamental to systems design.