7 min read
On this page

External Memory Algorithms

Overview

When data exceeds main memory, the cost of disk/SSD I/O dominates computation time. External memory (EM) algorithms minimize the number of I/O operations (block transfers) between slow external storage and fast internal memory, achieving orders-of-magnitude speedups over algorithms designed for the RAM model.

The I/O Model (Aggarwal-Vitter)

Parameters

  • N: problem size (number of elements)
  • M: internal memory size (number of elements that fit in RAM)
  • B: block size (number of elements transferred per I/O operation)
  • Assume M >= 2B (at least two blocks fit in memory)

Cost Metric

Count the number of I/O operations (block transfers). CPU computation is free.

Fundamental Bounds

  • Scanning N elements: O(N/B) I/Os — read sequentially
  • Sorting N elements: O((N/B) * log_{M/B}(N/B)) I/Os
  • Searching among N elements: O(log_B N) I/Os (with B-tree)
  • Permuting N elements: O(min(N, (N/B) * log_{M/B}(N/B))) I/Os

The sorting bound is the most important: it shows that external sorting is cheaper than arbitrary permutation, and many problems reduce to sorting in the EM model.

Tall Cache Assumption

M >= B^{1+epsilon} for some constant epsilon > 0. Required for many cache-oblivious results. Typically satisfied in practice (M >> B^2).

External Sorting

External Merge Sort

Phase 1 (Run Formation):
    Read M elements at a time into memory
    Sort each chunk internally (e.g., quicksort)
    Write sorted runs of length M back to disk
    Result: ceil(N/M) sorted runs

Phase 2 (Multiway Merge):
    Merge (M/B - 1) runs at a time:
        Allocate one input buffer (B elements) per run
        Allocate one output buffer (B elements)
        Repeatedly extract minimum element across all runs
        Refill input buffers as needed
    Repeat until single sorted run remains

I/O Analysis

  • Run formation: O(N/B) I/Os (scan once)
  • Each merge pass: O(N/B) I/Os (scan all data)
  • Number of passes: O(log_{M/B}(N/M))
  • Total: O((N/B) * log_{M/B}(N/B)) — matches the lower bound

Practical Optimizations

  • Replacement-selection: create runs of expected length 2M (double-length runs)
  • Forecasting: prefetch blocks from runs that will be needed soon
  • SSD awareness: exploit parallelism in flash storage
  • Tournament trees for efficient k-way merging

Example

N = 10^9 elements, M = 10^7 (100 MB), B = 10^4 (page size):

  • Runs: 100 sorted runs of 10M elements each
  • Fan-out: ~999 (merge all in one pass)
  • Total I/Os: ~2 * 10^5 = 200K block transfers

B-Trees

The fundamental external memory search structure.

Structure

  • Balanced tree with branching factor O(B)
  • Each node occupies one disk block
  • Internal node: up to B keys and B+1 child pointers
  • Leaf node: up to B key-value pairs
  • All leaves at the same depth: O(log_B N)

Operations

| Operation | I/O Complexity | Notes | |-------------|-------------------|------------------------------| | Search | O(log_B N) | Follow root-to-leaf path | | Insert | O(log_B N) | Split full nodes on the way | | Delete | O(log_B N) | Merge/redistribute as needed | | Range query | O(log_B N + K/B) | K = number of results |

B+ Tree Variant

  • All data stored in leaves (linked list of leaves for range scans)
  • Internal nodes contain only keys and pointers
  • Standard in databases (MySQL InnoDB, PostgreSQL) and file systems (NTFS, ext4, Btrfs)

Bulk Loading

  • Sort data, then build tree bottom-up
  • O(N/B) I/Os — much faster than N individual insertions
  • Fill nodes to capacity (or partial for anticipated insertions)

Buffer Trees

  • Attach buffers of size B to each internal node
  • Batch updates: amortize I/O cost over B operations
  • Amortized O((1/B) * log_{M/B}(N/B)) I/Os per operation
  • Used for external priority queues and graph algorithms

External BFS

BFS on a graph stored on disk with V vertices and E edges.

Challenge

Standard BFS accesses neighbors of each vertex — random access pattern causes O(V + E) I/Os in the worst case (one I/O per edge).

Munagala-Ranade Algorithm

  1. Sort adjacency list by vertex ID
  2. For each BFS level:
    • Sort frontier vertices
    • Scan adjacency list, extract neighbors of frontier vertices (merge-join)
    • Remove already-visited vertices (sort and scan)
  3. I/Os: O((V + E/B) * sqrt(V/B)) for sparse graphs

Mehlhorn-Meyer Algorithm

  • Precompute clustering of vertices to minimize I/Os
  • O(V/B + E/B * log(V/B)) I/Os after O(sort(E)) preprocessing

Practical Approaches

  • Graph partitioning to improve locality
  • Semi-external model: V fits in memory, edges on disk
    • BFS in O(E/B) I/Os (scan edges, track visited vertices in memory)

External Graph Algorithms

External DFS

  • More difficult than BFS externally
  • O((V + E/B) * log(V/B) + sort(E)) I/Os [Arge-Meyer-Zeh]

External MST

  • External Kruskal: sort edges O(sort(E)), then scan with union-find
  • O(sort(E)) I/Os when V fits in memory (semi-external)

External Shortest Paths

  • Tournament tree-based priority queue for Dijkstra
  • O((V + E/B) * log_{M/B}(V/B)) I/Os

List Ranking

  • Fundamental primitive: given a linked list, compute rank of each element
  • External: O(sort(N)) I/Os via independent set removal
  • Key building block for external tree algorithms

Cache-Oblivious Algorithms

Algorithms that achieve optimal I/O complexity without knowing M or B. Analyzed in the ideal cache model (optimal replacement policy, two-level memory).

Advantages

  • Portable across machines with different cache parameters
  • Optimal for all levels of the memory hierarchy simultaneously
  • No parameter tuning required

Cache-Oblivious Scanning and Reversal

  • Scanning N elements: O(N/B) I/Os automatically (sequential access)
  • Array reversal: O(N/B) I/Os

Funnel Sort

Cache-oblivious sorting algorithm matching the external sorting bound.

K-Funnel (Merger)

A K-funnel merges K sorted sequences. Recursive structure:

K-funnel:
    Split into sqrt(K) sub-funnels of size sqrt(K)
    Connect via a sqrt(K)-merger at the top
    Buffers between levels hold sqrt(K)^3 elements

Funnel Sort Algorithm

FunnelSort(A, n):
    if n <= constant: sort directly
    Split A into n^(1/3) segments of size n^(2/3)
    Recursively sort each segment
    Merge all segments using an n^(1/3)-funnel

I/O Complexity

  • O((N/B) * log_{M/B}(N/B)) — optimal, matches external merge sort
  • Achieved without knowing M or B
  • Recursive structure ensures good cache behavior at every level

Van Emde Boas Layout

Cache-oblivious static search tree layout achieving O(log_B N) search I/Os.

Construction

Given a complete binary search tree of height h:

  1. Split at height h/2 into a top tree and O(sqrt(N)) bottom trees
  2. Recursively lay out the top tree, then each bottom tree
  3. Concatenate in memory: top tree followed by bottom trees in order

Properties

  • Search follows a root-to-leaf path of length O(log N)
  • At each level of the recursive layout, the subtree fits in O(1) blocks once the subtree size drops below B
  • Search cost: O(log_B N) I/Os — optimal for comparison-based search
  • Static: no efficient updates; use cache-oblivious B-trees for dynamic version

Analysis Intuition

The path visits O(log N / log B) = O(log_B N) subtrees of size >= B. Each such subtree is stored contiguously, costing O(1) I/Os.

Cache-Oblivious B-Trees

Dynamic search structure supporting insertions and deletions.

Packed Memory Array (PMA)

  • Store N elements in sorted order in an array of size O(N)
  • Gaps between elements allow insertions
  • Maintain density invariants at each level of a virtual tree
  • Amortized O(log^2(N) / B) I/Os per update

Cache-Oblivious B-Tree (Bender-Demaine-Farach-Colton)

  • Combine van Emde Boas layout tree (for search) with PMA (for ordered storage)
  • Search: O(log_B N) I/Os
  • Insert/Delete: O(log_B N + (log^2 N) / B) amortized I/Os
  • Range query: O(log_B N + K/B) I/Os

Comparison with B-Trees

| Operation | B-Tree | CO B-Tree | |------------|---------------|------------------------------| | Search | O(log_B N) | O(log_B N) | | Insert | O(log_B N) | O(log_B N + log^2(N)/B) am. | | Range | O(log_B N + K/B) | O(log_B N + K/B) | | Knows M,B? | Yes | No |

Cache-Oblivious Matrix Algorithms

Matrix Transpose

Recursive block decomposition:

Transpose(A, n):
    if n <= base: transpose directly
    Partition into 4 quadrants
    Recursively transpose each
    Swap off-diagonal quadrants
  • O(N/B) I/Os when M >= B^2 (tall cache)

Matrix Multiplication

Recursive 8-way divide-and-conquer:

  • O(N^3 / (B * sqrt(M))) I/Os — matches the lower bound
  • Automatically utilizes all cache levels

Practical Considerations

SSD vs HDD

  • HDD: large B (sequential is crucial), high seek time
  • SSD: smaller effective B, random reads nearly as fast as sequential
  • External memory algorithms still beneficial on SSDs for very large data

Real-World Systems

  • Database engines: B-tree indexes, external sort for ORDER BY
  • MapReduce / Spark: external shuffle is essentially external sorting
  • Operating systems: virtual memory paging follows I/O model principles

When to Use External Memory Algorithms

  • Data size exceeds available RAM
  • Multiple levels of cache hierarchy matter
  • Working with databases, large-scale analytics, scientific computing

Summary of I/O Complexities

| Operation | I/O Complexity | |--------------------|--------------------------------------| | Scan | O(N/B) | | Sort | O((N/B) log_{M/B}(N/B)) | | Search (B-tree) | O(log_B N) | | BFS | O(V + E/B) semi-external | | Permute | O(min(N, sort(N))) | | Priority Queue | O((1/B) log_{M/B}(N/B)) amortized |

Summary

External memory algorithms are essential for processing data that exceeds RAM. The I/O model provides a clean framework for analyzing block transfer costs. B-trees and external merge sort are workhorses of database systems. Cache-oblivious algorithms elegantly achieve optimal performance without knowing hardware parameters. As data continues to grow faster than memory capacity, these techniques remain fundamental to systems design.