B-Trees
B-trees are balanced search trees designed for disk-based storage. They minimize I/O by maximizing the number of keys per node (matching disk block size). Every major database and file system uses B-trees or their variants.
B-Tree Definition
A B-tree of order m (minimum degree t = ⌈m/2⌉) satisfies:
- Every node has at most m children (m-1 keys).
- Every non-root internal node has at least ⌈m/2⌉ children.
- The root has at least 2 children (unless it's a leaf).
- All leaves are at the same depth (perfectly balanced).
- Keys within a node are sorted.
Example (order 5, min degree t = 3):
[30 | 60]
/ | \
[10|20] [40|50] [70|80|90]
Each internal node has between t and 2t-1 keys and between t+1 and 2t children.
Why B-Trees?
Disk access: Reading one byte vs reading 4 KB takes almost the same time (disk seek dominates). So we should read large blocks and pack many keys into each node.
A B-tree node typically = one disk page (4 KB - 16 KB). A single node can hold hundreds of keys.
Height comparison (1 million keys):
| Tree | Branching | Height | Disk reads per search | |---|---|---|---| | Binary BST | 2 | ~20 | ~20 | | B-tree (order 101) | 101 | ~3 | ~3 | | B-tree (order 1001) | 1001 | ~2 | ~2 |
Search
Like binary search tree, but at each node, perform a binary search among the node's keys to find the appropriate child pointer.
Search(node, key):
Find i such that key[i-1] < key ≤ key[i] (binary search within node)
If key == key[i]: found
If node is leaf: not found
Else: recurse on child[i]
Time: O(log_t n) disk reads × O(t) comparisons per node = O(t · log_t n) = O(log n).
Insertion
- Search for the appropriate leaf.
- Insert the key into the leaf (maintaining sorted order).
- If the leaf exceeds m-1 keys, split it.
Splitting
When a node has m keys (overflow):
- Split into two nodes at the median key.
- The median key is promoted to the parent.
- If the parent overflows, split it too (may propagate to root).
- If the root splits, create a new root (tree grows taller by 1).
Before split (order 5, node has 5 keys):
[10 | 20 | 30 | 40 | 50]
After split:
Parent gets: 30
Left child: [10 | 20]
Right child: [40 | 50]
Proactive splitting: Split full nodes on the way down during insertion. This avoids the need to propagate splits upward (single-pass insertion).
Deletion
More complex than insertion. Three cases:
Case 1: Key is in a leaf → simply remove it. If the leaf underflows (fewer than t-1 keys), rebalance.
Case 2: Key is in an internal node → replace with predecessor (or successor) from a child, then delete the predecessor from the child.
Case 3: Rebalancing when a node has too few keys:
- Borrow from a sibling (rotate through parent) if sibling has spare keys.
- Merge with a sibling (pull down a key from parent) if sibling is at minimum.
- If parent underflows, repeat upward.
Deletion may cause the tree to shrink (root merge → height decreases by 1).
B+ Trees
The most common variant used in databases and file systems.
Differences from B-Trees
- All data in leaves: Internal nodes contain only keys (routing information), not data. This means more keys per internal node → lower height.
- Leaves are linked: A doubly linked list connects all leaves. Enables efficient range queries and sequential scans.
- Keys are duplicated: Internal node keys are copies of leaf keys (specifically, the smallest key of each subtree).
Internal: [30 | 60]
/ | \
Leaves: [10,20]↔[30,40,50]↔[60,70,80]
(linked list)
Range Queries
To find all keys in range [a, b]:
- Search for a → find the leaf containing a.
- Scan right along the leaf linked list until key > b.
Time: O(log n + k) where k is the number of results. Very efficient for range scans.
B+ Tree in Databases
A typical database B+ tree index:
- Page size: 8-16 KB
- Keys per node: 100-500 (depending on key size)
- Height: 3-4 levels for billions of rows
- Root cached in memory: First 1-2 levels often cached → 1-2 disk reads per lookup
B* Trees
A variant where nodes are kept at least 2/3 full (vs 1/2 for B-trees).
Approach: When a node is full, try to redistribute keys to a sibling before splitting. Split two full nodes into three 2/3-full nodes.
Benefit: Better space utilization. Fewer splits. Shallower tree.
Database Indexing Applications
Clustered vs Non-Clustered Index
Clustered index: Table rows are physically stored in index order. Only one per table. Range scans are very fast (sequential disk reads).
Non-clustered index: Index stores pointers to rows. Rows stored in arbitrary order. Multiple per table. Point lookups are fine; range scans may require random I/O.
Composite Index
Index on multiple columns: (lastname, firstname). Supports:
- Queries on (lastname) ✓
- Queries on (lastname, firstname) ✓
- Queries on (firstname) alone ✗ (leftmost prefix rule)
Covering Index
An index that includes all columns needed by a query. The query can be answered from the index alone without accessing the table (index-only scan).
Disk-Oriented Considerations
Page Size
Larger pages → more keys per node → shallower tree. But:
- More data to read per I/O
- More wasted space for partially filled nodes
- Typical: 4-16 KB (matching OS page size)
Fanout Optimization
Fanout = number of children per node. Higher fanout → shallower tree → fewer disk reads.
Maximize fanout: Use small keys. Prefix compression. Suffix truncation in internal nodes (only store enough of the key to route correctly).
Buffer Pool
Database buffer pool caches B-tree pages in memory:
- Root and upper levels: always cached.
- Hot leaf pages: cached based on access frequency.
- Hit rate: Typically 95-99% for well-sized buffer pools.
Bulk Loading
Building a B+ tree by inserting keys one at a time is slow (random I/O).
Bulk load: Sort all keys first, then build the tree bottom-up (fill leaves sequentially, build internal nodes). Much faster — sequential I/O throughout.
Write Optimization
B-trees have poor write amplification (each insert/update may require reading and writing multiple pages).
Write-optimized alternatives: LSM trees, Bε-trees, fractal trees. Trade read performance for better write throughput.
Concurrency
Latch Coupling (Crabbing)
To traverse the tree safely with concurrent writers:
- Latch (lock) the parent.
- Latch the child.
- If the child is "safe" (won't split or merge), release the parent latch.
- Continue down.
Optimistic: Assume no structural changes. Latch only the leaf. If a split is needed, restart with pessimistic locking.
B-link Trees (Lehman-Yao)
Add right-link pointers between siblings at each level. If a node is being split while you're traversing, follow the right link to the correct node. Enables higher concurrency (no need to hold parent latch during child operations).
Applications in CS
- Relational databases: MySQL (InnoDB), PostgreSQL, SQL Server, Oracle — all use B+ tree indexes.
- Key-value stores: BerkeleyDB, LMDB, SQLite — B-tree based.
- File systems: NTFS, HFS+, ext4 (extents), Btrfs, XFS, ReiserFS — use B-trees for directory and extent management.
- Document databases: MongoDB uses B-trees (WiredTiger engine) and LSM trees.
- Full-text search: Inverted indexes often stored as B-tree variants.
- Embedded databases: SQLite (the most widely deployed database engine) uses B-trees.