6 min read
On this page

B-Trees

B-trees are balanced search trees designed for disk-based storage. They minimize I/O by maximizing the number of keys per node (matching disk block size). Every major database and file system uses B-trees or their variants.

B-Tree Definition

A B-tree of order m (minimum degree t = ⌈m/2⌉) satisfies:

  1. Every node has at most m children (m-1 keys).
  2. Every non-root internal node has at least ⌈m/2⌉ children.
  3. The root has at least 2 children (unless it's a leaf).
  4. All leaves are at the same depth (perfectly balanced).
  5. Keys within a node are sorted.

Example (order 5, min degree t = 3):

              [30 | 60]
             /    |    \
    [10|20]  [40|50]  [70|80|90]

Each internal node has between t and 2t-1 keys and between t+1 and 2t children.

Why B-Trees?

Disk access: Reading one byte vs reading 4 KB takes almost the same time (disk seek dominates). So we should read large blocks and pack many keys into each node.

A B-tree node typically = one disk page (4 KB - 16 KB). A single node can hold hundreds of keys.

Height comparison (1 million keys):

| Tree | Branching | Height | Disk reads per search | |---|---|---|---| | Binary BST | 2 | ~20 | ~20 | | B-tree (order 101) | 101 | ~3 | ~3 | | B-tree (order 1001) | 1001 | ~2 | ~2 |

Like binary search tree, but at each node, perform a binary search among the node's keys to find the appropriate child pointer.

Search(node, key):
    Find i such that key[i-1] < key ≤ key[i] (binary search within node)
    If key == key[i]: found
    If node is leaf: not found
    Else: recurse on child[i]

Time: O(log_t n) disk reads × O(t) comparisons per node = O(t · log_t n) = O(log n).

Insertion

  1. Search for the appropriate leaf.
  2. Insert the key into the leaf (maintaining sorted order).
  3. If the leaf exceeds m-1 keys, split it.

Splitting

When a node has m keys (overflow):

  1. Split into two nodes at the median key.
  2. The median key is promoted to the parent.
  3. If the parent overflows, split it too (may propagate to root).
  4. If the root splits, create a new root (tree grows taller by 1).
Before split (order 5, node has 5 keys):
[10 | 20 | 30 | 40 | 50]

After split:
Parent gets: 30
Left child:  [10 | 20]
Right child: [40 | 50]

Proactive splitting: Split full nodes on the way down during insertion. This avoids the need to propagate splits upward (single-pass insertion).

Deletion

More complex than insertion. Three cases:

Case 1: Key is in a leaf → simply remove it. If the leaf underflows (fewer than t-1 keys), rebalance.

Case 2: Key is in an internal node → replace with predecessor (or successor) from a child, then delete the predecessor from the child.

Case 3: Rebalancing when a node has too few keys:

  • Borrow from a sibling (rotate through parent) if sibling has spare keys.
  • Merge with a sibling (pull down a key from parent) if sibling is at minimum.
  • If parent underflows, repeat upward.

Deletion may cause the tree to shrink (root merge → height decreases by 1).

B+ Trees

The most common variant used in databases and file systems.

Differences from B-Trees

  1. All data in leaves: Internal nodes contain only keys (routing information), not data. This means more keys per internal node → lower height.
  2. Leaves are linked: A doubly linked list connects all leaves. Enables efficient range queries and sequential scans.
  3. Keys are duplicated: Internal node keys are copies of leaf keys (specifically, the smallest key of each subtree).
Internal:     [30 |  60]
             /    |    \
Leaves: [10,20]↔[30,40,50]↔[60,70,80]
         (linked list)

Range Queries

To find all keys in range [a, b]:

  1. Search for a → find the leaf containing a.
  2. Scan right along the leaf linked list until key > b.

Time: O(log n + k) where k is the number of results. Very efficient for range scans.

B+ Tree in Databases

A typical database B+ tree index:

  • Page size: 8-16 KB
  • Keys per node: 100-500 (depending on key size)
  • Height: 3-4 levels for billions of rows
  • Root cached in memory: First 1-2 levels often cached → 1-2 disk reads per lookup

B* Trees

A variant where nodes are kept at least 2/3 full (vs 1/2 for B-trees).

Approach: When a node is full, try to redistribute keys to a sibling before splitting. Split two full nodes into three 2/3-full nodes.

Benefit: Better space utilization. Fewer splits. Shallower tree.

Database Indexing Applications

Clustered vs Non-Clustered Index

Clustered index: Table rows are physically stored in index order. Only one per table. Range scans are very fast (sequential disk reads).

Non-clustered index: Index stores pointers to rows. Rows stored in arbitrary order. Multiple per table. Point lookups are fine; range scans may require random I/O.

Composite Index

Index on multiple columns: (lastname, firstname). Supports:

  • Queries on (lastname) ✓
  • Queries on (lastname, firstname) ✓
  • Queries on (firstname) alone ✗ (leftmost prefix rule)

Covering Index

An index that includes all columns needed by a query. The query can be answered from the index alone without accessing the table (index-only scan).

Disk-Oriented Considerations

Page Size

Larger pages → more keys per node → shallower tree. But:

  • More data to read per I/O
  • More wasted space for partially filled nodes
  • Typical: 4-16 KB (matching OS page size)

Fanout Optimization

Fanout = number of children per node. Higher fanout → shallower tree → fewer disk reads.

Maximize fanout: Use small keys. Prefix compression. Suffix truncation in internal nodes (only store enough of the key to route correctly).

Buffer Pool

Database buffer pool caches B-tree pages in memory:

  • Root and upper levels: always cached.
  • Hot leaf pages: cached based on access frequency.
  • Hit rate: Typically 95-99% for well-sized buffer pools.

Bulk Loading

Building a B+ tree by inserting keys one at a time is slow (random I/O).

Bulk load: Sort all keys first, then build the tree bottom-up (fill leaves sequentially, build internal nodes). Much faster — sequential I/O throughout.

Write Optimization

B-trees have poor write amplification (each insert/update may require reading and writing multiple pages).

Write-optimized alternatives: LSM trees, Bε-trees, fractal trees. Trade read performance for better write throughput.

Concurrency

Latch Coupling (Crabbing)

To traverse the tree safely with concurrent writers:

  1. Latch (lock) the parent.
  2. Latch the child.
  3. If the child is "safe" (won't split or merge), release the parent latch.
  4. Continue down.

Optimistic: Assume no structural changes. Latch only the leaf. If a split is needed, restart with pessimistic locking.

Add right-link pointers between siblings at each level. If a node is being split while you're traversing, follow the right link to the correct node. Enables higher concurrency (no need to hold parent latch during child operations).

Applications in CS

  • Relational databases: MySQL (InnoDB), PostgreSQL, SQL Server, Oracle — all use B+ tree indexes.
  • Key-value stores: BerkeleyDB, LMDB, SQLite — B-tree based.
  • File systems: NTFS, HFS+, ext4 (extents), Btrfs, XFS, ReiserFS — use B-trees for directory and extent management.
  • Document databases: MongoDB uses B-trees (WiredTiger engine) and LSM trees.
  • Full-text search: Inverted indexes often stored as B-tree variants.
  • Embedded databases: SQLite (the most widely deployed database engine) uses B-trees.