Disjoint Sets (Union-Find)
The Union-Find (disjoint set union, DSU) data structure tracks a collection of non-overlapping sets and supports two operations efficiently: finding which set an element belongs to, and merging two sets.
Operations
| Operation | Description | |---|---| | make_set(x) | Create a new set containing only x | | find(x) | Return the representative (root) of x's set | | union(x, y) | Merge the sets containing x and y |
Naive Implementation
Represent each set as a tree. Each element points to its parent. The root points to itself and serves as the set's representative.
Initial: {1} {2} {3} {4} {5}
After union(1,2), union(3,4), union(2,3):
1
/ \ 5
2 3
|
4
Find: Follow parent pointers to the root. O(n) worst case (degenerate tree). Union: Make one root point to the other. O(n) worst case (includes find).
Optimization 1: Union by Rank/Size
Union by Rank
Each root maintains a rank (upper bound on tree height). On union, attach the smaller-rank tree under the larger-rank root.
PROCEDURE UNION(x, y)
rx ← FIND(x)
ry ← FIND(y)
IF rx = ry THEN RETURN
IF rank[rx] < rank[ry] THEN
parent[rx] ← ry
ELSE IF rank[rx] > rank[ry] THEN
parent[ry] ← rx
ELSE
parent[ry] ← rx
rank[rx] ← rank[rx] + 1
Tree height: O(log n). Find and union are O(log n).
Union by Size
Attach the smaller-size tree under the larger. Same O(log n) bound.
Optimization 2: Path Compression
During find, make every node on the path point directly to the root.
FUNCTION FIND(x)
IF parent[x] ≠ x THEN
parent[x] ← FIND(parent[x]) // Path compression
RETURN parent[x]
Before find(4):
1 ← 2 ← 3 ← 4
After find(4) (with path compression):
1
/|\
2 3 4
All nodes now point directly to root. Future operations on these nodes are O(1).
Combined: Union by Rank + Path Compression
With both optimizations, the amortized time per operation is O(α(n)), where α is the inverse Ackermann function.
Inverse Ackermann Function
α(n) is an incredibly slowly growing function:
α(1) = 0
α(2) = 1
α(2^(2^(2^65536))) ≈ 5
For all practical purposes, α(n) ≤ 4 for any input that fits in the observable universe.
Effectively O(1) per operation.
Proof Sketch
Tarjan (1975) proved the α(n) bound using a potential function argument. The key insight: path compression flattens trees rapidly, and union by rank keeps trees balanced, so the combined effect allows only α(n) amortized work.
This bound is tight — Tarjan also proved that Ω(α(n)) is a lower bound for any pointer-based union-find.
Complete Implementation
STRUCTURE UnionFind
parent : array of integers
rank : array of integers
count : integer // number of components
FUNCTION NEW_UNION_FIND(n)
parent ← [0, 1, 2, ..., n-1] // each element is its own parent
rank ← array of n zeros
count ← n
RETURN new UnionFind(parent, rank, count)
FUNCTION FIND(x)
IF parent[x] ≠ x THEN
parent[x] ← FIND(parent[x]) // path compression
RETURN parent[x]
FUNCTION UNION(x, y)
rx ← FIND(x)
ry ← FIND(y)
IF rx = ry THEN RETURN false // already same set
IF rank[rx] < rank[ry] THEN
parent[rx] ← ry
ELSE IF rank[rx] > rank[ry] THEN
parent[ry] ← rx
ELSE
parent[ry] ← rx
rank[rx] ← rank[rx] + 1
count ← count - 1
RETURN true
FUNCTION CONNECTED(x, y)
RETURN FIND(x) = FIND(y)
FUNCTION COMPONENTS()
RETURN count
Applications
Kruskal's Algorithm
The classic application. Build MST by processing edges in weight order, adding an edge iff it connects different components.
FUNCTION KRUSKAL(n, edges)
SORT edges by weight
uf ← NEW_UNION_FIND(n)
total_weight ← 0
FOR EACH (w, u, v) IN edges DO
IF UNION(uf, u, v) THEN
total_weight ← total_weight + w
RETURN total_weight
Time: O(E log E + E · α(V)) ≈ O(E log E).
Connected Components
Process all edges with union. After processing, find(u) == find(v) iff u and v are in the same component.
Dynamic Connectivity (Offline)
Given a sequence of edge additions and connectivity queries, process all queries offline using union-find. O(n · α(n)) total.
Note: Union-find only supports adding edges, not removing them. For fully dynamic connectivity (add + remove), more complex data structures are needed (link-cut trees, Euler tour trees).
Percolation
Model: n×n grid, each cell open/closed randomly. Does a path exist from top to bottom through open cells?
Union-find tracks connected components of open cells. When a component spans top to bottom, percolation occurs.
Phase transition: For an n×n grid, percolation threshold ≈ 0.5927... (for site percolation on a square lattice). Sharp transition from "almost never" to "almost always" percolates.
Equivalence Classes
Any equivalence relation can be represented with union-find. Given pairs (a, b) meaning "a is equivalent to b," union(a, b) for each pair. Then find determines the equivalence class.
Examples: Alias analysis in compilers. Image segmentation. Social network communities.
Maze Generation
Randomly connect adjacent cells using union-find. An edge is added iff it connects different components (preventing cycles). Result: a spanning tree of the grid graph = a perfect maze.
Variants
Weighted Union-Find
Track the "distance" or "weight" from each element to its root.
find(x) returns (root, weight_to_root)
union(x, y, w) means weight(x) - weight(y) = w
Application: Detecting contradictions in a system of difference constraints. Competitive programming problems.
Persistent Union-Find
Support queries on past versions (after k-th union). Using persistent arrays or link-cut trees.
Rollback Union-Find
Support undo of the last union. Use union by rank (without path compression) and maintain a stack of changes.
Useful in divide-and-conquer algorithms on graphs.
Applications in CS
- Graph algorithms: Kruskal's MST, connected components, cycle detection.
- Image processing: Connected component labeling (flood fill). Segmenting regions.
- Compilers: Type unification in type inference (Hindley-Milner). Alias analysis. Register coalescing.
- Network connectivity: Determine if two nodes can communicate. Track network partitions.
- Physics simulation: Grouping rigid bodies in contact for stable stacking.
- Database query optimization: Equivalence classes for join predicates.
- Social networks: Community detection. Friend-of-friend grouping.
- Competitive programming: One of the most commonly used data structures in contests.