4 min read
On this page

Disjoint Sets (Union-Find)

The Union-Find (disjoint set union, DSU) data structure tracks a collection of non-overlapping sets and supports two operations efficiently: finding which set an element belongs to, and merging two sets.

Operations

| Operation | Description | |---|---| | make_set(x) | Create a new set containing only x | | find(x) | Return the representative (root) of x's set | | union(x, y) | Merge the sets containing x and y |

Naive Implementation

Represent each set as a tree. Each element points to its parent. The root points to itself and serves as the set's representative.

Initial: {1} {2} {3} {4} {5}

After union(1,2), union(3,4), union(2,3):
    1
   / \        5
  2   3
      |
      4

Find: Follow parent pointers to the root. O(n) worst case (degenerate tree). Union: Make one root point to the other. O(n) worst case (includes find).

Optimization 1: Union by Rank/Size

Union by Rank

Each root maintains a rank (upper bound on tree height). On union, attach the smaller-rank tree under the larger-rank root.

PROCEDURE UNION(x, y)
    rx ← FIND(x)
    ry ← FIND(y)
    IF rx = ry THEN RETURN

    IF rank[rx] < rank[ry] THEN
        parent[rx] ← ry
    ELSE IF rank[rx] > rank[ry] THEN
        parent[ry] ← rx
    ELSE
        parent[ry] ← rx
        rank[rx] ← rank[rx] + 1

Tree height: O(log n). Find and union are O(log n).

Union by Size

Attach the smaller-size tree under the larger. Same O(log n) bound.

Optimization 2: Path Compression

During find, make every node on the path point directly to the root.

FUNCTION FIND(x)
    IF parent[x] ≠ x THEN
        parent[x] ← FIND(parent[x])  // Path compression
    RETURN parent[x]

Before find(4):

1 ← 2 ← 3 ← 4

After find(4) (with path compression):

  1
 /|\
2  3  4

All nodes now point directly to root. Future operations on these nodes are O(1).

Combined: Union by Rank + Path Compression

With both optimizations, the amortized time per operation is O(α(n)), where α is the inverse Ackermann function.

Inverse Ackermann Function

α(n) is an incredibly slowly growing function:

α(1) = 0
α(2) = 1
α(2^(2^(2^65536))) ≈ 5

For all practical purposes, α(n) ≤ 4 for any input that fits in the observable universe.

Effectively O(1) per operation.

Proof Sketch

Tarjan (1975) proved the α(n) bound using a potential function argument. The key insight: path compression flattens trees rapidly, and union by rank keeps trees balanced, so the combined effect allows only α(n) amortized work.

This bound is tight — Tarjan also proved that Ω(α(n)) is a lower bound for any pointer-based union-find.

Complete Implementation

STRUCTURE UnionFind
    parent : array of integers
    rank   : array of integers
    count  : integer  // number of components

FUNCTION NEW_UNION_FIND(n)
    parent ← [0, 1, 2, ..., n-1]   // each element is its own parent
    rank ← array of n zeros
    count ← n
    RETURN new UnionFind(parent, rank, count)

FUNCTION FIND(x)
    IF parent[x] ≠ x THEN
        parent[x] ← FIND(parent[x])  // path compression
    RETURN parent[x]

FUNCTION UNION(x, y)
    rx ← FIND(x)
    ry ← FIND(y)
    IF rx = ry THEN RETURN false  // already same set

    IF rank[rx] < rank[ry] THEN
        parent[rx] ← ry
    ELSE IF rank[rx] > rank[ry] THEN
        parent[ry] ← rx
    ELSE
        parent[ry] ← rx
        rank[rx] ← rank[rx] + 1
    count ← count - 1
    RETURN true

FUNCTION CONNECTED(x, y)
    RETURN FIND(x) = FIND(y)

FUNCTION COMPONENTS()
    RETURN count

Applications

Kruskal's Algorithm

The classic application. Build MST by processing edges in weight order, adding an edge iff it connects different components.

FUNCTION KRUSKAL(n, edges)
    SORT edges by weight
    uf ← NEW_UNION_FIND(n)
    total_weight ← 0

    FOR EACH (w, u, v) IN edges DO
        IF UNION(uf, u, v) THEN
            total_weight ← total_weight + w
    RETURN total_weight

Time: O(E log E + E · α(V)) ≈ O(E log E).

Connected Components

Process all edges with union. After processing, find(u) == find(v) iff u and v are in the same component.

Dynamic Connectivity (Offline)

Given a sequence of edge additions and connectivity queries, process all queries offline using union-find. O(n · α(n)) total.

Note: Union-find only supports adding edges, not removing them. For fully dynamic connectivity (add + remove), more complex data structures are needed (link-cut trees, Euler tour trees).

Percolation

Model: n×n grid, each cell open/closed randomly. Does a path exist from top to bottom through open cells?

Union-find tracks connected components of open cells. When a component spans top to bottom, percolation occurs.

Phase transition: For an n×n grid, percolation threshold ≈ 0.5927... (for site percolation on a square lattice). Sharp transition from "almost never" to "almost always" percolates.

Equivalence Classes

Any equivalence relation can be represented with union-find. Given pairs (a, b) meaning "a is equivalent to b," union(a, b) for each pair. Then find determines the equivalence class.

Examples: Alias analysis in compilers. Image segmentation. Social network communities.

Maze Generation

Randomly connect adjacent cells using union-find. An edge is added iff it connects different components (preventing cycles). Result: a spanning tree of the grid graph = a perfect maze.

Variants

Weighted Union-Find

Track the "distance" or "weight" from each element to its root.

find(x) returns (root, weight_to_root)
union(x, y, w) means weight(x) - weight(y) = w

Application: Detecting contradictions in a system of difference constraints. Competitive programming problems.

Persistent Union-Find

Support queries on past versions (after k-th union). Using persistent arrays or link-cut trees.

Rollback Union-Find

Support undo of the last union. Use union by rank (without path compression) and maintain a stack of changes.

Useful in divide-and-conquer algorithms on graphs.

Applications in CS

  • Graph algorithms: Kruskal's MST, connected components, cycle detection.
  • Image processing: Connected component labeling (flood fill). Segmenting regions.
  • Compilers: Type unification in type inference (Hindley-Milner). Alias analysis. Register coalescing.
  • Network connectivity: Determine if two nodes can communicate. Track network partitions.
  • Physics simulation: Grouping rigid bodies in contact for stable stacking.
  • Database query optimization: Equivalence classes for join predicates.
  • Social networks: Community detection. Friend-of-friend grouping.
  • Competitive programming: One of the most commonly used data structures in contests.