Union-Find (Disjoint Set Union)
Union-Find (Disjoint Set Union)
A Union-Find (DSU) structure maintains a partition of n elements into disjoint sets by storing, for each element, a pointer toward a representative (root) of its set; find(x) walks parent pointers to the root, and union(x,y) links the two roots together — so two elements are in the same set exactly when their roots coincide, and merging sets costs only a pointer update rather than relabeling every member.
Recognize the pattern
- The problem repeatedly asks "are these two things connected / in the same group?" interleaved with "merge these two groups," and you never need to undo a merge.
- You are building a spanning structure incrementally and must detect a cycle cheaply — the textbook case is Kruskal's MST, and "count connected components while edges stream in."
- Offline connectivity: given a fixed batch of union operations, answer queries in any order without rebuilding a graph traversal each time.
- Problems literally named "number of islands," "accounts merge," "redundant connection," "friend circles," "smallest equivalent string."
Baseline vs. this algorithm
Naive baseline — graph traversal: model connectivity as an adjacency list and run BFS/DFS from a node whenever you need to answer "is x connected to y?" Each query costs O(V+E). If you instead try to maintain an explicit component-id array and merge two components by relabeling every node in one of them, a single merge costs O(size of the component being relabeled), which is O(n) in the worst case — and an adversarial sequence of n merges can cost O(n²) total.
Union-Find: represent each set as a small in-tree (parent pointers, not a full adjacency structure). find(x) costs O(height of x's tree). union(x,y) costs one find on each side plus O(1) to re-point a root. Two optimizations make the trees shallow:
- Union by rank/size: when merging, attach the root of the smaller tree under the root of the larger tree (by subtree size or by an upper-bound "rank"). This alone caps tree height at O(log n), since a node's depth only increases when its tree is the smaller one being absorbed, at least doubling that tree's size each time.
- Path compression: during
find, re-point every node visited on the path directly to the root. This flattens the tree for all future queries through those nodes.
Used together, the amortized cost per operation is O(α(n)), where α is the inverse Ackermann function — for any n that fits in the physical universe, α(n) ≤ 4, so this is "practically constant."
Complexity, derived from first principles
Without either optimization: a chain can form (union always attaches root of x under root of y, y under z, ...), giving tree height O(n), so find is O(n) worst case.
With union by size alone (no compression): every time a node's depth increases by 1, it just moved from a tree of size s into a merged tree of size ≥ 2s (because we always attach the smaller side, so the tree it joined was at least as big as its own). A node can only be on the "smaller side" O(log n) times before the total number of elements n is exhausted, since size at least doubles from that node's perspective each time its depth grows. Hence height ≤ O(log n) and find is O(log n) worst case.
Adding path compression: a single find can still touch O(log n) nodes on a cold path, but every node it touches gets re-attached directly to the root — so the next find through any of them is O(1). This combined bound is Tarjan's classic amortized-analysis result (a potential-function / block argument over ranks): any sequence of m union/find operations on n elements costs O(m·α(n)) total, i.e., O(α(n)) amortized per operation — cited here rather than re-derived, since the full block argument is a page of its own, but the O(log n) case above is fully derived from first principles. Space is O(n): one parent slot and one rank/size slot per element — no adjacency list, no visited array.
| Variant | find / union (amortized) | Space |
|---|---|---|
| Naive relabeling | O(n) per union | O(n) |
| Tree, no optimization | O(n) worst case | O(n) |
| Union by size/rank only | O(log n) | O(n) |
| Union by size/rank + path compression | O(α(n)) ≈ O(1) | O(n) |
Traced worked example
Elements 0..6, all start as singleton sets: parent[i] = i, size[i] = 1. Process unions in this order: union(0,1), union(1,2), union(3,4), union(5,6), union(2,3), find(6) with path compression.
| Step | Operation | Roots before | Action (smaller size attaches under larger) | Parent array after (index 0..6) |
|---|---|---|---|---|
| 1 | union(0,1) | root(0)=0 (size1), root(1)=1 (size1) | tie → attach 1 under 0; size[0]=2 | 0,0,2,3,4,5,6 |
| 2 | union(1,2) | root(1)=0 (size2), root(2)=2 (size1) | attach 2 under 0; size[0]=3 | 0,0,0,3,4,5,6 |
| 3 | union(3,4) | root(3)=3 (size1), root(4)=4 (size1) | tie → attach 4 under 3; size[3]=2 | 0,0,0,3,3,5,6 |
| 4 | union(5,6) | root(5)=5 (size1), root(6)=6 (size1) | tie → attach 6 under 5; size[5]=2 | 0,0,0,3,3,5,5 |
| 5 | union(2,3) | root(2)=0 (size3), root(3)=3 (size2) | attach smaller root 3 under 0; size[0]=5 | 0,0,0,0,3,5,5 |
| 6 | find(6) | 6→5→5 (root 5, depth 1) | no compression needed (already depth 1); returns 5 | unchanged |
Note element 4 sits at depth 2 (4→3→0). A subsequent find(4) would walk 4→3→0, then path-compression re-points parent[4] = 0 directly, so any later find(4) is O(1).
Java implementation
final class UnionFind {
private final int[] parent;
private final int[] size;
private int components;
UnionFind(int n) {
parent = new int[n];
size = new int[n];
for (int i = 0; i < n; i++) {
parent[i] = i;
size[i] = 1;
}
components = n;
}
// Path compression: point every visited node straight at the root.
int find(int x) {
while (parent[x] != x) {
parent[x] = parent[parent[x]]; // path halving
x = parent[x];
}
return x;
}
// Union by size: attach smaller tree under larger tree's root.
// Returns false if x and y were already in the same set (cycle detected).
boolean union(int x, int y) {
int rx = find(x), ry = find(y);
if (rx == ry) return false;
if (size[rx] < size[ry]) { int t = rx; rx = ry; ry = t; }
parent[ry] = rx;
size[rx] += size[ry];
components--;
return true;
}
boolean connected(int x, int y) {
return find(x) == find(y);
}
int componentCount() {
return components;
}
}
// Kruskal's MST sketch: sort edges by weight, then union(u, v) each edge
// only if union() returns true (i.e., it does not close a cycle).
Pitfalls
- Forgetting path compression or union by rank/size: either one alone is fine (O(log n)), but skipping both degrades to O(n) per find on an adversarial union order (e.g., always union(i, i+1) in increasing order builds a straight chain).
- Comparing indices instead of roots: a common bug is checking
x == yorparent[x] == parent[y]instead offind(x) == find(y)— two elements can be in the same set with different immediate parents. - Stale size/rank after swapping roots: if you swap rx/ry to enforce "attach smaller under larger" but forget to also swap which size you update, you corrupt future union-by-size decisions and can reintroduce chains.
- Off-by-one on component counting: only decrement the component counter when
unionactually merges two distinct sets, not on every call. - DSU has no built-in "split" or "remove": if the problem needs to undo a union or delete an edge, plain DSU cannot do it efficiently; that needs a different structure (e.g., a link-cut tree, or DSU-with-rollback for offline divide-and-conquer over time).
- Not resetting rank/size correctly when reusing a DSU instance across test cases — leftover state silently produces wrong connectivity answers.
When to use / when NOT — trade-offs vs. DFS-based connectivity
| Union-Find | DFS/BFS connectivity | |
|---|---|---|
| Best for | Incremental unions interleaved with queries; no need to traverse full structure; cycle detection while adding edges one at a time (Kruskal) | Static graph, connectivity known once graph is fully built; need actual paths, distances, or full component contents in traversal order |
| Query cost | O(α(n)) amortized per connectivity check | O(V+E) to build components once, then O(1) per query using precomputed labels — but O(V+E) again if edges keep arriving |
| Handles streaming edges well | Yes — designed for it | No — must re-run traversal (or incrementally patch labels, which is what DSU already does better) |
| Can report the actual path between two nodes | No — only "same set or not" | Yes — DFS/BFS naturally reconstructs a path |
| Can undo a merge / handle edge deletion | No (plain DSU) | Yes, by rebuilding or maintaining adjacency and re-traversing |
| Implementation weight | Two small arrays, ~15 lines | Adjacency list + visited array + recursion/stack |
Use Union-Find when the workload is "many unions, many connectivity queries, edges only ever added, no path needed" — Kruskal's MST, "redundant connection," incremental network connectivity, offline grid flood-fill counted via merges. Prefer DFS/BFS when you need the shortest/actual path, need to handle edge removal, or only need connectivity computed once on a fixed, complete graph (a single O(V+E) pass is simpler to write and reason about than standing up a DSU for a one-shot query).
Takeaways
- DSU stores a forest of parent pointers, not an adjacency list;
findwalks to the root,unionrelinks roots. - Union by rank/size bounds height at O(log n); adding path compression drops amortized cost to O(α(n)), effectively constant.
unionreturning "already connected" is exactly a cycle-detection signal — that is the mechanism behind Kruskal's MST.- DSU trades away path reconstruction and deletion support for near-O(1) incremental connectivity — reach for DFS/BFS when you need paths, distances, or edge removal.
Recall question
You are given m union operations to process on n elements, and only union by size is applied (no path compression). What is the tight worst-case bound on the height of any resulting tree, and why does always attaching the smaller-sized tree under the larger one's root guarantee that bound?
Authored for this guide; concepts and amortized-complexity analysis follow the standard treatment in Cormen/Leiserson/Rivest/Stein (CLRS), "Introduction to Algorithms," ch. 21 (Data Structures for Disjoint Sets), and Tarjan's original inverse-Ackermann analysis of union-find.
🤖 Don't fully get this? Learn it with Claude
Stuck on Union-Find (Disjoint Set Union)? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.
Build the mental picture, not memorization.
I just read a lesson on **Union-Find (Disjoint Set Union)** (DSA) and want to truly understand it. Explain Union-Find (Disjoint Set Union) from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Socratic — adapts to where you're stuck.
Teach me **Union-Find (Disjoint Set Union)** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Active recall exposes what you missed.
Quiz me on **Union-Find (Disjoint Set Union)** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Intuition + hook + flashcards for long-term memory.
Help me remember **Union-Find (Disjoint Set Union)** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.