Английская Википедия:Disjoint-set data structure

Материал из Онлайн справочника
Перейти к навигацииПерейти к поиску

Шаблон:Short description Шаблон:Infobox data structure

In computer science, a disjoint-set data structure, also called a union–find data structure or merge–find set, is a data structure that stores a collection of disjoint (non-overlapping) sets. Equivalently, it stores a partition of a set into disjoint subsets. It provides operations for adding new sets, merging sets (replacing them by their union), and finding a representative member of a set. The last operation makes it possible to find out efficiently if any two elements are in the same or different sets.

While there are several ways of implementing disjoint-set data structures, in practice they are often identified with a particular implementation called a disjoint-set forest. This is a specialized type of forest which performs unions and finds in near-constant amortized time. To perform a sequence of Шаблон:Mvar addition, union, or find operations on a disjoint-set forest with Шаблон:Mvar nodes requires total time Шаблон:Math, where Шаблон:Math is the extremely slow-growing inverse Ackermann function. Disjoint-set forests do not guarantee this performance on a per-operation basis. Individual union and find operations can take longer than a constant times Шаблон:Math time, but each operation causes the disjoint-set forest to adjust itself so that successive operations are faster. Disjoint-set forests are both asymptotically optimal and practically efficient.

Disjoint-set data structures play a key role in Kruskal's algorithm for finding the minimum spanning tree of a graph. The importance of minimum spanning trees means that disjoint-set data structures underlie a wide variety of algorithms. In addition, disjoint-set data structures also have applications to symbolic computation, as well as in compilers, especially for register allocation problems.

History

Disjoint-set forests were first described by Bernard A. Galler and Michael J. Fischer in 1964.[1] In 1973, their time complexity was bounded to <math>O(\log^{*}(n))</math>, the iterated logarithm of <math>n</math>, by Hopcroft and Ullman.[2] In 1975, Robert Tarjan was the first to prove the <math>O(m\alpha(n))</math> (inverse Ackermann function) upper bound on the algorithm's time complexity,[3]. He also proved it to be tight. In 1979, he showed that this was the lower bound for a certain class of algorithms, that include the Galler-Fischer structure.[4] In 1989, Fredman and Saks showed that <math>\Omega(\alpha(n))</math> (amortized) words of <math>O(\log n)</math> bits must be accessed by any disjoint-set data structure per operation,[5] thereby proving the optimality of the data structure in this model.

In 1991, Galil and Italiano published a survey of data structures for disjoint-sets.[6]

In 1994, Richard J. Anderson and Heather Woll described a parallelized version of Union–Find that never needs to block.[7]

In 2007, Sylvain Conchon and Jean-Christophe Filliâtre developed a semi-persistent version of the disjoint-set forest data structure and formalized its correctness using the proof assistant Coq.[8] "Semi-persistent" means that previous versions of the structure are efficiently retained, but accessing previous versions of the data structure invalidates later ones. Their fastest implementation achieves performance almost as efficient as the non-persistent algorithm. They do not perform a complexity analysis.

Variants of disjoint-set data structures with better performance on a restricted class of problems have also been considered. Gabow and Tarjan showed that if the possible unions are restricted in certain ways, then a truly linear time algorithm is possible.[9]

Representation

Each node in a disjoint-set forest consists of a pointer and some auxiliary information, either a size or a rank (but not both). The pointers are used to make parent pointer trees, where each node that is not the root of a tree points to its parent. To distinguish root nodes from others, their parent pointers have invalid values, such as a circular reference to the node or a sentinel value. Each tree represents a set stored in the forest, with the members of the set being the nodes in the tree. Root nodes provide set representatives: Two nodes are in the same set if and only if the roots of the trees containing the nodes are equal.

Nodes in the forest can be stored in any way convenient to the application, but a common technique is to store them in an array. In this case, parents can be indicated by their array index. Every array entry requires Шаблон:Math bits of storage for the parent pointer. A comparable or lesser amount of storage is required for the rest of the entry, so the number of bits required to store the forest is Шаблон:Math. If an implementation uses fixed size nodes (thereby limiting the maximum size of the forest that can be stored), then the necessary storage is linear in Шаблон:Mvar.

Operations

Disjoint-set data structures support three operations: Making a new set containing a new element; Finding the representative of the set containing a given element; and Merging two sets.

Making new sets

The MakeSet operation adds a new element into a new set containing only the new element, and the new set is added to the data structure. If the data structure is instead viewed as a partition of a set, then the MakeSet operation enlarges the set by adding the new element, and it extends the existing partition by putting the new element into a new subset containing only the new element.

In a disjoint-set forest, MakeSet initializes the node's parent pointer and the node's size or rank. If a root is represented by a node that points to itself, then adding an element can be described using the following pseudocode:

function MakeSet(x) is
    if x is not already in the forest then
        x.parent := x
        x.size := 1     // if nodes store size
        x.rank := 0     // if nodes store rank
    end if
end function

This operation has constant time complexity. In particular, initializing a disjoint-set forest with Шаблон:Mvar nodes requires Шаблон:Math time.

Lack of a parent assigned to the node implies that the node is not present in the forest.

In practice, MakeSet must be preceded by an operation that allocates memory to hold Шаблон:Math. As long as memory allocation is an amortized constant-time operation, as it is for a good dynamic array implementation, it does not change the asymptotic performance of the random-set forest.

Finding set representatives

The Find operation follows the chain of parent pointers from a specified query node Шаблон:Mvar until it reaches a root element. This root element represents the set to which Шаблон:Mvar belongs and may be Шаблон:Mvar itself. Find returns the root element it reaches.

Performing a Find operation presents an important opportunity for improving the forest. The time in a Find operation is spent chasing parent pointers, so a flatter tree leads to faster Find operations. When a Find is executed, there is no faster way to reach the root than by following each parent pointer in succession. However, the parent pointers visited during this search can be updated to point closer to the root. Because every element visited on the way to a root is part of the same set, this does not change the sets stored in the forest. But it makes future Find operations faster, not only for the nodes between the query node and the root, but also for their descendants. This updating is an important part of the disjoint-set forest's amortized performance guarantee.

There are several algorithms for Find that achieve the asymptotically optimal time complexity. One family of algorithms, known as path compression, makes every node between the query node and the root point to the root. Path compression can be implemented using a simple recursion as follows:

function Find(x) is
    if x.parent ≠ x then
        x.parent := Find(x.parent)
        return x.parent
    else
        return x
    end if
end function

This implementation makes two passes, one up the tree and one back down. It requires enough scratch memory to store the path from the query node to the root (in the above pseudocode, the path is implicitly represented using the call stack). This can be decreased to a constant amount of memory by performing both passes in the same direction. The constant memory implementation walks from the query node to the root twice, once to find the root and once to update pointers:

function Find(x) is
    root := x
    while root.parent ≠ root do
        root := root.parent
    end while

    while x.parent ≠ root do
        parent := x.parent
        x.parent := root
        x := parent
    end while

    return root
end function

Tarjan and Van Leeuwen also developed one-pass Find algorithms that retain the same worst-case complexity but are more efficient in practice.[3] These are called path splitting and path halving. Both of these update the parent pointers of nodes on the path between the query node and the root. Path splitting replaces every parent pointer on that path by a pointer to the node's grandparent:

function Find(x) is
    while x.parent ≠ x do
        (x, x.parent) := (x.parent, x.parent.parent)
    end while
    return x
end function

Path halving works similarly but replaces only every other parent pointer:

function Find(x) is
    while x.parent ≠ x do
        x.parent := x.parent.parent
        x := x.parent
    end while
    return x
end function

Merging two sets

Файл:Dsu disjoint sets init.svg
MakeSet creates 8 singletons.
Файл:Dsu disjoint sets final.svg
After some operations of Union, some sets are grouped together.

The operation Union(x, y) replaces the set containing Шаблон:Mvar and the set containing Шаблон:Mvar with their union. Union first uses Find to determine the roots of the trees containing Шаблон:Mvar and Шаблон:Mvar. If the roots are the same, there is nothing more to do. Otherwise, the two trees must be merged. This is done by either setting the parent pointer of Шаблон:Mvar's root to Шаблон:Mvar's, or setting the parent pointer of Шаблон:Mvar's root to Шаблон:Mvar's.

The choice of which node becomes the parent has consequences for the complexity of future operations on the tree. If it is done carelessly, trees can become excessively tall. For example, suppose that Union always made the tree containing Шаблон:Mvar a subtree of the tree containing Шаблон:Mvar. Begin with a forest that has just been initialized with elements <math>1, 2, 3, \ldots, n,</math> and execute Шаблон:Math, Шаблон:Math, ..., Шаблон:Math. The resulting forest contains a single tree whose root is Шаблон:Mvar, and the path from 1 to Шаблон:Mvar passes through every node in the tree. For this forest, the time to run Find(1) is Шаблон:Math.

In an efficient implementation, tree height is controlled using union by size or union by rank. Both of these require a node to store information besides just its parent pointer. This information is used to decide which root becomes the new parent. Both strategies ensure that trees do not become too deep.

Union by size

In the case of union by size, a node stores its size, which is simply its number of descendants (including the node itself). When the trees with roots Шаблон:Mvar and Шаблон:Mvar are merged, the node with more descendants becomes the parent. If the two nodes have the same number of descendants, then either one can become the parent. In both cases, the size of the new parent node is set to its new total number of descendants.

function Union(x, y) is
    // Replace nodes by roots
    x := Find(x)
    y := Find(y)

    if x = y then
        return  // x and y are already in the same set
    end if

    // If necessary, swap variables to ensure that
    // x has at least as many descendants as y
    if x.size < y.size then
        (x, y) := (y, x)
    end if

    // Make x the new root
    y.parent := x
    // Update the size of x
    x.size := x.size + y.size
end function

The number of bits necessary to store the size is clearly the number of bits necessary to store Шаблон:Mvar. This adds a constant factor to the forest's required storage.

Union by rank

For union by rank, a node stores its Шаблон:Em, which is an upper bound for its height. When a node is initialized, its rank is set to zero. To merge trees with roots Шаблон:Mvar and Шаблон:Mvar, first compare their ranks. If the ranks are different, then the larger rank tree becomes the parent, and the ranks of Шаблон:Mvar and Шаблон:Mvar do not change. If the ranks are the same, then either one can become the parent, but the new parent's rank is incremented by one. While the rank of a node is clearly related to its height, storing ranks is more efficient than storing heights. The height of a node can change during a Find operation, so storing ranks avoids the extra effort of keeping the height correct. In pseudocode, union by rank is:

function Union(x, y) is
    // Replace nodes by roots
    x := Find(x)
    y := Find(y)

    if x = y then
        return  // x and y are already in the same set
    end if

    // If necessary, rename variables to ensure that
    // x has rank at least as large as that of y
    if x.rank < y.rank then
        (x, y) := (y, x)
    end if

    // Make x the new root
    y.parent := x
    // If necessary, increment the rank of x
    if x.rank = y.rank then
        x.rank := x.rank + 1
    end if
end function

It can be shown that every node has rank <math>\lfloor \log n \rfloor</math> or less.[10] Consequently each rank can be stored in Шаблон:Math bits and all the ranks can be stored in Шаблон:Math bits. This makes the ranks an asymptotically negligible portion of the forest's size.

It is clear from the above implementations that the size and rank of a node do not matter unless a node is the root of a tree. Once a node becomes a child, its size and rank are never accessed again.

Time complexity

A disjoint-set forest implementation in which Find does not update parent pointers, and in which Union does not attempt to control tree heights, can have trees with height Шаблон:Math. In such a situation, the Find and Union operations require Шаблон:Math time.

If an implementation uses path compression alone, then a sequence of Шаблон:Mvar MakeSet operations, followed by up to Шаблон:Math Union operations and Шаблон:Math Find operations, has a worst-case running time of <math>\Theta(n+f \cdot \left(1 + \log_{2+f/n} n\right))</math>.[10]

Using union by rank, but without updating parent pointers during Find, gives a running time of <math>\Theta(m \log n)</math> for Шаблон:Mvar operations of any type, up to Шаблон:Mvar of which are MakeSet operations.[10]

The combination of path compression, splitting, or halving, with union by size or by rank, reduces the running time for Шаблон:Mvar operations of any type, up to Шаблон:Mvar of which are MakeSet operations, to <math>\Theta(m\alpha(n))</math>.[3][4] This makes the amortized running time of each operation <math>\Theta(\alpha(n))</math>. This is asymptotically optimal, meaning that every disjoint set data structure must use <math>\Omega(\alpha(n))</math> amortized time per operation.[5] Here, the function <math>\alpha(n)</math> is the inverse Ackermann function. The inverse Ackermann function grows extraordinarily slowly, so this factor is Шаблон:Math or less for any Шаблон:Mvar that can actually be written in the physical universe. This makes disjoint-set operations practically amortized constant time.

Proof of O(m log* n) time complexity of Union-Find

The precise analysis of the performance of a disjoint-set forest is somewhat intricate. However, there is a much simpler analysis that proves that the amortized time for any Шаблон:Mvar Find or Union operations on a disjoint-set forest containing Шаблон:Mvar objects is Шаблон:Math, where Шаблон:Math denotes the iterated logarithm.[11][12][13][14]

Шаблон:AnchorLemma 1: As the find function follows the path along to the root, the rank of node it encounters is increasing.

Шаблон:Math proof

Шаблон:AnchorLemma 2: A node Шаблон:Mvar which is root of a subtree with rank Шаблон:Mvar has at least <math>2^r</math> nodes.

Шаблон:Math proof

Файл:ProofOflogstarnRank.jpg

Lemma 3: The maximum number of nodes of rank Шаблон:Mvar is at most <math>\frac{n}{2^r}.</math>

Шаблон:Math proof

For convenience, we define "bucket" here: a bucket is a set that contains vertices with particular ranks.

We create some buckets and put vertices into the buckets according to their ranks inductively. That is, vertices with rank 0 go into the zeroth bucket, vertices with rank 1 go into the first bucket, vertices with ranks 2 and 3 go into the second bucket. If the Шаблон:Mvar-th bucket contains vertices with ranks from interval <math>\left[r, 2^r - 1\right] = [r, R - 1]</math> then the (B+1)st bucket will contain vertices with ranks from interval <math>\left[R, 2^R - 1\right].</math>

Файл:Proof of O(log*n) Union Find.jpg
Proof of <math>O(\log^*n)</math> Union Find

We can make two observations about the buckets.

  1. Шаблон:AnchorThe total number of buckets is at most Шаблон:Math
    Proof: When we go from one bucket to the next, we add one more two to the power, that is, the next bucket to <math>\left[B, 2^B - 1\right]</math> will be <math>\left[2^B, 2^{2^B} - 1\right]</math>
  2. Шаблон:AnchorThe maximum number of elements in bucket <math>\left[B, 2^B - 1\right]</math> is at most <math>\frac{2n}{2^B}</math>
    Proof: The maximum number of elements in bucket <math>\left[B, 2^B - 1\right]</math> is at most <math>\frac{n}{2^B} + \frac{n}{2^{B+1}} + \frac{n}{2^{B+2}} + \cdots + \frac{n}{2^{2^B - 1}} \leq \frac{2n}{2^B}.</math>

Let Шаблон:Mvar represent the list of "find" operations performed, and let

<math display=block>T_1 = \sum_F\text{(link to the root)}</math> <math display=block>T_2 = \sum_F\text{(number of links traversed where the buckets are different)}</math> <math display=block>T_3 = \sum_F\text{(number of links traversed where the buckets are the same).}</math>

Then the total cost of Шаблон:Mvar finds is <math>T = T_1 + T_2 + T_3.</math>

Since each find operation makes exactly one traversal that leads to a root, we have Шаблон:Math.

Also, from the bound above on the number of buckets, we have Шаблон:Math.

For Шаблон:Mvar, suppose we are traversing an edge from Шаблон:Mvar to Шаблон:Mvar, where Шаблон:Mvar and Шаблон:Mvar have rank in the bucket Шаблон:Math and Шаблон:Mvar is not the root (at the time of this traversing, otherwise the traversal would be accounted for in Шаблон:Mvar). Fix Шаблон:Mvar and consider the sequence <math>v_1, v_2, \ldots, v_k</math> that take the role of Шаблон:Mvar in different find operations. Because of path compression and not accounting for the edge to a root, this sequence contains only different nodes and because of Lemma 1 we know that the ranks of the nodes in this sequence are strictly increasing. By both of the nodes being in the bucket we can conclude that the length Шаблон:Mvar of the sequence (the number of times node Шаблон:Mvar is attached to a different root in the same bucket) is at most the number of ranks in the buckets Шаблон:Mvar, that is, at most <math>2^B - 1 - B < 2^B.</math>

Therefore, <math>T_3 \leq \sum_{[B, 2^B - 1]} \sum_u 2^B.</math>

From Observations 1 and 2, we can conclude that <math display="inline">T_3 \leq \sum_{B} 2^B \frac{2n}{2^B} \leq 2 n \log^* n.</math>

Therefore, <math>T = T_1 + T_2 + T_3 = O(m \log^*n).</math>

Other structures

Better worst-case time per operation

The worst-case time of the Find operation in trees with Union by rank or Union by weight is <math>\Theta(\log n)</math> (i.e., it is <math>O(\log n)</math> and this bound is tight). In 1985, N. Blum gave an implementation of the operations that does not use path compression, but compresses trees during <math>union</math>. His implementation runs in <math>O(\log n / \log\log n)</math> time per operation,[15] and thus in comparison with Galler and Fischer's structure it has a better worst-case time per operation, but inferior amortized time. In 1999, Alstrup et al. gave a structure that has optimal worst-case time <math>O(\log n / \log\log n)</math> together with inverse-Ackermann amortized time.[16]

Deletion

The regular implementation as disjoint-set forests does not react favorably to the deletion of elements, in the sense that the time for Find will not improve as a result of the decrease in the number of elements. However, there exist modern implementations that allow for constant-time deletion and where the time-bound for Find depends on the current number of elements[17][18]

Applications

Файл:UnionFindKruskalDemo.gif
A demo for Union-Find when using Kruskal's algorithm to find minimum spanning tree.

Disjoint-set data structures model the partitioning of a set, for example to keep track of the connected components of an undirected graph. This model can then be used to determine whether two vertices belong to the same component, or whether adding an edge between them would result in a cycle. The Union–Find algorithm is used in high-performance implementations of unification.[19]

This data structure is used by the Boost Graph Library to implement its Incremental Connected Components functionality. It is also a key component in implementing Kruskal's algorithm to find the minimum spanning tree of a graph.

The Hoshen-Kopelman algorithm uses a Union-Find in the algorithm.

See also

References

Шаблон:Reflist

External links

Шаблон:Data structures

  1. Шаблон:Cite journal. The paper originating disjoint-set forests.
  2. Шаблон:Cite journal
  3. 3,0 3,1 3,2 Шаблон:Cite journal
  4. 4,0 4,1 Шаблон:Cite journal
  5. 5,0 5,1 Шаблон:Cite book
  6. Шаблон:Cite journal
  7. Шаблон:Cite conference
  8. Шаблон:Cite conference
  9. Harold N. Gabow, Robert Endre Tarjan, "A linear-time algorithm for a special case of disjoint set union," Journal of Computer and System Sciences, Volume 30, Issue 2, 1985, pp. 209–221, ISSN 0022-0000, https://doi.org/10.1016/0022-0000(85)90014-5
  10. 10,0 10,1 10,2 Шаблон:Cite book
  11. Raimund Seidel, Micha Sharir. "Top-down analysis of path compression", SIAM J. Comput. 34(3):515–525, 2005
  12. Шаблон:Cite journal
  13. Шаблон:Cite journal
  14. Robert E. Tarjan and Jan van Leeuwen. Worst-case analysis of set union algorithms. Journal of the ACM, 31(2):245–281, 1984.
  15. Шаблон:Cite journal
  16. Шаблон:Cite book
  17. Шаблон:Cite journal
  18. Шаблон:Cite journal
  19. Шаблон:Cite journal