Computational Graph Theory and Geometry - A user – friendly M inimum S panning T ree

CHAPTER 2. THEORETICAL BACKGROUND

• In set theory0 , {0}as0is a number and{0}is a set containing0. An inexperienced reader may mistake a forest for a set of trees.

• In graph theory, trees and forests are both graphs. This means that there is no hierarchical relation between them making it possible to be equivalent:

the forest of one tree is the tree itself and not a set of trees containing it. We will see later that Kruskal’s algorithm operates on forests that are subgraphs of a tree.

Definition 2.15: Aspanning forestF(V, E⁰)of a graphG(V, E)is the forest containing all the vertices ofGand a subset of its edges (E⁰ ⊆E) such that a disjoint union of trees is formed (any vertex or edge belongs to only one tree).

If a graph is disconnected, then there is no spanning tree — and consequently no minimum spanning tree — but there can be a set of trees or MSTs, even if some of them contain only one vertex and no edges. For example, the supergraph of a spanning tree and a vertex which is not joined to any other vertices, is a spanning forest.

For a connected graph, there is always a (minimum) spanning tree, which is also a (minimum) spanning forest. However, a disconnected graph can only be associated with a spanning forest. Thus, algorithms applied on undirected graphs which may or may not be connected should always be supported by a data representation of forests or inform the user for the inability to find a spanning tree.

In addition, there are graph theoretic mathematical proofs or algorithms that handle forests in the process of finding trees of MSTs of a given graph (e.g.

Kruskal’s algorithm.)

For example, in Figure 2.10 we can see a forest, which happens to be a subgraph and spanning forest of the tree in Figure2.8and the complete graph in Figure2.7.

2.3. Computational Graph Theory and Geometry

Figure 2.10 A spanning forest

step the algorithm makes the best choice out of many possibilities. Greedy algorithms do not always guarantee globally optimal solution, but fortunately the two algorithms find the unique solution when the edges’ weights are distinct or one valid solution among tje unique set of MSTs that present the minimum weighting. They always result to spanning tree of the input graph with total weight less or equal to any other spanning tree (Cormenet al.,2009).

The general strategy behind those algorithms is to create a set of edges A that grows by one edge at a time while securing that the following loop invariant holds:

At the start of each iteration,Ais a subset of a minimum spanning tree.

At each step, the algorithms determine asafe edge: an edge that can be added to A without the invariant being violated. Thus, each iteration will be a step closer to the final solution, the minimum spanning tree. See Algorithm2.1for the pseudocode of the generic greedy algorithm andCormenet al.(2009) for the mathematical proof.

The critical part of any greedy MST algorithm and its distinctive characteristic is the method of finding safe edges. The next sections describe the methods used by the two algorithms we mentioned.

CHAPTER 2. THEORETICAL BACKGROUND Algorithm 2.1:GreedyMST

Input :graphG(V, E)

Output :minimum spanning treeT(V, A)

1 A←∅

2 whileT is not spanning tree ofGdo

3 find an edge(u, v)that is safe

4 A←A∪ {(u, v)}

5 returnT

2.3.1 Kruskal’s algorithm

Kruskal’s algorithm (Kruskal,1956) begins with a forest containing all vertices of the input graph and no edges. At this stage we can consider the forest as a collection of one–vertexed trees. Then, it gradually identifies safes edges in the edge set of the input graph, adds them to the forest and combines the trees adjacent to the edges, until a spanning tree is formed.

The algorithm’s method of finding safe edges is based on the fact that the lightest edge is always contained in the MST so we can add it to the forest, the 2^ndlightest can be included too as there is no way to create a cycle at this point (only two vertices are connected), the 3^rdlightest can be included if it does not form a cycle in the already formed forest, the 4^th too,· · · until the MST is formed or no edges are left to be checked.

Thus, by sorting the edges in non–decreasing order and trying to add the lightest not–yet–considered edge that creates no cycles, eventually results to a MST of the input graph. The proof can be found inCormenet al.(2009).

Safe edges

In order to find if the inclusion of an edge would create a cycle we keep track of which vertices are connect to which, by creating setsV_u for each vertexuthat include all vertices that are connected withu. Of course, allVucontain onlyu initially.

Then, checking an edge for its safety is reduced into locating the subsetsV_i andV_j where the two endpoints of the edge are contained. We call this process

FindSet.

2.3. Computational Graph Theory and Geometry

IfV_i andV_j are the same, then the currently checked edge connects vertices that were already connected indirectly before and would create a cycle. If not, then they belong to different trees, so the edge can be included to the forest and the corresponding sets should be merged as their vertices are now connected —a process we callUnionSet.

The combination of UnionSet and FindSet is call Union–Find algorithm which relies in disjoint set data structures for efficient computation (Cormen et al.,2009).

Ability to find MSFs

For a connected input graph, successive unions will merge the initialV sets (of one vertex), into a minimum spanning tree ofV vertices andE ≡V −1edges.

If a disconnected graph was given, a minimum spanning forest will be created, composed ofT trees,V vertices andE−T edges.

As the MSF is a generalization of the MST, in the listing Algorithm2.2we will consider the output of the Kruskal’s algorithm as a MSF.

Time complexity

The sorting phase of the algorithm dominates the execution time because

• sorting requiresO(ElogE)time, while

• FindSetandUnionSetoperations can be implemented using a tree–based implementation of Union–Find algorithm requiring less timeO(Eα(E, V) (Cormenet al.,2009).

while We should note that α(n, m) is the two–parameter inverse Ackermann function which is a very slowly increasing function that can be consider< 4 even for extremely high values fornandm.

Consequently, the time complexity isElogEand because of the fact than the number of edges cannot exceed ¹₂V (V −1),

E =O(V²)⇒O(ElogE)→O(ElogV) (2.9)

CHAPTER 2. THEORETICAL BACKGROUND Algorithm 2.2:Kruskal’s algorithm

Input :graphG(V, E)

Output :minimum spanning forestT(V, A)

1 foreachu∈V do

2 MakeSet:V_u ← {u}

3 SortEin non–decreasing order

4 foreach(u, v)∈E do

5 Vi ←FindSet(u)

6 V_j ←FindSet(v)

7 if V_i ,V_j then

8 A←A∪ {(u, v)}

9 V_i ←UnionSet(V_i, V_j)

10 V_j ←∅

11 if Vi =V then break

12 returnT

Pseudocode

2.3.2 Prim’s algorithm

Contrary to Kruskal’s algorithm that merging of trees takes place, the idea behind Prim’s algorithm (Prim,1957) is to start with an null tree and grow it step by step by adding a safe edge.

Safe edge

As the resulting MST will have all vertices of the input graph, then we can add an arbitrary vertexuto the null tree. Then, it is proven that

i the lightest edge starting from u is safe and it can be added to the tree accompanied by the other endpointv

ii the lightest edge connecting eitheruofv to any other vertex (V \ {u, v}) is also safe

iii at any point of the process, out of non–included edges that join a vertex in the tree to one not yet in tree, the lightest one is necessarily safe.

2.3. Computational Graph Theory and Geometry Finding lightest edges

The key in any implementation or modification of Prim’s algorithm is to effi- ciently track the lightest edge at each step. The most common strategy is to assign two key values at each vertexu:

• c_u – the weight of the lightest edge to connectuto any vertex inT

• eu – the corresponding light edge

Initially, allc_iare set equal to∞for subsequent inequilities of the formx < c_i to evaluate true the first time. Alle_i are set to equal to aflag valueindicating they’re undefined.

By including the first arbitrary vertex, we update thec_iande_i values of the rest vertices. Searching for the lightest edge is trivial by keeping track of the minimum ofc_i’s and the index of the edge it corresponds to. Then, the edge is included and the other endpoint now takes the role of the first vertex: imposes updates toci andei values of the rest of the graph. Ultimately, by keeping the lightest edge table, we do not have to revisit included vertices.

The pseudocode of Prim’s algorithm is given in Algorithm2.3.

Time complexity

The time complexity of Prim’s algorithm varies depending the data structures used (Eisner,2014):

Adjacency matrix O(V²) Binary heap O(ElogV) Fibonacci heap O(E+V logV) Table 2.1 Time complexity of Prim’s algorithm

2.3.3 Comparison of the MST algorithms

Binary heap provides complexity asymptotically equal to Kruskal’s algorithm.

The above complexities involve both E and V. To compare them, a function f connecting the number of edges with the number of vertices in terms of asymptotic analysis,

E =E(f(V))

CHAPTER 2. THEORETICAL BACKGROUND is required. In other words, an advantageous selection of the algorithm and the data structures depends on how dense the graph is expected to be. Asparse graph is considered to have number of edges proportional to the number of vertices,E =O(V), while for a dense graph^∗E =O(V²)should be regarded.

We produced the following table for future reference:

Data structure Sparse graphE =O(V) Dense graphE =O(V²) Prim’s algorithm

Adjacency matrix O(V²) O(V²)

Binary heap O(V logV) O(V²logV)

Fibonacci heap O(V logV) O(V²)

Kruskal’s algorithm

Anything O(V logV) O(V²logV)

Table 2.2 Comparison of the computational complexities of the MST algorithms in respect to the density of graphs

Algorithm 2.3:Prim’s algorithm Input :graphG(V, E)

Output :minimum spanning treeT(S, A)

1 foreachu∈V do

2 c_u ←+∞

3 e_u ←FLAG

4 initializeT —S, A←∅

5 initialize set of vertices not yet included inT —Q←∅

6 whileQ,∅do

7 find the vertexu∈Qhaving the minimum key valuec_u

8 S ←S∪u

9 if e_u ,FLAGthenA←A∪e_u

10 foreachv ∈Qand connected toudo

11 w←weight of the edge(u, v)

12 if w < c_v then

13 c_v ←w

14 e_v ←(u, v)

15 returnT

∗with complete graph being the denser graph

2.3. Computational Graph Theory and Geometry

2.3.4 Delaunay Triangulation

GivenN points in the plane, drawing line segments between pairs of them such that no two segments are crossed, until no more segments can be drawn, we arrive at atriangulationof the set of points.

There are many distinct triangulations for the same set of points. ADelaunay triangulationis the one that best avoids small angles and long sides (Presset al., 2007):

Among all triangulations of a set of points, a Delaunay triangulation has the largest minimum angles, as defined by sorting all angles from smallest to largest and comparing lexicographically to the angles of any other triangulation.

The same principle can be extended to more than two dimensions. For example, for points in 3D space there is a Delaunay triangulation, also called Delaunay tetrahedralizationas the segments are organized in tetrahedra.

We include the DT in our study because of the fact that the MST of a point set in Euclidean space can be quickly extracted from its DT, faster than using MST algorithms directly, by exploiting the existence of the metric of the space (§3.3.) There are various DT algorithms but for generic applications the computational complexity of the most efficient algorithms isO(V logV)in 2D and 3D space.

The art of progress is to preserve order amid change and to preserve change amid order.

—Alfred North Whitehead, “Process and Reality:

An Essay in Cosmology (1929)”

3

Related work

Outline of chapter

3.1 Two–point correlation function . . . 32 3.2 Use of MSTs in Cosmology . . . 35 3.3 Delaunay Triangulation and MST . . . 37 3.4 DEUSS . . . 38

3.1. Two–point correlation function

In this chapter we present the two–point correlation function (2PCF), a statistical tool coming from the field of Crystallography, and the application on cosmological data.

Though, its failure to distinguish difference in the clustering of matter in simulations with different initial conditions, required the use of the higher–order N–point correlation functions (NPCFs.)

But NPCFs are far from a panacea of clustering problems. Beside being computationally intensive, as N. K. Bose discusses inColes(1992),

[...] correlation functions are insensitive to visually different patterns and morphology.

In the next sections we present how the minimum spanning trees can address these issues. Furthermore, computed inO(V²logV)time at worst presents itself as a valuable shortcut around the heavy NPCFs.

3.1 Two–point correlation function

3.1.1 Definition

The organization of matter in the universe on large scales and the evolution of structures, are connected to a wide range of processes from the early universe (quantum perturbations) up to recent times (mostly gravitation and dark energy).

Naturally, various clustering techniques have been applied. The mainstream approach is to use point field statistics in samples of concentrations of matter (galaxies, dark haloes, clusters.)

The observation that the Universe is homogenous on very large scales, led to an approach of identifying the clustering of matter as fluctuations superposed on a mean field of the universe (Joneset al.,2004). Consequently, the statistical analysis of the spatial distribution of galaxies often involves the two–point correlation function (2PCF), ξ(r), that quantifies the clustering of galaxies with the following infinitesimal interpretation (Martínez & Star,2003):

dP₁₂= ¯n²[1 +ξ(r)] dV₁dV₂ (3.1)

CHAPTER 3. RELATED WORK wheredP₁₂is the joint probability that at least one galaxy lies in each one of the two infinitesimal volumesdV₁ anddV₂,r is the separation vector of the latter, and n¯ is the mean number densityor intensity. The correlation function is a measure of excess probability relative to a Poisson distribution.

A similar quantity used for projected catalogues, is the angular two-point correlation function,w(θ), defined by

dP =N[1 +w(θ)]dΩ (3.2)

where dP is the conditional probability of a galaxy residing within the solid angledΩat an angular distanceθfrom an arbitrarily chosen galaxy andN is the mean number density of galaxies per unit area in the projection.

The use of the 2PCF was crucial in showing that the universe is homogeneous forr >100h⁻¹M pcand that it presents a fractal structure (with dimension close to 2) for r < 20h⁻¹M pc. A complete review on the correlation functions in cosmology and their interpretation can be found inJoneset al.(2004).

3.1.2 Estimation

Under the assumption of homogeneous and isotropic distribution of galaxies (invariant under translations and rotations), the probability depends only on the magnituder =|r|of the separation vector.

The most commonly used estimator for ξ(r)is theLS estimator(Landy &

Szalay,1992),

ξ(r) = DD−2DR+RR

RR (3.3)

where the three quantities are normalized counts of galaxy pairs of binned separationrfor two catalogues: (i) the observed data withndgalaxies and (ii) random distribution ofn_r galaxies of the same mean density and geometry as the former:

DD(r) = dd(r)

n_d(n_d−1)/2 (data sample) (3.4)

3.1. Two–point correlation function RR(r) = rr(r)

n_r(n_r−1)/2 (random sample) (3.5)

DR(r) = dr(r)

n_rn_d (cross-correlation of the samples) (3.6) The LS estimator can be computed inO(N²)time with the brute-force approach of computing all pairs.

3.1.3 Other measures

The 2PCF is merely the first descriptor, the lowest order of the infinite set of N–point correlation functions (N-PCF), that describe morphology. For example, thethree point correlation function (3PCF)takes in consideration all possible triples (instead of pairs) of objects in the sample.

Another clustering measure is thecounts–in–cellprobabilityP(N, r), defined as the probability of exactly N galaxies are contained in a random sphere of radiusr. It is also related (forN = 0) to thevoid probability functionand the higher order,n–point correlation functions through the moments of count-in-cells (Szapudiet al.,1999).

3.1.4 Computational issues

Many of the algorithms used in the spatial statistical analysis in Cosmology involve maximum likelihood estimators that are computationally expensive as they require matrix inversions. Also, the brute–force method for computing n–point correlation function involves the computation of

N(N −1)· · ·(N −n+ 1)

combinations for theN points of a field, leading to aO(Nⁿ)time complexity.

Though, the growth of computational power does not match the growth of data.

Szalay & Takahiko(2003) discuss the importance ofO(NlogN)algorithms and their hypothesis that slower algorithms will become impractical in the immediate future. Use of clever data structures like hierarchy trees (Zhang & Pen,2005) and Fast Fourier transforms can speed up the computation down to complexities

CHAPTER 3. RELATED WORK ON^alog^bNwherea ∈ R,b ∈ Nand1 ≤ a, b < n, not avoiding the usual trade-offbetween speed and accuracy(Mooreet al.,2001).

No documento A user – friendly M inimum S panning T ree (páginas 43-55)