Prim’s algorithm - OUTPUT tab - A user – friendly M inimum S panning T ree

C.6 OUTPUT tab

2.3 Prim’s algorithm

Output :minimum spanning treeT(S, A)

1 foreachu∈V do

2 c_u ←+∞

3 e_u ←FLAG

4 initializeT —S, A←∅

5 initialize set of vertices not yet included inT —Q←∅

6 whileQ,∅do

7 find the vertexu∈Qhaving the minimum key valuec_u

8 S ←S∪u

9 if e_u ,FLAGthenA←A∪e_u

10 foreachv ∈Qand connected toudo

11 w←weight of the edge(u, v)

12 if w < c_v then

13 c_v ←w

14 e_v ←(u, v)

15 returnT

∗with complete graph being the denser graph

2.3. Computational Graph Theory and Geometry

2.3.4 Delaunay Triangulation

GivenN points in the plane, drawing line segments between pairs of them such that no two segments are crossed, until no more segments can be drawn, we arrive at atriangulationof the set of points.

There are many distinct triangulations for the same set of points. ADelaunay triangulationis the one that best avoids small angles and long sides (Presset al., 2007):

Among all triangulations of a set of points, a Delaunay triangulation has the largest minimum angles, as defined by sorting all angles from smallest to largest and comparing lexicographically to the angles of any other triangulation.

The same principle can be extended to more than two dimensions. For example, for points in 3D space there is a Delaunay triangulation, also called Delaunay tetrahedralizationas the segments are organized in tetrahedra.

We include the DT in our study because of the fact that the MST of a point set in Euclidean space can be quickly extracted from its DT, faster than using MST algorithms directly, by exploiting the existence of the metric of the space (§3.3.) There are various DT algorithms but for generic applications the computational complexity of the most efficient algorithms isO(V logV)in 2D and 3D space.

The art of progress is to preserve order amid change and to preserve change amid order.

—Alfred North Whitehead, “Process and Reality:

An Essay in Cosmology (1929)”

3

Related work

Outline of chapter

3.1 Two–point correlation function . . . 32 3.2 Use of MSTs in Cosmology . . . 35 3.3 Delaunay Triangulation and MST . . . 37 3.4 DEUSS . . . 38

3.1. Two–point correlation function

In this chapter we present the two–point correlation function (2PCF), a statistical tool coming from the field of Crystallography, and the application on cosmological data.

Though, its failure to distinguish difference in the clustering of matter in simulations with different initial conditions, required the use of the higher–order N–point correlation functions (NPCFs.)

But NPCFs are far from a panacea of clustering problems. Beside being computationally intensive, as N. K. Bose discusses inColes(1992),

[...] correlation functions are insensitive to visually different patterns and morphology.

In the next sections we present how the minimum spanning trees can address these issues. Furthermore, computed inO(V²logV)time at worst presents itself as a valuable shortcut around the heavy NPCFs.

3.1 Two–point correlation function

3.1.1 Definition

The organization of matter in the universe on large scales and the evolution of structures, are connected to a wide range of processes from the early universe (quantum perturbations) up to recent times (mostly gravitation and dark energy).

Naturally, various clustering techniques have been applied. The mainstream approach is to use point field statistics in samples of concentrations of matter (galaxies, dark haloes, clusters.)

The observation that the Universe is homogenous on very large scales, led to an approach of identifying the clustering of matter as fluctuations superposed on a mean field of the universe (Joneset al.,2004). Consequently, the statistical analysis of the spatial distribution of galaxies often involves the two–point correlation function (2PCF), ξ(r), that quantifies the clustering of galaxies with the following infinitesimal interpretation (Martínez & Star,2003):

dP₁₂= ¯n²[1 +ξ(r)] dV₁dV₂ (3.1)

CHAPTER 3. RELATED WORK wheredP₁₂is the joint probability that at least one galaxy lies in each one of the two infinitesimal volumesdV₁ anddV₂,r is the separation vector of the latter, and n¯ is the mean number densityor intensity. The correlation function is a measure of excess probability relative to a Poisson distribution.

A similar quantity used for projected catalogues, is the angular two-point correlation function,w(θ), defined by

dP =N[1 +w(θ)]dΩ (3.2)

where dP is the conditional probability of a galaxy residing within the solid angledΩat an angular distanceθfrom an arbitrarily chosen galaxy andN is the mean number density of galaxies per unit area in the projection.

The use of the 2PCF was crucial in showing that the universe is homogeneous forr >100h⁻¹M pcand that it presents a fractal structure (with dimension close to 2) for r < 20h⁻¹M pc. A complete review on the correlation functions in cosmology and their interpretation can be found inJoneset al.(2004).

3.1.2 Estimation

Under the assumption of homogeneous and isotropic distribution of galaxies (invariant under translations and rotations), the probability depends only on the magnituder =|r|of the separation vector.

The most commonly used estimator for ξ(r)is theLS estimator(Landy &

Szalay,1992),

ξ(r) = DD−2DR+RR

RR (3.3)

where the three quantities are normalized counts of galaxy pairs of binned separationrfor two catalogues: (i) the observed data withndgalaxies and (ii) random distribution ofn_r galaxies of the same mean density and geometry as the former:

DD(r) = dd(r)

n_d(n_d−1)/2 (data sample) (3.4)

3.1. Two–point correlation function RR(r) = rr(r)

n_r(n_r−1)/2 (random sample) (3.5)

DR(r) = dr(r)

n_rn_d (cross-correlation of the samples) (3.6) The LS estimator can be computed inO(N²)time with the brute-force approach of computing all pairs.

3.1.3 Other measures

The 2PCF is merely the first descriptor, the lowest order of the infinite set of N–point correlation functions (N-PCF), that describe morphology. For example, thethree point correlation function (3PCF)takes in consideration all possible triples (instead of pairs) of objects in the sample.

Another clustering measure is thecounts–in–cellprobabilityP(N, r), defined as the probability of exactly N galaxies are contained in a random sphere of radiusr. It is also related (forN = 0) to thevoid probability functionand the higher order,n–point correlation functions through the moments of count-in-cells (Szapudiet al.,1999).

3.1.4 Computational issues

Many of the algorithms used in the spatial statistical analysis in Cosmology involve maximum likelihood estimators that are computationally expensive as they require matrix inversions. Also, the brute–force method for computing n–point correlation function involves the computation of

N(N −1)· · ·(N −n+ 1)

combinations for theN points of a field, leading to aO(Nⁿ)time complexity.

Though, the growth of computational power does not match the growth of data.

Szalay & Takahiko(2003) discuss the importance ofO(NlogN)algorithms and their hypothesis that slower algorithms will become impractical in the immediate future. Use of clever data structures like hierarchy trees (Zhang & Pen,2005) and Fast Fourier transforms can speed up the computation down to complexities

CHAPTER 3. RELATED WORK ON^alog^bNwherea ∈ R,b ∈ Nand1 ≤ a, b < n, not avoiding the usual trade-offbetween speed and accuracy(Mooreet al.,2001).

3.2. Use of MSTs in Cosmology

(a) The starting minimum spanning tree.

(b) Pruned at level 2. Six branches with two or less nodes were cut.

Figure 3.1 Example of pruning and separation operators applied to a MST.

CHAPTER 3. RELATED WORK

3.3 Delaunay Triangulation and MST

As we mentioned before, another two techniques used in the analysis of cosmological point sets for their own merits, the Delaunay Triangulation (§2.3.4) and its dual graph, the Voronoi Diagram (Rien van,2003).

In this study we exploit Theorems3.1 and 3.2proved by Toussaint (Tous- saint,1980,Jaromczyk & Toussaint,1992) and McMullen (McMullen,1970) respectively:

Theorem 3.1: The edges of a minimum spanning tree of a set of points in Euclidean space, are included in the the edges of their Delaunay triangulation.

Thus, finding the MST for Euclidean graphs can be implemented as such:

1. For verticesV, find the DT 2. Extract the edgesEof DT 3. Find the MST of(V, E)

Theorem3.2proves the reason why this «pipeline» is computationally efficient and presents a great optimization opportunity.

Theorem 3.2: The number of simplices in the Delaunay triangulation of n points in dimensiondis at most

n− b^d+1₂ c n−d

+ n− b^d+2₂ c n−d

(3.7)

It is rather simple to prove that in a 3D triangulation,E =O(V):

A simplex in dimension d is the convex hull of d + 1vertices. Its edges connect each vertex to each other. Thus, the number of edges is

d+ 1 2

= d(d+ 1)

2 (3.8)

The number of edgesEin a Delaunay triangulation ofV vertices in dimension dis less than the number of edges in each simplex forming it because of the fact

that most of the edges are common. Consequently,

E < d(d+ 1) 2

V − b^d+1₂ c V −d

+ n− b^d+2₂ c V −d

(3.9)

Ford= 3, we arrive at the fact that

E <12(V −2) ⇒ E =O(V) (3.10) Therefore, for 3D space, we can find theO(V)edges of the DT inO(V logV) time and then use them instead of allO(V²)possible pairs in aO(ElogE)MST algorithm like Kruskal. Then, we arrive at the MST of the graph inO(V logV) time.

3.4 DEUSS

Our data sets are two catalogs of Dark Matter Halos provided byDEUS Consor- tium^∗. The numerical simulations from which the data were extracted were part of the Dark Energy Universe Simulation Series and numbered2048³ particles confined in a2592h⁻¹Mpccomoving box. Each set corresponds to a different

«cosmology.» Table 3.1 summarizes their properties. For more information consultRaseraet al.(2010).

Cosmology ΛCDM (LCDM) Ratra–Peebles (RPCDM) Parameters WMAP–5 In1σfrom WMAP–5 & 7

h 0.72 0.72

Ω_m 0.26 0.23

Ω_Λ 0.74 0.77

Ω_b 0.044 0.044

σ₈ 0.79 0.66

n_s 0.963 0.963

Table 3.1 Properties of the cosmological models of the DEUSS simulations from which our data were extracted.

∗Dark Energy Universe Simulation:http://www.deus-consortium.org/

CHAPTER 3. RELATED WORK Both catalogs cover the full sky and resolve all dark matter halos with minimum mass10¹³M/hand redshift. 0.635because of the box length. Each halo catalog documents by 7 numerical fields:

Field Description

Halo–ID An integer for indexing.

M_{F oF} Mass in Mof the halo as a group of particles, joined using the Friend–of–Friend algorithm (Daviset al.,1985).

z_obs Observed redshift.

φ Azimuthal angle (degrees).

θ Polar angle (degrees).

v_CDM Particles’ velocity according to Cold Dark Matter model.

Distance Comoving–distance inh⁻¹Mpc.

3.4. DEUSS

Figure 3.2 Black and white slice from a DEUSS simulation, depicting large–scale structure.

If a man will begin with certainties, he will end in doubts;

but if he will be content to begin with doubts, he will end in certainties.

—Francis Bacon, “Advancement of Learning”

4

Implementation methodology

Outline of chapter

4.1 Selection of algorithms . . . 42 4.2 The logic behind the library. . . 43 4.3 Implementation of Kruskal’s algorithm. . . 46 4.4 Implementation of Prim’s algorithm . . . 52 4.5 Implementation of operators . . . 55 4.6 Delaunay Triangulation . . . 58 4.7 Statistics . . . 59 4.8 Cosmological distance calculations . . . 75

4.1. Selection of algorithms

4.1 Selection of algorithms

In § 2.3we analyzed the computational complexity of Kruskal’s and Prim’s algorithms and discussed the optimization opportunity offer by Delaunay triangulation.

In the limited time frame of programming, debugging and applying Morava- Pack the options for algorithms and data structure were inevitably restricted.

Though, having in mind common applications of the library, a careful selection was made.

From Table2.2and §3.3we can see that

• for sparse graphs, Prim’s algorithm using binary or Fibonacci heap and Kruskal’s algorithm are optimal

• for dense graphs, Prim’s algorithm employing adjacency matrix is the best solution

• for dense graphs in Euclidean space, Delaunay triangulation produces a sparse graph to be subsequently used as input to a MST algorithm offering solution in O(V logV)time otherwise unreachable by generic MST algorithms.

Furthermore, complete graphs dominate the clustering and spatial analysis of geometrical graphs. Criteria from choosing certain edges out of all connections between points are rarely presented when studying point distributions from observations orN–body simulations. An efficient selection for complete graphs is crucial.

As the implementation of binary and Fibonacci heaps is difficult, considerably increasing development time, we concluded to the following subset of algorithms and data structures:

Sparse graphs Kruskal’s algorithm O(V logV) Complete graphs Prim’s algorithm using

adjacency matrix

O(V²) Complete graphs inE³ Delaunay triangulation

+Kruskal’s algorithm

O(V logV)

Table 4.1 Selected MST algorithms for MoravaPack for various cases of graphs

CHAPTER 4. IMPLEMENTATION METHODOLOGY

4.2. The logic behind the library

• extraction of all weights (lengths) of edges (e.g. for statistics)

G.V[i]works as for a disconnected graph (vertices)

G.E[i]fori= 0,1,· · ·nE−1refers to thei-th edge of a weighted graph

GwithnEvertices.

MSFfind a Minimum Spanning Forest/Tree finder class, encapsulating the Mini- mum Spanning Tree algorithms and all the necessary helper data structures, variables and functions.

An MSFfind objectMis ‘fed’ with a DisconnectedGraph or a Weighted- Graph object, by copying the vertices and edges into the arraysM.Vand

M.Erespectively.

Then it can be asked to produce a MSF^∗or a MST using the Kruskal’s or Prim’s algorithm. The resulting MST or MSF necessarily contains all the vertices of the original graph and a subset of the edges. Thus, it would be slower, more time consuming and redundant to represent the structure by lists of vertices and edges. Instead, we only require the indices of the included edges inM.E:

• Each tree of the MSF is represented by a list of indices of edges, stored in a vector of non–negative integers, a type we callIndices.

• Thus, the forest as a whole is a vector of Indices, a type we call

IndexForest. Because of the fact that theIndexForestis a vector of vectors, it is not an array of indices: theIndicesit contains may not be of the same length, saving time and memory.

After the procedure is done, the user has ultimate control of the resulting structure. If it is an MST, then it can extract its structure fromMinto an

MSTobject (defined below). Else, if it was impossible to get one single tree from the original graph, then the user can extract aMSFobject containing all the trees, or request for a specific tree (by its index) to be encoded in a

MSTobject (defined below).

This is why we designed this class to provide methods for

• Setting the graph

• Producing the MSF/MST through Kruskal’s and Prim’s algorithms

• ExtractingMSTandMSFobjects from the structure

∗if the graph is not connected and consequently a MST is impossible

CHAPTER 4. IMPLEMENTATION METHODOLOGY

• Getting number of trees so that user’s code can decide what to extract in each case (MST/MSF)

• I/O on screen and in files

MST represents a Minimum Spanning Tree, independently of the original graph, holding all the information: vertices and edges. Of coure, the standard I/O methods accompany MST–specific operations:

• Separatrion operator taking the separation length as an argument with a switch between absolute units or mean lengths

• Pruning operators taking one argument: the pruning level

• Extraction of all edge weights

MSF representing a Minimum Spanning Forest providing all MST features plus the ability to extract any tree of the forest.

Extensive use of vectors

One can notice something rare for contemporary algorithms: there is no use of data structures such assets,linked lists,binary trees,heaps,multisetsetc. Only one dimensional dynamic arrays implemented for us by the C++standard library

vectorclass.

This choice was initially made to avoid the memory overhead required by pointers in 64bit architectures. The library was decided to work for hundreds of thousands of vertices in typical memories of contemporary computers (several GBs.) There also some positive consequences of this fact, as the library

• can be easily translated to non–objective–oriented languages with limited at hand implementations of more complex data structures

• can be analyzed in respect to its memory and time complexity directly from the code without requiring details on the implementation of the STL by specific compilers

• avoid ‘surprises’ by compiler optimizations that have the opposite effect (size or speed)

For example vertices and edges in graphs are sets so std::set would be appropriate. We chose to use std::vector in order to avoid computational

CHAPTER 4. IMPLEMENTATION METHODOLOGY in anIndexForestcontaining the minimum required information: which edges of the original graph are included in thei-th tree.

The logic behind Kruskal’s algorithm (§2.3.1) is to grow a forest by combin- ing trees in each iteration. Thus we require a representation of each tree. We do not have to refer to the coordinates of each vertex or the weight of each included edge again. Instead we save time and space when accessing and merging the trees by encoding their components (vertices and edges) by their indices in the original graph:

• KruskalTree— structure containing two arrays of integers: (i) the indices of the included vertices and (ii) the indices of the edges (in the original graph) connecting the above vertices.

• KruskalForest— a vector ofKruskalTrees. The algorithm will maintain one instance of this type, namedTrees

Finding vertices When the algorithm checks if an edge issafeto be included, its endpoints have to be located (in whichKruskalTreeare found) so that the proper decision can be made:

1. If both endpoints are already included, then

• if they belong to the same tree, then the edge must be avoided as it would create a circle

• else (endpoints are found in different trees), the inclusion of the edge dictates the tree to be merged. We decided to merge the tree of the second endpoint to the tree of the first^∗

2. If only one endpoint is already included, then the edge and the other vertex should be included in the corresponding tree,

3. If no endpoint was included, then both of them and the edge are the components of a new tree that will be merged later (if possible of course) Every check for a safe edge requires the above procedure for searching the endpoints in the current state of the forest. The usual approach is to usedisjoint–

set linked listsor other data structures that either requireO(logN)time or extra memory for storing pointers. We try a different, ‘old–school approach’ for the reasons explained in §4.2.2: we use an array vertex_whereto keep track to

∗there is no preferable choice

Our Kruskal’s algorithm implementation is sketched in Algorithm4.1. Note that:

∗avoiding the use of (i) 0 as C++arrays are zero–indexed and it would be pointless to convert each index by adding or subtracting 1 each time, (ii) -1 as it would require signed integers cutting in half the set of possible values just for one flag value

CHAPTER 4. IMPLEMENTATION METHODOLOGY

• IndexForestandKruskalForestrefer to data types

• The size ofKFis mentioned in the sense that the current size of a vector is also the index of a newly appended element. Our array/vector notation is zero–indexed, so the index of a newly appended item, is always equal to previous size of the array/vector.

• itreeandjtreeare indices of trees in the Kruskal forestKF

The algorithm is complemented by two functions, Algorithms4.2and 4.3, of no graph theoretic value but crucial to the data structure we designed for our implementation.

4.3. Implementation of Kruskal’s algorithm

Algorithm 4.1:Kruskal’s algorithm

No documento A user – friendly M inimum S panning T ree (páginas 49-70)