A user – friendly M inimum S panning T ree

The "Minimum Spanning Tree" graph theoretic technique addresses some of these problems since an MST depends on the correlation functions of all ranks. The minimum spanning tree technique is used by many scientific disciplines for research or as part of technological solutions.

Thesis statement

Objectives

Contributions

Outline

A brief history
Large–scale structure of the universe
Redshift space and distance measures
Numerical Cosmology

The time evolution of the metric introduces H(z), the Hubble parameter as a function of redshift. Due to the lack of a theory of everything (Laughlin & Pines, 2000), the parameters involved in the evolution of the universe cannot be predicted.

Figure 2.1 Distribution of galaxies in redshift space from the first CfA survey (Davis et al., 1983)

Graph theory

Historical overview

Definitions & examples

For example, the handshake of the members of the social network in Figure 2.4 could be represented by the graph in Figure 2.7. Note that the spanning tree need not be a subgraph of the graph it spans, as it may contain edges not found in the latter.

Computational Graph Theory and Geometry

Kruskal’s algorithm
Prim’s algorithm
Comparison of the MST algorithms
Delaunay Triangulation
Definition
Estimation
Other measures
Computational issues

Kruskal's algorithm (Kruskal,1956) starts with a forest containing all vertices of the input graph, but no edges. Consequently, the time complexity is ElogE and due to the fact that the number of edges cannot exceed 12V (V −1).

Use of MSTs in Cosmology

Delaunay Triangulation and MST

DEUSS

In § 2.3 we analyzed the computational complexity of Kruskal and Prim's algorithms and discussed the optimization possibilities offered by Delaunay triangulation. In the limited time frame of programming, debugging and applying Morava-Pack, the options for algorithms and data structure were inevitably limited.

The logic behind the library

Programming paradigm
Representation of graphs, trees and forests
Input / Output
Conventions

The resulting MST or MSF necessarily contains all the vertices of the original graph and a subset of the edges. MST represents a Minimum Spanning Tree, independent of the original graph, that contains all information: vertices and edges.

Implementation of Kruskal’s algorithm

Pseudocode of our implementation

The size of KFis mentioned in the sense that the current size of a vector is also the index of a newly added element. Output: IndexForestF, so Fi is the list of edge indices (inE2) for the tenth tree in the minimum-spanning forest. Input:EdgesEwhere eKF indices refer to Input:Edges_includedfrom Kruskal algorithm Output:Edges in forestE2 ⊆E.

Implementation of Prim’s algorithm

Pseudocode

The pseudocode for our implementation of Prim's algorithm for complete geometric graphs is listed in Algorithm 4x4.

Implementation of operators

Pruning a tree
Pruning a MSF
Separating a MST
Separating a MSF

Start from the first edges of the old list of edges and proceed: if the endpoints of the i-th edge are both to be kept, add the edge to the new list of edges. Although the indices of the endpoint of each attached edge still refer to the old top list. Create an array holding the degrees of the vertices in the split tree, initialized to 0.

Delaunay Triangulation

Statistics

Decriptive statistics
Fast bootstrap method to estimate the SD of median
Pseudo–random number generation
Tree curvature / compactness

In the following sections, an application to the standard error of the median is presented. The sample standard deviation of these is an estimate of the standard error of the median of the original population. Find the smaller of the two values, a, by asking Quickselect to return the bN2c–th element.

Cosmological distance calculations

Numerical integration
Interpolant
Interpolation
Procedures
Results
Discussion

Input : a vectorf of N equal samples of a function Output : The integral of the function over the given range. The result :f vector holds the function evaluations that were used for the last integration. The points are about half (n2 + 1) of the evaluated points outside (n) due to the requirements of Simpson's Rule (even number of intervals.) The latter need not be evaluated as they are already calculated by Algorithm 4.14.

Figure 4.2 Quadratic interpolation using parabolas (blue) is more accurate than linear interpolation (red) in approximating a curve (black.)

Fast Bootstrap Median

Procedures

To avoid pathological cases of input and to increase the resolution of timing (for more details, §5.1.1), we measured the time required to process 30 different samples for each input length. By summing the previous values minus one, we can detect the difference due to the even-odd effect, since the input length effect would be negligible for such small lengths. Briefly, we do not provide all the plots for B = 50, as the results were similar to those for B = 1000, except for the errors which were slightly larger.

Results

Error bars were magnified 10-fold and are shown in pairs due to the adjacent input lengths.

Figure 5.6 Execution times for naïve approach (sort), Quickselect algorithm and FBM for 1000 resamples

Discussion

Distance Calculator

Procedures

The agreement of the DistanceCalcand DEUS-LCDM distance data was verified by producing the motion distance function from the redshift of the halo objects vs.

Figure 5.10 The cosmological distances computed by our algorithm for h = 0.74, and Ω m = 0.26.

Results

Due to the limited accuracy of the redshift values in the DEUSS data, there were objects with the same redshift but different distances. By sorting the redshifts, finding objects with the same redshift, and averaging their distances, we created a unique and interpolable set of z−dC pairs.

Figure 5.11 The relative difference of comoving distance computed by our algorithm using either linear or quadratic interpolation, and the one from Wright (2006).

Discussion

The first trend is clearly explained by the fact that the distances compared were 3 to 4 orders of magnitude higher than the accuracy (0.1Mpc) of CosmoCalc (Table 5.2.) The higher the redshift/distance, the higher the error in the relative difference. As we mentioned in the description of the experiment, the DEUSS catalog presented many objects with the same redshift and different distance. When we averaged the distances to create diez –dC interpolation, the «rounding» error in the catalog propagated to the RMSE.

Figure 5.13 Frequency of redshift groups of by the number of objects they contain.

DEUSS MSTs

Procedures

Results

Discussion

In this chapter, we will connect the main points of the thesis and parts of the discussions on numerical experiments (§5.4.3). Due to the limited time frame of the study, the use of cosmological data was not as broad as we would have liked. DistanceCalc was found to be consistent with the CosmoCalc and DEUS catalogs (§ 5.3) in terms of the accuracy level of the latter.

Future work

MoravaPack

The original Prim's algorithm must be available (currently only complete graphs are supported.) Also, choice for the data structure must be added, to suit the user's needs and availability of computing resources.

MoravaGUI

The best programs are written so that computing machines can execute them quickly and that humans can clearly understand them.

Introduction

OpenMP

TetGen

Morava files

Core header file ( mrv_core.h )

Variable types ( mrv_types.h )

In languages with zero-based arrays (such as C, C++, Java, Lisp), a function that searches an array and returns the index of the first occurrence of the key is a legal value. Single argument operators take the index of the coordinate we want to access: 0,1,2forx, y, z respectively. Parentheses operator refers to the index of the ends when the argument is 0 or 1 (from endpoints, respectively) or to the weight when not.

Input / output ( mrv_io.h )

3 Returns false if more than one graph is stored in the file, if count.

Timing ( mrv_time.h)

Built–in metric functions ( mrv_metrics.h )

EuclideanLength returns the Euclidean length (or distance) dE between two points which is equal to the L2 norm (or the Euclidean norm) returned by .

Unweighted graphs ( mrv_dgraph.h )

In the latter case, the function returns whether it succeeded in opening and writing the file. Again, in the latter case, a boolean is returned to confirm success in opening/reading the file.

Weighted graphs ( mrv_wgraph.h )

ClearVertices deletes all vertices but leaves the edges as they are (giving the user more freedom to create other graph organization paradigms). FillWeights(metric) replaces the weight of each edge with the result of applying the metric function to the endpoints.

Finding MSTs / MSFs ( mrv_msffind.h )

CompleteKruskal(metric) clears pre-existing edges, adds all pairs with the weight calculated using the thematic function, and finally calls Kruskal to produce the MST/MSF. CompletePrim(metric) deletes preexisting edges and finds the MST of the complete graph (defined by the given vertices.) The MST edges are preserved. KruskalMST(G, f) finds the MST of the complete graph with the vertices of the Gusing CompleteKruskal algorithm.

MST representation ( mrv_mst.h )

Separate(l, b) separates the tree by cutting edges heavier/longer than l.b is a boolean argument that sets whether it is an absolute length or a parameter multiplied by the mean length of the edges.

MSF representation ( mrv_msf.h )

This allows the user to load/save multiple MSTs in the same file by calling these methods. CombineTrees returns the weighted graph object (see §A.9) with the vertices and edges of all the trees in the forest. If an absolute (b=false) length is given, the separation operator is applied to all trees in the forest.

Delaunay triangulation ( mrv_dt.h )

Random numbers ( mrv_rand.h)

With less than 3 arguments, returns one number in the ranges .. 0.1)(standard uniform distribution) if there is no argument. The following code creates 1000 random numbers from the normal distribution with mean value 2.3 and standard deviation 0.01 and prints one chosen at random. This example also shows why we implemented a single-argument version of Rand_uInt that returns values less than the argument (zero-based arrays/vectors in C++).

General algorithms ( mrv_algo.h )

For example, it can be used for Edges structure since we accommodated the comparison operators in its definition.

Statistics ( mrv_stats.h )

If the sample mean is known, precomputed, etc., it can be passed as the second argument to avoid recalculation. Median calculates the median of the elements of a real vector using the faster method provided in MoravaPack. The second argument (if not given, false is implied) defines sorts the real vector using std::sortof C++STL and returns the median.

Histograms ( mrv_histogram.h )

Therefore, if the vector is not sorted, a partial rearrangement of the elements will occur. LowerBound(i), Midpoint(i), UpperBound(i) return the lower, middle and higher values in the range of the ith bin. RelativeFrequencies,ProbabilityDensities return vectors containing the relative frequencies and probability density estimates of the bins, respectively.

Cosmological distance calculator ( csm_dist.h )

ΛCDM cosmology by setting the hubble parameter and the cosmological density parameter ΩΛ with the assumption that Ωm = 1−ΩΛandΩr = 0 = Ωk. PrintSummary reports the cosmological parameters, the valid range for interpolation, and the number of points used for standard output (usually the screen).

Haloes catalog manipulation ( csm_halo.h)

Clarify the number of halo objects currently in the catalog. COSMOLOGY-ORIENTED CODE Subset(field, oper, key) takes the subset of the halo catalog that meets a comparison criterion: propertyfieldshould be<,=, >,≤or≥to the value key. PROCESS: Create MSTs, DTs, or MSF from the input, or apply operators to created objects.

Notes

INPUT tab

Check if Complete” informs us whether the graph is complete by confirming that the edge list consists of the expected pairs V(V2−1). Apparently, the speed is achieved by reducing the probability of rejections or equivalently the volume of the cube. The optimal way would be to place any shape so that the major axis of the ellipsoid is parallel to the diagonal of the cube.

PROCESS tab

MORAVAGUI APPLICATION words, we can repeatedly take random points from a cube containing the ellipsoid by treating their coordinates as random variations of uniform distribution, rejecting those outside the ellipsoid. A simple but not optimal way to do this is to fit the ellipsoid into a cube with its faces parallel to the coordinate system axes and sides equal to twice the maximum axis of the ellipsoid.

OUTPUT tab

Greedy MST
Prim’s algorithm
KF2F algorithm
MaintainKF algorithm
CompletePrim algorithm
Kahan’s compensated summation
Åke Björck’s modified two–pass algorithm
Quickselect algorithm
Median with sort
Median with QuickSelect
Fast Bootstrap for Median’s SE
Discrete uniform rejection method
Marsaglia Polar Method
Simpson’s Rule
Simpson’s Rule with Convergence Check
Make Interpolation
Quadratic Interpolation
Time complexity of Prim’s algorithm
Computational complexities of MST algorithms
Properties of DEUSS cosmological models
Selected MST algorithms for MoravaPack
Median examples
Average execution times of MST algorithms
Cosmic distance calculations
The DEUSS subcatalogs
Two sample Kolmogorov–Smirnov tests
LCDM-1 and RPCDM-1 statistics
LCDM-1-5 and RPCDM-1-5 statistics
Distribution of galaxies (CfA)
Distribution of galaxies (2dF)
Map of local large-scale structures
Example of a simple graph
Example of a directed graph
Example of a weighted graph
Example of a complete graph
Example of a spanning tree
Example of a minimum spanning tree
Example of a spanning forest
Example of pruning and separation operations
Black and white slice from a DEUSS simulation
Detecting a branch
Linear vs. Quadratic interpolation
Average execution times of MST algorithms (log–log)
Average time complexity of Kruskal
Avarage time complexity of DT + Kruskal
Avarage time complexity of Prim
Bootstrap median algorithms comparison (semilog)
Bootstrap median algorithms comparison (log–log)
Bootstrap median algorithms speedup (1000 resamples)
Bootstrap median algorithms speedup (50 resamples)
DistanceCalc output
Comparison of cosmic distances
DEUSS distance verification
Histogram of redshift groups
Histogram of redshift groups’ standard deviations
The MSTs of LCDM-3 and RPCDM-3 catalogs
LCDM-1 and RPCDM-1 edge length distributions
LCDM-1-5 and RPCDM-1-5 edge length distributions

The scene is projected on a plane perpendicular to the line joining the observer (behind the screen) and the center of gravity of the points. By pressing «Statistics», the user is prompted for the number of bootstrap samples to be taken to estimate the standard error of the descriptive statistics (see §4.7.2.) After all calculation, a report like the one in Figure C.7 is displayed. Parallel Computation of Median and Order Statistics on GPUs with Applications to Robust Regression.

Figure C.4 The OUTPUT tab allows saving, plotting and statistical analyses of created objects

MoravaGUI’s INPUT tab

Dialog for adding random points

MoravaGUI’s PROCESS tab

MoravaGUI’s OUTPUT tab

MoravaGUI’s 2D plots

MoravaGUI’s 3D plots

MoravaGUI’s statistics report