• Nenhum resultado encontrado

A user – friendly M inimum S panning T ree

N/A
N/A
Protected

Academic year: 2024

Share "A user – friendly M inimum S panning T ree"

Copied!
204
0
0

Texto

The "Minimum Spanning Tree" graph theoretic technique addresses some of these problems since an MST depends on the correlation functions of all ranks. The minimum spanning tree technique is used by many scientific disciplines for research or as part of technological solutions.

Thesis statement

Objectives

Contributions

Outline

  • A brief history
  • Large–scale structure of the universe
  • Redshift space and distance measures
  • Numerical Cosmology

The time evolution of the metric introduces H(z), the Hubble parameter as a function of redshift. Due to the lack of a theory of everything (Laughlin & Pines, 2000), the parameters involved in the evolution of the universe cannot be predicted.

Figure 2.1 Distribution of galaxies in redshift space from the first CfA survey (Davis et al., 1983)
Figure 2.1 Distribution of galaxies in redshift space from the first CfA survey (Davis et al., 1983)

Graph theory

Historical overview

Definitions & examples

For example, the handshake of the members of the social network in Figure 2.4 could be represented by the graph in Figure 2.7. Note that the spanning tree need not be a subgraph of the graph it spans, as it may contain edges not found in the latter.

2.2. Graph theory
2.2. Graph theory

Computational Graph Theory and Geometry

  • Kruskal’s algorithm
  • Prim’s algorithm
  • Comparison of the MST algorithms
  • Delaunay Triangulation
  • Definition
  • Estimation
  • Other measures
  • Computational issues

Kruskal's algorithm (Kruskal,1956) starts with a forest containing all vertices of the input graph, but no edges. Consequently, the time complexity is ElogE and due to the fact that the number of edges cannot exceed 12V (V −1).

Figure 2.10 A spanning forest
Figure 2.10 A spanning forest

Use of MSTs in Cosmology

Delaunay Triangulation and MST

DEUSS

In § 2.3 we analyzed the computational complexity of Kruskal and Prim's algorithms and discussed the optimization possibilities offered by Delaunay triangulation. In the limited time frame of programming, debugging and applying Morava-Pack, the options for algorithms and data structure were inevitably limited.

Figure 3.2 Black and white slice from a DEUSS simulation, depicting large–scale structure.
Figure 3.2 Black and white slice from a DEUSS simulation, depicting large–scale structure.

The logic behind the library

  • Programming paradigm
  • Representation of graphs, trees and forests
  • Input / Output
  • Conventions

The resulting MST or MSF necessarily contains all the vertices of the original graph and a subset of the edges. MST represents a Minimum Spanning Tree, independent of the original graph, that contains all information: vertices and edges.

Implementation of Kruskal’s algorithm

Pseudocode of our implementation

The size of KFis mentioned in the sense that the current size of a vector is also the index of a newly added element. Output: IndexForestF, so Fi is the list of edge indices (inE2) for the tenth tree in the minimum-spanning forest. Input:EdgesEwhere eKF indices refer to Input:Edges_includedfrom Kruskal algorithm Output:Edges in forestE2 ⊆E.

Implementation of Prim’s algorithm

Pseudocode

The pseudocode for our implementation of Prim's algorithm for complete geometric graphs is listed in Algorithm 4x4.

Implementation of operators

  • Pruning a tree
  • Pruning a MSF
  • Separating a MST
  • Separating a MSF

Start from the first edges of the old list of edges and proceed: if the endpoints of the i-th edge are both to be kept, add the edge to the new list of edges. Although the indices of the endpoint of each attached edge still refer to the old top list. Create an array holding the degrees of the vertices in the split tree, initialized to 0.

Delaunay Triangulation

Statistics

  • Decriptive statistics
  • Fast bootstrap method to estimate the SD of median
  • Pseudo–random number generation
  • Tree curvature / compactness

In the following sections, an application to the standard error of the median is presented. The sample standard deviation of these is an estimate of the standard error of the median of the original population. Find the smaller of the two values, a, by asking Quickselect to return the bN2c–th element.

Cosmological distance calculations

  • Numerical integration
  • Interpolant
  • Interpolation
  • Procedures
  • Results
  • Discussion

Input : a vectorf of N equal samples of a function Output : The integral of the function over the given range. The result :f vector holds the function evaluations that were used for the last integration. The points are about half (n2 + 1) of the evaluated points outside (n) due to the requirements of Simpson's Rule (even number of intervals.) The latter need not be evaluated as they are already calculated by Algorithm 4.14.

Figure 4.2 Quadratic interpolation using parabolas (blue) is more accurate than linear interpolation (red) in approximating a curve (black.)
Figure 4.2 Quadratic interpolation using parabolas (blue) is more accurate than linear interpolation (red) in approximating a curve (black.)

Fast Bootstrap Median

Procedures

To avoid pathological cases of input and to increase the resolution of timing (for more details, §5.1.1), we measured the time required to process 30 different samples for each input length. By summing the previous values ​​minus one, we can detect the difference due to the even-odd effect, since the input length effect would be negligible for such small lengths. Briefly, we do not provide all the plots for B = 50, as the results were similar to those for B = 1000, except for the errors which were slightly larger.

Results

Error bars were magnified 10-fold and are shown in pairs due to the adjacent input lengths.

Figure 5.6 Execution times for naïve approach (sort), Quickselect algorithm and FBM for 1000 resamples
Figure 5.6 Execution times for naïve approach (sort), Quickselect algorithm and FBM for 1000 resamples

Discussion

Distance Calculator

Procedures

The agreement of the DistanceCalcand DEUS-LCDM distance data was verified by producing the motion distance function from the redshift of the halo objects vs.

Figure 5.10 The cosmological distances computed by our algorithm for h = 0.74, and Ω m = 0.26.
Figure 5.10 The cosmological distances computed by our algorithm for h = 0.74, and Ω m = 0.26.

Results

Due to the limited accuracy of the redshift values ​​in the DEUSS data, there were objects with the same redshift but different distances. By sorting the redshifts, finding objects with the same redshift, and averaging their distances, we created a unique and interpolable set of z−dC pairs.

Figure 5.11 The relative difference of comoving distance computed by our algorithm using either linear or quadratic interpolation, and the one from Wright (2006).
Figure 5.11 The relative difference of comoving distance computed by our algorithm using either linear or quadratic interpolation, and the one from Wright (2006).

Discussion

The first trend is clearly explained by the fact that the distances compared were 3 to 4 orders of magnitude higher than the accuracy (0.1Mpc) of CosmoCalc (Table 5.2.) The higher the redshift/distance, the higher the error in the relative difference. As we mentioned in the description of the experiment, the DEUSS catalog presented many objects with the same redshift and different distance. When we averaged the distances to create diez –dC interpolation, the «rounding» error in the catalog propagated to the RMSE.

Figure 5.13 Frequency of redshift groups of by the number of objects they contain.
Figure 5.13 Frequency of redshift groups of by the number of objects they contain.

DEUSS MSTs

Procedures

Results

Discussion

In this chapter, we will connect the main points of the thesis and parts of the discussions on numerical experiments (§5.4.3). Due to the limited time frame of the study, the use of cosmological data was not as broad as we would have liked. DistanceCalc was found to be consistent with the CosmoCalc and DEUS catalogs (§ 5.3) in terms of the accuracy level of the latter.

Future work

MoravaPack

The original Prim's algorithm must be available (currently only complete graphs are supported.) Also, choice for the data structure must be added, to suit the user's needs and availability of computing resources.

MoravaGUI

The best programs are written so that computing machines can execute them quickly and that humans can clearly understand them.

Introduction

OpenMP

TetGen

Morava files

Core header file ( mrv_core.h )

Variable types ( mrv_types.h )

In languages ​​with zero-based arrays (such as C, C++, Java, Lisp), a function that searches an array and returns the index of the first occurrence of the key is a legal value. Single argument operators take the index of the coordinate we want to access: 0,1,2forx, y, z respectively. Parentheses operator refers to the index of the ends when the argument is 0 or 1 (from endpoints, respectively) or to the weight when not.

Input / output ( mrv_io.h )

3 Returns false if more than one graph is stored in the file, if count.

Timing ( mrv_time.h)

Built–in metric functions ( mrv_metrics.h )

EuclideanLength returns the Euclidean length (or distance) dE between two points which is equal to the L2 norm (or the Euclidean norm) returned by .

Unweighted graphs ( mrv_dgraph.h )

In the latter case, the function returns whether it succeeded in opening and writing the file. Again, in the latter case, a boolean is returned to confirm success in opening/reading the file.

Weighted graphs ( mrv_wgraph.h )

ClearVertices deletes all vertices but leaves the edges as they are (giving the user more freedom to create other graph organization paradigms). FillWeights(metric) replaces the weight of each edge with the result of applying the metric function to the endpoints.

Finding MSTs / MSFs ( mrv_msffind.h )

CompleteKruskal(metric) clears pre-existing edges, adds all pairs with the weight calculated using the thematic function, and finally calls Kruskal to produce the MST/MSF. CompletePrim(metric) deletes preexisting edges and finds the MST of the complete graph (defined by the given vertices.) The MST edges are preserved. KruskalMST(G, f) finds the MST of the complete graph with the vertices of the Gusing CompleteKruskal algorithm.

MST representation ( mrv_mst.h )

Separate(l, b) separates the tree by cutting edges heavier/longer than l.b is a boolean argument that sets whether it is an absolute length or a parameter multiplied by the mean length of the edges.

MSF representation ( mrv_msf.h )

This allows the user to load/save multiple MSTs in the same file by calling these methods. CombineTrees returns the weighted graph object (see §A.9) with the vertices and edges of all the trees in the forest. If an absolute (b=false) length is given, the separation operator is applied to all trees in the forest.

Delaunay triangulation ( mrv_dt.h )

Random numbers ( mrv_rand.h)

With less than 3 arguments, returns one number in the ranges .. 0.1)(standard uniform distribution) if there is no argument. The following code creates 1000 random numbers from the normal distribution with mean value 2.3 and standard deviation 0.01 and prints one chosen at random. This example also shows why we implemented a single-argument version of Rand_uInt that returns values ​​less than the argument (zero-based arrays/vectors in C++).

General algorithms ( mrv_algo.h )

For example, it can be used for Edges structure since we accommodated the comparison operators in its definition.

Statistics ( mrv_stats.h )

If the sample mean is known, precomputed, etc., it can be passed as the second argument to avoid recalculation. Median calculates the median of the elements of a real vector using the faster method provided in MoravaPack. The second argument (if not given, false is implied) defines sorts the real vector using std::sortof C++STL and returns the median.

Histograms ( mrv_histogram.h )

Therefore, if the vector is not sorted, a partial rearrangement of the elements will occur. LowerBound(i), Midpoint(i), UpperBound(i) return the lower, middle and higher values ​​in the range of the ith bin. RelativeFrequencies,ProbabilityDensities return vectors containing the relative frequencies and probability density estimates of the bins, respectively.

Cosmological distance calculator ( csm_dist.h )

ΛCDM cosmology by setting the hubble parameter and the cosmological density parameter ΩΛ with the assumption that Ωm = 1−ΩΛandΩr = 0 = Ωk. PrintSummary reports the cosmological parameters, the valid range for interpolation, and the number of points used for standard output (usually the screen).

Haloes catalog manipulation ( csm_halo.h)

Clarify the number of halo objects currently in the catalog. COSMOLOGY-ORIENTED CODE Subset(field, oper, key) takes the subset of the halo catalog that meets a comparison criterion: propertyfieldshould be<,=, >,≤or≥to the value key. PROCESS: Create MSTs, DTs, or MSF from the input, or apply operators to created objects.

Notes

INPUT tab

Check if Complete” informs us whether the graph is complete by confirming that the edge list consists of the expected pairs V(V2−1). Apparently, the speed is achieved by reducing the probability of rejections or equivalently the volume of the cube. The optimal way would be to place any shape so that the major axis of the ellipsoid is parallel to the diagonal of the cube.

Figure C.1 The INPUT tab of the application provides the tools to set the graph to be processed
Figure C.1 The INPUT tab of the application provides the tools to set the graph to be processed

PROCESS tab

MORAVAGUI APPLICATION words, we can repeatedly take random points from a cube containing the ellipsoid by treating their coordinates as random variations of uniform distribution, rejecting those outside the ellipsoid. A simple but not optimal way to do this is to fit the ellipsoid into a cube with its faces parallel to the coordinate system axes and sides equal to twice the maximum axis of the ellipsoid.

OUTPUT tab

  • Greedy MST
  • Kruskal’s algorithm
  • Prim’s algorithm
  • Kruskal’s algorithm
  • KF2F algorithm
  • MaintainKF algorithm
  • CompletePrim algorithm
  • Kahan’s compensated summation
  • Åke Björck’s modified two–pass algorithm
  • Quickselect algorithm
  • Median with sort
  • Median with QuickSelect
  • Fast Bootstrap for Median’s SE
  • Discrete uniform rejection method
  • Marsaglia Polar Method
  • Simpson’s Rule
  • Simpson’s Rule with Convergence Check
  • Make Interpolation
  • Quadratic Interpolation
  • Time complexity of Prim’s algorithm
  • Computational complexities of MST algorithms
  • Properties of DEUSS cosmological models
  • Selected MST algorithms for MoravaPack
  • Median examples
  • Average execution times of MST algorithms
  • Cosmic distance calculations
  • The DEUSS subcatalogs
  • Two sample Kolmogorov–Smirnov tests
  • LCDM-1 and RPCDM-1 statistics
  • LCDM-3 and RPCDM-3 statistics
  • LCDM-5 and RPCDM-5 statistics
  • LCDM-1-5 and RPCDM-1-5 statistics
  • Distribution of galaxies (CfA)
  • Distribution of galaxies (2dF)
  • Map of local large-scale structures
  • Example of a simple graph
  • Example of a directed graph
  • Example of a weighted graph
  • Example of a complete graph
  • Example of a spanning tree
  • Example of a minimum spanning tree
  • Example of a spanning forest
  • Example of pruning and separation operations
  • Black and white slice from a DEUSS simulation
  • Detecting a branch
  • Linear vs. Quadratic interpolation
  • Average execution times of MST algorithms (log–log)
  • Average time complexity of Kruskal
  • Avarage time complexity of DT + Kruskal
  • Avarage time complexity of Prim
  • Bootstrap median algorithms comparison (semilog)
  • Bootstrap median algorithms comparison (log–log)
  • Bootstrap median algorithms speedup (1000 resamples)
  • Bootstrap median algorithms speedup (50 resamples)
  • DistanceCalc output
  • Comparison of cosmic distances
  • DEUSS distance verification
  • Histogram of redshift groups
  • Histogram of redshift groups’ standard deviations
  • The MSTs of LCDM-3 and RPCDM-3 catalogs
  • LCDM-1 and RPCDM-1 edge length distributions
  • LCDM-3 and RPCDM-3 edge length distributions
  • LCDM-5 and RPCDM-5 edge length distributions
  • LCDM-1-5 and RPCDM-1-5 edge length distributions

The scene is projected on a plane perpendicular to the line joining the observer (behind the screen) and the center of gravity of the points. By pressing «Statistics», the user is prompted for the number of bootstrap samples to be taken to estimate the standard error of the descriptive statistics (see §4.7.2.) After all calculation, a report like the one in Figure C.7 is displayed. Parallel Computation of Median and Order Statistics on GPUs with Applications to Robust Regression.

Figure C.4 The OUTPUT tab allows saving, plotting and statistical analyses of created objects
Figure C.4 The OUTPUT tab allows saving, plotting and statistical analyses of created objects

MoravaGUI’s INPUT tab

Dialog for adding random points

MoravaGUI’s PROCESS tab

MoravaGUI’s OUTPUT tab

MoravaGUI’s 2D plots

MoravaGUI’s 3D plots

MoravaGUI’s statistics report

Referências

Documentos relacionados