4.7 Statistics
4.7.2 Fast bootstrap method to estimate the SD of median
4.7. Statistics
Note that the exact arithmetic would yieldS1 = 0.
Algorithm 4.6:ÅkeBjorck¨ ’s modified two–pass algorithm Input :VectorXN = (x0, x1,· · · , xN−1)
Output :Sample variance ofX
1 if N = 0then returnNaN
2 if N = 1then returnInf
3 S1 ←0
4 S2 ←0
5 m ← N1 (compensated sum from Algorithm4.5)
6 fori∈ {0,· · ·N −1}do
7 x←xi−m
8 S1 ←S1+x
9 S2 ←S2+x2
10 return N−11 S2− N1S12
CHAPTER 4. IMPLEMENTATION METHODOLOGY By taking all possible bootstrap samples from the same sample we can com- puteθ1∗, θ2∗,· · ·, creating abootstrap distributionforθ. Many parameters of thisˆ distribution are used as estimates of the corresponding parameter ofF.
AsN is usually big, all the possible bootstraps with any order areNN in num- ber and there are also too many distinct bootstraps,2N−1N . It is computationally impossible to consider them all. Thus by takingBresamples, we conclude in a Monte Carlo approximation of the bootstrap estimate.
Bootstrapping can be used to estimate biases, standard errors, percentiles etc.
In the following paragraphs an application to the standard error of the median is presented.
Bootstrapping the median
Bootstrapping for standard error of the median can be summarized in these steps:
• Replace the population with the sample ofN elements
• Sample with replacement to createB bootstrap samples
• Compute the medians of the bootsrap samples
• The sample standard deviation of those is an estimate of the standard error of the median of the original population
Caution should be taken in the choice of algorithm to compute the median, as it will be usedBtimes.
The median
Among the various statistics on edge lengths (or weights), we included the medianwhich is the length separating the graph’s set of lengths into two halves:
the shortest and the longest. In other words, if we sort the lengths, the median is the length in the middle.
In case of an even number of edges, there are two lengthslkandlmso that lk ≤lm and the number of lengths less or equal tolk is equal to the number of lengths greater or equal tolm. The median in this cases is the average of these values. Example for two samples of weights:
Sample Median
1, 4, 2, 7,3 3
4.7. Statistics
Table 4.2 Examples of medians in samples.
The standard error of the medianσm for a large sample from normal distribu- tionN(µ, σ2)is
σm ≈1.253 σ
√ N
Similar formulas have been found for most of the common distributions.
Though, in the general case of data drawn from unknown distributions, non–
parametric methods have to be applied.
Selection algorithms
The simplest but less efficient way to find the median is to sort the data — an operation requiringO(NlogN)time in the general case — and then select the element(s) in the middle inO(1)time. Thus, bootstrapping withB resamples would requireO(BN logN)time.
Various selection algorithmshave been introduced in order to calculate or estimate∗) thek–th smallest element (such as the median) of an array in linear time.
Hoare’s selection algorithmorQuickselect(Hoare,1961) hasO(N)average and O(N2) worst–case complexity. On the other hand, Median of Medians (Blum et al., 1973) which was based on Quickselect has O(N) worst–case complexity. Though, because of the fact that a randomized Quickselect algorithm has probability close to zero to face the worst–case scenario and Median of Medians presents huge computational overhead, the former is used more often — being faster on average but also simpler to implement (Beliakov,2011). When speed and reassurance of linearity are requiredIntroselect(introspective selection) is used: hybrid of Quickselect which is used in the first steps and Median of Medians which used after a worst–case is detected (rarely happens.)
The Quickselect algorithm was implemented by us (as part of the Morava- Pack’s tools) in the functionmrv::QuickSelect. For the pseudocode, see Algo- rithm4.7.
∗exact or approximate medians are excellent pivots for other operations like sorting
CHAPTER 4. IMPLEMENTATION METHODOLOGY Algorithm 4.7:Quickselect algorithm
Input :VectorXN = (x0, x1,· · · , xN−1)
Input :Indices of leftmost and rightmost elements,landr Output :k–th smallest element inX
1 if l=rthen returnxl
2 pivot indexp←Partition(X, l, r)
3 length←p−l+ 1
4 if length > kthen returnQuickSelect(X, l, p−1, k)
5 if length < kthen returnQuickSelect(X, p+ 1, r, k−length)
6 returnXp
7 FunctionPartition(XN,l,r)
8 set random pivot indexp←random integer in[l, r]
9 swapXp ↔Xr
10 p←r
11 i←l
12 forj =l,· · ·, r−1do
13 if Xj <=Xp then
14 swapXi ↔Xj
15 i←i+ 1
16 end
17 end
18 swapXi ↔Xp
19 returni
20 end
Adjustments for even number of elements
Caution should be taken in the utilization of selection algorithms for finding the median as one must account for the even–sized data∗.
Using the sorting approach, after a small adjustment for even–sized vectors, we get the «Median with Sort» algorithm (Algorithm4.8.)
With the Quickselect algorithm, the situation is more complex. Extracting a k–th element does not give any information for the position of the(k−1)–th or (k+ 1)–th elements. We decided to use the followingO(N)instructions:
• If odd elementsN, then return thebN+12 c–th element using Quickselect.
∗In computer science applications it is common to select only one element for both odd or even number of elements: thebn+12 c–th one.
4.7. Statistics
• If even elements,
1. Find the smallest of the two values,a, by requesting Quickselect to return thebN2c–th element.
2. Scan all elements and enumerate how many of them,LEare less or equal toa.
3. IfLE > N2 thenais not unique and at least one other instance follows after the N2–th position. Consequently, the median is 12(a+a) =a.
Returna.
4. If not, then the next bigger element,bhas to be found. This is easily implemented. We simply setb equal to ∞and while iterating the elements in Step 2, when an element larger thanabut less than current value of bcomes up, letb be equal to it. At the end, the minimum element that is larger thanais stored inb. Return a+b2 .
Being aO(N)operation, Quickselect with our modification remains as such.
We call the algorithm «Median with QuickSelect» (Algorithm4.9.) Algorithm 4.8:Median with sort
Input :VectorXN = (x0, x1,· · · , xN−1) Output :Median ofX
1 sortX
2 if N ≡1(mod2)then
3 returnXbN/2c 4 else
5 return 12XbN/2c+XbN/2−1c
Fast bootstrapping for median
Here we present a method for computing the medians of bootstrap resamples (in order for the standard error to be computed) which is faster than using Quickselect repeatedly. We call it Fast Bootstrapping for Median (FBM) although its logic and advantages can be used for the estimation of any other order statistic (like quantiles) through bootstrapping. It was inspired by the following observations:
• By replacing population with the sample, we create a finite and discrete population no matter the nature of the former (continuous or discrete).
CHAPTER 4. IMPLEMENTATION METHODOLOGY Algorithm 4.9:Median withQuickSelect
Input :VectorXN = (x0, x1,· · · , xN−1) Output :Median ofX
1 if N ≡1(mod2)then returnQuickSelect(X,N2 + 1)
2 Half the size,H ← N2
3 First valuea←QuickSelect(X,N2)
4 Number of elements less or equal toa,LE ←0
5 Second valueb ← ∞
6 fori= 0,1,· · · , N −1do
7 if Xi ≤athen
8 LE ←LE+ 1
9 if LE > H then
10 returna
11 else if Xi < bthen
12 b ←Xi
13 return a+b2
• Sampling with replacement from a finite populationx0,· · · , xN−1 can be representedexactlyby a histogram: a vector of integershi which are the occurrences ofxi in the resample.
• The resampling is required to be uniform so that any change of the order ofxi does not affect the distribution of the resample and we are allowed to exploit the following observation.
• Ifxi are sorted in ascending order∗before the sampling, so thatxi ≤xj fori < j, then each resample’s histogram will read:
h0: number of occurances of the smallest element
h1: " second smallest "
...
hn: " greatest "
• Finding the median in a histogram is trivial: sum the histogram bin counts until N2 is reached — the last bin considered corresponds to the median of the resample.
∗descending order could be used without loss of generality
4.7. Statistics
For simplicity’s sake but without loss of generality, we did not account for even–sized data. The method can be described in the following steps:
Step 1 Sort the sample ofN elements — inO(NlogN)time Step 2 Create a histogram ofN bins of the sample — inO(N)time
Step 3 ComputeN integers from the discrete uniform distributionU(0, N −1):
these are the indices of the original sample that we are supposed to include in the resample. Do not creating the resample, but construct the histogram by increasing the corresponding bins inO(N)time.
Step 4 Compute the cumulative sum of bin counts until you reach in the middle element (slightly modified for the case of even numbered samples) — in O(N)time
Step 5 Repeat Steps 2 to 4 untilB medians are found — inO(BN)time Step 6 Compute the standard deviation of the median — inO(B)time
The above process hasO(BN+NlogN+B) =O(BN)complexity in time:
we reasonably assume thatB >logN as the computing power today imposes higher and higher values onB for published research — usually above1000and less than million.
One can note that the method presents a lot of merits:
• Most of the computations take place in the summations/ comparisons of integers instead of the computationally heavier comparisons of real numbers. This is also memory efficient.
An extreme case would be numbers in exact arithmetics of software like Wolfram Mathematica: the computational resources for comparisons and other operations are restricted in the first step (sorting.)
• With Quickselect on each resample, the comparisons are redundantly repeated – while in FBM compares the elements only once.
• In each step of the process, we operate only once on any new information:
– the value and order of its elements/when sorting – each resample/when creating histogram
– each histogram/when counting to find the median – each median when computing the standard deviation
CHAPTER 4. IMPLEMENTATION METHODOLOGY Algorithm 4.10:FastBootstrap forMedian’sSE
Input :VectorXN = (x0, x1,· · · , xN−1) Input :NumberB of bootstrap resamples Output :Standard error of median ofX
1 sortX
2 create medians vectorMB = (m1, m2,· · ·mB−1)
3 create histogramHN = (h0, h1,· · ·hN−1)
4 fori∈ {0,1,· · ·B −1}do
5 forj ∈ {0,1,· · · , N−1}dohj ←0
6 forj ∈ {0,1,· · · , N−1}do
7 k ←random number in{0,1,· · ·, N −1}
8 hk ←hk+ 1
9 if N = 0 (mod 2)then
10 even←true
11 stop_at← dN2e
12 else
13 even←false
14 stop_at← dN2+1e
15 count←0
16 forj ∈ {0,1,· · · , N−1}do
17 count←count+hj
18 if count≥stop_atthen
19 median←xj
20 if count=stop_atandeventhen
21 median← 12(median+xj+1)
22 break
23 mi ←median;
24 returnstandard deviation ofM
The pseudocode of the method is listed in Algorithm4.10.
Also, we should note here that as the Kruskal’s algorithm sorts the edges, MoravaPack is designed to store the edges of the produced MSTs in that order, rendering the Step 1 redundant. Though, we do not have to exclude the step from the code in MoravaPack or explicitly bypass Step 1 — most sorting algorithms (like the one we use: STL’sstd::sort) either check if the input is already sorted
4.7. Statistics
inO(n)time or work inO(n)time for sorted arrays by design.