• Nenhum resultado encontrado

Fast bootstrap method to estimate the SD of median

4.7 Statistics

4.7.2 Fast bootstrap method to estimate the SD of median

4.7. Statistics

Note that the exact arithmetic would yieldS1 = 0.

Algorithm 4.6:ÅkeBjorck¨ ’s modified two–pass algorithm Input :VectorXN = (x0, x1,· · · , xN−1)

Output :Sample variance ofX

1 if N = 0then returnNaN

2 if N = 1then returnInf

3 S1 ←0

4 S2 ←0

5 mN1 (compensated sum from Algorithm4.5)

6 fori∈ {0,· · ·N −1}do

7 xxim

8 S1S1+x

9 S2S2+x2

10 return N−11 S2N1S12

CHAPTER 4. IMPLEMENTATION METHODOLOGY By taking all possible bootstrap samples from the same sample we can com- puteθ1, θ2,· · ·, creating abootstrap distributionforθ. Many parameters of thisˆ distribution are used as estimates of the corresponding parameter ofF.

AsN is usually big, all the possible bootstraps with any order areNN in num- ber and there are also too many distinct bootstraps,2N−1N . It is computationally impossible to consider them all. Thus by takingBresamples, we conclude in a Monte Carlo approximation of the bootstrap estimate.

Bootstrapping can be used to estimate biases, standard errors, percentiles etc.

In the following paragraphs an application to the standard error of the median is presented.

Bootstrapping the median

Bootstrapping for standard error of the median can be summarized in these steps:

• Replace the population with the sample ofN elements

• Sample with replacement to createB bootstrap samples

• Compute the medians of the bootsrap samples

• The sample standard deviation of those is an estimate of the standard error of the median of the original population

Caution should be taken in the choice of algorithm to compute the median, as it will be usedBtimes.

The median

Among the various statistics on edge lengths (or weights), we included the medianwhich is the length separating the graph’s set of lengths into two halves:

the shortest and the longest. In other words, if we sort the lengths, the median is the length in the middle.

In case of an even number of edges, there are two lengthslkandlmso that lklm and the number of lengths less or equal tolk is equal to the number of lengths greater or equal tolm. The median in this cases is the average of these values. Example for two samples of weights:

Sample Median

1, 4, 2, 7,3 3

4.7. Statistics

Table 4.2 Examples of medians in samples.

The standard error of the medianσm for a large sample from normal distribu- tionN(µ, σ2)is

σm ≈1.253 σ

N

Similar formulas have been found for most of the common distributions.

Though, in the general case of data drawn from unknown distributions, non–

parametric methods have to be applied.

Selection algorithms

The simplest but less efficient way to find the median is to sort the data — an operation requiringO(NlogN)time in the general case — and then select the element(s) in the middle inO(1)time. Thus, bootstrapping withB resamples would requireO(BN logN)time.

Various selection algorithmshave been introduced in order to calculate or estimate) thek–th smallest element (such as the median) of an array in linear time.

Hoare’s selection algorithmorQuickselect(Hoare,1961) hasO(N)average and O(N2) worst–case complexity. On the other hand, Median of Medians (Blum et al., 1973) which was based on Quickselect has O(N) worst–case complexity. Though, because of the fact that a randomized Quickselect algorithm has probability close to zero to face the worst–case scenario and Median of Medians presents huge computational overhead, the former is used more often — being faster on average but also simpler to implement (Beliakov,2011). When speed and reassurance of linearity are requiredIntroselect(introspective selection) is used: hybrid of Quickselect which is used in the first steps and Median of Medians which used after a worst–case is detected (rarely happens.)

The Quickselect algorithm was implemented by us (as part of the Morava- Pack’s tools) in the functionmrv::QuickSelect. For the pseudocode, see Algo- rithm4.7.

exact or approximate medians are excellent pivots for other operations like sorting

CHAPTER 4. IMPLEMENTATION METHODOLOGY Algorithm 4.7:Quickselect algorithm

Input :VectorXN = (x0, x1,· · · , xN−1)

Input :Indices of leftmost and rightmost elements,landr Output :k–th smallest element inX

1 if l=rthen returnxl

2 pivot indexp←Partition(X, l, r)

3 lengthpl+ 1

4 if length > kthen returnQuickSelect(X, l, p−1, k)

5 if length < kthen returnQuickSelect(X, p+ 1, r, klength)

6 returnXp

7 FunctionPartition(XN,l,r)

8 set random pivot indexp←random integer in[l, r]

9 swapXp ↔Xr

10 pr

11 il

12 forj =l,· · ·, r−1do

13 if Xj <=Xp then

14 swapXi ↔Xj

15 ii+ 1

16 end

17 end

18 swapXi ↔Xp

19 returni

20 end

Adjustments for even number of elements

Caution should be taken in the utilization of selection algorithms for finding the median as one must account for the even–sized data.

Using the sorting approach, after a small adjustment for even–sized vectors, we get the «Median with Sort» algorithm (Algorithm4.8.)

With the Quickselect algorithm, the situation is more complex. Extracting a k–th element does not give any information for the position of the(k−1)–th or (k+ 1)–th elements. We decided to use the followingO(N)instructions:

• If odd elementsN, then return thebN+12 c–th element using Quickselect.

In computer science applications it is common to select only one element for both odd or even number of elements: thebn+12 c–th one.

4.7. Statistics

• If even elements,

1. Find the smallest of the two values,a, by requesting Quickselect to return thebN2c–th element.

2. Scan all elements and enumerate how many of them,LEare less or equal toa.

3. IfLE > N2 thenais not unique and at least one other instance follows after the N2–th position. Consequently, the median is 12(a+a) =a.

Returna.

4. If not, then the next bigger element,bhas to be found. This is easily implemented. We simply setb equal to ∞and while iterating the elements in Step 2, when an element larger thanabut less than current value of bcomes up, letb be equal to it. At the end, the minimum element that is larger thanais stored inb. Return a+b2 .

Being aO(N)operation, Quickselect with our modification remains as such.

We call the algorithm «Median with QuickSelect» (Algorithm4.9.) Algorithm 4.8:Median with sort

Input :VectorXN = (x0, x1,· · · , xN−1) Output :Median ofX

1 sortX

2 if N ≡1(mod2)then

3 returnXbN/2c 4 else

5 return 12XbN/2c+XbN/2−1c

Fast bootstrapping for median

Here we present a method for computing the medians of bootstrap resamples (in order for the standard error to be computed) which is faster than using Quickselect repeatedly. We call it Fast Bootstrapping for Median (FBM) although its logic and advantages can be used for the estimation of any other order statistic (like quantiles) through bootstrapping. It was inspired by the following observations:

• By replacing population with the sample, we create a finite and discrete population no matter the nature of the former (continuous or discrete).

CHAPTER 4. IMPLEMENTATION METHODOLOGY Algorithm 4.9:Median withQuickSelect

Input :VectorXN = (x0, x1,· · · , xN−1) Output :Median ofX

1 if N ≡1(mod2)then returnQuickSelect(X,N2 + 1)

2 Half the size,HN2

3 First valuea←QuickSelect(X,N2)

4 Number of elements less or equal toa,LE ←0

5 Second valueb ← ∞

6 fori= 0,1,· · · , N −1do

7 if Xiathen

8 LELE+ 1

9 if LE > H then

10 returna

11 else if Xi < bthen

12 b ←Xi

13 return a+b2

• Sampling with replacement from a finite populationx0,· · · , xN−1 can be representedexactlyby a histogram: a vector of integershi which are the occurrences ofxi in the resample.

• The resampling is required to be uniform so that any change of the order ofxi does not affect the distribution of the resample and we are allowed to exploit the following observation.

• Ifxi are sorted in ascending orderbefore the sampling, so thatxixj fori < j, then each resample’s histogram will read:

h0: number of occurances of the smallest element

h1: " second smallest "

...

hn: " greatest "

• Finding the median in a histogram is trivial: sum the histogram bin counts until N2 is reached — the last bin considered corresponds to the median of the resample.

descending order could be used without loss of generality

4.7. Statistics

For simplicity’s sake but without loss of generality, we did not account for even–sized data. The method can be described in the following steps:

Step 1 Sort the sample ofN elements — inO(NlogN)time Step 2 Create a histogram ofN bins of the sample — inO(N)time

Step 3 ComputeN integers from the discrete uniform distributionU(0, N −1):

these are the indices of the original sample that we are supposed to include in the resample. Do not creating the resample, but construct the histogram by increasing the corresponding bins inO(N)time.

Step 4 Compute the cumulative sum of bin counts until you reach in the middle element (slightly modified for the case of even numbered samples) — in O(N)time

Step 5 Repeat Steps 2 to 4 untilB medians are found — inO(BN)time Step 6 Compute the standard deviation of the median — inO(B)time

The above process hasO(BN+NlogN+B) =O(BN)complexity in time:

we reasonably assume thatB >logN as the computing power today imposes higher and higher values onB for published research — usually above1000and less than million.

One can note that the method presents a lot of merits:

• Most of the computations take place in the summations/ comparisons of integers instead of the computationally heavier comparisons of real numbers. This is also memory efficient.

An extreme case would be numbers in exact arithmetics of software like Wolfram Mathematica: the computational resources for comparisons and other operations are restricted in the first step (sorting.)

• With Quickselect on each resample, the comparisons are redundantly repeated – while in FBM compares the elements only once.

• In each step of the process, we operate only once on any new information:

– the value and order of its elements/when sorting – each resample/when creating histogram

– each histogram/when counting to find the median – each median when computing the standard deviation

CHAPTER 4. IMPLEMENTATION METHODOLOGY Algorithm 4.10:FastBootstrap forMedian’sSE

Input :VectorXN = (x0, x1,· · · , xN−1) Input :NumberB of bootstrap resamples Output :Standard error of median ofX

1 sortX

2 create medians vectorMB = (m1, m2,· · ·mB−1)

3 create histogramHN = (h0, h1,· · ·hN−1)

4 fori∈ {0,1,· · ·B −1}do

5 forj ∈ {0,1,· · · , N−1}dohj ←0

6 forj ∈ {0,1,· · · , N−1}do

7 k ←random number in{0,1,· · ·, N −1}

8 hkhk+ 1

9 if N = 0 (mod 2)then

10 even←true

11 stop_at← dN2e

12 else

13 even←false

14 stop_at← dN2+1e

15 count←0

16 forj ∈ {0,1,· · · , N−1}do

17 countcount+hj

18 if countstop_atthen

19 medianxj

20 if count=stop_atandeventhen

21 median12(median+xj+1)

22 break

23 mimedian;

24 returnstandard deviation ofM

The pseudocode of the method is listed in Algorithm4.10.

Also, we should note here that as the Kruskal’s algorithm sorts the edges, MoravaPack is designed to store the edges of the produced MSTs in that order, rendering the Step 1 redundant. Though, we do not have to exclude the step from the code in MoravaPack or explicitly bypass Step 1 — most sorting algorithms (like the one we use: STL’sstd::sort) either check if the input is already sorted

4.7. Statistics

inO(n)time or work inO(n)time for sorted arrays by design.