We characterize an experiment by the treatments and experimental units to be used, the way we assign the treatments to units, and the responses we
mea-sure. An experiment is randomized if the method for assigning treatments Randomization to assign treatment to units
to units involves a known, well-understood probabilistic scheme. The prob-abilistic scheme is called a randomization. As we will see, an experiment may have several randomized features in addition to the assignment of treat-ments to units. Randomization is one of the most important eletreat-ments of a well-designed experiment.
Let’s emphasize first the distinction between a random scheme and a Haphazard is not randomized
“haphazard” scheme. Consider the following potential mechanisms for as-signing treatments to experimental units. In all cases suppose that we have four treatments that need to be assigned to 16 units.
• We use sixteen identical slips of paper, four marked with A, four with B, and so on to D. We put the slips of paper into a basket and mix them thoroughly. For each unit, we draw a slip of paper from the basket and use the treatment marked on the slip.
• Treatment A is assigned to the first four units we happen to encounter, treatment B to the next four units, and so on.
• As each unit is encountered, we assign treatments A, B, C, and D based on whether the “seconds” reading on the clock is between 1 and 15, 16 and 30, 31 and 45, or 46 and 60.
The first method clearly uses a precisely-defined probabilistic method. We understand how this method makes it assignments, and we can use this method
14 Randomization and Design
to obtain statistically equivalent randomizations in replications of the exper-iment.
The second two methods might be described as “haphazard”; they are not predictable and deterministic, but they do not use a randomization. It is diffi-cult to model and understand the mechanism that is being used. Assignment here depends on the order in which units are encountered, the elapsed time between encountering units, how the treatments were labeled A, B, C, and D, and potentially other factors. I might not be able to replicate your experi-ment, simply because I tend to encounter units in a different order, or I tend to work a little more slowly. The second two methods are not randomization.
Haphazard is not randomized.
Introducing more randomness into an experiment may seem like a per-verse thing to do. After all, we are always battling against random exper-imental error. However, random assignment of treatments to units has two
Two reasons for
randomizing useful consequences:
1. Randomization protects against confounding.
2. Randomization can form the basis for inference.
Randomization is rarely used for inference in practice, primarily due to com-putational difficulties. Furthermore, some statisticians (Bayesian statisticians in particular) disagree about the usefulness of randomization as a basis for inference.1 However, the success of randomization in the protection against confounding is so overwhelming that randomization is almost universally recommended.
2.1 Randomization Against Confounding
We defined confounding as occurring when the effect of one factor or treat-ment cannot be distinguished from that of another factor or treattreat-ment. How does randomization help prevent confounding? Let’s start by looking at the trouble that can happen when we don’t randomize.
Consider a new drug treatment for coronary artery disease. We wish to compare this drug treatment with bypass surgery, which is costly and inva-sive. We have 100 patients in our pool of volunteers that have agreed via
1Statisticians don’t always agree on philosophy or methodology. This is the first of several ongoing little debates that we will encounter.
2.1 Randomization Against Confounding 15
informed consent to participate in our study; they need to be assigned to the two treatments. We then measure five-year survival as a response.
What sort of trouble can happen if we fail to randomize? Bypass surgery is a major operation, and patients with severe disease may not be strong
enough to survive the operation. It might thus be tempting to assign the Failure to
randomize can cause trouble
stronger patients to surgery and the weaker patients to the drug therapy. This confounds strength of the patient with treatment differences. The drug ther-apy would likely have a lower survival rate because it is getting the weakest patients, even if the drug therapy is every bit as good as the surgery.
Alternatively, perhaps only small quantities of the drug are available early in the experiment, so that we assign more of the early patients to surgery, and more of the later patients to drug therapy. There will be a problem if the early patients are somehow different from the later patients. For example, the earlier patients might be from your own practice, and the later patients might be recruited from other doctors and hospitals. The patients could differ by age, socioeconomic status, and other factors that are known to be associated with survival.
There are several potential randomization schemes for this experiment;
here are two:
• Toss a coin for every patient; heads—the patient gets the drug, tails—
the patient gets surgery.
• Make up a basket with 50 red balls and 50 white balls well mixed together. Each patient gets a randomly drawn ball; red balls lead to surgery, white balls lead to drug therapy.
Note that for coin tossing the numbers of patients in the two treatment groups are random, while the numbers are fixed for the colored ball scheme.
Here is how randomization has helped us. No matter which features of the population of experimental units are associated with our response, our randomizations put approximately half the patients with these features in
each treatment group. Approximately half the men get the drug; approxi- Randomization balances the population on average
mately half the older patients get the drug; approximately half the stronger patients get the drug; and so on. These are not exactly 50/50 splits, but the deviation from an even split follows rules of probability that we can use when making inference about the treatments.
This example is, of course, an oversimplification. A real experimental design would include considerations for age, gender, health status, and so on. The beauty of randomization is that it helps prevent confounding, even for factors that we do not know are important.
16 Randomization and Design
Here is another example of randomization. A company is evaluating two different word processing packages for use by its clerical staff. Part of the evaluation is how quickly a test document can be entered correctly using the two programs. We have 20 test secretaries, and each secretary will enter the document twice, using each program once.
As expected, there are potential pitfalls in nonrandomized designs. Sup-pose that all secretaries did the evaluation in the order A first and B second.
Does the second program have an advantage because the secretary will be familiar with the document and thus enter it faster? Or maybe the second program will be at a disadvantage because the secretary will be tired and thus slower.
Two randomized designs that could be considered are:
1. For each secretary, toss a coin: the secretary will use the programs in the orders AB and BA according to whether the coin is a head or a tail, respectively.
2. Choose 10 secretaries at random for the AB order, the rest get the BA order.
Both these designs are randomized and will help guard against confounding,
Different randomizations are different designs
but the designs are slightly different and we will see that they should be analyzed differently.
Cochran and Cox (1957) draw the following analogy:
Randomization is somewhat analogous to insurance, in that it is a precaution against disturbances that may or may not occur and that may or may not be serious if they do occur. It is gen-erally advisable to take the trouble to randomize even when it is not expected that there will be any serious bias from failure to randomize. The experimenter is thus protected against unusual events that upset his expectations.
Randomization generally costs little in time and trouble, but it can save us from disaster.
2.2 Randomizing Other Things
We have taken a very simplistic view of experiments; “assign treatments to units and then measure responses” hides a multitude of potential steps and choices that will need to be made. Many of these additional steps can be randomized, as they could also lead to confounding. For example:
2.3 Performing a Randomization 17
• If the experimental units are not used simultaneously, you can random-ize the order in which they are used.
• If the experimental units are not used at the same location, you can randomize the locations at which they are used.
• If you use more than one measuring instrument for determining re-sponse, you can randomize which units are measured on which instru-ments.
When we anticipate that one of these might cause a change in the response, we can often design that into the experiment (for example, by using blocking;
see Chapter 13). Thus I try to design for the known problems, and randomize everything else.
One tale of woe Example 2.1
I once evaluated data from a study that was examining cadmium and other metal concentrations in soils around a commercial incinerator. The issue was whether the concentrations were higher in soils near the incinerator. They had eight sites selected (matched for soil type) around the incinerator, and took ten random soil samples at each site.
The samples were all sent to a commercial lab for analysis. The analysis was long and expensive, so they could only do about ten samples a day. Yes indeed, there was almost a perfect match of sites and analysis days. Sev-eral elements, including cadmium, were only present in trace concentrations, concentrations that were so low that instrument calibration, which was done daily, was crucial. When the data came back from the lab, we had a very good idea of the variability of their calibrations, and essentially no idea of how the sites differed.
The lab was informed that all the trace analyses, including cadmium, would be redone, all on one day, in a random order that we specified. Fortu-nately I was not a party to the question of who picked up the $75,000 tab for reanalysis.
2.3 Performing a Randomization
Once we decide to use randomization, there is still the problem of actually doing it. Randomizations usually consist of choosing a random order for
a set of objects (for example, doing analyses in random order) or choosing Random orders and random subsets
random subsets of a set of objects (for example, choosing a subset of units for treatment A). Thus we need methods for putting objects into random orders
18 Randomization and Design
and choosing random subsets. When the sample sizes for the subsets are fixed and known (as they usually are), we will be able to choose random subsets by first choosing random orders.
Randomization methods can be either physical or numerical. Physical randomization is achieved via an actual physical act that is believed to pro-duce random results with known properties. Examples of physical random-ization are coin tosses, card draws from shuffled decks, rolls of a die, and
Physical
randomization tickets in a hat. I say “believed to produce random results with known prop-erties” because cards can be poorly shuffled, tickets in the hat can be poorly mixed, and skilled magicians can toss coins that come up heads every time.
Large scale embarrassments due to faulty physical randomization include poor mixing of Selective Service draft induction numbers during World War II (see Mosteller, Rourke, and Thomas 1970). It is important to make sure that any physical randomization that you use is done well.
Physical generation of random orders is most easily done with cards or tickets in a hat. We must order N objects. We take N cards or tickets, numbered1throughN, and mix them well. The first object is then given the
Physical random
order number of the first card or ticket drawn, and so on. The objects are then sorted so that their assigned numbers are in increasing order. With good mixing, all orders of the objects are equally likely.
Once we have a random order, random subsets are easy. Suppose that the N objects are to be broken into g subsets with sizesn1, . . ., ng, with n1+· · ·+ng =N. For example, eight students are to be grouped into one
Physical random subsets from random orders
group of four and two groups of two. First arrange the objects in random order. Once the objects are in random order, assign the first n1 objects to group one, the nextn2objects to group two, and so on. If our eight students were randomly ordered 3, 1, 6, 8, 5, 7, 2, 4, then our three groups would be (3, 1, 6, 8), (5, 7), and (2, 4).
Numerical randomization uses numbers taken from a table of “random”
numbers or generated by a “random” number generator in computer software.
Numerical
randomization For example, Appendix Table D.1 contains random digits. We use the table or a generator to produce a random ordering for our N objects, and then proceed as for physical randomization if we need random subsets.
We get the random order by obtaining a random number for each object, and then sorting the objects so that the random numbers are in increasing order. Start arbitrarily in the table and read numbers of the required size sequentially from the table. If any number is a repeat of an earlier number, replace the repeat by the next number in the list so that you getN different numbers. For example, suppose that we need 5 numbers and that the random
Numerical
random order numbers in the table are (4, 3, 7, 4, 6, 7, 2, 1, 9, . . .). Then our 5 selected numbers would be (4, 3, 7, 6, 2), the duplicates of 4 and 7 being discarded.
2.4 Randomization for Inference 19
Now arrange the objects so that their selected numbers are in ascending order.
For the sample numbers, the objects, A through E would be reordered E, B, A, D, C. Obviously, you need numbers with more digits asN gets larger.
Getting rid of duplicates makes this procedure a little tedious. You will have fewer duplicates if you use numbers with more digits than are
abso-lutely necessary. For example, for 9 objects, we could use two- or three-digit Longer random numbers have fewer duplicates
numbers, and for 30 objects we could use three- or four-digit numbers. The probabilities of 9 random one-, two-, and three-digit numbers having no du-plicates are .004, .690, and .965; the probabilities of 30 random two-, three-, and four-digit numbers having no duplicates are .008, .644, and .957 respec-tively.
Many computer software packages (and even calculators) can produce
“random” numbers. Some produce random integers, others numbers be-tween 0 and 1. In either case, you use these numbers as you would numbers formed by a sequence of digits from a random number table. Suppose that we needed to put 6 units into random order, and that our random number generator produced the following numbers: .52983, .37225, .99139, .48011, .69382, .61181. Associate the 6 units with these random numbers. The sec-ond unit has the smallest random number, so the secsec-ond unit is first in the ordering; the fourth unit has the next smallest random number, so it is second in the ordering; and so on. Thus the random order of the units is B, D, A, F, E, C.
The word random is quoted above because these numbers are not truly random. The numbers in the table are the same every time you read it; they don’t change unpredictably when you open the book. The numbers produced by the software package are from an algorithm; if you know the algorithm you can predict the numbers perfectly. They are technically pseudorandom
numbers; that is, numbers that possess many of the attributes of random num- Pseudorandom numbers
bers so that they appear to be random and can usually be used in place of random numbers.
2.4 Randomization for Inference
Nearly all the analysis that we will do in this book is based on the normal distribution and linear models and will uset-tests, F-tests, and the like. As we will see in great detail later, these procedures make assumptions such as
“The responses in treatment group A are independent from unit to unit and follow a normal distribution with meanµand varianceσ2.” Nowhere in the design of our experiment did we do anything to make this so; all we did was randomize treatments to units and observe responses.
20 Randomization and Design
Table 2.1: Auxiliary manual times runstitching a collar for 30 workers under standard (S) and ergonomic (E) conditions.
# S E # S E # S E
1 4.90 3.87 11 4.70 4.25 21 5.06 5.54
2 4.50 4.54 12 4.77 5.57 22 4.44 5.52
3 4.86 4.60 13 4.75 4.36 23 4.46 5.03
4 5.57 5.27 14 4.60 4.35 24 5.43 4.33
5 4.62 5.59 15 5.06 4.88 25 4.83 4.56
6 4.65 4.61 16 5.51 4.56 26 5.05 5.50
7 4.62 5.19 17 4.66 4.84 27 5.78 5.16
8 6.39 4.64 18 4.95 4.24 28 5.10 4.89
9 4.36 4.35 19 4.75 4.33 29 4.68 4.89
10 4.91 4.49 20 4.67 4.24 30 6.06 5.24
In fact, randomization itself can be used as a basis for inference. The advantage of this randomization approach is that it relies only on the
ran-Randomization inference makes few assumptions
domization that we performed. It does not need independence, normality, and the other assumptions that go with linear models. The disadvantage of the randomization approach is that it can be difficult to implement, even in relatively small problems, though computers make it much easier. Further-more, the inference that randomization provides is often indistinguishable from that of standard techniques such as ANOVA.
Now that computers are powerful and common, randomization inference procedures can be done with relatively little pain. These ideas of randomiza-tion inference are best shown by example. Below we introduce the ideas of randomization inference using two extended examples, one corresponding to a pairedt-test, and one corresponding to a two samplet-test.
2.4.1 The pairedt-test
Bezjak and Knez (1995) provide data on the length of time it takes garment workers to runstitch a collar on a man’s shirt, using a standard workplace and a more ergonomic workplace. Table 2.1 gives the “auxiliary manual time”
per collar in seconds for 30 workers using both systems.
One question of interest is whether the times are the same on average for the two workplaces. Formally, we test the null hypothesis that the aver-age runstitching time for the standard workplace is the same as the averaver-age runstitching time for the ergonomic workplace.
2.4 Randomization for Inference 21
Table 2.2:Differences in runstitching times (standard−ergonomic).
1.03 -.04 .26 .30 -.97 .04 -.57 1.75 .01 .42
.45 -.80 .39 .25 .18 .95 -.18 .71 .42 .43
-.48 -1.08 -.57 1.10 .27 -.45 .62 .21 -.21 .82 A pairedt-test is the standard procedure for testing this null hypothesis.
We use a paired t-test because each worker was measured twice, once for Pairedt-test for paired data
each workplace, so the observations on the two workplaces are dependent.
Fast workers are probably fast for both workplaces, and slow workers are slow for both. Thus what we do is compute the difference (standard− er-gonomic) for each worker, and test the null hypothesis that the average of these differences is zero using a one samplet-test on the differences.
Table 2.2 gives the differences between standard and ergonomic times.
Recall the setup for a one samplet-test. Letd1, d2, . . ., dnbe the n differ-ences in the sample. We assume that these differdiffer-ences are independent sam-ples from a normal distribution with meanµand varianceσ2, both unknown.
Our null hypothesis is that the mean µ equals prespecified value µ0 = 0 (H0 :µ=µ0 = 0), and our alternative isH1:µ >0because we expect the workers to be faster in the ergonomic workplace.
The formula for a one samplet-test is t= d¯−µ0
s/√n ,
whered¯is the mean of the data (here the differencesd1, d2, . . ., dn),nis the The pairedt-test
sample size, andsis the sample standard deviation (of the differences) s=
vu ut 1
n−1 Xn i=1
(di−d¯)2 .
If our null hypothesis is correct and our assumptions are true, then the t-statistic follows at-distribution withn−1degrees of freedom.
Thep-value for a test is the probability, assuming that the null hypothesis
is true, of observing a test statistic as extreme or more extreme than the one Thep-value
we did observe. “Extreme” means away from the the null hypothesis towards the alternative hypothesis. Our alternative here is that the true average is larger than the null hypothesis value, so larger values of the test statistic are extreme. Thus thep-value is the area under thet-curve withn−1degrees of freedom from the observedt-value to the right. (If the alternative had been µ < µ0, then the p-value is the area under the curve to the left of our test