2.4 EXTENSIONS FOR I = J TABLES
2.4.4 Ordinal Measure of Association: Gamma
Ž .
Given that a pair is untied on both variables,⌸cr ⌸cq⌸d is the
probabil-Ž .
ity of concordance and ⌸dr ⌸cq⌸d is the probability of discordance. The
NOTES 59 difference between these probabilities is
⌸cy⌸d
␥s , Ž2.14.
⌸cq⌸d
Ž . Ž
called gamma Goodman and Kruskal 1954 . The sample version is␥ˆs Cy
. Ž .
D r CqD .
Like the correlation, gamma treats the variables symmetricallyᎏit is unnecessary to identify one classification as a response variable. Also like the correlation, gamma has rangey1F␥F1. A reversal in the category order-ings of one variable causes a change in the sign of ␥. Whereas the absolute value of the correlation is 1 when the relationship between X and Y is
< <
perfectly linear, only monotonicity is required for ␥ s1, with ␥s1 if
⌸ds0 and ␥s y1 if ⌸cs0. Independence implies that ␥s0, but the converse is not true. For instance, a U-shaped joint distribution can have
⌸cs⌸d and hence␥s0.
2.4.5 Gamma for Job Satisfaction Example For Table 2.8,Cs1331 and Ds849. Hence,
␥ˆsŽ1331y849.rŽ1331q849.s0.221.
Only a weak tendency exists for job satisfaction to increase as income increases. Of the untied pairs, the proportion of concordant pairs is 0.221 higher than the proportion of discordant pairs.
NOTES
Section 2.2: Comparing Two Proportions
Ž .
2.1. Breslow 1996 presented an interesting overview of the development of methods for case᎐control studies.
Ž .
2.2. For 2=2 tables, Edwards 1963 showed that functions of the odds ratio are the only statistics that are invariant both to row᎐column interchange and to multiplication within
Ž .
rows or within columns by a constant. For I=J tables, Altham 1970 gave related
Ž .
results. Yule 1912, p. 587 had argued that multiplicative invariance is a desirable property for measures of association, especially when proportions sampled in various
Ž .
marginal categories are arbitrary. Goodman 2000 showed five ways of viewing associa-tion in a 2=2 table and proposed a general measure that includes all five.
Section 2.3: Partial Association in Stratified 2=2 Tables
Ž .
2.3. Paik 1985 proposed circle diagrams of type Figure 2.2 to summarize three-way tables.
Ž .
Friendly 2000 discussed graphical presentation of categorical data. For more on
Ž . Ž . Ž .
Simpson’s paradox and when it can happen, see Blyth 1972 , Davis 1989 , Dong 1998 ,
DESCRIBING CONTINGENCY TABLES
60
Ž . Ž . Ž .
Samuels 1993 , and Simpson 1951 . Good and Mittal 1989 extended it to an amalga-mation paradox, whereby a marginal measure is greater than the maximum or less than the minimum of the partial table measures.
Section 2.4: Extensions for I=J Tables
Ž .
2.4. For continuous variables, samples can be fully ranked i.e., no ties occur , so CqD
n n
Ž . Ž .
s
ž /
2 and␥ˆs CyDrž /
2 . This is Kendall’s tau. Agresti 1984, Chaps. 9 and 10Ž .
and Kruskal 1958 surveyed ordinal measures of association. These also apply when one variable is ordinal and the other is binary. WhenY is ordinal and X is nominal with I)2, no measure presented in Section 2.4 is very helpful. Ordinal modeling approaches ŽSection 7.2 use a parameter for each category of. X; comparing parameters compares the ordinal response for pairs of categories of X.
PROBLEMS Applications
Ž .
2.1 An article in the New York Times Feb. 17, 1999 about the PSA blood test for detecting prostate cancer stated: ‘‘The test fails to detect prostate cancer in 1 in 4 men who have the disease false-negativeŽ results , and as many as two-thirds of the men tested receive false-posi-.
Ž . Ž .
tive results.’’ Let C C denote the event of having not having prostate
Ž . Ž .
cancer, and letq y denote a positive negative test result. Which is
1 1 2 2
Ž < . Ž < . Ž < . Ž < .
true: P y C s4 or P C y s 4? P C q s3 or P q C s3? Determine the sensitivity and specificity.
2.2 A diagnostic test has sensitivitysspecificitys0.80. Find the odds ratio between true disease status and the diagnostic test result.
2.3 Table 2.9 is based on records of accidents in 1988 compiled by the Department of Highway Safety and Motor Vehicles in Florida. Identify the response variable, and find and interpret the difference of propor-tions, relative risk, and odds ratio. Why are the relative risk and odds ratio approximately equal?
TABLE 2.9 Data for Problem 2.3
Injury Safety Equipment
in Use Fatal Nonfatal
None 1601 162,527
Seat belt 510 412,368
Source:Florida Department of Highway Safety and Motor Vehi-cles.
PROBLEMS 61 2.4 Consider the following two studies reported in the New York Times.
Ž .
a. A British study reported Dec. 3, 1998 that of smokers who get lung cancer, ‘‘ women were 1.7 times more vulnerable than men to get small-cell lung cancer.’’ Is 1.7 the odds ratio or the relative risk?
b. A National Cancer Institute study about tamoxifen and breast
Ž .
cancer reported Apr. 7, 1998 that the women taking the drug were 45% less likely to experience invasive breast cancer then were women taking placebo. Find the relative risk forŽ .i those taking the drug compared to those taking placebo, andŽ .ii those taking placebo compared to those taking the drug.
Ž .
2.5 A study E. G. Krug et al., Internat. J. Epidemiol.,27: 214᎐221, 1998 reported that the number of gun-related deaths per 100,000 people in 1994 was 14.24 in the United States, 4.31 in Canada, 2.65 in Australia, 1.24 in Germany, and 0.41 in England and Wales. Use the relative risk to compare the United States with the other countries. Interpret.
2.6 A newspaper article preceding the 1994 World Cup semifinal match between Italy and Bulgaria stated that ‘‘Italy is favored 10᎐11 to beat Bulgaria, which is rated at 10᎐3 to reach the final.’’ Suppose that this means that the odds that Italy wins are 1110 and the odds that Bulgaria wins are 103. Find the probability that each team wins, and comment.
2.7 In the United States, the estimated annual probability that a woman over the age of 35 dies of lung cancer equals 0.001304 for current smokers and 0.000121 for nonsmokers M. Pagano and K. Gauvreau,Ž Principles of Biostatistics, Duxbury Press, Pacific Grove, CA. 1993, p. 134 ..
a. Find and interpret the difference of proportions and the relative risk. Which measure is more informative for these data? Why?
b. Find and interpret the odds ratio. Explain why the relative risk and odds ratio take similar values.
2.8 For adults who sailed on the Titanic on its fateful voyage, the odds
Ž . Ž .
ratio between gender female, male and survival yes, no was 11.4.
ŽFor data, see R. J. M. Dawson, J. Statist. Ed.3, 1995..
a. What is wrong with the interpretation, ‘‘The probability of survival for females was 11.4 times that for males’’? Give the correct inter-pretation. When would the quoted interpretation be approximately correct?
b. The odds of survival for females equaled 2.9. For each gender, find the proportion who survived.
DESCRIBING CONTINGENCY TABLES
62
2.9 In an article about crime in the United States, Newsweek ŽJan. 10, 1994 quoted FBI statistics for 1992 stating that of blacks slain, 94%. were slain by blacks, and of whites slain, 83% were slain by whites. Let Ysrace of victim and Xsrace of murderer. Which conditional
< <
distribution do these statistics refer to,Y X, or X Y? What additional information would you need to estimate the probability that the victim was white given that a murderer was white? Find and interpret the odds ratio.
2.10 A research study estimated that under a certain condition, the proba-bility that a subject would be referred for heart catheterization was 0.906 for whites and 0.847 for blacks.
a. A press release about the study stated that the odds of referral for cardiac catheterization for blacks are 60% of the odds for whites.
Ž .
Explain how they obtained 60% more accurately, 57% .
b. An Associated Press story later described the study and said ‘‘Doc-tors were only 60% as likely to order cardiac catheterization for blacks as for whites.’’ Explain what is wrong with this interpretation.
Give the correct percentage for this interpretation. ŽIn stating results to the general public, it is better to use the relative risk than the odds ratio. It is simpler to understand and less likely to be misinterpreted. For details, see New Engl. J. Med. 341: 279᎐283, 1999..
2.11 A 20-year cohort study of British male physicians R. Doll and R. Peto,Ž British Med. J.2: 1525᎐1536, 1976 noted that the proportion per year. who died from lung cancer was 0.00140 for cigarette smokers and 0.00010 for nonsmokers. The proportion who died from coronary heart disease was 0.00669 for smokers and 0.00413 for nonsmokers.
a. Describe the association of smoking with each of lung cancer and heart disease, using the difference of proportions, relative risk, and odds ratio. Interpret.
b. Which response is more strongly related to cigarette smoking, in terms of the reduction in number of deaths that would occur with elimination of cigarettes? Explain.
2.12 Table 2.10 refers to applicants to graduate school at the University of California at Berkeley, for fall 1973. It presents admissions decisions by gender of applicant for the six largest graduate departments. De-note the three variables by Aswhether admitted, Gsgender, and Dsdepartment. Find the sample AG conditional odds ratios and the marginal odds ratio. Interpret, and explain why they give such different indications of the AG association.
PROBLEMS 63 TABLE 2.10 Data for Problem 2.12
Whether Admitted
Male Female
Department Yes No Yes No
A 512 313 89 19
B 353 207 17 8
C 120 205 202 391
D 138 279 131 244
E 53 138 94 299
F 22 351 24 317
Total 1198 1493 557 1278
Ž .
Source:Data from Freedman et al. 1978, p.14 . See also P. Bickel
Ž .
et al.,Science187: 398᎐403 1975 .
2.13 State three ‘‘real-world’’ variables X,Y, and Z for which you expect a marginal association between X and Y but conditional independence controlling for Z.
2.14 Based on 1987 murder rates in the United States, an Associated Press story reported that the probability that a newborn child has of eventu-ally being a murder victim is 0.0263 for nonwhite males, 0.0049 for white males, 0.0072 for nonwhite females, and 0.0023 for white fe-males.
a. Find the conditional odds ratios between race and whether a murder victim, given the gender. Interpret. Do these variables exhibit homogeneous association?
b. Half the newborns are of each gender, for each race. Find the marginal odds ratio between race and whether a murder victim.
2.15 At each age level, the death rate is higher in South Carolina than in Maine, but overall, the death rate is higher in Maine. Explain how this
Ž .
could be possible. For data, see H. Wainer,Chance 12: 44, 1999.
2.16 A study of the death penalty for cases in Kentucky between 1976 and
Ž .
1991 T. Keil and G. Vito, Amer. J. Criminal Justice 20: 17᎐36, 1995 indicated that the defendant received the death penalty in 8% of the 391 cases in which a white killed a white, in 2% of the 108 cases in which a black killed a black, in 12% of the 57 cases in which a black killed a white, and in 0% of the 18 cases in which a white killed a black. Form the three-way contingency table, obtain the conditional odds ratios between the defendant’s race and the death penalty verdict, interpret those associations, study whether Simpson’s paradox occurs,
DESCRIBING CONTINGENCY TABLES
64
and explain why the marginal association is so different from the conditional associations.
2.17 An estimated odds ratio for adult females between the presence of
Ž . Ž
squamous cell carcinoma yes, no and smoking behavior smoker, nonsmoker equals 11.7 when the smoker category has subjects whose. smoking level sis 0-s-20 cigarettes per day; it is 26.1 for smokers with sG20 cigarettes per day R. C. Brownson et al.,Ž Epidemiology3:
61᎐64, 1992 . Show that the estimated odds ratio between carcinoma. Žyes, no and the smoking levels. ŽsG20, 0-s-20 equals 2.2.. 2.18 Table 2.11 refers to a retrospective study of lung cancer and tobacco
smoking among patients in several English hospitals. The table com-pares male lung cancer patients with control patients having other diseases, according to the average number of cigarettes smoked daily over a 10-year period preceding the onset of the disease.
a. Find the sample odds of lung cancer at each smoking level and the five odds ratios that pair each level of smoking with no smoking. As smoking increases, is there a trend? Interpret.
b. If the log odds of lung cancer is linearly related to smoking level,
Ž .
the log odds in row i satisfies log oddsi s␣qi. Show that this implies that the local odds ratios are identical.
c. Using these data, can you estimate the probability of lung cancer at each level of smoking? Are the estimated odds ratios in part aŽ . meaningful? Explain.
d. Show that the disease groups are stochastically ordered with respect to their distributions on smoking of cigarettes see Problem 2.34 andŽ Section 7.3.4 . Interpret..
TABLE 2.11 Data for Problem 2.18
Disease Group
Daily Average Lung Cancer Control
Number of Cigarettes Patients Patients
None 7 61
-5 55 129
5᎐14 489 570
15᎐24 475 431
25᎐49 293 154
50q 38 12
Source: Reprinted with permission from R. Doll and A. B. Hill,
Ž .
British Med. J.2: 1271᎐1286 1952 .
PROBLEMS 65 TABLE 2.12 Data for Problem 2.19
Wife’s Rating of Sexual Fun
Never or Fairly Very Almost
Husband’s Rating Occasionally Often Often Always
Never or occasionally 7 7 2 3
Fairly often 2 8 3 7
Very often 1 5 4 9
Almost always 2 8 9 14
Ž .
Source: Reprinted with permission from Hout et al. 1987 .
2.19 Table 2.12 summarizes responses of 91 married couples in Arizona to a question about how often sex is fun. Find and interpret a measure of association between wife’s response and husband’s response.
2.20 Table 2.13 is from an early study on the death penalty in Florida.
Analyze these data and show that Simpson’s paradox occurs.
TABLE 2.13 Data for Problem 2.20
Death Penalty Victim’s Defendant’s
Race Race Yes No
White White 19 132
Black 11 52
Black White 0 9
Black 6 97
Source: Reprinted with permission from M. L. Radelet,
Ž .
Amer.Sociol.Re®.46: 918᎐927 1981
Theory and Methods
2.21 For a diagnostic test of a certain disease, 1 denotes the probability that the diagnosis is positive given that a subject has the disease, and
2 denotes the probability that the diagnosis is positive given that a subject does not have it. Let denote the probability that a subject does have the disease.
a. Given that the diagnosis is positive, show that the probability that a subject does have the disease is
1 r 1 q2Ž1y. .
DESCRIBING CONTINGENCY TABLES
66
b. Suppose that a diagnostic test for HIVq status has both sensitivity and specificity equal to 0.95, and s0.005. Find the probability that a subject is truly HIVq, given that the diagnostic test is positive. To better understand this answer, find the joint probabili-ties relating diagnosis to actual disease status, and discuss their relative sizes.
2.22 Binomial parameters for two groups are graphed, with 1 on the horizontal axis and 2 on the vertical axis. Plot the locus of points for
Ž . Ž .
a 2=2 table having a relative risk s0.5, b odds ratios0.5, and Ž .c difference of proportionss y0.5.
2.23 Let D denote having a certain disease and E denote having exposure
Ž .
to a certain risk factor. The attributable risk AR is the proportion of
Ž .
disease cases attributable to that exposure see Benichou 1998 .
Ž . Ž .
a. Let P E s1yP E . Explain why
<
ARs P DŽ .yP D E
Ž .
rP DŽ .. b. Show that AR relates to the relative risk RR byARs P EŽ . ŽRRy1. r 1qP EŽ . ŽRRy1. .
4
2.24 For a 2=2 table of counts ni j, show that the odds ratio is invariant
Ž . Ž .
to a interchanging rows with columns, and b multiplication of cell counts within rows or within columns by c/0. Show that the differ-ence of proportions and the relative risk do not have these properties.
2.25 For given1 and2, show that the relative risk cannot be farther than the odds ratio from their independence value of 1.0.
2.26 Explain why for three events E1, E2, and E3 and their complements, it
Ž < . Ž < . Ž < .
is possible that P E E1 2 )P E E1 2 even if both P E E E1 2 3
-Ž < . Ž < . Ž < . Ž
P E E E1 2 3 and P E E E1 2 3 -P E E E1 2 3 . Hint: Use Simpson’s paradox for a three-way table..
Ž < .
2.27 Leti j<ksP Xsi,Ysj Zsk . Explain why XY conditional inde-pendence is
i j<ksiq<kqj<k for all iand jand k.
2.28 For a 2=2=2 table, show that homogeneous association is a sym-metric property, by showing that equal XY conditional odds ratios is equivalent to equal YZ conditional odds ratios.
PROBLEMS 67 2.29 Smith and Jones are baseball players. Smith has a higher batting average than Jones in each of K years. Is is possible that for the combined data from the K years, Jones has the higher batting aver-age? Explain, using an example to illustrate.
2.30 When X and Y are conditionally dependent at each level of Z yet marginally independent, Z is called a suppressor®ariable. Specify joint probabilities for a 2=2=2 table to show that this can happen Ž .a when there is homogeneous association, and Ž .b when the association has opposite direction in the partial tables.
I J
4 Ž . Ž .
2.31 Show that the ␣i j in 2.11 determine a all
ž / ž /
2 2 odds ratiosŽ . 4 Ž .
formed from pairs of rows and pairs of columns, b all i j in 2.10 , and vice versa.
2.32 Refer to Problem 2.31. When all rows and columns have positive
4
probability, show that independence is equivalent to all ␣i js1 . 2.33 For I=J contingency tables, explain why the variables are
indepen-Ž .Ž .
dent when the Iy1 Jy1 differences j<iyj<Is0, is1, . . . , Iy1, js1, . . . ,Jy1.
2.34 A 2=J table has ordinal response. Let Fj<is1<iq⭈⭈⭈qj<i. When Fj<2F Fj<1 for js1, . . . ,J, the conditional distribution in row 2 is stochastically higher than the one in row 1. Consider the cumulati®e odds ratios
Fj<1r
Ž
1yFj<1.
js , js1, . . . , Jy1.
Fj<2r
Ž
1yFj<2.
a. Show that logjG0 for all jis equivalent to row 2 being stochasti-cally higher than row 1. Explain why row 2 is then more likely than row 1 to have observations at the high end of the ordinal scale.
b. If all local log odds ratios are nonnegative, logjG0 for 1FjF
Ž .
Jy1 Lehmann 1966 . Show by counterexample that the converse is not true.
4 4
2.35 Suppose that Yi j are independent Poisson variates with means i j .
Ž . 4
Show that P Yi jsni j for all i,j, conditional on Yiqsni, satisfy
w Ž . x
independent multinomial sampling i.e., the product of 2.2 for all i within the rows.
DESCRIBING CONTINGENCY TABLES
68
Ž .
2.36 For 2=2 tables, Yule 1900, 1912 introduced
11 22y 12 21
Qs ,
11 22q 12 21
which he labeled Q in honor of the Belgian statistician Quetelet. It is now called Yule’s Q.
a. Show that for 2=2 tables, Goodman and Kruskal’s␥sQ.
b. Show that Q falls betweeny1 and 1.
c. State conditions under which Qs y1 or Qs1.
Ž . Ž .
d. Show that Q relates to the odds ratio by Qs y1r q1 , w x
a monotone transformation of from the 0,⬁ scale onto the wy1,q1 scale.x
4
2.37 When X and Y are ordinal with counts ni j:
a. Explain why the
ž /
n2 pairs of observations partition into CqDqŽ .
TXqTYyTX Y, whereTXsÝniq niqy1r2 pairs are tied on X, TY pairs are tied on Y, and TX Y pairs are tied on X and Y.
Ž . Ž .
b. For each ordered pair of observations Xa,Ya and Xb,Yb , let
Ž . Ž .
Xabssign XayXb and Yabssign YayYb . Show that the
sam-Ž . Ž .
ple correlation for the n ny1 distinct Xab,Yab pairs is CyD
bs 1r2.
n yTX n yTY
½ ž /
2ž /
25
Ž .
This ordinal measure, called Kendall’s tau-b Kendall 1945 , is less sensitive than gamma to the choice of response categories.
Ž . n
c. Let ds CyD r
ž /
2 yTX . Explain why d is the difference be-tween the proportions of concordant and discordant pairs out ofŽ . Ž
those pairs untied on X Somers 1962 . For 2=2 tables, d equals the difference of proportions, and tau-b equals the correlation between X and Y..
Ž . Ž .
2.38 Goodman and Kruskal 1954 proposed an association measure tau for nominal variables based on variation measure
V YŽ .s
Ý
qjŽ
1yqj.
s1yÝ
q2j. Ž .a. Show V Y is the probability that two independent observations on
Ž .
Y fall in different categories called the Gini concentration index .
PROBLEMS 69
Ž . Ž .
Show that V Y s0 when qjs1 for some j and V Y takes
Ž .
maximum value of Jy1rJ whenqjs1rJ for all j.
w Ž < .x b. For the proportional reduction in variation, show that E V Y X
2 w Ž .
s1yÝ Ýi ji jriq. The resulting measure 2.12 is called the concentration coefficient. Like U, s0 is equivalent to
indepen-Ž .
dence. Haberman 1982 presented generalized concentration and uncertainty coefficients.x
2.39 The measure of association lambda for nominal variables GoodmanŽ
. Ž . 4 Ž < .
and Kruskal 1954 has V Y s1ymaxqj and V Y i s1y
4
maxjj<i. Interpret lambda as a proportional reduction in prediction error for predictions which select the response category that is most likely. Show that independence implies s0 but that the converse is not true.
C H A P T E R 3
Inference for Contingency Tables
In this chapter we introduce inferential methods for contingency tables.
Many of these methods also play a vital role in analyses of later chapters for which categorical data need not have contingency table form. The methods assume Poisson, multinomial, or independent binomial sampling.
In Section 3.1 we present confidence intervals for measures of association for 2=2 tables such as the odds ratio. Section 3.2 covers chi-squared tests of the hypothesis of independence between two categorical variables. Like any significance test, these have limited usefulness. In Section 3.3 we show how to follow-up the test using residuals or the partitioning property of chi-squared to extract components that describe the evidence about the association. In Section 3.4 we present more powerful inference applicable with ordered categories. The methods of Sections 3.1 through 3.4 assume large samples. In Sections 3.5 and 3.6 we introduce small-sample methods.
3.1 CONFIDENCE INTERVALS FOR ASSOCIATION PARAMETERS