1 Prior and Likelihood Choices in the Analysis of Ecological Data
1.7 REGISTRATION–RACE EXAMPLE
In this section we return to the registration–race data. We first analyze in detail two counties that were considered by King, Rosen, and Tanner (1999). The data from these counties are given in Tables 1.3 and 1.4.
1.7.1 County 150
King, Rosen, and Tanner (1999) considered this county to demonstrate that their approach can detect multimodalities in the posterior distribution. The bounds on the fractions
34 Jonathan Wakefield
p0 z
z
(a)
z
p0 z
z
(b)
z
Figure 1.6. Various likelihood surfaces for the data of Table 1.2: (a) convolution likelihood function, (b) normal approximation to the convolution likelihood function, (c) binomial likelihood function of King, Rosen, and Tanner (1999), (d) tomography line “likelihood” of King (1997).
Prior and Likelihood Choices in the Analysis of Ecological Data 35
p0 z
z
(c)
z
Figure 1.6. (continued)
Figure1.7.PosteriorsummariesforthedataofTable1.2underauniformprior:(a)π(p0|y),(b)π(y0|y),(c)π(p1|y),(d) π(y1|y),(e)π(p0,p1|y),(f)π(p1−p0|y).
36
Prior and Likelihood Choices in the Analysis of Ecological Data 37
Table 1.3 Voter registration–race data for county 150 of King (1997)
Unregistered Registered
Y =0 Y =1
Black x=0 4,001
White x=1 11,199
6,800 8,400 15,200
registered for blacks and whites are (0, 1) and (0.39, 0.75), respectively, with ˜p1−p˜0∈ (−0.61, 0.75). The MLEs for the convolution likelihood in Equation 1.6 are ˜p0=0, ˜p1= 0.75.
In Figure 1.8a we plot the convolution surface by evaluating the posterior (with inde- pendent uniform priors) at a series of grid points as described in Section 1.5.5. For such a large marginal total (15,700) we see that the likelihood is highly concentrated along the tomography line, though the curvature along the line is evident. Figure 1.8b shows the normal approximation, which is very accurate here; Figure 1.8c the binomial likelihood of King, Rosen, and Tanner (1999), which is flat along the tomography line; and Figure 1.8d the implicit “likelihood” of King (1997), which is a flat ridge.
Figure 1.9 shows posterior summaries for the baseline model based on the auxiliary scheme of Section 1.5.3 with two separate chains set off from different points. Panels (a)–
(d) clearly show the slow mixing – (a) and (b) show the sample path forp0, and we see that after 30,000 iterations the chains have not come together. This is due to the nonidentifiability and the fact that the noncentral hypergeometric distribution has nonnegligible probability on a relatively small number of values, hence slowing movement around the space. Recently there has been interest in identifiability in Bayesian models, particularly from an MCMC perspective; see for example Gelfand and Sahu (1999). On the basis of this example and other analyses we have carried out, from this point onward for nonrare outcomes and for individual tables with large counts, we use the rejection algorithm, though we note that investigation of efficient computational schemes, including auxialiary variable schemes, is an important area of future research.
We obtained 1000 independent samples from the posteriorπ(p0,p1|y) using the rejection algorithm and sampling along the tomography line with the convolution likelihood of Equation 1.6; the acceptance rate was 0.80. Using the rejection algorithm and sampling from U(0, 1)×U(0, 1) with the convolution likelihood was very inefficient, and the accepted points fell almost exactly on the tomography line, as can be seen from Figure 1.9o and p.
Figure 1.10 contains a number of graphical summaries; these may be compared with Figure 4 of King, Rosen, and Tanner (1999). Panels (a), (c), and (e) give representations of the univariate posteriors and bivariate posteriors ofπ(p0,p1|y), while panels (b) and (d) give the predictive distributions forY0 andY1. The univariate posterior distributions are
Table 1.4 Voter-registration–race data for county 50 of King (1997)
Unregistered Registered
Y =0 Y =1
Black x=0 29,494
White x=1 126,806
10,800 145,500 156,300
38 Jonathan Wakefield
Figure 1.8. Various likelihood surfaces for the data of Table 1.3: (a) convolution likelihood function, (b) normal approximation to the convolution likelihood function, (c) binomial likelihood function of King, Rosen, and Tanner (1999), (d) tomography line “likelihood” of King (1997).
Prior and Likelihood Choices in the Analysis of Ecological Data 39
Figure 1.8. (continued)
Figure1.9.Posteriorplotsfromtwochainsoflength30,000iterationsfromanauxiliaryvariableMCMCalgorithmforcounty 150oftheregistration–racedata.Panels(a)and(b)givethetimeseriesplotsforp0forchains1and2,respectively;panels(c)and (d)givetheresultanthistograms.Thesecondandthirdrowsshowtheequivalentplotsforp1andy0,respectively.Panels(m)and (n)givethetimeseriesplotsforq=p0×x+p1×(1−x)forthetwochains,andpanels(o)and(p)the(p0,p1)pairsunder thetwochains.
40
Figure1.10.Posteriorplotsforcounty150oftheregistration–racedata:(a)π(p0|y),(b)π(y0|y),(c)π(p1|y),(d)π(y1|y), (e)π(p0,p1|y),(f)normalizedlikelihoodsalongtomographylineunderconvolutionandapproximatingnormallikelihood (indistinguishable,solidline),andbinomial(dashedline).
41
42 Jonathan Wakefield close to uniform on the bounds, with the slight U-shape reflecting the shape of the variance Vi in Equation 1.8 along the tomography line. Panel (f) shows the scaled convolution and approximating normal likelihoods along the tomography line, and shows that they are virtually identical; the binomial likelihood is constant along this line and is also included as a dashed line. We note that the bimodality reported for p0by King, Rosen, and Tanner (1999) has the same shape as that of the convolution likelihood in Figure 1.10f and is at first sight surprising, since the binomial likelihood utilized by these authors is constant along the tomography line. The explanation is that the bimodality arises because of the exponential prior with mean 2 that was used fora0,b0,a1,b1. As discussed in more detail by Wakefield (2004), this prior is highly U-shaped for p0and p1, with spikes close to 0 and 1, and the spike at 1 is evident in the posterior forp1that is reported in the upper panel of King, Rosen, and Tanner (1999: Figure 4). As we discuss in Section 1.7.3, with an MCMC approach, very large samples are required for reliable reporting of individual county probabilities.
It may at first seem nonintuitive that the convolution likelihood is not flat along the tomography line. However, whereas the tomography line of King (1997) is in terms of the fractionsp˜0iand ˜p1i, the likelihood is in terms of theprobabilities p0iandp1i. In this example the nonconstancy of the likelihood is clear from examining the likelihood at the endpoints, which are given by
l(p0=0,p1=0.75)=P(Y =8400|p0=0, p1=0.75)
= 4001 y0=0
4001 y0
11199 8400−y0
0.758400−y00.252799+y0
and
l(p0=0, p1=0.39)=P(Y=8400|p0=0,p1=0.39)
=
4001
y0=0
4001 y0
11199 8400−y0
0.398400−y00.612799+y0,
which are clearly different. Mathematically it is evident why the likelihood is not flat: the likelihood must average across the unobserved cell, and the required summation will produce different heights for different values ofp0,p1.
1.7.2 County 50
We now examine county 50, which was also considered by King (1997). The rejection al- gorithm was implemented using the normal approximation along the tomography line and a uniform prior; the acceptance rate was 0.87. The bounds here are (0.63, 1) for ˜p0 and (0.75, 1) for ˜p1, and the MLEs are ˆp0=0.63 and ˆp1=1. The posterior means were esti- mated as 0.81, 0.96. The bound on ˆp1−pˆ0is (−0.09, 0.7), and the posterior probability Pr(p1−p0>0|y)=0.18. Figure 1.11 contains a number of graphical summaries for county 50; these summaries may be compared with Figure 3 of King, Rosen, and Tan- ner (1999). The latter plot shows a large mode around 0.65 which, when compared with Figure 1.11, would appear to be due to the hierarchical prior, showing how strongly infer- ence for a particular table depends on the information from all of the tables. The bimodal nature of the posteriors induced by the poor choice of prior is also evident in Figure 3 of King, Rosen, and Tanner (1999). The normal approximation to the convolution likelihood is again accurate along the tomography line (Figure 1.11d).
Prior and Likelihood Choices in the Analysis of Ecological Data 43
Figure 1.11. Posterior plots for county 50 of the registration–race data: (a)π(p0|y), (b)π(p1|y), (c) π(p0,p1|y) (d) normalized likelihoods along tomography line under convolution and approximate normal likelihood (indistinguishable, solid line) and binomial likelihood (dashed line).
1.7.3 All Counties
We have already seen from examination of the bounds that there is far more informa- tion concerning p1i here, because whites are in the majority in most of the areas. For the registration–race data Figure 1.1 shows the weighted least squares line with weightsNi; we obtain estimates of ˆp0=0.34, ˆp1=0.89 from ecological regression. When the three out- lying counties are removed, we obtain estimates for ( ˆp0, ˆp1) of (0.41, 0.87) with weighted least squares. Hence we see some sensitivity, particularly for p0.
BASELINE MODELS
We investigated the sensitivity to the prior by assuming Beta(3, 2) and Beta(4, 1) priors in addition to the (uniform) Beta(1, 1) prior. A Beta(3, 2) random variable has 2.5%, 50% and 97.5% points of 0.20, 0.61, 0.93; the equivalent quantities for a Beta(4, 1) random variable are 0.40, 0.84, 0.99. The aim of these analyses is to illustrate the sensitivity of inference to prior assumptions, and we used a very approximate rejection algorithm in which the likelihood was assumed to be constant along the tomography line. The rejection step, as described in Section 1.5.4, then consists in accepting a point based on the ratio of the density of the prior at the point to the density at the supremum of the prior (which is available in closed form).
Figure 1.12 shows the histograms of the empirical distribution of the posterior medians of p0i, p1i under the three beta priors, and Table 1.5 provides numerical summaries. The sensitivity in the distribution of the medians of p0iis evident. Under the Beta(4, 1) prior, 49% of the areas have Pr(p0i <p1i|y)>0.5, showing that the registration rates for blacks
p0 (a)
p1 (b) p1 (d)
p0 (c) p0 (e)
p1 (f) Figure1.12.Posteriormediansforallareasfortheregistration–racedatausingthebaselinemodelandindependent Beta(a,b)×Beta(a,b)priors:(a)p0i,a=1,b=1,(b)p1i,a=1,b=1,(c)p0i,a=3,b=2,(d)p1i,a=3,b=2, (e)p0i,a=4,b=1,(f)p1i,a=4,b=1,i=1,...,275.
44
Prior and Likelihood Choices in the Analysis of Ecological Data 45
Table 1.5 Summaries of distribution of posterior medians of (p0i,p1i) overi =1, . . . ,275 counties of registration–race data of King (1997) under the baseline model and different prior specifications
Statistic Beta(1, 1) Beta(3, 2) Beta(4, 1)
¯
p0 0.66 0.71 0.81
s.d.{p0} 0.16 0.13 0.10
2.5%,p0 0.46 0.47 0.57
50%,p0 0.63 0.70 0.83
97.5%,p0 0.98 0.98 0.99
¯
p1 0.80 0.79 0.77
s.d.{p1} 0.16 0.17 0.17
2.5%,p1 0.39 0.41 0.37
50%,p1 0.84 0.82 0.78
97.5%,p1 1.0 1.0 0.99
and whites are virtually identical. The drop in p1 in Table 1.5 may be attributed to the negative dependence betweenp0i andp1iin the likelihood.
For some areas, the rejection raterwas very low under the nonuniform priors, in particular for the outlying areas. This is not surprising when one recognizes that the prior predictive Pr(yi)=r×M, where M is the maximized prior in this implementation and is constant for alli(Wakefield, 1996).
HIERARCHICAL MODELS
We first state summaries for the truncated normal model (from King, 1997). The posterior means of the averages of the normal for blacks and whites were 0.62 and 0.83, respec- tively. For the binomial–beta model of King, Rosen, and Tanner (1999), the means of the beta distributions for blacks and whites were 0.60 and 0.85, respectively (this analysis had exponential priors of mean 2 on the hyperparameters).
At the first stage of the hierarchical model we take the normal approximation to the convolution. At the second stage we assume the modelθj i ∼N(·|µj,j j), j =0, 1, with θj i being the logit of pj i, as in Equations 1.13 and 1.14 with01=0 (so that the logits are independent across tables). At the third stage of the model we assumed independent logistic distributions with location 0 and scale 1 (as described in Section 1.4). The variances are more difficult. We report two analyses, one with the naive choice of Ga(0.001, 0.001) priors for −j j1, and the other with Ga(1, 0.01) priors. The models were fitted using the WinBUGS(Spiegelhalter, Thomas, and Best, 1998); the code for the normal approximation to the convolution model is given in the Appendix.
The individual probabilities display very poor mixing in an MCMC approach, and in- ference for (p0i,p1i) or (Y0i,Y1i) for a particular table would be more accurate using an empirical Bayes approach in which the table of interest was treated as a new table and the prior was taken as the posterior over the population parameters. For example, we could use a rejection algorithm with samplesθi(s)fromπ(θi|φ(s)) withφ(s)∼π(φ|y),s =1,. . .,S, being samples from the posterior on the hyperparameters. Given the large margins and large number of tables, this will not be too poor an approximation unless the table has very large margins and/or is outlying.
The population summary parameters in Table 1.6 were obtained from chains that dis- played slow convergence. The Markov chain was more stable when the more informative
46 Jonathan Wakefield
Table 1.6 Posterior quantiles from hierarchical analyses of the race–registration data of King (1997)
Ga(0.001, 0.001) Ga(1, 0.01) Parameter 2.5% 50% 97.5% 2.5% 50% 97.5%
¯
p0 0.53 0.57 0.62 0.53 0.58 0.63
¯
p1 0.84 0.86 0.87 0.83 0.85 0.87
s.d.{p0i} 0.30 0.32 0.34 0.29 0.32 0.34 s.d.{p1i} 0.16 0.18 0.20 0.16 0.18 0.20
(and plausible) prior was used for inference. For all analyses we ran the chains for 500,000 iterations of burn-in (to give the chain an opportunity to reach the main mass of the poste- rior and “forget” its starting position), and then samples from a further 2,500,000 iterations were used for inference (this latter number was greater than was needed but was used for safety). Figure 1.13 shows samples from the Markov chain for four population summary parameters (the names of which are given in the Appendix), and shows strong dependence across iteration.
From Table 1.6 we see that the results forp1are relatively robust to the choice of second- stage distribution, as we would expect from Figure 1.12 and the analyses of the previous sections. The number of tables is large here, and so the impact of the prior is not so great.
We attempted to include a correlation parameter in the second-stage distributions, but the resultant Markov chain displayed extremely slow mixing, indicating that there was close to zero information in the data to estimate this parameter. Includingxias a covariate also pro- duced a poorly mixing chain, corresponding to near-nonidentifiability. This phenomenon is noted by Rosen, Jiang, King, and Tanner (2001: Section 4.2).
In this example we have seen that the white registration probabilities are well estimated in many areas, while there is far more uncertainty associated with the registration probabilities for blacks. We conclude that under the assumption of no contextual effects there is evidence to suggest that, over all, areas the black probabilities are smaller than the white probabilities on average, but the extent of the difference cannot be precisely estimated without an informative prior distribution, or surveys from within a sample of areas. We emphasize that this analysis has not addressed the contextual aspect of the ecological fallacy.