1 Prior and Likelihood Choices in the Analysis of Ecological Data
1.5 COMPUTATION
28 Jonathan Wakefield the right fork gives ecological regression. Hierarchical models allow 2mparameters but tie the pairs of probabilities together via the assumption of a common distribution from which they are drawn (possibly allowing contextual effects also). At the bottom of the nesting the baseline model is located. The latter is essentially a fixed effects model for each table retaining the 2mparameters – as we discussed above, we do not advocate the use of this model, but it is useful to identify the extreme saturated model for ecological data.
All of the above hierarchical models result in posterior distributions that are analyt- ically intractable (as we describe in the next section), but Markov chain Monte Carlo (MCMC) algorithms are relatively straightforward to implement (though convergence may be a problem), and all of the models but the truncated normal have been implemented in the WinBUGSsoftware (Spiegelhalter, Thomas, and Best, 1998). The Appendix gives code for the logistic normal with the normal approximation to the convolution at stage 1. In our fairly limited experience we have found that the logit model is much more stable than the beta model, at least when used within theWinBUGSsoftware. In particular we found that this software may crash with the beta model, because points very close to 0 and 1 are supported by ecological data and when sampled lead to numerical problems.
In the next section we briefly review the Bayesian approach to inference and give an overview of computation for the Bayesian models that have been described in the previous section. The Bayesian approach is particularly appealing in the context of ecological data because for such data modeling assumptions have to be made to enforce identifiability, and the most rigorous way of including such assumptions is via the adoption of a prior distribution.
Prior and Likelihood Choices in the Analysis of Ecological Data 29
parameters), we may further write
π(θ1,. . .,θm|φ)= m
i=1
π(θi|φ).
Hence, under these assumptions, we have the posterior distribution π(θ1,. . .,θm,φ|yi,. . .,ym)∝
m i=1
p(yi|θi)×
m i=1
π(θi|φ)×π(φ).
Inference follows via consideration of marginal posterior distributions and predictive distributions. For exampleπ(θi|y1,. . .,ym) is the marginal posterior distribution for the pair of parameters from tablei. We may also be interested in imputing the missing counts in tablei. This may be carried out via examination of the predictive distribution
P(Y0i,Y1i|yi,. . .,ym)=
P(Y0i,Y1i|θi,N0i,N1i,Ni−yi,yi)×π(θi|y1,. . .,yn)dθi. If we can simulate from P(Y0i,Y1i|θi,N0i,N1i,Ni−yi,yi), then it is straightforward to simulate from the predictive distribution, once samplesθi(s) are available fromπ(θi|N0i, N1i,Ni−yi,yi), via
1 S
S s=1
P(Y0i,Y1i|θi(s),N0i,N1i,Ni−yi,yi).
Each of the distributions within this sum is the distribution ofY0igiven the row and column margins and the table probabilities, and is a noncentral (or extended) hypergeometric distribution (e.g. McCullagh and Nelder, 1989). Suppose the odds ratio in the table is given byψi = p0i(1−p1i)/p1i(1−p0i); thenY0ihas a noncentral hypergeometric distribution if its distribution is of the form
P r(Y0i =y0i|ψi,N0i,N1i,Ni−yi,yi)
=
N0i
y0i
N1i
yi−y0i
ψiy0i ui
u=li
N0i u
N1i yi−u
ψiu 0
y0i =li,. . .,ui, otherwise, (1.15)
whereli =max(0,yi−N1i) andui=min(N0i,yi). Hence the predictive distribution is an overdispersed noncentral hypergeometric distribution. The distribution ofY1iis obtained asY1i =Yi−Y0i, and produces (y0i/N0i,y1i/N1i) pairs that lie along the tomography line.
1.5.2 Markov Chain Monte Carlo Algorithms
Unfortunately, the integrals required to calculate the posterior and predictive distributions just described are not analytically tractable, and so some form of approximation is required.
One such approximation, based on generating samples from the posterior distribution, is particularly well suited to the hierarchical model that we have generically described. This
30 Jonathan Wakefield approximation is based on constructing a Markov chain whose stationary distribution is the required posterior distribution. Specifically, we exploit the conditional independences that were used to derive the posterior distribution and simulate repeatedly from the distributions p(θi|θj, j=i,φ,y1,. . .,ym)∝ p(yi|θi)×π(θi|φ), (1.16) i =1,. . .,m, and
p(φ|θ1,. . .,θm,y1,. . .,ym)∝
m i=1
π(θi|φ)×π(φ). (1.17)
This MCMC algorithm is used within theWinBUGSsoftware; the manual (Spiegelhalter, Thomas, and Best, 1998) gives details of specific algorithms and advice on assessing conver- gence of the Markov chain. Due to the nonidentifiability of ecological models the Markov chain typically has to be run for a large number of iterations (1–3 million iterations were used for the examples of this paper). The sampled values also typically show extremely high autocorrelations, and so a large number of samples are required for reliable inference. When such samples are retained for all parameters, storage becomes an issue, and so instead the Markov chain may be thinned (that is, samples are only stored every 1000th (say) itera- tion). Great care must be taken when examination of posterior quantities is carried out to gain some assurance that convergence has been attained – particularly for table-specific parameters, for example, p0i,p1i.
1.5.3 Auxiliary Variables Scheme
An alternative MCMC scheme suggests itself when the event of interest is rare (which is not typical in social science applications). The algorithm is based on introducing the countsY0ias auxiliary variables. Byers and Besag (2000) describe a Markov chain in such a situation under a rare-event assumption (the latter allows the convolution to be replaced by a Poisson distribution). With the introduction of auxiliary variablesY0iwe have a posterior distribution over not only the unknown parameters, but also the missing data:
π(θ1,. . .,θm,φ,y01,. . .,y0m|y1,. . .,ym)= p(y01,. . .,y0m|y1,. . .,ym,θ1,. . .,θm)
×p(θ1,. . .,θm,|yφ1,. . .,ym). (1.18) This introduction seems unhelpful at first, since we have a more complex form than pre- viously, but the benefit is that the conditional distributions forθi, which are required for an MCMC algorithm, are now of standard form. Specifically, we may alternate between the conditional distributions given by
p(y0i|yi,θi), (1.19)
i =1,. . .,m, and
π(θi|N0i,N1i,yi,y0i), (1.20) i =1,. . .,m, withπ(φ|θ1,. . .,θm,y1,. . .,ym,y01,. . .,y0m) as in Equation 1.17. This al- gorithm is inefficient for nonrare outcomes; see Section 1.7 for a demonstration.
Prior and Likelihood Choices in the Analysis of Ecological Data 31
1.5.4 Rejection Algorithm for Individual Tables
For examination of the likelihood–posterior surface for a single table only via the baseline model, various schemes are possible. The auxiliary scheme just described may be imple- mented, or we can implement a rejection algorithm. The advantage of the latter is that it providesindependentsamples from the posterior distribution (as opposed to the dependent samples produced by MCMC schemes, including the auxiliary vaiable algorithm).
Since we only have two unknown parameters and a finite range for both, a rejection scheme is straightforward to implement. A generic rejection algorithm for sampling from a density f(·), given a proposal densityg(·), proceeds as follows. First find
M=sup f(z) g(z); then:
1. SampleZ∼g(·) and, independently,U∼U(0, 1).
2. AcceptZif
U< f(Z) Mg(Z); otherwise return to 1.
The rejection algorithm depends on Mbeing finite, and the efficiency may be measured through the number of samples that are accepted. The latter is a function of how closely gmimics f. A specific rejection algorithm that is useful in Bayesian inference is to choose g to be the prior distributionπ(·). In this caseMis the maximized likelihood (which we know is finite in the ecological context). LettingL(·) denote the likelihood andM=L( ˆZ) the maximized likelihood, the algorithm then becomes:
1. SampleZ∼π(·) and, independently,U ∼U(0, 1).
2. AcceptZif
U< L(Z) M ; otherwise return to 1.
HereZ=(p0,p1).
For small tables, the above rejection algorithm of sampling from the prior and rejecting according to the ratio of the convolution at the sampled point to that at the maximum (which is evaluated once only) is feasible. The maximum lies at one of the endpoints of the tomography line; see Wakefield (2004) and Chapter 2 for details.
As the table margins increase in size, the rejection algorithm may become very inefficient, since the likelihood concentrates upon the tomography line and so hardly any points are accepted. The computational expense is greatest for the convolution likelihood, due to the need to evaluate a large number of terms in the summation over the missing datay0i. The normal likelihood is computationally inexpensive, and in our experience provides a very good approximation (examples follow in Sections 1.6 and 1.7).
For tablei and for a beta prior, the details of the algorithm are as follows. Let Mi = L( ˆp0i, ˆp1i) denote the supremum of the likelihood for p0i,p1i for either the convolution of binomials or approximating normal likelihood. The rejection algorithm is as follows:
32 Jonathan Wakefield 1. Sample pj i∼Beta(aj,bj), j=0, 1,U ∼U(0, 1), with all generations being inde-
pendent.
2. Acceptp0i,p1iif
U< L(p0i,p1i) Mi
,
withL(p0i,p1i) given by Equation 1.6 (if the convolution is used); otherwise return to 1.
To address the sensitivity to the prior, the accepted points may be reweighted via the ratio of the new to the original prior, or by thinning the accepted points via another rejection algorithm based on the supremum of the ratio of the priors. See Smith and Gelfand (1992) for details.
Another possibility that we have used as a quick approximation when the size of the table margins is large is to restrict the prior to sampling along the tomography line and then to test using the convolution likelihood. For large margins the likelihood will fall away very quickly to either side of the tomography line.
As we commented in Section 1.3, due to the nonidentifiability, asNi → ∞the likelihood tends to a line, and not to a point as in the regular case. For large Ni, withuniform priors, the posterior distribution may therefore be approximated by a uniform distribution on the tomography line. In this situation the posterior mediansp0i,p1iare therefore approximated by the midpoint of the method of bounds, that is,
ˆ
p0i =0.5×
min
1, yi N0i
+max
0, yi−N1i N0i
, ˆ
p1i =0.5×
min
1, yi N1i
+max
0, yi−N0i N1i
,
since these bounds define the tomography line. We note again that estimates defined in this way will be biased, and the amount of bias depends onxi. Consequently, examining the midpoints versusxi(say) will be deceptive. We again stress that examination of the data via the baseline model is an initial exploratory step with inference following from hierarchical modeling.
1.5.5 Discrete Approximation
If we wish to examine the likelihood surface for a single table, then we may simply evaluate the surface over a grid ofp0i,p1ivalues. If we wish to convert to a posterior surface, we may then multiply the likelihood by the prior and then normalize by dividing the product by the sum over all points in the grid. This approach is illustrated in the following sections.