Model Evaluation
&
Model Selection
Modeling process
Identify scientific objectives
Collect &
understand data
Draw upon existing theory/
knowledge Visualize model (DAG)
Write down (unnormalized) posterior Derive full-conditional distributions
Construct MCMC algorithm
Fit model (using MCMC & data)
Evaluate models (posterior predictive checks)
Use output to make inferences Model selection Program model components in software
Problem identification / data collection
Model specification
Model implementation Model evaluation &
inference
or
Write model mathematically using probability notation & appropriate distributions
Motivating issues
• How well does my model(s) fit my data? [evaluation]
– Just because the MCMC procedure “went smoothly,” doesn’t mean you have a “good model”
– Just because you got posterior stats for your parameters of interest doesn’t mean you have a “good model”
– Check ability of model to “replicate” observed data
– Potentially check ability of model to “predict” observed data (via cross-validation)
• Alternative model formulations = alternative
hypotheses about system. Which model? [selection]
– Which model agrees the best with my data?
– Which model is simpler to interpret?
– Which model satisfies both criteria? Which model should I chose?
– Combine alternative models? Model averaging
3
Lecture content
• Posterior predictive checks:
– Observed vs “predicted”
– Replicated data – Bayesian p-values
• Model selection/comparison:
– Deviance
– Akaike Information Criterion (AIC) in the likelihood framework
– Deviance information criteria (DIC)
– Posterior predictive loss (D)
The first question we should ask after fitting a model: Are the predictions of the model consistent with the data?
1. Is our process model a reasonable representation?
2. Have we made the right choices of distributions to represent the uncertainties?
Evaluate model fits
• Model evaluation and diagnostics are relatively under-developed in Bayesian analysis
• We often rely on relatively qualitative and informal methods that are based on ideas developed in “classical” analyses
• But, one particularly useful method for evaluating model fit is to compare posterior predictions of “replicated” data with observed data.
• Examine the ability of the model to produce “replicated” data that are consistent with the observed data
• Assume that “replicated” data (yrep) arise from same sampling distribution used to define the likelihood of the observed data (y):
• Again, y is observed and yrep is not observed; thus, we obtain the posterior predictive distribution for each yrep.
• Compare the posterior predictive distribution for each yrep to the corresponding
observed y value. 6
~ |
~ |
i i
y P y
yrep P y
Posterior predictive checks
Posterior predictive distribution
• It is called posterior because it is conditional on the
observed y and predictive because it is a prediction for an observable ynew.
• It gives the probability of a new prediction of y
conditional on θ , which in turn is conditioned on the data at hand, y.
P y d y
P y
y
P (
new| ) (
new| ) ( | )
The mechanics
• We have a scientific model g(θ,x) that predicts a response y. We estimate the posterior distribution, P(θ|y). For any given value of x, we can simulate the posterior predictive distribution ynew by making a draw of θ= θ’ from P(θ|y) and estimating
ynew ~P(g(θ’|x),σ).
• In MCMC, this simply means making draws from the data
model because each draw is conditional on the current value of the parameters. These draws define the posterior predictive
distribution in exactly the same way that draws allow us to define the posterior probability of the parameters.
Data model
Process model
Parameter model yijis the observed growth.
μi is the true value of growth.
It is “latent”, i.e.
not observable.
DAG: Back to hemlock trees
y
i
i
proc
model{
for(i in 1:length(y)){
mu[i] <- b0 + b1*diam[i]
y[i] ~ dnorm(mu[i],tau)
#posterior predictive distribution of y.new[i], unobserved trees y.sim[i] ~ dnorm(mu[i],tau)
}
# Priors
b0 ~ dnorm(0,.0001) b1 ~ dnorm(0,.0001)
tau ~ dgamma(.0001,.0001) sigma<-1/sqrt(tau)
}
) 001 ,.
001 .
| ( )
0001 ,.
0
| ( )
0001 ,.
0
| (
) ), ,
, (
| ( )
,
| , , (
) ,
, (
1
1 1
1
1 1
gamma b
normal b
normal
x diam
b b g y normal diam
y b
b P
diam b
b diam
b b g
o
i o
n i
i o
i o
o
Recall that by Monte Carlo Integration….
Marginal posterior distributions
K
y E
y and
y K E
K k
k K
k
k
1
2 1
))
| ( (
)
| var(
) 1
| (
• Equivariance: Any quantity calculated from a random variable becomes a random variable with its own probability distribution.
• These quantities may be of scientific interest in themselves (e.g., biomass using allometric equations, Shannon Diversity Index, effect sizes…and so on).
• The derived quantity may involve model parameters, latent processes, or data.
• Equivalence is also incredibly useful in calculating goodness of fit of the model against observed data and in making forecasts about yet-unobserved quantities.
Derived quantities
Bayesian p-values
• Let T(y,) be a test statistic (e.g., mean, standard deviation, CV, quantile, sums of squared discrepancy, etc.) associated with the observed data
• Likewise, let T(yrep,) be the corresponding test statistics associated with the replicated data
• We can calculate the “tail” probability p:
• If p is very large (e.g., p >>0.5 or close to 1) or very small (i.e., p <<0.5 or close to 0), then the difference between the observed and simulated data cannot be attributed to chance, indicating potential lack of fit.
| ( | )
| ( | )
p P T yrep T y y or
p P T yrep T y y
R.A. Fischer’s ticks
A simple example: We want to know the average
number of ticks on sheep. We round up 60 sheep and count ticks on each one. Does a Poisson distribution fit the data?
For each value in the MCMC chain, we generate a new data set, y
sim, by sampling from:
) ( )
| ( )
|
(
601
y P y P
P
i
i
)
| ( )
|
( y P y
P
sim
Data
Parameter
A single mean governs the pattern
y
i
model{
#prior
lambda ~ dgamma(0.001,0.001) for(i in 1:60){
y[i] ~ dpois(lambda)
y.sim[i] ~ dpois(lambda) #simulate a new data set of 60 points }
cv.y <- sd(y[ ])/mean(y[ ])
cv.y.sim <- sd(y.sim[])/mean(y.sim[ ]) mean.y <-mean(y[])
mean.y.sim <-mean(y.sim[])
# find Bayesian P value--the mean of many 0's and 1's returned by the step function, one for each step in the chain
pvalue.cv <- step(cv.y.sim-cv.y)
pvalue.mean <-step(mean.y.sim - mean.y)
# Sums of Squares for(j in 1:60){
sq[j] <- (y[j]-lambda)^2
sq.new[j] <- (y.sim[j]-lambda)^2 }
fit <- sum(sq[])
fit.new <- sum(sq.new[]) pvalue.fit <- step(fit.new-fit) } #end of model
Key part
Step function=1 if ()>0
Simple Model
Real Data
Number of Ticks
Density
0 2 4 6 8 10
0.000.15
Simulated Data
Number of Ticks
Density
0 2 4 6 8 10
0.000.100.20
Posterior predictive check
fit <- sum(sq[])
fit.new <- sum(sq.new[]) pvalue.fit <- step(fit.new-fit) } #end of model
Simple Model
p-value for CV=0.0015 P-value for mean=0.51 Remember, this is a two- tailed probability, so
values close to 0 and 1 indicate lack of fit.
0 2 4 6 8 10
0246810
Real data
Simulated data
Data
Parameter
Hyperparameter
Each sheep has its own mean (a.k.a. random effect)
y
i
i
Hierarchical Model
model{
# Priors
a~ dgamma(.001,.001) b~ dgamma(.001,.001) for(i in 1:60){
lambda[i] ~ dgamma(a,b) y[i] ~ dpois(lambda[i]) y.sim[i] ~ dpois(lambda[i]) }
cv.y <- sd(y[ ])/mean(y[ ])
cv.y.sim <- sd(y.sim[])/mean(y.sim[ ])
pvalue.cv <- step(cv.y.sim-cv.y) # find Bayesian P value--the mean of many 0's and 1's returned by the step function, one for each step in the chain
mean.y <-mean(y[])
mean.y.sim <-mean(y.sim[])
pvalue.mean <-step(mean.y.sim - mean.y) for(j in 1:60){
sq[j] <- (y[j]-lambda[j])^2
sq.new[j] <- (y.sim[j]-lambda[j])^2}
fit <- sum(sq[])
fit.new <- sum(sq.new[]) pvalue.fit <- step(fit.new-fit) } #end of model
) ( ) (
) ,
| ( )
| ( )
| , , (
60
1
b P a P
b a P
y P y
b a
P i
i
i
i
Hierarchical
Model
Posterior predictive check
fit <- sum(sq[])
fit.new <- sum(sq.new[]) pvalue.fit <- step(fit.new-fit) } #end of model
Hierarchical Model p-value for CV=0.46 p-value for mean=0.51 Remember, this is a two- tailed probability, so
values close to 0 and 1 indicate lack of fit.
0 2 4 6 8 10
0246810
Real data
Simulated data
Posterior predictive checks
• Gelman, A., and J. Hill. 2009. Data analysis using regression and multilevel / hierarchical models. Cambridge University Press, Cambridge, UK.
• Link, W. A., and R. J. Barker. 2010. Bayesian Inference with Ecological Applications. Academic Press.
• Kery, M. 2010. Introduction to WinBUGS for Ecologists: A Bayesian approach to regression, ANOVA, mixed models and related analyses. Academic Press.
Motivating issues
• How well does my model(s) fit my data? [evaluation]
– Just because the MCMC procedure “went smoothly,” doesn’t mean you have a “good model”
– Just because you got posterior stats for your parameters of interest doesn’t mean you have a “good model”
– Check ability of model to “replicate” observed data
– Potentially check ability of model to “predict” observed data (via cross-validation)
• Alternative model formulations = alternative
hypotheses about system. Which model? [selection]
– Which model agrees the best with my data?
– Which model is simpler to interpret?
– Which model satisfies both criteria? Which model should I chose?
– Combine alternative models? Model averaging
26
Lecture content
• Posterior predictive checks:
– Observed vs “predicted”
– Replicated data – Bayesian p-values
• Model selection/comparison:
– Deviance
– Akaike Information Criterion (AIC) in the likelihood framework
– Deviance information criteria (DIC) – Posterior predictive loss (D)
27
“Model selection and model averaging are deep waters, mathematically, and no consensus has emerged in the substantial literature on a single
approach. Indeed, our only criticism of the wide use of AIC weights in wildlife and ecological statistics is with their uncritical acceptance and the view that this challenging problem has been simply resolved”
Link, W. A., and R. J. Barker. 2006. Model weights and the foundations of multi-model inference. Ecology 87:2626
The problem of model selection
• Up until now, we have been concerned with the uncertainties associated with a given model.
• What about the uncertainty that arises from our choice of models?
• How do we decide which model is best?
• How do we make inferences based on multiple
models?
Parsimony==Ockham’s razor
William of Ockham (1285-1349)
“Pluralitas non est ponenda sine neccesitate”
“entities should not be multiplied unnecessarily”
“Parsimony: ... 2 : economy in the use of means to an end;
especially : economy of explanation in conformity with Occam's razor”
(Merriam-Webster Online Dictionary)
Information theory and the principle of
parsimony
True model:
Generated ten datasets sampling from normal distribution with mean=0 and var=0.01.
Fit five models to the these ten datasets.
e
(x0.3)2 1y
5 5 4
4 3
3 2
2 1
0
4 4 3
3 2
2 1
0
3 3 2
2 1
0
2 2 1
0
1 0
x x
x x
x y
x x
x x
y
x x
x y
x x
y
x y
What creates noise in models?
Illustration of tradeoff
The Kullback-Leibler
distance
Interpretation of Kullblack-Leibler Information (aka. distance between 2 models)
• Given truth represented by f and a model
approximating truth g, the K-distance measures the
information lost by using model g to approximate f.
Interpretation of Kullblack-Leibler Information (aka. distance between 2 models)
Measures the (asymmetric) distance between two models. Minimizing the information lost when using g(x) to approximate f(x) is the same as maximizing the likelihood.
0 5 10 15 20
GAMMA 0
130 260 390 520 650
Count
0 5 10 15 20
WEIBULL 0
130 260 390 520 650
0 5 10 15 20
LOGNORMAL 0
130 260 390 520 650
f(x)
g1(x)
g2(x) Truth
Approximations to truth
Heuristic interpretation of K-L
Model comparison
• Within the classical modeling framework, we tradeoff a measure of complexity (typically deviance) for a
measure of complexity (typically number of
parameters).
Akaike defined “an information criterion” that related K-L distance and the maximized log-likelihood as follows:
This is an estimate of the expected, relative distance between the fitted model and the unknown true mechanism that
generated the observed data.
K=number of estimable parameters
How do we know the truth?
Akaike’s Information Criterion
^
| y )) K
( L ln(
AIC 2 2
Deviance
44
• Deviance (deviance) is a built-in node in JAGS,
• thus, you can monitor deviance,
• look at its history plots (helpful to evaluating
overall model convergence and potential “problem”
chains)
• compute posterior statistics
( | ) ( , ) 2 log P
D y y
• Use the difference in AIC to compare competing models.
) ,...
min(
) min(
) min(
1 n
r r
AIC AIC
AIC
AIC AIC
As a rule of thumb models having Δr ≤ 2 have sufficient support—
they should receive consideration in making inferences. Models having Δr within about 3-7 have considerable less support, while models with Δr ≥10 have essentially no support .
But there is a better way….
) min( AIC AIC
rr
• The likelihood of a model given the data
decreases exponentially with increasing Δr . Note that the likelihood of the best model = 1 and all other likelihoods are relative to the likelihood of the best model.
2 ) ( 1
)
|
( y e
rL
Likelihood ratio from AIC
2 over 1
model for
data in
evidence of
strength relative
2 ) ( 1
2 ) ( 1
2 1
e
e
Akaike Weights
wr are Akaike weights, the likelihood of one of the candidate models divided by the sum of the likelihoods of all of the
candidates. The wr for the best model does not equal 1. The wr sum to 1.
The wr can be thought of as “probabilities.” This is a frequentist interpretation derived from simulation. They are not “true”
probabilities. (Link, W. A., and R. J. Barker. 2006)
data
| models all
of likelihood total
data
| r model likelihood
1
2 ) ( 1
2 ) ( 1
,
R i r
i r r
e w e
Interpretation of Akaike Weights
• w
iis the weight of evidence in favor of model i being the actual best K-L model given that one of the R
models must be the K-L best model of the candidate set.
• “probability” that model i is the actual best K-L model
• Last statement is quite controversial.
The raptors…moving towards model selection
0. Identify scientific problem/objectives; understand data; draw-up existing theory/knowledge.
Summary of problem and data: In most northern temperate regions, diurnal birds of prey (raptors) migrate seasonally between their breeding and wintering grounds. Most raptors are obligate or facultative soaring migrants that congregate along major thermal and orographic updraft corridors.
We might wish to analyze the raptor survey data to understand how temperature and wind speed affect the chance of observing birds of each species.
Data: Autumn migration counts of multiple species of raptors in NE U.S., conducted during 2010.
yd,s = number of birds observed on day d for species s 95 days, 15 species
xd = total time of observation period (minutes)
Td = average air temperature (C) during observation period on day d WSd = average wind speed (km/hr) during observation period on day d
Therrien et al. 2012. Ecology.
Visualize model via DAG
y
x
Day
Day, species
Species Month Population
σ
Data (stochastic) Latent process Data parameters Process parameters
σ σ
WS T
Likelihood:
Log link function (for log-linear Poisson regression):
Stochastic model for linear predictor (account for over-dispersion)
Hierarchical priors for species-level effects parameters:
Zero-centered hierarchical prior for month random effect:
Conjugate, relatively non-informative priors for root nodes:
where σ2 = 1/τ for each σ2 term
Specify model
y
x
σ
σ σ
WS T
, ~ ,
d s d s d
y Poisson x
, log( , ) , exp( , )
d s d s d s d s
2
, ~ ˆ , 1, 2,3, 4
k s N k k k parameters
2
~ 0,
m N
ˆ ~ 0,10000
, , ~ (0.01,0.01)
k k
N
gamma
2
, ~ 1 2 3 4 ( ),
d s Normal s s Td s Wd s T Wd d m d
Log link function (for log-linear Poisson regression):
Stochastic model for linear predictor (without over-dispersion)
Hierarchical priors for species-level effects parameters:
Zero-centered hierarchical prior for month random effect:
Conjugate, relatively non-informative priors for root nodes:
where σ2 = 1/τ for each σ2 term
Is over-dispersion needed?
y
x
σ
σ σ
WS T
, 1 2 3 4 ( )
d s s s Td s Wd s T Wd d m d
, log( , ) , exp( , )
d s d s d s d s
2
, ~ ˆ , 1, 2,3, 4
k s N k k k parameters
2
~ 0,
m N
ˆ ~ 0,10000
, ~ (0.01,0 )
, .01
k k
N
gamma
Implement (code) models: Model 1
BUGS code shown here.
BUGS code shown here.
57
BUGS code shown here.
Implement models: Model 2 (no overdispersion)
Evaluate results/make inferences
Model 1: includes over-dispersion, example (temperature effects at sp level):
Eff ect of temperature on observation rate
Species ID
theta.star[2,]-0.50.00.5
[2,1][2,2][2,3][2,4][2,5][2,6][2,7][2,8][2,9][2,10][2,11][2,12][2,13][2,14][2,15][2,16] Population-level parametermean sd val2.5pc median val97.5pc
sig (σ) 1.68 0.07601 1.538 1.678 1.84
Do the posterior stats for the over-dispersion standard deviation term indicate the presences of “significant” over-dispersion?
ˆ2
2
, ~ 1 2 3 4 ( ),
d s Normal s s Td s Wd s T Wd d m d
Replicated data (Model 1)
Observed # of birds (y) Predicted # of birds Posterior mean & 95% CI for yrep
Species 8 (northern goshawk) R2 = 0.965
Coverage = 100%
Species 10 (broad-winged hawk) R2 = 1.0
Coverage = 100%
Things to look for:
○Bias/accuracy: Do the points fall around the 1:1 line, or is there some prediction bias?
○Coverage: Do most of the observed values fall inside the 95% CIs for the replicated values (the Yrep’s)?
○What percentage do you expect to fall outside of the Yrep CIs?
○Is the variability in the observed data consistent with the variability in the replicated values?
○Can also overlay plots the observed Y values and the predicted Y values (i.e., posterior means for Yrep and 95% CI) as functions of a covariate.
○Why we get a “perfect” fit and 100%
coverage when including over- dispersion here?
Replicated data (Model 2)
Observed # of birds (y) Predicted # of birds Posterior mean & 95% CI for yrep
Species 8 (northern goshawk) R2 = 0.082
Coverage = 97.9%
Species 10 (broad-winged hawk) R2 = 0.259
Coverage = 52.6%
Things to look for:
○Bias/accuracy: Do the points fall around the 1:1 line, or is there some prediction bias?
○Coverage: Do most of the observed values fall inside the 95% CIs for the replicated values (the Yrep’s)?
○What percentage do you expect to fall outside of the Yrep CIs?
○Is the variability in the observed data consistent with the variability in the replicated values?
○Can also overlay plots the observed Y values and the predicted Y values (i.e., posterior means for Yrep and 95% CI) as functions of a covariate.
○Why is the fit so much worse when we don’t include an over-dispersion term?
• Can get a point estimate of deviance by plugging in a point estimate of the parameters (e.g., ’s)
• But, this doesn’t account for uncertainty in the parameters.
• Compute an “expected” deviance, which may be used as an overall measure of model fit.
• Compute “expected” deviance by “averaging” over the posterior distribution of the parameters:
• If we have L draws from the posterior, then an estimate of Dave(y) is:
Point estimates of deviance
Posterior mean of deviance
= point estimate
(usually posterior mean)
ˆ( ) ( , )ˆ D y D y
( ) ( ( , )| )) ( , ) ( | ) Dave y E D y y D y P y d
1
ˆ ( )ave 1 L ( , )l
l
D D
y L y
• To account for model complexity, compute the effective number of parameters. To do this we compute deviance in two ways:
• The posterior mean of the deviance.
• The deviance evaluated at the posterior mean values of model parameters.
• Why should first component be larger than the second component?
• In some situations, the above solution for pD can lead to pD < 0 (i.e., a negative # of effective parameters), which renders DIC and pD useless.
Model complexity
ˆ ( ) ˆ( )
D ave
p D y D y
• Thus, DIC is given by:
model fit (lower better) + model penalty (lower better)
• Lower DIC -> “better” model, but what is an “important difference” in DIC?
• Interpretation of DIC values same as AIC
• DIC difference of 1-2: “best” model deserves consideration
• DIC differences of 3-7: “considerably” more support for best model
• Differences can be affected by Monte Carlo error
• I look for differences > 10
Computing DIC
The deviance of the model evaluated at the means of the posterior
distribution of parameters The effective number of
parameters
ˆ ( )
ave
DDIC D y p
In R
# JAGS model
model{
for (i in 1:n){
mu[i]<-(alpha*x[i])/((alpha/gama)+x[i]) y[i]~ dnorm(mu[i],tau)}
tau~dgamma(0.001,.001) alpha~dgamma(0.001,.001) gama~dgamma(.001,.001)
} # end of model
#In R
jm=jags.model("Bugs_light_example.R",data=data,mod.inits,n.chains=3,n.adapt = n.adapt) update(jm, n.iter=5000)
#generate coda object for parameters and deviance.
zm=coda.samples(jm,variable.names=c("alpha", "gama", "c","sigma","deviance"),n.iter=5000) dic.samples(jm,n.iter, type="pD")
#Mean deviance: 529.8
#penalty 2.97 =pd
# Penalized deviance: 532.8 =dic
# Another way:
summary(zm[,"deviance"])$stat[,]
#pd* = (1/2)*Var(deviance)
pd= 0.5*summary(zm[,"deviance"])$stat[2]^2
• DIC, AIC, and BIC all have the same general form:
xIC = model fit + model penalty
• So, why focus on DIC? Why not use AIC or BIC?
• BIC and AIC require us to count the number of parameters in a model, but an informative prior, or hierarchical priors makes it impossible to count the number of “effective”
parameters
• Thus, Spiegelhalter et al. (2002) developed DIC, which
computes an “effective” number of parameters that (should) capture the effect of “shrinkage” or “borrowing of strength”
due to informative priors or hierarchical priors
Why DIC?
Some intuition for DIC
• The problem is parameters that are “free” to be
influenced by noise in the data. How free are they?
• If a prior on a parameter is very informative—the
parameter is not free to respond to the data, it does not contribute to the effective number of parameters.
• If a prior is uninformative, the opposite is the case. It is free to respond and contributes to the effective number of parameters in the same way as in a likelihood analysis.
• If a parameter is part of a hierarchy, should it count to
same way as a parameter that is part of a simpler model?
Posterior predictive loss
• Posterior predictive loss (i.e., Dk) is fairly widely used within statistics, but is not frequently used in ecology
• Dk provides an index of a model’s predictive ability by comparing observed data to replicated data
• The “best” model(s) is the one that performs the best under a “balanced loss function.”
• Similar to the DIC, the loss function penalizes for both departure from the observed data (measure of model “fit”) and departure from what we expect the replicated data to be (measure of “smoothness” – somewhat analogous to DIC’s effective number of parameters).
• The loss function puts weights, which depend on k > 0, on the model fit component (G) and a weight of 1 (one) on the smoothness or penalty component (P); the value of k is determined by the user:
k k
D G P
k 1
Posterior predictive loss (D
)
• We often assume “k = ,” in which case Dk (call this D) is given by:
• Under squared-error loss, G and P are given by:
• Thus, D is equivalent to:
�∞=�+�
D G P
2 1
2 2
1
( | ) ( | )
i i i
i i i
N
rep i rep rep
i
N
rep rep rep
i
G y where E y
P where Var y
y y
• In your code, simply monitor the squared deviation, (yrepi - yi)2, for each
observation i, and outside of the i loop, compute the sum of the squared deviations:
Dsum is NOT D; D is computed after you’ve run the model (after convergence) and it is approximated as the posterior mean (expected value) of Dsum.
Computing D in practice
69
Dsum.species[s]<-sum(sqdiff[,s]) # in the species loop
Back to the tree and the light
20 40 60 80 100
10203040
tree.data$Light
tree.data$Observed.growth.rate
Model 1:mu[i]<-(alpha*x[i])/((alpha/gama)+x[i]) Model 2: mu[i]<-alpha*x[i]
In R
# JAGS model model{
for (i in 1:n){
mu[i]<-(alpha*x[i])/((alpha/gama)+x[i]) y[i]~ dnorm(mu[i],tau)
yrep[i]~dnorm(mu[i],tau)) }
tau~dgamma(0.001,.001) alpha~dgamma(0.001,.001)
gama~dgamma(.001,.001)} # end of model
zm=coda.samples(jm,variable.names=c("alpha","gama","sigma","deviance",yrep"),n.iter=5000) y<-tree.data$Observed.growth.rate # assign response variable
ntree=nrow(tree.data)
yrep.stats=summary(zm[,paste("yrep[",1:ntree,"]",sep="")])$stat G <- sum((yrep.stats[,1]-y)^2)# sq diff (yrep.mean-y)
P <- sum((yrep.stats[,2])^2)# var for each yrep Dinf <- G + P
Interpreting D values
72
• For a “poor” model, we expect large predictive variance (large P) and poor fit (large G).
• Better models have a lower Dk (or lower D) associated with a smaller P and/or smaller G
• But, as we start to “overfit” (e.g., model with lots of parameters), G will continue to decrease (better fit), but P will start to increase (i.e., variance will be inflated due to multi-collinearity between parameters).
• The model with the smallest Dk (or D) is preferred.
• But, how small is “small”?
Linear vs non-linear
pd Mean(dev) P G Dinf DIC R2
Nonlinear 2.94 529.8 4551.13 4236.13 8787.27 532.7 0.91 Linear 2.05 574.2 8019.67 7612.01 15631.6 576.3 0.82
Consider multiple comparison indices
74
Some conclusions:
• Posterior predictive checks suggest we should “pick” Model 1 (with over- dispersion) over Model 2 (Model 1 has much lower DIC, D, G)
• But, the results suggest that Model 1 is over-fitting the data – it has a very high P, the coverage of the replicated data is too high (should be ~95%), and the uncertainty in the replicated data (e.g., width of 95% CIs) is similar to Model 1.
• Perhaps we should evaluate a third model that lies somewhere between the
complexity of Model 1 and Model 2? E.g., explore incorporation of day (nested in month) random effects without observation-level over-dispersion?
Indices computed across all species.
• In the index of Gelman et al. 1995: “Model selection, why we do not do it”
• In the index of Gelman et al. 2004: “Model selection, why we avoid it.”
– Gelman, A., J. B. Carlin, H. S. Stern, and D. B.
Rubin. 2004. Bayesian data analysis. Chapman and
Hall / CRC, London.
When should we avoid model selection?
• When we have a firm, theoretical / mechanistic basis for a particular model formulation.
• When our objectives for insight determine the form of the model.
• When we want to make forecasts and must include
known influences on future behavior of the system.
Concluding remarks
• Use multiple approaches for comparing / selecting between models
• Selection criteria may depend on scientific objectives:
• Use model to learn (heuristic tool)
• Use model to predict under novel conditions
• Other topics not covered
• Model averaging & Bayes factors
• E.g., can use posterior model weights to average derived or predicted quantities obtained from each model
• Can use BIC to approximate the Bayes factors (BF)
• See Link & Barber (2006) or Kass (1993)
• Evaluation of model assumptions
• Appropriate choice of distributions?
• Appropriate model structure
• E.g.: linear vs non-linear; choice of covariates; random vs fixed effects; hierarchical vs non-hierarchical priors, etc 77
References
• Posterior predictive loss: Gelfand & Ghosh (1998) Model choice: A minimum posterior predictive loss approach. Biometrika 85:1-11.
• Bayes factors & BIC: Link & Barker (2006) Model weights and the foundations of multimodel inference. Ecology 87:2626-2635.
• Model checking & improvement: Gelman, Carlin, Stern, Rubin (2004) Bayesian Data Analysis. Chapman & Hall/CRC. (Chapter 6)
• Bayes factors: Kass (1993) Bayes factors in practice. The Statistician 42:551-560.
• Elements of hierarchical Bayes, including Bayes factors, DIC, D:
Carlin et al. (2006) Elements of hierarchical Bayesian inference, in Hierarchical Modelling for the Environmental Sciences: Statistical Methods and Applications, J.S. Clark & A.E. Gelfand (eds.) Oxford Univ Press.
78
Quantiles vs. HPD intervals
HPDI
quantiles
Quantiles vs. HPD intervals
• HPDI: The longest horizontal line that can be placed within the distribution such that the area between
the vertical dashed lines and beneath the distribution curve = 1 –alpha.
• Equal tailed intervals: quantiles of distribution (1-
alpha/2).
Likelihood Ratio Test
• Ratios of llikelihoods (R) follow a chi-square distribution with degrees of freedom equal to the difference in the number of parameters between models A and B.
)]
| ( )
| ( [
2 L Y M
AL Y M
BR
the Likelihood Ratio Test
θ
Difference in loglikelihood Chi-square probability
Likelihood profile
Chi-sq=3.84