Model Evaluation & Model Selection

(1)

Model Evaluation

&

Model Selection

(2)

Modeling process

Identify scientific objectives

Collect &

understand data

Draw upon existing theory/

knowledge Visualize model (DAG)

Write down (unnormalized) posterior Derive full-conditional distributions

Construct MCMC algorithm

Fit model (using MCMC & data)

Evaluate models (posterior predictive checks)

Use output to make inferences Model selection Program model components in software

Problem identification / data collection

Model specification

Model implementation Model evaluation &

inference

or

Write model mathematically using probability notation & appropriate distributions

(3)

Motivating issues

• How well does my model(s) fit my data? [evaluation]

– Just because the MCMC procedure “went smoothly,” doesn’t mean you have a “good model”

– Just because you got posterior stats for your parameters of interest doesn’t mean you have a “good model”

– Check ability of model to “replicate” observed data

– Potentially check ability of model to “predict” observed data (via cross-validation)

• Alternative model formulations = alternative

hypotheses about system. Which model? [selection]

– Which model agrees the best with my data?

– Which model is simpler to interpret?

– Which model satisfies both criteria? Which model should I chose?

– Combine alternative models? Model averaging

3

(4)

Lecture content

• Posterior predictive checks:

– Observed vs “predicted”

– Replicated data – Bayesian p-values

• Model selection/comparison:

– Deviance

– Akaike Information Criterion (AIC) in the likelihood framework

– Deviance information criteria (DIC)

– Posterior predictive loss (D)

(5)

The first question we should ask after fitting a model: Are the predictions of the model consistent with the data?

1. Is our process model a reasonable representation?

2. Have we made the right choices of distributions to represent the uncertainties?

(6)

Evaluate model fits

• Model evaluation and diagnostics are relatively under-developed in Bayesian analysis

• We often rely on relatively qualitative and informal methods that are based on ideas developed in “classical” analyses

• But, one particularly useful method for evaluating model fit is to compare posterior predictions of “replicated” data with observed data.

• Examine the ability of the model to produce “replicated” data that are consistent with the observed data

• Assume that “replicated” data (yrep) arise from same sampling distribution used to define the likelihood of the observed data (y):

• Again, y is observed and yrep is not observed; thus, we obtain the posterior predictive distribution for each yrep_.

• Compare the posterior predictive distribution for each yrep to the corresponding

observed y value. ⁶

 

~ |

i i

y P y

yrep P y



(7)

Posterior predictive checks

Posterior predictive distribution

• It is called posterior because it is conditional on the

observed y and predictive because it is a prediction for an observable y^new.

• It gives the probability of a new prediction of y

conditional on θ , which in turn is conditioned on the data at hand, y.



 P y d y

P y

y

P (

^new

| ) ^  (

^new

| ) ( | )

(8)

The mechanics

• We have a scientific model g(θ,x) that predicts a response y. We estimate the posterior distribution, P(θ|y). For any given value of x, we can simulate the posterior predictive distribution y^newby making a draw of θ= θ’ from P(θ|y) and estimating

y^new~P(g(θ’|x),σ).

• In MCMC, this simply means making draws from the data

model because each draw is conditional on the current value of the parameters. These draws define the posterior predictive

distribution in exactly the same way that draws allow us to define the posterior probability of the parameters.

(9)

Data model

Process model

Parameter model y_ijis the observed growth.

μ_i is the true value of growth.

It is “latent”, i.e.

not observable.

DAG: Back to hemlock trees

y

i



i



proc

_

(10)

model{

for(i in 1:length(y)){

mu[i] <- b0 + b1*diam[i]

y[i] ~ dnorm(mu[i],tau)

#posterior predictive distribution of y.new[i], unobserved trees y.sim[i] ~ dnorm(mu[i],tau)

}

# Priors

b0 ~ dnorm(0,.0001) b1 ~ dnorm(0,.0001)

tau ~ dgamma(.0001,.0001) sigma<-1/sqrt(tau)

}

) 001 ,.

001 .

| ( )

0001 ,.

0

| ( )

0001 ,.

0

| (

) ), ,

, (

| ( )

,

| , , (

) ,

, (

1

1 1

1

1 1



gamma b

normal b

normal

x diam

b b g y normal diam

y b

b P

diam b

b diam

b b g

o

i o

n i

i o

o











(11)

Recall that by Monte Carlo Integration….

Marginal posterior distributions

 

K

y E

y and

y K E

K k

k K

k









1

2 1

))

| ( (

)

| var(

) 1

| (



(12)

• Equivariance: Any quantity calculated from a random variable becomes a random variable with its own probability distribution.

• These quantities may be of scientific interest in themselves (e.g., biomass using allometric equations, Shannon Diversity Index, effect sizes…and so on).

• The derived quantity may involve model parameters, latent processes, or data.

• Equivalence is also incredibly useful in calculating goodness of fit of the model against observed data and in making forecasts about yet-unobserved quantities.

Derived quantities

(13)

Bayesian p-values

• Let T(y,) be a test statistic (e.g., mean, standard deviation, CV, quantile, sums of squared discrepancy, etc.) associated with the observed data

• Likewise, let T(y_rep,) be the corresponding test statistics associated with the replicated data

• We can calculate the “tail” probability p:

• If p is very large (e.g., p >>0.5 or close to 1) or very small (i.e., p <<0.5 or close to 0), then the difference between the observed and simulated data cannot be attributed to chance, indicating potential lack of fit.

 

| ( | )

p P T yrep T y y or

p P T yrep T y y

 

 

   

 

   

(14)

R.A. Fischer’s ticks

A simple example: We want to know the average

number of ticks on sheep. We round up 60 sheep and count ticks on each one. Does a Poisson distribution fit the data?

For each value in the MCMC chain, we generate a new data set, y

^sim

, by sampling from:

) ( )

| ( )

|

(

⁶⁰

1



 y P y P

P

i



i





)

| ( )

|

( y P y

P

^sim

 

(15)

Data

Parameter

A single mean governs the pattern

y

i



(16)

model{

#prior

lambda ~ dgamma(0.001,0.001) for(i in 1:60){

y[i] ~ dpois(lambda)

y.sim[i] ~ dpois(lambda) #simulate a new data set of 60 points }

cv.y <- sd(y[ ])/mean(y[ ])

cv.y.sim <- sd(y.sim[])/mean(y.sim[ ]) mean.y <-mean(y[])

mean.y.sim <-mean(y.sim[])

# find Bayesian P value--the mean of many 0's and 1's returned by the step function, one for each step in the chain

pvalue.cv <- step(cv.y.sim-cv.y)

pvalue.mean <-step(mean.y.sim - mean.y)

# Sums of Squares for(j in 1:60){

sq[j] <- (y[j]-lambda)^2

sq.new[j] <- (y.sim[j]-lambda)^2 }

fit <- sum(sq[])

fit.new <- sum(sq.new[]) pvalue.fit <- step(fit.new-fit) } #end of model

Key part

Step function=1 if ()>0

(17)

Simple Model

Real Data

Number of Ticks

Density

0 2 4 6 8 10

0.000.15

Simulated Data

Number of Ticks

Density

0 2 4 6 8 10

0.000.100.20

(18)

Posterior predictive check

Simple Model

p-value for CV=0.0015 P-value for mean=0.51 Remember, this is a two- tailed probability, so

values close to 0 and 1 indicate lack of fit.

(19)

0 2 4 6 8 10

0246810

Real data

Simulated data

(20)

Data

Parameter

Hyperparameter

Each sheep has its own mean (a.k.a. random effect)

y

i



i

 ^

(21)

Hierarchical Model

model{

# Priors

a~ dgamma(.001,.001) b~ dgamma(.001,.001) for(i in 1:60){

lambda[i] ~ dgamma(a,b) y[i] ~ dpois(lambda[i]) y.sim[i] ~ dpois(lambda[i]) }

cv.y <- sd(y[ ])/mean(y[ ])

cv.y.sim <- sd(y.sim[])/mean(y.sim[ ])

pvalue.cv <- step(cv.y.sim-cv.y) # find Bayesian P value--the mean of many 0's and 1's returned by the step function, one for each step in the chain

mean.y <-mean(y[])

mean.y.sim <-mean(y.sim[])

pvalue.mean <-step(mean.y.sim - mean.y) for(j in 1:60){

sq[j] <- (y[j]-lambda[j])^2

sq.new[j] <- (y.sim[j]-lambda[j])^2}

) ( ) (

) ,

| ( )

| , , (

60

1

b P a P

b a P

y P y

b a

P _i

i

i  









(22)

Hierarchical

Model

(23)

Posterior predictive check

Hierarchical Model p-value for CV=0.46 p-value for mean=0.51 Remember, this is a two- tailed probability, so

values close to 0 and 1 indicate lack of fit.

(24)

0 2 4 6 8 10

0246810

Real data

Simulated data

(25)

Posterior predictive checks

• Gelman, A., and J. Hill. 2009. Data analysis using regression and multilevel / hierarchical models. Cambridge University Press, Cambridge, UK.

• Link, W. A., and R. J. Barker. 2010. Bayesian Inference with Ecological Applications. Academic Press.

• Kery, M. 2010. Introduction to WinBUGS for Ecologists: A Bayesian approach to regression, ANOVA, mixed models and related analyses. Academic Press.

(26)

Motivating issues

• How well does my model(s) fit my data? [evaluation]

– Just because the MCMC procedure “went smoothly,” doesn’t mean you have a “good model”

– Just because you got posterior stats for your parameters of interest doesn’t mean you have a “good model”

– Check ability of model to “replicate” observed data

– Potentially check ability of model to “predict” observed data (via cross-validation)

• Alternative model formulations = alternative

hypotheses about system. Which model? [selection]

– Which model agrees the best with my data?

– Which model is simpler to interpret?

– Which model satisfies both criteria? Which model should I chose?

– Combine alternative models? Model averaging

26

(27)

Lecture content

• Posterior predictive checks:

– Observed vs “predicted”

– Replicated data – Bayesian p-values

• Model selection/comparison:

– Deviance

– Akaike Information Criterion (AIC) in the likelihood framework

– Deviance information criteria (DIC) – Posterior predictive loss (D)

27

(28)

“Model selection and model averaging are deep waters, mathematically, and no consensus has emerged in the substantial literature on a single

approach. Indeed, our only criticism of the wide use of AIC weights in wildlife and ecological statistics is with their uncritical acceptance and the view that this challenging problem has been simply resolved”

Link, W. A., and R. J. Barker. 2006. Model weights and the foundations of multi-model inference. Ecology 87:2626

(29)

The problem of model selection

• Up until now, we have been concerned with the uncertainties associated with a given model.

• What about the uncertainty that arises from our choice of models?

• How do we decide which model is best?

• How do we make inferences based on multiple

models?

(30)

Parsimony==Ockham’s razor

William of Ockham (1285-1349)

“Pluralitas non est ponenda sine neccesitate”

“entities should not be multiplied unnecessarily”

“Parsimony: ... 2 : economy in the use of means to an end;

especially : economy of explanation in conformity with Occam's razor”

(Merriam-Webster Online Dictionary)

(31)

Information theory and the principle of

parsimony

(32)

True model:

Generated ten datasets sampling from normal distribution with mean=0 and var=0.01.

Fit five models to the these ten datasets.





 e

⁽^x⁰^.³⁾² ¹

y

5 5 4

4 3

3 2

2 1

0

4 4 3

3 2

2 1

0

3 3 2

2 1

0

2 2 1

0

1 0

x x

x y

x x

y

x x

x y

x x

y

x y























(33)

What creates noise in models?

(34)

Illustration of tradeoff

(35)

(36)

(37)

(38)

The Kullback-Leibler

distance

(39)

Interpretation of Kullblack-Leibler Information (aka. distance between 2 models)

• Given truth represented by f and a model

approximating truth g, the K-distance measures the

information lost by using model g to approximate f.

(40)

Interpretation of Kullblack-Leibler Information (aka. distance between 2 models)

Measures the (asymmetric) distance between two models. Minimizing the information lost when using g(x) to approximate f(x) is the same as maximizing the likelihood.

0 5 10 15 20

GAMMA 0

130 260 390 520 650

Count

0 5 10 15 20

WEIBULL 0

130 260 390 520 650

0 5 10 15 20

LOGNORMAL 0

130 260 390 520 650

f(x)

g₁(x)

g₂(x) Truth

Approximations to truth

(41)

Heuristic interpretation of K-L

(42)

Model comparison

• Within the classical modeling framework, we tradeoff a measure of complexity (typically deviance) for a

measure of complexity (typically number of

parameters).

(43)

Akaike defined “an information criterion” that related K-L distance and the maximized log-likelihood as follows:

This is an estimate of the expected, relative distance between the fitted model and the unknown true mechanism that

generated the observed data.

K=number of estimable parameters

How do we know the truth?

Akaike’s Information Criterion

^

| y )) K

( L ln(

AIC   2   2

(44)

Deviance

44

• Deviance (deviance) is a built-in node in JAGS,

• thus, you can monitor deviance,

• look at its history plots (helpful to evaluating

overall model convergence and potential “problem”

chains)

• compute posterior statistics

 ^ 

    ( | ) ( , ) 2 log P

D y y

(45)

• Use the difference in AIC to compare competing models.

) ,...

min(

) min(

1 n

r r

AIC AIC

AIC

AIC AIC









(46)

As a rule of thumb models having Δr ≤ 2 have sufficient support—

they should receive consideration in making inferences. Models having Δr within about 3-7 have considerable less support, while models with Δr ≥10 have essentially no support .

But there is a better way….

) min( AIC AIC

_r

r

 



(47)

• The likelihood of a model given the data

decreases exponentially with increasing Δr . Note that the likelihood of the best model = 1 and all other likelihoods are relative to the likelihood of the best model.

2 ) ( 1

)

|

( y e

^r

L  

^ ^

(48)

Likelihood ratio from AIC

2 over 1

model for

data in

evidence of

strength relative









 2 ) ( 1

2 ) ( 1

2 1

e

(49)

Akaike Weights

w_r are Akaike weights, the likelihood of one of the candidate models divided by the sum of the likelihoods of all of the

candidates. The w_r for the best model does not equal 1. The w_r sum to 1.

The w_r can be thought of as “probabilities.” This is a frequentist interpretation derived from simulation. They are not “true”

probabilities. (Link, W. A., and R. J. Barker. 2006)

data

| models all

of likelihood total

data

| r model likelihood

1

2 ) ( 1

,













 R i r

i r r

e w e

(50)

Interpretation of Akaike Weights

• w

_i

is the weight of evidence in favor of model i being the actual best K-L model given that one of the R

models must be the K-L best model of the candidate set.

• “probability” that model i is the actual best K-L model

• Last statement is quite controversial.

(51)

The raptors…moving towards model selection

0. Identify scientific problem/objectives; understand data; draw-up existing theory/knowledge.

Summary of problem and data: In most northern temperate regions, diurnal birds of prey (raptors) migrate seasonally between their breeding and wintering grounds. Most raptors are obligate or facultative soaring migrants that congregate along major thermal and orographic updraft corridors.

We might wish to analyze the raptor survey data to understand how temperature and wind speed affect the chance of observing birds of each species.

Data: Autumn migration counts of multiple species of raptors in NE U.S., conducted during 2010.

y_d,s = number of birds observed on day d for species s 95 days, 15 species

x_d = total time of observation period (minutes)

T_d = average air temperature (C) during observation period on day d WS_d = average wind speed (km/hr) during observation period on day d

Therrien et al. 2012. Ecology.

(52)

Visualize model via DAG

y

x



Day

Day, species

Species Month Population

σ

Data (stochastic) Latent process Data parameters Process parameters





σ_ σ_

WS T

(53)

Likelihood:

Log link function (for log-linear Poisson regression):

Stochastic model for linear predictor (account for over-dispersion)

Hierarchical priors for species-level effects parameters:

Zero-centered hierarchical prior for month random effect:

Conjugate, relatively non-informative priors for root nodes:

where σ² = 1/τ for each σ² term

Specify model

y

x



σ





σ_ σ_

WS T

 

, ~ ,

d s d s d

y Poisson  x

, log( , ) , exp( , )

d s d s d s d s

      



²



, ~ ˆ , 1, 2,3, 4

k s N k _k k parameters

   



²



~ 0,

m N _

 

 

ˆ ~ 0,10000

, , ~ (0.01,0.01)

k k

N

gamma

 



  



²



, ~ 1 2 3 4 ( ),

d s Normal s s Td s Wd s T Wd d m d

          

(54)

Log link function (for log-linear Poisson regression):

Stochastic model for linear predictor (without over-dispersion)

Hierarchical priors for species-level effects parameters:

Zero-centered hierarchical prior for month random effect:

Conjugate, relatively non-informative priors for root nodes:

where σ² = 1/τ for each σ² term

Is over-dispersion needed?

y

x



σ





σ_ σ_

WS T

, 1 2 3 4 ( )

d s s s Td s Wd s T Wd d m d

         

, log( , ) , exp( , )

d s d s d s d s

      



²



, ~ ˆ , 1, 2,3, 4

k s N k _k k parameters

   



²



~ 0,

m N _

 

 

ˆ ~ 0,10000

, ~ (0.01,0 )

, .01

k k

N

gamma

 



 



(55)

Implement (code) models: Model 1

BUGS code shown here.

(56)

(57)

57

Implement models: Model 2 (no overdispersion)

(58)

Evaluate results/make inferences

Model 1: includes over-dispersion, example (temperature effects at sp level):

Eff ect of temperature on observation rate

Species ID

theta.star[2,]-0.50.00.5

^[2,1][2,2]^[2,3]^[2,4]^[2,5][2,6]^[2,7]^[2,8]^[2,9]^[2,10]^[2,11]^[2,12]^[2,13]^[2,14]^[2,15]^[2,16] Population-level parameter

mean sd val2.5pc median val97.5pc

sig (σ) 1.68 0.07601 1.538 1.678 1.84

Do the posterior stats for the over-dispersion standard deviation term indicate the presences of “significant” over-dispersion?

ˆ2



 ²

, ~ 1 2 3 4 ( ),

d s Normal s s Td s Wd s T Wd d m d

          

(59)

Replicated data (Model 1)

Observed # of birds (y) Predicted # of birds Posterior mean & 95% CI for yrep

Species 8 (northern goshawk) R² = 0.965

Coverage = 100%

Species 10 (broad-winged hawk) R² = 1.0

Coverage = 100%

Things to look for:

○Bias/accuracy: Do the points fall around the 1:1 line, or is there some prediction bias?

○Coverage: Do most of the observed values fall inside the 95% CIs for the replicated values (the Yrep’s)?

○What percentage do you expect to fall outside of the Yrep CIs?

○Is the variability in the observed data consistent with the variability in the replicated values?

○Can also overlay plots the observed Y values and the predicted Y values (i.e., posterior means for Yrep and 95% CI) as functions of a covariate.

○Why we get a “perfect” fit and 100%

coverage when including over- dispersion here?

(60)

Replicated data (Model 2)

Observed # of birds (y) Predicted # of birds Posterior mean & 95% CI for yrep

Species 8 (northern goshawk) R² = 0.082

Coverage = 97.9%

Species 10 (broad-winged hawk) R² = 0.259

Coverage = 52.6%

Things to look for:

○Bias/accuracy: Do the points fall around the 1:1 line, or is there some prediction bias?

○Coverage: Do most of the observed values fall inside the 95% CIs for the replicated values (the Yrep’s)?

○What percentage do you expect to fall outside of the Yrep CIs?

○Is the variability in the observed data consistent with the variability in the replicated values?

○Can also overlay plots the observed Y values and the predicted Y values (i.e., posterior means for Yrep and 95% CI) as functions of a covariate.

○Why is the fit so much worse when we don’t include an over-dispersion term?

(61)

• Can get a point estimate of deviance by plugging in a point estimate of the parameters (e.g., ’s)

• But, this doesn’t account for uncertainty in the parameters.

• Compute an “expected” deviance, which may be used as an overall measure of model fit.

• Compute “expected” deviance by “averaging” over the posterior distribution of the parameters:

• If we have L draws from the posterior, then an estimate of D_ave(y) is:

Point estimates of deviance

Posterior mean of deviance

= point estimate

(usually posterior mean)

_ˆ( ) ( , )ˆ D y D y

   

 





( ) ( ( , )| )) ( , ) ( | ) Dave y E D y y D y P y d







1

ˆ ( )_ave 1 ^L ( , )^l

l

D D

y L y

(62)

• To account for model complexity, compute the effective number of parameters. To do this we compute deviance in two ways:

• The posterior mean of the deviance.

• The deviance evaluated at the posterior mean values of model parameters.

• Why should first component be larger than the second component?

• In some situations, the above solution for p_D can lead to p_D < 0 (i.e., a negative # of effective parameters), which renders DIC and p_D useless.

Model complexity

 ˆ ( ) _ˆ( )

D ave

p D y D y

(63)

• Thus, DIC is given by:

model fit (lower better) + model penalty (lower better)

• Lower DIC -> “better” model, but what is an “important difference” in DIC?

• Interpretation of DIC values same as AIC

• DIC difference of 1-2: “best” model deserves consideration

• DIC differences of 3-7: “considerably” more support for best model

• Differences can be affected by Monte Carlo error

• I look for differences > 10

Computing DIC

The deviance of the model evaluated at the means of the posterior

distribution of parameters The effective number of

parameters

 ˆ ( )

_ave



_D

DIC D y p

(64)

In R

# JAGS model

model{

for (i in 1:n){

mu[i]<-(alpha*x[i])/((alpha/gama)+x[i]) y[i]~ dnorm(mu[i],tau)}

tau~dgamma(0.001,.001) alpha~dgamma(0.001,.001) gama~dgamma(.001,.001)

} # end of model

#In R

jm=jags.model("Bugs_light_example.R",data=data,mod.inits,n.chains=3,n.adapt = n.adapt) update(jm, n.iter=5000)

#generate coda object for parameters and deviance.

zm=coda.samples(jm,variable.names=c("alpha", "gama", "c","sigma","deviance"),n.iter=5000) dic.samples(jm,n.iter, type="pD")

#Mean deviance: 529.8

#penalty 2.97 =pd

# Penalized deviance: 532.8 =dic

# Another way:

summary(zm[,"deviance"])$stat[,]

#pd* = (1/2)*Var(deviance)

pd= 0.5*summary(zm[,"deviance"])$stat[2]^2

(65)

• DIC, AIC, and BIC all have the same general form:

xIC = model fit + model penalty

• So, why focus on DIC? Why not use AIC or BIC?

• BIC and AIC require us to count the number of parameters in a model, but an informative prior, or hierarchical priors makes it impossible to count the number of “effective”

parameters

• Thus, Spiegelhalter et al. (2002) developed DIC, which

computes an “effective” number of parameters that (should) capture the effect of “shrinkage” or “borrowing of strength”

due to informative priors or hierarchical priors

Why DIC?

(66)

Some intuition for DIC

• The problem is parameters that are “free” to be

influenced by noise in the data. How free are they?

• If a prior on a parameter is very informative—the

parameter is not free to respond to the data, it does not contribute to the effective number of parameters.

• If a prior is uninformative, the opposite is the case. It is free to respond and contributes to the effective number of parameters in the same way as in a likelihood analysis.

• If a parameter is part of a hierarchy, should it count to

same way as a parameter that is part of a simpler model?

(67)

Posterior predictive loss

• Posterior predictive loss (i.e., D_k) is fairly widely used within statistics, but is not frequently used in ecology

• D_k provides an index of a model’s predictive ability by comparing observed data to replicated data

• The “best” model(s) is the one that performs the best under a “balanced loss function.”

• Similar to the DIC, the loss function penalizes for both departure from the observed data (measure of model “fit”) and departure from what we expect the replicated data to be (measure of “smoothness” – somewhat analogous to DIC’s effective number of parameters).

• The loss function puts weights, which depend on k > 0, on the model fit component (G) and a weight of 1 (one) on the smoothness or penalty component (P); the value of k is determined by the user:

 

k k

D G P

k 1

(68)

Posterior predictive loss (D



)

• We often assume “k = ,” in which case D_k (call this D_) is given by:

• Under squared-error loss, G and P are given by:

• Thus, D_ is equivalent to:

�_∞=�+�

   D G P



^



^

 



  

 



2 1

2 2

1

( | ) ( | )

i i i

N

rep i rep rep

i

N

rep rep rep

i

G y where E y

P where Var y

y y

(69)

• In your code, simply monitor the squared deviation, (y_repi - y_i)², for each

observation i, and outside of the i loop, compute the sum of the squared deviations:

Dsum is NOT D_; D_ is computed after you’ve run the model (after convergence) and it is approximated as the posterior mean (expected value) of Dsum.

Computing D  in practice

69

Dsum.species[s]<-sum(sqdiff[,s]) # in the species loop

(70)

Back to the tree and the light

20 40 60 80 100

10203040

tree.data$Light

tree.data$Observed.growth.rate

Model 1:mu[i]<-(alpha*x[i])/((alpha/gama)+x[i]) Model 2: mu[i]<-alpha*x[i]

(71)

In R

# JAGS model model{

for (i in 1:n){

mu[i]<-(alpha*x[i])/((alpha/gama)+x[i]) y[i]~ dnorm(mu[i],tau)

yrep[i]~dnorm(mu[i],tau)) }

tau~dgamma(0.001,.001) alpha~dgamma(0.001,.001)

gama~dgamma(.001,.001)} # end of model

zm=coda.samples(jm,variable.names=c("alpha","gama","sigma","deviance",yrep"),n.iter=5000) y<-tree.data$Observed.growth.rate # assign response variable

ntree=nrow(tree.data)

yrep.stats=summary(zm[,paste("yrep[",1:ntree,"]",sep="")])$stat G <- sum((yrep.stats[,1]-y)^2)# sq diff (yrep.mean-y)

P <- sum((yrep.stats[,2])^2)# var for each yrep Dinf <- G + P

(72)

Interpreting D  values

72

• For a “poor” model, we expect large predictive variance (large P) and poor fit (large G).

• Better models have a lower D_k (or lower D_) associated with a smaller P and/or smaller G

• But, as we start to “overfit” (e.g., model with lots of parameters), G will continue to decrease (better fit), but P will start to increase (i.e., variance will be inflated due to multi-collinearity between parameters).

• The model with the smallest D_k (or D_) is preferred.

• But, how small is “small”?

(73)

Linear vs non-linear

pd Mean(dev) P G Dinf DIC R²

Nonlinear 2.94 529.8 4551.13 4236.13 8787.27 532.7 0.91 Linear 2.05 574.2 8019.67 7612.01 15631.6 576.3 0.82

(74)

Consider multiple comparison indices

74

Some conclusions:

• Posterior predictive checks suggest we should “pick” Model 1 (with over- dispersion) over Model 2 (Model 1 has much lower DIC, D, G)

• But, the results suggest that Model 1 is over-fitting the data – it has a very high P, the coverage of the replicated data is too high (should be ~95%), and the uncertainty in the replicated data (e.g., width of 95% CIs) is similar to Model 1.

• Perhaps we should evaluate a third model that lies somewhere between the

complexity of Model 1 and Model 2? E.g., explore incorporation of day (nested in month) random effects without observation-level over-dispersion?

Indices computed across all species.

(75)

• In the index of Gelman et al. 1995: “Model selection, why we do not do it”

• In the index of Gelman et al. 2004: “Model selection, why we avoid it.”

– Gelman, A., J. B. Carlin, H. S. Stern, and D. B.

Rubin. 2004. Bayesian data analysis. Chapman and

Hall / CRC, London.

(76)

When should we avoid model selection?

• When we have a firm, theoretical / mechanistic basis for a particular model formulation.

• When our objectives for insight determine the form of the model.

• When we want to make forecasts and must include

known influences on future behavior of the system.

(77)

Concluding remarks

• Use multiple approaches for comparing / selecting between models

• Selection criteria may depend on scientific objectives:

• Use model to learn (heuristic tool)

• Use model to predict under novel conditions

• Other topics not covered

• Model averaging & Bayes factors

• E.g., can use posterior model weights to average derived or predicted quantities obtained from each model

• Can use BIC to approximate the Bayes factors (BF)

• See Link & Barber (2006) or Kass (1993)

• Evaluation of model assumptions

• Appropriate choice of distributions?

• Appropriate model structure

• E.g.: linear vs non-linear; choice of covariates; random vs fixed effects; hierarchical vs non-hierarchical priors, etc ₇₇

(78)

References

• Posterior predictive loss: Gelfand & Ghosh (1998) Model choice: A minimum posterior predictive loss approach. Biometrika 85:1-11.

• Bayes factors & BIC: Link & Barker (2006) Model weights and the foundations of multimodel inference. Ecology 87:2626-2635.

• Model checking & improvement: Gelman, Carlin, Stern, Rubin (2004) Bayesian Data Analysis. Chapman & Hall/CRC. (Chapter 6)

• Bayes factors: Kass (1993) Bayes factors in practice. The Statistician 42:551-560.

• Elements of hierarchical Bayes, including Bayes factors, DIC, D:

Carlin et al. (2006) Elements of hierarchical Bayesian inference, in Hierarchical Modelling for the Environmental Sciences: Statistical Methods and Applications, J.S. Clark & A.E. Gelfand (eds.) Oxford Univ Press.

78

(79)

Quantiles vs. HPD intervals

HPDI

quantiles

(80)

Quantiles vs. HPD intervals

• HPDI: The longest horizontal line that can be placed within the distribution such that the area between

the vertical dashed lines and beneath the distribution curve = 1 –alpha.

• Equal tailed intervals: quantiles of distribution (1-

alpha/2).

(81)

Likelihood Ratio Test

• Ratios of llikelihoods (R) follow a chi-square distribution with degrees of freedom equal to the difference in the number of parameters between models A and B.

)]

| ( )

| ( [

2 L Y M

_A

L Y M

_B

R  

(82)

the Likelihood Ratio Test

θ

Difference in loglikelihood Chi-square probability

Likelihood profile

Chi-sq=3.84

Model Evaluation & Model Selection