Bayes made simple

(1)

Bayes made simple

(2)

Significance is…

P (obtaining a test statistic more extreme than the one we observed|H

₀

is true)

This is not a test of the strength of evidence supporting a hypothesis (or model) . It is simply a statement about the probability of obtaining extreme values that we have not observed.

(3)

A frequentist confidence interval

In frequentist statistics, a 95% CI represents an interval

such that if the experiments were repeated 100 times,

95% of the resulting CIs (e.g., average ± 1.96 SE) would

contain he true parameter value….

(4)

A new approach to insight

Pose question and think of the answer needed to answer it.

Ask:

•How do the data arise?

•What is the hypothesized process that produces them?

•What are the sources of randomness/uncertainty in the process and the way we observe it?

•How can we model the process and its associated uncertainty in a way that allows the data to speak informatively?

This approach is based on a firm intuitive understanding of the

relationship between process models and probability models

(5)

Why Bayes?

Light Limitation of Trees

ϒ= max. growth rate at high light c=minimum light requirement

α=slope of curve at low light ⁰ ¹ ² ³ ⁴ ⁵ ⁶ ⁷

0.00.20.40.60.81.0

light Availability

Growth Rate

) , (

~

) , ( )

| (

) (















 

i i

i i i

Normal y

p

c L







 

(6)

Where do uncertainties arise?

•Variation due to processes we failed to model.

•Error in our observations?

• of the process

• of covariates or predictor variables

•What about genetic variation among individuals? Geographic variation among sites?

•What does the current science tell us about the process we are modeling?

•How can we exploit what is already known about the processes we are modeling?

(7)

Process model

Parameter model

) , (L_i 

g



i

proc  ,,c



) , (

~

) , ( )

| (

) (















 

i i

i i i

Normal y

p

c L







 

(8)

Process model

Parameter model Data model

Data on response Predictor

) , (L_i 

g



i

proc  ,,c



y

i

x

i

y



obs.

x



obs.

(9)

Process model

Parameter model Data model Data on response

Predictor

Hyperparameters

) , (L_i 

g



i

proc  ,,c



y

i

x

i

y



obs.

x



obs.



(10)

Today

 Derivation of Bayes Law

 Understanding each piece

 P(y|θ)

 P(θ)

 P(y|θ) P(θ)

 P(y)

 Putting the pieces together

 The relationship between likelihood and Bayes

 Priors and conjugacy (…probably into Thursday)

(11)

Concept of Probability

Event A

S= sample space

S of area

A of occurs area

A event that

y probabilit

P(A) 

(12)

Concept of Probability

Event A

S= sample space Event B

S of area

A of occurs area

A event that

y probabilit

P(A) 

(13)

Conditional Probabilities

Event A

Probability of B given that we know A occurred:

) (

) , ( )

(

) (

A P

A B P A

P

A B

P A

of area

A and B

of A) area

|

P(B    

(14)

Conditional Probabilities

Event A

What is P(A occurred given that B occurred?

(15)

Bayes Law: Get this now and forever

1)

2)

Solving 1 for P(B,A):

Substituting into 2 gives Bayes Law:

What is P(B|A)?

) (

) , (

) (

) , (

B P

B A B) P

| P(A

A P

A B A) P

| P(B



) ( )

|

(B A P A P

A) P(B, 

) (

) ( )

| (

B P

A P A B B) P

|

P(A 

(16)

We are interested in P(θ|y)

•We have some new data (y) in hand—the data represent the

“event” that has occurred.

•What is the probability of the parameters given the data? By symmetry, Bayes law is:

Joint

Marginal

Product rule

) (

) ( )

| ( )

( ) , (

y P

P y

P y y) P

|

P(     

(17)

The Holy Grail

θ P(θ|y)

The posterior distribution specifies P(θ|y) as a function of θ. It returns a probability of the parameter value in light of the data.

(18)

Bayes Law

The probability of the data, aka, the marginal distribution. More on this coming up.

The probability that the parameter takes on a particular value in light of prior data on θ,

=the prior distribution.

What is this? Haven’t we seen this before?

What we seek: the probability that a parameter takes on a

particular value in light of the new data== the posterior distribution

) (

)

| (

y P

P y

y) P

|

P(    

(19)

Components

 Understanding P(θ) = the prior

 Understanding P(y|θ)P(θ) = the joint distribution

 Understanding P(y) = the marginal distribution

(20)

What is P(θ) (aka the prior)?

30 35 40 45 50 55

0.000.100.20

x

dnorm(x, 40, 2)

30 35 40 45 50 55

0.0000.015

x

dnorm(x, 0, 100)

θ

θ Uninformative prior Informative prior

(21)

Where do priors come from?

• If we have a mean and a standard deviation from earlier studies of θ, then we have a prior on θ.

P(θ|y) in our current study becomes P(θ) in future studies

• If we don’t have prior information, the prior will be

uninformative

(22)

The joint

So what is

P(y|θ)P(θ)?

(aka the joint distribution)

(23)

Exercise

•You have 8 observations of the standing crop of carbon in a

grassland from 0.25 sq. m. plots. Assume the data are normally distributed.

y=(16.5,15.7,16,15.3,14.9,15.7,14.7,15.6)

•A previous estimate of carbon standing crop was mean=20, sd=2.2.

•Calculate and plot the prior, the likelihood, and the joint distribution.

(24)

θ (mean)

P(y|θ)

P(y =4 infected|θ=0.12)|

L(θ|y)= dnorm(y, sigma).

The data are constant , the parameter varies.

Area under curve ≠1

Point estimates vs. distribution

0.000.100.20

θ (mean)

P (θ |y )

Area under curve =1

) (

) ( )

| (

y P

P y

P  

 | y)  P(

(25)

# Data

y=c(16.5,15.7,16,15.3,14.9,15.7,14.7,15.6) y.sd<-sd(y)

#prior mean and sd on theta p.mean=20

p.sd=2.2 D=NULL

theta=seq(0,30,.1) # set up a vector of potential values for theta

#Likelihood x prior=joint

for (i in 1:length(theta)){ # note we do this for all values of theta

#prior

P=dnorm(theta[i],p.mean,p.sd)

#likelihood

L=prod(dnorm(y,theta[i],y.sd)) # note the product (not log-likelihood)

#likelihood x prior LP=L*P

D=rbind(D,c(theta[i],LP,L,P)) }

(26)

D=as.data.frame(D)

names(D)=c("theta", "LP","L","P")

# Plot everything par(mfrow=c(3,1))

#prior

plot(D$theta,D$P,type="l",lwd=2,xlab=expression(theta),ylab=expression (paste("P(",theta,")")), main="Prior", col="blue")

# likelihood

plot(D$theta,D$L,type="l",lwd=2,xlab=expression(theta), ylab=expression (paste("P(y|",theta,")")), main="Likelihood",col="blue")

# prior * likelihood=joint

plot(D$theta,D$LP,type="l",lwd=2, xlab=expression(theta), ylab=expression (paste("P(y|",theta,")P(",theta,")")), main="Joint",col="blue")

(27)

0 5 10 15 20 25 30

0.01.02.0

Prior



P()

0 5 10 15 20 25 30

0e+003e-06

Likelihood



P(y|)

0 5 10 15 20 25 30

0e+006e-33

Joint



P(y|)P()

(28)

What is P(y)?

Because P(y) is a constant

So, without knowing the denominator, we can evaluate the relative support for each value of θ, but not the probability. This is what

maximum likelihood does. To get at the probability, we must

“normalize” the relative support by dividing by p(y).

) (

) ( )

| (

y P

P y

y) P

|

P(    

) (

)

|

(  

 | y) P y P

P( 

(29)

So what is P(y)?

The θ are mutually exhaustive, mutually exclusive hypotheses.

Sample space:

All possible outcomes of observation, experiment, etc.

θ

₁

θ

₃

θ

₂

(30)

So what is P(y)?

The θ are mutually exhaustive, mutually exclusive hypotheses.

Sample space:

All possible outcomes of observation,

experiment, etc (the green blob)

θ

₁

θ

₃

θ

₂

Data: the observed

Outcome (the blue blob).

blob green

of Area

blob blue

of data Area

P

P(y)  ( ) 

(31)

So what is P(y)?

θ

₁

θ

₃

θ

₂ Because the probability of Y is:

) ( )

|

3

(

1 i

i

P y

i

P

P(y)   



) (

)

| (

) , ( )

3 3

3



P y

P

y P

y

P(

₃

  



 ^y ⁾ ^P( ^y ⁾ ^P( ^y ⁾

P( 

₃

 

₂

 

₁

(32)

Bayes law for discrete parameters

P(θ_i|y) reads: in light of the data, the probability that the parameter has the value θ_i If we find this value for all possible values of the parameter θ, then we have the posterior distribution.





J

i i i

i i

i

i i

i

P y

P

P y

y) P

| P(

y P

P y

y) P

| P(

1

) ( )

| (

) ( )

| (

) (

) ( )

| (



 



 

(33)

An example from medical testing:

False positives in medical testing

) (

)

| ) (

| (

? )

| (

10 )

| (

1 )

| (

10 )

(

3 6



 















Prob

ill Prob ill

ill Prob Prob

test ill

prob is

What

healthy test

Prob

ill test

Prob

ill

Prob

(34)

An example from medical testing

) (

)

| ) (

| (

? )

| (

10 )

| (

1 )

| (

10 )

(

3 6



 















Prob

ill Prob ill

ill Prob Prob

test ill

prob is

What

healthy test

Prob

ill test

Prob

ill

Prob

(35)

ill

Not ill Test +

3 3

6

6 1 10 10 10

10

1 ^   ^ ^  ^



















x ) (

x

) healthy

| ( ob Pr ) healthy (

prob )

ill

| ( prob )

ill ( prob

) healthy

( prob )

ill ( prob )

( ob Pr

(36)

The Definite Integral

The integral between a and b:

Δx->0

y=f(x) y

a x b

 

 ⁿ  

i

b i a

n f x x f x dx

1

) ( )

lim

(

(37)

Bayes law for continuous parameters

P(θ|y) reads: in light of the data, the probability that the parameter has the value θ If we find this value for all possible values of the parameter θ, then we have the posterior distribution.

θ P(θ|y)

 





 



 

) ( )

| (

) ( )

| (

) (

) ( )

| (

P y

P

P y

y) P

| P(

y P

P y

y) P

|

P(

(38)

Bayes Law

The probability of the data, aka, the marginal distribution.

The probability that the parameter takes on a particular value in light of prior data on θ,

=the prior distribution.

The likelihood

The probability that a parameter takes on a particular value in light of the new data==the posterior distribution

) (

)

| (

y P

P y

y) P

|

P(    

(39)

Marginal density of θ

Marginal density of y

Joint density of y and θ

Conditional density of θ given y=y0







 ( | ) ( )

) Pr(

)

| ( )

Pr(

) Pr(

)

| Pr(

) Pr(

) , ) Pr(

|

Pr(  



 

p y

p y P y

y y

(40)

How do we derive a posterior distribution?

(41)

The prior distribution, P(θ), can be subjective or objective, informative or non-informative.

) (

) ( )

| (

y P

P y

y) P

|

P(    

(42)

The likelihood function, aka, data distribution, P(y|θ),

) (

) ( )

| (

y P

P y

y) P

|

P(    

(43)

The product of the prior and the likelihood function, P(θ )P(y|θ), the joint P(y, θ)

) (

) ( )

| (

y P

P y

y) P

|

P(    

(44)

The denominator, the marginal distribution or normalization constant



 







d y

P P

y P

P y

y) P

| P(

)

| ( ) ( )

(

) (

) ( )

|

(

(45)

What we are seeking:

the posterior distribution P(θ|y)

Note that we are dividing each point the dashed line by the

area in the dashed line to obtain a probability reflecting our prior and current knowledge.

) (

) ( )

| (

y P

P y

y) P

|

P(    

(46)

Summary: Bayes vs likelihood

•The difference is not the use of prior information.

•In likelihood, we find parameter estimates by maximizing the likelihood. We have a likelihood profile, but it is somewhat cumbersome for developing

confidence or support envelopes.

•In Bayes, we integrate or sum over the entire range of parameter values to get a PDF by dividing each point on the “likelihood profile” by the area

beneath the profile. The estimate of our parameter is the mean or the median of the resulting PDF. We also obtain estimates of the mode, the variance,

kurtosis, etc..., which allows us to make statements about the probability of our parameter(s). Likelihood cannot make these statements.

•This posterior PDF in our current study forms the prior in subsequent studies.

•The real value of Bayes over likelihood emerges as our process and

probability models become complex. In this case, we can exploit the product rule to simplify our problem, to break it up into manageable chunks that can be reassembled in a coherent way.