Bayes made simple
Significance is…
P (obtaining a test statistic more extreme than the one we observed|H
0is true)
This is not a test of the strength of evidence supporting a hypothesis (or model) . It is simply a statement about the probability of obtaining extreme values that we have not observed.
A frequentist confidence interval
In frequentist statistics, a 95% CI represents an interval
such that if the experiments were repeated 100 times,
95% of the resulting CIs (e.g., average ± 1.96 SE) would
contain he true parameter value….
A new approach to insight
Pose question and think of the answer needed to answer it.
Ask:
•How do the data arise?
•What is the hypothesized process that produces them?
•What are the sources of randomness/uncertainty in the process and the way we observe it?
•How can we model the process and its associated uncertainty in a way that allows the data to speak informatively?
This approach is based on a firm intuitive understanding of the
relationship between process models and probability models
Why Bayes?
Light Limitation of Trees
ϒ= max. growth rate at high light c=minimum light requirement
α=slope of curve at low light 0 1 2 3 4 5 6 7
0.00.20.40.60.81.0
light Availability
Growth Rate
) , (
~
) , ( )
| (
) (
) (
) (
i i
i i
i i i
Normal y
Normal y
p
c L
c L
Where do uncertainties arise?
•Variation due to processes we failed to model.
•Error in our observations?
• of the process
• of covariates or predictor variables
•What about genetic variation among individuals? Geographic variation among sites?
•What does the current science tell us about the process we are modeling?
•How can we exploit what is already known about the processes we are modeling?
Process model
Parameter model
) , (Li
g
iproc ,,c
) , (
~
) , ( )
| (
) (
) (
) (
i i
i i
i i i
Normal y
Normal y
p
c L
c L
Process model
Parameter model Data model
Data on response Predictor
) , (Li
g
iproc ,,c
y
ix
iy
obs.x
obs.Process model
Parameter model Data model Data on response
Predictor
Hyperparameters
) , (Li
g
iproc ,,c
y
ix
iy
obs.x
obs.
Today
Derivation of Bayes Law
Understanding each piece
P(y|θ)
P(θ)
P(y|θ) P(θ)
P(y)
Putting the pieces together
The relationship between likelihood and Bayes
Priors and conjugacy (…probably into Thursday)
Concept of Probability
Event A
S= sample space
S of area
A of occurs area
A event that
y probabilit
P(A)
Concept of Probability
Event A
S= sample space Event B
S of area
A of occurs area
A event that
y probabilit
P(A)
Conditional Probabilities
Event A
S= sample space Event B
Probability of B given that we know A occurred:
) (
) , ( )
(
) (
A P
A B P A
P
A B
P A
of area
A and B
of A) area
|
P(B
Conditional Probabilities
Event A
S= sample space Event B
What is P(A occurred given that B occurred?
Bayes Law: Get this now and forever
1)
2)
Solving 1 for P(B,A):
Substituting into 2 gives Bayes Law:
What is P(B|A)?
) (
) , (
) (
) , (
B P
B A B) P
| P(A
A P
A B A) P
| P(B
) ( )
|
(B A P A P
A) P(B,
) (
) ( )
| (
B P
A P A B B) P
|
P(A
We are interested in P(θ|y)
•We have some new data (y) in hand—the data represent the
“event” that has occurred.
•What is the probability of the parameters given the data? By symmetry, Bayes law is:
Joint
Marginal
Product rule
) (
) ( )
| ( )
( ) , (
y P
P y
P y
P y y) P
|
P(
The Holy Grail
θ P(θ|y)
The posterior distribution specifies P(θ|y) as a function of θ. It returns a probability of the parameter value in light of the data.
Bayes Law
The probability of the data, aka, the marginal distribution. More on this coming up.
The probability that the parameter takes on a particular value in light of prior data on θ,
=the prior distribution.
What is this? Haven’t we seen this before?
What we seek: the probability that a parameter takes on a
particular value in light of the new data== the posterior distribution
) (
) (
)
| (
y P
P y
y) P
|
P(
Components
Understanding P(θ) = the prior
Understanding P(y|θ)P(θ) = the joint distribution
Understanding P(y) = the marginal distribution
What is P(θ) (aka the prior)?
30 35 40 45 50 55
0.000.100.20
x
dnorm(x, 40, 2)
30 35 40 45 50 55
0.0000.015
x
dnorm(x, 0, 100)
θ
θ Uninformative prior Informative prior
Where do priors come from?
• If we have a mean and a standard deviation from earlier studies of θ, then we have a prior on θ.
P(θ|y) in our current study becomes P(θ) in future studies
• If we don’t have prior information, the prior will be
uninformative
The joint
So what is
P(y|θ)P(θ)?
(aka the joint distribution)
Exercise
•You have 8 observations of the standing crop of carbon in a
grassland from 0.25 sq. m. plots. Assume the data are normally distributed.
y=(16.5,15.7,16,15.3,14.9,15.7,14.7,15.6)
•A previous estimate of carbon standing crop was mean=20, sd=2.2.
•Calculate and plot the prior, the likelihood, and the joint distribution.
θ (mean)
P(y|θ)
P(y =4 infected|θ=0.12)|
L(θ|y)= dnorm(y, sigma).
The data are constant , the parameter varies.
Area under curve ≠1
Point estimates vs. distribution
0.000.100.20
θ (mean)
P (θ |y )
Area under curve =1
) (
) ( )
| (
y P
P y
P
| y) P(
# Data
y=c(16.5,15.7,16,15.3,14.9,15.7,14.7,15.6) y.sd<-sd(y)
#prior mean and sd on theta p.mean=20
p.sd=2.2 D=NULL
theta=seq(0,30,.1) # set up a vector of potential values for theta
#Likelihood x prior=joint
for (i in 1:length(theta)){ # note we do this for all values of theta
#prior
P=dnorm(theta[i],p.mean,p.sd)
#likelihood
L=prod(dnorm(y,theta[i],y.sd)) # note the product (not log-likelihood)
#likelihood x prior LP=L*P
D=rbind(D,c(theta[i],LP,L,P)) }
D=as.data.frame(D)
names(D)=c("theta", "LP","L","P")
# Plot everything par(mfrow=c(3,1))
#prior
plot(D$theta,D$P,type="l",lwd=2,xlab=expression(theta),ylab=expression (paste("P(",theta,")")), main="Prior", col="blue")
# likelihood
plot(D$theta,D$L,type="l",lwd=2,xlab=expression(theta), ylab=expression (paste("P(y|",theta,")")), main="Likelihood",col="blue")
# prior * likelihood=joint
plot(D$theta,D$LP,type="l",lwd=2, xlab=expression(theta), ylab=expression (paste("P(y|",theta,")P(",theta,")")), main="Joint",col="blue")
0 5 10 15 20 25 30
0.01.02.0
Prior
P()
0 5 10 15 20 25 30
0e+003e-06
Likelihood
P(y|)
0 5 10 15 20 25 30
0e+006e-33
Joint
P(y|)P()
What is P(y)?
Because P(y) is a constant
So, without knowing the denominator, we can evaluate the relative support for each value of θ, but not the probability. This is what
maximum likelihood does. To get at the probability, we must
“normalize” the relative support by dividing by p(y).
) (
) ( )
| (
y P
P y
y) P
|
P(
) (
)
|
(
| y) P y P
P(
So what is P(y)?
The θ are mutually exhaustive, mutually exclusive hypotheses.
Sample space:
All possible outcomes of observation, experiment, etc.
θ
1θ
3θ
2So what is P(y)?
The θ are mutually exhaustive, mutually exclusive hypotheses.
Sample space:
All possible outcomes of observation,
experiment, etc (the green blob)
θ
1θ
3θ
2Data: the observed
Outcome (the blue blob).
blob green
of Area
blob blue
of data Area
P
P(y) ( )
So what is P(y)?
θ
1θ
3θ
2 Because the probability of Y is:) ( )
|
3
(
1 i
i
P y
iP
P(y)
) (
)
| (
) , ( )
3 3
3
P y
P
y P
y
P(
3
y ) P( y ) P( y )
P(
3
2
1Bayes law for discrete parameters
P(θi|y) reads: in light of the data, the probability that the parameter has the value θi If we find this value for all possible values of the parameter θ, then we have the posterior distribution.
J
i i i
i i
i
i i
i
P y
P
P y
y) P
| P(
y P
P y
y) P
| P(
1
) ( )
| (
) ( )
| (
) (
) ( )
| (
An example from medical testing:
False positives in medical testing
) (
) (
)
| ) (
| (
? )
| (
10 )
| (
1 )
| (
10 )
(
3 6
Prob
ill Prob ill
ill Prob Prob
test ill
prob is
What
healthy test
Prob
ill test
Prob
ill
Prob
An example from medical testing
) (
) (
)
| ) (
| (
? )
| (
10 )
| (
1 )
| (
10 )
(
3 6
Prob
ill Prob ill
ill Prob Prob
test ill
prob is
What
healthy test
Prob
ill test
Prob
ill
Prob
ill
Not ill Test +
3 3
6
6 1 10 10 10
10
1
x ) (
x
) healthy
| ( ob Pr ) healthy (
prob )
ill
| ( prob )
ill ( prob
) healthy
( prob )
ill ( prob )
( ob Pr
The Definite Integral
The integral between a and b:
Δx->0
y=f(x) y
a x b
n
i
b i a
n f x x f x dx
1
) ( )
lim
(Bayes law for continuous parameters
P(θ|y) reads: in light of the data, the probability that the parameter has the value θ If we find this value for all possible values of the parameter θ, then we have the posterior distribution.
θ P(θ|y)
) ( )
| (
) ( )
| (
) (
) ( )
| (
P y
P
P y
y) P
| P(
y P
P y
y) P
|
P(
Bayes Law
The probability of the data, aka, the marginal distribution.
The probability that the parameter takes on a particular value in light of prior data on θ,
=the prior distribution.
The likelihood
The probability that a parameter takes on a particular value in light of the new data==the posterior distribution
) (
) (
)
| (
y P
P y
y) P
|
P(
Marginal density of θ
Marginal density of y
Joint density of y and θ
Conditional density of θ given y=y0
( | ) ( )
) Pr(
)
| ( )
Pr(
) Pr(
)
| Pr(
) Pr(
) , ) Pr(
|
Pr(
p y
p y P y
y y
y y
How do we derive a posterior distribution?
The prior distribution, P(θ), can be subjective or objective, informative or non-informative.
) (
) ( )
| (
y P
P y
y) P
|
P(
The likelihood function, aka, data distribution, P(y|θ),
) (
) ( )
| (
y P
P y
y) P
|
P(
The product of the prior and the likelihood function, P(θ )P(y|θ), the joint P(y, θ)
) (
) ( )
| (
y P
P y
y) P
|
P(
The denominator, the marginal distribution or normalization constant
d y
P P
y P
y P
P y
y) P
| P(
)
| ( ) ( )
(
) (
) ( )
|
(
What we are seeking:
the posterior distribution P(θ|y)
Note that we are dividing each point the dashed line by the
area in the dashed line to obtain a probability reflecting our prior and current knowledge.
) (
) ( )
| (
y P
P y
y) P
|
P(
Summary: Bayes vs likelihood
•The difference is not the use of prior information.
•In likelihood, we find parameter estimates by maximizing the likelihood. We have a likelihood profile, but it is somewhat cumbersome for developing
confidence or support envelopes.
•In Bayes, we integrate or sum over the entire range of parameter values to get a PDF by dividing each point on the “likelihood profile” by the area
beneath the profile. The estimate of our parameter is the mean or the median of the resulting PDF. We also obtain estimates of the mode, the variance,
kurtosis, etc..., which allows us to make statements about the probability of our parameter(s). Likelihood cannot make these statements.
•This posterior PDF in our current study forms the prior in subsequent studies.
•The real value of Bayes over likelihood emerges as our process and
probability models become complex. In this case, we can exploit the product rule to simplify our problem, to break it up into manageable chunks that can be reassembled in a coherent way.