Posterior inference can be conducted using (at least) two different kinds of MCMC strategies:
1 Employing a truncation of the stick-breaking representation (Ishwaran and James, 2001, 2002).
2 Using a marginal/collapsed Gibbs sampling where the mixing distribution is ingrates out from the model (MacEachern 1998, Neal 2000).
Throughout approach 1 will be detailed.
Blocked Gibbs sampler
The blocked Gibbs sampler relies on truncating the stick-breaking representation to a finite number of components.
Hence
GN(·) =
N
X
k=1
δ(µ
k,σ2k)(·).
The atoms(µk, σ2k)are iidG0, i.e.,(µk, σ2k)∼iidN(aµ,b2µ)IG(aσ2,bσ2),k=1, . . . ,N.
The weights arise through a truncated stick-breaking construction ω1=v1, fork≥2 ωk =vkY
l<k
(1−vl), vk iid∼Beta(1, α), k=1, . . . ,N−1
vN =1, ωN =
N−1
Y
l=1
(1−vl)
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 48 / 95
Blocked Gibbs sampler
Ishwaran and James (2001) showed the following bound for the truncation error
4
1−E
N−1
X
k=1
ωk
n
≈4nexp 1−N
α
For instance, ifn=500 and if we use a truncate value ofN=20, then forα=1, we get a bound of≈1.1×10−5.
In practice,N=20 orN=50 are commonly chosen as a default.
Blocked Gibbs sampler
Using the truncated versionGNofG, the normal mixture density can be expressed as
f(y) =
N
X
k=1
ωkφ(y|µk, σk2),
withωkgenerated from the truncated stick-breaking representation, whereas µk∼iidN(aµ,b2µ), andσk2∼iidIG(aσ2,bσ2).
As it was the case for the finite mixture model, derivation of the full conditionals for Gibbs sampling involves the data-augmented likelihood.
The means, variances, and latent component indicators are sampled in an identical manner to the finite mixture model.
The main difference is that, unlike in the finite mixture model, uncertainty in the component weights is shifted tov, the inputs into the construction of the stick-breaking weights.
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 50 / 95
Blocked Gibbs sampler
The full conditional distributions are
µk|else∼N aµ/b2µ+P
i:zi=kyi/σ2k
1/bµ2+nk/σ2k , 1 1/b2µ+nk/σk2
!
, (11)
σ2k|else∼IG
aσ2+nk/2,bσ2+ X
i:zi=k
(yi−µk)2/2
, (12)
fork=1, . . . ,N.
Fori=1, . . . ,n, the full conditional distribution forziis zi|else∼Mult(pi), withpi= (pi1, . . . ,piN)andpik = ωkφ(yi|µk,σ
2 k) PK
l=1ωlφ(yi|µl,σl2),k=1, . . . ,N.
Blocked Gibbs sampler
Fork=1, . . . ,N−1, update the inputs of the stick-breaking weights have from
vk |else∼Beta
nk+1, α+
N
X
l=k+1
nl
. (13)
Regarding the precision parameterα, it can be fixed at a small value, for instance,α=1 is widely used in applications.
Alternatively, one can place a prior onαand allow the data to inform about the appropriate value ofα.
Lettingα∼Gamma(aα,bα), the resulting full conditional forαis
α|else∼Gamma(aα+N−1,bα−
N−1
X
k=1
log(1−vk)). (14)
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 52 / 95
Set an initial value forα,ω, andθ, say,α(0),ω(0), andθ(0). fort=1, . . . ,T ,do:
1 fori=1, . . . ,n, and k=1, . . . ,K , compute posterior probabilities of membership using equation(6).
2 fori=1, . . . ,n, sample the latent component indicator, zi(t)∼Multi(p(t)i ).
3 Simulate stick-breaking inputsv(t)from equation(13).
4 Givenv(t), computeω(t)using the stick-breaking construction.
5 Conditional onz(t), updateµ(t)k and(σ(t)k )2,fork=1, . . . ,K , from(11)and(12), respectively.
6 Update the precision parameterαfrom(14).
Blocked Gibbs sampler
A valid concern with the blocked Gibbs sampler is that by truncating the stick-breaking representation we are effectively fitting a finite (and hence parametric) mixture model.
Quoting Dunson (2011):
“For example, if we let N=25as a truncation level, a natural question is how this is better or intrinsically different than fitting a finite mixture model with 25 components. One answer is that N is not the number of components occupied by the subjects in your sample but is
instead an upper bound on the number of subjects.
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 54 / 95
Collapsed Gibbs sampler/ Polya urn scheme
The functionDPdensityfrom theRpackageDPpackagecan also to be used to fit a DPM of normals.
The MCMC scheme behind this function is a marginalized/collapsed Gibbs sampler.
The collapsed Gibbs sampler avoids specifying a truncation level by marginalizing outG and relying on the Polya urn scheme of Blackwell and MacQueen (1973).
Letting
yi∼φ(θi), θi= (µi, σi2)∼G, G∼DP(α,G0), and marginalizing outG, we obtain the Polya urn predictive rule
p(θi|θ1, . . . ,θi−1)∝ α
α+i−1G0(θi) + 1 α+i−1
i−1
X
j=1
δθj(θi)
The Polya urn rule form the basis of the collapsed Gibbs sampler. For those interested in further details, see (Escobar and West 1995, MacEachern 1998, Neal 2000).
Example
Same data from the finite mixture example. DPM fit (left) and comparison against the fit of a 3 component mixture model, which corresponds to the true data generating mechanism (right).
y
Density
−10 −5 0 5 10
0.000.050.100.15
y
Density
−10 −5 0 5 10
0.000.050.100.15
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 56 / 95
Example
True data generating mechanism:φSN(y|µ=0, σ2=1, λ=8)
n=150
y
Density
−0.5 0.5 1.0 1.5 2.0 2.5 3.0
0.00.20.40.60.81.0
n=500
y
Density
−0.5 0.5 1.0 1.5 2.0 2.5 3.0
0.00.20.40.60.81.0
Example
True data generating mechanism:φ(y|0,1). DPM fit (blue) against normal fit (red).
n=150
y
Density
−3 −2 −1 0 1 2 3
0.00.10.20.30.40.50.6
n=500
y
Density
−3 −2 −1 0 1 2 3
0.00.10.20.30.4
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 58 / 95
Censored responses
The DPM can easily handle censored responses (left,right, or interval).
Remember that under the hierarchical formulation yi|µ,σ2,z∼φ(yi|µzi, σ2z
i) ..
.
For instance, ifyiis right censored, we know that its true value, sayyi∗, is greater thanyi, that is,yi∗>yi.
We can take care of these censored observations by simply adding an extra step in the blocked Gibbs algorithm.
In fact, we can simulate those observations from
yi∗|yi,zi,µ,σ2∼φ(yi∗|µzi, σz2i)I(yi∗>yi).
This can be accomplished by simulatingyi∗from a truncated normal distribution with lower limit equal toyi.
Censored responses
DPM fit (blue line) agains Kaplan–Meier fit (black line). The censored observations are represented by crosses.
0 1 2 3 4 5 6
0.00.20.40.60.81.0
y
Survival
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 60 / 95
Density regression
So far we have focused on the problem of density estimation.
We will now move to the problem of density regression.
Traditional regression models allow just one or two characteristics (e.g., mean and/or variance) to change as a function of covariates.
Here we will explore tools that allow the entire density/distribution to change as a function of covariates.
Density regression
Let{(x1,y1), . . . ,(xn,yn)}be regression data, wherexi∈ X ⊆Rp. It is assumed that
yi|xi
ind.∼ f(· |xi), i=1, . . . ,n.
We specify a probability model for the entire collection of densitiesF={f(· |x) :x∈X}.
Further, one possibility is to model the conditional density using covariate-dependent mixture of normal models
f(y|x) = Z
φ(y|µ, σ2)dGx(µ, σ2).
The probability model for the conditional densities is induced by specifying a prior for the collection of mixing distributions
GX={Gx:x∈ X } ∼ G.
Gxis the random mixing distribution at covariatexandGis the prior for the collectionGX.
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 62 / 95
DDP prior
One possibility forGis the dependent DP (DDP) proposed by MacEachern (1999,2000), which is built upon the constructive definition of the DP.
In its full generality, the DDP is specified as follows:
Gx(·) =
∞
X
k=1
ωk,xδθk,x(·), ω1,x=v1,x, ωk,x=vk,x
Y
l<k
(1−vk,x).
Here
θ1,x,θ2,x, . . .are realizations of a stochastic process (e.g., a Gaussian process) over X.
v1,x,v2,x, . . .are realizations from a stochastic process onXsuch that vk,x∼Beta(1, αx)
DDP prior
Because of complications involved in allowing the weights to depend on covariates, the
‘single weights’ DDP, which assumes fixed weights, is commonly used.
Following De Iorio et al. (2009), a possibility forGxis
Gx(·) =
∞
X
k=1
ωkδθk,x(·),
where the weights match those from a standard DP andθk,x= (µk,x, σ2k), with µk,x=xTβk.
Thus, under this formulation, the base stochastic processes are replaced with a base distributionG0that generates the component-specific regression coefficients and variances.
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 64 / 95
DPM of Gaussian regression models
Thus, the conditional density can therefore be represented as a DP mixture of Gaussian regression models
f(y|x) = Z
φ(y|xTβ, σ2)dG(β, σ2), G∼DP(α,G0). (15)
For example,G0could beNp(µβ,Sβ)IG(aσ2,bσ2).
This model is known as the linear dependent Dirichlet process (LDDP) (De Iorio 2004, 2009).
Note that Eq. (15) can be equivalently written as
f(y|x) =
∞
X
k=1
ωkφ(y|xTβk, σk2).
DPM of Gaussian regression models - spline based version
The LDDP although flexible, does not allow for nonlinear effects of the covariates.
An alternative is instead of consideringµk,x=xTβkto consider an additive formulation based on B-splines, namely
µk,x=βk0+
p
X
r=1
Lr
X
s=1
βkrsΨs(xr,dr)
,
whereΨ(x,d)corresponds to thesth B spline basis function of degreedevaluated atx. As in the density estimation setup, posterior inference can be conducted using a blocked Gibbs sampler or a collapsed Gibbs sampler.
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 66 / 95
Example
Data generating mechanism:yi|xi∼0.5φ(yi| −3.5−1.5xi,12) +0.5φ(yi|2+3.5xi,1.252), xi∼U(0,1), fori=1, . . . ,500. Results from the fit of a LDDP.
−5 0 5
0.000.050.100.150.200.25
x=0.2
y
Density
−5 0 5
0.000.100.200.30
x=0.4
y
Density
−5 0 5
0.000.100.200.30
x=0.7
y
Density
PT prior
We now focus on another popular nonparametric prior for density/distribution estimation.
The discussion closely follows Branscum, Johnson, and Baron (2013).
Polya tree priors have been discussed as early as Freedman (1963), Fabius (1964), and Ferguson (1974).
However, the natural starting point for understanding their potential use in modeling data is Lavine (1992, 1994), while Hanson (2006) considered some computational details.
Polya trees are way less popular than DPs but they are a very powerful tool as well.
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 68 / 95
PT prior
Suppose that a random sampley1, . . . ,ynis obtained from an unknown/random distributionF.
Polya tree priors, as Dirichlet process priors, place a distribution on a collection of distributions.
A Polya tree for a distributionFis constructed by dividing the sample space into finer-and-fines disjoint sets using successive binary partitioning.
For instance, the first partition splits the sample space into two non overlapping intervals.
In the second partition, those two intervals are each split, yielding a finer partition that contain four intervals.
Then, these four intervals are each split to give an eight interval third level partition of the sample space.
At leveljof the tree, the sample space is partitioned into 2jintervals,j=1, . . .
PT prior
LetBj,k denotes thekth interval at leveljof the tree, forj=1, . . ., andk=1, . . . ,2j.
Observe that the partitions are nested within one another, starting at the top of the tree and working up.
For example, by definition,B1,1=B2,1∪B2,2andBj−1,1=Bj,1∪Bj,2, etc.
These intervals can be thought as the bins in the histogram.
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 70 / 95
Finite PT prior
For a full tree the splitting continues ad infinitum.
However, in practice, we truncate to a fixedJ, hence the term, finite Polya tree.
Generally, settingJequal to 4, 5, or 6 often works well in practice.
Another option is to selectJso that roughlyJ=log2n(Hanson, 2006).
Finite PT prior
Informally speaking, the unknown distributionFassigns the data pointsyis to the intervals at levelJof the tree and the task is to use the observed distribution of the data into the intervals to estimateF.
Although all levels of the tree are important for the purpose of estimatingF, of primary importance is levelJ.
The goal here is to produce a data-driven estimate ofFthat assigns high probability to intervals that contain lots of data, assigns low probability to empty intervals, and assigns midrange probability to intervals that contain some (but not a lot) of the data.
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 72 / 95
Finite PT prior
Let us consider first the simplest caseJ=1.
Then data are assigned to eitherB1,1orB1,2.
The probability of a data pointyibeing assigned toB1,1isF(B1,1) =Pr(yi∈B1,1)which sinceFis a cdf andB1,1is an interval of the form(L,U)is to be interpreted as
F(B1,1) =F(U)−F(L).
Denote this unknown probability byπ1,1.
Then, by the complement rule,π1,2=1−π1,1is the probability assigned to setB1,2. In notation, Pr(yi∈B1,1) =F(B1,1) =π1,1and Pr(yi∈B1,2) =F(B1,2) =π1,2. SinceFis unknown,π1,1andπ1,2are also unknown.
Finite PT prior
To help interpret these parameters, suppose the data arise from a right-skewed distribution (lots of more data inB1,1than inB1,2), thenπ1,1would be large and henceπ1,2would be small, and vice versa for left-skewed data.
Obviously, in practice,J=1 is never used because it would lead to a crude estimate of the density functionf, much like estimating a density using a relative frequency histogram that contains only two bins.
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 74 / 95
Finite PT prior
Let us now consider, again for simplicity, the case ofJ=2. We have now four sets.
Data assignment is based on whether the data point was inB1,1orB1,2at the previous levelj=1.
Ifyi∈B1,1thenyiis assigned to intervalB2,1with unknown probabilityπ2,1or to interval B2,2with probabilityπ2,2=1−π2,1.
Similarly, defineπ2,3andπ2,4(=1−π2,3)to be the probability ofyibeing assigned to set B2,3orB2,4, respectively, given thatyiwas onB1,2.
Theπj,ks are conditional parameters, sinceπ2,1=Pr(yi∈B2,1|yi∈B1,1)and π2,3=Pr(yi|yi∈B1,2).
To relate theπj,ks toFwe must determine the marginal probability of assignment to the various intervals at levelJ(=2).
Finite PT prior
Observe that intervalB2,1is nested on intervalB1,1, so the marginal probability of interval B2,1is
F(B2,1) =Pr(yi∈B2,1)
=Pr(yi∈B2,1∩B1,1)
=Pr(yi∈B2,1|yi∈B1,1)Pr(yi∈B1,1)
=π2,1π1,1.
Similar steps lead toF(B2,2) = (1−π2,1)π1,1,F(B2,3) =π2,3π1,2, and F(B2,4) = (1−π2,3)π1,2.
Suppose again thatF is right skewed . Then the data will estimateπ1,1to be large, and it will estimateπ2,1to be large since most of thendata points will be assigned to setB2,1. Therefore, the estimate ofF(B2,1)will be (relatively) large.
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 76 / 95
Finite PT prior
Notice that level 1 has only one unique parameter,π1,1, associated with it becauseπ1,2is completely determined byπ1,1.
Similarly, level 2 has two unique parameters,π2,1andπ2,3associated with it.
We can continue the partitioning to any levelJ.
ForJ=3, we add eight conditional probabilities parameters,π3,1, π3,2, . . . , π3,8, but only four of these are unique.
For instance,π3,1is the probability thatyiis in intervalB3,1given that it is in intervalB2,1, and
F(B3,1) =Pr(yi∈B3,1)
=Pr(yi∈B3,1∩B2,1)
=Pr(yi∈B3,1|yi∈B2,1)Pr(yi∈B2,1)
=π3,1π2,1π1,1. In general, we have
F(Bj,k) =
j
Y
l=1
πl,Int{(k−1)2l−j+1}, j=1, . . . ,J, k=1, . . . ,2j.
Finite PT prior
The key point is that if we can estimate all of theπj,ks, then we can estimate the probability that it is allocated byF to each interval at levelJ.
So far, we have modeled the probability of assignment to each set at levelJ, but we have not modeled how probability mass is distributed within each interval at levelJ.
For instance, all theyis can be clumped together in the center of the interval.
Alternatively, the data could be uniformly distributed, clumped to the right or left side of the interval or have any other dispersion pattern within each interval.
To address this issue, we model the data according to how a user-specified parametric distributionF0allocated probability mass within the intervals at levelJ.
So, as it was in the DP (or DPM), here with finite Polya trees, the user also needs to specify a probability distribution (and we will see in a few slides thatF0is also a centering distribution).
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 78 / 95
Finite PT prior
The distributionF0is also used to determine the lower and upper endpoints of all intervals in the tree.
The median ofF0is used to split the sample space into two intervals at level 1 of the tree.
The quartiles ofF0define cut points for intervals at level 2.
Writing the 25th percentile asF0−1(1/4), the median asF0−1(2/4), and the 75th percentile asF0−1(3/4), we have for a sample space that covers the real line
B2,1=
−∞,F0−1(1/4)
, B2,2=
F0−1(1/4),F0−1(2/4)
B2,3=
F0−1(2/4),F0−1(3/4)
, B2,4=
F0−1(3/4),∞
In general, the(j,k)th interval is
Bj,k =
F0−1 k−1
2j
,F0−1 k
2j
, j=1, . . . ,J, k=1, . . . ,2j.
Finite PT prior
The collectionΠ ={πj,k :j=1, . . . ,J,k=1, . . . ,2j}constitutes the unknown parameters corresponding toF.
The probabilities inΠare assumed mutually independent. That is, for instance,(π21, π22) and(π23, π24)are independent.
We thus need to specify a prior distribution over this collection.
Due to the fact that whenkis an even number between 2 and 2j,πj,k =1−πj,k−1, priors are needed only onπj,k whenkis odd.
Since theπj,ks are probabilities, it is standard to use independent beta priors, specifically πj,2k−1∼Beta(cρ(j),cρ(j)), j=1, . . . ,J, k=1, . . . ,2j−1.
In most of the applicationsρ(j) =j2as this guarantees an absolutely continuousF(in an infinite tree) (Ferguson, 1974).
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 80 / 95
Finite PT prior
Before proceeding on further considerations about the role ofc, let us note that under this parametrization
E[F(B(j,k))] = 1
2j =F0(Bj,k).
Thus,F0is the prior expectation of the unknown distribution functionF.
F0is usually selected based on our best prior assessment of the data-generating distributionF.
The parameterc>0, also referred to as the weight parameter, acts much like as the precision parameterαin the Dirichlet process.
As in the Dirichlet process, large values ofcleads to realizations ofFthat are close toF0. A very low value ofc(e.g.,c=0.1) will often lead to an estimate ofFthat is similar to the empirical cdf. Usuallyc=1 works well in practice.
Just likeα,ccan be regarded as random a prior placed on it.
Finite PT prior
Once we have selectedJ,F0, andcwe have all the elements needed to specify a finite Polya tree forF.
The formula for the density functionf (our interest here) is given by
f(y) =2Jp(BJ,k(y))f0(y). (16) Herek(y)∈ {1, . . . ,2J}identifies the interval at levelJcontainingyandp(BJ,k(y))is the probability of that interval (the product ofJof theπj,ks) (Hanson, 2006).
The interval that containsyat levelJcan be determined using the formula k(y) =Int(2JF0(y) +1).
Note that the density at stageJis just the product of a weighting factor 2Jp(BJ,k(y))and the original used-specified parametric densityf0.
Prior or posterior distributions that focus high probability on regions aroundπjk =0.5 for all jandkwill behave very much like thef0density.
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 82 / 95
Finite PT prior
GivenΠ,Gis known.
So, in order to compute posterior estimates ofF(f)we only need to know how to update theπj,ks.
Fortunately, just like the DP, Polya trees enjoy also a simple conjugacy result.
Specifically, if
y1, . . . ,yn|F∼iidF, F ∼FPTJ(c,F0), thenF|yis updated through
πj,2k−1|yind.∼ Beta cj2+
n
X
i=1
I(yi∈Bj,k),cj2+
n
X
i=1
I(yi ∈Bj,k+1)
!
, (17)
forj=1, . . . ,Jandk=1, . . . ,2j−1.
In words, we update the Beta parameters by counting the number of observations that fall in each set of each level of the tree.
That is,F |yis a PT with Beta parameters updated through (17).
Finite PT prior: example
Μ-2Σ Μ Μ+2Σ
HaL
Μ-2Σ Μ Μ+2Σ
HbL
Μ-2Σ Μ Μ+2Σ
HcL
Μ-2Σ Μ Μ+2Σ
HdL
FPT density estimates consideringF0=N(µ, σ2)andJ=3. (a) N(µ, σ2)density. (b)j=1;π1,1=0.45. (c)j=2;
π2,1=0.4,π2,3=0.6. (d) J=3;π3,1=0.3,π3,3=0.3,π3,5=0.6,π3,7=0.3.
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 84 / 95
All densities in the previous figure are too jagged, which turns out to be the result of using a fixedF0.
In fact, one of the major criticisms of Polya trees is that, unlike the DP, inferences are somewhat sensitive to the choice of a fixed partition.
A remedy is to place a prior distribution on the parameters ofF0, sayθ, we denote the resulting centering distribution asF0,θto emphasize the dependence onθ.
A prior onθimplies that the starting and endpoints of the sets of the tree are uncertain/random.
This has the effect of smoothing out the abrupt jumps at these points that are noticeable in the previous figure.
In fact, in panel (d) of the previous figure it is also shown the estimate obtained by consideringθ= (µ, σ2)as random (dashed line).
So, the final model is
y1, . . . ,yn|F iid∼F,
F|c,θ∼FPTJ(F0,θ,c), θ∼p(θ),
and it is known as a mixture of finite Polya trees.
It can be alternatively written as F ∼
Z
FPTJ(F0,θ,c)p(θ)dθ.
The formula for the density functionf is identical to that given in (16).
To conduct posterior inference we will now need to know how to sampleθ|y,Π(we already know how to sampleΠ|y,θ).
We will make this concrete considering the particular case ofF0,θ=N(µ, σ2).
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 86 / 95
We center randomFatF0,θ=N(µ, σ2), whereθ= (µ, σ2).
Using (16) the likelihoodL(Π,θ;y)is
n
Y
i=1
2Jφ(y|µ, σ2)p(kθ(J,yi)).
We writep(kθ(J,yi))instead ofp(BJ,k(yi))to alleviate notation and to make clear the dependence onθ.
As in Branscum et al. (2008), we assumeµ∼N(aµ,bµ2)andσ∼Γ(aσ,bσ).
Assuming further thatθandΠare a priori independent, the joint posterior density is proportional to
p(θ,Π|y)∝L(Π,θ;y)p(θ)p(Π).
The full conditionals forµandσare not recognizable as belonging to a parametric family thus these parameters are updated through Metropolis–Hastings steps.
MCMC
Algorithm
1 µis updated by samplingµ∗∼N(µ,s1)and accepted with probability
min
1,
exp{−0.5b−2µ (µ∗−aµ)2} exp{−0.5b−2µ (µ−aµ)2}
Qn
i=1p(kµ∗,σ(J,yi)) Qn
i=1p(kµ,σ(J,yi))
×exp{−0.5σ−2Pn
i=1(yi−µ∗)2} exp{−0.5σ−2Pn
i=1(yi−µ)2}
.
Here s1is a tuning parameter that needs to be calibrated to achieve a desirable acceptance rate.
2 σis updated by samplingσ∗∼Γ(σs2,s2)and accepted with probability
min
1,fΓ(σ∗;aσ,bσ) fΓ(σ;aσ,bσ)
Qn
i=1p(kµ,σ∗(J,yi)) Qn
i=1p(kµ,σ(J,yi))
×σnexp{−0.5(σ∗)−2Pn
i=1(yi−µ)2} σ∗nexp{−0.5σ−2Pn
i=1(yi−µ)2}
fΓ(σ;σ∗s2,s2) fΓ(σ∗;σs2,s2)
,
where s2has the same meaning as s1.
3 Use(17)to updateπj,k, for k=1, . . . ,2j−1and j=1, . . . ,J.
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 88 / 95
Example
Data generated from 0.5φ(y| −3,1) +0.5φ(y|3,1),n=500.
J=3
y
Density
−6 −4 −2 0 2 4 6
0.000.100.200.30
J=4
y
Density
−6 −4 −2 0 2 4 6
0.000.100.200.30
J=5
y
Density
−6 −4 −2 0 2 4 6
0.000.100.200.30
J=6
y
Density
−6 −4 −2 0 2 4 6
0.000.100.200.30
J=7
y
Density
−6 −4 −2 0 2 4 6
0.000.100.200.30
J=8
y
Density
−6 −4 −2 0 2 4 6
0.000.100.200.30
Example
True data generating mechanism:
0.3φSN(y|µ=−2, σ2=1.52, λ=6) +0.7φSN(y|µ=2, σ2=1.52, λ=−3).J=4 was considered.
n=150
y
Density
−3 −2 −1 0 1 2 3
0.00.10.20.30.4
n=500
y
Density
−3 −2 −1 0 1 2 3
0.00.10.20.30.4
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 90 / 95
Example
True data generating mechanism:φSN(y|µ=0, σ2=1, λ=10). MFPT estimate (solid blue line) agains DPM estimate (dashed red line).
n=150
y
Density
0.0 0.5 1.0 1.5 2.0 2.5
0.00.20.40.60.81.0
n=500
y
Density
0.0 0.5 1.0 1.5 2.0 2.5
0.00.20.40.60.81.0
A Polya tree define the conditional probabilitiesπj+1,2k−1,πj+1,2k as beta distributions.
To accommodate covariates, and in a spirit of density regression, Jara and Hanson (2011) proposed to model these probabilities through logistic regression.
Specifically, given covariatesx, the probabilities(πj+1,2k−1, πj+1,2k)are defined as
log πj+1,2k−1
πj+1,2k
!
=xTτj,k.
The resulting model is known as linear dependent tail free process. For further details see Jara and Hanson, 2011.
The functionLDTFPdensityinDPpackageimplements this model.
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 92 / 95
Throughout this presentation we have focused on Dirichlet process mixtures and mixtures of finite Polya trees.
We did not mean to be exhaustive. The aim was to provide, as the name says, an introduction.
Other popular Bayesian nonparametric models include:
Gausian processes, Bernstein polynomials,
Splines/ wavelets/ neural networks, etc.
I am happily in debt to...
Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 94 / 95