MCMC

Posterior inference can be conducted using (at least) two different kinds of MCMC strategies:

1 Employing a truncation of the stick-breaking representation (Ishwaran and James, 2001, 2002).

2 Using a marginal/collapsed Gibbs sampling where the mixing distribution is ingrates out from the model (MacEachern 1998, Neal 2000).

Throughout approach 1 will be detailed.

Blocked Gibbs sampler

The blocked Gibbs sampler relies on truncating the stick-breaking representation to a finite number of components.

Hence

GN(·) =

k=1

δ_(µ

k,σ²_k)(·).

The atoms(µk, σ²_k)are iidG₀, i.e.,(µk, σ²_k)∼^iidN(aµ,b²_µ)IG(a_σ2,b_σ2),k=1, . . . ,N.

The weights arise through a truncated stick-breaking construction ω1=v₁, fork≥2 ωk =v_kY

l<k

(1−v_l), v_k ^iid∼Beta(1, α), k=1, . . . ,N−1

vN =1, ωN =

N−1

l=1

(1−vl)

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 48 / 95

Blocked Gibbs sampler

Ishwaran and James (2001) showed the following bound for the truncation error



1−E











N−1

k=1

ωk





n







≈4nexp 1−N

For instance, ifn=500 and if we use a truncate value ofN=20, then forα=1, we get a bound of≈1.1×10⁻⁵.

In practice,N=20 orN=50 are commonly chosen as a default.

Blocked Gibbs sampler

Using the truncated versionGNofG, the normal mixture density can be expressed as

f(y) =

k=1

ω_kφ(y|µ_k, σ_k²),

withω_kgenerated from the truncated stick-breaking representation, whereas µ_k∼^iidN(aµ,b²_µ), andσ_k²∼^iidIG(a_σ2,b_σ2).

As it was the case for the finite mixture model, derivation of the full conditionals for Gibbs sampling involves the data-augmented likelihood.

The means, variances, and latent component indicators are sampled in an identical manner to the finite mixture model.

The main difference is that, unlike in the finite mixture model, uncertainty in the component weights is shifted tov, the inputs into the construction of the stick-breaking weights.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 50 / 95

Blocked Gibbs sampler

The full conditional distributions are

µk|else∼N aµ/b²_µ+P

i:z_i=kyi/σ²_k

1/b_µ²+nk/σ²_k , 1 1/b²_µ+nk/σ_k²

, (11)

σ²_k|else∼IG



a_σ2+nk/2,b_σ2+ X

i:z_i=k

(yi−µk)²/2



, (12)

fork=1, . . . ,N.

Fori=1, . . . ,n, the full conditional distribution forziis z_i|else∼Mult(p_i), withp_i= (p_i1, . . . ,p_iN)andp_ik = ^ω^k^φ(yⁱ^|µ^k^,σ

2 k) P_K

l=1ω_lφ(y_i|µ_l,σ_l²),k=1, . . . ,N.

Blocked Gibbs sampler

Fork=1, . . . ,N−1, update the inputs of the stick-breaking weights have from

vk |else∼Beta



nk+1, α+

l=k+1



. (13)

Regarding the precision parameterα, it can be fixed at a small value, for instance,α=1 is widely used in applications.

Alternatively, one can place a prior onαand allow the data to inform about the appropriate value ofα.

Lettingα∼Gamma(aα,bα), the resulting full conditional forαis

α|else∼Gamma(aα+N−1,bα−

N−1

k=1

log(1−vk)). (14)

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 52 / 95

Set an initial value forα,ω, andθ, say,α⁽⁰⁾,ω⁽⁰⁾, andθ⁽⁰⁾. fort=1, . . . ,T ,do:

1 fori=1, . . . ,n, and k=1, . . . ,K , compute posterior probabilities of membership using equation(6).

2 fori=1, . . . ,n, sample the latent component indicator, z_i^(t)∼Multi(p^(t)_i ).

3 Simulate stick-breaking inputsv^(t)from equation(13).

4 Givenv^(t), computeω^(t)using the stick-breaking construction.

5 Conditional onz^(t), updateµ^(t)_k and(σ^(t)_k )²,fork=1, . . . ,K , from(11)and(12), respectively.

6 Update the precision parameterαfrom(14).

Blocked Gibbs sampler

A valid concern with the blocked Gibbs sampler is that by truncating the stick-breaking representation we are effectively fitting a finite (and hence parametric) mixture model.

Quoting Dunson (2011):

“For example, if we let N=25as a truncation level, a natural question is how this is better or intrinsically different than fitting a finite mixture model with 25 components. One answer is that N is not the number of components occupied by the subjects in your sample but is

instead an upper bound on the number of subjects.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 54 / 95

Collapsed Gibbs sampler/ Polya urn scheme

The functionDPdensityfrom theRpackageDPpackagecan also to be used to fit a DPM of normals.

The MCMC scheme behind this function is a marginalized/collapsed Gibbs sampler.

The collapsed Gibbs sampler avoids specifying a truncation level by marginalizing outG and relying on the Polya urn scheme of Blackwell and MacQueen (1973).

Letting

yi∼φ(θi), θi= (µi, σ_i²)∼G, G∼DP(α,G0), and marginalizing outG, we obtain the Polya urn predictive rule

p(θ_i|θ1, . . . ,θi−1)∝ α

α+i−1G₀(θi) + 1 α+i−1

i−1

j=1

δθ_j(θi)

The Polya urn rule form the basis of the collapsed Gibbs sampler. For those interested in further details, see (Escobar and West 1995, MacEachern 1998, Neal 2000).

Example

Same data from the finite mixture example. DPM fit (left) and comparison against the fit of a 3 component mixture model, which corresponds to the true data generating mechanism (right).

Density

−10 −5 0 5 10

0.000.050.100.15

Density

−10 −5 0 5 10

0.000.050.100.15

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 56 / 95

Example

True data generating mechanism:φ_SN(y|µ=0, σ²=1, λ=8)

n=150

Density

−0.5 0.5 1.0 1.5 2.0 2.5 3.0

0.00.20.40.60.81.0

n=500

Density

−0.5 0.5 1.0 1.5 2.0 2.5 3.0

0.00.20.40.60.81.0

Example

True data generating mechanism:φ(y|0,1). DPM fit (blue) against normal fit (red).

n=150

Density

−3 −2 −1 0 1 2 3

0.00.10.20.30.40.50.6

n=500

Density

−3 −2 −1 0 1 2 3

0.00.10.20.30.4

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 58 / 95

Censored responses

The DPM can easily handle censored responses (left,right, or interval).

Remember that under the hierarchical formulation y_i|µ,σ²,z∼φ(y_i|µz_i, σ²_z

i) ..

For instance, ify_iis right censored, we know that its true value, sayy_i^∗, is greater thany_i, that is,y_i^∗>yi.

We can take care of these censored observations by simply adding an extra step in the blocked Gibbs algorithm.

In fact, we can simulate those observations from

y_i^∗|y_i,z_i,µ,σ²∼φ(y_i^∗|µz_i, σ_z²_i)I(y_i^∗>y_i).

This can be accomplished by simulatingy_i^∗from a truncated normal distribution with lower limit equal toy_i.

Censored responses

DPM fit (blue line) agains Kaplan–Meier fit (black line). The censored observations are represented by crosses.

0 1 2 3 4 5 6

0.00.20.40.60.81.0

Survival

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 60 / 95

Density regression

So far we have focused on the problem of density estimation.

We will now move to the problem of density regression.

Traditional regression models allow just one or two characteristics (e.g., mean and/or variance) to change as a function of covariates.

Here we will explore tools that allow the entire density/distribution to change as a function of covariates.

Density regression

Let{(x₁,y₁), . . . ,(xn,yn)}be regression data, wherex_i∈ X ⊆R^p. It is assumed that

yi|xi

ind.∼ f(· |xi), i=1, . . . ,n.

We specify a probability model for the entire collection of densitiesF={f(· |x) :x∈X}.

Further, one possibility is to model the conditional density using covariate-dependent mixture of normal models

f(y|x) = Z

φ(y|µ, σ²)dGx(µ, σ²).

The probability model for the conditional densities is induced by specifying a prior for the collection of mixing distributions

GX={Gx:x∈ X } ∼ G.

Gxis the random mixing distribution at covariatexandGis the prior for the collectionGX.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 62 / 95

DDP prior

One possibility forGis the dependent DP (DDP) proposed by MacEachern (1999,2000), which is built upon the constructive definition of the DP.

In its full generality, the DDP is specified as follows:

Gx(·) =

∞

k=1

ω_k,xδ_θ_k,x(·), ω1,x=v1,x, ωk,x=vk,x

l<k

(1−vk,x).

Here

θ1,x,θ2,x, . . .are realizations of a stochastic process (e.g., a Gaussian process) over X.

v_1,x,v_2,x, . . .are realizations from a stochastic process onXsuch that v_k,x∼Beta(1, αx)

DDP prior

Because of complications involved in allowing the weights to depend on covariates, the

‘single weights’ DDP, which assumes fixed weights, is commonly used.

Following De Iorio et al. (2009), a possibility forGxis

Gx(·) =

∞

k=1

ωkδθ_k,x(·),

where the weights match those from a standard DP andθk,x= (µk,x, σ²_k), with µk,x=x^Tβk.

Thus, under this formulation, the base stochastic processes are replaced with a base distributionG₀that generates the component-specific regression coefficients and variances.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 64 / 95

DPM of Gaussian regression models

Thus, the conditional density can therefore be represented as a DP mixture of Gaussian regression models

f(y|x) = Z

φ(y|x^Tβ, σ²)dG(β, σ²), G∼DP(α,G₀). (15)

For example,G0could beNp(µβ,Sβ)IG(a_σ2,b_σ2).

This model is known as the linear dependent Dirichlet process (LDDP) (De Iorio 2004, 2009).

Note that Eq. (15) can be equivalently written as

f(y|x) =

∞

k=1

ωkφ(y|x^Tβk, σ_k²).

DPM of Gaussian regression models - spline based version

The LDDP although flexible, does not allow for nonlinear effects of the covariates.

An alternative is instead of consideringµk,x=x^Tβkto consider an additive formulation based on B-splines, namely

µk,x=βk0+

r=1





L_r

s=1

βkrsΨs(xr,dr)



,

whereΨ(x,d)corresponds to thesth B spline basis function of degreedevaluated atx. As in the density estimation setup, posterior inference can be conducted using a blocked Gibbs sampler or a collapsed Gibbs sampler.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 66 / 95

Example

Data generating mechanism:y_i|x_i∼0.5φ(y_i| −3.5−1.5x_i,1²) +0.5φ(y_i|2+3.5x_i,1.25²), xi∼U(0,1), fori=1, . . . ,500. Results from the fit of a LDDP.

−5 0 5

0.000.050.100.150.200.25

x=0.2

Density

−5 0 5

0.000.100.200.30

x=0.4

Density

−5 0 5

0.000.100.200.30

x=0.7

Density

PT prior

We now focus on another popular nonparametric prior for density/distribution estimation.

The discussion closely follows Branscum, Johnson, and Baron (2013).

Polya tree priors have been discussed as early as Freedman (1963), Fabius (1964), and Ferguson (1974).

However, the natural starting point for understanding their potential use in modeling data is Lavine (1992, 1994), while Hanson (2006) considered some computational details.

Polya trees are way less popular than DPs but they are a very powerful tool as well.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 68 / 95

PT prior

Suppose that a random sampley1, . . . ,ynis obtained from an unknown/random distributionF.

Polya tree priors, as Dirichlet process priors, place a distribution on a collection of distributions.

A Polya tree for a distributionFis constructed by dividing the sample space into finer-and-fines disjoint sets using successive binary partitioning.

For instance, the first partition splits the sample space into two non overlapping intervals.

In the second partition, those two intervals are each split, yielding a finer partition that contain four intervals.

Then, these four intervals are each split to give an eight interval third level partition of the sample space.

At leveljof the tree, the sample space is partitioned into 2^jintervals,j=1, . . .

PT prior

LetBj,k denotes thekth interval at leveljof the tree, forj=1, . . ., andk=1, . . . ,2^j.

Observe that the partitions are nested within one another, starting at the top of the tree and working up.

For example, by definition,B1,1=B2,1∪B2,2andBj−1,1=Bj,1∪Bj,2, etc.

These intervals can be thought as the bins in the histogram.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 70 / 95

Finite PT prior

For a full tree the splitting continues ad infinitum.

However, in practice, we truncate to a fixedJ, hence the term, finite Polya tree.

Generally, settingJequal to 4, 5, or 6 often works well in practice.

Another option is to selectJso that roughlyJ=log₂n(Hanson, 2006).

Finite PT prior

Informally speaking, the unknown distributionFassigns the data pointsy_is to the intervals at levelJof the tree and the task is to use the observed distribution of the data into the intervals to estimateF.

Although all levels of the tree are important for the purpose of estimatingF, of primary importance is levelJ.

The goal here is to produce a data-driven estimate ofFthat assigns high probability to intervals that contain lots of data, assigns low probability to empty intervals, and assigns midrange probability to intervals that contain some (but not a lot) of the data.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 72 / 95

Finite PT prior

Let us consider first the simplest caseJ=1.

Then data are assigned to eitherB1,1orB1,2.

The probability of a data pointy_ibeing assigned toB_1,1isF(B_1,1) =Pr(y_i∈B_1,1)which sinceFis a cdf andB_1,1is an interval of the form(L,U)is to be interpreted as

F(B1,1) =F(U)−F(L).

Denote this unknown probability byπ1,1.

Then, by the complement rule,π1,2=1−π1,1is the probability assigned to setB_1,2. In notation, Pr(yi∈B1,1) =F(B1,1) =π1,1and Pr(yi∈B1,2) =F(B1,2) =π1,2. SinceFis unknown,π1,1andπ1,2are also unknown.

Finite PT prior

To help interpret these parameters, suppose the data arise from a right-skewed distribution (lots of more data inB1,1than inB1,2), thenπ1,1would be large and henceπ1,2would be small, and vice versa for left-skewed data.

Obviously, in practice,J=1 is never used because it would lead to a crude estimate of the density functionf, much like estimating a density using a relative frequency histogram that contains only two bins.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 74 / 95

Finite PT prior

Let us now consider, again for simplicity, the case ofJ=2. We have now four sets.

Data assignment is based on whether the data point was inB1,1orB1,2at the previous levelj=1.

Ify_i∈B_1,1theny_iis assigned to intervalB_2,1with unknown probabilityπ2,1or to interval B2,2with probabilityπ2,2=1−π2,1.

Similarly, defineπ2,3andπ2,4(=1−π2,3)to be the probability ofyibeing assigned to set B_2,3orB_2,4, respectively, given thaty_iwas onB_1,2.

Theπj,ks are conditional parameters, sinceπ2,1=Pr(yi∈B2,1|yi∈B1,1)and π2,3=Pr(yi|yi∈B1,2).

To relate theπ_j,ks toFwe must determine the marginal probability of assignment to the various intervals at levelJ(=2).

Finite PT prior

Observe that intervalB2,1is nested on intervalB1,1, so the marginal probability of interval B_2,1is

F(B2,1) =Pr(yi∈B2,1)

=Pr(yi∈B2,1∩B1,1)

=Pr(y_i∈B_2,1|y_i∈B_1,1)Pr(y_i∈B_1,1)

=π2,1π1,1.

Similar steps lead toF(B2,2) = (1−π2,1)π1,1,F(B2,3) =π2,3π1,2, and F(B2,4) = (1−π2,3)π1,2.

Suppose again thatF is right skewed . Then the data will estimateπ_1,1to be large, and it will estimateπ2,1to be large since most of thendata points will be assigned to setB_2,1. Therefore, the estimate ofF(B2,1)will be (relatively) large.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 76 / 95

Finite PT prior

Notice that level 1 has only one unique parameter,π1,1, associated with it becauseπ1,2is completely determined byπ1,1.

Similarly, level 2 has two unique parameters,π2,1andπ2,3associated with it.

We can continue the partitioning to any levelJ.

ForJ=3, we add eight conditional probabilities parameters,π3,1, π3,2, . . . , π3,8, but only four of these are unique.

For instance,π3,1is the probability thatyiis in intervalB3,1given that it is in intervalB2,1, and

F(B3,1) =Pr(yi∈B3,1)

=Pr(y_i∈B_3,1∩B_2,1)

=Pr(yi∈B3,1|yi∈B2,1)Pr(yi∈B2,1)

=π3,1π2,1π1,1. In general, we have

F(B_j,k) =

l=1

πl,Int{(k−1)2^l−j+1}, j=1, . . . ,J, k=1, . . . ,2^j.

Finite PT prior

The key point is that if we can estimate all of theπj,ks, then we can estimate the probability that it is allocated byF to each interval at levelJ.

So far, we have modeled the probability of assignment to each set at levelJ, but we have not modeled how probability mass is distributed within each interval at levelJ.

For instance, all theyis can be clumped together in the center of the interval.

Alternatively, the data could be uniformly distributed, clumped to the right or left side of the interval or have any other dispersion pattern within each interval.

To address this issue, we model the data according to how a user-specified parametric distributionF₀allocated probability mass within the intervals at levelJ.

So, as it was in the DP (or DPM), here with finite Polya trees, the user also needs to specify a probability distribution (and we will see in a few slides thatF0is also a centering distribution).

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 78 / 95

Finite PT prior

The distributionF₀is also used to determine the lower and upper endpoints of all intervals in the tree.

The median ofF0is used to split the sample space into two intervals at level 1 of the tree.

The quartiles ofF0define cut points for intervals at level 2.

Writing the 25th percentile asF₀⁻¹(1/4), the median asF₀⁻¹(2/4), and the 75th percentile asF₀⁻¹(3/4), we have for a sample space that covers the real line

B2,1=

−∞,F₀⁻¹(1/4)

, B2,2=

F₀⁻¹(1/4),F₀⁻¹(2/4)

B_2,3=

F₀⁻¹(2/4),F₀⁻¹(3/4)

, B_2,4=

F₀⁻¹(3/4),∞

In general, the(j,k)th interval is

Bj,k =

F₀⁻¹ k−1

2^j

,F₀⁻¹ k

2^j

, j=1, . . . ,J, k=1, . . . ,2^j.

Finite PT prior

The collectionΠ ={π_j,k :j=1, . . . ,J,k=1, . . . ,2^j}constitutes the unknown parameters corresponding toF.

The probabilities inΠare assumed mutually independent. That is, for instance,(π21, π22) and(π23, π24)are independent.

We thus need to specify a prior distribution over this collection.

Due to the fact that whenkis an even number between 2 and 2^j,πj,k =1−πj,k−1, priors are needed only onπj,k whenkis odd.

Since theπj,ks are probabilities, it is standard to use independent beta priors, specifically πj,2k−1∼Beta(cρ(j),cρ(j)), j=1, . . . ,J, k=1, . . . ,2^j−1.

In most of the applicationsρ(j) =j²as this guarantees an absolutely continuousF(in an infinite tree) (Ferguson, 1974).

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 80 / 95

Finite PT prior

Before proceeding on further considerations about the role ofc, let us note that under this parametrization

E[F(B(j,k))] = 1

2^j =F₀(B_j,k).

Thus,F0is the prior expectation of the unknown distribution functionF.

F₀is usually selected based on our best prior assessment of the data-generating distributionF.

The parameterc>0, also referred to as the weight parameter, acts much like as the precision parameterαin the Dirichlet process.

As in the Dirichlet process, large values ofcleads to realizations ofFthat are close toF0. A very low value ofc(e.g.,c=0.1) will often lead to an estimate ofFthat is similar to the empirical cdf. Usuallyc=1 works well in practice.

Just likeα,ccan be regarded as random a prior placed on it.

Finite PT prior

Once we have selectedJ,F₀, andcwe have all the elements needed to specify a finite Polya tree forF.

The formula for the density functionf (our interest here) is given by

f(y) =2^Jp(B_J,k(y))f₀(y). (16) Herek(y)∈ {1, . . . ,2^J}identifies the interval at levelJcontainingyandp(BJ,k(y))is the probability of that interval (the product ofJof theπj,ks) (Hanson, 2006).

The interval that containsyat levelJcan be determined using the formula k(y) =Int(2^JF₀(y) +1).

Note that the density at stageJis just the product of a weighting factor 2^Jp(B_J,k_(y₎)and the original used-specified parametric densityf0.

Prior or posterior distributions that focus high probability on regions aroundπjk =0.5 for all jandkwill behave very much like thef0density.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 82 / 95

Finite PT prior

GivenΠ,Gis known.

So, in order to compute posterior estimates ofF(f)we only need to know how to update theπj,ks.

Fortunately, just like the DP, Polya trees enjoy also a simple conjugacy result.

Specifically, if

y1, . . . ,yn|F∼^iidF, F ∼FPTJ(c,F0), thenF|yis updated through

πj,2k−1|y^ind.∼ Beta cj²+

i=1

I(yi∈Bj,k),cj²+

i=1

I(yi ∈Bj,k+1)

, (17)

forj=1, . . . ,Jandk=1, . . . ,2^j−1.

In words, we update the Beta parameters by counting the number of observations that fall in each set of each level of the tree.

That is,F |yis a PT with Beta parameters updated through (17).

Finite PT prior: example

Μ-2Σ Μ Μ+2Σ

HaL

Μ-2Σ Μ Μ+2Σ

HbL

Μ-2Σ Μ Μ+2Σ

HcL

Μ-2Σ Μ Μ+2Σ

HdL

FPT density estimates consideringF₀=N(µ, σ²)andJ=3. (a) N(µ, σ²)density. (b)j=1;π_1,1=0.45. (c)j=2;

π_2,1=0.4,π_2,3=0.6. (d) J=3;π_3,1=0.3,π_3,3=0.3,π_3,5=0.6,π3,7=0.3.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 84 / 95

All densities in the previous figure are too jagged, which turns out to be the result of using a fixedF0.

In fact, one of the major criticisms of Polya trees is that, unlike the DP, inferences are somewhat sensitive to the choice of a fixed partition.

A remedy is to place a prior distribution on the parameters ofF0, sayθ, we denote the resulting centering distribution asF_0,θto emphasize the dependence onθ.

A prior onθimplies that the starting and endpoints of the sets of the tree are uncertain/random.

This has the effect of smoothing out the abrupt jumps at these points that are noticeable in the previous figure.

In fact, in panel (d) of the previous figure it is also shown the estimate obtained by consideringθ= (µ, σ²)as random (dashed line).

So, the final model is

y₁, . . . ,yn|F ^iid∼F,

F|c,θ∼FPTJ(F0,θ,c), θ∼p(θ),

and it is known as a mixture of finite Polya trees.

It can be alternatively written as F ∼

FPT_J(F_0,θ,c)p(θ)dθ.

The formula for the density functionf is identical to that given in (16).

To conduct posterior inference we will now need to know how to sampleθ|y,Π(we already know how to sampleΠ|y,θ).

We will make this concrete considering the particular case ofF0,θ=N(µ, σ²).

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 86 / 95

We center randomFatF0,θ=N(µ, σ²), whereθ= (µ, σ²).

Using (16) the likelihoodL(Π,θ;y)is

i=1

2^Jφ(y|µ, σ²)p(kθ(J,yi)).

We writep(kθ(J,yi))instead ofp(BJ,k(y_i))to alleviate notation and to make clear the dependence onθ.

As in Branscum et al. (2008), we assumeµ∼N(aµ,b_µ²)andσ∼Γ(aσ,bσ).

Assuming further thatθandΠare a priori independent, the joint posterior density is proportional to

p(θ,Π|y)∝L(Π,θ;y)p(θ)p(Π).

The full conditionals forµandσare not recognizable as belonging to a parametric family thus these parameters are updated through Metropolis–Hastings steps.

Algorithm

1 µis updated by samplingµ^∗∼N(µ,s₁)and accepted with probability

min

exp{−0.5b⁻²_µ (µ^∗−a_µ)²} exp{−0.5b⁻²_µ (µ−a_µ)²}

Q_n

i=1p(k_µ∗,σ(J,y_i)) Q_n

i=1p(kµ,σ(J,y_i))

×exp{−0.5σ⁻²Pn

i=1(y_i−µ^∗)²} exp{−0.5σ⁻²Pn

i=1(y_i−µ)²}

Here s₁is a tuning parameter that needs to be calibrated to achieve a desirable acceptance rate.

2 σis updated by samplingσ^∗∼Γ(σs₂,s₂)and accepted with probability

min

1,f_Γ(σ^∗;a_σ,b_σ) f_Γ(σ;a_σ,b_σ)

i=1p(k_µ,σ∗(J,y_i)) Q_n

i=1p(k_µ,σ(J,y_i))

×σⁿexp{−0.5(σ^∗)⁻²Pn

i=1(y_i−µ)²} σ^∗nexp{−0.5σ⁻²P_n

i=1(y_i−µ)²}

f_Γ(σ;σ^∗s₂,s₂) f_Γ(σ^∗;σs₂,s₂)

where s₂has the same meaning as s₁.

3 Use(17)to updateπ_j,k, for k=1, . . . ,2^j−1and j=1, . . . ,J.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 88 / 95

Example

Data generated from 0.5φ(y| −3,1) +0.5φ(y|3,1),n=500.

J=3

Density

−6 −4 −2 0 2 4 6

0.000.100.200.30

J=4

Density

−6 −4 −2 0 2 4 6

0.000.100.200.30

J=5

Density

−6 −4 −2 0 2 4 6

0.000.100.200.30

J=6

Density

−6 −4 −2 0 2 4 6

0.000.100.200.30

J=7

Density

−6 −4 −2 0 2 4 6

0.000.100.200.30

J=8

Density

−6 −4 −2 0 2 4 6

0.000.100.200.30

Example

True data generating mechanism:

0.3φ_SN(y|µ=−2, σ²=1.5², λ=6) +0.7φ_SN(y|µ=2, σ²=1.5², λ=−3).J=4 was considered.

n=150

Density

−3 −2 −1 0 1 2 3

0.00.10.20.30.4

n=500

Density

−3 −2 −1 0 1 2 3

0.00.10.20.30.4

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 90 / 95

Example

True data generating mechanism:φ_SN(y|µ=0, σ²=1, λ=10). MFPT estimate (solid blue line) agains DPM estimate (dashed red line).

n=150

Density

0.0 0.5 1.0 1.5 2.0 2.5

0.00.20.40.60.81.0

n=500

Density

0.0 0.5 1.0 1.5 2.0 2.5

0.00.20.40.60.81.0

A Polya tree define the conditional probabilitiesπj+1,2k−1,πj+1,2k as beta distributions.

To accommodate covariates, and in a spirit of density regression, Jara and Hanson (2011) proposed to model these probabilities through logistic regression.

Specifically, given covariatesx, the probabilities(πj+1,2k−1, πj+1,2k)are defined as

log πj+1,2k−1

πj+1,2k

=x^Tτj,k.

The resulting model is known as linear dependent tail free process. For further details see Jara and Hanson, 2011.

The functionLDTFPdensityinDPpackageimplements this model.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 92 / 95

Throughout this presentation we have focused on Dirichlet process mixtures and mixtures of finite Polya trees.

We did not mean to be exhaustive. The aim was to provide, as the name says, an introduction.

Other popular Bayesian nonparametric models include:

Gausian processes, Bernstein polynomials,

Splines/ wavelets/ neural networks, etc.

I am happily in debt to...

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 94 / 95

No documento Bayesian Nonparametrics: a Soft Introduction (páginas 47-95)