• Nenhum resultado encontrado

Bayesian Nonparametrics: a Soft Introduction

N/A
N/A
Protected

Academic year: 2022

Share "Bayesian Nonparametrics: a Soft Introduction"

Copied!
95
0
0

Texto

(1)

Vanda Inácio

Pontificia Universidad Católica de Chile, Chile

XIII EBEB, February 22, 2016

(2)

1 Introduction/motivation

2 Finite mixture models

3 Dirichlet process mixture models

4 Mixtures of finite Polya trees

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 2 / 95

(3)
(4)

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 4 / 95

(5)

Hanson, Branscum and Johnson (2005). Bayesian nonparametric modeling and data analysis: an introduction. InHandbook of Statistics, Elsevier.

Hanson, and Jara, A (2013). Surviving fully Bayesian nonparametric regression models. In Bayesian Theory and Applications, Oxford University Press, UK.

Muller and Rodriguez (2013).Nonparametric Bayesian Inference. NSF-CBMS Regional Conference Series in Probability and Statistics, Volume 9.

http://projecteuclid.org/euclid.cbms/1362163742

Kottas and Rodriguez (2014). Unpublished lecture notes.

https://users.soe.ucsc.edu/ thanos/notes-1.pdf

(6)

Muller and Quintana (2004). Nonparametric Bayesian data analysis.Statistical Science 19, 95–110.

Muller and Mitra (2013). Bayesian nonparametric inference – why and how.Bayesian Analysis8, 269–302.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 6 / 95

(7)

DPpackage: Bayesian Semi- and Nonparametric Modeling in R

https://cran.r-project.org/web/packages/DPpackage/

Bayesian Regression: Nonparametric and Parametric Models

http://georgek.people.uic.edu/BayesSoftware.html

(8)

Motivation

The motivation is twofold:

1 Allowing for model flexibility.

2 Safeguarding against model misspecification.

Bayesian nonparametrics rely on parametric baseline models while allowing for data-driven deviations.

So that we can see if parametric models might actually fit by embedding them in nonparametric families (sensitivity analysis).

Because Bayesian nonparametric modeling is feasible due to modern MCMC methods.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 8 / 95

(9)

Motivation

The goal here is to provide a rich class of statistical models for data analysis.

Data distributions can easily be multimodal or have severe skewness.

The normal distribution is still widely used in practice.

But the normal family is symmetric and has only two parameters, one corresponding to location and the other corresponding to dispersion.

Common practice is to log transform the data. But if after the log, the data still have multimodality and/or skewness, the normal distribution would not be adequate.

Other parametric distributions are similarly constrained.

(10)

Motivation

We thus proceed to discuss broader families that allow for flexibility and robustness beyond what is achievable using parametric models.

This goal is accomplished by setting a particular parametric class and expanding it so that there are many more possibilities included.

In the parametric Bayesian approach, given a particular dataset, a family of models is selected.

In the normal case, this involves location and scale parameters.

We then select prior distributions for these two parameters. This induces a prior on the family of distributions.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 10 / 95

(11)

Motivation

In the nonparametric case, we do the same, except that, instead of having a small number of parameters that characterize the family of distributions for the data, conceptually there is an infinite number of parameters.

But since we live in a finite world, practically dictates that these models must be truncated to have a possibly large but finite number of parameters.

We call them richly parametric as a result.

Thus, nonparametric Bayesian modeling requires a prior distribution on the (potentially) infinite dimensional parameters.

Technically, this amounts to placing a prior distribution on the space of all distribution functions.

(12)

Motivation

We consider two approaches:

1 The first involves the specification of a Dirichlet process mixture model for the data distribution.

2 The second approach involves the specification of a mixture of finite Polya trees prior on the space of all distribution functions.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 12 / 95

(13)

Motivation

We start with finite mixture models since they, to a certain extent, provide the background for Dirichlet process mixture models.

The natural question is: why mixture models?

In a diversity of situations, the complexity of the observed data may render the use of a single parametric distribution insufficient for data modeling.

For example, in the context of medical tests, biomarker outcomes for some specific disease may consist of, at least, two subgroups, corresponding to mild diseased and severe diseased individuals.

(14)

Motivation

A mixture model assumes that data can be represented by a weighted sum of distributions, with each distribution representing a proportion of the data.

It is worth nothing that multimodality is not the sole motivation for the use of mixture models.

For instance, skew data can also be handled by mixtures.

Of course, for this latter case, one could use a skew distribution, but this would imply that we expect skewness in advance, while with the more general mixture model, we can handle this and other nonstandard features of the data without the need to know them in advance.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 14 / 95

(15)

Formulation

Throughout, we consider the particular case of mixtures of normal distributions.

General mixtures of normal densities can approximate any continuous density (e.g., Lo 1984).

We assume thus that the datay= (y1, . . . ,yn)follows a mixture of K normal distributions, whose density is given by

f(y|ω,θ) =

K

X

k=1

ωkφ(y|µk, σ2k), (1)

withω= (ω1, . . . , ωK)andθ= (µ1, . . . , µK, σ21, . . . , σ2K).

Hence, each data point would arise from one ofKmixture components, with each component having its own mean and variance.

(16)

Formulation

Model (1) is called a location-scale mixture of normals, since both the mean (location) and variance (scale) vary across components.

An alternative model would treat the variances across components as constants.

Generally, location-scale mixture of normals produce more accurate results and also correspond to more realistic representations of the data.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 16 / 95

(17)

Formulation

Model (1) can be equivalently written as f(y|ω,θ) =

Z

φ(y|µ, σ2)dG(µ, σ2).

Here G is a discrete finite mixing distribution given by

G(·) =

K

X

k=1

ωkδ

k2k)(·),

whereδadenotes a point mass ata.

The likelihood of the data is then

L(ω,θ;y) =

n

Y

i=1 K

X

k=1

ωkφ(yik, σk2),

which is not analytically tractable.

(18)

Data augmentation

This issue is overcomed by data augmentation.

To this end, consider the latent variablezi∈ {1, . . . ,K}withzi=kindicating that observationyicomes from componentk, i.e., from the normal component with meanµk

and varianceσ2k.

The mixture model can then be viewed hierarchically; the observed datayiis modeled conditionally onziandziis also given a probabilistic specification.

It can then be written

yi|ziind.∼ φ(yizi, σ2z

i), ziiid∼Mult(ω).

If we marginalize overziwe recover the original mixture formulation.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 18 / 95

(19)

Likelihood

By introducing,z= (z1, . . . ,zn)we obtain the complete/augmented data likelihood, which is given by

L(ω,θ;y,z) =

n

Y

i=1

ωziφ(yizi, σ2zi)

=

K

Y

k=1

Y

i:zi=k

ωkφ(yik, σ2k)

=

K

Y

k=1

ωnkk Y

i:zi=k

φ(yik, σk2). (2)

Herenk =Pn

i=1I(zi=k)counts the number of observations allocated to componentk.

(20)

Prior distributions

The model specification is completed by specifying prior distributions of the weights, means and variances

(ω,µ,σ2)∼p(ω,µ,σ2).

We consider prior independence and thus

p(ω,µ,σ2) =p(ω)p(µ)p(σ2).

We will make use of conjugate priors. Specifically, the weights are assigned a Dirichlet distribution

1, . . . , ωK)∼Dir(α1, . . . , αK).

In turn, the means and variances are assigned normal and inverse-gamma prior distributions, respectively. That is,

µk∼N(aµ,b2µ), σ2k ∼IG(aσ2,bσ2), k=1, . . . ,K.

Note that placing a prior on the collection({ωk},{µk, σ2k})is equivalent to placing a prior on the finite mixing distributionG

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 20 / 95

(21)

Posterior inference

By combining the likelihood in (2) with this prior specification, the joint posterior distribution is

p(ω,θ|y,z)∝L(ω,θ;y,z)p(ω)

K

Y

k=1

p(µk)p(σk2) (3) Although the posterior distribution in (3) does not have a recognizable form, all full conditional distributions have simple conjugate forms.

Specifically, for the mean and variance of the components we have

µk|else∼N aµ/b2µ+P

i:zi=kyi2k

1/bµ2+nk2k , 1 1/b2µ+nkk2

!

, (4)

σ2k|else∼IG

aσ2+nk/2,bσ2+ X

i:zi=k

(yi−µk)2/2

, (5)

fork=1, . . . ,K.

(22)

Posterior inference

Fori=1, . . . ,n, the full conditional distribution forziis zi|else∼Mult(pi).

Here,pi= (pi1, . . . ,piK), with

pik =Pr(zi=k|ω,θ,yi) = ωkφ(yik, σ2k) PK

l=1ωlφ(yil, σl2). (6)

Finally, the full conditional of the weights is given by

ω|else∼Dir(α1+n1, . . . , αK+nK).

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 22 / 95

(23)

Gibbs sampler algorithm

Algorithm

Set an initial value forω, andθ, say,ω(0), andθ(0). fort=1, . . . ,T ,do:

1 fori=1, . . . ,n, and k=1, . . . ,K compute posterior probabilities of membership using equation(6).

2 fori=1, . . . ,n, sample the latent component indicator, zi(t)∼Multi(p(t)i ).

3 Conditional onz(t), updateω,

ω(t)∼Dirichlet(α1+n(t)1 , . . . , αK+n(t)K ), n(t)1 =

n

X

i=1

zi(t).

4 Conditional onz(t), updateµ(t)k and(σ(t)k )2,fork=1, . . . ,K , from(4)and(5), respectively.

(24)

Identifiability issues

Due to identifiability issues, such as the so-called label switching problem (Jasra et al.

2005), it makes difference whether there is interest in making inferences about the mixture component-specific parameters and clustering.

The label switching problem (also known as label ambiguity) refers to the fact that there is nothing in the likelihood to distinguish mixture componentkfrom mixture componentk0. Permuting theKlabels in any ofK!ways results in the same model for the data.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 24 / 95

(25)

Identifiability issues

As a concrete example, in thek=2 case, consider

p1=0.3, µ1=1, p2=0.7, µ2=1.5, σ2122=1, (Scenario A).

Then, the model is equivalent to one with

p1=0.7, µ1=1.5, p2=0.3, µ2=1, σ2122=1, (Scenario B).

If one is only interested in density estimation, then everything is fine, because as illustrated in the example

fA(y) =0.3φ(y|1,1) +0.7φ(y|1.5,1)

=0.7φ(y|1.5,1) +0.3φ(y|1,1) =fB(y).

Usually, imposingµ1< µ2<· · ·< µKand using informative priors helps to mitigate the identifiability issues.

(26)

How to choose

K

?

A drawback of finite mixture models is that we must choose the number of componentsK, which is a non-trivial task in general.

One possibility is placing a prior onK, which leads to a model that changes dimension (number of parameters) withK.

One approach to fit such model is to use reversible jump type of algorithms, but these type of algorithms tend to be difficult to implement efficiently in practice.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 26 / 95

(27)

How to choose

K

?

Another possibility is to fit the model for different values ofKand assess the adequacy of the fit, through the LPML criterion, for instance.

Placing aKtoo small is much worst than placing aKtoo large.

For instance, one cannot get a trimodal density out of a mixture of two components but one can essentially ignore extra mixture terms and get bimodal densities from mixtures of three or more components.

Marin and Robert (2013) contains a good discussion of Bayesian finite mixture models.

(28)

Example

True data generating distribution: 0.3φ(y| −6,1) +0.3φ(y|0,1) +0.4φ(y|6,1),n=500.

K=2

y

Density

−10 −5 0 5 10

0.000.050.100.15

K=3

y

Density

−10 −5 0 5 10

0.000.050.100.15

K=4

y

Density

−10 −5 0 5 10

0.000.050.100.15

K=5

y

Density

−10 −5 0 5 10

0.000.050.100.15

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 28 / 95

(29)

Example

y

Density

−10 −5 0 5 10

0.000.050.100.15

(30)

DP prior

Another alternative is to use a DP prior (Ferguson 1973, 1974) for the mixing distribution G, resulting in a Dirichlet process mixture (DPM) model.

The DP prior forG, on one hand, offers the theoretical advantage of having full weak support on all mixing distributions and, on the other hand, the practical advantage of automatically determining the number of components that best fits a given dataset.

The resulting model is

f(y)≡f(y|G) = Z

k(y|θ)dG(θ), G∼DP(α,G0).

In the context of DPM of normals,k(y|θ) =φ(y|µ, σ2).

The full weak support means that, under mild conditions, any distribution with the same support asG0can be well approximated weakly by a DP.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 30 / 95

(31)

DP prior

But what does it meanG∼DP(α,G0)?

The DP was originally defined by Ferguson (1973, 1974).

The DP generates random probability measures (random distributions),G, defined on some sigma-algebra (collection of subsets) of a sample spaceΩ, such that the distribution of any finite partition ofΩis a Dirichlet distribution.

That is,G∼DP(α,G0)if for anykand any partitionB1, . . . ,BkofΩ, then [G(B1), . . . ,G(Bk)]∼Dir[αG0(B1), . . . , αG0(Bk)].

(32)

DP prior

The definition of the Dirichlet process and the properties of the Dirichlet distribution imply that for any subset B ofΩ

G(B)∼Beta(αG0(B), α(1−G0(B))).

Thus,

E{G(B)}=G0(B), Var{G(B)}=G0(B)(1−G0(B))

α+1 .

Hence, from the above, we have

Gis centered onG0(G0is also referred as baseline distribution).

α >0 is a precision parameter controlling the variance of the process. Ifαis large, Gis highly concentrated aroundG0.

Therefore, the DP prior is centered on a parametric model through the specification ofG0, while allowingαto control uncertainty in this choice.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 32 / 95

(33)

Stick-breaking representation

Although useful, the definition of the DP does not provide an intuition for what realizations of a DP actually look like.

Undoubtedly, the most useful definition of the DP is its constructive definition (Sethuraman, 1994), according to whichGhas an almost sure representation of the form

G(·) =

X

k=1

ωkδθk(·). (7)

In (7), we have θk

iid∼G0,k=1,2, . . .

ω1=v1, and fork≥2,ωk =vkQ

l<k(1−vl), withvkiidBeta(1, α)(independently of theθk)⇒Stick-breaking construction.

(34)

Stick-breaking representation

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 34 / 95

(35)

Stick-breaking representation

As the stick-breaking construction proceeds, the stick gets shorter and shorter and the lengths allocated to higher index atoms decrease stochastically, with the rate of decrease depending onα.

Sincevk ∼Beta(1, α), then

E(vk) = 1 1+α,

so that values ofαclose to zero lead to high weight on the first few atoms, with the remaining atoms being assigned small probabilities.

(36)

Stick-breaking representation

−3 −2 −1 0 1 2

0.00.20.40.60.81.0

α = 0.5

θ

ω

−3 −2 −1 0 1 2 3

0.00.20.40.60.81.0

α = 1

θ

ω

−3 −2 −1 0 1 2 3

0.00.20.40.60.81.0

α = 5

θ

ω

−3 −2 −1 0 1 2 3

0.00.20.40.60.81.0

α = 10

θ

ω

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 36 / 95

(37)

DP prior trajectories

30 trajectories from a DP withG0=N(0,1)(based on 1000 draws).

−4 −2 0 2 4

0.00.20.40.60.81.0

α = 1

CDF

−4 −2 0 2 4

0.00.20.40.60.81.0

α = 5

CDF

−4 −2 0 2 4

0.00.20.40.60.81.0

α = 50

CDF

−4 −2 0 2 4

0.00.20.40.60.81.0

α = 100

CDF

(38)

DP prior trajectories

30 trajectories from a DP withG0=Beta(2,10)(based on 1000 draws).

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

α = 1

CDF

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

α = 5

CDF

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

α = 50

CDF

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

α = 100

CDF

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 38 / 95

(39)

DP prior trajectories

30 trajectories from a DP withG0=Pois(1)(based on 1000 draws).

0 2 4 6 8 10

0.00.20.40.60.81.0

α = 1

CDF

0 2 4 6 8 10

0.00.20.40.60.81.0

α = 5

CDF

0 2 4 6 8 10

0.00.20.40.60.81.0

α = 50

CDF

0 2 4 6 8 10

0.00.20.40.60.81.0

α = 100

CDF

(40)

Conjugacy of the DP prior

The DP prior is closed under sampling. That is, the posterior distribution is also a DP.

Lety1, . . . ,yn|G∼iidGandG∼DP(α,G0). Then

G|y1, . . . ,yn∼DP α+n, α

α+nG0+ 1 α+n

n

X

i=1

δyi

!

The posterior mean is then

E(G|y1, . . . ,yn) = α

α+nG0+ n α+n

n

X

i=1

1 nδyi

Thus, the posterior mean is a weighted average between the centering distribution and the empirical cdf.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 40 / 95

(41)

Posterior inference using a DP prior/ Bayesian bootstrap

Whenα→0, the limiting posterior distribution is G|y1, . . . ,yn∼DP n,1

n

n

X

i=1

δyi

! , and thus

E(G|y1, . . . ,yn) = 1 n

n

X

i=1

δyi

This limiting posterior distribution is known as the Bayesian bootstrap (Rubin 1981, Gasparini 1995).

A Bayesian bootstrap estimator forGcan be computed as G=

n

X

i=1

piδyi, (p1, . . . ,pn)∼Dir(n;1, . . . ,1). (8) Again, from (8) it follows that

E(G|y1, . . . ,yn) =E

n

X

i=1

piδyi

!

=

n

X

i=1

E(piδyi =1 n

n

X

i=1

δyi

(42)

Posterior inference using a DP prior/ Bayesian bootstrap

Recall that in the frequentist bootstrap (Efron, 1979)pi∈ {0,1/n, . . . ,n/n}corresponding to the number of timesyiappears in a bootstrap sample.

Thus, the weights in the Bayesian bootstrap are smoother than those from Efron’s frequentist bootstrap.

This is justified by the fact that in the BB the weights arise from a Dirichlet (continuous distribution) while the weights in the classical bootstrap have a discrete distribution.

Note that in the BB the data should be regarded as fixed, so that we do not resample from it.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 42 / 95

(43)

Posterior inference using a DP prior

Estimating the posterior mean under a DP prior using simulated data. The true distribution generating the data isN(3,1), while the centering is aN(0,1)distribution.

−2 0 2 4 6

0.00.20.40.60.81.0

n=50,α = 1

CDF

True ECDF Post. Mean G0

−2 0 2 4 6

0.00.20.40.60.81.0

n=50,α = 50

CDF

True ECDF Post. Mean G0

−2 0 2 4 6

0.00.20.40.60.81.0

n=50,α = 100

CDF

True ECDF Post. Mean G0

(44)

Posterior inference using a DP prior

Estimating the posterior mean under a DP prior using simulated data. The true distribution generating the data isN(3,1), while the centering is aN(0,1)distribution.

−2 0 2 4 6

0.00.20.40.60.81.0

n=500,α = 1

CDF

True ECDF Post. Mean G0

−2 0 2 4 6

0.00.20.40.60.81.0

n=500,α = 50

CDF

True ECDF Post. Mean G0

−2 0 2 4 6

0.00.20.40.60.81.0

n=500,α = 100

CDF

True ECDF Post. Mean G0

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 44 / 95

(45)

DPM model

The DP generates flexible albeit discrete distributions.

Due to this fact, and as we have already anticipated, it is commonly used as a prior in mixture models.

The Dirichlet process mixture model can be specified as f(y) =

Z

k(y|θ)dG(θ), G∼DP(α,G0)

BecauseGis random, the mixture densityf is also random.

Further, the mixture density can be discrete or continuous, univariate or multivariate, depending on the nature ofk(y|θ).

(46)

DPM model

Hereby, we focus on Dirichlet process mixtures of normals f(y) =

Z

φ(y|µ, σ2)dG(µ, σ2), G∼DP(α,G0). (9) The centering distributionG0is defined onR×R+.

Due to conjugacy reasons,G0is usually taken to be the normal-inverse-gamma distribution, that is

G0≡N(aµ,bµ2)IG(aσ2,bσ2).

To allow for extra flexibility, hyperpriors can be placed onaµ,b2µ,aσ2,bσ2. The stick-breaking representation of the DP allows us to write (9) as the following countably infinite mixture of normals

f(y) =

X

k=1

ωkφ(y|µk, σk2), (10)

Note that equation (10) resembles the finite mixture model considered earlier but with the important difference that the number of mixture components is set to infinite and the weights now follow a stick-breaking construction.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 46 / 95

(47)

MCMC

Posterior inference can be conducted using (at least) two different kinds of MCMC strategies:

1 Employing a truncation of the stick-breaking representation (Ishwaran and James, 2001, 2002).

2 Using a marginal/collapsed Gibbs sampling where the mixing distribution is ingrates out from the model (MacEachern 1998, Neal 2000).

Throughout approach 1 will be detailed.

(48)

Blocked Gibbs sampler

The blocked Gibbs sampler relies on truncating the stick-breaking representation to a finite number of components.

Hence

GN(·) =

N

X

k=1

δ

k2k)(·).

The atoms(µk, σ2k)are iidG0, i.e.,(µk, σ2k)∼iidN(aµ,b2µ)IG(aσ2,bσ2),k=1, . . . ,N.

The weights arise through a truncated stick-breaking construction ω1=v1, fork≥2 ωk =vkY

l<k

(1−vl), vk iid∼Beta(1, α), k=1, . . . ,N−1

vN =1, ωN =

N−1

Y

l=1

(1−vl)

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 48 / 95

(49)

Blocked Gibbs sampler

Ishwaran and James (2001) showed the following bound for the truncation error

4

1−E

N−1

X

k=1

ωk

n

≈4nexp 1−N

α

For instance, ifn=500 and if we use a truncate value ofN=20, then forα=1, we get a bound of≈1.1×10−5.

In practice,N=20 orN=50 are commonly chosen as a default.

(50)

Blocked Gibbs sampler

Using the truncated versionGNofG, the normal mixture density can be expressed as

f(y) =

N

X

k=1

ωkφ(y|µk, σk2),

withωkgenerated from the truncated stick-breaking representation, whereas µkiidN(aµ,b2µ), andσk2iidIG(aσ2,bσ2).

As it was the case for the finite mixture model, derivation of the full conditionals for Gibbs sampling involves the data-augmented likelihood.

The means, variances, and latent component indicators are sampled in an identical manner to the finite mixture model.

The main difference is that, unlike in the finite mixture model, uncertainty in the component weights is shifted tov, the inputs into the construction of the stick-breaking weights.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 50 / 95

(51)

Blocked Gibbs sampler

The full conditional distributions are

µk|else∼N aµ/b2µ+P

i:zi=kyi2k

1/bµ2+nk2k , 1 1/b2µ+nkk2

!

, (11)

σ2k|else∼IG

aσ2+nk/2,bσ2+ X

i:zi=k

(yi−µk)2/2

, (12)

fork=1, . . . ,N.

Fori=1, . . . ,n, the full conditional distribution forziis zi|else∼Mult(pi), withpi= (pi1, . . . ,piN)andpik = ωkφ(yik

2 k) PK

l=1ωlφ(yill2),k=1, . . . ,N.

(52)

Blocked Gibbs sampler

Fork=1, . . . ,N−1, update the inputs of the stick-breaking weights have from

vk |else∼Beta

nk+1, α+

N

X

l=k+1

nl

. (13)

Regarding the precision parameterα, it can be fixed at a small value, for instance,α=1 is widely used in applications.

Alternatively, one can place a prior onαand allow the data to inform about the appropriate value ofα.

Lettingα∼Gamma(aα,bα), the resulting full conditional forαis

α|else∼Gamma(aα+N−1,bα

N−1

X

k=1

log(1−vk)). (14)

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 52 / 95

(53)

Set an initial value forα,ω, andθ, say,α(0)(0), andθ(0). fort=1, . . . ,T ,do:

1 fori=1, . . . ,n, and k=1, . . . ,K , compute posterior probabilities of membership using equation(6).

2 fori=1, . . . ,n, sample the latent component indicator, zi(t)∼Multi(p(t)i ).

3 Simulate stick-breaking inputsv(t)from equation(13).

4 Givenv(t), computeω(t)using the stick-breaking construction.

5 Conditional onz(t), updateµ(t)k and(σ(t)k )2,fork=1, . . . ,K , from(11)and(12), respectively.

6 Update the precision parameterαfrom(14).

(54)

Blocked Gibbs sampler

A valid concern with the blocked Gibbs sampler is that by truncating the stick-breaking representation we are effectively fitting a finite (and hence parametric) mixture model.

Quoting Dunson (2011):

“For example, if we let N=25as a truncation level, a natural question is how this is better or intrinsically different than fitting a finite mixture model with 25 components. One answer is that N is not the number of components occupied by the subjects in your sample but is

instead an upper bound on the number of subjects.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 54 / 95

(55)

Collapsed Gibbs sampler/ Polya urn scheme

The functionDPdensityfrom theRpackageDPpackagecan also to be used to fit a DPM of normals.

The MCMC scheme behind this function is a marginalized/collapsed Gibbs sampler.

The collapsed Gibbs sampler avoids specifying a truncation level by marginalizing outG and relying on the Polya urn scheme of Blackwell and MacQueen (1973).

Letting

yi∼φ(θi), θi= (µi, σi2)∼G, G∼DP(α,G0), and marginalizing outG, we obtain the Polya urn predictive rule

p(θi1, . . . ,θi−1)∝ α

α+i−1G0i) + 1 α+i−1

i−1

X

j=1

δθji)

The Polya urn rule form the basis of the collapsed Gibbs sampler. For those interested in further details, see (Escobar and West 1995, MacEachern 1998, Neal 2000).

(56)

Example

Same data from the finite mixture example. DPM fit (left) and comparison against the fit of a 3 component mixture model, which corresponds to the true data generating mechanism (right).

y

Density

−10 −5 0 5 10

0.000.050.100.15

y

Density

−10 −5 0 5 10

0.000.050.100.15

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 56 / 95

(57)

Example

True data generating mechanism:φSN(y|µ=0, σ2=1, λ=8)

n=150

y

Density

−0.5 0.5 1.0 1.5 2.0 2.5 3.0

0.00.20.40.60.81.0

n=500

y

Density

−0.5 0.5 1.0 1.5 2.0 2.5 3.0

0.00.20.40.60.81.0

(58)

Example

True data generating mechanism:φ(y|0,1). DPM fit (blue) against normal fit (red).

n=150

y

Density

−3 −2 −1 0 1 2 3

0.00.10.20.30.40.50.6

n=500

y

Density

−3 −2 −1 0 1 2 3

0.00.10.20.30.4

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 58 / 95

(59)

Censored responses

The DPM can easily handle censored responses (left,right, or interval).

Remember that under the hierarchical formulation yi|µ,σ2,z∼φ(yizi, σ2z

i) ..

.

For instance, ifyiis right censored, we know that its true value, sayyi, is greater thanyi, that is,yi>yi.

We can take care of these censored observations by simply adding an extra step in the blocked Gibbs algorithm.

In fact, we can simulate those observations from

yi|yi,zi,µ,σ2∼φ(yizi, σz2i)I(yi>yi).

This can be accomplished by simulatingyifrom a truncated normal distribution with lower limit equal toyi.

(60)

Censored responses

DPM fit (blue line) agains Kaplan–Meier fit (black line). The censored observations are represented by crosses.

0 1 2 3 4 5 6

0.00.20.40.60.81.0

y

Survival

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 60 / 95

(61)

Density regression

So far we have focused on the problem of density estimation.

We will now move to the problem of density regression.

Traditional regression models allow just one or two characteristics (e.g., mean and/or variance) to change as a function of covariates.

Here we will explore tools that allow the entire density/distribution to change as a function of covariates.

(62)

Density regression

Let{(x1,y1), . . . ,(xn,yn)}be regression data, wherexi∈ X ⊆Rp. It is assumed that

yi|xi

ind.∼ f(· |xi), i=1, . . . ,n.

We specify a probability model for the entire collection of densitiesF={f(· |x) :x∈X}.

Further, one possibility is to model the conditional density using covariate-dependent mixture of normal models

f(y|x) = Z

φ(y|µ, σ2)dGx(µ, σ2).

The probability model for the conditional densities is induced by specifying a prior for the collection of mixing distributions

GX={Gx:x∈ X } ∼ G.

Gxis the random mixing distribution at covariatexandGis the prior for the collectionGX.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 62 / 95

(63)

DDP prior

One possibility forGis the dependent DP (DDP) proposed by MacEachern (1999,2000), which is built upon the constructive definition of the DP.

In its full generality, the DDP is specified as follows:

Gx(·) =

X

k=1

ωk,xδθk,x(·), ω1,x=v1,x, ωk,x=vk,x

Y

l<k

(1−vk,x).

Here

θ1,x2,x, . . .are realizations of a stochastic process (e.g., a Gaussian process) over X.

v1,x,v2,x, . . .are realizations from a stochastic process onXsuch that vk,x∼Beta(1, αx)

(64)

DDP prior

Because of complications involved in allowing the weights to depend on covariates, the

‘single weights’ DDP, which assumes fixed weights, is commonly used.

Following De Iorio et al. (2009), a possibility forGxis

Gx(·) =

X

k=1

ωkδθk,x(·),

where the weights match those from a standard DP andθk,x= (µk,x, σ2k), with µk,x=xTβk.

Thus, under this formulation, the base stochastic processes are replaced with a base distributionG0that generates the component-specific regression coefficients and variances.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 64 / 95

(65)

DPM of Gaussian regression models

Thus, the conditional density can therefore be represented as a DP mixture of Gaussian regression models

f(y|x) = Z

φ(y|xTβ, σ2)dG(β, σ2), G∼DP(α,G0). (15)

For example,G0could beNpβ,Sβ)IG(aσ2,bσ2).

This model is known as the linear dependent Dirichlet process (LDDP) (De Iorio 2004, 2009).

Note that Eq. (15) can be equivalently written as

f(y|x) =

X

k=1

ωkφ(y|xTβk, σk2).

(66)

DPM of Gaussian regression models - spline based version

The LDDP although flexible, does not allow for nonlinear effects of the covariates.

An alternative is instead of consideringµk,x=xTβkto consider an additive formulation based on B-splines, namely

µk,xk0+

p

X

r=1

Lr

X

s=1

βkrsΨs(xr,dr)

,

whereΨ(x,d)corresponds to thesth B spline basis function of degreedevaluated atx. As in the density estimation setup, posterior inference can be conducted using a blocked Gibbs sampler or a collapsed Gibbs sampler.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 66 / 95

(67)

Example

Data generating mechanism:yi|xi∼0.5φ(yi| −3.5−1.5xi,12) +0.5φ(yi|2+3.5xi,1.252), xi∼U(0,1), fori=1, . . . ,500. Results from the fit of a LDDP.

−5 0 5

0.000.050.100.150.200.25

x=0.2

y

Density

−5 0 5

0.000.100.200.30

x=0.4

y

Density

−5 0 5

0.000.100.200.30

x=0.7

y

Density

(68)

PT prior

We now focus on another popular nonparametric prior for density/distribution estimation.

The discussion closely follows Branscum, Johnson, and Baron (2013).

Polya tree priors have been discussed as early as Freedman (1963), Fabius (1964), and Ferguson (1974).

However, the natural starting point for understanding their potential use in modeling data is Lavine (1992, 1994), while Hanson (2006) considered some computational details.

Polya trees are way less popular than DPs but they are a very powerful tool as well.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 68 / 95

(69)

PT prior

Suppose that a random sampley1, . . . ,ynis obtained from an unknown/random distributionF.

Polya tree priors, as Dirichlet process priors, place a distribution on a collection of distributions.

A Polya tree for a distributionFis constructed by dividing the sample space into finer-and-fines disjoint sets using successive binary partitioning.

For instance, the first partition splits the sample space into two non overlapping intervals.

In the second partition, those two intervals are each split, yielding a finer partition that contain four intervals.

Then, these four intervals are each split to give an eight interval third level partition of the sample space.

At leveljof the tree, the sample space is partitioned into 2jintervals,j=1, . . .

(70)

PT prior

LetBj,k denotes thekth interval at leveljof the tree, forj=1, . . ., andk=1, . . . ,2j.

Observe that the partitions are nested within one another, starting at the top of the tree and working up.

For example, by definition,B1,1=B2,1∪B2,2andBj−1,1=Bj,1∪Bj,2, etc.

These intervals can be thought as the bins in the histogram.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 70 / 95

(71)

Finite PT prior

For a full tree the splitting continues ad infinitum.

However, in practice, we truncate to a fixedJ, hence the term, finite Polya tree.

Generally, settingJequal to 4, 5, or 6 often works well in practice.

Another option is to selectJso that roughlyJ=log2n(Hanson, 2006).

(72)

Finite PT prior

Informally speaking, the unknown distributionFassigns the data pointsyis to the intervals at levelJof the tree and the task is to use the observed distribution of the data into the intervals to estimateF.

Although all levels of the tree are important for the purpose of estimatingF, of primary importance is levelJ.

The goal here is to produce a data-driven estimate ofFthat assigns high probability to intervals that contain lots of data, assigns low probability to empty intervals, and assigns midrange probability to intervals that contain some (but not a lot) of the data.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 72 / 95

(73)

Finite PT prior

Let us consider first the simplest caseJ=1.

Then data are assigned to eitherB1,1orB1,2.

The probability of a data pointyibeing assigned toB1,1isF(B1,1) =Pr(yi∈B1,1)which sinceFis a cdf andB1,1is an interval of the form(L,U)is to be interpreted as

F(B1,1) =F(U)−F(L).

Denote this unknown probability byπ1,1.

Then, by the complement rule,π1,2=1−π1,1is the probability assigned to setB1,2. In notation, Pr(yi∈B1,1) =F(B1,1) =π1,1and Pr(yi∈B1,2) =F(B1,2) =π1,2. SinceFis unknown,π1,1andπ1,2are also unknown.

(74)

Finite PT prior

To help interpret these parameters, suppose the data arise from a right-skewed distribution (lots of more data inB1,1than inB1,2), thenπ1,1would be large and henceπ1,2would be small, and vice versa for left-skewed data.

Obviously, in practice,J=1 is never used because it would lead to a crude estimate of the density functionf, much like estimating a density using a relative frequency histogram that contains only two bins.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 74 / 95

(75)

Finite PT prior

Let us now consider, again for simplicity, the case ofJ=2. We have now four sets.

Data assignment is based on whether the data point was inB1,1orB1,2at the previous levelj=1.

Ifyi∈B1,1thenyiis assigned to intervalB2,1with unknown probabilityπ2,1or to interval B2,2with probabilityπ2,2=1−π2,1.

Similarly, defineπ2,3andπ2,4(=1−π2,3)to be the probability ofyibeing assigned to set B2,3orB2,4, respectively, given thatyiwas onB1,2.

Theπj,ks are conditional parameters, sinceπ2,1=Pr(yi∈B2,1|yi∈B1,1)and π2,3=Pr(yi|yi∈B1,2).

To relate theπj,ks toFwe must determine the marginal probability of assignment to the various intervals at levelJ(=2).

(76)

Finite PT prior

Observe that intervalB2,1is nested on intervalB1,1, so the marginal probability of interval B2,1is

F(B2,1) =Pr(yi∈B2,1)

=Pr(yi∈B2,1∩B1,1)

=Pr(yi∈B2,1|yi∈B1,1)Pr(yi∈B1,1)

2,1π1,1.

Similar steps lead toF(B2,2) = (1−π2,11,1,F(B2,3) =π2,3π1,2, and F(B2,4) = (1−π2,31,2.

Suppose again thatF is right skewed . Then the data will estimateπ1,1to be large, and it will estimateπ2,1to be large since most of thendata points will be assigned to setB2,1. Therefore, the estimate ofF(B2,1)will be (relatively) large.

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 76 / 95

(77)

Finite PT prior

Notice that level 1 has only one unique parameter,π1,1, associated with it becauseπ1,2is completely determined byπ1,1.

Similarly, level 2 has two unique parameters,π2,1andπ2,3associated with it.

We can continue the partitioning to any levelJ.

ForJ=3, we add eight conditional probabilities parameters,π3,1, π3,2, . . . , π3,8, but only four of these are unique.

For instance,π3,1is the probability thatyiis in intervalB3,1given that it is in intervalB2,1, and

F(B3,1) =Pr(yi∈B3,1)

=Pr(yi∈B3,1∩B2,1)

=Pr(yi∈B3,1|yi∈B2,1)Pr(yi∈B2,1)

3,1π2,1π1,1. In general, we have

F(Bj,k) =

j

Y

l=1

πl,Int{(k−1)2l−j+1}, j=1, . . . ,J, k=1, . . . ,2j.

(78)

Finite PT prior

The key point is that if we can estimate all of theπj,ks, then we can estimate the probability that it is allocated byF to each interval at levelJ.

So far, we have modeled the probability of assignment to each set at levelJ, but we have not modeled how probability mass is distributed within each interval at levelJ.

For instance, all theyis can be clumped together in the center of the interval.

Alternatively, the data could be uniformly distributed, clumped to the right or left side of the interval or have any other dispersion pattern within each interval.

To address this issue, we model the data according to how a user-specified parametric distributionF0allocated probability mass within the intervals at levelJ.

So, as it was in the DP (or DPM), here with finite Polya trees, the user also needs to specify a probability distribution (and we will see in a few slides thatF0is also a centering distribution).

Vanda Inácio (PUCC) Bayesian Nonparametrics XIII EBEB, February 22, 2016 78 / 95

Referências

Outline

Documentos relacionados

BEAGLE: Broad-platform Evolutionary Analysis General Likelihood Evaluator; BEAST: Bayesian Evolutionary Analysis Sampling Trees; BF: Bayes Factor; BSSVS: Bayesian stochastic

A pós-modernidade trouxe a ideia do corpo como principal objeto de estudo de uma coreografia, desconstruindo a ideia do corpo como mera ferramenta para trans- missão de emoções e

parte do pessoal da empresa cliente. - A MAIOR PARTE DOS PROBLEMAS NO PROCESSO DE CONSULTORIA ÀS PEQUENAS EMPRESAS,. INDICADOS PELOS DIRIGENTES ENCONTRA, EM MAIOR OU

Foi notada, por exemplo, a valorização da utilização de elementos diegéticos na interface, bem como a percepção dos jogadores de que um design pouco apropriado pode

Nóvoa (coord.), As Organizações Escolares em Análise (pp. face ao trabalho escolar, enfim, de um ethos de escola. Ao recorrer ao termo ethos Rutter procura pôr em

Esta era a forma como a comunicação entre pares era realizada, para que se pudesse ter alguma notícia imparcial, ou pelo menos não viciada pelas ideias pré-concebidas

The present study clearly revealed that in terms of metabolic capacity, energy reserves and oxidative stress f-MWCNTs induced higher toxicity in clams in comparison to

Diversas técnicas têm, desta forma, sido utilizadas - desde curetagem subgengival, gengivectomia, retalho modificado de Widman e retalhos de espessura total ou parcial –