• Nenhum resultado encontrado

The EM Algorithm

N/A
N/A
Protected

Academic year: 2022

Share "The EM Algorithm"

Copied!
28
0
0

Texto

(1)

The EM Algorithm

Kenneth Lange

Departments of Biomathematics and Human Genetics

David Geffen School of Medicine at UCLA

May, 2003

(2)

Overview of the EM Algorithm

1. Maximum likelihood estimation is ubiquitous in statistics

2. EM is a special case of the MM algorithm that relies on the notion of missing information.

3. The surrogate function is created by calculating a certain conditional expectation. Sometimes an MM and an EM al- gorithm coincide for the same problem; sometimes not.

4. Convexity enters through Jensen’s inequality.

5. Many examples were known before the general principle was enunciated.

(3)

Nature of Missing Information

1. Missing information can take the form of missing data.

2. Missing information can also be more abstract. Even with perfect data collection, there can be missing information.

3. For instance, in PET scanning, current machines cannot de- termine where along a projection line a decay event has taken place. If we knew the pixel of origin of each decay event, then estimating the concentration of a radio-labeled compound would be straightforward.

4. The complete data should be conceptualized to make maxi-

(4)

Ingredients of the EM Algorithm

1. The observed data y with likelihood f(y | θ). Here θ is a parameter vector.

2. The complete data x with likelihood g(x | θ).

3. The conditional expectation

Q(θ | θn) = E[lng(x | θ) | y, θn]

furnishes the minorizing function up to a constant. Here θn is the value of θ at iteration n of the EM algorithm.

4. Calculation of Q(θ | θn) constitutes the E step; maximization of Q(θ | θn) with respect to θ constitutes the M step.

(5)

Minorization Property of the EM Algorithm

1. The proof depends on Jensen’s inequality E[h(Z)] ≥ h[E(Z)]

for a random variable Z and convex function h(z).

2. If p(z) and q(z) are probability densities with respect to a measure µ, then the convexity of −lnz implies the informa- tion inequality

Ep[ln p] − Ep[ln q] = Ep[−ln q

p] ≥ −ln Ep(q

p) = −ln

Z q

ppdµ = 0, with equality when p = q.

3. In the E step minorization, we apply the information inequal- ity to the conditional densities p(x) = f(x | θn)/g(y | θn) and q(x) = f(x | θ)/g(y | θ) of the complete data x given the observed data y.

(6)

Minorization Property II

1. The information inequality Ep[ln p] ≥ Ep[lnq] now yields Q(θ | θn) − ln g(y | θ) = E [ ln f(x | θ)

g(y | θ) | y, θn]

≤ E [ ln f(x | θn)

g(y | θn) | y, θn] = Q(θn | θn) − lng(y | θn), with equality when θ = θn.

2. Thus, Q(θ | θn) −Q(θn | θn) + ln g(y | θn) minorizes lng(y | θ).

3. In the M step it suffices to maximize Q(θ | θn) since the other two terms of the minorizing function do not depend on θ.

(7)

Example 1: Twin Data

1. You are given a sample of m male twin pairs, f female twin pairs, and o opposite sex twin pairs. Estimate the probability p that a twin pair is identical and the probability q that a child is male.

2. Here y = (m, f, o) is the observed data and θ = (p, q) is the parameter vector. If we knew exactly which pairs of same-sex twins were identical, then it would be easy to estimate p and q. Thus, we postulate complete data x = (m1, m2, f1, f2, o), with m1 representing the number of male identical twin pairs and m2 the number of male non-identical twin pairs. f1 and f2 are defined similarly.

(8)

Example 1: Complete Data Loglikelihood

1. The multinomial complete data likelihood is g(x | θ) = m + f + o

m1, m2, f1, f2, o

(pq)m1[(1 − p)q2]m2[p(1 − q)]f1

×[(1 − p)(1 − q)2]f2[(1 − p)2q(1 − q)]o

since identical twins involve one choice of sex and non-identical twins two choices of sex.

2. The complete data loglikelihood is

lng(x | θ) = (m1 + f1) lnp + (m2 + f2 + o) ln(1 − p) +(m1 + 2m2 + o) ln q

+(f1 + 2f2 + o) ln(1 − q) + constant.

(9)

Example 1: E Step

To carry out the E step we calculate

mn1 = E(m1 | y, θn) = m pnqn

pnqn + (1 − pn)(qn)2 mn2 = E(m2 | y, θn) = m (1 − pn)(qn)2

pnqn + (1 − pn)(qn)2 f1n = E(f1 | y, θn) = f pn(1 − qn)

pn(1 − qn) + (1 − pn)(1 − qn)2 f2n = E(f2 | y, θn) = f (1 − pn)(1 − qn)2

pn(1 − qn) + (1 − pn)(1 − qn)2 by applying Bayes’ rule.

(10)

Example 1: M Step

1. The surrogate function is

Q(θ | θn) = (mn1 + f1n) lnp + (mn2 + f2n + o) ln(1 − p) +(mn1 + 2mn2 + o) ln q

+(f1n + 2f2n + o) ln(1 − q) + constant.

2. Straightforward calculus shows the maximum occurs for pn+1 = mn1 + f1n

m + f + o

qn+1 = mm1 + 2mn2 + o

m + f + o + mn2 + f2n.

Note that Q(θ | θn) is separable in the parameters p and q and that m1 + m2 + f1 + f2 + o = m + f + o.

(11)

Example 1: Hidden Binomial Updates

1. Both the number of identical twins and the number of choices of the male sex involve hidden binomial trials.

2. The update in such circumstances take the form rn+1 = E(#successes | y, θn)

E(#trials | y, θn)

for r = p or r = q. In the first case the number of trials is fixed at m + f + o, and in the second case the number of trials is random because the number of choices of sex depends on the number of identical twins versus the number of non-identical twins.

(12)

Example 2: Light Bulb Lifetimes

1. The random lifetime of a light bulb is postulated to be ex- ponential with unknown mean 1.

2. The lifetimes y1, . . . , yr of r independent bulbs are observed.

A further s independent bulbs are observed at time t > 0.

Bulb i + r is registered as still burning, zi+r = 1, or expired, zi+r = 0. Thus, the lifetimes of the second set of bulbs are both left and right censored.

3. In this situation it is natural to view the complete data as the observed lifetimes y1, . . . , yr supplement by the unobserved lifetimes xr+1, . . . , xr+s.

(13)

Example 2: Complete Data Loglikelihood

The complete data loglikelihood is g(x | θn) =

Xr i=1

(ln θθyi) +

rX+s i=r+1

(ln θθxi)

= r lnθrθy¯+

rX+s i=r+1

(ln θθxi)

for exponentially distributed lifetimes, where ¯y is the average value of y1, . . . , yr.

(14)

Example 2: E Step

1. Because the survival time of a light bulb lacks memory, right censored data gives E(xr+i | zr+i = 1, θn) = t + 1n.

2. For left censored data, integration by parts and the funda- mental theorem of calculus yield

E(xr+i | zr+i = 0, θn) =

Rt

0 neθnxdx

Rt

0 θneθnxdx

= 1

θnteθnt 1 − eθnt.

3. Weighted by their respective probabilities and summed, these two conditional expectations give back the unconditional mean 1n of xr+i.

(15)

Example 2: M Step

1. If we let µni = E(xr+i | zr+i, θn), then the surrogate function is

Q(θ | θn) = (r + s) lnθθ[ry¯+

rX+s i=r+1

µni ].

2. It is now easy to differentiate and solve for the EM update θn+1 = r + s

ry¯+ Pr+s

i=r+1µni .

3. In other words, we fill in unknown times by their conditional expectations and then identity 1n+1 with the average of

(16)

Example 3: Binomial-Poisson Mixture

Thisted considers historic data on widows and their dependent children from a Swedish pension fund. If yk denotes the number of widows with k children, then the data values are y0 = 3062, y1 = 587, y2 = 284, y3 = 103, y4 = 33, and y5 = 4, and y6 = 2. The fact that most widows have no dependent children suggests that a simple Poisson model would give a poor fit.

A better model is a mixture of a population of widows with no children, population A, and a population of widows having a Poisson number of children, population B. Suppose a widow falls into population A with probability p and into population B with probability 1 − p. Let µ be the mean of the Poisson distribution characterizing population B.

(17)

Example 3: Loglikelihood for the Binomial-Poisson Mixture

1. The parameter vector is θ = (p, µ).

2. Omitting the obvious multinomial coefficient, the loglikeli- hood for the observed data is

lng(y | θ) = y0 ln[p + (1 − p)eµ] + X

k≥1

yk[ ln(1 − p) + k lnµµ − lnk!]

3. There is no closed-form maximum.

(18)

Example 3: The Complete Data

1. To generate the complete data, we split the y0 widows into xA widows from population A and xB widows from population B.

2. The loglikelihood for the complete data is lnf(x | θ) = xA lnp + xB[ln(1 − p) − µ]

+ X

k≥1

yk[ ln(1 − p) + k lnµµ − lnk!]

3. In the E step, we calculate

xnA = E(xA | y0, θn) = y0 pn

pn + (1 − pn)eµn xnB = E(xB | y0, θn) = y0 − E(xA | y0, θn).

(19)

Example 3: The M Step

1. The surrogate function is

Q(θ | θn) = xnA lnp + xnB[ln(1 − p) − µ] + X

k≥1

yk[ ln(1 − p) + k lnµµ − lnk!]

2. The maximum occurs for

pn+1 = xnA

y0 + Pk≥1 yk µn+1 =

P

k≥1 kyk xnB + Pk≥1 yk.

(20)

Example 3: The Algorithm in Practice

1. The updates are hidden binomial and hidden Poisson up- dates.

2. The first few iterations are:

p0 = 0.75000 µ0 = 0.40000 p1 = 0.61418 µ1 = 1.03548 p2 = 0.61438 µ2 = 1.03601 p3 = 0.61453 µ3 = 1.03643 p4 = 0.61465 µ4 = 1.03675 p4 = 0.61474 µ5 = 1.03670 3. Convergence is fast at first and then slows.

(21)

Example 4: Transmission Tomography

1. Recall that the observed data consist of the photon counts yi for the various projection lines i. Along projection i the number of photons that begin the journey from source to detector follows a Poisson distribution with mean di. Pixel j is assigned attenuation coefficient θj, and projection i intersects pixel j over a distance of lij.

2. With this notation the loglikelihood of the observed data is lng(y | θ) = X

i

[ − die

P

j lijθj

yi X

j

lijθj + yi lndi − lnyi!].

(22)

Example 4: Complete Data

1. The complete data consist of the number of photons that enter pixel j along projection i for all pairs i and j.

2. Since transmission acts independently along each projection, we focus on a single projection and drop the projection sub- script. Let y = xm be the number of photons detected and xj the number of photons entering pixel j. Here we assume m − 1 pixels along the projection.

3. Each of these random variables is Poisson; x1 is Poisson by virtue of how X-rays are generated, and xj is Poisson because random thinning turns one Poisson process into another.

(23)

Example 4: Complete Data Likelihood

1. For the sake of simplicity, we omit a smoothing prior.

2. Given xj, the number of photons xj+1 passing through pixel j is binomial with mean xj and success probability eljθj.

3. The complete data loglikelihood is therefore f(x | θ) = −d + x1 lnd − lnx1! +

mX−1 j=1

[ ln xj xj+1

+xj+1lneljθj + (xj+1xj) ln(1 − eljθj)].

(24)

Example 4: E Step

1. To complete the E step, it suffices to calculate E(xj | xm, θn) for each j.

2. The unconditional mean µj = E(xj) = de

Pj−1

k=1lkθk.

3. On the next slide we show that E(xj | xm, θn) = µjµm+xm. 4. This will show in the original notation that

Q(θ | θn) = X

i

X j

[ − rijnlijθj + (snijrijn) ln(1 − elijθj)]

for computable constants rijn and snij depending on θn and the yi.

(25)

Example 4: Calculation of E(xj | xm, θn)

Suppose U and V are Poisson counts. If V is generated from U by randomly thinning each U point with probability 1 − p, then UV is Poisson and independent of V . If U has mean µ, then

Pr(UV = j | V = k) =

µj+k

(j+k)!eµj+kkpk(1 − p)j

()k

(k)! e

= [(1 − p)µ]j

j! e−(1−p)µ.

Thus, E(UV | V ) = (1−p)µ and E(U | V ) = V + (1−p)µ. Now apply this to V = xm and U = xj.

(26)

Example 4: M Step

1. The surrogate function Q(θ | θn) = X

i

X j

[ − rijnlijθj + (snijrijn) ln(1 − elijθj)]

separates the parameters.

2. To find the maximum of the part containing θj, one must solve a transcendental equation numerically.

3. This is not hard, but the EM algorithm is inferior to the MM algorithm posed earlier because of the work involved in computing the constants rijn and snij. In essence, one must exponentiate all of the partial line integrals running from the source to each intermediate pixel along each projection.

(27)

Concluding Comments on the EM Algorithm

1. It always involves missing information. Recognizing an ap- propriate complete data framework is often fairly natural.

2. The E step can involve tricky conditional expectations. Never guess at the form of the surrogate. Work through the recipe.

3. Convergence can be very slow on some problems and is inti- mately related to the amount of missing information.

4. Intermediate quantities in the algorithm often have useful statistical interpretations.

5. Every EM algorithm is an MM algorithm, so all convergence

(28)

References

1. Dempster AP, Laird NM, Rubin DB (1977) Maximum likeli- hood from incomplete data via the EM algorithm (with dis- cussion). J Roy Stat Soc B 39:1–38

2. Lange K (1999) Numerical Analysis for Statisticians. Springer- Verlag, New York

3. Little RJA, Rubin DB (1987) Statistical Analysis with Miss- ing Data. Wiley, New York

4. McLachlan GJ, Krishnan T (1996) The EM Algorithm and Extensions.

Referências

Documentos relacionados

i) A condutividade da matriz vítrea diminui com o aumento do tempo de tratamento térmico (Fig.. 241 pequena quantidade de cristais existentes na amostra já provoca um efeito

The probability of attending school four our group of interest in this region increased by 6.5 percentage points after the expansion of the Bolsa Família program in 2007 and

No campo, os efeitos da seca e da privatiza- ção dos recursos recaíram principalmente sobre agricultores familiares, que mobilizaram as comunidades rurais organizadas e as agências

didático e resolva as ​listas de exercícios (disponíveis no ​Classroom​) referentes às obras de Carlos Drummond de Andrade, João Guimarães Rosa, Machado de Assis,

Mas, apesar das recomendações do Ministério da Saúde (MS) sobre aleitamento materno e sobre as desvantagens de uso de bicos artificiais (OPAS/OMS, 2003), parece não

Cette liste des formes les moins employées des chansons (par rapport aux autres émetteurs, bien entendu) suffit à faire apparaitre l’énorme décalage qui les sépare des autres

The fourth generation of sinkholes is connected with the older Đulin ponor-Medvedica cave system and collects the water which appears deeper in the cave as permanent

Ousasse apontar algumas hipóteses para a solução desse problema público a partir do exposto dos autores usados como base para fundamentação teórica, da análise dos dados