The EM Algorithm

(1)

The EM Algorithm

Kenneth Lange

Departments of Biomathematics and Human Genetics

David Geffen School of Medicine at UCLA

May, 2003

(2)

Overview of the EM Algorithm

1. Maximum likelihood estimation is ubiquitous in statistics

2. EM is a special case of the MM algorithm that relies on the notion of missing information.

3. The surrogate function is created by calculating a certain conditional expectation. Sometimes an MM and an EM algorithm coincide for the same problem; sometimes not.

4. Convexity enters through Jensen’s inequality.

5. Many examples were known before the general principle was enunciated.

(3)

Nature of Missing Information

1. Missing information can take the form of missing data.

2. Missing information can also be more abstract. Even with perfect data collection, there can be missing information.

3. For instance, in PET scanning, current machines cannot de- termine where along a projection line a decay event has taken place. If we knew the pixel of origin of each decay event, then estimating the concentration of a radio-labeled compound would be straightforward.

4. The complete data should be conceptualized to make maxi-

(4)

Ingredients of the EM Algorithm

1. The observed data y with likelihood f(y | θ). Here θ is a parameter vector.

2. The complete data x with likelihood g(x | θ).

3. The conditional expectation

Q(θ | θⁿ) = E[lng(x | θ) | y, θⁿ]

furnishes the minorizing function up to a constant. Here θⁿ is the value of θ at iteration n of the EM algorithm.

4. Calculation of Q(θ | θⁿ) constitutes the E step; maximization of Q(θ | θⁿ) with respect to θ constitutes the M step.

(5)

Minorization Property of the EM Algorithm

1. The proof depends on Jensen’s inequality E[h(Z)] ≥ h[E(Z)]

for a random variable Z and convex function h(z).

2. If p(z) and q(z) are probability densities with respect to a measure µ, then the convexity of −lnz implies the information inequality

E_p[ln p] − E_p[ln q] = E_p[−ln q

p] ≥ −ln E_p(q

p) = −ln

Z q

ppdµ = 0, with equality when p = q.

3. In the E step minorization, we apply the information inequality to the conditional densities p(x) = f(x | θⁿ)/g(y | θⁿ) and q(x) = f(x | θ)/g(y | θ) of the complete data x given the observed data y.

(6)

Minorization Property II

1. The information inequality E_p[ln p] ≥ E_p[lnq] now yields Q(θ | θⁿ) − ln g(y | θ) = E [ ln f(x | θ)

g(y | θ) | y, θⁿ]

≤ E [ ln f(x | θⁿ)

g(y | θⁿ) | y, θⁿ] = Q(θⁿ | θⁿ) − lng(y | θⁿ), with equality when θ = θⁿ.

2. Thus, Q(θ | θⁿ) −Q(θⁿ | θⁿ) + ln g(y | θⁿ) minorizes lng(y | θ).

3. In the M step it suffices to maximize Q(θ | θⁿ) since the other two terms of the minorizing function do not depend on θ.

(7)

Example 1: Twin Data

1. You are given a sample of m male twin pairs, f female twin pairs, and o opposite sex twin pairs. Estimate the probability p that a twin pair is identical and the probability q that a child is male.

2. Here y = (m, f, o) is the observed data and θ = (p, q) is the parameter vector. If we knew exactly which pairs of same-sex twins were identical, then it would be easy to estimate p and q. Thus, we postulate complete data x = (m₁, m₂, f₁, f₂, o), with m₁ representing the number of male identical twin pairs and m₂ the number of male non-identical twin pairs. f₁ and f₂ are defined similarly.

(8)

Example 1: Complete Data Loglikelihood

1. The multinomial complete data likelihood is g(x | θ) = m + f + o

m₁, m₂, f₁, f₂, o

(pq)^m¹[(1 − p)q²]^m²[p(1 − q)]^f¹

×[(1 − p)(1 − q)²]^f²[(1 − p)2q(1 − q)]^o

since identical twins involve one choice of sex and non-identical twins two choices of sex.

2. The complete data loglikelihood is

lng(x | θ) = (m₁ + f₁) lnp + (m₂ + f₂ + o) ln(1 − p) +(m₁ + 2m₂ + o) ln q

+(f₁ + 2f₂ + o) ln(1 − q) + constant.

(9)

Example 1: E Step

To carry out the E step we calculate

mⁿ₁ = E(m₁ | y, θⁿ) = m pⁿqⁿ

pⁿqⁿ + (1 − pⁿ)(qⁿ)² mⁿ₂ = E(m₂ | y, θⁿ) = m (1 − pⁿ)(qⁿ)²

pⁿqⁿ + (1 − pⁿ)(qⁿ)² f₁ⁿ = E(f₁ | y, θⁿ) = f pⁿ(1 − qⁿ)

pⁿ(1 − qⁿ) + (1 − pⁿ)(1 − qⁿ)² f₂ⁿ = E(f₂ | y, θⁿ) = f (1 − pⁿ)(1 − qⁿ)²

pⁿ(1 − qⁿ) + (1 − pⁿ)(1 − qⁿ)² by applying Bayes’ rule.

(10)

Example 1: M Step

1. The surrogate function is

Q(θ | θⁿ) = (mⁿ₁ + f₁ⁿ) lnp + (mⁿ₂ + f₂ⁿ + o) ln(1 − p) +(mⁿ₁ + 2mⁿ₂ + o) ln q

+(f₁ⁿ + 2f₂ⁿ + o) ln(1 − q) + constant.

2. Straightforward calculus shows the maximum occurs for pⁿ⁺¹ = mⁿ₁ + f₁ⁿ

m + f + o

qⁿ⁺¹ = m^m₁ + 2mⁿ₂ + o

m + f + o + mⁿ₂ + f₂ⁿ.

Note that Q(θ | θⁿ) is separable in the parameters p and q and that m₁ + m₂ + f₁ + f₂ + o = m + f + o.

(11)

Example 1: Hidden Binomial Updates

1. Both the number of identical twins and the number of choices of the male sex involve hidden binomial trials.

2. The update in such circumstances take the form rⁿ⁺¹ = E(#successes | y, θⁿ)

E(#trials | y, θⁿ)

for r = p or r = q. In the first case the number of trials is fixed at m + f + o, and in the second case the number of trials is random because the number of choices of sex depends on the number of identical twins versus the number of non-identical twins.

(12)

Example 2: Light Bulb Lifetimes

1. The random lifetime of a light bulb is postulated to be ex- ponential with unknown mean 1/θ.

2. The lifetimes y₁, . . . , y_r of r independent bulbs are observed.

A further s independent bulbs are observed at time t > 0.

Bulb i + r is registered as still burning, z_i₊_r = 1, or expired, z_i₊_r = 0. Thus, the lifetimes of the second set of bulbs are both left and right censored.

3. In this situation it is natural to view the complete data as the observed lifetimes y₁, . . . , y_r supplement by the unobserved lifetimes x_r₊₁, . . . , x_r₊_s.

(13)

Example 2: Complete Data Loglikelihood

The complete data loglikelihood is g(x | θⁿ) =

Xr i=1

(ln θ − θy_i) +

rX+s i=r+1

(ln θ − θx_i)

= r lnθ − rθy¯+

rX+s i=r+1

(ln θ − θx_i)

for exponentially distributed lifetimes, where ¯y is the average value of y₁, . . . , y_r.

(14)

1. Because the survival time of a light bulb lacks memory, right censored data gives E(x_r₊_i | z_r₊_i = 1, θⁿ) = t + 1/θⁿ.

2. For left censored data, integration by parts and the funda- mental theorem of calculus yield

E(x_r₊_i | z_r₊_i = 0, θⁿ) =

R_t

0 xθⁿe⁻^θⁿ^xdx

R_t

0 θⁿe⁻^θⁿ^xdx

= 1

θⁿ − te⁻^θⁿ^t 1 − e⁻^θⁿ^t.

3. Weighted by their respective probabilities and summed, these two conditional expectations give back the unconditional mean 1/θⁿ of x_r₊_i.

(15)

1. If we let µⁿ_i = E(x_r₊_i | z_r₊_i, θⁿ), then the surrogate function is

Q(θ | θⁿ) = (r + s) lnθ − θ[ry¯+

rX+s i=r+1

µⁿ_i ].

2. It is now easy to differentiate and solve for the EM update θⁿ⁺¹ = r + s

ry¯+ ^P^r⁺^s

i=r+1µⁿ_i .

3. In other words, we fill in unknown times by their conditional expectations and then identity 1/θⁿ⁺¹ with the average of

(16)

Example 3: Binomial-Poisson Mixture

Thisted considers historic data on widows and their dependent children from a Swedish pension fund. If y_k denotes the number of widows with k children, then the data values are y₀ = 3062, y₁ = 587, y₂ = 284, y₃ = 103, y₄ = 33, and y₅ = 4, and y₆ = 2. The fact that most widows have no dependent children suggests that a simple Poisson model would give a poor fit.

A better model is a mixture of a population of widows with no children, population A, and a population of widows having a Poisson number of children, population B. Suppose a widow falls into population A with probability p and into population B with probability 1 − p. Let µ be the mean of the Poisson distribution characterizing population B.

(17)

Example 3: Loglikelihood for the Binomial-Poisson Mixture

1. The parameter vector is θ = (p, µ).

2. Omitting the obvious multinomial coefficient, the loglikelihood for the observed data is

lng(y | θ) = y₀ ln[p + (1 − p)e⁻^µ] + ^X

k≥1

y_k[ ln(1 − p) + k lnµ − µ − lnk!]

3. There is no closed-form maximum.

(18)

Example 3: The Complete Data

1. To generate the complete data, we split the y₀ widows into x_A widows from population A and x_B widows from population B.

2. The loglikelihood for the complete data is lnf(x | θ) = x_A lnp + x_B[ln(1 − p) − µ]

+ ^X

k≥1

y_k[ ln(1 − p) + k lnµ − µ − lnk!]

3. In the E step, we calculate

xⁿ_A = E(x_A | y₀, θⁿ) = y₀ pⁿ

pⁿ + (1 − pⁿ)e⁻^µⁿ xⁿ_B = E(x_B | y₀, θⁿ) = y₀ − E(x_A | y₀, θⁿ).

(19)

Example 3: The M Step

1. The surrogate function is

Q(θ | θⁿ) = xⁿ_A lnp + xⁿ_B[ln(1 − p) − µ] + ^X

k≥1

y_k[ ln(1 − p) + k lnµ − µ − lnk!]

2. The maximum occurs for

pⁿ⁺¹ = xⁿ_A

y₀ + ^P_k_≥1 y_k µⁿ⁺¹ =

P

k≥1 ky_k xⁿ_B + ^P_k_≥1 y_k.

(20)

Example 3: The Algorithm in Practice

1. The updates are hidden binomial and hidden Poisson updates.

2. The first few iterations are:

p⁰ = 0.75000 µ⁰ = 0.40000 p¹ = 0.61418 µ¹ = 1.03548 p² = 0.61438 µ² = 1.03601 p³ = 0.61453 µ³ = 1.03643 p⁴ = 0.61465 µ⁴ = 1.03675 p⁴ = 0.61474 µ⁵ = 1.03670 3. Convergence is fast at first and then slows.

(21)

Example 4: Transmission Tomography

1. Recall that the observed data consist of the photon counts y_i for the various projection lines i. Along projection i the number of photons that begin the journey from source to detector follows a Poisson distribution with mean d_i. Pixel j is assigned attenuation coefficient θ_j, and projection i intersects pixel j over a distance of l_ij.

2. With this notation the loglikelihood of the observed data is lng(y | θ) = ^X

i

[ − d_ie⁻

P

j l_ijθ_j

− y_i ^X

j

l_ijθ_j + y_i lnd_i − lny_i!].

(22)

Example 4: Complete Data

1. The complete data consist of the number of photons that enter pixel j along projection i for all pairs i and j.

2. Since transmission acts independently along each projection, we focus on a single projection and drop the projection sub- script. Let y = x_m be the number of photons detected and x_j the number of photons entering pixel j. Here we assume m − 1 pixels along the projection.

3. Each of these random variables is Poisson; x₁ is Poisson by virtue of how X-rays are generated, and x_j is Poisson because random thinning turns one Poisson process into another.

(23)

Example 4: Complete Data Likelihood

1. For the sake of simplicity, we omit a smoothing prior.

2. Given x_j, the number of photons x_j₊₁ passing through pixel j is binomial with mean x_j and success probability e⁻^l^j^θ^j.

3. The complete data loglikelihood is therefore f(x | θ) = −d + x₁ lnd − lnx₁! +

mX−1 j=1

[ ln x_j x_j₊₁

+x_j₊₁lne⁻^l^j^θ^j + (x_j₊₁ − x_j) ln(1 − e⁻^l^j^θ^j)].

(24)

1. To complete the E step, it suffices to calculate E(x_j | x_m, θⁿ) for each j.

2. The unconditional mean µ_j = E(x_j) = de⁻

Pj−1

k=1l_kθ_k.

3. On the next slide we show that E(x_j | x_m, θⁿ) = µ_j−µ_m+x_m. 4. This will show in the original notation that

Q(θ | θⁿ) = ^X

i

X j

[ − r_ijⁿl_ijθ_j + (sⁿ_ij − r_ijⁿ) ln(1 − e⁻^l^ij^θ^j)]

for computable constants r_ijⁿ and sⁿ_ij depending on θⁿ and the y_i.

(25)

Example 4: Calculation of E(x_j | x_m, θⁿ)

Suppose U and V are Poisson counts. If V is generated from U by randomly thinning each U point with probability 1 − p, then U − V is Poisson and independent of V . If U has mean µ, then

Pr(U − V = j | V = k) =

µ^j+k

(j+k)!e⁻^µ^j⁺_k^kp^k(1 − p)^j

(pµ)^k

(k)! e⁻^pµ

= [(1 − p)µ]^j

j! e⁻⁽¹⁻^p⁾^µ.

Thus, E(U −V | V ) = (1−p)µ and E(U | V ) = V + (1−p)µ. Now apply this to V = x_m and U = x_j.

(26)

1. The surrogate function Q(θ | θⁿ) = ^X

i

X j

[ − r_ijⁿl_ijθ_j + (sⁿ_ij − r_ijⁿ) ln(1 − e⁻^l^ij^θ^j)]

separates the parameters.

2. To find the maximum of the part containing θ_j, one must solve a transcendental equation numerically.

3. This is not hard, but the EM algorithm is inferior to the MM algorithm posed earlier because of the work involved in computing the constants r_ijⁿ and sⁿ_ij. In essence, one must exponentiate all of the partial line integrals running from the source to each intermediate pixel along each projection.

(27)

Concluding Comments on the EM Algorithm

1. It always involves missing information. Recognizing an ap- propriate complete data framework is often fairly natural.

2. The E step can involve tricky conditional expectations. Never guess at the form of the surrogate. Work through the recipe.

3. Convergence can be very slow on some problems and is inti- mately related to the amount of missing information.

4. Intermediate quantities in the algorithm often have useful statistical interpretations.

5. Every EM algorithm is an MM algorithm, so all convergence

(28)

References

1. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with dis- cussion). J Roy Stat Soc B 39:1–38

2. Lange K (1999) Numerical Analysis for Statisticians. Springer- Verlag, New York

3. Little RJA, Rubin DB (1987) Statistical Analysis with Miss- ing Data. Wiley, New York

4. McLachlan GJ, Krishnan T (1996) The EM Algorithm and Extensions.