The Expectation-Maximization Algorithm: A Method for Modelling Mixtures of Distributions
To write this post on the expectation-maximization algorithm, I used the following academic sources: Citation: Bishop, 2006 Bishop, C. (2006). Pattern recognition and machine learning. Springer New York. Retrieved from bit.ly/411YnEq and Citation: Do & Batzoglou, 2008 Do, C. & Batzoglou, S. (2008). What is the expectation maximization algorithm?. Nature Biotechnology, 26(8), 897β899. https://doi.org/10.1038/nbt1406 . I also used the following lectures/blogs as aids: EM blog and EM lecture.
Introduction to Mixture Models
Consider a situation where a researcher entrusts a colleague to flip coins and record the results of 10 flips. Flips that result in heads are recorded as 1
and flips that result in tails are recorded as 0
. Importantly, before each flip, the colleague picks one of two coins (according to another probability mass function) but does not tell the researcher which coin was flipped. When the colleague records the results of the 10 coin flips, they provide the researcher with the following data:
- The probability of picking each of the two coins (i.e., a probability mass function),
. Given that both these probability values sum to one, only the probability of picking one coin must be estimated, , with . - The first coinβs probability of heads,
. - The second coinβs probability of heads,
.
Mixture Models Cannot be Estimated With Maximum Likelihood Estimation
One way to estimate the parameters (
where
To find maximum likelihood estimates for the parameters, partial derivatives can be computed with respect to each parameter (
t
indicates a coin flip that results in βtailsβ (i.e.,
Mixture Models Can be Estimated with the Expectation-Maximization (EM) Algorithm
Unlike maximum likelihood estimation, the expectation-maximization (EM) algorithm provides viable parameter estimates for modelling mixture distributions. The EM algorithm works because it indirectly maximizes the incomplete-data log-likelihood; that is, it does not directly operate on the incomplete-data log-likelihood. To act as an indirect estimation method, the EM algorithm begins by modelling the uncertainty of the coinβs identity on each flip as a latent variable: A variable that is assumed to exist but has not been directly measured, whether by choice or because direct measurement is impossible. To model the coinβs identity on each
where
Although including a binary exponent on the likelihood computation seems inconsequential, it provides two desirable outcomes. First, it allows viable parameter estimates to be computed. In taking the logarithm of the complete-data likelihood to prevent underflow, the complete-data log-likelihood is obtained and is shown below in Equation
Second, using one-hot encoding creates an equivalence between the complete- and incomplete-data log-likelihoods. As shown below, the complete-data log-likelihood becomes the incomplete-data log-likelihood after summing over all
Despite the benefits of using one-hot encoding, one challenge remains. The complete-data log-likelihood cannot yet be computed because the latent variable memberships for each
Although the complete-data log-likelihood cannot be directly computed, the EM algorithm finds a clever way to circumvent this problem in the E step.
Expectation (E) Step: Using Expectations to Estimate the Distribution of the Latent Variable
The E step takes advantage of the equivalence between the incomplete- and complete-data log-likelihoods and applies two clever tricks to work with the complete-data log-likelihood. First, it applies what appears to be an inconsequential algebraic manipulation, whereby the complete-data log-likelihood is multiplied and divided by some distribution on the latent variable
Second, given that the logarithm is a concave function, Jensenβs inequality can be applied to convert Equation
In other words, Equation
where the summation over
Although we still do not have a way for computing
where I have used likelihood notation in Equation
Because these values represent the (posterior) probability of membership to each
To show that the lower bound is equal to the complete-data log-likelihood when
import numpy as np import pandas as pd
from scipy.stats import binom
def e_step(data, mu, p, n = 1):
"""
Compute expectations (i.e., responsibilities) for each data point's membership to each mixture
Parameters:
- data: data set
- mu: Probability of each component
- p: Probabilities of success for each binomial distribution
Returns:
- pandas dataframe
"""
assert len(mu) == len(p), "Number of estimates in mu is equal to the number of sucsess probabilities"
assert sum(mu) == 1, "Sum of mu should be equal to 1"
#unnormalized responsibilities for each data point for each mixture
unnormalized_responsibilities = [mu * binom.pmf(x, n=n, p= np.array(p)) for x in data]
#normalized responsibilities (i.e., probabilities)
normalized_responsibilities = [rp / np.sum(rp) for rp in unnormalized_responsibilities]
column_names = ['resp_mixture_{}'.format(mix+1) for mix in range(len(normalized_responsibilities[0]))]
df_responsibilities = pd.DataFrame(np.vstack(normalized_responsibilities),
columns = column_names)
#insert data column as the first one
df_responsibilities.insert(0, 'data', data)
return(df_responsibilities)
#incomplete/complete-data log-likelihood
def compute_incomplete_log_like(data, mu, p):
"""
Compute incomplete-data log-likelihood
Parameters:
- data: data set
- mu: Probability of each component
- p: Probability of success for each binomial distribution
"""
#probability of each data point coming from each distribution
mixture_sums = [np.sum(mu * binom.pmf(flip_result, n=1, p= np.array(p))) for flip_result in binom_mixture_data]
#log of mixture_sums
log_mixture_sums = np.log(mixture_sums)
#sums of log of mixture_sums
incomplete_like = np.sum(log_mixture_sums)
return(incomplete_like)
#lower bound = expected complete-data log-likelihood + entropy
def compute_lower_bound(responsibilities, mu, p):
#expected complete-data log-likelihood
expected_complete_data_like = responsibilities.apply(compute_expected_complete_like, mu = mu, p = p, axis=1).sum()
##compute entropy
entropy = compute_entropy(responsibilities = responsibilities)
return expected_complete_data_like + entropy
#entropy: sum of rs*log(rs) for all rs (responsibilities)
def compute_entropy(responsibilities):
##extract responsibility columns and then compute entropy
resp_colummns = responsibilities.filter(like = 'resp_mixture')
##take sum of x*log(x) for each responsibility
entropy = -np.sum(resp_colummns.values * np.log(resp_colummns.values))
return entropy
#expected complete-data log-likelihood
def compute_expected_complete_like(row, mu, p):
resp_columns = [col for col in row.index if 'resp_mixture' in col]
resp_values = [row[col] for col in resp_columns]
return np.sum(
[resp_values * (np.log(mu) +
row['data'] * np.log(p) + #non-zero if flip result is success (i.e., 'heads')
(1 - row['data']) * np.log(1 - np.array(p)) #non-zero if flip result is failure (i.e., 'tails')
)]
)
#data given to researcher
binom_mixture_data = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
#initial guesses for E step
mu = [0.3, 0.7] #mixture probabilities
p = [0.6, 0.8] #success probabilities
#E step
responsibilities = e_step(data = binom_mixture_data, mu = mu, p = p)
evidence_lower_bound = np.round(compute_incomplete_log_like(data = binom_mixture_data, mu = mu, p = p), 5)
incomplete_log_likelihood = np.round(compute_lower_bound(responsibilities = responsibilities, mu = mu, p = p), 5)
print('Incomplete-data log-likelihood:', evidence_lower_bound)
print('Lower bound:', incomplete_log_likelihood)
Incomplete-data log-likelihood: -9.28686 Lower bound: -9.28686
Therefore, by introducing a distribution of the latent variable,
Maximization (M) Step: Using Expectations to Obtain New Parameter Estimates
In this section, I first discuss the intuition of computing new parameter estimates with the expectations (see intuition of M step) and then show how these estimates are computed by computing the appropriate partial derivatives (see computation of new estimates).
Understanding the Intuition of the M Step
Beginning with the intuition of the M step, it uses responsibilities obtained in the E step to compute new parameter estimates for
To understand the logic behind Equation
A Brief Review of Entropy and the Kullback-Liebler (KL) Divergence
Consider an example where an analyst for a wine merchant records wine preferences among customers over a two-year period. Specifically, the analyst asks customers for their favourite type of wine from red, white, and sparkling. Figure 1 shows the customersβ preferences in each year.

To quantify the difference between the probability distributions with a single value, the analyst uses the Kullback-Liebler (KL) divergence shown below in Equation
To understand the KL divergence, it is helpful to understand each of its three computations (for an excellent explanation, see KL divergence) that are presented below:
: measures the change in each wine type relative to Year 1. : gives equal weightings to reciprocals. As an example, consider the change in preferences across the two years for red and white wine. Across the two years, the preference for red wine across increases from 20% to 40%, whereas the preference for white wine decreases from 40% to 20%. Given that these changes are exactly the same, they should contribute the same amount to the total difference between the years. Using logarithm accomplishes this goal; whereas , . : each value of is weighed by its current probability (i.e., Year 2).
Thus, the KL divergence measures the difference between two probability distributions,
To understand why the KL divergence is always non-negative, it is important to understand entropy and cross-entropy (for an excellent explanation, see entropy & cross-entropy). If we expand the KL divergence expression of Equation
The first term of Equation
The second term of Equation
Computing New Parameter Estimates Increases The Incomplete Log-Likelihood More Than the Evidence Lower Bound
Returning to the M step, the incomplete-data log-likelihood increases by at least as much as the evidence lower bound increases. The inequality between the increase in the incomplete-data log-likelihood and the evidence lower bound after the optimization of the lower bound with respect to
- Point 1: After the M step, the evidence lower bound only increases as much as the expected complete-data log-likelihood.
- Point 2: After the M step, the incomplete-data log-likelihood increases by as much as the evidence lower bound and the cross-entropy of the new responsibilities,
, with respect to the old responsibilities, .
Point 1: The Increase in the Evidence Lower Bound is Equal to the Increase in the Expected Complete-Data Log-Likelihood
In computing new parameter estimates in the M step,
Below, I show that, when determining the parameters values,
Therefore, optimizing the evidence lower bound with respect to
To more concisely represent the difference between the new and old evidence lower bounds, I use auxiliary function notation in Equation
Also note that I fix the values of
def m_step(responsibilities, n = 1):
#isolate columns that contain responsibilities
resp_cols = responsibilities.filter(like = 'resp_mixture')
#New estimate for the probability of heads
"""Weigh each data point by the corresponding responsibility and divide by the sum
of responsibilities for each coin (N_k)."""
#specify axis=1 so that operations are conducted along rows
p_new = np.sum(responsibilities.filter(regex='^resp_mixture').mul(responsibilities['data'],
axis=0))/(np.sum(resp_cols)*n)
#new mixture probabilities
mu_new = resp_cols.sum()/resp_cols.sum().sum()
return pd.DataFrame({'p_new': p_new, 'mu_new': mu_new})
#data given to researcher
binom_mixture_data = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
#initial guesses for E step
mu_fixed = [0.5, 0.5] #mixture probabilities are fixed so that convergence does not occur in one trial
p = [0.6, 0.8] #success probabilities
#E step
responsibilities = e_step(data = binom_mixture_data, mu = mu_fixed, p = p)
#M step
estimates = m_step(responsibilities = responsibilities)
#amount that lower bound increased by after M step
optimized_lower_bound = compute_lower_bound(responsibilities = responsibilities, mu = mu_fixed, p = estimates['p_new'])
expectation_lower_bound = compute_lower_bound(responsibilities = responsibilities, mu = mu_fixed, p = p)
lower_bound_increase = np.round(optimized_lower_bound - expectation_lower_bound, 5)
#amount that expected complete-data log-likelihood increases
optimized_expected_complete = responsibilities.apply(compute_expected_complete_like, mu = mu_fixed, p = estimates['p_new'], axis=1).sum()
expectation_expected_complete = responsibilities.apply(compute_expected_complete_like, mu = mu_fixed, p = p, axis=1).sum()
expected_complete_increase = np.round(optimized_expected_complete - expectation_expected_complete, 5)
print('Increase in evidence lower bound = ', lower_bound_increase)
print('Increase in expected complete-data log-likelihood:', expected_complete_increase)
Increase in evidence lower bound = 1.81803 Increase in expected complete-data log-likelihood: 1.81803
Point 2: The Incomplete-Data Log-Likelihood Increases by at Least as Much as the Evidence Lower Bound
In computing new parameter estimates in the M step,
As a brief aside, we can see that, by setting
After computing new parameter estimates in the M step, the value of the incomplete-data log-likelihood increases at these new values
Note that Equation
def compute_cross_entropy(old_responsibilities, new_responsibilities):
##extract responsibility columns and then compute cross-entropy, \sum_x P(x)*log(Q(x)), for each cell value
resp_colummns_old = old_responsibilities.filter(like = 'resp_mixture')
resp_colummns_new = new_responsibilities.filter(like = 'resp_mixture')
entropy = -np.sum(resp_colummns_old.values * np.log(resp_colummns_new.values))
return entropy
#data given to researcher
binom_mixture_data = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
#initial guesses for E step
mu_fixed = [0.5, 0.5] #mixture probabilities
p = [0.6, 0.8] #success probabilities
#First iteration of EM algorithm
old_responsibilities = e_step(data = binom_mixture_data, mu = mu_fixed, p = p) #E step
estimates = m_step(responsibilities = old_responsibilities) # M step
#Second iteration of EM algorithm (only E step)
new_responsibilities = e_step(data = binom_mixture_data, mu = mu_fixed, p = estimates['p_new'])
#amount that lower bound increased by after M step: increase in lower bound + KL divergence
##1) Increase in lower bound
optimized_lower_bound = compute_lower_bound(responsibilities = old_responsibilities, mu = mu_fixed, p = estimates['p_new'])
expectation_lower_bound = compute_lower_bound(responsibilities = old_responsibilities, mu = mu_fixed, p = p)
lower_bound_increase = optimized_lower_bound - expectation_lower_bound
##2) KL divergence between new and old responsibilities
entropy = compute_entropy(responsibilities = old_responsibilities)
cross_entropy = compute_cross_entropy(old_responsibilities = old_responsibilities,
new_responsibilities = new_responsibilities)
KL_divergence = -entropy + cross_entropy
inc_lower_bound_plus_KL = np.round(lower_bound_increase + KL_divergence, 5)
#amount that incomplete-data log-likelihood increases
new_incomplete_like = compute_incomplete_log_like(data = binom_mixture_data, mu = mu_fixed, p = estimates['p_new'])
old_incomplete_like = compute_incomplete_log_like(data = binom_mixture_data, mu = mu_fixed, p = p)
incomplete_like_increase = np.round(new_incomplete_like - old_incomplete_like, 5)
print('Increase in evidence lower bound + cross-entropy = ', inc_lower_bound_plus_KL)
print('Increase in incomplete-data log-likelihood:', incomplete_like_increase)
Increase in evidence lower bound + cross-entropy = 1.91468 Increase in incomplete-data log-likelihood: 1.91468
Computation of New Parameter Estimates
Ending with the computation of the new parameter estimates, I will now show how the M step obtains new parameter values,
For clarity, I have superscripted each parameter with
Beginning with the probability of selecting each
In looking at Equation
We can then plug in the the solution for
Ending with the probability of heads for each
Conclusion
In conclusion, the expectation-maximization (EM) algorithm provides a method for indirectly optimizing the incomplete-data log-likelihood. To work with the incomplete-data log-likelihood, the EM algorithm uses a lower-bounding function in the E step that can be used to obtain responsibilities. The responsibilities are then used in the M step to optimize the lower-bounding function and obtain new parameter estimates. By optimizing the lower-bounding function, the incomplete-data log-likelihood increases by at least as much as the lower-bounding function, thus necessitating another E step. The E and M steps repeat until the parameter values stop updating.
References
Appendix A: Proof of Jensenβs Inequality
Jensenβs inequality applies to convex and concave functions. Beginning with convex functions, we can use the first-order condition to prove Jensenβs inequality, which is shown below in Equation
We can then take the expectation of Equation
For concave functions, we simply change the direction of the inequality in Equation
Appendix B: Proof that KL Divergence is Always Non-Negative
As explained in section on the KL divergence and entropy, the KL divergence measures the difference between two probability distributions,
Because the negative logarithm causes the KL divergence to become convex, we can use Jensenβs inequality to transform the KL divergence into an inequality by moving the summation inside the logarithm (Equation
Appendix C: Proof of Gibbsβ Inequality
Gibbsβ inequality states that the cross-entropy,
It should also be noted that maximum likelihood estimation can result in singularities with Gaussian mixture models. In short, if the estimate for a
mixtureβs mean, , happens to exactly match the value of an individual value, , then the mixture in question can become βstuckβ on this data point with all the other data points being modelled by the other mixtures. With the mixture fixed on the one data point, the variance of the mixture will decrease to zero and the log-likelihood will increase to infinity. β©οΈ The first value in Equation
is technically the negative entropy. Because entropy (and cross-entropy) compute information/surprise, it makes conceptual sense to represent them as positive values. Unfortunately, the term returns negative values. To reflect the conceptualization that entropy computes information, a negative sign is included to multiply the negative value returned by into a positive value. β©οΈ