The Expectation-Maximization Algorithm: A Method for Modelling Mixtures of Distributions

The expectation-maximization (EM) algorithm provides a method for modelling mixtures of distributions. To explain the EM algorithm, I do so in the context of a coin-flipping example where, for each flip, one of two coins are used, and so a mixture of binomial distributions underlie the data. Although maximum likelihood estimation cannot estimate mixture models, the EM algorithm can because it optimizes the likelihood function indirectly. In the expectation (E) step, a lower-bounding function is used to obtain responsibilities. In the maximization step (M), the lower-bounding function is optimized, with the responsibilities being used to obtain new parameter estimates. By optimizing the lower-bounding function, likelihood function increases by at least as much as the lower-bounding function, thus necessitating another E step. The E and M steps repeat until the parameter values stop updating.

To write this post on the expectation-maximization algorithm, I used the following academic sources: Citation: (). Pattern recognition and machine learning. Springer New York. Retrieved from bit.ly/411YnEq and Citation: & (). What is the expectation maximization algorithm?. Nature Biotechnology, 26(8), 897–899. https://doi.org/10.1038/nbt1406 . I also used the following lectures/blogs as aids: EM blog and EM lecture.

Introduction to Mixture Models

Consider a situation where a researcher entrusts a colleague to flip coins and record the results of 10 flips. Flips that result in heads are recorded as 1 and flips that result in tails are recorded as 0. Importantly, before each flip, the colleague picks one of two coins (according to another probability mass function) but does not tell the researcher which coin was flipped. When the colleague records the results of the 10 coin flips, they provide the researcher with the following data: . With these data, the researcher wants to estimate the following three parameters:

  1. The probability of picking each of the two coins (i.e., a probability mass function), . Given that both these probability values sum to one, only the probability of picking one coin must be estimated, , with .
  2. The first coin’s probability of heads, .
  3. The second coin’s probability of heads, .

Mixture Models Cannot be Estimated With Maximum Likelihood Estimation

One way to estimate the parameters (, , , ) is to use maximum likelihood estimation (for a review, see my post on maximum likelihood estimation). In maximum likelihood estimation, we solve for the parameter values that maximize the probability of observing the data, . Because the values being maximized are not technically probabilities and are instead likelihoods, , maximum likelihood estimation solves for the parameter values with the highest likelihood, as shown below in Equation :

In the current example, the likelihood values can be calculated by using Equation shown below:

where is the binomial probability of the data point given the the coin, and this probability is weighted by the corresponding probability of selecting the coin. Importantly, because the researcher does not know which of the two coins produces the result of any flip, then any flip could be the result of flipping the first or second coin. To model this uncertainty, the calculation of the likelihood for each coin flip result, , computes the sum of weighted binomial probabilities, . The lack of information surrounding the identity of the coin that produces each flip result also explains why Equation above is often called the incomplete-data likelihood. To prevent underflow (the rounding of small numbers to zero in computers), the log-likelihood is taken, resulting in the incomplete-data log-likelihood shown below in Equation :

To find maximum likelihood estimates for the parameters, partial derivatives can be computed with respect to each parameter () and then set to equal zero. In computing the partial derivatives of the incomplete-data log-likelihood (Equation with respect to the parameters, it is important to note that the existence of the summation symbol within the logarithm will oftentimes yield a complex and lengthy derivative because the chain rule has to be applied for each data point. Although computing the partial derivatives does not yield overly complex equations in the current example, solutions (often called closed-form solutions) for the parameters cannot be obtained. As an example, I will show how maximum likelihood estimation would be implemented for estimating the probability of selecting coin 1, . To compute the likelihood, we can expand the binomial term in the incomplete-data log-likelihood function above (Equation ) to produce

To allow the partial derivative to be computed, I will apply the binomial calculation on each flip. Thus, and , which means that . In expanding Equation over the summation sign within the logarithm, Equation is obtained

Because for any given flip, the term inside the logarithm will only ever take on one of the two following forms:

where indicates a coin flip that results in β€˜heads’ and t indicates a coin flip that results in ’tails’ (i.e., ). To expand Equation over the summation sign outside the logarithm, we can apply Equation and obtain a simplified expression of the incomplete-data log-likelihood shown below in Equation :

where indicates the total number of heads and indicates the total number of tails. In the current data set, , four heads and six tails are obtained. Although we can compute the partial derivative of Equation with respect to and obtain a closed-form solution, Equation shows that it is inadmissible because it always yields negative values for .

Therefore, although the summation symbol within the logarithm does not result in an overly complex partial derivative in the current example, maximum likelihood estimation results in inadmissible estimates for parameter values, and so is not a viable method for modelling mixture distributions.1

Mixture Models Can be Estimated with the Expectation-Maximization (EM) Algorithm

Unlike maximum likelihood estimation, the expectation-maximization (EM) algorithm provides viable parameter estimates for modelling mixture distributions. The EM algorithm works because it indirectly maximizes the incomplete-data log-likelihood; that is, it does not directly operate on the incomplete-data log-likelihood. To act as an indirect estimation method, the EM algorithm begins by modelling the uncertainty of the coin’s identity on each flip as a latent variable: A variable that is assumed to exist but has not been directly measured, whether by choice or because direct measurement is impossible. To model the coin’s identity on each flip as a latent variable, one-hot (or 1-of-K) encoding is used. In one-hot encoding, the levels of a categorical variable can be represented numerically with a binary vector that sums to one and has a length equal to the number of levels in the categorical variable. In the current example, the categorical variable is the coin’s identity, and the two levels (i.e., coin 1, coin 2) can be represented in each flip by , as shown below in Equation below:

By modelling the coin’s status as a latent variable with one-hot encoding, the incomplete-data likelihood (Equation ) can be modified to produce the complete-data likelihood shown below in Equation :

where is used to index each value of . Note that, because the likelihood of each data point is raised to an exponent value of either 0 or 1, we now take the product over the K classes. Thus, data points that do not belong to a mixture do not contribute to the likelihood computation; these likelihoods are raised to the power of zero, and, therefore, contribute a redundant multiplication factor of one.

Although including a binary exponent on the likelihood computation seems inconsequential, it provides two desirable outcomes. First, it allows viable parameter estimates to be computed. In taking the logarithm of the complete-data likelihood to prevent underflow, the complete-data log-likelihood is obtained and is shown below in Equation :

Because the summation over K is now outside the logarithm, this means that the partial derivatives will be much less complex and yield admissible solutions.

Second, using one-hot encoding creates an equivalence between the complete- and incomplete-data log-likelihoods. As shown below, the complete-data log-likelihood becomes the incomplete-data log-likelihood after summing over all possible latent variable representations for each data point (see Equation below). Therefore, we can optimize the incomplete-data log-likelihood by optimizing the complete-data log-likelihood.

As an aside, in computing the sum over , we are marginalizing over , which explains why the incomplete-data log-likelihood is often called the marginal log-likelihood and the complete-data log-likelihood is often called the joint log-likelihood.

Despite the benefits of using one-hot encoding, one challenge remains. The complete-data log-likelihood cannot yet be computed because the latent variable memberships for each data point, , are unknown. Recall that the researcher does not know which coin produced the result of any flip. Related, and as an aside, Equation is called the complete-data likelihood because it can only be computed if we know the mixture membership of each data point; that is, we must have the complete data.

Although the complete-data log-likelihood cannot be directly computed, the EM algorithm finds a clever way to circumvent this problem in the E step.

Expectation (E) Step: Using Expectations to Estimate the Distribution of the Latent Variable

The E step takes advantage of the equivalence between the incomplete- and complete-data log-likelihoods and applies two clever tricks to work with the complete-data log-likelihood. First, it applies what appears to be an inconsequential algebraic manipulation, whereby the complete-data log-likelihood is multiplied and divided by some distribution on the latent variable as shown below in Equation :

Second, given that the logarithm is a concave function, Jensen’s inequality can be applied to convert Equation into an inequality. Briefly, Jensen’s inequality states that, for concave functions, the function of the expected value, ), is greater than or equal to the expected value of the function, (for a proof, see Appendix A), and is shown below in Equation :

In other words, Equation above shows that is a lower bound on . Applying Jensen’s inequality to Equation , a lower bound for the incomplete-data log-likelihood is obtained, , resulting in the inequality shown below in Equation :

where the summation over of is the expected value of , and can be represented as . As a note, the lower bound, , is often called the evidence lower bound because it is a lower bound on the marginal log-likelihood, which is often called the evidence in Bayesian inference.

Although we still do not have a way for computing , a closer inspection of the above inequality provides a way forward. In the E step, it is in our best interest to obtain the most accurate approximation of the distribution . By obtaining the best estimate of , the greatest improvements in the parameter can be realized in the M step. To obtain the best estimate of , and thus maximize the potential for improvement in the parameter estimates in the M step, we need to maximize the evidence lower bound with respect to . Thus, the inequality of Equation must be transformed into an equality. To do so, we can compute such that the logarithm, , returns constant values, and this can be accomplished if the probability values computed for the latent variable are proportional to the numerator, . Fortunately, Bayes’ theorem provides one way for us to compute to maximize the evidence lower bound such that

where I have used likelihood notation in Equation to highlight the equivalence with the calculation of probabilities and likelihoods. It is important to note that, because latent variable memberships exist for each data point for each mixture in the complete-data log-likelihood (Equation ), Equation above is computed for each data point such that

Because these values represent the (posterior) probability of membership to each mixture, they are often called responsibilities. Note that the responsibilities are often represented as , which is simply the scalar form of that I use throughout this post. Therefore, by setting , we can compute and also obtain a lower bound, that is equal to the complete-data log-likelihood. Using Equation , we can rewrite the inequality of Equation as an equality in Equation below:

To show that the lower bound is equal to the complete-data log-likelihood when (see Equation ), I provide the Python code block below (see lines 1–109). In order to better understand the Python code, I provide the function for the incomplete-data log-likelihood ( Equation ) and the expansion for the evidence lower bound (Equation ). Importantly, and as I will discuss later on in this post, the first term in Equation is the expected complete-data log-likelihood, and the second term is the entropy of the responsibilities. Recall that the researcher’s data set is .

import numpy as np
import pandas as pd
from scipy.stats import binom

def e_step(data, mu, p, n = 1):
"""
Compute expectations (i.e., responsibilities) for each data point's membership to each mixture
Parameters:
- data: data set
- mu: Probability of each component
- p: Probabilities of success for each binomial distribution
Returns:
- pandas dataframe
"""

assert len(mu) == len(p), "Number of estimates in mu is equal to the number of sucsess probabilities"
assert sum(mu) == 1, "Sum of mu should be equal to 1"

#unnormalized responsibilities for each data point for each mixture
unnormalized_responsibilities = [mu * binom.pmf(x, n=n, p= np.array(p)) for x in data]

#normalized responsibilities (i.e., probabilities)
normalized_responsibilities = [rp / np.sum(rp) for rp in unnormalized_responsibilities]

column_names = ['resp_mixture_{}'.format(mix+1) for mix in range(len(normalized_responsibilities[0]))]

df_responsibilities = pd.DataFrame(np.vstack(normalized_responsibilities),
columns = column_names)

#insert data column as the first one
df_responsibilities.insert(0, 'data', data)

return(df_responsibilities)


#incomplete/complete-data log-likelihood
def compute_incomplete_log_like(data, mu, p):
"""
Compute incomplete-data log-likelihood
Parameters:
- data: data set
- mu: Probability of each component
- p: Probability of success for each binomial distribution
"""

#probability of each data point coming from each distribution
mixture_sums = [np.sum(mu * binom.pmf(flip_result, n=1, p= np.array(p))) for flip_result in binom_mixture_data]

#log of mixture_sums
log_mixture_sums = np.log(mixture_sums)

#sums of log of mixture_sums
incomplete_like = np.sum(log_mixture_sums)

return(incomplete_like)


#lower bound = expected complete-data log-likelihood + entropy
def compute_lower_bound(responsibilities, mu, p):

#expected complete-data log-likelihood
expected_complete_data_like = responsibilities.apply(compute_expected_complete_like, mu = mu, p = p, axis=1).sum()

##compute entropy
entropy = compute_entropy(responsibilities = responsibilities)

return expected_complete_data_like + entropy


#entropy: sum of rs*log(rs) for all rs (responsibilities)
def compute_entropy(responsibilities):

##extract responsibility columns and then compute entropy
resp_colummns = responsibilities.filter(like = 'resp_mixture')

##take sum of x*log(x) for each responsibility
entropy = -np.sum(resp_colummns.values * np.log(resp_colummns.values))

return entropy


#expected complete-data log-likelihood
def compute_expected_complete_like(row, mu, p):
resp_columns = [col for col in row.index if 'resp_mixture' in col]
resp_values = [row[col] for col in resp_columns]

return np.sum(
[resp_values * (np.log(mu) +
row['data'] * np.log(p) + #non-zero if flip result is success (i.e., 'heads')
(1 - row['data']) * np.log(1 - np.array(p)) #non-zero if flip result is failure (i.e., 'tails')
)]
)


#data given to researcher
binom_mixture_data = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]

#initial guesses for E step
mu = [0.3, 0.7] #mixture probabilities
p = [0.6, 0.8] #success probabilities

#E step
responsibilities = e_step(data = binom_mixture_data, mu = mu, p = p)

evidence_lower_bound = np.round(compute_incomplete_log_like(data = binom_mixture_data, mu = mu, p = p), 5)
incomplete_log_likelihood = np.round(compute_lower_bound(responsibilities = responsibilities, mu = mu, p = p), 5)

print('Incomplete-data log-likelihood:', evidence_lower_bound)
print('Lower bound:', incomplete_log_likelihood)
Incomplete-data log-likelihood: -9.28686
Lower bound: -9.28686

Therefore, by introducing a distribution of the latent variable, ), its expectation is taken so that the evidence lower bound becomes equal to the incomplete-data log-likelihood. More importantly, however, responsibilities are obtained for each data point for each mixture, , which allow new parameter estimates to be obtained in the M step.

Maximization (M) Step: Using Expectations to Obtain New Parameter Estimates

In this section, I first discuss the intuition of computing new parameter estimates with the expectations (see intuition of M step) and then show how these estimates are computed by computing the appropriate partial derivatives (see computation of new estimates).

Understanding the Intuition of the M Step

Beginning with the intuition of the M step, it uses responsibilities obtained in the E step to compute new parameter estimates for . As in the E step, the M step also indirectly optimizes the incomplete-data log-likelihood by optimizing the evidence lower bound. In the M step, however, instead of optimizing the evidence lower bound with respect to , the lower bound is optimized with respect to the parameters, , resulting in new parameter estimates. Because new parameter estimates are obtained in the M step, I will represent them with and the old estimates with . Thus, in optimizing the evidence lower bound with respect to the parameter values in the M step, we can say the lower bound is optimized with respect to . In optimizing the evidence lower bound with respect to , the incomplete-data log-likelihood is obtained with new parameter values, , and increases by at least as much as the lower bound increases when optimized with respect to , as shown below in Equation :

To understand the logic behind Equation above, a brief discussion of entropy, cross-entropy, and the Kullback-Liebler (KL) divergence is necessary and I provide an overview in the following section.

A Brief Review of Entropy and the Kullback-Liebler (KL) Divergence

Consider an example where an analyst for a wine merchant records wine preferences among customers over a two-year period. Specifically, the analyst asks customers for their favourite type of wine from red, white, and sparkling. Figure 1 shows the customers’ preferences in each year.

Figure 1
Favourite Wine Types by Customers in Each of Two Years

To quantify the difference between the probability distributions with a single value, the analyst uses the Kullback-Liebler (KL) divergence shown below in Equation

To understand the KL divergence, it is helpful to understand each of its three computations (for an excellent explanation, see KL divergence) that are presented below:

  1. : measures the change in each wine type relative to Year 1.
  2. : gives equal weightings to reciprocals. As an example, consider the change in preferences across the two years for red and white wine. Across the two years, the preference for red wine across increases from 20% to 40%, whereas the preference for white wine decreases from 40% to 20%. Given that these changes are exactly the same, they should contribute the same amount to the total difference between the years. Using logarithm accomplishes this goal; whereas , .
  3. : each value of is weighed by its current probability (i.e., Year 2).

Thus, the KL divergence measures the difference between two probability distributions, and , by computing the sum of weighted logarithmic ratios. Intuitively, if the two distributions are the same, the KL divergence is zero, and if the distributions are different, the KL divergence is positive. Therefore, the KL divergence is always non-negative, (for a proof, see Appendix B).

To understand why the KL divergence is always non-negative, it is important to understand entropy and cross-entropy (for an excellent explanation, see entropy & cross-entropy). If we expand the KL divergence expression of Equation , we obtain

The first term of Equation represents entropy,2 which can be conceptualized as the amount of information or surprise obtained for a given wine type from the distribution in Year 2, , when encoding it by itself.

The second term of Equation represents cross-entropy, which can be conceptualized as the amount of information or surprise obtained for a given wine type from the distribution in Year 1, , when encoded by the distribution in Year 2, . Because the distributions in each year are different, it is intuitive to think that cross-entropy is greater than the entropy; that is, it should be more surprising to encode values of one distribution, , with values of another distribution, , than with values of the same distribution. The conceptualization that cross-entropy is greater than entropy is formally represented by Gibbs’ inequality (for a proof, see Appendix C) below:

Using Gibbs’ inequality, it then becomes clear looking at the expression in Equation that the KL divergence will always be non-negative because the larger value of the cross-entropy is added to negative entropy, which has a smaller value.

Computing New Parameter Estimates Increases The Incomplete Log-Likelihood More Than the Evidence Lower Bound

Returning to the M step, the incomplete-data log-likelihood increases by at least as much as the evidence lower bound increases. The inequality between the increase in the incomplete-data log-likelihood and the evidence lower bound after the optimization of the lower bound with respect to occurs because the optimization results in a larger-value cross-entropy term, . Importantly, the cross-entropy term is absorbed by the incomplete-data log-likelihood and not the evidence lower bound. To understand how, after the M step, the increase in the incomplete-data log-likelihood is at least the increase in the evidence lower bound, I will prove the following two points:

  • Point 1: After the M step, the evidence lower bound only increases as much as the expected complete-data log-likelihood.
  • Point 2: After the M step, the incomplete-data log-likelihood increases by as much as the evidence lower bound and the cross-entropy of the new responsibilities, , with respect to the old responsibilities, .
Point 1: The Increase in the Evidence Lower Bound is Equal to the Increase in the Expected Complete-Data Log-Likelihood

In computing new parameter estimates in the M step, , the evidence lower bound increases by the amount that the expected complete-data log-likelihood increases. To show this, I repeat the function for the evidence lower bound below (Equation ) and set to keep track of the iteration index.

Below, I show that, when determining the parameters values, , that maximize the evidence lower bound, the entropy term does not contribute to the derivative (Equation ), and so the maximization is equivalent to maximizing the expected data complete-data log-likelihood (Equation ).

Therefore, optimizing the evidence lower bound with respect to is equivalent to maximizing the expected complete-data log-likelihood with respect to . In other words, to obtain new parameter estimates, we only need to compute partial derivatives with respect to each parameter of the expected complete-data log-likelihood (see computation of estimates for these partial derivatives). Importantly, to compute the value of the lower bound after it has been optimized with respect to , , the entropy of is included, as shown below in Equation .

To more concisely represent the difference between the new and old evidence lower bounds, I use auxiliary function notation in Equation , where is the new lower bound and is the old lower bound.

The Python code block below (see lines 112–158) shows that, after the M step, the evidence lower bound indeed only increases by as much as the expected complete-data log-likelihood. Note that, although I have not yet shown how to derive new parameter estimates, I do so in the section on computing new parameter estimates. To better understand how new values are computed for (probability of selecting each coin) and (probability of heads for each coin), I provide previews of the solutions in Equations and .

Also note that I fix the values of so that convergence does not occur in one trial.

def m_step(responsibilities, n = 1):

#isolate columns that contain responsibilities
resp_cols = responsibilities.filter(like = 'resp_mixture')

#New estimate for the probability of heads
"""Weigh each data point by the corresponding responsibility and divide by the sum
of responsibilities for each coin (N_k)."""
#specify axis=1 so that operations are conducted along rows
p_new = np.sum(responsibilities.filter(regex='^resp_mixture').mul(responsibilities['data'],
axis=0))/(np.sum(resp_cols)*n)

#new mixture probabilities
mu_new = resp_cols.sum()/resp_cols.sum().sum()

return pd.DataFrame({'p_new': p_new, 'mu_new': mu_new})


#data given to researcher
binom_mixture_data = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]


#initial guesses for E step
mu_fixed = [0.5, 0.5] #mixture probabilities are fixed so that convergence does not occur in one trial
p = [0.6, 0.8] #success probabilities

#E step
responsibilities = e_step(data = binom_mixture_data, mu = mu_fixed, p = p)

#M step
estimates = m_step(responsibilities = responsibilities)

#amount that lower bound increased by after M step
optimized_lower_bound = compute_lower_bound(responsibilities = responsibilities, mu = mu_fixed, p = estimates['p_new'])
expectation_lower_bound = compute_lower_bound(responsibilities = responsibilities, mu = mu_fixed, p = p)

lower_bound_increase = np.round(optimized_lower_bound - expectation_lower_bound, 5)


#amount that expected complete-data log-likelihood increases
optimized_expected_complete = responsibilities.apply(compute_expected_complete_like, mu = mu_fixed, p = estimates['p_new'], axis=1).sum()
expectation_expected_complete = responsibilities.apply(compute_expected_complete_like, mu = mu_fixed, p = p, axis=1).sum()

expected_complete_increase = np.round(optimized_expected_complete - expectation_expected_complete, 5)

print('Increase in evidence lower bound = ', lower_bound_increase)
print('Increase in expected complete-data log-likelihood:', expected_complete_increase)
Increase in evidence lower bound = 1.81803
Increase in expected complete-data log-likelihood: 1.81803
Point 2: The Incomplete-Data Log-Likelihood Increases by at Least as Much as the Evidence Lower Bound

In computing new parameter estimates in the M step, , the incomplete-data log-likelihood increases by at least as much as the evidence lower bound increases. To show the inequality between the increase in the incomplete-data log-likelihood and the evidence lower bound, I first show below that the incomplete-data log-likelihood can be decomposed as the sum of the lower bound and a KL divergence (see Equation . Importantly, I use probability notation in the beginning to highlight that , which can be decomposed into . I denote with probability notation because it is a true probability distribution, but I denote with likelihood notation, , because, when fixing the data and varying the parameters, values become likelihoods (see my previous post on likelihood and probability). As mentioned before, because the M step computes new parameter estimates, , I distinguish them from the current estimates, .

As a brief aside, we can see that, by setting in the E step, the KL divergence goes to zero and the incomplete-data log-likelihood becomes equal to the evidence lower bound (Equation is repeated below, with ).

After computing new parameter estimates in the M step, the value of the incomplete-data log-likelihood increases at these new values such that it it the sum of the old evidence lower bound maximized with respect to and the KL divergence between the old responsibilities and some new distribution of the responsibilities, (Equation ). From Equation , we know that we can use Bayes’ theorem to compute the new distribution of the latent variables, and so I set ) in Equation .

In looking at the difference between the incomplete-data log-likelihood at the old parameter values with the old responsibilities and the new parameter values, the incomplete-data log-likelihood increases by at least as much as the evidence lower bound because it absorbs the KL divergence between the old responsibilities and the new responsibilities. Therefore, we arrive at Equation presented at the beginning of this section.

Note that Equation can also be represented using auxiliary function and entropy notation, whereby is the entropy of and is the cross-entropy of with respect to (Equation ). Because Gibbs’ inequality states that cross-entropy is greater than or equal to entropy (for a proof, see Appendix C), , and turns Equation into an inequality when removed.

The Python code block below (see lines 161–210) shows that, after the M step, the incomplete-data log-likelihood indeed increases by as much as the evidence lower bound and the KL divergence between the new and old responsibilities. Note that I fix the values of so that convergence does not occur in one trial.

def compute_cross_entropy(old_responsibilities, new_responsibilities):

##extract responsibility columns and then compute cross-entropy, \sum_x P(x)*log(Q(x)), for each cell value
resp_colummns_old = old_responsibilities.filter(like = 'resp_mixture')
resp_colummns_new = new_responsibilities.filter(like = 'resp_mixture')

entropy = -np.sum(resp_colummns_old.values * np.log(resp_colummns_new.values))

return entropy


#data given to researcher
binom_mixture_data = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]

#initial guesses for E step
mu_fixed = [0.5, 0.5] #mixture probabilities
p = [0.6, 0.8] #success probabilities

#First iteration of EM algorithm
old_responsibilities = e_step(data = binom_mixture_data, mu = mu_fixed, p = p) #E step
estimates = m_step(responsibilities = old_responsibilities) # M step

#Second iteration of EM algorithm (only E step)
new_responsibilities = e_step(data = binom_mixture_data, mu = mu_fixed, p = estimates['p_new'])


#amount that lower bound increased by after M step: increase in lower bound + KL divergence
##1) Increase in lower bound
optimized_lower_bound = compute_lower_bound(responsibilities = old_responsibilities, mu = mu_fixed, p = estimates['p_new'])
expectation_lower_bound = compute_lower_bound(responsibilities = old_responsibilities, mu = mu_fixed, p = p)
lower_bound_increase = optimized_lower_bound - expectation_lower_bound

##2) KL divergence between new and old responsibilities
entropy = compute_entropy(responsibilities = old_responsibilities)
cross_entropy = compute_cross_entropy(old_responsibilities = old_responsibilities,
new_responsibilities = new_responsibilities)

KL_divergence = -entropy + cross_entropy
inc_lower_bound_plus_KL = np.round(lower_bound_increase + KL_divergence, 5)


#amount that incomplete-data log-likelihood increases
new_incomplete_like = compute_incomplete_log_like(data = binom_mixture_data, mu = mu_fixed, p = estimates['p_new'])
old_incomplete_like = compute_incomplete_log_like(data = binom_mixture_data, mu = mu_fixed, p = p)

incomplete_like_increase = np.round(new_incomplete_like - old_incomplete_like, 5)


print('Increase in evidence lower bound + cross-entropy = ', inc_lower_bound_plus_KL)
print('Increase in incomplete-data log-likelihood:', incomplete_like_increase)
Increase in evidence lower bound + cross-entropy = 1.91468
Increase in incomplete-data log-likelihood: 1.91468

Computation of New Parameter Estimates

Ending with the computation of the new parameter estimates, I will now show how the M step obtains new parameter values, . As shown in Equation , the new parameter estimates are obtained by maximizing the expected complete-data log-likelihood, . Therefore, to obtain new parameter values, I will compute the partial derivatives of with respect to each parameter in . As a reference, I repeat the expanded version of the expected complete-data log-likelihood in Equation below:

For clarity, I have superscripted each parameter with because new parameter estimates are always obtained with the old values.

Beginning with the probability of selecting each coin, , I compute the corresponding partial derivative of the expected complete-data log-likelihood below.

In looking at Equation , it is impossible to obtain a solution for . Thankfully, this problem can be fixed by constraining the optimization problem with a Lagrange multiplier, (for an explanation, see Lagrange multipliers). Specifically, we constrain the values to sum to one and then set the derivatives equal to each and include the Lagrange multiplier in the partial derivative of the constraint.

Note that adding the responsibilities across all data points and all mixtures gives the total number of data points, .

We can then plug in the the solution for (Equation ) into Equation to obtain the solution for shown below in Equation . Intuitively, the probability of each coin is obtained by dividing the effective number of data points, , by the total number of data points, .

Ending with the probability of heads for each coin, , I compute the corresponding partial derivative of the expected complete-data log-likelihood below. As shown in Equation , the value of that optimizes the lower bound and the incomplete-data log-likelihood is obtained by dividing the sum of flips that result in heads weighted by the corresponding responsibility by the effective number of data points associated with the coin, .

Conclusion

In conclusion, the expectation-maximization (EM) algorithm provides a method for indirectly optimizing the incomplete-data log-likelihood. To work with the incomplete-data log-likelihood, the EM algorithm uses a lower-bounding function in the E step that can be used to obtain responsibilities. The responsibilities are then used in the M step to optimize the lower-bounding function and obtain new parameter estimates. By optimizing the lower-bounding function, the incomplete-data log-likelihood increases by at least as much as the lower-bounding function, thus necessitating another E step. The E and M steps repeat until the parameter values stop updating.

References

(). Pattern recognition and machine learning. Springer New York. Retrieved from bit.ly/411YnEq
& (). What is the expectation maximization algorithm?. Nature Biotechnology, 26(8), 897–899. https://doi.org/10.1038/nbt1406

Appendix A: Proof of Jensen’s Inequality

Jensen’s inequality applies to convex and concave functions. Beginning with convex functions, we can use the first-order condition to prove Jensen’s inequality, which is shown below in Equation :

Given that this definition holds for all , we can set , which gives

We can then take the expectation of Equation , which results in Jensen’s inequality in Equation .

For concave functions, we simply change the direction of the inequality in Equation to obtain Jensen’s inequality for concave functions below (Equation ).

Appendix B: Proof that KL Divergence is Always Non-Negative

As explained in section on the KL divergence and entropy, the KL divergence measures the difference between two probability distributions, and , by computing the sum of weighted logarithmic ratios, as shown below in a repetition of Equation . Note that I flip the fraction inside the logarithm to obtain the negative (or reverse) KL divergence in Equation

Because the negative logarithm causes the KL divergence to become convex, we can use Jensen’s inequality to transform the KL divergence into an inequality by moving the summation inside the logarithm (Equation ). From this point onward, it becomes clear that the KL diverence is always greater than or equal to zero (Equation ).

Appendix C: Proof of Gibbs’ Inequality

Gibbs’ inequality states that the cross-entropy, , is greater than or equal to the entropy, (see Equation ). Gibbs’ inequality can be easily proven using the fact that the KL divergence is always greater than or equal to zero.


  1. It should also be noted that maximum likelihood estimation can result in singularities with Gaussian mixture models. In short, if the estimate for a mixture’s mean, , happens to exactly match the value of an individual value, , then the mixture in question can become β€˜stuck’ on this data point with all the other data points being modelled by the other mixtures. With the mixture fixed on the one data point, the variance of the mixture will decrease to zero and the log-likelihood will increase to infinity. β†©οΈŽ

  2. The first value in Equation is technically the negative entropy. Because entropy (and cross-entropy) compute information/surprise, it makes conceptual sense to represent them as positive values. Unfortunately, the term returns negative values. To reflect the conceptualization that entropy computes information, a negative sign is included to multiply the negative value returned by into a positive value. β†©οΈŽ