Probability, Likelihood, and Maximum Likelihood Estimation

Probability and likelihood are discussed in the context of a coin flipping scenario and it is shown that only probabilities sum to one. Although likelihoods cannot be interpreted as probabilities, they can be used to determine the set of parameter values that most likely produced a data set (maximum likelihood estimates). Maximum likelihood estimation provides one efficient method for determining maximum likelihood estimates and is applied in the binomial and Gaussian cases.

Probability Mass Functions: The Probability of Observing Each Possible Outcome Given One Set of Parameter Values

Consider an example where a researcher obtains a coin and believes it to be unbiased, . To test this hypothesis, the researcher intends to flip the coin 10 times and record the result as a 1 for heads and 0 for tails. Thus, a vector of 10 observed scores is obtained, , where . Before collecting the data to test their hypothesis, the researcher would like to get an idea of the probability of observing any given number of heads given that the coin is unbiased and there are 10 coin flips, . Thus, the outcome of interest is the number of heads, , where . Because the coin flips have a dichotomous outcome and the result of any given flip is independent of all the other flips, the probability of obtaining any given number of heads will be distributed according to a binomial distribution, . To compute the probability of obtaining any given number of heads, the binomial function shown below in Equation can be used:

where gives the total number of ways in which heads (or successes) can be obtained in a series of attempts (i.e., coin flips) and gives the probability of obtaining a given number of heads and tails in a given set of flips. Thus, the binomial function (Equation ) has an underlying intuition: To compute the probability of obtaining a specific number of heads given flips and a certain probability of success, the probability of obtaining heads in a given set of coin flips, , is multiplied by the total number of ways in which heads can be obtained in coin flips, .

As an example, the probability of obtaining four heads () in 10 coin flips () is calculated below.

Unknown environment 'spreadliness' Thus, there are 210 possible ways of obtaining four heads in a series of 10 coin flips, with each way having a probability of of occurring. Altogether, four heads have a probability of .205 of occurring given a probability of heads of .50 and 10 coin flips.

In order to calculate the probability of obtaining each possible number of heads in a series of 10 coin flips, the binomial function (Equation ) can be computed for each number of heads, . The resulting probabilities of obtaining each number of heads can then be plotted to produce a probability mass function: A distribution that gives the probability of obtaining each possible value of a discrete random variable1 (see Figure 1). Importantly, probability mass functions have two conditions: 1) the probability of obtaining each value is non-negative and 2) the sum of all probabilities is one. The R code block below (see lines 1–65) produces a probability mass function for the current binomial example.

#create function that computes probability mass function with following arguments:
##num_trials = number of trials (10 [coin flips] in the current example)
##prob_success = probability of success (or heads; 0.50 in the current example)
##num_successes = number of successes (or heads; [1-10] in the current example)

compute_binom_mass_density <- function(num_trials, prob_success, num_successes){

#computation of binomial term (i.e., number of ways of obtaining a given number of successes)
num_success_patterns <- (factorial(num_trials)/(factorial(num_successes)*factorial(num_trials-num_successes)))

#computation of the number of possible ways of obtaining a given number of successes (i.e., heads)
prob_single_pattern <- (prob_success)^num_successes*(1-prob_success)^(num_trials-num_successes)


probability <- num_success_patterns*prob_single_pattern

pmf_df <- data.frame('probability' = probability,
'num_successes' = num_successes,
'prob_success' = prob_success,
'num_trials' = num_trials)

return(pmf_df)
}


num_trials <- 10
prob_success <- 0.5
num_successes <- 0:10 #manipulated variable

prob_distribution <- compute_binom_mass_density(num_trials, prob_success, num_successes)

library (tidyverse)
library(grDevices) #needed for italic()

#create data set for shaded rectangle that indicates the most likely value
##index of highest probability
highest_number_ind <- which.max(prob_distribution$probability)
##most likely number of successes
most_likely_number <- prob_distribution$num_successes[highest_number_ind]

#create pmf plot
pmf_plot <- ggplot(data = prob_distribution, aes(x = num_successes, y = probability)) +
geom_bar(stat = 'identity',
fill = ifelse(test = prob_distribution$num_successes == most_likely_number,
no = "#002241",
yes = "#00182d")) + #calculate sum of probability for each num_successes

scale_y_continuous(name = bquote(italic("P(h")*"|"*italic(theta == .(prob_success)*","~n == .(num_trials)*")"))) +
scale_x_continuous(name = bquote("Number of Heads ("*italic("h")~")"),
breaks = seq(from = 0, to = 10, by = 1)) +
theme_classic(base_family = "Helvetica", base_size = 18) +
theme(axis.title.y = element_text(face = 'italic'),

#embolden the most likely number of heads
axis.text.x =
element_text(face =
ifelse(test = prob_distribution$num_successes == most_likely_number,
no = "plain",
yes = "bold")),
text = element_text(color = "#002241"),
axis.line = element_line(color = "#002241"),
axis.ticks = element_line(color = "#002241"),
axis.text = element_text(color = "#002241"))

ggsave(filename = 'images/pmf_plot.png', plot = pmf_plot, height = 6, width = 8)
Figure 1
Probability Mass Function With an Unbiased Coin (θ = 0.50) and Ten Coin Flips (n = 10)
Note. Number emboldened on the x-axis indicates the number of heads that is most likely to occur with an unbiased coin and 10 coin flips, with the corresponding bar in darker blue indicating the corresponding probability.

Figure 1 shows the probability mass function that results with an unbiased coin () and ten coin flips (). In looking across the probability values of obtaining each number of heads (x-axis), 5 heads is the most likely value, as indicated by the emboldened number on the x-axis and the bar above it with a darker blue color. As an aside, the R code below (lines 66–70) verifies the two conditions of probability mass functions for the current example (for a mathematical proof, see Appendix A).

#Condition 1: All probability values have nonnegative values.
sum(prob_distribution$probability >= 0) #11 nonnegative values

#Condition 2: Sum of all probability values is 1.
sum(prob_distribution$probability) #1
[1] 11
[1] 1

With a probability mass function that shows the probability of obtaining each possible number of heads, the researcher now has an idea what outcomes to expect after flipping the coin ten times. Unfortunately, the probability mass function in Figure 1 gives no insight into the coin’s actual probability of heads, , after data have been collected; in computing the probability mass function, the probability of heads is fixed. Thus, the researcher must use a different type of distribution to determine the coin’s probability of heads.

Likelihood Distributions: The Probability of Observing Each Possible Set of Parameter Values Given a Specific Outcome

Continuing with the coin flipping example, the researcher flips the coin 10 times and obtains seven heads. With these data, the researcher wants to determine the probability value of heads, , that most likely produced the data, 2. Before continuing, it is important to explain why the researcher is no longer dealing with probabilities and is instead dealing with likelihoods.

Likelihoods are not Probabilities

Because we are interested in determining which value of most likely produced the data, the probability of observing the data must be computed for each of these values, . Thus, we now fix the data, , and vary the parameter value of . Although we also use the binomial function to compute for each , the resulting values are not probabilities because they do not sum to one. Indeed, the R code block below ( lines 73–80) shows that the values sum to 9.09. Thus, when fixing the data and varying the parameter values, the resulting values do not sum to one (for a mathematical proof with the binomial function, see Appendix B and are, therefore, not probabilities: they are likelihoods. To signify the shift from probabilities to likelihoods, a different notation is used. Instead of computing the probability of the data given a parameter value, , the likelihood of the parameter given the data is computed, .

num_trials <- 10
num_successes <- 7
prob_success <- seq(from = 0, to = 1, by = 0.01) #manipulated variable

#compute P(h, n|theta) for each theta in [0, 1].
likelihood_distribution <- compute_binom_mass_density(num_trials = num_trials, num_successes = num_successes, prob_success = prob_success)

sum(likelihood_distribution$probability)
[1] 9.09091

In computing likelihoods, it is important to note that, because they do not sum to one, they cannot be interpreted as probabilities. As an example, the likelihood of 0.108 obtained for does not mean that, given a probability of heads of .50, there is a 10.80% chance that seven heads will arise in 10 coin flips: The value of provides a measure of how strongly the data are expected under the hypothesis that . To gain a better understanding of whether the likelihood value of 0.108 is a high value, the likelihood values of all the other can be computed.

Creating a Likelihood Distribution to Find the Maximum Likelihood Estimate

Figure 2 shows the likelihood distribution of for all values of . By plotting the likelihoods, the parameter value that most likely produced the data or the maximum likelihood estimate can be identified. The maximum likelihood estimate of in this example is .70, which is emboldened on the x-axis and its likelihood indicated by the height of the vertical bar. The R code block below (lines 82–126) plots computes the likelihood values for all .

num_trials <- 10
num_successes <- 7
prob_success <- seq(from = 0, to = 1, by = 0.01) #manipulated variable

likelihood_distribution <- compute_binom_mass_density(num_trials = num_trials, num_successes = num_successes, prob_success = prob_success)

#create data set for shaded rectangle that indicates the most likely value
##index of highest probability
highest_number_ind <- which.max(likelihood_distribution$probability)
##most likely number of successes
maximum_likelihood_estimate <- likelihood_distribution$prob_success[highest_number_ind]
##probability value of most likely number of successes
highest_prob <- max(likelihood_distribution$probability)

rectangle_data <- data.frame(
'xmin' = maximum_likelihood_estimate - .005,
'xmax' = maximum_likelihood_estimate + .005,
'ymin' = -0.5,
'ymax' = highest_prob)


likelihood_plot <- ggplot(data = likelihood_distribution, aes(x = prob_success, y = probability)) +
geom_line(color = "#002241") +
geom_rect(inherit.aes = F,
fill = "#002241",
data = rectangle_data, mapping = aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax))+

scale_x_continuous(name = bquote(paste("Probability of Heads (", theta, ")")),
breaks = seq(0, 1, 0.10),
labels = scales::number_format(accuracy = 0.01)) +

scale_y_continuous(name = bquote(italic("L(")*italic(theta)*"| "*italic("h")== .(num_successes)*","~italic("n") == .(num_trials)*")"),
labels = scales::number_format(accuracy = 0.01),
breaks = seq(from = 0, to = .30, by = .10)) +
coord_cartesian(ylim = c(0, .30)) +

theme_classic(base_family = "Helvetica", base_size = 18) +

theme(axis.text.x = element_text(face = ifelse(seq(0, 1, 0.10) == maximum_likelihood_estimate, "bold", "plain")),
text = element_text(color = "#002241"),
axis.line = element_line(color = "#002241"),
axis.ticks = element_line(color = "#002241"),
axis.text = element_text(color = "#002241"))

ggsave(filename = 'images/likelihood_plot.png', plot = likelihood_plot, height = 6, width = 8)
Figure 2
Likelihood Distribution With Seven Heads (h = 7) and Ten Coin Flips (n = 10)
Note. Number emboldened on the x-axis indicates the maximum likelihood estimate for θ and the corresponding bar in dark blue indicates the likelihood value.

Although maximum likelihood estimates can be identified by creating likelihood distributions, this method is not efficient. Under many circumstances, creating such distributions is computationally demanding when a large range of parameter values must be considered. Even more important, many situations arise where many parameters are estimated, and this can make plotting the likelihood distribution impossible. As an example, if a researcher wants to estimate six parameters and plot the likelihood distribution, then six dimensions would have to be represented on a 2D plot, which is a nearly impossible task. Thus, a more efficient method is needed to find maximum likelihood estimates that does not rely on plotting.

Using Maximum Likelihood Estimation to Find the Most Likely Set of Parameter Values

Maximum likelihood estimation identifies maximum likelihood estimates by using calculus to find a peak on the likelihood distribution. In mathematical parlance, maximum likelihood estimation solves for the parameter value where the derivative (i.e., rate of change) is zero. Assuming the likelihood only has one peak (i.e., it is convex), then the parameter value at the zero-derivative point will have the highest likelihood and will, therefore, be the maximum likelihood estimate. In mathematical notation, then, the maximum likelihood estimate, , is the value of that maximizes the likelihood function

In the two sections that follow, I will apply maximum likelihood estimation for the binomial and Gaussian cases.

Maximum Likelihood Estimation for the Binomial Case

In the binomial case, there is only one parameter value of interest: the probability of heads, . Thus, maximum likelihood estimation will find the value that maximizes the likelihood function,

Before computing the maximum likelihood estimate, however, it is important to apply a transformation on Equation for two reasons. First, applying a transformation to the likelihood function of Equation greatly simplifies the computation of the derivative because taking the derivative of the log-likelihood does not involve a lengthy application of the quotient, product, and chain rules. Second, log-likelihoods are necessary to avoid underflow: the rounding of small numbers to zero in computers. As an example, in a coin flipping example with a moderate number of flips such as and , many likelihood values become extremely small (e.g., 1.20E-73) and can easily be rounded down to zero within computers. Instead of directly representing extremely small values, likelihoods can be used to retain numerical precision. For example, the value of 1.2E-73 becomes -72.9208188 on a log scale (base 10), . In applying a transformation to the likelihood function, the log-likelihood function shown below in Equation is obtained:

To solve for , the partial derivative of with respect to is computed below and then set to zero (at a peak, the likelihood function has a zero-value rate of change with respect to ).

Therefore, the maximum likelihood estimate for the probability of heads, , is found by dividing the number of observed headsby the number of flips, (see Equation ). In the current example where seven heads were obtained in ten coin flips, the probability value of heads that that maximizes the probability of observing the data is .70, .

Maximum Likelihood Estimation for Several Binomial Cases

To build on the current example, consider a more realistic example where a researcher decides to flip a coin over multiple sessions. Specifically, in each of 10 sessions, the researcher flips the coin 10 times. Across the 10 sessions, the following number of heads are obtained: . At this point, it may seem daunting to compute the partial derivative of the resulting likelihood function with respect to because the equation will contain terms. Thankfully, a simple equation can be derived that does not require a lengthy partial derivative computation. To derive a equation for multiple coin flipping sessions, I will compute the function for with only two coin flipping sessions that each have their corresponding number of flips, , and heads, .

Therefore, to obtain when there are coin flipping sessions, the sum of heads,, is divided by the sum of coin flips across the sessions, . In the current example where and each session has 10 coin flips, the maximum likelihood estimate for the probability of heads, , is .48 (see lines below).

h <- c(1, 6, 4, 7, 3, 4 ,5, 10, 5, 3)
theta_mle <- sum(h)/sum(rep(x = 10, times = 10))
theta_mle
[1] 0.48

Maximum Likelihood Estimation for the Gaussian Case

To explain maximum likelihood estimation for the Gaussian case, let’s consider a new example where a researcher measures the heights of 100 males, . From previous studies, the researcher believes heights to be normally distributed and, thus, estimates a mean, , and standard deviation, , for the population heights of males. To obtain population estimates for the mean and standard deviation, the Gaussian function shown below in Equation can be used:

where the probability of observing a score given a population mean, , and standard deviation, , is computed, . Because the researcher is interested in determining the parameter values that most likely produced the data, parameter values will be varied and the data will be fixed. Thus, likelihoods and not probabilities will be used (see Likelihood are not probabilities). Although Equation will still be used to compute likelihoods, I will rewrite Equation to explicitly indicate that likelihoods will be computed, as shown below in Equation :

Importantly, Equation above only computes the likelihood given one data point. Because the researcher wants to determine the parameter values that produced all the 100 data points, , Equation must be used each for each data point and all the resulting likelihood values must be multiplied together. Thus, a product of likelihoods must be computed to obtain the likelihood of the parameters given the entire data set, , as shown below in Equation :

As in the binomial case, the likelihood equation must be transformed to a scale to prevent underflow and to simplify the derivation of the partial derivatives. Given that the equation contains Euler’s number, , I will use log of base or the natural log, , to further simplify the derivatives. Before applying the log transformation, Equation can be simplified to yield Equation below:

With a simplified form of Equation , Equation can now be converted to a log scale by using the product rule and then the power rule to obtain the log-likelihood Gaussian function shown below in Equation .

The maximum likelihood estimate functions for the mean, , and standard deviation, , can now be obtained by taking the derivative of the log-likelihood function with respect to each parameter. The derivation below solves for .

Therefore, Equation above shows that the maximum likelihood estimate for the mean can be obtained by simply computing the mean of the observed scores. The derivation below solves for .

Therefore, Equation above shows that the the maximum likelihood estimate for the standard deviation parameter, , is the square root of the average squared deviation from the mean observed score.

Thus, as in the binomial case, maximum likelihood estimation provides a simple function for calculating maximum likelihood estimates for the Gaussian parameters.

Conclusion

In conclusion, probabilities and likelihoods are fundamentally different. Probabilities sum to one, whereas likelihoods do not sum to one. Thus, likelihoods cannot be interpreted as probabilities. Although likelihoods cannot be interpreted as probabilities, they can be used to determine parameter values that most likely produce observed data sets (maximum likelihood estimates). Maximum likelihood estimation provides an efficient method for determining maximum likelihood estimates and was applied in the binomial and Gaussian cases.

References

(). Introduction to the concept of likelihood and its applications. Advances in Methods and Practices in Psychological Science, 1(1), 60–69. https://doi.org/10.1177/251524591774

Appendix A: Proof That the Binomial Function is a Probability Mass Function

To prove that the binomial function is a probability mass function, two outcomes must be shown: 1) all probability values are non-negative and 2) the sum of all probabilities is one.

For the first condition, the impossibility of negative values occurring in the binomial function becomes obvious when individually considering the binomial coefficient, , and the binomial factors, . With respect to the binomial coefficient, , it is always nonnegative because it is the product of two non-negative numbers; the number of trials, , and the number of heads, , can never be negative. With respect to the binomial factors, the resulting value is always nonnegative because all the constituent terms are nonnegative; in addition to the number of trials and heads (, respectively), the probability of heads and tails are also always nonnegative (). Therefore, probabilities can be conceptualized as the product of a nonnegative binomial coefficient and a nonnegative binomial factor, and so are always nonnegative.

For the second condition, the equality stated below in Equation must be proven:

Importantly, it can be proven that all probabilities sum to one by using the binomial theorem, which states below in Equation that

Given the striking resemblance between the binomial function in Equation and the binomial theorem in Equation , it is possible to restate the binomial theorem with respect to the variables in the binomial function. Specifically, we can let and , which returns the proof as shown below:

For a proof of the binomial theorem, see Appendix E.

Appendix B: Proof That Likelihoods are not Probabilities

As a reminder, although the same formula is used to compute likelihoods and probabilities, the variables allowed to vary and those that are fixed differ when computing likelihoods and probabilities. With probabilities, the parameters are fixed (i.e., ) and the data are varied (). With likelihoods, however, the data are fixed () and the parameters are varied (). To prove that likelihoods are not probabilities, we have to prove that likelihoods do not satisfy one of the two conditions required by probabilities (i.e., likelihoods can have negative values or likelihoods do not sum to one). Given that likelihoods are calculated with the same function as probabilities and probabilities can never be negative (see Appendix A), likelihoods likewise can never be negative. Therefore, to prove that likelihoods are not probabilities, we must prove that likelihoods do not always sum to one. Thus, the following proposition must be proven:

That is, the integral of the binomial function with respect to does not equal one. To prove this proposition, it is important to realize that can be restated in terms of the beta function, , which is shown below. Therefore, the function in Equation can be restated below in Equation as At this point, another proof becomes important because it allows us to express the beta function in terms of another function that will, ultimately, allow us to simplify Equation and prove that likelihoods do not sum to one. Specifically, the beta function, can be stated in terms of the gamma function such that

For a proof of the beta-gamma relation, see Appendix C. Thus, Equation can be stated in terms of the gamma function such that

One nice feature of the gamma function is that it can be stated as a factorial (for a proof, see Appendix D) such that

Given that the gamma function can be stated as a factorial, Equation can be now be written with factorial terms and simplified to prove that likelihoods do not sum to one.

Therefore, binomial likelihoods sum to a multiple of , where the multiple is the number of integration steps. The R code block below provides an example where the integral can be shown to be a multiple of the value in Equation . In the example, the integral of the likelihood is taken over 100 equally spaced steps. Thus, the sum of likelihoods should be , and this turns out to be true in the R code block below (lines 131–136).

num_trials <- 10 #n
num_successes <- 7 #h
prob_success <- seq(from = 0, to = 1, by = 0.01) #theta; contains 100 values (i.e., there are 100 dtheta values)

likelihood_distribution <- compute_binom_mass_density(num_trials = num_trials, num_successes = num_successes, prob_success = prob_success)
sum(likelihood_distribution$probability) #compute integral
[1] 9.09091

Appendix C: Proof of Relation Between Beta and Gamma Functions

Equation below will be proven:

To begin, let’s write out the expansions of the gamma function, , and the numerator of Equation , , where Equation shows the gamma function, , which will be useful as a reference and Equation shows the expansion of the numerator in Equation . To prove Equation , we will begin by changing the variables of and in Equation by reexpressing them in terms of and . Importantly, when changing variables in a double integral, the formula below in Equation must be followed:

where is the Jacobian of and (for a great explanation, see Jacobian and change of variables). To apply Equation , we will first determine the expressions of and in terms of and to obtain and , which are, respectively, provided below in Equation and Equation .

With the expression for and , the determinant of the Jacobian of and can now be computed, as shown below and provided in Equation . With computed, we can no express the new function with the changed variables, as shown below in Equation .

At this point, we need to determine the integration limits of and by evaluating them at the limits of and , which is shown below.

Therefore, the original integration limits of 0 to of and produce integration limits 0 to for and 0 to 1 for . Recalling the gamma function (Equation and the beta function (Equation ), the beta function can now be expressed in terms of the gamma function, proving Equation .

Appendix D: Proof of Relation Between Gamma and Factorial Functions

To prove the following proposition in Equation that

it is first helpful to prove the proposition below in Equation that To prove Equation , we first expand Equation in Equation and then simplify Equation using integration by parts such that

To simplify Equation , I will first focus on the evaluation of between and below. At ,

and because approaches faster than approaches , Equation becomes zero. At , Therefore, Equation simplifies to

Having proven that , it becomes easy to prove Equation which states that . If I continue to expand the gamma function, , where , I will obtain

To evaluate , I write out its expansion and show that

Therefore, expands to because the last term will inevitably be .

Appendix E: Proof of Binomial Theorem

The binomial theorem provided below in Equation states that

I will prove the binomial theorem using induction. Thus, I will first prove the binomial theorem in a base where so that I can later generalize the proof with a larger number of . In the base case, the binomial theorem is proven such that

Now, I will prove the binomial theorem with . Thus,

I now expand the left-hand side of Equation , to obtain

Now I, respectively, remove and from the first and second terms of Equation so that the sums iterate over the same range of to .

I then apply Pascal’s rule to simplify the addition of summations in Equation to obtain


  1. Discrete variables have a countable number of discrete values. In the current example with ten coin flips (), the number of heads is a discrete variable because the number of heads, , has a countable number of outcomes, ↩︎

  2. It should be noted that Bayes’ formula can also be used to determine the value of that most likely produced the data. Instead of calculating, , however, Bayes’ formula uses prior information about an hypothesis to calculate the probability of given the data, (for a review, see Citation: (). Introduction to the concept of likelihood and its applications. Advances in Methods and Practices in Psychological Science, 1(1), 60–69. https://doi.org/10.1177/251524591774 ). ↩︎