Probability, Likelihood, and Maximum Likelihood Estimation
Probability and likelihood are discussed in the context of a coin flipping scenario and it is shown that only probabilities sum to one. Although likelihoods cannot be interpreted as probabilities, they can be used to determine the set of parameter values that most likely produced a data set (maximum likelihood estimates). Maximum likelihood estimation provides one efficient method for determining maximum likelihood estimates and is applied in the binomial and Gaussian cases.
Probability Mass Functions: The Probability of Observing Each Possible Outcome Given One Set of Parameter Values
Consider an example where a researcher obtains a coin and believes it to be unbiased, 𝑃(𝜃)=𝑃(ℎ𝑒𝑎𝑑)=0.50. To test this hypothesis, the researcher intends to flip the coin 10 times and record the result as a 1 for heads and 0 for tails. Thus, a vector of 10 observed scores is obtained, 𝐲∈{0,1}𝑛, where 𝑛=10. Before collecting the data to test their hypothesis, the researcher would like to get an idea of the probability of observing any given number of heads given that the coin is unbiased and there are 10 coin flips, 𝑃(𝐲|𝜃,𝑛). Thus, the outcome of interest is the number of heads, ℎ, where {ℎ|0≤ℎ≤10}. Because the coin flips have a dichotomous outcome and the result of any given flip is independent of all the other flips, the probability of obtaining any given number of heads will be distributed according to a binomial distribution, ℎ∼𝐵(𝑛,ℎ). To compute the probability of obtaining any given number of heads, the binomial function shown below in Equation 1.1 can be used:
𝑃(ℎ|𝜃,𝑛)=(𝑛ℎ)(𝜃)ℎ(1−𝜃)(𝑛−ℎ),(1.1)
where (𝑛ℎ) gives the total number of ways in which ℎ heads (or successes) can be obtained in a series of 𝑛 attempts (i.e., coin flips) and (𝜃)ℎ(1−𝜃)(𝑛−ℎ) gives the probability of obtaining a given number of ℎ heads and 𝑛−ℎ tails in a given set of 𝑛 flips. Thus, the binomial function (Equation 1.1) has an underlying intuition: To compute the probability of obtaining a specific number of ℎ heads given 𝑛 flips and a certain 𝜃 probability of success, the probability of obtaining ℎ heads in a given set of 𝑛 coin flips, (𝜃)ℎ(1−𝜃)(𝑛−ℎ), is multiplied by the total number of ways in which ℎ heads can be obtained in 𝑛 coin flips, (𝑛ℎ).
As an example, the probability of obtaining four heads (ℎ=4) in 10 coin flips (𝑛=10) is calculated below.
Unknown environment 'spreadliness'
Thus, there are 210 possible ways of obtaining four heads in a series of 10 coin flips, with each way having a probability of (0.5)10 of occurring. Altogether, four heads have a probability of .205 of occurring given a probability of heads of .50 and 10 coin flips.
In order to calculate the probability of obtaining each possible number of heads in a series of 10 coin flips, the binomial function (Equation 1.1) can be computed for each number of heads, ℎ. The resulting probabilities of obtaining each number of heads can then be plotted to produce a probability mass function: A distribution that gives the probability of obtaining each possible value of a discrete random variable1 (see Figure 1). Importantly, probability mass functions have two conditions: 1) the probability of obtaining each value is non-negative and 2) the sum of all probabilities is one. The R code block below (see lines 1–65) produces a probability mass function for the current binomial example.
#create function that computes probability mass function with following arguments:
##num_trials = number of trials (10 [coin flips] in the current example)
##prob_success = probability of success (or heads; 0.50 in the current example)
##num_successes = number of successes (or heads; [1-10] in the current example)
Probability Mass Function With an Unbiased Coin (θ = 0.50) and Ten Coin Flips (n = 10)
Note. Number emboldened on the x-axis indicates the number of heads that is most likely to occur with an unbiased coin and 10 coin flips, with the corresponding bar in darker blue indicating the corresponding probability.
Figure 1 shows the probability mass function that results with an unbiased coin (𝜃=0.50) and ten coin flips (𝑛=10). In looking across the probability values of obtaining each number of heads (x-axis), 5 heads is the most likely value, as indicated by the emboldened number on the x-axis and the bar above it with a darker blue color. As an aside, the R code below (lines 66–70) verifies the two conditions of probability mass functions for the current example (for a mathematical proof, see Appendix A).
#Condition 1: All probability values have nonnegative values.
With a probability mass function that shows the probability of obtaining each possible number of heads, the researcher now has an idea what outcomes to expect after flipping the coin ten times. Unfortunately, the probability mass function in Figure 1 gives no insight into the coin’s actual probability of heads, 𝜃, after data have been collected; in computing the probability mass function, the probability of heads is fixed. Thus, the researcher must use a different type of distribution to determine the coin’s probability of heads.
Likelihood Distributions: The Probability of Observing Each Possible Set of Parameter Values Given a Specific Outcome
Continuing with the coin flipping example, the researcher flips the coin 10 times and obtains seven heads. With these data, the researcher wants to determine the probability value of heads, 𝜃, that most likely produced the data, 𝑃(ℎ,𝑛|𝜃)2. Before continuing, it is important to explain why the researcher is no longer dealing with probabilities and is instead dealing with likelihoods.
Likelihoods are not Probabilities
Because we are interested in determining which value of 𝜃∈[0,1] most likely produced the data, the probability of observing the data must be computed for each of these values, 𝑃(ℎ=7,𝑛=10|𝜃). Thus, we now fix the data, ℎ=7,𝑛=10, and vary the parameter value of 𝜃. Although we also use the binomial function to compute 𝑃(ℎ=7,𝑛=10|𝜃) for each 𝜃∈[0,1], the resulting values are not probabilities because they do not sum to one. Indeed, the R code block below ( lines 73–80) shows that the values sum to 9.09. Thus, when fixing the data and varying the parameter values, the resulting values do not sum to one (for a mathematical proof with the binomial function, see Appendix B and are, therefore, not probabilities: they are likelihoods. To signify the shift from probabilities to likelihoods, a different notation is used. Instead of computing the probability of the data given a parameter value, 𝑃(ℎ=7,𝑛=10|𝜃), the likelihood of the parameter given the data is computed, 𝐿(𝜃|ℎ,𝑛).
In computing likelihoods, it is important to note that, because they do not sum to one, they cannot be interpreted as probabilities. As an example, the likelihood of 0.108 obtained for 𝐿(𝜃=.50|ℎ=7,𝑛=10) does not mean that, given a probability of heads of .50, there is a 10.80% chance that seven heads will arise in 10 coin flips: The value of 𝐿(𝜃=.50|ℎ=7,𝑛=10)=0.108 provides a measure of how strongly the data are expected under the hypothesis that 𝜃=.50. To gain a better understanding of whether the likelihood value of 0.108 is a high value, the likelihood values of all the other 𝜃 can be computed.
Creating a Likelihood Distribution to Find the Maximum Likelihood Estimate
Figure 2 shows the likelihood distribution of for all values of 𝜃∈[0,1]. By plotting the likelihoods, the parameter value that most likely produced the data or the maximum likelihood estimate can be identified. The maximum likelihood estimate of 𝜃 in this example is .70, which is emboldened on the x-axis and its likelihood indicated by the height of the vertical bar. The R code block below (lines 82–126) plots computes the likelihood values for all 𝜃∈[0,1].
Likelihood Distribution With Seven Heads (h = 7) and Ten Coin Flips (n = 10)
Note. Number emboldened on the x-axis indicates the maximum likelihood estimate for θ and the corresponding bar in dark blue indicates the likelihood value.
Although maximum likelihood estimates can be identified by creating likelihood distributions, this method is not efficient. Under many circumstances, creating such distributions is computationally demanding when a large range of parameter values must be considered. Even more important, many situations arise where many parameters are estimated, and this can make plotting the likelihood distribution impossible. As an example, if a researcher wants to estimate six parameters and plot the likelihood distribution, then six dimensions would have to be represented on a 2D plot, which is a nearly impossible task. Thus, a more efficient method is needed to find maximum likelihood estimates that does not rely on plotting.
Using Maximum Likelihood Estimation to Find the Most Likely Set of Parameter Values
Maximum likelihood estimation identifies maximum likelihood estimates by using calculus to find a peak on the likelihood distribution. In mathematical parlance, maximum likelihood estimation solves for the parameter value where the derivative (i.e., rate of change) is zero. Assuming the likelihood only has one peak (i.e., it is convex), then the parameter value at the zero-derivative point will have the highest likelihood and will, therefore, be the maximum likelihood estimate. In mathematical notation, then, the maximum likelihood estimate, 𝜃𝑀𝐿𝐸, is the value of 𝜃 that maximizes the likelihood function
𝜃𝑀𝐿𝐸=argmax𝜃𝐿(𝜃|𝐷).(3.1)
In the two sections that follow, I will apply maximum likelihood estimation for the binomial and Gaussian cases.
Maximum Likelihood Estimation for the Binomial Case
In the binomial case, there is only one parameter value of interest: the probability of heads, 𝜃. Thus, maximum likelihood estimation will find the value 𝜃 that maximizes the likelihood function,
Before computing the maximum likelihood estimate, however, it is important to apply a log transformation on Equation ??? for two reasons. First, applying a log transformation to the likelihood function of Equation 3.2 greatly simplifies the computation of the derivative because taking the derivative of the log-likelihood does not involve a lengthy application of the quotient, product, and chain rules. Second, log-likelihoods are necessary to avoid underflow: the rounding of small numbers to zero in computers. As an example, in a coin flipping example with a moderate number of flips such as 𝑛=100 and ℎ=70, many likelihood values become extremely small (e.g., 1.20E-73) and can easily be rounded down to zero within computers. Instead of directly representing extremely small values, log likelihoods can be used to retain numerical precision. For example, the value of 1.2E-73 becomes -72.9208188 on a log scale (base 10), log101.2𝑒73=−72.92. In applying a log transformation to the likelihood function, the log-likelihood function shown below in Equation 3.3 is obtained:
log(𝐿(𝜃|ℎ,𝑛))=log(𝑛ℎ)+ℎlog(𝜃)+(𝑛−ℎ)log(1−𝜃)(3.3)
To solve for 𝜃𝑀𝐿𝐸, the partial derivative of log[𝐿(𝜃|ℎ,𝑛)] with respect to 𝜃 is computed below and then set to zero (at a peak, the likelihood function has a zero-value rate of change with respect to 𝜃).
𝛿log[𝐿(𝜃|ℎ,𝑛)]𝛿𝜃=𝛿𝛿𝜃(log(𝑛ℎ)+ℎlog(𝜃)+(𝑛−ℎ)log(1−𝜃))=0+ℎ(1𝜃)+(𝑛−ℎ)(−1)(11−𝜃)0=ℎ𝜃−𝑛−ℎ1−𝜃𝑛−ℎ1−𝜃=ℎ𝜃𝜃𝑛−𝜃ℎ=ℎ−𝜃ℎ𝜃𝑛=ℎ𝜃=ℎ𝑛(3.4)
Therefore, the maximum likelihood estimate for the probability of heads, 𝜃, is found by dividing the number of observed headsby the number of flips, ℎ𝑛 (see Equation 3.4). In the current example where seven heads were obtained in ten coin flips, the probability value of heads that that maximizes the probability of observing the data is .70, 𝜃𝑀𝐿𝐸=710=.70.
Maximum Likelihood Estimation for Several Binomial Cases
To build on the current example, consider a more realistic example where a researcher decides to flip a coin over multiple sessions. Specifically, in each of 10 𝑘 sessions, the researcher flips the coin 10 times. Across the 10 sessions, the following number of heads are obtained: 𝐡=[1,6,4,7,3,4,5,10,5,3]. At this point, it may seem daunting to compute the partial derivative of the resulting likelihood function with respect to 𝜃
because the equation will contain 𝑘=10 terms. Thankfully, a simple equation can be derived that does not require a lengthy partial derivative computation. To derive a 𝜃𝑀𝐿𝐸 equation for multiple coin flipping sessions, I will compute the function for 𝜃𝑀𝐿𝐸 with only two coin flipping sessions that each have their corresponding number of flips, 𝐧=[𝑛1,𝑛2], and heads, 𝐡=[ℎ1,ℎ2].
Therefore, to obtain 𝜃𝑀𝐿𝐸 when there are 𝑘 coin flipping sessions, the sum of heads,∑𝑘𝑖=1ℎ𝑖, is divided by the sum of coin flips across the sessions, ∑𝑘𝑖=1𝑛𝑖. In the current example where 𝐡=[1,6,4,7,3,4,5,10,5,3] and each session has 10 coin flips, the maximum likelihood estimate for the probability of heads, 𝜃𝑀𝐿𝐸, is .48 (see lines below).
h<-c(1,6,4,7,3,4,5,10,5,3)
theta_mle<-sum(h)/sum(rep(x=10,times=10))
theta_mle
[1] 0.48
Maximum Likelihood Estimation for the Gaussian Case
To explain maximum likelihood estimation for the Gaussian case, let’s consider a new example where a researcher measures the heights of 100 males, 𝐲∈R100. From previous studies, the researcher believes heights to be normally distributed and, thus, estimates a mean, 𝜇, and standard deviation, 𝜎, for the population heights of males. To obtain population estimates for the mean and standard deviation, the Gaussian function shown below in Equation 3.6 can be used:
𝑃(𝑦𝑖|𝜎,𝜇)=1𝜎√2𝜋𝑒−12(𝑦𝑖−𝜇𝜎)2,(3.6)
where the probability of observing a 𝑦𝑖 score given a population mean, 𝜇, and standard deviation, 𝜎, is computed, 𝑃(𝑦𝑖|𝜎,𝜇). Because the researcher is interested in determining the parameter values that most likely produced the data, parameter values will be varied and the data will be fixed. Thus, likelihoods and not probabilities will be used (see Likelihood are not probabilities). Although Equation 3.6 will still be used to compute likelihoods, I will rewrite Equation 3.6 to explicitly indicate that likelihoods will be computed, as shown below in Equation 3.7:
𝐿(𝜎,𝜇|𝐲𝐢)=1𝜎√2𝜋𝑒−12(𝑦𝑖−𝜇𝜎)2.(3.7)
Importantly, Equation 3.7 above only computes the likelihood given one 𝑦𝑖 data point. Because the researcher wants to determine the parameter values that produced all the 100 data points, 𝑦𝑖∈𝐲, Equation 3.7 must be used each for each data point and all the resulting likelihood values must be multiplied together. Thus, a product of likelihoods must be computed to obtain the likelihood of the parameters given the entire data set, 𝐿(𝜎,𝜇|𝐲), as shown below in Equation 3.8:
𝐿(𝜎,𝜇|𝐲𝐢)=𝑛∏𝑖=11𝜎√2𝜋𝑒−12(𝑦𝑖−𝜇𝜎)2.(3.8)
As in the binomial case, the likelihood equation must be transformed to a log scale to prevent underflow and to simplify the derivation of the partial derivatives. Given that the equation contains Euler’s number, 𝑒, I will use log of base 𝑒 or the natural log, ln, to further simplify the derivatives. Before applying the log transformation, Equation 3.8 can be simplified to yield Equation 3.9 below:
With a simplified form of Equation 3.8, Equation 3.9 can now be converted to a log scale by using the product rule and then the power rule to obtain the log-likelihood Gaussian function shown below in Equation 3.10.
Apply product rule ⇒ln(𝜎−𝑛)+ln((√2𝜋)−𝑛2)+ln(𝑒(−12𝜎2∑𝑛𝑖=1(𝑦𝑖−𝜇)2))Apply power rule ⇒−𝑛ln(𝜎)−𝑛2ln(√2𝜋)−12𝜎2𝑛∑𝑖=1(𝑦𝑖−𝜇)2ln(𝑒)ln𝐿(𝜎,𝜇|𝐲)=−𝑛ln(𝜎)−𝑛2ln(√2𝜋)−12𝜎2𝑛∑𝑖=1(𝑦𝑖−𝜇)2(3.10)
The maximum likelihood estimate functions for the mean, 𝜇, and standard deviation, 𝜎, can now be obtained by taking the derivative of the log-likelihood function with respect to each parameter. The derivation below solves for 𝜇.
𝛿ln𝐿(𝜎,𝜇|𝐲)𝛿𝜇=𝛿𝛿𝜇(−𝑛ln(𝜎)−𝑛2ln(√2𝜋)−12𝜎2𝑛∑𝑖=1(𝑦𝑖−𝜇)2)=0−0−12𝜎2𝑛∑𝑖=12(𝑦𝑖−𝜇)𝛿𝛿𝜇(𝑦𝑖−𝜇)=−12𝜎2𝑛∑𝑖=12(𝑦𝑖−𝜇)⋅−1=1𝜎2𝑛∑𝑖=1(𝑦𝑖−𝜇)Set 𝛿ln𝐿(𝜎,𝜇|𝐲)𝛿𝜇=00=1𝜎2𝑛∑𝑖=1(𝑦𝑖−𝜇)0=𝑛∑𝑖=1𝑦𝑖−𝑛∑𝑖=1𝜇0=𝑛∑𝑖=1𝑦𝑖−𝑛𝜇𝜇𝑀𝐿𝐸=1𝑛𝑛∑𝑖=1𝑦𝑖(3.11)
Therefore, Equation 3.11 above shows that the maximum likelihood estimate for the mean can be obtained by simply computing the mean of the observed 𝑦𝑖 scores. The derivation below solves for 𝜎.
𝛿ln𝐿(𝜎,𝜇|𝐲)𝛿𝜎=𝛿𝛿𝜎(−𝑛ln(𝜎)−𝑛2ln(√2𝜋)−12𝜎2𝑛∑𝑖=1(𝑦𝑖−𝜇)2)=−𝑛𝜎+0−12(−2𝜎−3)𝑛∑𝑖=1(𝑦𝑖−𝜇)2=−𝑛𝜎+∑𝑛𝑖=1(𝑦𝑖−𝜇)2𝜎3=1𝜎3(𝑛∑𝑖=1(𝑦𝑖−𝜇)2−𝑛𝜎2)Set 𝛿ln𝐿(𝜎,𝜇|𝐲)𝛿𝜎=00=𝑛∑𝑖=1(𝑦𝑖−𝜇)2−𝑛𝜎2𝑛𝜎2=𝑛∑𝑖=1(𝑦𝑖−𝜇)2𝜎𝑀𝐿𝐸=√1𝑛𝑛∑𝑖=1(𝑦𝑖−𝜇)2(3.12)
Therefore, Equation 3.12 above shows that the the maximum likelihood estimate for the standard deviation parameter, 𝜎, is the square root of the average squared deviation from the mean observed score.
Thus, as in the binomial case, maximum likelihood estimation provides a simple function for calculating maximum likelihood estimates for the Gaussian parameters.
Conclusion
In conclusion, probabilities and likelihoods are fundamentally different. Probabilities sum to one, whereas likelihoods do not sum to one. Thus, likelihoods cannot be interpreted as probabilities. Although likelihoods cannot be interpreted as probabilities, they can be used to determine parameter values that most likely produce observed data sets (maximum likelihood estimates). Maximum likelihood estimation provides an efficient method for determining maximum likelihood estimates and was applied in the binomial and Gaussian cases.
References
Etz, A.
(2018).
Introduction to the concept of likelihood and its applications.Advances in Methods and Practices in Psychological Science, 1(1), 60–69.
https://doi.org/10.1177/251524591774
Appendix A: Proof That the Binomial Function is a Probability Mass Function
To prove that the binomial function is a probability mass function, two outcomes must be shown: 1) all probability values are non-negative and 2) the sum of all probabilities is one.
For the first condition, the impossibility of negative values occurring in the binomial function becomes obvious when individually considering the binomial coefficient, (𝑛ℎ), and the binomial factors, 𝜃ℎ(1−𝜃)𝑛−ℎ. With respect to the binomial coefficient, (𝑛ℎ), it is always nonnegative because it is the product of two non-negative numbers; the number of trials, 𝑛, and the number of heads, ℎ, can never be negative. With respect to the binomial factors, the resulting value is always nonnegative because all the constituent terms are nonnegative; in addition to the number of trials and heads (𝑛,ℎ, respectively), the probability of heads and tails are also always nonnegative (𝜃,(1−𝜃)∈[0,1]). Therefore, probabilities can be conceptualized as the product of a nonnegative binomial coefficient and a nonnegative binomial factor, and so are always nonnegative.
For the second condition, the equality stated below in Equation A.1 must be proven:
1=𝑛∑ℎ=0(𝑛ℎ)𝜃ℎ(1−𝜃)𝑛−ℎ.(A.1)
Importantly, it can be proven that all probabilities sum to one by using the binomial theorem, which states below in Equation A.2 that
(𝑎+𝑏)𝑛=𝑛∑𝑘=0(𝑛𝑘)𝑎𝑘(𝑏)𝑛−𝑘.(A.2)
Given the striking resemblance between the binomial function in Equation A.1 and the binomial theorem in Equation A.2, it is possible to restate the binomial theorem with respect to the variables in the binomial function. Specifically, we can let 𝑎=𝜃 and 𝑏=1−𝜃, which returns the proof as shown below:
For a proof of the binomial theorem, see Appendix E.
Appendix B: Proof That Likelihoods are not Probabilities
As a reminder, although the same formula is used to compute likelihoods and probabilities, the variables allowed to vary and those that are fixed differ when computing likelihoods and probabilities. With probabilities, the parameters are fixed (i.e., 𝜃) and the data are varied (ℎ,𝑛). With likelihoods, however, the data are fixed (ℎ,𝑛) and the parameters are varied (𝜃). To prove that likelihoods are not probabilities, we have to prove that likelihoods do not satisfy one of the two conditions required by probabilities (i.e., likelihoods can have negative values or likelihoods do not sum to one). Given that likelihoods are calculated with the same function as probabilities and probabilities can never be negative (see Appendix A), likelihoods likewise can never be negative. Therefore, to prove that likelihoods are not probabilities, we must prove that likelihoods do not always sum to one. Thus, the following proposition must be proven:
∫10(𝑛ℎ)𝜃ℎ(1−𝜃)𝑛−ℎ𝑑𝜃≠1.(B.1)
That is, the integral of the binomial function with respect to 𝜃 does not equal one. To prove this proposition, it is important to realize that ∫10𝜃ℎ(1−𝜃)𝑛−ℎ can be restated in terms of the beta function, B(𝑥,𝑦), which is shown below.
B(𝑥,𝑦)=∫10𝑡𝑥−1(1−𝑡)𝑦−1𝑐𝑑𝑡Let 𝑡=𝜃B(𝑥,𝑦)=∫10𝜃𝑥−1(1−𝜃)𝑦−1𝑐𝑑𝜃Let 𝑥=ℎ+1 and 𝑦=𝑛−ℎ+1B(ℎ+1,𝑛−ℎ+1)=∫10𝜃ℎ+1−1(1−𝜃)𝑛−ℎ+1−1𝑐𝑑𝜃=∫10𝜃ℎ(1−𝜃)𝑛−ℎ𝑐𝑑𝜃(B.2)(B.3)
Therefore, the function in Equation B.1 can be restated below in Equation B.4 as
∫10𝐿(𝜃|ℎ,𝑛)𝑐𝑑𝜃=(𝑛ℎ)B(ℎ+1,𝑛−ℎ+1).(B.4)
At this point, another proof becomes important because it allows us to express the beta function in terms of another function that will, ultimately, allow us to simplify Equation B.4 and prove that likelihoods do not sum to one. Specifically, the beta function, B(𝑥,𝑦) can be stated in terms of the gamma function Γ such that
B(𝑥,𝑦)=Γ(𝑥)Γ(𝑦)Γ(𝑥+𝑦).(B.5)
For a proof of the beta-gamma relation, see Appendix C. Thus, Equation B.4 can be stated in terms of the gamma function such that
∫10𝐿(𝜃|ℎ,𝑛)𝑐𝑑𝜃=(𝑛ℎ)Γ(ℎ+1)Γ(𝑛−ℎ+1)Γ(𝑛+2).(B.6)
One nice feature of the gamma function is that it can be stated as a factorial (for a proof, see Appendix D) such that
Γ(𝑥)=(𝑥−1)!.(B.7)
Given that the gamma function can be stated as a factorial, Equation B.6 can be now be written with factorial terms and simplified to prove that likelihoods do not sum to one.
Therefore, binomial likelihoods sum to a multiple of 11+𝑛, where the multiple is the number of integration steps. The R code block below provides an example where the integral can be shown to be a multiple of the value in Equation B.8. In the example, the integral of the likelihood is taken over 100 equally spaced steps. Thus, the sum of likelihoods should be 10011+𝑛=9.09, and this turns out to be true in the R code block below (lines 131–136).
num_trials<-10#n
num_successes<-7#h
prob_success<-seq(from=0,to=1,by=0.01)#theta; contains 100 values (i.e., there are 100 dtheta values)
B(𝑥,𝑦)=Γ(𝑥)Γ(𝑦)Γ(𝑥+𝑦).(C.1)
To begin, let’s write out the expansions of the gamma function, Γ(𝑥), and the numerator of Equation C.1, Γ(𝑥)Γ(𝑦), where
Γ(𝑥)=∫∞0𝑡𝑥−1𝑒−𝑡𝑐𝑑𝑡Γ(𝑥)Γ(𝑦)=∫∞0𝑡𝑥−1𝑒−𝑡𝑐𝑑𝑡∫∞0𝑠𝑦−1𝑒−𝑠𝑐𝑑𝑠.(C.2)(C.3)
Equation C.2 shows the gamma function, Γ(𝑥), which will be useful as a reference and Equation C.3 shows the expansion of the numerator in Equation C.1. To prove Equation C.1, we will begin by changing the variables of 𝑠 and 𝑡 in Equation C.3 by reexpressing them in terms of 𝑢 and 𝑣. Importantly, when changing variables in a double integral, the formula below in Equation C.4 must be followed:
∬𝐺𝑓(𝑥,𝑦)𝑑𝑥𝑑𝑦=∬𝑓(𝑔(𝑢,𝑣),ℎ(𝑢,𝑣))det(𝐉(𝑢,𝑣))𝑑𝑢𝑑𝑣,(C.4)
where |𝐉(𝑢,𝑣)| is the Jacobian of 𝑢 and 𝑣 (for a great explanation, see Jacobian and change of variables). To apply Equation C.4, we will first determine the expressions of 𝑠 and 𝑡 in terms of 𝑢 and 𝑣 to obtain 𝑔(𝑢,𝑣) and ℎ(𝑢,𝑣), which are, respectively, provided below in Equation C.5 and Equation C.6.
Let 𝑢=𝑠+𝑡,𝑣=𝑡𝑠+𝑡then 𝑠=𝑢−𝑡=𝑢−𝑢𝑣=𝑔(𝑢,𝑣)𝑡=𝑢−𝑠=𝑢−(𝑢−𝑢𝑣)=𝑢𝑣=ℎ(𝑢,𝑣).(C.5)(C.6)
With the expression for 𝑔(𝑢,𝑣) and ℎ(𝑢,𝑣), the determinant of the Jacobian of 𝑢 and 𝑣 can now be computed, as shown below and provided in Equation C.7.
det𝐉(𝑢,𝑣)=det[𝜕𝑔𝜕𝑢𝜕𝑔𝜕𝑣𝜕ℎ𝜕𝑢𝜕ℎ𝜕𝑣]=det[1−𝑣−𝑢𝑣𝑢]=(1−𝑣)𝑢−(−𝑢𝑣)=𝑢−𝑢𝑣+𝑢𝑣=𝑢(C.7)
With det𝐉(𝑢,𝑣) computed, we can no express the new function with the changed variables, as shown below in Equation C.8.
∬𝐺𝑓(𝑔(𝑢,𝑣),ℎ(𝑢,𝑣))det(𝐉(𝑢,𝑣))𝑑𝑢𝑑𝑣=∬𝑅Γ(𝑔(𝑢,𝑣))Γ(ℎ(𝑢,𝑣))det(𝐉(𝑢,𝑣))𝑑𝑢𝑑𝑣=∬𝑅𝑢𝑣𝑥−1𝑒−𝑢𝑣(𝑢−𝑢𝑣)𝑦−1𝑒−(𝑢−𝑢𝑣)𝑢𝑑𝑢𝑑𝑣=∬𝑅𝑢𝑥−1𝑣𝑥−1𝑒−𝑢𝑣(𝑢(1−𝑣)𝑦−1)𝑒−(𝑢−𝑢𝑣)𝑢𝑑𝑢𝑑𝑣=∬𝑅𝑢𝑥−1𝑢𝑦−1𝑢𝑒−𝑢𝑣𝑒−𝑢+𝑢𝑣𝑣𝑥−1(1−𝑣)𝑦−1𝑑𝑢𝑑𝑣=∬𝑅𝑢𝑥+𝑦−1𝑒−𝑢𝑣𝑥−1(1−𝑣)𝑦−1𝑑𝑢𝑑𝑣(C.8)
At this point, we need to determine the integration limits of 𝑢 and 𝑣 by evaluating them at the limits of 𝑠 and 𝑡, which is shown below.
Recall 𝑢=𝑠+𝑡,𝑣=𝑡𝑠+𝑡, and 𝑠,𝑦∈[0,∞]If 𝑠=0⇒𝑢=𝑡,𝑣=1𝐼𝑓𝑠=∞⇒𝑢=∞,𝑣=0𝐼𝑓𝑡=0⇒𝑢=𝑠,𝑣=0𝐼𝑓𝑡=∞⇒𝑢=∞,𝑣=1
Therefore, the original integration limits of 0 to ∞ of 𝑠 and 𝑡 produce integration limits 0 to ∞ for 𝑢 and 0 to 1 for 𝑣. Recalling the gamma function (Equation C.2 and the beta function (Equation B.2), the beta function can now be expressed in terms of the gamma function, proving Equation C.1.
Appendix D: Proof of Relation Between Gamma and Factorial Functions
To prove the following proposition in Equation D.1 that
Γ(𝑥)=∫∞0𝑡𝑥−1𝑒−𝑡𝑑𝑡=(𝑥−1)!,(D.1)
it is first helpful to prove the proposition below in Equation D.2 that
Γ(𝛼+1)=𝛼Γ(𝛼)(D.2)
To prove Equation D.2, we first expand Equation D.2 in Equation D.3 and then simplify Equation D.3 using integration by parts such that
Γ(𝛼+1)=∫∞0𝑡𝛼𝑒−𝑡𝑑𝑡∫𝑢𝑑𝑣=𝑢𝑣−∫𝑣𝑑𝑢.Let 𝑢=𝑡𝛼, 𝑑𝑣=𝑒−𝑡𝑑𝑡, 𝑑𝑢=𝛼𝑡𝛼−1, and 𝑣=−𝑒−𝑡.∫𝑢𝑑𝑣=−𝑡𝛼𝑒−𝑡|∞0−∫∞0(−𝑒−𝑡)𝛼𝑡𝛼−1(D.3)(D.4)(D.5)
To simplify Equation D.5, I will first focus on the evaluation of −𝑡𝛼𝑒−𝑡 between ∞ and 0 below. At 𝑡=∞,
−𝑡𝛼𝑒−𝑡=−∞𝛼𝑒∞,(D.6)
and because 𝑒∞ approaches ∞ faster than −∞𝛼 approaches −∞, Equation D.6 becomes zero. At 𝑡=0,
−𝑡𝛼𝑒−𝑡=−0𝛼𝑒0=01=0.
Therefore, Equation D.5 simplifies to
∫𝑢𝑑𝑣=0−0−∫∞0(−𝑒−𝑡)𝛼𝑡𝛼−1=𝛼∫∞0𝑡𝛼−1𝑒−𝑡=𝛼Γ(𝛼)◼
Having proven that Γ(𝛼+1)=𝛼Γ(𝛼), it becomes easy to prove Equation D.1 which states that Γ(𝑥)=(𝑥−1)!. If I continue to expand the gamma function, Γ(𝑥−𝑛), where 𝑛=𝑥−1, I will obtain
Γ(𝑥)=(𝑥−1)Γ(𝑥−1)Γ(𝑥−1)=(𝑥−2)Γ(𝑥−2)⋮Γ(𝑥−𝑛)=(1)Γ(1)
To evaluate Γ(1), I write out its expansion and show that
Γ(1)=∫∞0𝑡1−1𝑒−𝑡𝑑𝑡=𝑒−𝑡|∞0=𝑒−∞−𝑒0=0+1=1
Therefore, Γ(𝑥) expands to (𝑥−1)! because the last term will inevitably be 1×Γ(1)=1.
The binomial theorem provided below in Equation E.1 states that
(𝑥+𝑦)𝑛=𝑛∑𝑘=0(𝑛𝑘)𝑥𝑛−𝑘𝑦𝑘.(E.1)
I will prove the binomial theorem using induction. Thus, I will first prove the binomial theorem in a base where 𝑛=1 so that I can later generalize the proof with a larger number of 𝑛+1. In the base case, the binomial theorem is proven such that
𝑥+𝑦=(10)𝑥1−0𝑦0+(11)𝑥1−1𝑦1=𝑥+𝑦.
Now, I will prove the binomial theorem with 𝑛+1. Thus,
(𝑥+𝑦)𝑛+1=𝑛+1∑𝑘=0(𝑛+1𝑘)𝑥𝑛+1−𝑘𝑦𝑘.(E.2)
I now expand the left-hand side of Equation E.2, to obtain
(𝑥+𝑦)𝑛+1=(𝑥+𝑦)(𝑥+𝑦)𝑛=(𝑥+𝑦)𝑛∑𝑘=0(𝑛𝑘)𝑥𝑛−𝑘𝑦𝑘=𝑥𝑛∑𝑘=0(𝑛𝑘)𝑥𝑛−𝑘𝑦𝑘+𝑦𝑛∑𝑘=0(𝑛𝑘)𝑥𝑛−𝑘𝑦𝑘=𝑛∑𝑘=0(𝑛𝑘)𝑥𝑛+1−𝑘𝑦𝑘+𝑛∑𝑘=0(𝑛𝑘)𝑥𝑛−𝑘𝑦𝑘+1=𝑛∑𝑘=0(𝑛𝑘)𝑥(𝑛+1)−𝑘𝑦𝑘+𝑛+1∑𝑘=1(𝑛𝑘−1)𝑥𝑛−(𝑘−1)𝑦(𝑘−1)+1(E.3)
Now I, respectively, remove 𝑘=0 and 𝑘=𝑛+1 from the first and second terms of Equation E.3 so that the sums iterate over the same range of 𝑘=1 to 𝑘=𝑛.
Discrete variables have a countable number of discrete values. In the current example with ten coin flips (𝑛=10), the number of heads is a discrete variable because the number of heads, ℎ, has a countable number of outcomes, ℎ∈{0,1,2,…,𝑛}. ↩︎
It should be noted that Bayes’ formula can also be used to determine the value of 𝜃 that most likely produced the data. Instead of calculating, 𝑃(ℎ,𝑛|𝜃), however, Bayes’ formula uses prior information about an hypothesis to calculate the probability of 𝜃 given the data, 𝑃(𝜃|ℎ,𝑛) (for a review, see
Citation: Etz, 2018Etz, A.
(2018).
Introduction to the concept of likelihood and its applications.Advances in Methods and Practices in Psychological Science, 1(1), 60–69.
https://doi.org/10.1177/251524591774
). ↩︎