I’ve been looking at the Expectation-Maximization (EM) algorithm, which is a really interesting tool for estimating parameters in multivariate models with latent (i.e. unobserved) variables. However, I found it quite hard to understand it from the formal definitions and explanations on Wikipedia. Even the more “intuitive” examples I found left me scratching my head a bit, mostly because I think they are a bit too complicated to get an intuition of what’s going on. So here I will run through the simplest possible example I can think of.

First, the purpose of the EM algorithm is to analyse a model with variables which you know exist but which are missing from the data. It allows you to find (as best as possible) what those missing values are, as well as the parameters for the whole model. That is, we take some measured variables, , some unobserved variables , and a parameter vector and calculate the expected values of and the maximum likelihood estimates of the parameters, written . The maximum likelihood estimate is a “guess” of the parameter values based on some observed data.

However, for this example, we will begin by ignoring the technical definition of “maximum likelihood estimate”, and go through a simple example in which the EM algorithm is essentially just the “obvious” way to get to the “right” guess.

Let’s start by assuming that we know what the model is, then build something to infer it. To begin with, lets take the go-to trivial model for statistics, flipping a coin. Let’s say that the random variable represents the coin toss, and takes the value 1 if the coin comes up heads and 0 otherwise. Thus if we performed a sequence of coin tosses we would get an output like:

z = 0,0,1,0,1,1,0,1,0

For the sake of argument. I find it easier to write the above than

T,T,H,T,H,H,T,H,T

but they obviously mean the same thing. One reason that the first version is nicer is that the obvious “guess” for the probability that the coin comes up heads, based on the data observed, is just the average value of the outcome variable:

The overbar-z meaning average value. We could equally write where is the number of times the coin came up heads. This will be useful later but we will stick with the formulation for now. It should be obvious that these two values are the same thing.

In statistical *inference*, we assume that the values are drawn from an underlying distribution belonging to the random variable , or alternatively there is some underlying process producing the which determines the probability that it will take different values. In the binary coin-toss example here, the only parameter we need is the probability that the coin will come up heads, which we will call . Thus we have the distribution function :

This should all be pretty obvious. It is also (I argue) pretty obvious that by some reasonable standard is the best guess you could have for . That is, the estimate based on the data, is, in this case, simply the average value of the outcome, or alternatively, the proportion of the time that the coin comes up heads. This guess – the average value – is also technically speaking the maximum likelihood estimate, but like I say for now there is no need to worry about what that means.

Now on to the real example. Suppose we have this one coin toss , and if that comes up tails () we perform two more coin tosses on a totally different coin which has probability of coming up heads, otherwise (if ) we do the second two tosses on yet another different coin which has probability of coming up heads. The two tosses of the second coin we will call and . Visualise it like this:

Or more compactly just in terms of the variables:

The model gives us these probabilities:

Lets consider a model where . If we “simulate” these coin tosses ten times with the computer random number generator we might get an output like this:

1 | 0 | 0 |

0 | 0 | 1 |

1 | 1 | 0 |

0 | 1 | 0 |

1 | 1 | 1 |

0 | 0 | 0 |

1 | 0 | 1 |

0 | 1 | 0 |

1 | 0 | 0 |

0 | 0 | 0 |

Now, let’s imagine that we already know the biases of the two secondary coins, i.e. we already know and , but we don’t know the bias of the original coin, . Obviously, if we had all the above data, we would just guess as as before – the information about how the other coin came up is pretty irrelevant. That is, the good guess for is the average of the column on the right. However, let’s suppose that we weren’t shown what the outcome of the first coin toss was:

1 | 0 | ? |

0 | 0 | ? |

1 | 1 | ? |

0 | 1 | ? |

1 | 1 | ? |

0 | 0 | ? |

1 | 0 | ? |

0 | 1 | ? |

1 | 0 | ? |

0 | 0 | ? |

Now there is no column on the right to average. Here is where the EM algorithm comes in, and we do something that is a little bit bizarre but also makes total sense. Let’s say we guess a value of totally at random. This isn’t a good guess, but it’s a guess. We’ll call the current guess . Arbitrarily, let’s pick . Given this guess and the data that we *do* know, namely , we can also make a guess of the average value that should have taken. This comes from Bayes theorem:

We substitute in the current guess for . So for example, if both of the secondary coin tosses came up as heads (1), then we calculate:

Thus we can re-do the table with these new guesses for , which will take a different value depending on what the result was for , but we just use the above formula in all cases:

Guess of assuming | ||
---|---|---|

1 | 0 | 0.100000 |

0 | 0 | 0.012195 |

1 | 1 | 0.500000 |

0 | 1 | 0.100000 |

1 | 1 | 0.500000 |

0 | 0 | 0.012195 |

1 | 0 | 0.100000 |

0 | 1 | 0.100000 |

1 | 0 | 0.100000 |

0 | 0 | 0.012195 |

But now that we’ve filled in the table with guesses for , we can calculate another guess for by just averaging the right-hand column again. This gives as a new guess which we’ll call of approximately 0.154. But now that we have , we could guess the values of again, and because is different to , we will get a different right hand column, leading to another guess . This is the expectation-maximization algorithm: we just keep iterating this process. If we do this we get a converging sequence of guesses for :

We can stop iterating when the algorithm converges, e.g. when where is some small value, in this case I used 0.001.

The correct value of was , and the algorithm has converged at about , so it hasn’t done too well here, but this was a very small sample. Here’s what happened when I tried again with 100 samples:

The dot-dash red line is the true value of in the middle. The dashed black line is the guess for that we would make if the values of were available to us (i.e. the mean value of ), and the solid lines are the output of the EM algorithm from five randomly chosen starting points: note that they all converge to the same estimate as each other, but the estimate is not necessarily right on the correct value (there is still some sampling error), nor is it the same as the estimate based on knowledge of .

**Formalization as maximum likelihood**

The above process, I argue, gives fairly “obviously” the right answer. We start with a guess of , then guess based on that, which gives us a new guess of and so on. We want to make sure this matches the formal definition of the EM algorithm.

To begin with, the technical definition of the maximum likelihood estimate: start with a likelihood function, which is a function of the parameter defined by the probability of some outcome according to that parameter. Here it is convenient to go back to thinking in terms of the number of heads that came up: let’s say there was trials, and in of them the first coin came up heads. The outcome variable now is – the total number of heads, which is a *sufficient statistic* for the parameter . That is, we don’t need to know exactly which of the trials gave a heads, just the total number of trials that gave heads. The likelihood function for is then just determined by the binomial distribution:

The maximum likelihood estimate (MLE) maximises this function: . The easy way to maximise this is to first note that the value of that maximises also maximises the log-likelihood because the log function increases whenever it’s argument increases. Thus we can just differentiate the log-likelihood function to get:

Set this equal to 0 and re-arrange to get as the MLE (the observant reader will notice that this is not actually sufficient to prove that is the maximum, it is left as an exercise to thus complete the argument, for now it will do). This MLE: is what we just called before the obvious “guess” for when the outcomes value are known.

Now, the expectation-maximization algorithm as defined by the Wikipedia page states that we first need to define a sort-of pseudo-likelihood function , based on our current guess of the parameter, namely:

That is, we find the expectation of the log-likelihood function according to our current guess and known observations, and . The EM algorithm proceeds by finding the value of that maximises . Notice that in we are taking the expectation *relative to the existing guess of * — i.e. relative to , but the log-likelihood function is in terms of the “free” variable (which can be different to ).

Finding the that maximises this is actually pretty easy: note that the expectation operation is linear, i.e. , which means that we can differentiate by just differentiating inside of the expectation, and we’ve already differentiated the log-likelihood function above, so:

Setting to 0 and using linearity of expectation again:

Thus, the next estimate for , being the value that maximises , is the average of the expected number of heads on coin according to our current estimate of and the values of and that we saw. It should be fairly easy to see that this is the same thing as the average value of the right-hand column in the tables above.

I won’t go through the proof that EM converges to the right answer (i.e. MLE) but it is out there in various forms on Wikipedia and references therein. What we’ve seen here is an intuitive example of it, and some discussion of how that example corresponds to the formal EM algorithm.

In most other examples like this, you assume that and are also unknown and need to be estimated. This, I find, overcomplicates things as an introductory example. Having seen how it works for one parameter, it should now be fairly obvious how to extend it to do all the parameters. The guessing procedure for doesn’t change, except that you have to also uses the guesses and when you apply Bayes’ theorem to get the hidden values . Having done that it’s fairly straightforward to come up with new estimates for all three parameters.

**Observations**

Looking at this, a couple of interesting things strike me.

**1. Convergence**

First is in the form of the update equation for , namely:

Notice that the expectation is conditioned on not just the previous parameter guess but on the observed data and . If we didn’t do this we would just have which is clearly just . I.e. if we did

Then we would always have — so no change on each iteration. The interesting thing is that this is precisely what we want under convergence — for to stop changing, implying that what the EM algorithm does (and hence what the MLE does) is find the value of such that

i.e. such that the expectation is “independent” of the descendent observed variables and . This makes an intuitive kind of sense to me: think about X1 and X2 as causal descendants of Z – since Z happened “first” (remember we toss the first coin to get Z before tossing the second coin to get X1 and X2), it makes sense that the outcomes X1 and X2 should only tell us something about the number of heads on the first coin if our current guess of is *wrong*.

**2. Not Bayesian**

I’ve seen it written that the EM algorithm implements Bayesian reasoning. It doesn’t, it just involves Bayes’ theorem. Using Bayes’ theorem is not (necessarily) Bayesian. For it to be Bayesian, there should be some kind of probability distribution defined on the parameter , usually both *prior* (before collecting the data) and *posterior* (), and there isn’t – at least not in the formulation we’ve used here. Any kind of MLE is (usually) not Bayesian. That said, I’m sure there are Bayesian variants of EM. It’s just best not to get confused. At least the Wikipedia page gets it right in this case.

*EDIT: I originally forgot about the denominator q(1-q) that you should get in the derivative of the log-likelihood function I think. This disappears when you set equal to 0 so does not really matter. Having fixed this I am moderately sure my derivations in the middle section are correct*