Intuitive example of expectation maximization

by James Thorniley

I’ve been looking at the Expectation-Maximization (EM) algorithm, which is a really interesting tool for estimating parameters in multivariate models with latent (i.e. unobserved) variables. However, I found it quite hard to understand it from the formal definitions and explanations on Wikipedia. Even the more “intuitive” examples I found left me scratching my head a bit, mostly because I think they are a bit too complicated to get an intuition of what’s going on. So here I will run through the simplest possible example I can think of.

First, the purpose of the EM algorithm is to analyse a model with variables which you know exist but which are missing from the data. It allows you to find (as best as possible) what those missing values are, as well as the parameters for the whole model. That is, we take some measured variables, X, some unobserved variables Z, and a parameter vector \theta and calculate the expected values of Z and the maximum likelihood estimates of the parameters, written \hat{\theta}(X). The maximum likelihood estimate is a “guess” of the parameter values based on some observed data.

However, for this example, we will begin by ignoring the technical definition of “maximum likelihood estimate”, and go through a simple example in which the EM algorithm is essentially just the “obvious” way to get to the “right” guess.

Let’s start by assuming that we know what the model is, then build something to infer it. To begin with, lets take the go-to trivial model for statistics, flipping a coin. Let’s say that the random variable Z represents the coin toss, and takes the value 1 if the coin comes up heads and 0 otherwise. Thus if we performed a sequence of coin tosses we would get an output like:

z = 0,0,1,0,1,1,0,1,0

For the sake of argument. I find it easier to write the above than

T,T,H,T,H,H,T,H,T

but they obviously mean the same thing. One reason that the first version is nicer is that the obvious “guess” for the probability that the coin comes up heads, based on the data observed, is just the average value of the outcome variable:

\bar{z} = \frac{1}{n}\sum_{i=1}^{n} z_i

The overbar-z  \bar{z} meaning average value. We could equally write k/n where k is the number of times the coin came up heads. This will be useful later but we will stick with the \bar{z} formulation for now. It should be obvious that these two values are the same thing.

In statistical inference, we assume that the values z_i are drawn from an underlying distribution belonging to the random variable Z, or alternatively there is some underlying process producing the z_i which determines the probability that it will take different values. In the binary coin-toss example here, the only parameter we need is the probability that the coin will come up heads, which we will call q. Thus we have the distribution function f_Z:

f_Z(z;q) = \left\{ \begin{array}{l l}q & z=1 \\ (1-q) & z = 0 \end{array} \right.

This should all be pretty obvious. It is also (I argue) pretty obvious that by some reasonable standard \bar{z} is the best guess you could have for q. That is, the estimate based on the data, \hat{q}(z) is, in this case, simply the average value of the outcome, or alternatively, the proportion of the time that the coin comes up heads. This guess – the average value – is also technically speaking the maximum likelihood estimate, but like I say for now there is no need to worry about what that means.

Now on to the real example. Suppose we have this one coin toss Z, and if that comes up tails (z=0) we perform two more coin tosses on a totally different coin which has probability p_0 of coming up heads, otherwise (if z=1) we do the second two tosses on yet another different coin which has probability p_1 of coming up heads. The two tosses of the second coin we will call X_1 and X_2. Visualise it like this:

tree

Or more compactly just in terms of the variables:

commoncause

The model gives us these probabilities:

P(Z=1) = q
P(X_1=1|Z=0) =P(X_2=1|Z=0) = p_0
P(X_1=1|Z=1) =P(X_2=1|Z=1) = p_1

Lets consider a model where q=1/2, p_0=1/4, p_1=3/4. If we “simulate” these coin tosses ten times with the computer random number generator we might get an output like this:

x_1 x_2 z
1 0 0
0 0 1
1 1 0
0 1 0
1 1 1
0 0 0
1 0 1
0 1 0
1 0 0
0 0 0

Now, let’s imagine that we already know the biases of the two secondary coins, i.e. we already know p_0=0.25 and p_1=0.75, but we don’t know the bias of the original coin, q. Obviously, if we had all the above data, we would just guess q as \bar{z} as before – the information about how the other coin came up is pretty irrelevant. That is, the good guess for q is the average of the column on the right. However, let’s suppose that we weren’t shown what the outcome of the first coin toss was:

x_1 x_2 z
1 0 ?
0 0 ?
1 1 ?
0 1 ?
1 1 ?
0 0 ?
1 0 ?
0 1 ?
1 0 ?
0 0 ?

Now there is no column on the right to average. Here is where the EM algorithm comes in, and we do something that is a little bit bizarre but also makes total sense. Let’s say we guess a value of q totally at random. This isn’t a good guess, but it’s a guess. We’ll call the current guess q^0. Arbitrarily, let’s pick q^0=0.1. Given this guess and the data that we do know, namely X_1,X_2, we can also make a guess of the average value that Z should have taken. This comes from Bayes theorem:

\displaystyle P(Z=z|X_1=x_1,X_2=x_2) = \frac{P(X_1=x_1,X_2=x_2|Z=z)P(Z=z)}{\sum_{z_i=0}^{1}P(X_1=x_1,X_2=x_2|Z=z_i)P(Z=z_i)}

We substitute in the current guess q^0 for P(Z=1). So for example, if both of the secondary coin tosses came up as heads (1), then we calculate:

\begin{array}{l l} \displaystyle  P(Z=1|X_1=1,X_2=1) &= \frac{P(X_1=1,X_2=1|Z=1)P(Z=1)}{\sum_{z_i=0}^{1}P(X_1=1,X_2=1|Z=z_i)P(Z=z_i)} \\  &= \frac{p_1^2 q^0}{p_0^2 (1-q^0) + p_1^2 q^0} \\  &= \frac{9/16 \cdot 1/10}{1/16 \cdot 9/10 + 9/16 \cdot 1/10} \\  &= \frac{1}{2}  \end{array}

Thus we can re-do the table with these new guesses for z, which will take a different value depending on what the result was for x_1,x_2, but we just use the above formula in all cases:

x_1 x_2 Guess of z assuming q^0=0.1
1 0 0.100000
0 0 0.012195
1 1 0.500000
0 1 0.100000
1 1 0.500000
0 0 0.012195
1 0 0.100000
0 1 0.100000
1 0 0.100000
0 0 0.012195

But now that we’ve filled in the table with guesses for z, we can calculate another guess for q by just averaging the right-hand column again. This gives as a new guess which we’ll call q^1 of approximately 0.154. But now that we have q^1, we could guess the values of z again, and because q^1 is different to q^0, we will get a different right hand column, leading to another guess q^2. This is the expectation-maximization algorithm: we just keep iterating this process. If we do this we get a converging sequence of guesses for q:

EM Algorithm 1

We can stop iterating when the algorithm converges, e.g. when |q^{t+1} - q^t| < \eta where \eta is some small value, in this case I used 0.001.

The correct value of q was 0.5, and the algorithm has converged at about 0.375, so it hasn’t done too well here, but this was a very small sample. Here’s what happened when I tried again with 100 samples:

EM algorithm with larger sample

The dot-dash red line is the true value of q in the middle. The dashed black line is the guess for q that we would make if the values of z were available to us (i.e. the mean value of z), and the solid lines are the output of the EM algorithm from five randomly chosen starting points: note that they all converge to the same estimate as each other, but the estimate is not necessarily right on the correct value (there is still some sampling error), nor is it the same as the estimate based on knowledge of z.

Formalization as maximum likelihood

The above process, I argue, gives fairly “obviously” the right answer. We start with a guess of q, then guess z based on that, which gives us a new guess of q and so on. We want to make sure this matches the formal definition of the EM algorithm.

To begin with, the technical definition of the maximum likelihood estimate: start with a likelihood function, which is a function of the parameter q defined by the probability of some outcome according to that parameter. Here it is convenient to go back to thinking in terms of the number of heads that came up: let’s say there was n trials, and in k_z of them the first coin came up heads. The outcome variable now is k_z – the total number of heads, which is a sufficient statistic for the parameter q. That is, we don’t need to know exactly which of the trials gave a heads, just the total number of trials that gave heads. The likelihood function for q is then just determined by the binomial distribution:

L(q;k_z) = P(k_z;q) = \left( \begin{array}{l}n\\k_z\end{array} \right) q^{k_z}(1-q)^{(n-k_z)}

The maximum likelihood estimate (MLE) maximises this function: \hat{q}(k_z)=\arg \max_{q} L(q;k_z). The easy way to maximise this is to first note that the value of q that maximises L(q;k_z) also maximises the log-likelihood \log L(q;k_z) because the log function increases whenever it’s argument increases. Thus we can just differentiate the log-likelihood function to get:

\begin{array}{l l}  \log L(q;k_z) &= \log\left( \begin{array}{l}n\\k_z\end{array} \right) + k_z\log(q) + (n-k_z)\log(1-q) \\   \frac{d}{dq} \log L(q;k_z) &= \frac{k_z}{q} - \frac{n-k_z}{1-q} \\  &= \frac{k_z-nq}{q(1-q)}  \end{array}

Set this equal to 0 and re-arrange to get q=k_z/n as the MLE (the observant reader will notice that this is not actually sufficient to prove that k_z/n is the maximum, it is left as an exercise to thus complete the argument, for now it will do). This MLE: \hat{q}(k_z)=k_z/n is what we just called before the obvious “guess” for q when the outcomes value z are known.

Now, the expectation-maximization algorithm as defined by the Wikipedia page states that we first need to define a sort-of pseudo-likelihood function Q, based on our current guess of the parameter, namely:

Q(q;q^t) = \mathrm{E}_{k_z|x_1,x_2,q^t}[\log L(q;k_z)]

That is, we find the expectation of the log-likelihood function according to our current guess q^t and known observations, x_1 and x_2. The EM algorithm proceeds by finding the value of q that maximises Q. Notice that in Q we are taking the expectation relative to the existing guess of q — i.e. relative to q^t, but the log-likelihood function is in terms of the “free” variable q (which can be different to q^t).

Finding the q that maximises this is actually pretty easy: note that the expectation operation \mathrm{E} is linear, i.e. \mathrm{E}[A+B]=\mathrm{E}[A]+\mathrm{E}[B], which means that we can differentiate Q by just differentiating inside of the expectation, and we’ve already differentiated the log-likelihood function above, so:

\frac{d}{dq}Q(q;q^t) = \mathrm{E}_{k_z|x_1,x_2,q^t}[\frac{k_z-nq}{q(1-q)}]

Setting to 0 and using linearity of expectation again:

\begin{array}{r l}   \mathrm{E}_{k_z|x_1,x_2,q^t}[\frac{k_z-nq}{q(1-q)}] &= 0\\  \mathrm{E}_{k_z|x_1,x_2,q^t}[k_z-nq]\frac{1}{q(1-q)} &= 0\\  \mathrm{E}_{k_z|x_1,x_2,q^t}[k_z-nq] &= 0\\  \mathrm{E}_{k_z|x_1,x_2,q^t}[k_z]-\mathrm{E}_{k_z|x_1,x_2,q^t}[nq] &= 0 \\  \mathrm{E}_{k_z|x_1,x_2,q^t}[k_z]-nq &= 0 \\  q &= \mathrm{E}_{k_z|x_1,x_2,q^t}[k_z]/n  \end{array}

Thus, the next estimate for q, being the value that maximises Q(q;q^t), is the average of the expected number of heads on coin z according to our current estimate of q^t and the values of x_1 and x_2 that we saw. It should be fairly easy to see that this is the same thing as the average value of the right-hand column in the tables above.

I won’t go through the proof that EM converges to the right answer (i.e. MLE) but it is out there in various forms on Wikipedia and references therein. What we’ve seen here is an intuitive example of it, and some discussion of how that example corresponds to the formal EM algorithm.

In most other examples like this, you assume that p_0 and p_1 are also unknown and need to be estimated. This, I find, overcomplicates things as an introductory example. Having seen how it works for one parameter, it should now be fairly obvious how to extend it to do all the parameters. The guessing procedure for q doesn’t change, except that you have to also uses the guesses p_0^t and p_1^t when you apply Bayes’ theorem to get the hidden values z. Having done that it’s fairly straightforward to come up with new estimates for all three parameters.

Observations

Looking at this, a couple of interesting things strike me.

1. Convergence

First is in the form of the update equation for q, namely:

q^{t+1} = \mathrm{E}_{k_z|x_1,x_2,q^t}[k_z]/n

Notice that the expectation is conditioned on not just the previous parameter guess q^t but on the observed data x_1 and x_2. If we didn’t do this we would just have \mathrm{E}_{k_z|q}[k_z] which is clearly just qn. I.e. if we did

q^{t+1} = \mathrm{E}_{k_z|q^t}[k_z]/n

Then we would always have q^{t+1}=q^t — so no change on each iteration. The interesting thing is that this is precisely what we want under convergence — for q^t to stop changing, implying that what the EM algorithm does (and hence what the MLE does) is find the value of q^t such that

\mathrm{E}_{k_z|x_1,x_2,q^t}[k_z] = \mathrm{E}_{k_z|q^t}[k_z]

i.e. such that the expectation is “independent” of the descendent observed variables x_1 and x_2. This makes an intuitive kind of sense to me: think about X1 and X2 as causal descendants of Z – since Z happened “first” (remember we toss the first coin to get Z before tossing the second coin to get X1 and X2), it makes sense that the outcomes X1 and X2 should only tell us something about the number of heads on the first coin if our current guess of q is wrong.

2. Not Bayesian

I’ve seen it written that the EM algorithm implements Bayesian reasoning. It doesn’t, it just involves Bayes’ theorem. Using Bayes’ theorem is not (necessarily) Bayesian. For it to be Bayesian, there should be some kind of probability distribution defined on the parameter q, usually both prior (before collecting the data) and posterior (P(q|X)), and there isn’t – at least not in the formulation we’ve used here. Any kind of MLE is (usually) not Bayesian. That said, I’m sure there are Bayesian variants of EM. It’s just best not to get confused. At least the Wikipedia page gets it right in this case.

EDIT: I originally forgot about the denominator q(1-q) that you should get in the derivative of the log-likelihood function I think. This disappears when you set equal to 0 so does not really matter. Having fixed this I am moderately sure my derivations in the middle section are correct

About these ads

5 Comments to “Intuitive example of expectation maximization”

  1. Can you add the curves representing N(z=heads)/(N(z=heads)+N(z=tails)) (i.e. the ML estimate of q when you know the z series) to the graph of the algorithms convergence.

    • As in, the mean z up to t not the value at t=100.

      • Not sure what you’re asking. So my plot shows q^t against t, do you mean

        1/t \sum_{i=0}^{t} q^i

      • Sorry, I think I see what you are asking. I think you are asking for the ML estimate of q based on observation of z (rather than the estimate from the EM algorithm). I think you mean how does the “normal” MLE based on observation of z change as data points are added, but the t axis on these graphs does not represent the number of data points included, it represents the number of iterations in the EM algorithm – as such none of the other estimates vary at all with t.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 474 other followers

%d bloggers like this: