I’m going to submit this to Oculus-Share, but for now you can download both the Oculus and Non-Oculus version here.

**Note:** something that I should mention. When the program loads or you reset (right mouse) the control will be zeroed on the direction you are looking, if you’re looking in an unusal direction then it will be hard to control.

After trying out a number of demos and games for the Rift, I came to some conclusions about what makes a good virtual experience. These are having: passive control, simple but interesting environment, multisensory output, things that look cool in 3D, and being trippy but not too disorienting. I tried to follow these principles as best I could…

**Passive Control**: Using a mouse or keyboard to control motion can be disorienting, as there is an unnatural disconnection between movement in the visual field and proprioceptive and vestibular sensation. This is always going to be a problem, but the move-where-you look system I implemented minimises this disconnect, but not so much that it stops being fun to do barrel rolls and loops.

**Interesting but simple environment**: The cubes swarm, which is interesting to watch, but whilst it was cool to watch the swarming, it needed some more behaviour. I originally wanted to have it so that the cubes would run away when you shouted into the microphone, but now I have it so they get “scared” when you move quickly near them. The more scared they are the more yellow they are, and the louder the noise they make is. Detailed environments are still a bit much for the Rift, the spatial and temporal resolutions are too low, some of the best demos are the ones where they haven’t textured anything and there are just bold, plain shapes.

**Multisensory output**: Sound, movement and colour should be correlated. I was inspired by Greebles and Rez to have some coordinated sound, though I admit, Rez is far cooler – but not as cool as it would be if it was controlled in the same way, and worked with the rift.

**Things that look coolin 3D**: Basically, things spinning and whizzing about in front of your face.

**Not too intensely disorienting**: Everything need to be slow. If everything is fast the experience can be overwhelming, in not a pleasant way. This is mainly because of the Rifts difficulties with blurring (which they are fixing). The lack of a global frame of reference doesn’t seem to be too much of a problem.

**… and some unexpected but good outcomes**: The swarming algorithm uses the Boid rules (I was going to call it Cube-Boids at one point, or worse, QBoids). One of the rules is “alignment”, when near-by cubes face the same direction as each other. The way I coded this, quite unthinkingly, is that not only do they need to move in the same direction, but they must have the same orientation. This leads to the cubes spinning on their axis. You can have two swarms (of any size, including 1) moving in the same direction but with the cubes in one swarm being rotated about the direction of movement – they look like they are moving in unison, but when they merge they suddenly spin around. The alignment can seem very deliberate and cubes seem full of purpose.

]]>

(click image for higher res)

This is the result of calculating many times over the coupled logistic maps and :

Where

Where the are coupling parameters. is set to 0.02 everywhere, the x-axis of the above plot follows varying between 0.0 and 0.35

After initialising and randomly in (0,1), you get chaotic results. The y-axis is values of , and the color of each pixel represents the frequency / probability of a x_n,y_n combination at that point.

You can also zoom in on a section to get more detail.

Rough and ready python code for this here:

https://github.com/jthorniley/pycuda_logmap

]]>

The blue points are the set of points I started with, and the black lines show the edges of the Voronoi tessellation. Doing a planar tessellation is quite simple, but I wanted to do it on the surface of a sphere. It’s conceptually quite simple, but the algorithm was really annoying to debug. So to save other people the same frustrations, I thought I’d post my python class.

**How it works**

The dual tessellation of the Vonoroi tessellation is the Dealaunay tesselation, which you can perform on a sphere by simply taking the convex hull of the points in 3D Cartesian coordinates. It turns out that finding the dual is quite easy, as the points defining the Voronoi tessellation (Voronoi vertices) are just the normals of the convex hull. So, essentially, the procedure is *(1)* get convex hull using a ready made function *(2)* find normals to the faces of the hull *(3)* assign the normals (= Voronoi vertices) to the correct initial points. The only hard part is figuring out what the ready made function gives you and using it correctly, mainly making sure all the faces are oriented correctly.

Here is an example. Start with the vertices of an octahedron, getting the convex hull gives you the Delaunay triangulation. The points of the dual (a cube) are found in the direction of the normal to each face from the origin (if it is a unit sphere then they are the normalised normal vectors exactly). There is one face for each starting point.

**The code**

""" Voronoi Tesseltion on a sphere """ import operator from numpy import dot, cross, sqrt, sum, array, arctan2 from scipy.spatial import ConvexHull def list_dict_add(dic, key, val): if dic.has_key(key): dic[key].append(val) else: dic[key]=[val] class SphericalVoronoi: """ Object representing the Voronoi tesselation on a sphere. Takes cartesian coordinates of points in 3D as input. Fields ------ * vertices - vertices of the voronoi diagram * dual_vertices - original points (potentially re-ordered) * faces - voronoi regions for each input point * dual_face_indices - index of triangulation, normals all aligned * hull - scipy.spatial.ConvexHull object used for triangulation Note: If the convex hull of the points has faces that are not triangular it will give duplicate faces. """ def __init__(self, points): points = array(points) if not points.shape[1] == 3: raise ValueError("Points must be spcified in cartesian coordinates") points = (points.T/sqrt(sum(points**2,1))).T self.hull = ConvexHull(points) self.dual_vertices = self.hull.points[self.hull.vertices,:] # find face normals (voronoi verts) and map to the original points face_index = 0 self.dual_vertex_to_face_index = {} vertices = [] self.dual_face_indices = [] for tri in self.hull.simplices: tri_points = self.dual_vertices[tri,:] normal = cross(tri_points[1,:]-tri_points[0,:], tri_points[2,:]-tri_points[0,:]) normal /= sqrt(sum(normal**2)) # we have to make sure the triangles have outward facing normals # or the points will appear at the polar opposite point com = sum(tri_points,0) if dot(com, normal) < 0: vertices.append(-normal) else: vertices.append(normal) # store the index of the normal in a list for each face/orig. point for dual_index in tri: list_dict_add(self.dual_vertex_to_face_index, dual_index, face_index) face_index+=1 self.vertices = array(vertices) # store face coordinates faces = [] for dual_index in self.hull.vertices: face_inds = self.dual_vertex_to_face_index[dual_index] facepoints = self.vertices[face_inds,:] faces.append(facepoints) # Order the faces faces_ordered = [] for face in faces: # find centre of mass npoints = face.shape[0] com = sum(face,0)/npoints # find the maximum direction to use a a projection (fast and sufficent) index, value = max(enumerate(abs(com)), key=operator.itemgetter(1)) inds = [i for i in range(3) if not i==index] com2d = com[inds] points2d = face[:,inds] points2d_minus_com = points2d - com2d # find order angles = arctan2(points2d_minus_com[:,0], points2d_minus_com[:,1]) indices = sorted(range(npoints), key=lambda k: angles[k]) ordered = face[indices,:] # This will have ensured they are consistently either clockwise # or anticlockwise, but we don't know which # check order direction relative to normal using first three points order_const = dot(com, cross(ordered[0,:]-com, ordered[1,:]-com)) if order_const < 0: ordered = ordered[::-1,:] faces_ordered.append(ordered) self.faces = faces_ordered def plot(self, ax=None, push_out=True, triangulation=True): """ Plot the tesselation. triangulation=False - Show the triangulation used ax=None - Give an Axes3D to draw on push_out=True - Move the faces away from the center slightly """ if ax is None: from mpl_toolkits.mplot3d import Axes3D import matplotlib.pyplot as plt fig = plt.figure() ax3 = fig.add_subplot(111, projection='3d') x,y,z = zip(*self.vertices) ax3.scatter(x,y,z, color='k') for face in self.faces: if push_out: face += 0.05*sum(face,0) x,y,z = zip(*face) x,y,z = map(lambda l: l + (l[0],), (x,y,z)) ax3.plot(x,y,z,color='k') if triangulation: x,y,z = zip(*self.dual_vertices) ax3.scatter(x,y,z, color='r') for face_inds in self.hull.simplices: pts = self.dual_vertices[face_inds,:] if push_out: pts += 0.05*sum(pts,0) x,y,z = zip(*pts) x,y,z = map(lambda l: l + (l[0],), (x,y,z)) ax3.plot(x,y,z,color='r') if ax is None: plt.show() if __name__ == "__main__": # Examples from numpy import random, concatenate, eye # octahedron points = concatenate((eye(3), -eye(3))) sv = SphericalVoronoi(points) sv.plot() # Random points points = random.randn(20,3); sv = SphericalVoronoi(points) sv.plot(push_out=False, triangulation=False)

]]>

Mixing is a property of dynamical systems whereby the state of the system in the distant future cannot be predicted from its initial state (or any given state a long way in the past). This is pretty much the same as the kind of mixing you get when you put milk in a cup of tea and swirl it around: obviously when you first put the milk in, it stays roughly where you put it, but after time it spreads out evenly. The even spread of the milk will be the same no matter where you put the milk in originally. More formally, if

is a “distribution” or density function of where the “particles” of milk are when you have just put them in the tea, and

is the distribution after seconds. “Mixing” is formally defined as

You don’t have to think about these distributions as probability distributions, but I find it easier if you do. For those that know probability, it is obvious that what the above is saying is that the distribution of milk after a long time is probabilistically independent of its distribution at the start.

In cups of tea, this happens (mostly) because of the “random” Brownian motion of the milk (possibly enhanced by someone swirling it with a spoon).

The most obvious type of “non-mixing” system is one that does nothing, or does something very predictable. For example, the map

or

Will produce either the same or constantly increasing (by a known amount) values of x respectively. So obviously for the first one is just the same as all the previous values of x and you would have no problems with making predictions. Likewise, assuming you know how much time has passed (and we are assuming you do), the value of x in the future is clearly predictable on the basis of what it started off as.

In chaotic systems you *can* get mixing, but you don’t *have* to. I find this fact a little surprising in itself – when you think of chaos you think of “sensitive dependence on initial conditions” and unpredictability of the future states as being pretty much the definition of chaos. And it is: usually you do get mixing. However, there are some interesting limitations.

The first kind of non-mixing chaotic system, which is more obvious, is something like the Arnold cat map. There is a great little simulator here which is worth playing around with. You start with an image of a cat (it doesn’t have to be a cat, but that’s what people use), then apply a simple rule which warps the image and maps the result back onto the original space. You get a warped picture of a cat. Apply the same rule a few more times and you get what appears to be a totally mixed up picture of the cat, as if it had just been warped randomly (but remember it hasn’t – the warping rule is actually deterministic). The weird thing that happens is if you keep iterating the map, then *eventually* the original cat image comes back. This much later state is an exact reconstruction of the original cat image. Clearly this *isn’t* mixing then: this later state is exactly the same as the initial one, so is fully predictable.

Well, you could argue that this means the cat map isn’t *really** *chaotic, it’s a periodic system with a really really long and complex cycle. I still think its interesting but ok you don’t have to view it as chaos. For this reason I naively assumed that “simpler” chaotic systems (that don’t have any periodic cycles) must therefore be mixing. I was wrong.

One of the simplest chaotic systems to play with is the logistic map, obtained by picking a starting value and using this iteration to get a sequence:

The value is a parameter that you can vary between 1 and 4 to get different behaviours. For low values of r, x settles down to a constant value. As you increase r, you first get a periodic oscillation and for r more than around 3.5 you get some chaotic behaviour. If you’ve played with this before you’ve probably seen this characteristic diagram of what happens:

(This image comes from the wikipedia page and has a public domain creative commons license).

The plot shows that for low values of r like 2.8, the x sequence sits at one value all the time (after an initial transient in the first few steps – these have been discarded to make the plot). For r=3.2 (for example) you see that the x sequence jumps between two values. At around r=3.55 you start to get a chaotic system – this is why the plot “spreads out” vertically: the x sequence jumps around all sorts of values within the grey region as the map is iterated. There are the nice “islands of periodicity” for a few higher values of r, then as you approach r=4, you get the most uniform chaotic system.

So first, let’s see how the logistic map *does* do mixing. Pick a high, very chaotic value of r. I will go with 3.9. First, pick a random value to start with, then iterate 50 times to get rid of the “transient” at the start. If you then iterate for another 10000 steps you can get an accurate picture of the *stationary distribution* of x. (By stationary distribution, I just mean the distribution you get by estimating probabilities from the proportion of the time x spends in a given state). For example, just plotting a histogram of the x values gives this:

Now, note that I have 10000 values of x. Lets imagine that instead of that being a sequence drawn from one system, I instead start 10000 new systems using each of these values as the . For this new set of 10000 logistic maps, the above histogram represents – its the distribution of values across all the systems.

If I iterated all of these 10000 maps at the same time for a large number of steps, then I would get a new distribution where is the number of steps. Remember that the mixing property says that provided that is large. So, lets pick a fairly big value of t, say 100, and iterate all the 10000 maps 100 times. This allows us to work out the new distribution .

Here is a key idea: *if the system is mixing then we can get the histogram for by just multiplying the individual hisograms .* This should be pretty obvious from the definition of mixing. So, here is what you get if you just multiply the two probability distributions. Remember, this is the 2D *histogram to expect if the system is* *mixing.*

So we can see if the system *is* mixing by plotting the actual 2D histogram of . Here’s what that *actual* joint distribution looks like for various values of .

I like this series of plots, it shows how the chaotic behaviour works. You can see that for the small values of t, the distribution is getting “complexified”, as the different systems move in slightly different ways due to the differences in where they started off. But you can also see that by t=100 the system is totally mixed. The joint distribution looks pretty much exactly like what you expect under mixing.

Ok, so the logistic map *is *a mixing we conclude, as expected. But what about for other values of r? In particular, at r=3.6 you can see on the original diagram that there are two very distinct grey regions of “chaos” with a blank bit in the middle. If we plot the stationary distribution for r=3.6 we get this:

Note the two very separate regions of non-zero probability. We can go through the same steps as before: initialise 10000 logistic maps according to this distribution, then plot the joint distribution *we would expect under mixing:*

Obviously the two distinct sections multiply with each other to create four quadrants of non-zero probability in the product distribution. Now, look at the *actual* joint distribution:

The system doesn’t mix — clearly the final distribution is different to the one expected under mixing. In fact you can see that for r=3.6 the logistic map is a weird mix of chaotic and periodic. Notice how the bottom row of maps swap over in each iteration. If a system is in the lower region of x values on one step, then it will swap to the higher region on the next step, and back to the lower region on the step after. Of course this means that there is a certain amount of information about the future state of the system that you never lose: if the system started with a value of x such as 0.4 (i.e. in the lower region), you will always know that on odd steps its x value will be greater than 0.7, and on even steps it will be less than 0.7. But it *is* still chaotic: you don’t know exactly what this value will be: just which of the two regions it must be inside. Within the regions though, the dynamics are still chaotic and unpredictable.

Finally, to measure this in a more exact way, we can measure how much the system “deviates” from mixing with the mutual information between the starting distribution of the 10000 logistic maps and the final distribution t steps later, written . This value is 0 if and only if . So if the system is mixing, the mutual information should be zero for large values of . Greater values of mutual information tend to mean that the two variables are “more dependent” (probabilistically) on each other.

Notice how both the curves decrease at the start, because the chaotic nature of the systems means that some of that starting state information is inevitably lost. But while the mutual information goes down to zero for r=3.9 (mixing), it levels off to a positive value for r=3.6.

One final thing: like I say the logistic map for r=3.6 is “like” a mixture of chaos and periodicity, note that it is *not actually* periodic like the cat map is, even with very long periods. The values of x never come back to exactly where they were, it is only periodic at all at a coarse-grained level where you look at just the two regions.

Note: the mutual information was calculated using the method from Kraskov, Stoegbauer and Grassberger here: http://arxiv.org/abs/cond-mat/0305641.

My python code as ipython notebook, html

]]>

It’s funny because it should be obvious, but actually distinguishing between small and far away based on visual information is slightly tricky to explain.

I think predictive coding, or something like that, perhaps offers at least some element of an explanation. (Jellymatter has another post about broadly similar principles). Basically, you have to combine your sensory data with a lot of “prior” information, knowledge and assumptions. Cows, as you probably know, are quite “big” in the physical sense that they are the same order of magnitude as people. So if a cow only takes up a small amount of space when its image appears on your retina, this suggests one of two possible explanations:

1. This cow is unusually small

2. This cow is far away

Naturally, 2 also relies on the knowledge that big things “look” small if they are far away.

Then, you need to work out which is more likely, or exclude one of the possibilities from other information. For example, if a cow is currently being picked up in someone’s hand, then that suggests the cow is small as well, reducing the probability that the cow is in fact far away.

Here is another example:

In the Ames room, the small person is not really “small”, they are far away. But it doesn’t look that way. The explanation usually offered is that your brain is mixing in the prior knowledge that rooms have a certain shape, and the combination of that assumption and the size of the people relative to the size of the room distorts your perception of how big the people are.

My question is, what exactly is it about your prior knowledge of the size of a room is it that makes it “override” your prior knowledge of the size of people? You know that two people can’t possibly be that different in size. You also know that they don’t change size as they walk around rooms. So how can your brain fail to work out what’s happening? Why don’t you see the room as a funny shape? Intuitively, a strangely shaped room is much easier to conceive of than people changing size before your eyes. I think most people with hardly a moments thought immediately realise that there must be some kind of “illusion”, even if they don’t know how it works. Moreover, even once you know how the Ames room illusion is done, it doesn’t stop working – you still see one person as unusually small and the other as unusually big.

]]>

**The Spaces**

The first two spaces we printed were chosen by me, in part because I think they are theoretically interesting, and in other part, because unlike some other colour spaces they are finite sized, 3D objects.

The first is a theoretically predicted uniform colour space (a bit like human L*a*b*) where euclidean distances between the points within it (colours) correspond to a bee’s ability to tell the difference between the colours. The shape shows to the locus of the colours of all possible lights in terms of such a perceptually uniform coordinate system (which happen to like inside a cube).

It’s a good shape.

The bottom corresponds to black, and the top corresponds to white. It is biased away from blue, towards green and UV. If you view this from above it is very similar to (more correct than) the honeybee hexagon space used in bee colour studies.

The second shape is the object colour solid for bees. This is a particular kind of colour solid that shows locus the relative proportions that the different types of cone cells get excited by all theoretically possible reflecting surfaces. Some combinations of excitations are impossible because the sensitivity of of the cone cells overlaps, like they do in humans:

This space also lies within a cube, but for different reasons, basically there is a maximum value that is achievable (100% reflecting surface). This shape has nice symmetry properties (it is center symmetric) and has a rounded parallelepiped shape characteristic of the object solid for all species in all conditions (though other species may require a different number of dimensions). It is displayed much like the space above, black at the bottom, white at the top.

It is also a good shape.

We also printed some tetrachromatic chromaticity spaces, which are also 3D. We had trouble with the human tetrachromat space. But an idealisation of tetrachromat chromaticity spaces (as a 3-sphericon) was quite successful … as were the other sphericons … they roll funny!

**Printing**

This was my first experience of 3D printing. We were using RepRaps.

The first thing you think when you see 3D printing is “I can print anything!”, but what you should be thinking is: resolution, tolerances and “can I print this without having to lay down filament in thin air”.

Printing the actual shapes was pretty straight forwards, we just made them in two halves and glued them together. We made them nice and smooth by suspending them in acetone vapor (which is a good solvent of ABS), well, in one case, by accidentally dropping the whole thing in a vat of acetone (it actually worked very well, probably because it was still attached to a bit of wire and I fished it out immediately).

The purple frames were printed out of a different material, PLA. They are made from six (different) individual parts and glued into two halves of a cube. You can take and were a bit of a pain. Because of the steps made by the layering, it was fairly hard to make them meet up and glue nicely. But a really nice thing about 3D printing is you can easily make your own tools, so I “quickly” printed a “glue jig” that I could use to put the parts in the right place.

In hindsight, it was a bit big. Given that the parts are glued together using “dead reckoning” they’ve come out really nicely.

]]>

First, the purpose of the EM algorithm is to analyse a model with variables which you know exist but which are missing from the data. It allows you to find (as best as possible) what those missing values are, as well as the parameters for the whole model. That is, we take some measured variables, , some unobserved variables , and a parameter vector and calculate the expected values of and the maximum likelihood estimates of the parameters, written . The maximum likelihood estimate is a “guess” of the parameter values based on some observed data.

However, for this example, we will begin by ignoring the technical definition of “maximum likelihood estimate”, and go through a simple example in which the EM algorithm is essentially just the “obvious” way to get to the “right” guess.

Let’s start by assuming that we know what the model is, then build something to infer it. To begin with, lets take the go-to trivial model for statistics, flipping a coin. Let’s say that the random variable represents the coin toss, and takes the value 1 if the coin comes up heads and 0 otherwise. Thus if we performed a sequence of coin tosses we would get an output like:

z = 0,0,1,0,1,1,0,1,0

For the sake of argument. I find it easier to write the above than

T,T,H,T,H,H,T,H,T

but they obviously mean the same thing. One reason that the first version is nicer is that the obvious “guess” for the probability that the coin comes up heads, based on the data observed, is just the average value of the outcome variable:

The overbar-z meaning average value. We could equally write where is the number of times the coin came up heads. This will be useful later but we will stick with the formulation for now. It should be obvious that these two values are the same thing.

In statistical *inference*, we assume that the values are drawn from an underlying distribution belonging to the random variable , or alternatively there is some underlying process producing the which determines the probability that it will take different values. In the binary coin-toss example here, the only parameter we need is the probability that the coin will come up heads, which we will call . Thus we have the distribution function :

This should all be pretty obvious. It is also (I argue) pretty obvious that by some reasonable standard is the best guess you could have for . That is, the estimate based on the data, is, in this case, simply the average value of the outcome, or alternatively, the proportion of the time that the coin comes up heads. This guess – the average value – is also technically speaking the maximum likelihood estimate, but like I say for now there is no need to worry about what that means.

Now on to the real example. Suppose we have this one coin toss , and if that comes up tails () we perform two more coin tosses on a totally different coin which has probability of coming up heads, otherwise (if ) we do the second two tosses on yet another different coin which has probability of coming up heads. The two tosses of the second coin we will call and . Visualise it like this:

Or more compactly just in terms of the variables:

The model gives us these probabilities:

Lets consider a model where . If we “simulate” these coin tosses ten times with the computer random number generator we might get an output like this:

1 | 0 | 0 |

0 | 0 | 1 |

1 | 1 | 0 |

0 | 1 | 0 |

1 | 1 | 1 |

0 | 0 | 0 |

1 | 0 | 1 |

0 | 1 | 0 |

1 | 0 | 0 |

0 | 0 | 0 |

Now, let’s imagine that we already know the biases of the two secondary coins, i.e. we already know and , but we don’t know the bias of the original coin, . Obviously, if we had all the above data, we would just guess as as before – the information about how the other coin came up is pretty irrelevant. That is, the good guess for is the average of the column on the right. However, let’s suppose that we weren’t shown what the outcome of the first coin toss was:

1 | 0 | ? |

0 | 0 | ? |

1 | 1 | ? |

0 | 1 | ? |

1 | 1 | ? |

0 | 0 | ? |

1 | 0 | ? |

0 | 1 | ? |

1 | 0 | ? |

0 | 0 | ? |

Now there is no column on the right to average. Here is where the EM algorithm comes in, and we do something that is a little bit bizarre but also makes total sense. Let’s say we guess a value of totally at random. This isn’t a good guess, but it’s a guess. We’ll call the current guess . Arbitrarily, let’s pick . Given this guess and the data that we *do* know, namely , we can also make a guess of the average value that should have taken. This comes from Bayes theorem:

We substitute in the current guess for . So for example, if both of the secondary coin tosses came up as heads (1), then we calculate:

Thus we can re-do the table with these new guesses for , which will take a different value depending on what the result was for , but we just use the above formula in all cases:

Guess of assuming | ||
---|---|---|

1 | 0 | 0.100000 |

0 | 0 | 0.012195 |

1 | 1 | 0.500000 |

0 | 1 | 0.100000 |

1 | 1 | 0.500000 |

0 | 0 | 0.012195 |

1 | 0 | 0.100000 |

0 | 1 | 0.100000 |

1 | 0 | 0.100000 |

0 | 0 | 0.012195 |

But now that we’ve filled in the table with guesses for , we can calculate another guess for by just averaging the right-hand column again. This gives as a new guess which we’ll call of approximately 0.154. But now that we have , we could guess the values of again, and because is different to , we will get a different right hand column, leading to another guess . This is the expectation-maximization algorithm: we just keep iterating this process. If we do this we get a converging sequence of guesses for :

We can stop iterating when the algorithm converges, e.g. when where is some small value, in this case I used 0.001.

The correct value of was , and the algorithm has converged at about , so it hasn’t done too well here, but this was a very small sample. Here’s what happened when I tried again with 100 samples:

The dot-dash red line is the true value of in the middle. The dashed black line is the guess for that we would make if the values of were available to us (i.e. the mean value of ), and the solid lines are the output of the EM algorithm from five randomly chosen starting points: note that they all converge to the same estimate as each other, but the estimate is not necessarily right on the correct value (there is still some sampling error), nor is it the same as the estimate based on knowledge of .

**Formalization as maximum likelihood**

The above process, I argue, gives fairly “obviously” the right answer. We start with a guess of , then guess based on that, which gives us a new guess of and so on. We want to make sure this matches the formal definition of the EM algorithm.

To begin with, the technical definition of the maximum likelihood estimate: start with a likelihood function, which is a function of the parameter defined by the probability of some outcome according to that parameter. Here it is convenient to go back to thinking in terms of the number of heads that came up: let’s say there was trials, and in of them the first coin came up heads. The outcome variable now is – the total number of heads, which is a *sufficient statistic* for the parameter . That is, we don’t need to know exactly which of the trials gave a heads, just the total number of trials that gave heads. The likelihood function for is then just determined by the binomial distribution:

The maximum likelihood estimate (MLE) maximises this function: . The easy way to maximise this is to first note that the value of that maximises also maximises the log-likelihood because the log function increases whenever it’s argument increases. Thus we can just differentiate the log-likelihood function to get:

Set this equal to 0 and re-arrange to get as the MLE (the observant reader will notice that this is not actually sufficient to prove that is the maximum, it is left as an exercise to thus complete the argument, for now it will do). This MLE: is what we just called before the obvious “guess” for when the outcomes value are known.

Now, the expectation-maximization algorithm as defined by the Wikipedia page states that we first need to define a sort-of pseudo-likelihood function , based on our current guess of the parameter, namely:

That is, we find the expectation of the log-likelihood function according to our current guess and known observations, and . The EM algorithm proceeds by finding the value of that maximises . Notice that in we are taking the expectation *relative to the existing guess of * — i.e. relative to , but the log-likelihood function is in terms of the “free” variable (which can be different to ).

Finding the that maximises this is actually pretty easy: note that the expectation operation is linear, i.e. , which means that we can differentiate by just differentiating inside of the expectation, and we’ve already differentiated the log-likelihood function above, so:

Setting to 0 and using linearity of expectation again:

Thus, the next estimate for , being the value that maximises , is the average of the expected number of heads on coin according to our current estimate of and the values of and that we saw. It should be fairly easy to see that this is the same thing as the average value of the right-hand column in the tables above.

I won’t go through the proof that EM converges to the right answer (i.e. MLE) but it is out there in various forms on Wikipedia and references therein. What we’ve seen here is an intuitive example of it, and some discussion of how that example corresponds to the formal EM algorithm.

In most other examples like this, you assume that and are also unknown and need to be estimated. This, I find, overcomplicates things as an introductory example. Having seen how it works for one parameter, it should now be fairly obvious how to extend it to do all the parameters. The guessing procedure for doesn’t change, except that you have to also uses the guesses and when you apply Bayes’ theorem to get the hidden values . Having done that it’s fairly straightforward to come up with new estimates for all three parameters.

**Observations**

Looking at this, a couple of interesting things strike me.

**1. Convergence**

First is in the form of the update equation for , namely:

Notice that the expectation is conditioned on not just the previous parameter guess but on the observed data and . If we didn’t do this we would just have which is clearly just . I.e. if we did

Then we would always have — so no change on each iteration. The interesting thing is that this is precisely what we want under convergence — for to stop changing, implying that what the EM algorithm does (and hence what the MLE does) is find the value of such that

i.e. such that the expectation is “independent” of the descendent observed variables and . This makes an intuitive kind of sense to me: think about X1 and X2 as causal descendants of Z – since Z happened “first” (remember we toss the first coin to get Z before tossing the second coin to get X1 and X2), it makes sense that the outcomes X1 and X2 should only tell us something about the number of heads on the first coin if our current guess of is *wrong*.

**2. Not Bayesian**

I’ve seen it written that the EM algorithm implements Bayesian reasoning. It doesn’t, it just involves Bayes’ theorem. Using Bayes’ theorem is not (necessarily) Bayesian. For it to be Bayesian, there should be some kind of probability distribution defined on the parameter , usually both *prior* (before collecting the data) and *posterior* (), and there isn’t – at least not in the formulation we’ve used here. Any kind of MLE is (usually) not Bayesian. That said, I’m sure there are Bayesian variants of EM. It’s just best not to get confused. At least the Wikipedia page gets it right in this case.

*EDIT: I originally forgot about the denominator q(1-q) that you should get in the derivative of the log-likelihood function I think. This disappears when you set equal to 0 so does not really matter. Having fixed this I am moderately sure my derivations in the middle section are correct*

]]>

**Some notes…**

- Rotations happen
*in planes*not*around axes*. In 3D, being in a plane and being perpendicular to an axis is the same thing, but it is not generally true. In dimensions there are possible planes of rotation. We can decompose any rotation into a combination of the rotations in each of these planes. *There is no unique solution.*For a given vector there is a set of rotations that do not change it (all rotations in planes perpendicular to it). One can obtain another solution by rotating in any plane that is orthogonal to the “target” vector.*It is easy to rotate a vector to*and this is how the algorithm here works. First we find the rotation matrix to for both the “starting” () vector and the “target” ()vector. If these rotations are and then the rotation from “start” to “target” is given by . This is computationally simple as they are rotation matrices and their inverse is equal to their transpose.*To rotate to*we can do the following. If our vector is given by ) then we find the rotations in each plane rotate the vector and find the rotation for the next plane. In other words find the matrix that rotates to then the one that rotates to and so on until we have got zeros everywhere except in the first dimension. The rotation matrix is the product (in order) of these individual rotations.*Numerical functions are not 100% accurate*and errors are compounded in this algorithm. In the implementation below they will generally be of the order of , but the higher the dimension the bigger the error. It would not be hard to implement this so that it works with symbolic algebra packages (scipy, for example) and produce exact solutions.

**Code…**

from numpy import sin, cos, arctan2, dot, eye def rotation_matrix(vector, target): """ Roation matrix from one vector to another target vector. The solution is not unique as any additional rotation perpendicular to the target vector will also yield a solution) However, the output is deterministic. """ R1 = rotation_to_pole(target) R2 = rotation_to_pole(vector) return dot(R1.T, R2) def rotation_to_pole(target): """ Rotate to 1,0,0... """ n = len(target) working = target rm = eye(n) for i in range(1,n): angle = arctan2(working[0],working[i]) rm = dot(rotation_matrix_inds(angle, n, 0, i), rm) working = dot(rm, target) return rm def rotation_matrix_inds(angle, n, ax1, ax2): """ 'n'-dimensional rotation matrix 'angle' radians in coordinate plane with indices 'ax1' and 'ax2' """ s = sin(angle) c = cos(angle) i = eye(n) i[ax1,ax1] = s i[ax1,ax2] = c i[ax2,ax1] = c i[ax2,ax2] = -s return i

This part verifies it and demonstrates the accuracy.

if __name__ == "__main__": from numpy import random, linalg, arccos, sqrt def short_vec_print(v): return ("["+("%.4g, "*len(v))[:-2]+"]")%tuple(v) # Verify rotation_to_pole function for i in range(8): v = random.randn(i) length = linalg.norm(v) rm = rotation_to_pole(v) vrot = dot(rm,v) print "to_pole:",short_vec_print(vrot), print "(length %g)"%length # Verify rotation_matrix function for i in range(2,8): v = random.randn(i) t = random.randn(i) len_v = linalg.norm(v) rm = rotation_matrix(v, t) vrot = dot(rm,v) cosang = dot(t,vrot)/sqrt(dot(vrot,vrot)*dot(t,t)) print "\nrotation_matrix:",short_vec_print(v),"to",short_vec_print(t) print " length of rotated relative to original:",len_v/linalg.norm(vrot) if cosang > 1: print " *** Rounding error of",cosang-1,"***" ang = 0 else: ang = arccos(cosang) print " cos angular difference:", cosang, "(%g radians)"%ang

]]>

The idea of rationality is often thought of as a consistency within a collection of behaviours or beliefs. For example, if I think it is absolutely wrong to eat meat, I should think it is wrong to eat beef. As beef is a meat it would be irrational for me to think it was wrong to meat but also have the belief that it is OK to eat beef. Similarly, it would be irrational to think that eating meat was abolutely wrong, and then eat a plate of steak. To be rational, my behaviour should also match what I think.

How do we know when we are being irrational? Asking about the rationality of other people seems easier at first. When we observe others, we can spot things that they ought not be doing if they were rational. For example, they might claim to be vegetarian whilst eating a steak. According to most people, I would say, what they are doing is inconsistent. But this doesn’t mean that they do not have a perfectly consistent way of thinking that accounts for their actions and it is us, the majority, who have failed to grasp it. *How do you know that someone else is being irrational, not you yourself?*

I don’t know how it could be possible to tell. We cannot absolutely rule out the possibility of our own irrationality. But, just because we cannot rule out the idea that we are being irrational, the converse is not true. We have explicit evidence of “irrational moments” from our own experience. We realise that what we are doing is not what we think we should be, or we suddenly realize that this or that belief does not match up with others which we hold. We have a direct, conscious, appreciation of it. Sometimes at least.

Oh, shit! What am I saying!

One would have to be *insane* to experience a moment of irrationality without the slightest attempt to “fix” it. In fact, I would go as far as to say, experience of being irrational is indistinguishable from some attempt to remedy it. Having an “irrational moment”, means understanding that you *ought* to be thinking or doing something differently. We have little else to judge our irrationality by other than our own experience of these moments. Sure we can make models of behaviour and hold people up to those standards. Sure, we could if we wanted, deem all belief in Gods irrational. But our decision to do would be founded on our own personal experience of our capacity to err and adjust in the direction of subjective consistency. They are models informed by our own experience of what rationality is.

So, from the point of view of our own experience the only time when we are irrational is when we are attempting to make our thoughts or actions more consistent. Such irrationality is something we can observe in others – people exclaim it, and we can sense it in body language and manner. When someone realizes that they are doing something not quite right by their own standards, we can see it in their face and hear it in their words and tone.

Given all this, how should we interpret a persons claim that someone is being irrational? It is one of two things, either they are just saying that they think the other is wrong. Or, they are saying something far less pompous: They’re saying, in effect, that the other person has the capacity to experience an irrational moment in some particular context. *They can undergo a valuable transformation*. I consider the latter to be the correct way of speaking about irrationality, because outside of a highly polemic debate (even in it) the former serves little purpose.

Understanding that you are mistaken is a powerful thing. It signals the bringing about profound and useful changes in thought and action. Being informed that you are being irrational can, at its best, lead to this kind of powerful experience. One which anyone (sane) would welcome. As social animals, we can also use it to deepen our understanding of each other by highlighting which thoughts and actions we consider consistent; bringing our beliefs and actions into alignment with those of others. But claims of irrationality are often only worthwhile among friends who share a roughly similar outlook. If there is no empathy or mutual understanding – if one cannot comprehend the experience of the other sufficiently – then one cannot say what will induce the personal transformation implicit in realizing ones irrationality.

Without acquaintance, a claim of irrationality leads only to disappointment. If someone without due skill or empathy calls someone irrational (or implies it) one of two things goes wrong. Either they are perceive them as overbearing and are completely put off, or, they give the benefit of the doubt only to be disappointed with their accusers inability to provide the promised experience. One only needs to type “creationism” into Google to observe this in action.

Ultimately, we can only judge the rationality of others with ourselves as the standard, and all we can tell about it is learned from those places where we have experienced its failure. We usually consider this experience to be a valuable one, and with skill and empathy we can guide others towards such experiences. Telling someone that they are irrational can be a powerful tool for personal and social transformation, but without kindness and respect it becomes either impotent or fierce and alienating.

]]>

One reason for the weaker than expected results was the higher number of younger students taking GCSE papers. The JCQ figures showed a 39% increase in the number of GCSE exams taken by those aged 15 or younger, for a total of 806,000.

This effect could be seen in English, with a 42% increase in entries from those younger than 16. While results for 16-year-olds remained stable, JCQ said the decline in top grades “can, therefore, be explained by younger students not performing as strongly as 16-year-olds”.

Newspapers seem to get worried whenever there are educational results out that there might be some dreadful societal decline going on, and that any change in educational outcomes might be a predictor of the impending collapse of civilisation. This alternative explanation of reduced age is therefore quite interesting, I thought it would be worth trying to analyse it formally to see if it stands up.

First, I think we can draw a graph, essentially a causal Bayes net, that looks like this:

The “decline of civilisation” is perhaps a bit hyperbolic, a less exaggerated alternative cause might be “falling educational standards” or something like that. Whatever it is, it is something that we aren’t sure of. The point is to try and use the GCSE results to work out whether it is happening or not.

On the other hand, the ages of people taking exams is clearly known quite precisely, as is the actual outcome measure, so these aren’t controversial. The other thing that isn’t particularly controversial is the *arrow* from “decline of civilisation” to “GCSE grades” — while we don’t know if the decline itself is happening or not, we can be fairly sure (I think) that if it did, it would tend to produce worse GSCE grades. That is to say, *all else equal*, falling standards increase the probability that students will fail their GCSEs.

The first interesting question is whether the arrow from age to GCSE grades should be there. I.e., can we also say that all else equal, younger students get worse GCSE grades?* It seems plausible, but is there any empirical evidence? The quote above in a roundabout way tells us that age and GCSE grades were correlated in this year’s data, because it said that the average results go down if you include all ages rather than looking only at a particular age (namely 16). This is a roundabout way of saying

This, according to Reichenbach’s *principle of common cause* tells us one of two things:

- Age affects GCSE grades;
*or* - There is a common cause of age and GCSE grades.

(There is a third theoretical possibility that the GCSE grades someone gets causally affect their age when they take them, but I’m going to exclude this on temporal grounds — your age is determined before you get your GCSE results).

So in order to be sure that Age causes GCSE grades, we need to exclude the second possibility, that there is a common cause.

It might be worth pointing out here, that when we say “age” as a variable, we mean “the age at which people take GCSEs”, not people’s age in general, which is presumably “caused” by their birth date and the passage of time. The age at which they take GCSEs however is caused by factors such as their own decisions and educational, school policies etc. This means that it may even (we don’t know) be caused by the very same “decline in standards” we have been talking about, but I’ll leave out that possibility for now.

So the alternative “common cause” graph (that we want to rule out) looks like this:

I think we can rule this out by the following logic: we know that not only is age correlated with GCSE grades, but that the statistical dependence seems to have remained exactly the same over two years. The article said that not only did conditioning on entrants’ age affect the overall GCSE grades this year, but that the grades for 16 year-olds was exactly the same year on year. However, we also know that the age distribution of entrants *did* change year-on-year. Since everything is caused by something, this I think strongly implies that the causes of the age distribution changed year on year. It isn’t a stretch to think that the hypothesised common causes here therefore also changed year on year. But, if the age-grade correlation was dependent on a common cause, we would *not *expect the statistical dependence to remain exactly the same when the common cause changed. This is because age isn’t just dependent on the common cause, but presumably also has some of its own exogenous causes (far left of the diagram), that would mean that it won’t be perfectly correlated with its common causes.

Here is another way to put it: we haven’t measured the common causes, but if the above graph is right, then if we did know the common causes C, they would “screen off” the age-grade (A-G) correlation. I.e.

Like I say we haven’t measured the common causes, but we do have good reason to believe that they would be different for the two years of data. In year 1, some unknown value C1, in year two, some other value C2. The statement from the article is then effectively saying that what we have determined empirically is this:

If the screening-off relationship is true, then this would mean

But if C1 and C2 are different, and they really are causes of G, then this should not happen: our definition of cause is that altering the causal variable changes the probability of its effect. It makes far more sense if it is A that screens-off C from G (i.e. A is the direct cause of G, not simply correlated by virtue of a common cause).

I’m not 100% sure about this argument. But let’s say it will do for now. This justifies our original graph. Once we have that, the only thing we don’t know on the original graph is the value of the “decline of civilisation” variable. I think we can rule out any change in the decline of civilisation quite easily if this graph is correct, because here we can see that age A does not (according to the graph) screen off decline D from grades G. I.e. it should generally be the case that

But if the decline in year 1 was D1 and in year 2 is was D2, we have seen:

Since there is no screening-off relationship, the only obvious way to explain this is that D1 and D2 are the same. Remember that D1 and D2 are some value representing educational standards that we can neither (directly) control nor measure. Thus the “conditioning” on D has been done “naturally” in the data (whereas conditioning on A was done by the choice to split the data by age).

There could be many problems with this line of reasoning. Also, there may be other plausible graphs that would fit with the data and not exclude the possibility of changing standards. So if anyone has any other ideas in the comments…

* EDIT: Just to add to this: it is obvious from the data that younger students *did *get lower GCSE grades, but this data does not fulfill the *all else equal* condition – it is pretty certain that all else was not equal (there are all sorts of confounding factors with age – they may have gone to different types of schools etc). The point is to work out if the claim that age causes a change in grades is reasonable, but, note, we can’t easily do a randomized controlled trial (and if we tried I would bet it would be pretty useless and highly unethical), so the question is whether any causal conclusions can be drawn without trying to randomize the age at which people take GCSEs).

]]>