## Visualizing the mutual information and an introduction to information geometry

For a while now I have had an interest in information geometry. The maxims that geometry is intuitive maths and information theory is intuitive statistics seem pretty fair to me, so it’s quite surprising to find a lack of easy to understand introductions to information geometry. This is my first attempt, the idea is to get an geometric understanding of the mutual information and to introduce a few select concepts from information geometry.

Mutual information

Most people that have heard of it will know that the mutual information is a measurement of how correlated two variables are. It is usually written as:

$MI(X,Y)=-\sum_x \sum_y p(x,y) \log \frac{p(x)p(y)}{p(x,y)}$

but it can also be written as a kind-of-distance-measure (a Kullback-Leibler divergece) between a joint distribution $p(x,y)$ and a distribution where the probabilities of each $x$ or $y$ are the same as in $p(x,y)$ but the variables X and Y are independent; in other words, to the product of the marginal distributions $p(x)p(y)$. The distance measure, the Kullback-Leibler divergence between distributions $P$ and $Q$ is usually written $D_{KL}(P||Q)$ and is equal to (for discrete variables)

$D_{KL}(P||Q) = -\sum_x p(x) \log \frac{q(x)}{p(x)}$

Which allows us to write the mutual information as

$MI(X,Y)=D_{KL}(p(x,y)||p(x)p(y))$

with the sum in $D_{KL}$ being over both $x$ and $y$.

For the example here I will look at the mutual information between two binary variables; $\mathcal{A}$: which can be either $A$ or $\neg A$, and $\mathcal{B}$: which can be either $B$ or $\neg B$. In this case there is only four different probabilities to consider; as these must sum to one, there are only three degrees of freedom and we can visualize the probability space in three dimensions.

The standard geometry bit

To begin with it is useful to introduce two coordinate systems. Firstly, one that is a ‘natural coordinate system’ in a sense that I will explain later (the $i$ in $\xi^i$ is a superscript index not a power):

$p(A,B) = \xi^1$
$p(A,\neg B) = \xi^2$
$p(\neg A,B) = \xi^3$
$p(\neg A,\neg B) = 1-\xi^1-\xi^2-\xi^3$

in this coordinate system $p(A) = p(A,B)+p(A,\neg B) = \xi^1+\xi^2$ and $p(B) = p(A,B)+p(\neg A,B) = \xi^1+\xi^3$. The other coordinate system that I will use is one for visualizing things in Cartesian coordinates:

$x = 1-\xi^1-\xi^2 = 1-p(A) = p(\neg A)$
$y = 1-\xi^1-\xi^3 = 1-p(B) = p(\neg B)$
$z = 1-\xi^2-\xi^3 = 1-p(A \neq B) = p(A = B)$

$x$, $y$ and $z$ correspond to probabilities (as do the $\xi$s) so every probability must be inside the cube with edges $[0,1]$, but the space of probabilities is more constrained than this and in fact lies within a tetrahedron made out of alternate corners of this cube.

The cases where the variables $\mathcal{A}$ and $\mathcal{B}$ are independent form a surface in the tetraderon. We can get an equation for it from $p(A,B) = p(A)p(B)$, in the Cartesian coordinates we have:

$p(A)p(B) = p(A,B)$
$(1-x)(1-y) = z - xy$
$1-x-y+2xy = z$

For the natural coordinates we have:
$p(A)p(B) = p(A,B)$
$(\xi^1+\xi^2)(\xi^1+\xi^3) = \xi^1$

The surface looks like this:

There is also a set of lines formed by all the distributions that have the same marginal distributions ($p(A)$ and $p(B)$). These are lines that are parallel with the $z$-axis. Putting all this together we get a picture that looks like this:

Each line intersects with the surface exactly once. Which we can see when we look at it along the $z$-axis. You can see that there is an intersection for every possible value of $x$ ($1-p(A)$) and $y$ ($1-p(B)$).

Arc lengths (Euclidean distances between points)

The mutual information, as we will see, can be thought of as measuring a distance to the curved surface along one of the lines in $z$. This can be done in standard geometry, although it does not yield a standard information measure. The basic idea is, that to measure the length of a curve we add up lots of little bits of a curve.

The picture shows the case of a 2D Euclidean geometry with a curve between $A$ and $B$ in a plane with coordinates $x$ and $y$. We can work out the length of the curve by integrating $ds$ along the curve – $||s|| = \int_A^B ds$. Another way of writing this is roughly:

$||s|| = \int_A^B \sqrt{dx^2 + dy^2}$

The standard thing to do is to write the coordinates of the curve as a function of some arbitrary measure ($t$) of how far along the curve we have gone, usually we make it so that the parameters is $0$ at the start and $1$ at the end:

$(x,y) = (x(t),y(t))$
$A = (x(0),y(0))$
$B = (x(1),y(1))$

Then we can write the integral as

$\int_A^B ds = \int_0^1 \sqrt{\left(\frac{\partial x(t)}{\partial t} dt\right)^2 + \left(\frac{\partial y(t)}{\partial t} dt\right)^2 } = \int_0^1 \sqrt{\left(\frac{\partial x(t)}{\partial t}\right)^2 + \left(\frac{\partial y(t)}{\partial t} \right)^2 } dt$

To make this look a bit more like differential geometry we can write the bit in the square root in a different way. If we now number $x$ and $y$ as dimensions $1$ and $2$ and give them new names $x = q^1$ and $y = q^2$ (the raised numbers are just there to say which one is which, i.e. $q^2$ does not mean ‘q squared’, but ‘q number 2′). Expressing the bit in the square root in these terms gives:

$\left(\frac{\partial x}{\partial t}(t) dt\right)^2 + \left(\frac{\partial y}{\partial t}(t) dt \right)^2 = \left(\frac{\partial q^1}{\partial t}(t)dt\right)^2 + \left(\frac{\partial q^2}{\partial t}(t)dt \right)^2$
$= \sum_i \sum_j \delta_{ij}\frac{\partial q^i}{\partial t}(t)\frac{\partial q^j}{\partial t}(t)dt^2 = \sum_i \sum_j \delta_{ij} dq^i dq^j$

I have introduced a new thing here, the $\delta$. This is basically the identity matrix, it takes the value $1$ when $i=j$ and $0$ when $i \neq j$. The benefit of doing this that we change the coordinates to something different and keep the distance the same by changing $\delta_{ij}$ for something else. For example if we have new coordinates $(m^1, m^2)$ that are related to $(q^1, q^2)$ by $m^1 = (q^1)^2$ (read ‘q one squared’) and $m^2 = (q^2)^3$ (read ‘q two cubed’) then we replace $q$ with $m$ and $\delta_{ij}(q)$ with a new quantity $g_{ij}(m)$ which is given by (I won’t show the working):

$g_{1,1} = \frac{1}{4} (m^1)^{-1}$
$g_{2,2} = \frac{1}{9} (m^2)^{-\frac{4}{3}}$
$g_{1,2} = g_{2,1} = 0$

The quantity $g$ is known as the metric tensor (or the coordinate based representation of it). Ultimately, it is the quantity that defines the distance between points in Riemannian geometry. When we change coordinates we have to change the value of $g$ so as to keep distances stay the same. Really, it is the relationship between $g$ and the coordinates that defines distances. I’m going to skip how do the coordinate transformations but you can find a description here – it’s not particularly hard but I’ve waffled about this a bit too much already.

So, in general, the distance (arc-length) between two points can be written as:

$\int_A^B \sqrt{\sum_i \sum_j g_{ij}dq^idq^j}$ or $\int_A^B \sqrt{\sum_i \sum_j g_{ij}\frac{\partial q^i}{\partial t}\frac{\partial q^j}{\partial t}}dt$

Later, it will be useful if we call the bit in the square root $G(t)$ so that:

$\int_A^B \sqrt{G(t)} dt$

The metric tensor and divergences

When the Kullback-Liebler divergence was first published, they noticed that approximately (for small $\Delta\xi$) that for distributions with parameters $\xi$, written here as $p(x;\xi)$:

$D_{KL}(p(x; \xi) || p(x; \xi + \Delta\xi)) \approx \frac{1}{2}\sum_i \sum_j f_{ij}(\xi) \Delta\xi^i \Delta\xi^j$

where $f_{ij}(\xi)$ is the Fisher information, it is a metric tensor (a valid $g_{ij}$) which can be calculated by:

$f_{ij}(\xi) = \sum_x p(x;\xi) \frac{\partial}{\partial \xi^i} \log{p(x;\xi)} \frac{\partial}{\partial \xi^j} \log{p(x;\xi)}$
or equally (though not obviously)
$f_{ij}(\xi) = - \sum_x p(x;\xi) \frac{\partial}{\partial \xi^i} \frac{\partial}{\partial \xi^j} \log{p(x;\xi)}$

This can be seen from the Taylor expansion, which is a bit of a pain to do, but quite straight forwards. Indeed, we can look at higher order terms too:

$D_{KL}(p(x; \xi) || p(x; \xi + d\xi)) = \frac{1}{2}\sum_i \sum_j f_{ij}(\xi) d\xi^i d\xi^j + \frac{1}{6}\sum_i \sum_j \sum_k h_{ijk}(\xi) d\xi^i d\xi^j d\xi^k + o(|d\xi|^4)$

It turns out, for the Kullback-Liebler divergence (among others) that if we choose the coordinates $\xi$ correctly, then we can recover the divergence by integrating. Basically, we can do this if $h_{ijk} = \frac{\partial}{\partial \xi^k} g_{ij}$. There are a number of ways of checking this, which can be found in this book, the simplest is checking that:
$\frac{\partial}{\partial \xi^i}\frac{\partial}{\partial \xi^j} p(x;\xi) = 0$
or for the reverse Kullback-Liebler divergence:
$\frac{\partial}{\partial \xi^i}\frac{\partial}{\partial \xi^j} \log{p(x;\xi)} = 0$
for all $x$.

It’s like a half square arc-length

From the section on classical geometry, we can see that small arc-lengths $\Delta s$ are approximately given by:

$\Delta s \approx \sqrt{G(t)}\Delta t$

But the divergences are approximately given by

$\Delta D \approx \frac{1}{2}G(t)\Delta t^2$

So, the divergence is a kind of half-square-arc-length. But the actual squared arc-length is something different; we can see this by direct calculation:

$\frac{1}{2}s^2 = \frac{1}{2} \left( \int_A^B \sqrt{G(s)}ds \right) \left( \int_A^B \sqrt{G(t)} dt \right) = \frac{1}{2} \int_A^B \int_A^B \sqrt{G(s)G(t)} ds dt$

We can visualize this as an integral over a square with sides of $s$ and $t$. If we represent the magnitude of $\sqrt{G}$ as the intensity of a greenness within the square we represent the area that we are integrating to get the square arc-length as:

Notice the symmetry in the pattern that is formed, it is symmetric in the axis where $s = t$. This makes it possible to write the half-square arc-length as the integral over a triangle:

$\int_{A \leq s \leq t \leq B} \sqrt{G(s)G(t)} ds dt$

Which is symmetric in the sense that it stays the same if we swap $s$ and $t$. We can view this integral in the same way as before.

The divergence on the other hand is calculated by (where $b$ is the value of $s$ at $B$):
$D = \int_{A \leq s \leq t \leq B} g(s) ds dt = \int_A^B (b-s) g(s) ds$
which is asymmetric, there is a ‘forwards’ and ‘backwards’ integral. It looks like

The asymmetry of the divergence is well known. The breaking of this symmetry corresponds to the fundamental difference between information measures and normal geometric measures.

We are now at a point where we can use an information theoretic integral instead of a geometric one to calculate the distance between points. This method has (nearly) all the features of differential geometry, but the quantities are those of information theory.

Back to the Mutual Information

So, let us go back to the description of the mutual information. We know that it is Kullback-Leibler from the joint distribution to the a point with the same $x$ and $y$ coordinates. But also, we can say a little more. Lines parallel with the $z$-axis are straight lines in terms of divergences (unlike normal geometry, this is not when the metric is constant). We can see this by checking that

$\frac{\partial}{\partial z} \frac{\partial}{\partial z} p(a,b; z) = 0$

for all $a$ and $b$. This is just the check I mentioned before. This holds because the $x$, $y$ and $z$ are linearly related to the probabilities. The result of this is that we can integrate Fisher information in a straight line along the $z$-axis in the Euclidean coordinates and get the Kullback-Leibler divergence.

In other words we can write that the mutual information, for a probability written in terms of $(x,y,z)$:

$\int_{1-x-y+2xy}^z \int_t^z f(s) dt ds = \int_{1-x-y+2xy}^z (z-s) f(s) ds$

where the Fisher information metric, $f(z)$, is given by:

$f(z) = \frac{8}{1-x+y+z} - \frac{2}{x-y+z+1} - \frac{2}{-x+y+z-1}$

The fact that the straight lines for the divergence are straight lines in the Cartesian space is important. In a very real sense, we are only comparing the joint distribution ($p(a,b)$) with others with different $z$ value (different $p(A=B)$ – different degrees of independence – but with the same value of $p(A)$ and $p(B)$). Roughly, we could do anything to the geometry of the space that isn’t on the $z$ line and the divergence would be unaffected.

I think that’s about all I have to say for now. I’m working on a more detailed tutorial for information geometry in general, I’ll post a link when I have finished it.

### 6 Comments to “Visualizing the mutual information and an introduction to information geometry”

1. This is great stuff. But I still really wish I knew what the information distance really *is* in information terms, other than being something that’s a bit like the KL divergence but which is a metric. By analogy, out of all the possible divergences one can define, the KL divergence has the unique property that minimising it results in least biased estimates, in a formally definable way. It also has the property that it can be interpreted as the information gained by a Bayesian agent in updating from a prior to a posterior. It seems like there should be something similar for the information distance: out of all the possible metrics that one could define on a parameterised set of probability distributions, the information distance has the desirable set of properties $X$. If you have any insights into what $X$ might be, it’d be really great to hear them :)

2. What do you mean information distance. Do you mean the arc-length as calculated by using the Fisher information metric? If that is what you mean, then basically, it is the (not square) Hellinger distance ($\sqrt{\sum_x (\sqrt{p(x)} -\sqrt{q(x)})^2}$). In the mutual information case here along $z$, it is the $\chi$ statistic used when testing for independence (I think).

However, the basic idea is that different information measures correspond to different notions of flatness and affinity. Affinity (the quality of being affine) in this case means that we can get a divergence by integrating the Fisher metric over a straight (geodesic) line. The different versions of flatness can be characterised by different numbers ($\alpha$) KL-divergences correspond to $\alpha = 1,-1$, Hellinger to $\alpha=0$.

For each version of flatness, there is a corresponding version of affinity, i.e. there is a different coordinate system that makes calculating information by integrating the Fisher information along a straight line the correct thing to do. For the KL divergence, either it is the representation above with the $\xi$s or a linear transformation of it. For the KL-divergence the other way, it is the representation used in exponential families. Basically, in statistics we are always using representations that are affine with respect to the KL divergence.

I’m not convinced by your claim:

the KL divergence has the unique property that minimising it results in least biased estimates

Do you have a reference?

• I only just saw this comment – didn’t get a notification for some reason.

But yes, I meant the Fisher-Rao distance. I know what it is mathematically, but I don’t know what it really *means* in information terms. I don’t find the notion of something being a measure of something to be a satisfactory explanation of its meaning; I’m only happy when it can be shown to be *the uniquely correct* measure of something, according to properly specified criteria.

I guess the point is, I know how to use the KL divergence to calculate the answers to questions that don’t involve the KL divergence. The question might be of the form “what is the optimal way for an agent to behave if it wants to do X?” and the answer might be “calculate the KL divergence of Y and do something based on that.” But so far I don’t know a question of the form “what is the optimal way to do X?” for which the answer is “do something based on the Fisher-Rao distance between Z”. I’m fairly sure questions like that must exist, and if I could see and understand one of them I would be a lot more inclined to feel that information geometry is more than a mathematical curiosity.

About the unique property of the KL divergence. It depends on how you define “least biased” of course, and I probably shouldn’t have used that phrase because it probably has a definite technical meaning that’s not what I intend. But I can try to explain what I did mean.

I don’t have a reference for the following because I worked it out myself, although I’m certain I’m not the first to do so. The proof is analogous to Shannon’s proof that the entropy is the only reasonable measure of the uncertainty in a discrete probability distribution (http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf, section 6 beginning on page 10; see http://www-biba.inrialpes.fr/Jaynes/cc11g.pdf for a better exposition by Jaynes that puts in terms of Bayesian updating rather than stochastic processes). I can’t remember the exact criteria you need to derive the KL divergence, but they are quite similar to Shannon’s, except that you have to demand that it goes to a minimum when every $p_i = q_i$ instead of being a function of $n$ when the $p_i$ are all equal to each other. The proof is much the same as Shannon’s – you start with the chosen criteria and use them to write down functional equations, which you solve show that only monotonically increasing functions of $D_\text{KL}(P\|Q)$ can meet them.

The upshot is that you can make the same types of argument about the KL divergence as Jaynes made about entropy. If you minimise $D_\text{KL}(P\|Q)$ with respect to P (subject to constraints on P) while keeping Q constant then you’re calculating the posterior P that is closest to Q in the sense of requiring the least amount of additional information. If you choose a posterior P’ that satisfies the constraints but does not minimise the KL divergence then you’re effectively making $D_\text{KL}(P'\|Q) - D_\text{KL}(P\|Q)$ bits of additional unwarranted assumptions.

• I replied in the wrong place, see below…

3. It does have an interpretation, however I am going to keep it mostly to myself for a bit as I am (and have been for a while) writing a paper on this. It’s certainly possible give it a very nice interpretation in terms of risk and (arguably, neuroscientifically consistent) estimation procedures. There is a whole class of distances (which included the symmetrised KL, and the squared Fisher-Rao distance) that correspond to ‘rational’ choices, but there are some special ones too. Will send it to you when I’m done.

As for the what information geometry is about, perhaps I can be a little clearer. First of all, it has very little to do with the Fisher-Rao distance – though it has some nice properties. The main reason for attention to be payed the Fisher-Rao distance it is that it corresponds to normal Riemannian geometry, which more people understand, letting us talk about how information measures are (slightly) different to it and providing an intuitive entry point for understanding them. Information geometry is more about things like “how is the forwards KL-divergence related to the backwards KL-divergence?” ($\alpha$-duality), “whats with all these distributions with exponentials in them?” (the relationship between distributions and their parametrizations), “what counts as a distance, or as a measure of information?”, “what is the relationship to convex analysis and other workhorses of economic theory?” (duality again) e.t.c.

Let’s start with unbiasedness, as it illustrates the importance of coordinate systems (which is why information geometry, which doesn’t care about coordinate systems, isn’t particularly concerned with it!). The problem with unbaisedness in general is that it is not coordinate free notion. For example, it’s well established that the square root of an unbiased estimate of the variance is not an unbiased estimate of the standard deviation. i.e. we have a simple counter example showing that, for some estimator $\hat{\xi}$ of a statistic (coordinate!) $\xi$ that $E[\hat{\xi} - \xi] = 0 \nRightarrow E[f(\hat{\xi}) - f(\xi)] = 0$ which is kind of obvious when you write it like that. There is nothing in particular about probabilities themselves that means that we must choose a particular coordinate system for representing probability distributions. The Kolmogorov formalization is designed so that one does not need coordinates. If we have a probability space $(X, \Sigma, \mu)$, then for $E \in \Sigma$ all we have is
$\mu(E) = \int_E d\mu(x)$
the $p_i$ come about when you choose a particular measure (the discrete measure, \delta) to create a representation of the probabilities by letting
$\mu(E) = \int_E p(x_i) d\delta(x)$
in a technical sense, the $p_i$s are probability densities, not probabilities. If one (justifiably) requires probabilities to be strictly a mapping from the $\sigma$-algebra to the interval $[0,1]$ and densities a mapping from some set (topologically) continuous with the support $X$ to the interval $[0,1]$. Having to work in a particular measure is problematic, the most obvious case being when tries to generalize from a discrete support to a continuous one.

So, one of the motivations behind information geometry is to deal with this. We want our mechanism to be invariant to transformations of measure, and more generally, to how we choose to represent our distributions (the coordinates). An abstract justification being, if we have a mechanism that does not deal in coordinates, then when things depend on coordinates we know that we have introduced some kind of asymmetry into our working from outside (to borrow an ideal from theoretical physics).

There are loads of divergences that deal with this invariance correctly. But we can say more than this. Often we can relate the coordinates we choose to a divergence, for example, those that fall out of using the discrete measure, to the KL divergence. Information geometry shows some important facts about certain coordinate systems, and for example, describes why the KL-divergence and the coordinates associated with it should be of particular importance – but this is fairly advanced stuff.

This means that we can, for example, discard some of Shannon’s requirements that are a little arbitrary and not so very justifiable outside of communication channels (i.e. additivity). If like me, you care about living things, we can get rid of the idea that nature cares about bits and talk about things that are more ecologically relevant, whilst still maintaining the more general notion of information which isn’t specifically tailored to communication channels (Shannon understood this need all to well).

• Well, I guess if by information geometry you mean “stuff to do with KL divergences” and not “stuff to do with the Fisher metric” then fair enough I guess – but then, why not just call it information theory – where does the geometry come in? (I’m sure there’s a lot of stuff I’m not getting. All this stuff about measures and supports and so on doesn’t mean a lot to me. One day I’ll try to find time to get to grips with it.)

I wouldn’t say that additivity is an arbitrary requirement. Rather, it is an arbitrary convention. Although Shannon didn’t express it this way, what his proof actually shows is that the other requirements determine the entropy up to an arbitrary monotonically increasing function. Shannon saw this and said “well, let’s choose the version that combines additively for independent systems” and ended up with $-\sum_i p_i \log p_i$, but he could have said “let’s choose the version that combines multiplicatively for independent systems”, and then we’d all be talking about $\prod_i p_i ^ {-p_i}$ instead. But since the point of entropy is usually to maximise it, it doesn’t make a lot of difference.

I also completely agree that Shannon’s work was very specifically tailored to communications channels. In biology, people have made quite a lot of mistakes in the past by not realising this. But I’m following Jaynes, who took Shannon’s work and did the hard work of removing the stuff that’s specific to that application domain, and re-casting it in terms of the reasoning of a Bayesian agent. I would strongly recommend you read the relevant chapter of his book (chapter pdf: http://www-biba.inrialpes.fr/Jaynes/cc11g.pdf). In all likelihood it will make about as much sense to you as the Kolmogorov stuff does to me, but perhaps you will find it gives you a useful new perspective on what Shannon’s proof actually shows, when you take away the specific communications engineering context.