## Friston’s Free Energy for Dummies

People always want an explanation of Friston’s Free Energy that doesn’t have any maths. This is quite a challenge, but I hope I have managed to produce something comprehensible.

This is basically a summary of Friston’s Entropy paper (available here). A friend of jellymatter was instrumental in its production, and for this reason I am fairly confident that my summary is going in the right direction, even if I have not emphasised exactly the same things as Friston.

I’ve made a point of writing this without any maths, and I have highlighted what I consider to be the main assumptions of the paper and maked them with a P.

Optimisation Theories

The basic structure of Fristons theory is not particularly unusual, it is one of many theories that basically work by assuming that something is optimised. In this case, a “free energy” (at bit like the physics one). Much of modern physics is based on some kind of optimisation: in mechanics one minimises an action, and in thermodynamics one minimises the (thermodynamic) free energy. Further afield it is found in economic decision making: where a risk is minimised; and in population dynamics: where, depending on ones interpretation, fitness is maximised.

In fact, all theories, right or wrong, can be formulated in terms of an optimisation. The theory: “Summer is warmer than winter” could be expressed as: “physics works so as to maximise summer temperature minus winter temperature”. Though usually, we do not speak in this way all the time.

Formulating a problems in this was always raises a question: “why should this quantity be optimised?” and there are two distinct but non-exclusive responses. (1) Some other theory suggests that it should be optimised; and (2) it is an elegant summary of other models.

Risk and fitness optimisation are of the first type: in economics, the notion of utility justifies* the minimisation risk and the notion of replicators is often used motivate the maximisation of fitness. Contrariwise, the action principles of physics are of the second type, they unify various principles in one coherent and elegant framework.

The argument that Friston provides addresses both. In the first case, it is motivated by a notion that agents are “coherent” in some sense (active systems). In the second, it generalises a number of concepts in machine learning and statistical inference.

Here, I will not worry about it’s ability to generalise mathematical theorems, but attempt to restate the argument for it following from biological principles. Friston’s presentation is usually aimed at those who prefer overarching, general, mathematical theories, but this seems to me the source of many of the difficulties that people have when trying to understand it.

Importantly, it is the justification according to other biological theories that is matters to the non-theoretician. It is this which matters to them when deciding whether or not to attempt an understanding of the mathematical details.

Agent and Environment

The model beings with a sensory-motor feedback loop. The agent affects its environment and the environment affects the agent – these are modeled as physical systems. In the environment there is “added noise”, and this noise motivates us to talk about probabilities and information.

P1: The internal states of an organism react deterministically on random (but correlated) sensory information from environment.

Because of the noise implicit in the environment, the environment behaves randomly. And because of this the agent – who’s state is affected by the environment – behaves randomly in a corresponding fashion. Which then affects the environment with (delayed and transformed) noise, which affects the agent… etc. etc. The consequence of all of this is that there is a now a probability distribution over the possible physical states of the agent and the environment.

P2: Organisms act against the environment’s randomness, towards being in a definite state.

The stated motivation for this is homeostasis: we act so as to not decay into disordered molecules, by eating, avoiding danger, not exploding, and what have you.

The most obvious (but not the only) way of measuring the amount of randomness is by talking about entropy, or in this case, the surprisal which for the purposes here we can consider to be the same thing as entropy. The surprisal measures the number of states that the system can be in. If it is low, the system stays in one of very few states, if it is high, the system is in one of many. So, the self-maintaining-ness/homeostaticity/distinctness is measured by the surprisal, which, according to Friston, should be minimised.

At first glance it may seem that one should apply this measure to the organisms internal states, but it turns out that this doesn’t work. For example: A rock would have very definite states and a low surprisal. Instead, the proposed solution is for the organism to instead minimise the surprisal associated with the external world* – he calls systems that do this active systems.

An active system acts as to make its sensory inputs as predicable and unsurprising as possible. This means we can make a modification to P2:

P2b: Organisms act against the environment’s randomness, towards obtaining sensory evidence that suggests that they are in a well defined definite state.

Doing this solves the rock problem. And because the randomness of the inputs affects the internal state, it is measuring a very similar thing.

Depending on ones philosophy P2b may be either a refinement or a change to P2. Ones opinion about this is crucial for deciding if the notion of homeostasis is a justification for this theory. Either way, with P2b we still have a problem much like the rock example from before. Sitting in a dark room with your fingers in your ears would be an excellent way of minimising surprisal – and we obviously don’t do that. Much.

The optimality of sensory deprivation can be seen as a motivational problem for free energy, but first I must go though some stuff about inference.

Best Inference

P3: Organisms make good inferences.

Organisms can be considered to make inferences, and making good inferences has different requirements to the notion of surprise as described above.

To discuss inference we must introduce some more probability distributions. Let’s say that the probability of sensory inputs are determined by things that we can’t directly observe. If we observe a coin landing on heads five times and tails five times, then we can make inferences about some hidden parameter that has a value somewhere around one half – a statistician would say the outcome of a fair coin flip is drawn from a Bernoulli distribution with a parameter of 1/2. The more we observe the coin landing, the more sure we can be about the parameter.

Of course in this example the choice of parameter has no obvious physical basis: one could easily choose another, related parameter – say by adding one, squaring it, taking the logarithm etc – and have it describe the same coin but in a different way. The choice of parameter is kind of arbitrary; and it is for this reason that Friston describes them as fictive. We can choose them how we like as long as they describe the same thing.

The fictive parameters are used to model the world. Consider a brick. This brick could have parameters width, height and depth. Associated with the parameters there would a confidence in each paramter: there is a probability that the brick is between 9 and 10cm long, 8 and 9 cm, 7 and 8, as well as for any other pair of numbers, 8.51234… and 8.51333… or whatever. But, we needn’t have chosen length, width, and height – we could have equally chosen surface area, volume and perimeter, we could still describe the same brick, and with these there would be a corresponding probability distribution which you could work out from the former.

It is both the power and the failure of information theory that it talks about probabilities with complete indifference for what they are.

Because the parameters don’t necessarily have any specific meaning or interpretation, at this point we simply forget about trying to work out what they are or what they mean – all we care about is the thing they describe. Friston argues, that whatever they happen to be we can still talk about them abstractly. The mathematical tools he uses are then chosen so as to make this aspect of his probabilities unproblematic (measure invariance).

The probability of the fictive parameters comes in two main flavors. One, the probability of the parameters as determined by the state of the sensory system (an “objective” probability in some sense), related to Friston’s “generative model”. The other, the probability of the parameters as determined by an internal model (the subjects probability), which Friston calls a “proposal density”. The former, the world as it is best described, the latter, the product of attempt for an organism to describe it.

The main idea of making inferences is that the organism tries to find the probability distribution (proposal density) that best matches the “real” probability distribution (generative model). The better the internal model is the better it matches the world. The better the model model matches the world, the better it is for making predictions about it.

Choosing to minimise the difference between probability distributions entails lots of things that people want from inferential systems, such as maximum entropy principle**. To measure this difference, Friston uses a standard tool usual: the Kullback-Leibler divergence. There are plenty of other measures that he could have used, but this one is usually preferred by information theory and Bayesian types.

Conflict Between Surprise and Inference

The next step is to acknowledge that both inferential ability and surprisal minimisation are “good”, and define a quantity that is measures the goodness. Depending on how you go about making this quantity, you end up with different things. But the quantity that is nice to work with once one has settled on the surprise and the Kullback-Liebler divergence is Friston’s free energy. This basically just adds the two together. The result is certainly elegant, but there is no motivation for this particular form beyond mathematical tractability (the reason for the mathematical niceness is the subject of information geometry).

Importantly, when optimal inference and minimising surprisal are mutually exclusive, minimising the free energy minimises both of them. This is an “all things being equal” justification:

P4: If it were possible, an organism would minimise both the surprisal and and the inferential “error” of their prediction independently of each other.

But no things are ever equal. In practice, the two quantities are not mutually exclusive. This is because the “subject” probabilty must have some physical basis in the internal states of the organism, and is thereby constrained by this physicality. This is essentially the idea that the brain represents probabilities, and is what Friston calls entailment. The internal states are the same thing as the subjects probability distribution but viewed, as it were, through a different lens. The exact manner in which the brain states are mapped to probabilities is not discussed directly, but there is an implicit notion that the brain cannot represent just any old probability distribution.

The usual consequence of entailment is that it is no longer possible to simultaneously minimise surprise and maximise inference, and instead, there is a trade off between the two:

P5: Due to the physical nature of organisms, maximising inferential ability and minimising surprise are in conflict with each other.

This motivates the need for free energy as one singular quantity, rather than two separate ones. It is also how one solves the “dark room with fingers in ears” problem, though for a slightly technical reason: Implicit in the formalisation of maximised inferential ability is the notion that making the best inferences about lots of things is better than making the best inferences about less things. Whilst in the state of sensory deprivation I mentioned one can make rather good inferences about you see, but one cannot make inferences about other things that you would if you opened your eyes and took your fingers out of your ears***.

The relationship between the internal states and the internal/subject probabilities is of fundamental importance. It is the very heart of what Friston calls the free energy principle. In this he elaborates on the nature of the physical constraint on the probability distribution it encodes. Basically, the constraint is simply the number of states that the brain can be in. So, we have a motivation for P5

P6: The world is very big, the brain is relatively small. The brain just does not have the capacity to match the complexity of the real world as provided in sensation.

Between them P1-P6 are the main motivations for the free energy in what I called type (1) terms – those that do not appeal to the generality of the theory. I hope they provide a good outline of the reasons for using free energy.

I will skip the applications sections of the paper as they are of the other kind. As I said, whilst these are important for those theoreticians who already use those techniques, it is not my concern here.

Summary

So, to summarise the notion of free energy: it is one way that one may quantify the trade off between making ones environment predictable, and the ability to make predictions where you cannot. The quantity is chosen so as to fit easily with established formalisms in information theory and Bayesian probability.

Throughout this summary I have omitted the mathematical assumptions such as additivity and the relevance of the surprisal and KL-divergence as I do not think including these helps with readability. But we must not forget that they are fundamental to how the particular form of the free energy is formulated. The wordy argument I have presented applies equally to all other free-energy like formulations.

Footnotes

* However, the internal and external surprisals are related because the state of the environment determines the internal state of the agent.

** Interestingly, the same version of the maximum entropy principle is also a property of Friston’s free energy. One should probably view Friston’s corollary 3 as showing that using free energy does not break this property of the KL divergence. Also, there are some very interesting results in information geometry along exactly these lines.

*** This raises obvious questions about what fictive variables are valid, but I shall skip over this problem mentioning only that it is a problem that may potentially be solved empirically.

### 9 Responses to “Friston’s Free Energy for Dummies”

1. This is great! I hadn’t appreciated the conflict between minimizing entropy and maximizing inferential ability, and that minimizing free energy represents a compromise between the two. I’m still not sure how the two are balanced in the free energy formulation, or how it applies to the activity of nervous systems, but I feel like I have some overall understanding of Friston now. Thanks

2. Having spoken to Carl Friston about this recently, I think his resolution of the dark room problem is different from what you describe, at least if I understood you correctly. To me the key concept of Friston’s idea is that some aspects of the agent’s model are updated according to Bayes’ rule, but others aren’t, they’re just fixed. (Set by evolution I guess, or maybe in some circumstances they’re set by learning processes that take place on a slower time scale than the one being considered.) The things that can be updated are the agent’s beliefs about the word, whereas the fixed things are the agent’s goals – but Friston calls both of these things “beliefs”.

So Friston will say (and quite often did say in our meeting) that the reason you don’t sit in a dark room until you starve is that you “expect” that you will go out and explore the world and find food and survive. This is the type of expectation that can’t be changed – so you can’t just (metaphorically) say to yourself, “well, I’m sitting in a dark room and not eating, so therefore I should expect to starve”, and then starve to death while not being surprised by it. Instead you’re forced to (metaphorically) say “hang on, I’m getting hungry. This is counter to my expectation that I will eat, and therefore surprising! I can’t change this expectation, so I must go out and get some food in order to minimise this surprise.” I guess you could say you have expectations about your own behaviour as well as about the world, and it’s these that allow you to have goals and motivations other than just minimising the surprisal of your sensory input.

It may be that Friston didn’t invent this idea – Andy Clark has a recent paper on “active inference” which I’m part way through reading, and that paper talks about versions of the idea that pre-date Friston. However, I think it’s very important to understand this idea in order to understand what Friston says.

Friston’s model is specifically about layers of neurons that try to predict features of their input. The only thing they send to the next layer is the error in their prediction, so each layer (apart from the first one) is trying to make predictions about the previous layer’s error signal. Simon McGregor pointed out to me that this “sending on the error signal” is exactly what a classical negative feedback control system does. A thermostat can be thought of as something that always “predicts” that a room will be at $20^\circ\,\text{C}$, and it sends the “error” in that prediction (i.e. the difference between 20 degrees and the actual temperature) to the heating element. The similarity between these two things is the real basis of Friston’s theory. I think he wants to say that every neuron, from the visual cortex to the motor neurons, is carrying out this same unified procedure: it’s making a “prediction” (which might be something it can update to match reality, or it might not) and then sending on the error. The only tricky bit is wiring things up so that this actually results in sensible behaviour. (If you wire a thermostat up “the wrong way around” you’ll get positive feedback instead of negative feedback, and the room will either get very hot or very cold.) For this reason Friston talked a lot in our meeting about “giving the agent a reflex” such that (in my words) the error signals from its motor neurons tend to make the “beliefs” of those neurons become reality.

On the other hand, I wouldn’t go as far as to say that I’m 100% sure this is the correct interpretation of what Friston says. I’m not an expert by any means, and it’s mostly not written in a way that makes it easy for someone outside the field to understand. But this understanding does seem to match up with the things that Friston says.

• What you describe is different in a sense. But the difference is one of emphasis. I simply chose to speak of a physically necessitated trade off and completely skipped over how that trade off is manifested. It’s exactly those details that confuse people…and the maths. It was definitely not my aim to say what Friston would say, as I’m trying to explain it to the people that find understanding him difficult. I’ve pitched it at a very different, possibly orthogonal, level/plane/manifold/universe/cucumber. I do not think the description I gave contradicts what you say Friston would say.

3. Have either of you come across this paper?

Free-energy minimization and the dark-room problem
Karl Friston, Christopher Thornton and Andy Clark
http://www.frontiersin.org/Perception_Science/10.3389/fpsyg.2012.00130/abstract

It addresses the issue in a very interesting format: a discussion between the three authors. No maths. Some jokes. It’s well worth the read.

• Argh, this is why I had always just given up. Relentless conflation.

The “model” (to be fair, he calls it the “generative model”) in the entropy paper is connected with the description of what the organism is “thinking” (it is a joint distribution of sensory system and the fictive variables given the dynamical system), whereas, here the “model” is the dynamical system.

Friston’s explanation for the dark room problem in this paper, where we always “have our eyes closed” (i.e. we are selective), is not in my mind as good as talking about “entailment” as he does in the Entropy paper.

4. Ok, I’m part way through. I’m stuck on what an “inferential ability” is, and how to get one? How can one organism be more or less inferentially able than another? I’m not sure if this means inferential performance, as in the actual number / size of errors the organism makes in inference in the long run, or something like avoiding “losing information” in the physical sensor processes.

• … So my question is can you expand on that?

• Perhaps inferential quality would have been a better choice of words. The idea is that you identify the internal state of the organism with a probabilistic prediction of the sensory states, then, there is an “actual” sensory state, also a probability distribution. Then you say that the difference (KL divergence) between them is minimised.

The KL divergence is basically $D_{KL}(\text{fictive parameters} ; \text{current internal state} \; || \; \text{fictive parameters} ; \text{current sensory state}, \text{structure of system dynamics})$. Friston tends to use “givens” ($|$) in a way that I wouldn’t, so I have changed them to parameterisations so that the dimensions of the densities match. If you choose the right fictive parameters then this can always be zero unless the state spaces forming the parameters of this divergence have different dimensionalities. This is all done at a specific time: it is not under an ergodic assumption. This all “sits on top” of the state space and only has a non-trivial solution when it conflicts with the surprise.

Does that help.