People always want an explanation of Friston’s Free Energy that doesn’t have any maths. This is quite a challenge, but I hope I have managed to produce something comprehensible.
This is basically a summary of Friston’s Entropy paper (available here). A friend of jellymatter was instrumental in its production, and for this reason I am fairly confident that my summary is going in the right direction, even if I have not emphasised exactly the same things as Friston.
I’ve made a point of writing this without any maths, and I have highlighted what I consider to be the main assumptions of the paper and maked them with a P.
The basic structure of Fristons theory is not particularly unusual, it is one of many theories that basically work by assuming that something is optimised. In this case, a “free energy” (at bit like the physics one). Much of modern physics is based on some kind of optimisation: in mechanics one minimises an action, and in thermodynamics one minimises the (thermodynamic) free energy. Further afield it is found in economic decision making: where a risk is minimised; and in population dynamics: where, depending on ones interpretation, fitness is maximised.
In fact, all theories, right or wrong, can be formulated in terms of an optimisation. The theory: “Summer is warmer than winter” could be expressed as: “physics works so as to maximise summer temperature minus winter temperature”. Though usually, we do not speak in this way all the time.
Formulating a problems in this was always raises a question: “why should this quantity be optimised?” and there are two distinct but non-exclusive responses. (1) Some other theory suggests that it should be optimised; and (2) it is an elegant summary of other models.
Risk and fitness optimisation are of the first type: in economics, the notion of utility justifies* the minimisation risk and the notion of replicators is often used motivate the maximisation of fitness. Contrariwise, the action principles of physics are of the second type, they unify various principles in one coherent and elegant framework.
The argument that Friston provides addresses both. In the first case, it is motivated by a notion that agents are “coherent” in some sense (active systems). In the second, it generalises a number of concepts in machine learning and statistical inference.
Here, I will not worry about it’s ability to generalise mathematical theorems, but attempt to restate the argument for it following from biological principles. Friston’s presentation is usually aimed at those who prefer overarching, general, mathematical theories, but this seems to me the source of many of the difficulties that people have when trying to understand it.
Importantly, it is the justification according to other biological theories that is matters to the non-theoretician. It is this which matters to them when deciding whether or not to attempt an understanding of the mathematical details.
Agent and Environment
The model beings with a sensory-motor feedback loop. The agent affects its environment and the environment affects the agent – these are modeled as physical systems. In the environment there is “added noise”, and this noise motivates us to talk about probabilities and information.
P1: The internal states of an organism react deterministically on random (but correlated) sensory information from environment.
Because of the noise implicit in the environment, the environment behaves randomly. And because of this the agent – who’s state is affected by the environment – behaves randomly in a corresponding fashion. Which then affects the environment with (delayed and transformed) noise, which affects the agent… etc. etc. The consequence of all of this is that there is a now a probability distribution over the possible physical states of the agent and the environment.
P2: Organisms act against the environment’s randomness, towards being in a definite state.
The stated motivation for this is homeostasis: we act so as to not decay into disordered molecules, by eating, avoiding danger, not exploding, and what have you.
The most obvious (but not the only) way of measuring the amount of randomness is by talking about entropy, or in this case, the surprisal which for the purposes here we can consider to be the same thing as entropy. The surprisal measures the number of states that the system can be in. If it is low, the system stays in one of very few states, if it is high, the system is in one of many. So, the self-maintaining-ness/homeostaticity/distinctness is measured by the surprisal, which, according to Friston, should be minimised.
At first glance it may seem that one should apply this measure to the organisms internal states, but it turns out that this doesn’t work. For example: A rock would have very definite states and a low surprisal. Instead, the proposed solution is for the organism to instead minimise the surprisal associated with the external world* – he calls systems that do this active systems.
An active system acts as to make its sensory inputs as predicable and unsurprising as possible. This means we can make a modification to P2:
P2b: Organisms act against the environment’s randomness, towards obtaining sensory evidence that suggests that they are in a well defined definite state.
Doing this solves the rock problem. And because the randomness of the inputs affects the internal state, it is measuring a very similar thing.
Depending on ones philosophy P2b may be either a refinement or a change to P2. Ones opinion about this is crucial for deciding if the notion of homeostasis is a justification for this theory. Either way, with P2b we still have a problem much like the rock example from before. Sitting in a dark room with your fingers in your ears would be an excellent way of minimising surprisal – and we obviously don’t do that. Much.
The optimality of sensory deprivation can be seen as a motivational problem for free energy, but first I must go though some stuff about inference.
P3: Organisms make good inferences.
Organisms can be considered to make inferences, and making good inferences has different requirements to the notion of surprise as described above.
To discuss inference we must introduce some more probability distributions. Let’s say that the probability of sensory inputs are determined by things that we can’t directly observe. If we observe a coin landing on heads five times and tails five times, then we can make inferences about some hidden parameter that has a value somewhere around one half – a statistician would say the outcome of a fair coin flip is drawn from a Bernoulli distribution with a parameter of 1/2. The more we observe the coin landing, the more sure we can be about the parameter.
Of course in this example the choice of parameter has no obvious physical basis: one could easily choose another, related parameter – say by adding one, squaring it, taking the logarithm etc – and have it describe the same coin but in a different way. The choice of parameter is kind of arbitrary; and it is for this reason that Friston describes them as fictive. We can choose them how we like as long as they describe the same thing.
The fictive parameters are used to model the world. Consider a brick. This brick could have parameters width, height and depth. Associated with the parameters there would a confidence in each paramter: there is a probability that the brick is between 9 and 10cm long, 8 and 9 cm, 7 and 8, as well as for any other pair of numbers, 8.51234… and 8.51333… or whatever. But, we needn’t have chosen length, width, and height – we could have equally chosen surface area, volume and perimeter, we could still describe the same brick, and with these there would be a corresponding probability distribution which you could work out from the former.
It is both the power and the failure of information theory that it talks about probabilities with complete indifference for what they are.
Because the parameters don’t necessarily have any specific meaning or interpretation, at this point we simply forget about trying to work out what they are or what they mean – all we care about is the thing they describe. Friston argues, that whatever they happen to be we can still talk about them abstractly. The mathematical tools he uses are then chosen so as to make this aspect of his probabilities unproblematic (measure invariance).
The probability of the fictive parameters comes in two main flavors. One, the probability of the parameters as determined by the state of the sensory system (an “objective” probability in some sense), related to Friston’s “generative model”. The other, the probability of the parameters as determined by an internal model (the subjects probability), which Friston calls a “proposal density”. The former, the world as it is best described, the latter, the product of attempt for an organism to describe it.
The main idea of making inferences is that the organism tries to find the probability distribution (proposal density) that best matches the “real” probability distribution (generative model). The better the internal model is the better it matches the world. The better the model model matches the world, the better it is for making predictions about it.
Choosing to minimise the difference between probability distributions entails lots of things that people want from inferential systems, such as maximum entropy principle**. To measure this difference, Friston uses a standard tool usual: the Kullback-Leibler divergence. There are plenty of other measures that he could have used, but this one is usually preferred by information theory and Bayesian types.
Conflict Between Surprise and Inference
The next step is to acknowledge that both inferential ability and surprisal minimisation are “good”, and define a quantity that is measures the goodness. Depending on how you go about making this quantity, you end up with different things. But the quantity that is nice to work with once one has settled on the surprise and the Kullback-Liebler divergence is Friston’s free energy. This basically just adds the two together. The result is certainly elegant, but there is no motivation for this particular form beyond mathematical tractability (the reason for the mathematical niceness is the subject of information geometry).
Importantly, when optimal inference and minimising surprisal are mutually exclusive, minimising the free energy minimises both of them. This is an “all things being equal” justification:
P4: If it were possible, an organism would minimise both the surprisal and and the inferential “error” of their prediction independently of each other.
But no things are ever equal. In practice, the two quantities are not mutually exclusive. This is because the “subject” probabilty must have some physical basis in the internal states of the organism, and is thereby constrained by this physicality. This is essentially the idea that the brain represents probabilities, and is what Friston calls entailment. The internal states are the same thing as the subjects probability distribution but viewed, as it were, through a different lens. The exact manner in which the brain states are mapped to probabilities is not discussed directly, but there is an implicit notion that the brain cannot represent just any old probability distribution.
The usual consequence of entailment is that it is no longer possible to simultaneously minimise surprise and maximise inference, and instead, there is a trade off between the two:
P5: Due to the physical nature of organisms, maximising inferential ability and minimising surprise are in conflict with each other.
This motivates the need for free energy as one singular quantity, rather than two separate ones. It is also how one solves the “dark room with fingers in ears” problem, though for a slightly technical reason: Implicit in the formalisation of maximised inferential ability is the notion that making the best inferences about lots of things is better than making the best inferences about less things. Whilst in the state of sensory deprivation I mentioned one can make rather good inferences about you see, but one cannot make inferences about other things that you would if you opened your eyes and took your fingers out of your ears***.
The relationship between the internal states and the internal/subject probabilities is of fundamental importance. It is the very heart of what Friston calls the free energy principle. In this he elaborates on the nature of the physical constraint on the probability distribution it encodes. Basically, the constraint is simply the number of states that the brain can be in. So, we have a motivation for P5
P6: The world is very big, the brain is relatively small. The brain just does not have the capacity to match the complexity of the real world as provided in sensation.
Between them P1-P6 are the main motivations for the free energy in what I called type (1) terms – those that do not appeal to the generality of the theory. I hope they provide a good outline of the reasons for using free energy.
I will skip the applications sections of the paper as they are of the other kind. As I said, whilst these are important for those theoreticians who already use those techniques, it is not my concern here.
So, to summarise the notion of free energy: it is one way that one may quantify the trade off between making ones environment predictable, and the ability to make predictions where you cannot. The quantity is chosen so as to fit easily with established formalisms in information theory and Bayesian probability.
Throughout this summary I have omitted the mathematical assumptions such as additivity and the relevance of the surprisal and KL-divergence as I do not think including these helps with readability. But we must not forget that they are fundamental to how the particular form of the free energy is formulated. The wordy argument I have presented applies equally to all other free-energy like formulations.
* However, the internal and external surprisals are related because the state of the environment determines the internal state of the agent.
** Interestingly, the same version of the maximum entropy principle is also a property of Friston’s free energy. One should probably view Friston’s corollary 3 as showing that using free energy does not break this property of the KL divergence. Also, there are some very interesting results in information geometry along exactly these lines.
*** This raises obvious questions about what fictive variables are valid, but I shall skip over this problem mentioning only that it is a problem that may potentially be solved empirically.