A while ago I wrote a little rant on the (mis)interpretation of P-values. I’d like to return to this subject having investigated a little more. First, this post, I’m going to point to an interesting little subtlety pointed out by Fisher that I hadn’t thought about before, in the second post, I will argue why P-values aren’t as bad as they are sometimes made out to be.
So, last time, I stressed the point that you can’t interpret a P-value as a probability or frequency of anything, unless you say “given that the null hypothesis is true”. Most misinterpretations, e.g. “the probability that you would accept the null hypothesis if you tried the experiment again”, make this error. But there is one common interpretation that is less obviously false: “A P-value is the probability that the data would deviate as or more strongly from the null hypothesis in another experiment, than they did in the current experiment, given that the null hypothesis is true”. This is something that you might think is a more careful statement, but the problem is that in fact when we calculate P values we take into account aspects of the data not necessarily related to how strongly they deviate from the prediction of the null hypothesis. This could be misleading, so we’ll build it up more precisely in this post.
To begin with, let’s make up some symbolic language. “The data” are taken to be a random variable , and the particular instance of the data observed in the current experiment
. This evidence itself is likely to consists of a number of observations, e.g. a collection of real values
etc. The null hypothesis
will specify a value for some parameter of interest, call it
. For example:
Generally, in order to investigate this parameter, we find a “sufficient statistic”, that is, a function(al) that operates on the observed data to produce (usually) a single real number that contains all the relevant information that allows the data to discriminate between different possible values of the parameter of interest. We’ll call this
, so for example we might use
– the arithmetic mean of the data values produced by an experiment (if the parameter of interest is the expectation value of the process that is being investigated). This statistic has a “sampling distribution”, i.e. let
be a random variable representing the randomness of the statistic rather than the data itself.
Now we need to say what “deviate as or more strongly” means. For this, assume there is a distance function which describes how far the statistic if from the expectation under the null hypothesis. So if the null hypothesis predicts a zero population mean or expectation and
is the sample mean from the experiment, we could have:
(i.e. the absolute value of the statistic). The null hypothesis says that the parameter is zero, so it implies that the statistic (in this case an estimator of the parameter) is also (expected to be) zero – we just take how far the statistic is from zero as a measure of deviation. Then, we can encode the above interpretation of the P-value symbolically:
“The probability that the distance of the statistic from the prediction is greater than the distance observed, given that the null hypothesis is true”.
By the law of large numbers, this could be read as the relative frequency with which the statistic would take a more extreme value than the one observed if we repeat the experiment many times on a process where applies. This might look fairly reasonable, so what is wrong? In fact, for many tests, it might be right, but I’ll borrow an example from Fisher of a case where it’s not.
Suppose we are doing a linear regression on two Gaussian random variables. We take a model with and
as two random variables:
The parameter represents an offset which is more or less irrelevant for our purposes. The “slope”
is the strength of the relation, and will be the subject of our hypothesis test. There is also some random deviation
such that
is distributed marginally (for any given value of
) as a Gaussian with standard deviation
.
We do an experiment and obtain pairs of values,
. Presenting the regression in the same way as Fisher, let’s define three intermediate statistics:
Say that the null hypothesis says again – the data are uncorrelated. A suitable estimator for
is:
Of course the null hypothesis predicts , so a reasonable distance value is
. According to the “misinterpretation”, the P value is
Where is the estimator statistic as a random variable (i.e. representing the values the estimator would take if you sampled over and over from a system where the two variables were uncorrelated).
But in fact, the usual way to test this null hypothesis is to do a t-test. This involves first getting the estimator for :
Then calculating the t-value:
And the P-value is the probability of the observed or one more extreme given that null hypothesis is correct:
Where is the cumulative density function of the t-distribution with
degrees of freedom. We multiply by 2 because we want a “two-tailed” test. But because t is calculated using the actual value of
(observed in the experiment), it is dependent on this actual value, even though it would not be constant across all experiments. That is, the
value represents the probability that the test statistic would take a more extreme value under the null hypothesis:
We can see how this is different from the probability that the data deviate more strongly from the null with a little computational experiment (Python code). First, generate N data points using a Gaussian random number generator (so X and Y are genuinely two independent random variables). Then calculate the and
values as above. Then, if we re-run the process (generate lots, say 500, of sets of N uncorrelated data), the original (mis)interpretation of the P value says that P should be the proportion of the time that
takes a larger absolute value in one of the re-runs than it did in the original experiment.
If we repeat this process a few times, we can plot the P values obtained against the proportion of the time that b took a greater value in one of the re-runs than it did in the original experiment:
If it was the case that the P value was the proportion of the time that the data would be observed to deviate more strongly from the prediction under the null hypothesis, the points should lie very close to the diagonal – but they don’t. I’ve also coloured the points blue if in the original experiment, was very low (less than 14) and red if it was very high (more than 27 – the values were chosen empirically). Notice that the blue values tend to lie below the diagonal, and the red values above it – so you can see that the probability of observing more extreme data is dependent not only on the P value, but on the value of
in the original experiment.
So to sum up, we have a few different kinds of statistics:
- The estimator for the relevant parameter – i.e. a sufficient statistic for the parameter of interest in the hypothesis
- Ancillary statistics – give information about parameters not relevant to the null hypothesis (e.g. the variance in the linear regression model)
- Test statistics – make use of the relevant statistic and potentially ancillary statistics.
So the P value is in fact conditional not only on the null hypothesis, but on the ancillary statistics:
P = Pr( data more extreme than observed | , ancillary statistics as observed )
Fisher points out that of course this can’t reasonably be interpreted as any mechanical procedure of repeated sampling, since it would imply that you would have to select for those samples where the ancillary statistics were exactly the same as the one you observed in the original test (which is never going to happen).
I’m not going to draw any grand conclusions from this. I’m not even quite sure what to make of it at the moment. However, it does seem to me like quite an important point if you want to understand what a P value is.


