## More wrong interpretations of P values – “repeated sampling”

A while ago I wrote a little rant on the (mis)interpretation of P-values. I’d like to return to this subject having investigated a little more. First, this post, I’m going to point to an interesting little subtlety pointed out by Fisher that I hadn’t thought about before, in the second post, I will argue why P-values aren’t as bad as they are sometimes made out to be.

So, last time, I stressed the point that you can’t interpret a P-value as a probability or frequency of anything, unless you say “given that the null hypothesis is true”. Most misinterpretations, e.g. “the probability that you would accept the null hypothesis if you tried the experiment again”, make this error. But there is one common interpretation that is less obviously false: “A P-value is the probability that the data would deviate as or more strongly from the null hypothesis in another experiment, than they did in the current experiment, given that the null hypothesis is true”. This is something that you might think is a more careful statement, but the problem is that in fact when we calculate P values we take into account aspects of the data not necessarily related to how strongly they deviate from the prediction of the null hypothesis. This could be misleading, so we’ll build it up more precisely in this post.

To begin with, let’s make up some symbolic language. “The data” are taken to be a random variable $E$, and the particular instance of the data observed in the current experiment $e$. This evidence itself is likely to consists of a number of observations, e.g. a collection of real values $x_1, x_2$ etc. The null hypothesis $H_0$ will specify a value for some parameter of interest, call it $\beta$. For example:

$H_0 \implies \beta = 0$

Generally, in order to investigate this parameter, we find a “sufficient statistic”, that is, a function(al) that operates on the observed data $e$ to produce (usually) a single real number that contains all the relevant information that allows the data to discriminate between different possible values of the parameter of interest. We’ll call this $R$, so for example we might use $R(e) = \bar{x}$ – the arithmetic mean of the data values produced by an experiment (if the parameter of interest is the expectation value of the process that is being investigated). This statistic has a “sampling distribution”, i.e. let $R(E)$ be a random variable representing the randomness of the statistic rather than the data itself.

Now we need to say what “deviate as or more strongly” means. For this, assume there is a distance function $d$ which describes how far the statistic if from the expectation under the null hypothesis. So if the null hypothesis predicts a zero population mean or expectation and $R(e)$ is the sample mean from the experiment, we could have:

$d(R(e)) = |R(e)|$

(i.e. the absolute value of the statistic). The null hypothesis says that the parameter is zero, so it implies that the statistic (in this case an estimator of the parameter) is also (expected to be) zero – we just take how far the statistic is from zero as a measure of deviation. Then, we can encode the above interpretation of the P-value symbolically:

$\Pr( d(R(E)) \ge d(R(e)) | H_0 )$

“The probability that the distance of the statistic from the prediction is greater than the distance observed, given that the null hypothesis is true”.

By the law of large numbers, this could be read as the relative frequency with which the statistic would take a more extreme value than the one observed if we repeat the experiment many times on a process where $H_0$ applies. This might look fairly reasonable, so what is wrong? In fact, for many tests, it might be right, but I’ll borrow an example from Fisher of a case where it’s not.

Suppose we are doing a linear regression on two Gaussian random variables. We take a model with $Y$ and $X$ as two random variables:

$y = \alpha + \beta x + \varepsilon$

The parameter $\alpha$ represents an offset which is more or less irrelevant for our purposes. The “slope” $\beta$ is the strength of the relation, and will be the subject of our hypothesis test. There is also some random deviation $\varepsilon$ such that $Y$ is distributed marginally (for any given value of $x$) as a Gaussian with standard deviation $\sigma$.

We do an experiment and obtain $N$ pairs of values, $e = ( (x_1,y_1),(x_2,y_2),...(x_N,y_N) )$. Presenting the regression in the same way as Fisher, let’s define three intermediate statistics:

$C_x = \sum (x_i-\bar{x})^2$

$C_{xy} = \sum (x_i-\bar{x})(y_i-\bar{y})$

$C_y = \sum (y_i-\bar{y})^2$

Say that the null hypothesis says $\beta = 0$ again – the data are uncorrelated. A suitable estimator for $\beta$ is:

$R(e) = b = {C_x}/C_{xy}$

Of course the null hypothesis predicts $b=0$, so a reasonable distance value is $|b|$. According to the “misinterpretation”, the P value is

$\Pr( |B| \ge |b| | H_0)$

Where $B$ is the estimator statistic as a random variable (i.e. representing the values the estimator would take if you sampled over and over from a system where the two variables were uncorrelated).

But in fact, the usual way to test this null hypothesis is to do a t-test. This involves first getting the estimator for $\sigma^2$:

$s^2 = (C_y-C_{xy}^2/C_x) / (N-2)$

Then calculating the t-value:

$t = b \sqrt{C_x}/s$

And the P-value is the probability of the $t$ observed or one more extreme given that null hypothesis is correct:

$P = 2 (1 - F_t(|t|;N-2))$

Where $F_t(x;\lambda)$ is the cumulative density function of the t-distribution with $\lambda$ degrees of freedom. We multiply by 2 because we want a “two-tailed” test. But because t is calculated using the actual value of $C_x$ (observed in the experiment), it is dependent on this actual value, even though it would not be constant across all experiments. That is, the $P$ value represents the probability that the test statistic would take a more extreme value under the null hypothesis:

$P = \Pr( |T| \ge |t| | H_0 )$

We can see how this is different from the probability that the data deviate more strongly from the null with a little computational experiment (Python code). First, generate N data points using a Gaussian random number generator (so X and Y are genuinely two independent random variables). Then calculate the $b$ and $P$ values as above. Then, if we re-run the process (generate lots, say 500, of sets of N uncorrelated data), the original (mis)interpretation of the P value says that P should be the proportion of the time that $b$ takes a larger absolute value in one of the re-runs than it did in the original experiment.

If we repeat this process a few times, we can plot the P values obtained against the proportion of the time that b took a greater value in one of the re-runs than it did in the original experiment:

If it was the case that the P value was the proportion of the time that the data would be observed to deviate more strongly from the prediction under the null hypothesis, the points should lie very close to the diagonal – but they don’t. I’ve also coloured the points blue if in the original experiment, $C_x$ was very low (less than 14) and red if it was very high (more than 27 – the values were chosen empirically). Notice that the blue values tend to lie below the diagonal, and the red values above it – so you can see that the probability of observing more extreme data is dependent not only on the P value, but on the value of $C_x$ in the original experiment.

So to sum up, we have a few different kinds of statistics:

• The estimator for the relevant parameter – i.e. a sufficient statistic for the parameter of interest in the hypothesis
• Ancillary statistics – give information about parameters not relevant to the null hypothesis (e.g. the variance in the linear regression model)
• Test statistics – make use of the relevant statistic and potentially ancillary statistics.

So the P value is in fact conditional not only on the null hypothesis, but on the ancillary statistics:

P = Pr( data more extreme than observed | $H_0$, ancillary statistics as observed )

Fisher points out that of course this can’t reasonably be interpreted as any mechanical procedure of repeated sampling, since it would imply that you would have to select for those samples where the ancillary statistics were exactly the same as the one you observed in the original test (which is never going to happen).

I’m not going to draw any grand conclusions from this. I’m not even quite sure what to make of it at the moment. However, it does seem to me like quite an important point if you want to understand what a P value is.

### 14 Comments to “More wrong interpretations of P values – “repeated sampling””

1. This looks like an interesting observation, and I hope to find time to think carefully about it at some point soon. But in the meantime I have a naïve question: why can’t I just say that the $t$-value in your example is the relevant measure of deviation in your example, rather than the $b$-value? After all, the $t$-value is something that you can actually calculate from the data, whereas the $b$-value is not, since you’d have to know the population values of the ancillary statistics in order to do so. (Right?)

Then “the probability that the data would deviate as or more strongly from the null hypothesis in another experiment, than they did in the current experiment, given that the null hypothesis is true” would still be a reasonable interpretation, it’s just that you’d have to be careful to be clear about what measure of “deviation” is involved.

2. Yes, as far as I can tell, if you are careful to ensure you are talking about the test statistic when you say deviation, that should be ok (because that’s what the test statistic is for). The problem is that the test statistic accounts for things that aren’t affected by the truth/falsity of the null hypothesis, but do affect how strongly the data support / conflict with the hypothesis.

An analogy would be where you a looking at two sets of data with the same regression slope but one is more noisy (spread away from the diagonal) than the other – you would be less confident about a conclusion from the more noisy graph than the less noisy one.

In the graph I put up, the red dots are high overall variance for the X variable (C_x higher than average), and blue dots are low variance in X. X is the “independent variable” in the regression, so when you have a wider set of values for X and a given regression slope b, you should get a lower variance of Y conditional on X (the Y points will be closer to the line, I think). (I.e. $b=C_x/C_{xy}$ so to keep b fixed as the marginal variance C_x goes up, the covariance C_xy must also go up, so the X and Y pairs must be deviating from their means in the same way)

That would explain why for a given b value, blue dots get a higher P value than red dots – in spite of having the same slope, they don’t contradict the null hypothesis of zero “real” slope as strongly.

(Points should have roughly the same b value in the initial experiment if they have the same latitude on my graph)

I need to check that, but I think that’s the explanation!

• That all seems completely reasonable and intuitive to me – I guess my way of thinking about this just makes it hard to see what the problem is. To me, a p-value is just p( some function of the data = 1 | H_0 ), where “some function of the data” might be “1 if the test statistic is greater than 4.1, 0 if it isn’t.”

If you have a prior p( H_i ) (that is, probabilities for many possible hypothesis including H_0), you can use Bayes’ theorem to calculate p( H_0 | the function of the data = 1 ) = p-value * p(H_0)/sum_i ( p( the function of the data = 1 | H_i )). So the posterior probability of the hypothesis is proportional to the p-value, with the constant of proportionality depending on the prior. But usually you don’t have such a prior, at least not in a form that can easily be written down, so you quote the p-value. That way the reader can assess the posterior for herself according to her own prior, if she has one. So p-values are useful because they tell you something about the probability of H_0 given the value of the test statistic, even though they don’t tell you the probability itself. Or at least, that would be the explanation if classical statistics was done properly.

So from my point of view it just sort of sounds like Fisher’s pointing out that the p-value isn’t equal to p( some other thing that’s not a function of the data = 1 | H_0), and trying to make it sound like it’s a problem. But as I said it does sound like there’s something interesting in this all the same, and I’ll think about it a bit more.

3. … and thus information geometry is motivated!

Nathaniel: Were this Bayesian, it would be very similar to the two envelope paradox. With the two envelope paradox, the discrimination between the “change envelope” and “don’t change envelope” is rendered incalculable by the inclusion of a ‘poorly defined’ ancillary statistic (the sufficient statistic being [amount in chosen envelope > amount in other envelope]) – in this case the constrast (function) between the two cases is totally broken; but for other less pathological cases there is still a problem – how do we quantify the difference between two values of a statistic associated with hypotheses (or, at least, two samples taken to be representative of one). Often, this is the same as asking: is the KL divergence between the things we care about (the hypotheses) related nicely to the KL divergence of the things we measure. If we can use ancillary statistics, then we make the KL divergences between the hypotheses a projection of the KL divergence between a more general probability distribution that includes the ancillary statistics. The problem is that we do not know the more general distribution including the ancillary statistics, and even if we did, there would be yet another set of ancillary statistics for that, which we then do not know the distribution of, and then for those too, etc. etc.

4. Of course, the reason that the two envelope paradox is so weird, is that the ancillary statistic does not seem to have any effect on the contrast other than when it fails to be defined nicely itself.

5. I think you can’t use the t or p value in a “Bayesian” way, possibly for the reasons Lucas is saying, but for another couple of ways of looking at it:

1. If you report t or P, you are “throwing away information”. Note that the hypothesis only talks about beta, and thus doesn’t predict anything about the variance of the process. You can see that t is calculated from b (which is affected by the hypothesis) but also C_x and s, which are not constrained by anything. That means that any given t could correspond to an arbitrary b so long as they have the same sign (C_x and s are always positive), so reporting t alone loses all the information about b apart from its sign, unless you also report the ancillaries. In calculating the P value, I used the absolute value of t, so in fact you don’t even have the sign information from that! (Of course losing the sign information is intended – thats the meaning of a two-tailed test).

2. Again, since the hypothesis only concerns beta, I suspect

P(b,C_x|H_i)

Is not defined, since H_i doesn’t constrain the probabilities enough. Moreover, say there are two universes and in one you get the result:

P(b=b0, C_x=c0|H_i)

and in the other you get

P(b=b0, C_x=c1|H_i)

With c0 != c1. Since the hypothesis H_i says nothing about C_x, even if these probabilities were defined, they would surely be exactly the same, therefore by the “Strong Likelihood Principle”, a Bayesian must make exactly the same inference in either universe, but as we’ve discussed the Fisherian gets a different p value in the two universes.

3. I might have been a bit wrong to say

P(|T|>|t| | H_i)

is a reasonable way of putting it. Since the t-values depend on the ancillaries, you probably have to make the further assumption that the initial experiment gave a particular value of the ancillaries before this statement is correct, otherwise the asymptotic distribution of the t statistic isn’t what you think it is when you calculate the P value.

4. I think to do a Bayesian analysis you therefore have to also specify a prior distribution for $\sigma$ or something equivalent, so your hypotheses H_i would be two-dimensional. For an actual Bayesian that should be fine, but for a Fisherian, it isn’t.

5. Another related problem I think to Lucas’s point is that because frequentists don’t have the “Strong Likelihood Principle”, you have to be more careful about how you define distances to be relevant. For example, you can easily create absurdities like saying “tossing five heads on a fair coin in a row has probability < 0.05 therefore in order to test that the earth is round, I will throw a fair coin five times, and if it comes up heads every time, conclude that the earth is not round". For a Bayesian, the likelihood of five heads is the same whether the earth is round or not, so you don't make an inference. For a Fisherian, you have to insist that the prediction of the hypothesis is related to the statistic you base your test on. Exactly how this works is a bit weird given the above, but essentially, because like I say the t statistic is underconstrained by the hypothesis, you can't define distance in terms of the t statistic alone, you have to define the distance on the estimator of the relevant statistic, then if necessary calculate the P value using ancillaries.

p(something not a function of the data=1|H_0) is essentially what this problem is, but the thing that's not a function of the data is the t statistic (or rather, t is only a function of the data if you constrain some other things), the b and C_x are genuinely functions of the data.

• I think I must be misunderstanding how the Fisherian calculation is performed. In your post it looked to me as if C_x and C_xy were functions of the data, b and s were functions of C_x and C_xy, and t is a function of C_x, b and s — therefore t is a function of the data. Where did I go wrong?

6. I want to say in regards to 1, when I say you are “throwing away information”, I only mean if you try to use the t statistic in a Bayesian way – the Fisherian takes it as read that the t statistic was calculated using the ancillaries, because they regard that as the “correct” way (Fisher’s word), so you can make sense of it, but for a Bayesian, the likelihood function would need to be defined without adding any (from a Bayesian point of view) “arbitrary” assumptions. I think that’s it anyway…

7. And on 2. again: P(b,C_x|H_i) quite possibly does have an analytic statement in terms of beta and sigma, but you have to know beta and sigma from the hypothesis if you want a numeric value for this (and you don’t) which is what I mean by “not defined”.

8. Sorry, yes, t *is* a function of the data in a sense, it is calculated in the way you say – the problem is more that the asymptotic distribution of it (which you need to work out the P-value) isn’t defined by the null hypothesis, where as the distribution of b alone is (I think).

At least, for a given hypothesis which specifies $\beta=\beta_0$, you know that $E[b]=\beta_0$. But you don’t know anything about the distribution of the t statistic unless you also take into account some assumed value of the ancillary.

9. Oh no, you don’t actually know the sampling variance of b unless you know C_x (from the paper).

• Right, but C_x is a function of the data – it’s the sample variance, not the population variance. Right?

• Yes that’s right. I’m not saying its not weird. (You also need to know the population variance to know the exact distribution of b). I need to think about this some more.

10. Ah, OK, I think I have a bit more of a handle on this now. Let me try to express it in my own language:

What I was saying before was slightly wrong, I think. I was saying that, essentially, we have a hypothesis H (doesn’t matter what it is, for now) and some data d. We calculate some (typically lossy) function of the data F(d) (doesn’t matter what that function is for now, or how it’s interpreted) and see if it passes a threshold, i.e. we calculate whether F(d)>f for some f. If it does, we then calculate p(F(D)>=f | H) (I’m going to call this “the likelihood”, even though I’m aware that that phrase is usually reserved for just p(d|H)) and quote that.

This would be a fine way of working, I think. As long as the readers understand the distinction between likelihoods and posteriors, they can calculate their own posteriors using their own priors. So quoting likelihoods removes the responsibility of the author to argue for the use of a particular prior, and seems like an all round Good Thing.

However, after a bit of thought, this isn’t how classical hypothesis testing works after all. In your regression example, the hypothesis H_0 is *just* that beta=0. The problem is that in order to calculate p(F(D)>=f | H_0) you need to know the relevant aspects of the conditional distribution p(D|H_0), but just knowing beta=0 isn’t enough to give you a probability distribution for the data. For that you also need to know the variances (i.e. the ancillary statistics) in addition to H_0.

If you were doing it the Bayesian way you would need a prior distribution A over the ancillary statistics, and the likelihood you’d quote would be p(F(D)>=f | H_0,A), to be interpreted as “the probability of getting data at least as extreme as the actual data, given the null hypothesis beta=0 *and* the additional assumptions we’ve made about the ancillary statistics”. Instead of doing this, classical stats tries to “estimate” the ancillaries without making any assumptions about them. From the Bayesian perspective this is actually impossible, and I suspect that the classical way does actually correspond to doing it the Bayesian way, just with some particular prior. So they are actually quoting p(F(D)>=f | H_0,A) after all, it’s just that the extra assumptions ‘A’ are very implicit and a bit obscure.