## Wednesday, May 23, 2012

Earlier, in 'The insignificance of significance tests', I wrote about the silliness of the P-value, one of the main metrics scientists use to determine whether or not they have made a discovery. I explained my view that a far better description of our state of knowledge is the posterior probability. C.R. Weinberg, however, presumably does not agree with all my arguments, having written a commentary some years ago with the title 'It's time to rehabilitate the P-value.'1 Weinberg acknowledges the usefulness of the posterior probability in this article, and I agree with various other parts of it, but a major part of her argument in favour of P-values is the following statement, from the start of the article's second paragraph:
'The P-value quantifies the discrepancy between a given data set and the null hypothesis,'
Lets take a moment to consider whether or not this is true.

Firstly, lets note that the word 'discrepancy' has a precise meaning and an imprecise one. The discrepancy between 5 and 7 is 2, and has a precise formal meaning. What is the discrepancy, though between an apple and an orange? When we say something like 'there is a discrepancy between what he said and what he did,' this makes use of the general, imprecise meaning of 'disagreement'. What is the discrepancy between a data set and a hypothesis? These are different things and can not be measured against one another, so there is clearly no exact formal meaning of the discrepancy in this case. It is a pity then that the main case made for use of P-values, in this case, is based on an informal concept, rather than an exact idea, which scientists normally prefer.

So it is clear then, that the discrepancy between a data set and the null hypothesis has no useful, clearly definable scientific meaning (and, therefore, can not be quantified, as was the claim). Lets examine, though, what the term 'discrepancy' is trying to convey in this case. Presumably, we are talking about the extent to which the data set alters our belief in the null hypothesis - the degree of inconsistency between the null hypothesis and the observed facts. Surely, we are not talking about any effects on our belief in the data, as the data is (to a fair approximation 2) an empirical fact. Anyway, belief in the data was not the issue that the experiment was performed to resolve, right? The truth or falsity of a particular data set is a very trivial thing, is it not, next to the truth or falsity of some general principle?

But the P-value does not quantify our rational belief in any hypothesis. This is impossible. For starters, as I hinted at in that earlier article, by performing the tail integrals required to calculate P-values, one commits an unnecessary and gratuitous violation of the likelihood principle. This simple (and obvious) principle states that when evaluating rational changes to our belief, one uses only the data that are available. By evaluating tail integrals, one effectively integrates over a host of hypothetical data sets, that have never actually been observed.

Even if we eliminate the tail integration, however, and calculate only P(D | H0), we are still not much closer to evaluating the information content of our data, D, with respect to H0. To do this, we need to specify a set of hypotheses against which H0 is to be tested, and obtain the direct probability, P(D | H), for all hypotheses involved. To think that you can test a hypothesis without considering alternatives is, as I have mentioned once or twice in other articles, the base-rate fallacy.

One of my favorite illustrations of this basic principle, that the probability for a hypothesis has no meaning when divorced from the context of a set of competing hypotheses, is something called the raven paradox. Devised in 1945 by C.G. Hempel, this supposed paradox (in truth, there seem to be no real paradoxes in probability theory) makes use of two simple premises:

(1) An instance of a hypothesis is evidence for that hypothesis.

(2) The statement 'all ravens are black' is logically equivalent to the statement 'everything that is not black is not a raven.'

Taking these two together, it follows that observation of a blue teapot constitutes further evidence for the hypothesis that all ravens are black. While it seems to most that the colour of an observed teapot can convey little information about ravens, this conclusion is claimed to be logically deduced. This paradox got several philosophers tied up in knots, but it took a mathematician and computer scientist, I.J. Good (he worked with Turing both at Bletchley Park and in Manchester, making him one of the world's first ever computer scientists) to point out its resolution.

The obvious solution is that premise number (1) is just plain wrong. You can not evaluate evidence for a hypothesis by considering only that hypothesis. It is necessary to specify all the premises you wish to consider, and specify them in a quantifiable way. Quantifiable, here, means in a manner that allows P(D | H) to be evaluated. To drive this point home, Good provided a thought experiment (a sanity check, as I like to it) where, given the provided background information, we are forced to conclude that observation of a black raven lowers the expectation that all ravens are black.3

Good imagined that we have very strong reasons to suppose that we live in one of only two possible classes of universe, U1 and U2, defined as follows:
U1 ≡ Exactly 100 ravens exist, all black. There are 1 million other birds.
U2 ≡1,000 black ravens exist, as well as 1 white raven, and 1 million other birds.
Now we can see that if a bird is selected randomly from the population of all birds, and found to be a black raven, then we are most probably in U2. From Bayes' theorem:

 P(U1 | DI)  = P(U1 | I) × P(D | U1 I) P(U1 | I) × P(D | U1 I) + P(U2 | I) × P(D | U2 I)

No information has been supplied concerning the relative likelihoods of these universes, so symmetry requires us to give them equal prior probabilites, so

 P(U1 | DI)  ≈ 10-4 10-4 + 10-3

which makes it about 10 times more likely that we are in U2, with some non-black ravens.

[1] C.R. Weinberg, 'It's time to rehabilitate the P-value,' Epidemiology May 2001, Vol. 12 No. 3 (p. 288) (Available here)

[2] We can imagine the following conversation between a PhD student and his advisor:

Supervisor:
This new data is fantastic! And so different to anything we have seen before. Are you sure everything was normal in the lab when you measured these?
Student:
Well, I was a little bit drunk that day....

But this kind of thing is rare, right? Come on, back me up here guys.

[3] I.J. Good, 'The White Shoe is a Red Herring,' British Journal for the Philosophy of Science, 1967, Vol. 17, No. 4 (p. 322) (I haven't read this paper (paywall again), but I have it from several reliable sources that this is where the thought experiment, or an equivalent version of it, appears.)