Maximum Entropy: The Insignificance of Significance Tests

In an earlier post on the base-rate fallacy, I made use of two important terms, ‘false-positive rate’ and ‘false-negative rate,’ without taking much time to explain them. These are concepts we need to be careful with, because, simple though they are, they have been given terrible names.

Lets start with the false positive rate for a test. This could mean any one of a number of things:

(1) it could be the expected proportion of results produced by the test that are both false and positive

(2) it could be the proportion of positive results that are false

(3) it could be the proportion of false results that are positive

So which one is it? Are you ready?

None of the above.

The false positive rate is the proportion of negative cases that are registered as positive by the test. This definition is widespread and agrees with the Wikipedia article ‘Type I and Type II Errors’. If it is a diagnostic test for some disease, it is the fraction of healthy people who will be told that they have the disease (or referred for further diagnosis).

My claim that the term is confusing, however, is supported by looking at another Wikipedia article, ‘False Positive Rate,’ which provides the definition: “the probability of falsely rejecting the null hypothesis.” The null hypothesis is the proposition that there is no effect to measure – what I have called a negative case. The probability of falsely rejecting the null hypothesis, therefore, depends on the probability that there is no effect to measure, while the proportion of negative cases that are registered as positive does not. The alternate definition in this second Wikipedia article is the same as my number (1), above.

Number (2) on the list above is the posterior probability that the null hypothesis is true, given that the test has indicated it to be false. It is obtained using Bayes’ theorem. Confusion between this posterior probability and the false-positive rate is the base-rate fallacy, yet again. Unfortunately, there is little about the term ‘false-positive rate’ that strives to steer one away from these misconceptions.

What makes the situation much worse is that most scientific research is assessed using the false-positive rate, while what we should really be interested in is assigning a posterior probability to the proposition under investigation.

The false-positive rate is often denoted by the greek letter α. In classical significance testing, α is used to define a significance level – or rather the significance level defines α. The significance level is chosen such that the probability that a negative case triggers the alarm is α. A negative case, once again, is an instance where the null hypothesis is true, and there is no effect. For example, if two groups are being treated for some affliction, one group with an experimental drug and the other with a placebo, the null hypothesis might be that both groups recover at the same rate, as the new treatment has no specific impact on the disease. If the null hypothesis is correct, then in an ideal measurement, there will be no difference between the recovery times of the two groups.

But measurements are not ideal: there is ‘noise’ – random fluctuations in the system under study that produce a non-zero difference between the groups. If we can assign a probability distribution for this noise, however, we can define limits, which, if exceeded by the measurement, suggest that the null hypothesis is false. If x is the measured difference in recovery time for the 2 groups, then there are two points, x_α/2, on the tails of the distribution, such that the integral of one of the tails up to this limit is α/2. The total probability contained in these two tail areas, therefore, sums to α. The x_α/2 points are chosen so that α is equal to some desired significance level, some acceptably low false positive rate. (We can not make the false positive rate too small, because then we would too often fail to spot a real effect – the false negative rate would be too high.) The integration is performed on each side of the error distribution, as we have no certainty in which direction the alternate hypothesis operates: the recovery time with the new drug might be worse than with no treatment, which could still lead to rejection of the null hypothesis.

The null hypothesis predicts a value of x in the peak of the curve, but measurement noise leads to a probability distribution for the experiment as plotted. The x_α/2 points are chosen so that the tails areas either side (the shaded regions) sum to 5%, or whatever the desired significance level is. In classical hypothesis testing, any measured x further from the centre than either of these points is considered statistically significant.

A common value chosen for α is 0.05. That is, if the null hypothesis is true, then the measured value of x will exceed x_α/2 about one time in twenty. This is the basis of how results are reported in probably most of science – if x exceeds the prescribed significance level, then the null hypothesis is rejected, and the result is classified as statistically significant and considered a finding. If the measured value of x is between the two x_α/2 points, then the null hypothesis is not rejected, and the outcome of the study is probably never reported (a fact that contributes hugely to the problem of publication bias in the scientific literature). This system for interpreting scientific data and reporting outcomes of research is, lets be honest, a travesty.

Firstly, the whole rationale of the significance test seems to me to be deeply flawed. The idea is that there is some cutoff, beyond which we can declare the matter final: ‘Yes, we have a finding here. Grand. Nothing more to do on that question.’ How manifestly absurd? Such a binary declaration about a hypothesis, necessary if one wishes to take real-world action based on scientific research, needs to be based on decision theory, which combines both probability theory and some appropriate loss function (something that specifies the cost of making a wrong decision). But the declarations arising in decision theory are not of the form ‘A is true,’ but rather ‘we must act, and rationality demands that we act as if A is true.’ Using some standard α-level is about the crudest substitute for decision theory you could imagine.

Why should our test be biased so much in favour of the null hypothesis, anyway? The alternate hypothesis, H_A, almost always represents an infinity of hypotheses about the magnitude of the possible non-null effect, so a truly 'unbiased' starting point would seem to be one that de-emphasizes H₀. Remember that at the core of the frequentist philosophy (not the approach I wish to promote) is the dictum "let the data speak for themselves": don't let prior impressions taint the measurement.

Secondly, when a finding is reported with a false-positive probability of 0.05, there appears to be a feeling of satisfaction among the scientific community that 0.05 is the probability that the positive finding is false. But meta-analyses regularly do their best to dispel this myth. For example, looking only at genetic association studies, Ioannidis et al.¹ reported that 'results of the first study correlate only modestly with subsequent research on the same association,' while Hirschorn et al.² write that 'of the 166 putative associations which have been studied three or more times, only 6 have been consistently replicated.'

Thinking again about the method for calculating the positive predictive value for a diagnostic test, the probability that a positive finding is false is not determined only by the false-positive rate, α, but also by the false negative rate, and the prior probability. The false negative rate, by analogy with the false positive rate, is the proportion of real effects that will be registered as null findings. It depends on the number of samples taken in the study and the magnitude of the effect. In the medical trial example, if the new drug helps people recover twice as fast, then there will be fewer false negatives than if the difference is only 10%, and a trial with 100 patients in each group will give a more powerful test than a trial with only 10 in each group.

If D_f represents data corresponding to a finding, with x > x_α/2, then from Bayes’ theorem, the probability that the null hypothesis will have been incorrectly rejected is

P(H₀ \| D_f, I) =	P(H₀ \| I)P(D_f \| H₀, I)
	P(H₀ \| I)P(D_f \| H₀, I) + P(H_A \| I)P(D_f \| H_A, I)

(1)

P(D_f | H₀, I) is α, and P(D_f | H_A, I) is 1-β, where β is the false negative rate. Using the standard α level of 0.05, I plot below this posterior probability as a function of the prior probability for the alternate hypothesis, H_A, for several false-negative rates. To estimate the posterior error probability for a specific experiment, Wacholder et al.³ propose to use the p-value instead of α (the p-value is twice the tail integral up to x_m, where x_m is the measured value of x, rather than x_α/2). Strictly, one shouldn't use the p-value, but P(x_m | H₀, I). We use α to evaluate the method and so integrate over all possible outcomes that would be registered as findings, while we use P(x_m) to investigate a particular experiment – there is only one outcome, so no integration is required. I’m fairly sure Wacholder et al. are aware of this: their goal seems to be to provide an approximate methodology capable of salvaging something useful from an almost ubiquitous and highly flawed set of statistical practices. In this regard, I think they can probably be credited with having made a valuable contribution. The problem with this, though, is that for a specific data set, P(D | H_A, I) is not the same as 1-β, and can not be determined. The correct procedure, of course, requires resolving H_Ainto a set of quantifiable hypotheses and calculating the appropriate sampling distributions for each of them.

Posterior probability for H₀following a statistically significant result, with α set at 0.05, plotted vs the prior probability that H₀is false.

We can see readily from the above plot that the posterior probability varies hugely, even for a fixed α, and that α alone (or the p-value) is next to useless for predicting it. As the prior probability gets smaller, the posterior error probability associated with a positive finding approaches unity. Looking closely at a prior of 0.01, which is generous for many experiments, (especially, for example, large-scale genomics studies, where the data and its analysis are now cheap enough to permit almost every conceivable relationship to be tested, regardless of whether or not they are suggested by other known facts) we can see that for a low-power test, with β = 0.8, then P(H₀ | D_f, I) is over 96%. So we crank up the number of samples, improve our instruments, do everything we can to reduce the experimental noise, until, miracle of miracles, we have reduced the false negative rate to almost zero. What is P(H₀ | D_f, I) now? Still 83%. Bugger.

This is not to say that the experiments are not worth doing. Science evidently makes tremendous advances despite these difficulties, and the technology that follows from the science is the most obvious proof of this. (Besides, any substantial success in reducing β will also permit an associated reduction of α.) What it does mean, however, is that the standard ways of reporting ‘findings,’ with alphas and p-values, are desperately inadequate. They fail to represent the information that a data set contains and to convey what should be our rational degree of belief in a hypothesis, given the empirical evidence available. Evaluation of this information content and elucidation of these rational degrees of belief (probabilities) should be the goal of every scientist, and the communication of these things should be viewed as a privilege to take delight in.

Update (19-5-2012)

Previously I stated that:

'Colhoun et al.⁴ found that 95% of reported findings concerning the genetic causes of disease are subsequently found to be false.'

This was based on a statement by Wacholder et al.³ that:

'Colhoun et al. estimated the fraction of false-positive findings in studies of association between a genetic variant and a disease to be at least .95.'

I have subsequently checked this paper by Colhoun et al., and could not find this estimate. I have adjusted the text accordingly, adding new references that support my original point. I apologize for the error. My only excuse is that, locked as it was behind a paywall, I was unable to access this paper for fact checking, before making a special trip to my local university library. I still recommend the Colhoun et al. paper for their discussion of the unsatisfactory nature of evidence evaluation in their field.

[1] J.P. Ioannidis et al. ‘Replication validity of genetic association studies,’ Nature Genetics 2001, 29 (p. 306)

[2] J.N. Hirschorn et al. ‘A comprehensive review of genetic association studies,’ Genetics in Medicine 2002, 4 (p. 45) Available here.

[3] S. Wacholder, et al. ‘Assessing the probability that a positive report is false: an approach for molecular epidemiology studies,’ Journal of the National Cancer Institute, Vol. 96, No. 6 (p. 434), March 17, 2004. Available here.

[4] H.M. Colhoun et al. ‘Problems of reporting genetic associations with complex outcomes,’ Lancet 2003 361:865–72