Saturday, May 26, 2012

One of the things that I believe very firmly is that scientific method is not just for scientists, it's for everybody. Because science is just the systematic evaluation of what is and is not likely to be true, it's the right way to go in all fields where we're interested in making effective decisions, or obtaining new knowledge. These fields include engineering, economics, law and criminal justice, politics and social policy, history, and everyday life. Science, after all, is just the systematization of common sense. In this post, I'll discuss how to employ a scientific approach to evaluating stories we read in the news.

With training and a clear idea of what common sense actually is, we can often do a surprisingly good job of evaluating issues where we have little or no expertise and very limited information. A good starting point when modeling common sense is Bayes' theorem. Its not that I think that our brains necessarily employ Bayes' theorem strictly, but to me it is perfectly obvious that natural selection of chance variations has equipped us with intellectual apparatus that is capable of mimicking to a high degree the results of Bayes' theorem in a broad range of commonly encountered circumstances.

Here's an example that shows how easy it can be to apply formal Bayesian reasoning to the news. A few months ago, a story came out that scientists had apparently observed neutrinos traveling faster than the speed of light in vacuum. Physicist and blogger, Ted Bunn, has outlined a breathtakingly simple estimation of the probability that this finding was accurate. Really, anybody who understands Bayes' theorem and has moderately well trained judgement could have performed this calculation. It juxtaposes two competing hypotheses: (1) physics as we know it severely wrong, and (2) somebody got one of their measurements wrong. The result is overwhelmingly in favor of neutrinos that can not exceed the speed of light, and holds even if the assigned prior probabilities are adjusted by orders of magnitude. Of course, we now know that the neutrinos in question were behaving perfectly in concert with known physics, and the measurement was indeed in error.

We can do a lot to train our intuition and improve our ability to discern merit or its absence in the news by learning something of the way that news journalism functions. For example, from examination of Bayes' theorem, it is clear that the probability, P(T | R), that a story is true, T, given that it has been reported, R depends on P(R | T) and P(R | F). Anything that aids you to estimate these, and similar quantities, therefore, enhances the performance of your inner Bayesian reasoner, and has got to be a good thing.

With particular regard to stories about science in the media, a gold mine of insight is to be found by browsing the archive of Ben Goldacre's Bad Science blog. Here, you can develop a healthy skepticism by reading, among other things, dozens of examples of how journalists distort stories to make them more sexy, how they misunderstand technical details, and how they play into the hands of PR companies.

Looking beyond science stories, an excellent book, 'Flat Earth News,' by seasoned newspaper journalist Nick Davies goes into terrifying detail about the extent to which news journalism in general is broken. Davies outlines ten rules of production, which he argues have important impacts on the quality and content of reported news. Examining them offers a stark vision, but also provides invaluable data in the quest to estimate what is true, what is important, how details are likely to be distorted, and what kinds of things are possibly being kept from you. Very briefly, these rules of production described by Davies are:

(1) Run cheap stories
- no long investigations, use information thats readily available. Most news stories are written in minutes.

(2) Select safe facts
- prefer official sources

(3) Don't upset the wrong people
- the more powerful somebody is, the more likely they are to sue, for example

(4) Select safe ideas
- don't print anything that goes against the widely held consensus

(5) Give both sides of the story
- if you never actually say anything definite, then nobody can accuse you of being wrong

(6) Give them what they want
- if it increases readership, then tell it. Tell it in a way that increases readership.

(7) Bias against truth
- the story can't be too complex, therefore suppress as many details as possible

(8) Give them what they want to believe in
- again, don't upset the readers

(9) Go with the moral panic
- in time of crisis, guage the public opinion and make sure to amplify it

(10) Give them what everybody else is giving them
- don't lose customers just because somebody else stoops lower then you are willing to

Understanding these things enables a skeptical mind to penetrate somewhat beyond what is actually presented in the news. A story can be judged, for example, on where the information probably comes from, what alternative sources were probably ignored, how much effort went into producing the story, why this story is in the news anyway - what is its real importance? This is skepticism. Science is healthy skepticism: scrutinizing everything, trying to find the flaws in every piece of evidence. Obvious, I know, but one or two people out there don't seem to have embraced it fully yet.

Wednesday, May 23, 2012

Earlier, in 'The insignificance of significance tests', I wrote about the silliness of the P-value, one of the main metrics scientists use to determine whether or not they have made a discovery. I explained my view that a far better description of our state of knowledge is the posterior probability. C.R. Weinberg, however, presumably does not agree with all my arguments, having written a commentary some years ago with the title 'It's time to rehabilitate the P-value.'1 Weinberg acknowledges the usefulness of the posterior probability in this article, and I agree with various other parts of it, but a major part of her argument in favour of P-values is the following statement, from the start of the article's second paragraph:
'The P-value quantifies the discrepancy between a given data set and the null hypothesis,'
Lets take a moment to consider whether or not this is true.

Firstly, lets note that the word 'discrepancy' has a precise meaning and an imprecise one. The discrepancy between 5 and 7 is 2, and has a precise formal meaning. What is the discrepancy, though between an apple and an orange? When we say something like 'there is a discrepancy between what he said and what he did,' this makes use of the general, imprecise meaning of 'disagreement'. What is the discrepancy between a data set and a hypothesis? These are different things and can not be measured against one another, so there is clearly no exact formal meaning of the discrepancy in this case. It is a pity then that the main case made for use of P-values, in this case, is based on an informal concept, rather than an exact idea, which scientists normally prefer.

So it is clear then, that the discrepancy between a data set and the null hypothesis has no useful, clearly definable scientific meaning (and, therefore, can not be quantified, as was the claim). Lets examine, though, what the term 'discrepancy' is trying to convey in this case. Presumably, we are talking about the extent to which the data set alters our belief in the null hypothesis - the degree of inconsistency between the null hypothesis and the observed facts. Surely, we are not talking about any effects on our belief in the data, as the data is (to a fair approximation 2) an empirical fact. Anyway, belief in the data was not the issue that the experiment was performed to resolve, right? The truth or falsity of a particular data set is a very trivial thing, is it not, next to the truth or falsity of some general principle?

But the P-value does not quantify our rational belief in any hypothesis. This is impossible. For starters, as I hinted at in that earlier article, by performing the tail integrals required to calculate P-values, one commits an unnecessary and gratuitous violation of the likelihood principle. This simple (and obvious) principle states that when evaluating rational changes to our belief, one uses only the data that are available. By evaluating tail integrals, one effectively integrates over a host of hypothetical data sets, that have never actually been observed.

Even if we eliminate the tail integration, however, and calculate only P(D | H0), we are still not much closer to evaluating the information content of our data, D, with respect to H0. To do this, we need to specify a set of hypotheses against which H0 is to be tested, and obtain the direct probability, P(D | H), for all hypotheses involved. To think that you can test a hypothesis without considering alternatives is, as I have mentioned once or twice in other articles, the base-rate fallacy.

One of my favorite illustrations of this basic principle, that the probability for a hypothesis has no meaning when divorced from the context of a set of competing hypotheses, is something called the raven paradox. Devised in 1945 by C.G. Hempel, this supposed paradox (in truth, there seem to be no real paradoxes in probability theory) makes use of two simple premises:

(1) An instance of a hypothesis is evidence for that hypothesis.

(2) The statement 'all ravens are black' is logically equivalent to the statement 'everything that is not black is not a raven.'

Taking these two together, it follows that observation of a blue teapot constitutes further evidence for the hypothesis that all ravens are black. While it seems to most that the colour of an observed teapot can convey little information about ravens, this conclusion is claimed to be logically deduced. This paradox got several philosophers tied up in knots, but it took a mathematician and computer scientist, I.J. Good (he worked with Turing both at Bletchley Park and in Manchester, making him one of the world's first ever computer scientists) to point out its resolution.

The obvious solution is that premise number (1) is just plain wrong. You can not evaluate evidence for a hypothesis by considering only that hypothesis. It is necessary to specify all the premises you wish to consider, and specify them in a quantifiable way. Quantifiable, here, means in a manner that allows P(D | H) to be evaluated. To drive this point home, Good provided a thought experiment (a sanity check, as I like to it) where, given the provided background information, we are forced to conclude that observation of a black raven lowers the expectation that all ravens are black.3

Good imagined that we have very strong reasons to suppose that we live in one of only two possible classes of universe, U1 and U2, defined as follows:
U1 ≡ Exactly 100 ravens exist, all black. There are 1 million other birds.
U2 ≡1,000 black ravens exist, as well as 1 white raven, and 1 million other birds.
Now we can see that if a bird is selected randomly from the population of all birds, and found to be a black raven, then we are most probably in U2. From Bayes' theorem:

 P(U1 | DI)  = P(U1 | I) × P(D | U1 I) P(U1 | I) × P(D | U1 I) + P(U2 | I) × P(D | U2 I)

No information has been supplied concerning the relative likelihoods of these universes, so symmetry requires us to give them equal prior probabilites, so

 P(U1 | DI)  ≈ 10-4 10-4 + 10-3

which makes it about 10 times more likely that we are in U2, with some non-black ravens.

 C.R. Weinberg, 'It's time to rehabilitate the P-value,' Epidemiology May 2001, Vol. 12 No. 3 (p. 288) (Available here)

 We can imagine the following conversation between a PhD student and his advisor:

Supervisor:
This new data is fantastic! And so different to anything we have seen before. Are you sure everything was normal in the lab when you measured these?
Student:
Well, I was a little bit drunk that day....

But this kind of thing is rare, right? Come on, back me up here guys.

 I.J. Good, 'The White Shoe is a Red Herring,' British Journal for the Philosophy of Science, 1967, Vol. 17, No. 4 (p. 322) (I haven't read this paper (paywall again), but I have it from several reliable sources that this is where the thought experiment, or an equivalent version of it, appears.)

Wednesday, May 16, 2012

Nuisance Parameters

A few days ago, I was racking my brain trying to think of a suitable example for this piece, when one landed on my lap unexpectedly. On his own statistics blog, Andrew Gelman, has been posting the questions from an exam he set on design and analysis of surveys. Question 1 was the following:
Suppose that, in a survey of 1000 people in a state, 400 say they voted in a recent primary election. Actually, though, the voter turnout was only 30%. Give an estimate of the probability that a nonvoter will falsely state that he or she voted. (Assume that all voters honestly report that they voted.)
Now the requested estimate is simple enough to produce, and needs little greater insight than the product rule and the law of large numbers. But an important part of rationally advancing our knowledge is to be able to quantify the degree of uncertainty in that new knowledge. A parameter estimate without a confidence interval tells us very little, because it is just that, an estimate. We can't empirically determine that the estimated value is the truth, only that it is the most probable value. If we want to make decisions based on some estimated parameter, or give a measurement any kind of substance, we need to know how much more probable it is than other possible values. We can convey this information conveniently by supplying an error bar - a region of values either side of the most likely value, in which the true value is very likely to reside. This is why the error bar is considered to be one of the most important concepts in science.

If we look at the question above, with an ambition to provide not just the estimate, but a region of high confidence on either side, then it becomes one of the simplest possible examples of marginalization, the topic of this post. It is also a really nice example to use here, because it utilizes technology we have already played with, in some of my earlier posts. These technologies are the binomial distribution, used in 'Fly papers and photon detectors' (Equation 5 in that post), and the beta function, which I introduced without naming it in 'How to make ad-hominem arguments' (Equation 4 in that article).

Defining V to be the proposition that a given person voted, Y is the statement that they have declared that they did vote. We want to know the probability that a person says they voted when in fact they did not, which is P(Y | V'). From the product rule

P(YV') = P(V') × P(Y | V')

P(V') = 0.7. Thats the probability that a person did not vote.

If we are experienced in such things, we know that random deviations from expected behaviour decrease in relative magnitude as the size of the sample increases (thats the law of large numbers). This means that for a sample of 1000, we can be confident that the number of voters will not be much different from 300, when P(V) = 0.3. This means that approximately 100 non-voters lied that they had voted, out of a total sample of 1000, so

P(YV') ≈ 0.1

The crude estimate for P(Y | V'), therefore, is 1/7.

A more rigorous calculation acknowledges the uncertainty in P(YV'), and at the same time automatically provides a means to get the desired confidence interval. Lets suppose that to start with, we have no information about the proportion of non-voters who lie in surveys, then we are justified in using a uniform prior distribution. Then it follows from Bayes' theorem that the posterior probability distribution for the true fraction is given by the beta function, in close analogy with the parable of Weety Peebles. If we knew the number of people who didn't vote, but lied in the survey, then this would be a piece of cake, but we don't know it. It is what's called a nuisance parameter. But there is a procedure for dealing with this.

If we have a model with 2 free parameters, θ and n, then the joint probability for any pair of values for these parameters is

 P(θn | DI)  =  P(θn | I) P(D | θnI) P(D | I)
(1)

But if n is a nuisance parameter, in which we have no direct interest, then we just integrate it out. The so-called marginal probability distribution, P(θ | DI), is the sum over Equation (1) for all possible values of n. If n is a continuous parameter, then the sum becomes an integral:

 P(θ | DI) = ∫  P(θn | DI) dn (2)

In our example, we have one desired parameter, the fraction of non-voters who say that they voted, f, and one nuisance parameter, the actual number of liars in the sample of 1000 people, so to get the distribution over all possible f, we need to calculate a two-dimensional array of numbers, something that is still amenable to a spreadsheet calculation. Down a column, I listed all the possible numbers of liars, n, from 0 to 400 (there can't be more than 400 as all voters tell the truth, according to the provided background information). For each of these n, the total number of non-voters is 600 plus that number (600 is the number of non-lying non-voters). The probability for each of these numbers of non-voters, P(n), was calculated in an adjacent column, using the binomial distribution, with p = 0.7.

Along the top of the spreadsheet, I listed all the hypotheses I wanted to test concerning the value of the desired fraction, f. I divided the full range [0, 1] into 1000 slices of width Δf = 0.001. The probability that the true value of f lies in any given range [f, f + Δf] is estimated as P(f) × Δf. Each P(f) was calculated using the beta function:

 P(f | nI) = (N + 1)! f n (1 - f)N-n n! (N - n)!
(3)

Here N is the sample size, 1000. Each P(f | nI) was multiplied by the calculated P(n) to give the joint probability, specified in Equation (1). At the bottom, along another row, I calculated the sum of each column, which gave the desired marginal probability distribution, which I plot below:

According to my calculation, the peak of this curve is at 0.143 (which is 1/7, as expected). As an error bar, lets identify points on either side of the peak, such that the enclosed area is 0.95. This means that there is 95% probability that the true value of f lies between these points. To find these points, just integrate the curve in each direction from the peak, until the area in each case first reaches 0.475. Performing this integration gives a 95% confidence interval of [0.102, 0.180].

Now we know not only the most likely value of f, but also how confident we are that the true value of f is near to that estimate. This is what good science is all about.

The process of eliminating nuisance parameters is termed marginalization. Its an important concept in Bayesian statistics. In maximum-likelihood model fitting, all free parameters in a model must be fitted at once, but use of Bayes' theorem not only permits important prior information to enter the calculation and enables confidence-interval estimation without a seperate calculation, but also allows us to reduce the number of parameters that must be calculated to only those that interest us. During my PhD work, for example, most of my time was spent measuring the temporal responses of nanocrystals to short laser pulses. My fitting model included an offset (displacement up the y-axis), a shift (displacement along the time axis), and a scale parameter (dependent on how long I measured for, and how many photons my detector picked up). Thats 3 parameters giving information only about the behaviour of the measurement apparatus. The physical model pertaining to the behaviour of the nanocrystals typically consisted of only 2 time constants. Thats three out of five model variables that are nuisance parameters.

A really excellent read for those with an interest in the technicalities of Bayesian stats, is a text book called 'Bayesian Spectrum Analysis and Parameter Estimation,' by G. Larry Bretthorst (available for free download here). This book describes some stunning work. While discussing the advantages of eliminating nuisance parameters, Bretthorst produces one of the sexiest lines in the whole of the statistical literature:

In a typical small problem, this might reduce the search dimensions from ten to two; in one "large" problem the reduction was from thousands to six or seven.

He goes on: "This represents many orders of magnitude reduction in computation, the difference between what is feasible and what is not."

Thanks to Andrew Gelman for providing valuable inspiration for this post!

'Bayesian Spectrum Analysis and Parameter Estimation' by G. Larry Bretthorst

Thursday, May 10, 2012

Dodgy Dice

Following up on an earlier post, here is another example of how thinking about conditional probabilities in terms of causation rather than logical dependence can lead to cognitive dissonance. This example comes from 'Entropy Demystified,' by Arieh Ben-Naim.

Let A, B, and C be three hypotheses, such that P(B|A) > P(B) and P(C|B) > P(C). Suppose I tell you that P(C|A) < P(C). How does this strike you?

Knowledge of A makes B more likely, and knowledge of B makes C more likely, so shouldn't knowledge of A necessarily also make C more likely?

If you are thinking along these lines and are struggling to accept that the situation I describe is possible, then I suspect it is because, despite my warning, you can't help thinking about P(x|y) in terms of causation. The specific property of causality that seems to invade our heads when we contemplate problems like this is its transitivity: if A causes B and B causes C, then A necessarily is the cause of C.

Logical dependence, however, is not required to exhibit this transitivity property. Let A, B, and C be hypotheses about the outcome of rolling a six-sided die, such that for each hypothesis, the face showing is one of the four specified:

A: {1, 2, 3, 4}
B: {2, 3, 4, 5}
C: {3, 4, 5, 6}

P(A) = P(B) = P(C) = 4/6. If you are told that A is true, then the probability  associated with B becomes 3/4, which is greater than 4/6. Similarly, P(C|B) is also greater than P(C). But P(C|A) = 2/4, which is less than P(C), as promised.

Here's another example of unexpected intransitivity for dice, operating on a different level, taken from Ian Stewart's entertaining "Cabinet of Mathematical Curiosities."

Mr. A proposes to his friend Mr. B that they play dice for money with his special dice. He has three of them, and they are each to pick one to play with against the other. Whoever roles the higher number on each throw wins the stake. To convince Mr. B that the dice are fair, Mr. A insists that he take first choice from the 3 dice - that way, if one of the dice is better than the of the others, Mr. B should have an fair chance of picking the better one.

The dice do not have the usual numbers marked on their faces, though. The red one has the numbers {3, 3, 4, 4, 8, 8}. The yellow one is marked with the numbers {1, 1, 5, 5, 9, 9}. Finally, the blue one has the numbers {2, 2, 6, 6, 7, 7}.

Which one, if any, should Mr. B choose?

Check the numbers for yourself. Whichever die Mr. B selects, Mr. A can select one of the others that gives a higher probability of winning.

Sunday, May 6, 2012

The Insignificance of Significance Tests

In an earlier post on the base-rate fallacy, I made use of two important terms, ‘false-positive rate’ and ‘false-negative rate,’ without taking much time to explain them. These are concepts we need to be careful with, because, simple though they are, they have been given terrible names.

Lets start with the false positive rate for a test. This could mean any one of a number of things:

(1) it could be the expected proportion of results produced by the test that are both false and positive
(2) it could be the proportion of positive results that are false
(3) it could be the proportion of false results that are positive

So which one is it? Are you ready?

None of the above.

The false positive rate is the proportion of negative cases that are registered as positive by the test. This definition is widespread and agrees with the Wikipedia article ‘Type I and Type II Errors’. If it is a diagnostic test for some disease, it is the fraction of healthy people who will be told that they have the disease (or referred for further diagnosis).

My claim that the term is confusing, however, is supported by looking at another Wikipedia article, ‘False Positive Rate,’ which provides the definition: “the probability of falsely rejecting the null hypothesis.” The null hypothesis is the proposition that there is no effect to measure – what I have called a negative case. The probability of falsely rejecting the null hypothesis, therefore, depends on the probability that there is no effect to measure, while the proportion of negative cases that are registered as positive does not. The alternate definition in this second Wikipedia article is the same as my number (1), above.

Number (2) on the list above is the posterior probability that the null hypothesis is true, given that the test has indicated it to be false. It is obtained using Bayes’ theorem. Confusion between this posterior probability and the false-positive rate is the base-rate fallacy, yet again. Unfortunately, there is little about the term ‘false-positive rate’ that strives to steer one away from these misconceptions.

What makes the situation much worse is that most scientific research is assessed using the false-positive rate, while what we should really be interested in is assigning a posterior probability to the proposition under investigation.

The false-positive rate is often denoted by the greek letter α. In classical significance testing, α is used to define a significance level – or rather the significance level defines α. The significance level is chosen such that the probability that a negative case triggers the alarm is α. A negative case, once again, is an instance where the null hypothesis is true, and there is no effect. For example, if two groups are being treated for some affliction, one group with an experimental drug and the other with a placebo, the null hypothesis might be that both groups recover at the same rate, as the new treatment has no specific impact on the disease. If the null hypothesis is correct, then in an ideal measurement, there will be no difference between the recovery times of the two groups.

But measurements are not ideal: there is ‘noise’ – random fluctuations in the system under study that produce a non-zero difference between the groups. If we can assign a probability distribution for this noise, however, we can define limits, which, if exceeded by the measurement, suggest that the null hypothesis is false. If x is the measured difference in recovery time for the 2 groups, then there are two points, xα/2, on the tails of the distribution, such that the integral of one of the tails up to this limit is α/2. The total probability contained in these two tail areas, therefore, sums to α. The xα/2 points are chosen so that α is equal to some desired significance level, some acceptably low false positive rate. (We can not make the false positive rate too small, because then we would too often fail to spot a real effect – the false negative rate would be too high.) The integration is performed on each side of the error distribution, as we have no certainty in which direction the alternate hypothesis operates: the recovery time with the new drug might be worse than with no treatment, which could still lead to rejection of the null hypothesis. The null hypothesis predicts a value of x in the peak of the curve, but measurement noise leads to a probability distribution for the experiment as plotted. The xα/2 points are chosen so that the tails areas either side (the shaded regions) sum to 5%, or whatever the desired significance level is. In classical hypothesis testing, any measured x further from the centre than either of these points is considered statistically significant.

A common value chosen for α is 0.05. That is, if the null hypothesis is true, then the measured value of x will exceed xα/2 about one time in twenty. This is the basis of how results are reported in probably most of science – if x exceeds the prescribed significance level, then the null hypothesis is rejected, and the result is classified as statistically significant and considered a finding. If the measured value of x is between the two xα/2 points, then the null hypothesis is not rejected, and the outcome of the study is probably never reported (a fact that contributes hugely to the problem of publication bias in the scientific literature). This system for interpreting scientific data and reporting outcomes of research is, lets be honest, a travesty.

Firstly, the whole rationale of the significance test seems to me to be deeply flawed. The idea is that there is some cutoff, beyond which we can declare the matter final: ‘Yes, we have a finding here. Grand. Nothing more to do on that question.’ How manifestly absurd? Such a binary declaration about a hypothesis, necessary if one wishes to take real-world action based on scientific research, needs to be based on decision theory, which combines both probability theory and some appropriate loss function (something that specifies the cost of making a wrong decision). But the declarations arising in decision theory are not of the form ‘A is true,’ but rather ‘we must act, and rationality demands that we act as if A is true.’ Using some standard α-level is about the crudest substitute for decision theory you could imagine.

Why should our test be biased so much in favour of the null hypothesis, anyway? The alternate hypothesis, HA, almost always represents an infinity of hypotheses about the magnitude of the possible non-null effect, so a truly 'unbiased' starting point would seem to be one that de-emphasizes H0. Remember that at the core of the frequentist philosophy (not the approach I wish to promote) is the dictum "let the data speak for themselves": don't let prior impressions taint the measurement.

Secondly, when a finding is reported with a false-positive probability of 0.05, there appears to be a feeling of satisfaction among the scientific community that 0.05 is the probability that the positive finding is false. But meta-analyses regularly do their best to dispel this myth. For example, looking only at genetic association studies, Ioannidis et al.1 reported that 'results of the first study correlate only modestly with subsequent research on the same association,' while Hirschorn et al.2 write that 'of the 166 putative associations which have been studied three or more times, only 6 have been consistently replicated.'

Thinking again about the method for calculating the positive predictive value for a diagnostic test, the probability that a positive finding is false is not determined only by the false-positive rate, α, but also by the false negative rate, and the prior probability. The false negative rate, by analogy with the false positive rate, is the proportion of real effects that will be registered as null findings. It depends on the number of samples taken in the study and the magnitude of the effect. In the medical trial example, if the new drug helps people recover twice as fast, then there will be fewer false negatives than if the difference is only 10%, and a trial with 100 patients in each group will give a more powerful test than a trial with only 10 in each group.

If Df represents data corresponding to a finding, with x > xα/2, then from Bayes’ theorem, the probability that the null hypothesis will have been incorrectly rejected is

 P(H0 | Df, I)  = P(H0 | I)P(Df | H0, I) P(H0 | I)P(Df | H0, I) + P(HA | I)P(Df | HA, I)
(1)

P(Df | H0, I) is α, and P(Df | HA, I) is 1-β, where β is the false negative rate. Using the standard α level of 0.05, I plot below this posterior probability as a function of the prior probability for the alternate hypothesis, HA, for several false-negative rates. To estimate the posterior error probability for a specific experiment, Wacholder et al.3 propose to use the p-value instead of α (the p-value is twice the tail integral up to xm, where xm is the measured value of x, rather than xα/2). Strictly, one shouldn't use the p-value, but P(xm | H0, I). We use α to evaluate the method and so integrate over all possible outcomes that would be registered as findings, while we use P(xm) to investigate a particular experiment – there is only one outcome, so no integration is required. I’m fairly sure Wacholder et al. are aware of this: their goal seems to be to provide an approximate methodology capable of salvaging something useful from an almost ubiquitous and highly flawed set of statistical practices. In this regard, I think they can probably be credited with having made a valuable contribution. The problem with this, though, is that for a specific data set, P(D | HA, I) is not the same as 1-β, and can not be determined. The correct procedure, of course, requires resolving HA into a set of quantifiable hypotheses and calculating the appropriate sampling distributions for each of them. Posterior probability for H0 following a statistically significant result, with α set at 0.05, plotted vs the prior probability that H0 is false.

We can see readily from the above plot that the posterior probability varies hugely, even for a fixed α, and that α alone (or the p-value) is next to useless for predicting it. As the prior probability gets smaller, the posterior error probability associated with a positive finding approaches unity. Looking closely at a prior of 0.01, which is generous for many experiments, (especially, for example, large-scale genomics studies, where the data and its analysis are now cheap enough to permit almost every conceivable relationship to be tested, regardless of whether or not they are suggested by other known facts) we can see that for a low-power test, with β = 0.8, then P(H0Df, I) is over 96%. So we crank up the number of samples, improve our instruments, do everything we can to reduce the experimental noise, until, miracle of miracles, we have reduced the false negative rate to almost zero. What is P(H0Df, I) now? Still 83%. Bugger.

This is not to say that the experiments are not worth doing. Science evidently makes tremendous advances despite these difficulties, and the technology that follows from the science is the most obvious proof of this. (Besides, any substantial success in reducing β will also permit an associated reduction of α.) What it does mean, however, is that the standard ways of reporting ‘findings,’ with alphas and p-values, are desperately inadequate. They fail to represent the information that a data set contains and to convey what should be our rational degree of belief in a hypothesis, given the empirical evidence available. Evaluation of this information content and elucidation of these rational degrees of belief (probabilities) should be the goal of every scientist, and the communication of these things should be viewed as a privilege to take delight in.

Update (19-5-2012)

Previously I stated that:

'Colhoun et al.4 found that 95% of reported findings concerning the genetic causes of disease are subsequently found to be false.'

This was based on a statement by Wacholder et al.3 that:
'Colhoun et al. estimated the fraction of false-positive findings in studies of association between a genetic variant and a disease to be at least .95.'
I have subsequently checked this paper by Colhoun et al., and could not find this estimate. I have adjusted the text accordingly, adding new references that support my original point. I apologize for the error. My only excuse is that, locked as it was behind a paywall, I was unable to access this paper for fact checking, before making a special trip to my local university library. I still recommend the Colhoun et al. paper for their discussion of the unsatisfactory nature of evidence evaluation in their field.

 J.P. Ioannidis et alReplication validity of genetic association studies,’ Nature Genetics 2001, 29 (p. 306)

 J.N. Hirschorn et alA comprehensive review of genetic association studies,’ Genetics in Medicine 2002, 4 (p. 45)  Available here.

 S. Wacholder, et al. ‘Assessing the probability that a positive report is false: an approach for molecular epidemiology studies,’ Journal of the National Cancer Institute, Vol. 96, No. 6 (p. 434), March 17, 2004. Available here.

 H.M. Colhoun et al. ‘Problems of reporting genetic associations with complex outcomes,’ Lancet 2003 361:865–72