Showing posts with label Base-Rate Fallacy. Show all posts
Showing posts with label Base-Rate Fallacy. Show all posts

Wednesday, May 23, 2012

The Raven Paradox


Earlier, in 'The insignificance of significance tests', I wrote about the silliness of the P-value, one of the main metrics scientists use to determine whether or not they have made a discovery. I explained my view that a far better description of our state of knowledge is the posterior probability. C.R. Weinberg, however, presumably does not agree with all my arguments, having written a commentary some years ago with the title 'It's time to rehabilitate the P-value.'1 Weinberg acknowledges the usefulness of the posterior probability in this article, and I agree with various other parts of it, but a major part of her argument in favour of P-values is the following statement, from the start of the article's second paragraph:
'The P-value quantifies the discrepancy between a given data set and the null hypothesis,'
Lets take a moment to consider whether or not this is true.

Firstly, lets note that the word 'discrepancy' has a precise meaning and an imprecise one. The discrepancy between 5 and 7 is 2, and has a precise formal meaning. What is the discrepancy, though between an apple and an orange? When we say something like 'there is a discrepancy between what he said and what he did,' this makes use of the general, imprecise meaning of 'disagreement'. What is the discrepancy between a data set and a hypothesis? These are different things and can not be measured against one another, so there is clearly no exact formal meaning of the discrepancy in this case. It is a pity then that the main case made for use of P-values, in this case, is based on an informal concept, rather than an exact idea, which scientists normally prefer.

So it is clear then, that the discrepancy between a data set and the null hypothesis has no useful, clearly definable scientific meaning (and, therefore, can not be quantified, as was the claim). Lets examine, though, what the term 'discrepancy' is trying to convey in this case. Presumably, we are talking about the extent to which the data set alters our belief in the null hypothesis - the degree of inconsistency between the null hypothesis and the observed facts. Surely, we are not talking about any effects on our belief in the data, as the data is (to a fair approximation 2) an empirical fact. Anyway, belief in the data was not the issue that the experiment was performed to resolve, right? The truth or falsity of a particular data set is a very trivial thing, is it not, next to the truth or falsity of some general principle?

But the P-value does not quantify our rational belief in any hypothesis. This is impossible. For starters, as I hinted at in that earlier article, by performing the tail integrals required to calculate P-values, one commits an unnecessary and gratuitous violation of the likelihood principle. This simple (and obvious) principle states that when evaluating rational changes to our belief, one uses only the data that are available. By evaluating tail integrals, one effectively integrates over a host of hypothetical data sets, that have never actually been observed. 

Even if we eliminate the tail integration, however, and calculate only P(D | H0), we are still not much closer to evaluating the information content of our data, D, with respect to H0. To do this, we need to specify a set of hypotheses against which H0 is to be tested, and obtain the direct probability, P(D | H), for all hypotheses involved. To think that you can test a hypothesis without considering alternatives is, as I have mentioned once or twice in other articles, the base-rate fallacy.

One of my favorite illustrations of this basic principle, that the probability for a hypothesis has no meaning when divorced from the context of a set of competing hypotheses, is something called the raven paradox. Devised in 1945 by C.G. Hempel, this supposed paradox (in truth, there seem to be no real paradoxes in probability theory) makes use of two simple premises:

(1) An instance of a hypothesis is evidence for that hypothesis.

(2) The statement 'all ravens are black' is logically equivalent to the statement 'everything that is not black is not a raven.'

Taking these two together, it follows that observation of a blue teapot constitutes further evidence for the hypothesis that all ravens are black. While it seems to most that the colour of an observed teapot can convey little information about ravens, this conclusion is claimed to be logically deduced. This paradox got several philosophers tied up in knots, but it took a mathematician and computer scientist, I.J. Good (he worked with Turing both at Bletchley Park and in Manchester, making him one of the world's first ever computer scientists) to point out its resolution.

The obvious solution is that premise number (1) is just plain wrong. You can not evaluate evidence for a hypothesis by considering only that hypothesis. It is necessary to specify all the premises you wish to consider, and specify them in a quantifiable way. Quantifiable, here, means in a manner that allows P(D | H) to be evaluated. To drive this point home, Good provided a thought experiment (a sanity check, as I like to it) where, given the provided background information, we are forced to conclude that observation of a black raven lowers the expectation that all ravens are black.3

Good imagined that we have very strong reasons to suppose that we live in one of only two possible classes of universe, U1 and U2, defined as follows:
U1 ≡ Exactly 100 ravens exist, all black. There are 1 million other birds.
U2 ≡1,000 black ravens exist, as well as 1 white raven, and 1 million other birds.
Now we can see that if a bird is selected randomly from the population of all birds, and found to be a black raven, then we are most probably in U2. From Bayes' theorem:

P(U1 | DI)  =   P(U1 | I) × P(D | U1 I)
P(U1 | I) × P(D | U1 I) + P(U2 | I) × P(D | U2 I)


No information has been supplied concerning the relative likelihoods of these universes, so symmetry requires us to give them equal prior probabilites, so

P(U1 | DI)  ≈   10-4
10-4 + 10-3


which makes it about 10 times more likely that we are in U2, with some non-black ravens.







[1] C.R. Weinberg, 'It's time to rehabilitate the P-value,' Epidemiology May 2001, Vol. 12 No. 3 (p. 288) (Available here)

[2] We can imagine the following conversation between a PhD student and his advisor:

Supervisor:
This new data is fantastic! And so different to anything we have seen before. Are you sure everything was normal in the lab when you measured these?
Student:
Well, I was a little bit drunk that day....

But this kind of thing is rare, right? Come on, back me up here guys.


[3] I.J. Good, 'The White Shoe is a Red Herring,' British Journal for the Philosophy of Science, 1967, Vol. 17, No. 4 (p. 322) (I haven't read this paper (paywall again), but I have it from several reliable sources that this is where the thought experiment, or an equivalent version of it, appears.)


Sunday, May 6, 2012

The Insignificance of Significance Tests



In an earlier post on the base-rate fallacy, I made use of two important terms, ‘false-positive rate’ and ‘false-negative rate,’ without taking much time to explain them. These are concepts we need to be careful with, because, simple though they are, they have been given terrible names.

Lets start with the false positive rate for a test. This could mean any one of a number of things:

(1) it could be the expected proportion of results produced by the test that are both false and positive 
(2) it could be the proportion of positive results that are false 
(3) it could be the proportion of false results that are positive

So which one is it? Are you ready?

None of the above.

The false positive rate is the proportion of negative cases that are registered as positive by the test. This definition is widespread and agrees with the Wikipedia article ‘Type I and Type II Errors’. If it is a diagnostic test for some disease, it is the fraction of healthy people who will be told that they have the disease (or referred for further diagnosis).

My claim that the term is confusing, however, is supported by looking at another Wikipedia article, ‘False Positive Rate,’ which provides the definition: “the probability of falsely rejecting the null hypothesis.” The null hypothesis is the proposition that there is no effect to measure – what I have called a negative case. The probability of falsely rejecting the null hypothesis, therefore, depends on the probability that there is no effect to measure, while the proportion of negative cases that are registered as positive does not. The alternate definition in this second Wikipedia article is the same as my number (1), above.

Number (2) on the list above is the posterior probability that the null hypothesis is true, given that the test has indicated it to be false. It is obtained using Bayes’ theorem. Confusion between this posterior probability and the false-positive rate is the base-rate fallacy, yet again. Unfortunately, there is little about the term ‘false-positive rate’ that strives to steer one away from these misconceptions.

What makes the situation much worse is that most scientific research is assessed using the false-positive rate, while what we should really be interested in is assigning a posterior probability to the proposition under investigation.

The false-positive rate is often denoted by the greek letter α. In classical significance testing, α is used to define a significance level – or rather the significance level defines α. The significance level is chosen such that the probability that a negative case triggers the alarm is α. A negative case, once again, is an instance where the null hypothesis is true, and there is no effect. For example, if two groups are being treated for some affliction, one group with an experimental drug and the other with a placebo, the null hypothesis might be that both groups recover at the same rate, as the new treatment has no specific impact on the disease. If the null hypothesis is correct, then in an ideal measurement, there will be no difference between the recovery times of the two groups. 

But measurements are not ideal: there is ‘noise’ – random fluctuations in the system under study that produce a non-zero difference between the groups. If we can assign a probability distribution for this noise, however, we can define limits, which, if exceeded by the measurement, suggest that the null hypothesis is false. If x is the measured difference in recovery time for the 2 groups, then there are two points, xα/2, on the tails of the distribution, such that the integral of one of the tails up to this limit is α/2. The total probability contained in these two tail areas, therefore, sums to α. The xα/2 points are chosen so that α is equal to some desired significance level, some acceptably low false positive rate. (We can not make the false positive rate too small, because then we would too often fail to spot a real effect – the false negative rate would be too high.) The integration is performed on each side of the error distribution, as we have no certainty in which direction the alternate hypothesis operates: the recovery time with the new drug might be worse than with no treatment, which could still lead to rejection of the null hypothesis.


The null hypothesis predicts a value of x in the peak of the curve, but measurement noise leads to a probability distribution for the experiment as plotted. The xα/2 points are chosen so that the tails areas either side (the shaded regions) sum to 5%, or whatever the desired significance level is. In classical hypothesis testing, any measured x further from the centre than either of these points is considered statistically significant.

A common value chosen for α is 0.05. That is, if the null hypothesis is true, then the measured value of x will exceed xα/2 about one time in twenty. This is the basis of how results are reported in probably most of science – if x exceeds the prescribed significance level, then the null hypothesis is rejected, and the result is classified as statistically significant and considered a finding. If the measured value of x is between the two xα/2 points, then the null hypothesis is not rejected, and the outcome of the study is probably never reported (a fact that contributes hugely to the problem of publication bias in the scientific literature). This system for interpreting scientific data and reporting outcomes of research is, lets be honest, a travesty.

Firstly, the whole rationale of the significance test seems to me to be deeply flawed. The idea is that there is some cutoff, beyond which we can declare the matter final: ‘Yes, we have a finding here. Grand. Nothing more to do on that question.’ How manifestly absurd? Such a binary declaration about a hypothesis, necessary if one wishes to take real-world action based on scientific research, needs to be based on decision theory, which combines both probability theory and some appropriate loss function (something that specifies the cost of making a wrong decision). But the declarations arising in decision theory are not of the form ‘A is true,’ but rather ‘we must act, and rationality demands that we act as if A is true.’ Using some standard α-level is about the crudest substitute for decision theory you could imagine. 

Why should our test be biased so much in favour of the null hypothesis, anyway? The alternate hypothesis, HA, almost always represents an infinity of hypotheses about the magnitude of the possible non-null effect, so a truly 'unbiased' starting point would seem to be one that de-emphasizes H0. Remember that at the core of the frequentist philosophy (not the approach I wish to promote) is the dictum "let the data speak for themselves": don't let prior impressions taint the measurement.

Secondly, when a finding is reported with a false-positive probability of 0.05, there appears to be a feeling of satisfaction among the scientific community that 0.05 is the probability that the positive finding is false. But meta-analyses regularly do their best to dispel this myth. For example, looking only at genetic association studies, Ioannidis et al.1 reported that 'results of the first study correlate only modestly with subsequent research on the same association,' while Hirschorn et al.2 write that 'of the 166 putative associations which have been studied three or more times, only 6 have been consistently replicated.'

Thinking again about the method for calculating the positive predictive value for a diagnostic test, the probability that a positive finding is false is not determined only by the false-positive rate, α, but also by the false negative rate, and the prior probability. The false negative rate, by analogy with the false positive rate, is the proportion of real effects that will be registered as null findings. It depends on the number of samples taken in the study and the magnitude of the effect. In the medical trial example, if the new drug helps people recover twice as fast, then there will be fewer false negatives than if the difference is only 10%, and a trial with 100 patients in each group will give a more powerful test than a trial with only 10 in each group.

If Df represents data corresponding to a finding, with x > xα/2, then from Bayes’ theorem, the probability that the null hypothesis will have been incorrectly rejected is

P(H0 | Df, I)  =   P(H0 | I)P(Df | H0, I)
P(H0 | I)P(Df | H0, I) + P(HA | I)P(Df | HA, I)
(1)

P(Df | H0, I) is α, and P(Df | HA, I) is 1-β, where β is the false negative rate. Using the standard α level of 0.05, I plot below this posterior probability as a function of the prior probability for the alternate hypothesis, HA, for several false-negative rates. To estimate the posterior error probability for a specific experiment, Wacholder et al.3 propose to use the p-value instead of α (the p-value is twice the tail integral up to xm, where xm is the measured value of x, rather than xα/2). Strictly, one shouldn't use the p-value, but P(xm | H0, I). We use α to evaluate the method and so integrate over all possible outcomes that would be registered as findings, while we use P(xm) to investigate a particular experiment – there is only one outcome, so no integration is required. I’m fairly sure Wacholder et al. are aware of this: their goal seems to be to provide an approximate methodology capable of salvaging something useful from an almost ubiquitous and highly flawed set of statistical practices. In this regard, I think they can probably be credited with having made a valuable contribution. The problem with this, though, is that for a specific data set, P(D | HA, I) is not the same as 1-β, and can not be determined. The correct procedure, of course, requires resolving HA into a set of quantifiable hypotheses and calculating the appropriate sampling distributions for each of them.


Posterior probability for Hfollowing a statistically significant result, with α set at 0.05, plotted vs the prior probability that His false.

We can see readily from the above plot that the posterior probability varies hugely, even for a fixed α, and that α alone (or the p-value) is next to useless for predicting it. As the prior probability gets smaller, the posterior error probability associated with a positive finding approaches unity. Looking closely at a prior of 0.01, which is generous for many experiments, (especially, for example, large-scale genomics studies, where the data and its analysis are now cheap enough to permit almost every conceivable relationship to be tested, regardless of whether or not they are suggested by other known facts) we can see that for a low-power test, with β = 0.8, then P(H0Df, I) is over 96%. So we crank up the number of samples, improve our instruments, do everything we can to reduce the experimental noise, until, miracle of miracles, we have reduced the false negative rate to almost zero. What is P(H0Df, I) now? Still 83%. Bugger.

This is not to say that the experiments are not worth doing. Science evidently makes tremendous advances despite these difficulties, and the technology that follows from the science is the most obvious proof of this. (Besides, any substantial success in reducing β will also permit an associated reduction of α.) What it does mean, however, is that the standard ways of reporting ‘findings,’ with alphas and p-values, are desperately inadequate. They fail to represent the information that a data set contains and to convey what should be our rational degree of belief in a hypothesis, given the empirical evidence available. Evaluation of this information content and elucidation of these rational degrees of belief (probabilities) should be the goal of every scientist, and the communication of these things should be viewed as a privilege to take delight in.




Update (19-5-2012)


Previously I stated that:

'Colhoun et al.4 found that 95% of reported findings concerning the genetic causes of disease are subsequently found to be false.'

This was based on a statement by Wacholder et al.3 that:
'Colhoun et al. estimated the fraction of false-positive findings in studies of association between a genetic variant and a disease to be at least .95.'
I have subsequently checked this paper by Colhoun et al., and could not find this estimate. I have adjusted the text accordingly, adding new references that support my original point. I apologize for the error. My only excuse is that, locked as it was behind a paywall, I was unable to access this paper for fact checking, before making a special trip to my local university library. I still recommend the Colhoun et al. paper for their discussion of the unsatisfactory nature of evidence evaluation in their field.







[1] J.P. Ioannidis et alReplication validity of genetic association studies,’ Nature Genetics 2001, 29 (p. 306)

[2] J.N. Hirschorn et alA comprehensive review of genetic association studies,’ Genetics in Medicine 2002, 4 (p. 45)  Available here.

[3] S. Wacholder, et al. ‘Assessing the probability that a positive report is false: an approach for molecular epidemiology studies,’ Journal of the National Cancer Institute, Vol. 96, No. 6 (p. 434), March 17, 2004. Available here.

[4] H.M. Colhoun et al. ‘Problems of reporting genetic associations with complex outcomes,’ Lancet 2003 361:865–72



Tuesday, April 17, 2012

Fly papers and photon detectors: another base-rate fallacy


Here is my version of a simple riddle posed by Jaynes in ‘Probability Theory: The Logic of Science’:
A constant Poissonian light source emits photons at an average rate of 100 per second. The photons strike a detector with 10% efficiency. In one particular second, the detector registers 15 counts. What is the expectation for the number of photons actually emitted in that second? (Assume a detector with no dark counts and zero dead time.)
Lets denote the number of emissions by n, the average number of emissions (the source strength) by s, the number of detected photons by c, and the detector efficiency by ϕ.

In case some of the terminology in the question is unfamiliar, ‘Poissonian’ just means that the number of events in any second conforms to a Poisson probability distribution with a mean of s:
P(n | s)  =  e-s  sn
n!
(1)
Make an effort to work out your best answer to the above question before reading on. Consider writing it down - feel free to record it in the comments below! 


Many textbooks on statistics specify 'maximum likelihood' as the state of the art for parameter estimation, so lets see what result it produces here. The likelihood function for a particular model is the probability to obtain the observed data, assuming that model to be true:

L(M) = P(D | M, I) (2)

where D is the set of data, M is the model in question, and I is the prior information available.

In the cases where the prior information effectively specifies the form of the model, but not the model parameters, then denoting the parameters of the model by θ, this equation becomes

L(θ) = P(D | θ, I) (3)




and we have the problem of determining the best estimate of the actual value of the θ from the observed data, which is often assumed to be the value at which the likelihood function is maximized:

^
θ  = argmax(L(θ | D))
(4)


In the present case, θ is a single parameter, the number of emitted photons, n, and D is the fact that we have observed c photocounts. P(D | θI) becomes P(c|nϕs). But if n is known, then knowledge of s is irrelevant, so P(c|nϕs) = P(c|nϕ), which is given by the binomial distribution:

P(c | nφ)  =  n!
c!(n - c)!
  ×  φc (1-φ)n-c (5)

For ϕ = 0.1 and c  = 15, then it is easy enough to verify that the value of n that maximizes this equation is 150, which is just what many of us would expect.

Lets examine this solution, though, by considering a different problem. Suppose I am interested in the number of flies in Utrecht, the city I live in. I hang a fly paper in my kitchen in order to gather some data (don’t worry, lab experiments on flies don’t require ethical approval). After one day, there are 3 flies trapped. Unfortunately, I don’t know the efficiency of my detector (the fly paper), but suppose I have good reasons to accept that the average number trapped is proportional to the total number in the population. On day 2, I count 6 new casualties on the fly paper, and announce the astonishing result that in one day, the number of flies in Utrecht has doubled. You’d hopefully respond that I am bonkers - clearly the experiment is nowhere near sensitive enough to determine such a thing. Yet this is essentially the result that we have declared above, under the advice of those who champion the method of maximum likelihood for all parameter estimation problems.


The following data are the result of a Monte-Carlo simulation of the emitter and detector described in the original question:



These data represent 32,000 seconds of simulated emissions. First, I generated 32000 random numbers, uniformly distributed between 0 and 1. Then, for each of these, I scanned along the Poisson distribution in equation (1), until the integrated probability first exceeded that random number. This gave the number of emitted photons, n, in each simulated second. Then, for each of these n, I generated another n uniform random numbers. For any of these random numbers less than or equal to the detector efficiency, ϕ, one detected photon was added to the count for that second. Each green dot in the graph above, therefore, displays the emissions and counts for a single second.

The data confirm that 150 emissions in a single second is an extremely rare event (didn’t occur once in the experiment), while 15 photo-counts occurred on very many (more than a thousand) occasions. From the graph above, it appears that the ‘middle’ of the distribution for 15 counts corresponds to somewhere only slightly above 100 emissions. We can get a better view by taking a slice through this graph along 15 counts, and histogramming the numbers of emissions that were responsible for those counts:






The peak of this histogram is at 105 emitted photons, which is also the average number (rounded) of emitted photons resulting in 15 counts. This is evidently our approximate solution.

The exact solution follows from Bayes’ theorem, and has been provided by Jaynes. The general statement of Bayes’ theorem, again, is


P(A | BI)  =  P(A | I)  P(B | AI)
P(B | I)
(6)


Filling in the notation I have defined above, this becomes

P(n | φcs)  =  P(n | φs)   P(c | nφs)
P(c | φs)
(7)

As noted, if we know the exact number of photons emitted, then no amount of information about the source strength can improve our ability to estimate the expected number of counts. Also, the efficiency of the detector is irrelevant to the number of photons emitted, so
P(n | φcs)  = P(n | s)   P(c | nφ)
P(c | φs)
(8)

Two of these distributions are specified above, in equations (1) and (5).

P(n|s) and P(c|nϕ) can be combined to derive the probability distribution for the number of counts given the source strength, which, in line with intuition, is again Poissonian:
P(c | φs)  =  e-sφ  (sφ)c
c!
(9)

Combining all these, and canceling terms, we get
P(n | φcs)  =   e-s(1-φ) [ s(1-φ) ] n - c
(n-c)!
(10)

This is, yet again, a Poisson distribution, with mean, s(1-ϕ), only shifted up by c, since with no dark counts, the number emitted can not be less than the number detected. The mean of this distribution is, therefore:

〈n〉 =   c + s(1 - φ) (11)


This gives 15 + 100×0.9 = 105 expected emissions, in the case of 15 detected photons. Counter intuitively, the 5 additional counts, above the expected 10, are only evidence for 5 more photons emitted, above the average.

The figure obtained by maximum likelihood can be seen to be another instance of the base-rate fallacy: 150 is equal to the maximum of P(c|nϕs), as n is varied. This is the essence of what maximum likelihood parameter estimation does - given a model with parameter θ, the maximum likelihood estimate for θ is given by argmax[P(data | θ)]. In this case, it takes no account of the probability distribution for the number of emissions, n. What we really needed was the expectation (approximately the maximum) of P(n | cϕs), which is the expectation of P(n | s)×P(c | nϕs). As with the fly paper, the efficiency of the photo detector, (at 10%) was too low to permit us to claim anything close to a 50% excess over the average 100 emissions per second.

Under the right circumstances, maximum likelihood can do a very good job of parameter estimation, but we see here a situation where it fails miserably. As with all rational procedures, its success is determined entirely by its capability to mimic the results of Bayes’ theorem. The answer it gave here differed from Bayes by an order of magnitude, and the reason for that failure was the important information contained in the base rates, P(n|s). In a future post, I’ll discuss another kind of problem where maximum likelihood can give a badly flawed impression, even when the prior distribution conveys little or no information. 


Thursday, March 29, 2012

The Base Rate Fallacy



Here is a simple puzzle:

A man takes a diagnostic test for a certain disease and the result is positive. The false positive rate for the test in this case is the same as the false negative rate, 0.001. The background prevalence of the disease is 1 in 10,000. What is the probability that he has the disease?

This problem is one of the simplest possible examples of a broad class of problems, known as hypothesis testing, concerned with defining a set of mutually contradictory statements about the world (hypotheses) and figuring out some kind of measure of the faith we can have in each of them.

It might be tempting to think that the desired probability is just 1- (false-positive rate), which would be 0.999. Be warned, however, that this is quite an infamous problem. In 1982, a study was published1 for which 100 physicians had been asked to solve an equivalent question. All but 5 got the answer wrong by a factor of about 10. Maybe it’s a good idea then to go through the logic carefully.

Think about the following:

  • What values should the correct answer depend on?

  • Other than reducing the false-positive rate, what would increase the probability that a person receiving a positive test result would have the disease?



The correct calculation needs to find some kind of balance between the likelihood that the person has the disease (the frequency with which the disease is contracted by similar people) and the likelihood that the positive test result was a mistake (the false positive rate). We should see intuitively that if the prevalence of the disease is high, the probability that any particular positive test result is a true positive is higher than if the disease is extremely rare.

The rate with which the disease is contracted is 1 in 10,000 people, so to make it simple, we will imagine that we have tested 10,000 people. Therefore we expect 1 true case of the disease. We also expect 10 false positives, so our estimate goes from 0.999 to 1 in 11, 0.09091. This answer is very close, but not precisely correct.

The frequency with which we see true positives must be reduced by the possibility that we can have false negatives also, how do we encode that in our calculation?

We require the conditional probability that the man has the disease, given that his test result was positive, P(D|R+). This is the number of ways of getting a positive result and having the disease, divided by the total number of ways of getting a positive test result,

P(D|R+) = P(R+D)/(P(R+D)+P(R+C))    (1),

where D is the proposition that he has the disease, C means he is clear, and R+ denotes the positive test result.

If we ask what is the probability of drawing the ace of hearts on the first draw from a deck of cards and the ace of spades on the second, without replacing the first card before the second draw, we have P(AHAS) = P(AH)P(As|AH). The probability for the second draw is modified by what we know to have taken place on the first.

Similarly, P(R+D) = P(D)P(R+|D), and P(R+C) = P(C)P(R+|C), so


(2).



  • P(D) is the background rate for the disease.
  • P(R+|D) is the true positive rate, equal to 1 – (false negative rate).
  • P(C) = 1 – P(D).
  • P(R+|C) = false positive rate



So

 (3),


which is 0.09090.

The formula we have arrived at above, by simple application of common sense is known as Bayes’ theorem. Many people assume the answer to be more like 0.999, but the correct answer is an order of magnitude smaller. As mentioned, most medical doctors also get questions like this wrong by about an order of magnitude. The correct answer to the question, 0.0909, is called in medical science the positive-predictive value of the test. Generally, it is known as the posterior probability.

Bayes’ theorem has been a controversial idea during the development of statistical reasoning, with many authorities dismissing it as an absurdity. This has led to the consequence that orthodox statistics, still today, does not employ this vitally important technique. Here, we have developed a special case of Bayes’ theorem by simple reasoning. In generality, it follows as a straightforward re-arrangement of probabilistic laws (the product and sum rules) that are so simple that most authors treat them as axioms, but which in fact can be rigorously derived (with a little effort) from extremely simple and perfectly reasonable principles.  It is overwhelmingly one of my central beliefs about science that a logical calculus of probability can only be achieved, and the highest quality inferences extracted from data when Bayes’ theorem is accepted and applied whenever appropriate.

The general statement of Bayes’ theorem is


(4).


Here 'I' represents the background information: a set of statements concerning the scope of the problem that are considered true for the purposes of the calculation. In working through the medical testing problem, above, I have omitted the 'I', but in every case where I right down a probability without including the 'I', this is to be recognized as short hand - the 'I' is always really there and the calculation makes no sense without it. 

The error that leads many people to over estimate, by an order of magnitude, probabilities such as the one required in this question is known as the base-rate fallacy. Specifically in this case, the base rate, or expected incidence, of the disease has been ignored, leading to a calamitous miscalculation. The base-rate fallacy amounts to believing that P(A|B) = P(B|A). In the above calculation this corresponds to saying that P(D|R+), which was desired, is the same as P(R+|D), the latter being equal to 1 – false positive rate.

In frequentist statistics, a probability is identified with a frequency. In this framework, therefore, it makes no sense to ask what is the probability that a hypothesis H is true, since there is no sense in which a relative frequency for the truth of H can be obtained. As a measure of faith in the proposition H in light of data, D, therefore, the frequentist habitually uses not P(H|D), but P(D|H), and so he commits himself to  committing the base-rate fallacy.

In case it is still not completely clear that the base rate fallacy is indeed a fallacy, lets employ a thought experiment with an extreme case. (These extreme cases, while not necessarily realistic, allow the desired outcome of a theory to be obtained directly and compared with the result of the theory - something computer scientists call a 'sanity check'.) Imagine the case where the base rate is higher than the sensitivity of the test. For example let the sensitivity be 98% (ie 2% false positive rate) and let the background prevalence of the disease be 99%. Then, P(B|A) is 0.98, and substituting this for P(A|B), we have an answer that is lower than P(A) = 0.99. The positive result of a high-quality test (98% sensitivity) giving lower probability that the test subject has the disease than before the test result was known.




[1] Eddy, D. M. (1982). Probabilistic reasoning in clinical medicine: Problems and opportunities. In D. Kahneman, P. Slovic, & A. Tversky (Eds.), Judgment under uncertainty: Heuristics and biases (pp. 249–267). Cambridge, England: Cambridge University Press. (In this study 95 out of 100 physicians answered between 0.7 and 0.8 to a similar question, to which the correct answer was 0.078.)