Saturday, April 18, 2015

The Fundamental Confidence Fallacy

The title of this post comes from an excellent recent paper (as far as I can tell, still in draft form) on misunderstandings of confidence intervals. The paper, 'The fallacy of placing confidence in confidence intervals', by R. D. Morey et al.1 is by almost exactly the same set of authors whose earlier paper on a very similar topic I criticized, before, but the current paper does a far better job of explaining the authors' position, and arguing for it.

The authors identify the fundamental confidence fallacy (FCF) as believing automatically that,
If the probability that a random interval contains the true value is X%, then the plausibility (or probability) that a particular observed interval contains the true value is also X%.

A confidence interval, a kind of error bar, is a device used in problems of parameter estimation (e.g. what is the age of the universe? how much rain will fall tomorrow? which gene is most responsible for making me so damn handsome?). As well as calculating a point estimate, such as the most probable value of a parameter, a researcher working on some relevant data may also provide a confidence interval, indicating a region around the parameter's point estimate, in which (hopefully) one can expect the true value of the parameter to reside. This is done because sampling errors (noise) in the data collection process will typically result in the point estimate being not exactly equal to the true value of the parameter.

In general, the point of such errors bars is that a very broad error bar indicates a not very precise measurement, where our confidence that the point estimate is very close to the true value is low, and vice versa.

Morey et al. show, however, that the traditional definition of the confidence interval is too broad to automatically satisfy the general requirements of error bars - hence their contention that the belief summarized in FCF (above) is indeed a fallacy.

Here's the conventional definition of confidence intervals that they take from the literature:
A X% confidence interval for a parameter θ is an interval (L, U) generated by an algorithm that in repeated sampling has an X% probability of containing the true value of θ.
The problem that they rightly identify about this definition is that it fails to account for differences in how informed one is, before and after the data have been gathered. They demonstrate this with a beautifully simple thought experiment about a lost submarine, which I'll try to explain:

The crew of a boat want to drop a rescue line down to the hatch of a 10 m long submarine. They don't know the exact location of the sub, but they know its length, and they know that it produces distinctive bubbles from uniformly distributed locations along its length. They also know that the hatch is exactly half way along the sub's length. They decide to watch for bubbles, to infer the sub's position, but they want to launch their rescue attempt quickly, so they decide to do so as soon as their 50% confidence interval for the location of the hatch is sufficiently narrow.

They reason that for 2 randomly positioned bubbles, the hatch is equally likely to be between them as not, so for a two-bubble data set, they devise the following confidence interval:

⟨x⟩ ± Δx / 2

where ⟨x⟩ is the average position of the 2 bubbles, and Δx is their separation - i.e. the 50% confidence interval is defined exactly by the positions of the 2 bubbles. Note that this confidence interval satisfies perfectly the definition given above.

Unfortunately, when the bubbles rise, they do so with a separation of only 1 cm. The rescuers calculate a 1 cm wide confidence interval, and, falling for the fundamental confidence fallacy, they infer that the hatch has a 50% probability to be in this very narrow region.

In reality, though, this extremely (and spuriously) precise inference has been drawn from almost maximally uninformative data. The bubbles could have arisen from either end of the submarine, or anywhere in between, meaning that the hatch could be located anywhere within a 10 m interval. The probability that the hatch is between the 2 bubbles is about as low as it can be.

On the other hand, had the bubbles been 10 m apart - indicating that they came from opposite ends, the rescuers would have been able to infer the exact location of the hatch, but from their adopted confidence procedure would have obtained a 10 m wide C. I., and hence would have wanted to wait for more data, perhaps losing their only chance to complete the rescue.

The problem with the conventional definition of confidence intervals is that it is set up with respect to the set of all possible measurement outcomes, rather than the specific measurement outcome that occurred. An inference that is valid when the data are not known can hardly be expected to remain necessarily valid when the data are known, but the standard wisdom regarding confidence intervals ignores this.

To remedy this, I've always advocated a somewhat different, unconventional definition of confidence intervals:
An X% confidence interval is a subset of the hypothesis space that has an X% posterior probability to contain the true state of the world.
(On a one-dimensional hypothesis space, this subset would be defined by lower and upper bounds, (L, U), exactly as appear in the above conventional definition.)

This definition also satisfies the conventional one, but is narrower in a way that eliminates its worst problems.

This is essentially the same recommendation made by Morey et al. (they prefer to call it a credence interval). Under a definition of this kind, the fundamental confidence fallacy disappears - if I give you a parameter estimate with a 95% confidence interval, then (i) 95% is the probability that the true value of the parameter lies within the interval and (ii) a narrow confidence interval necessarily corresponds to a precise determination of the parameter (and vice versa).

Under favourable conditions, many of the common frequentist confidence procedures do a reasonable job of approximating these desiderata, but I feel strongly (as, presumably, do Morey et al.) that it is far better to start with and understand a sensible definition, and hence understand when our approximate methods are (and are not) valid, than to sweep such validity issues under the carpet - pretend they don't exist - and proceed ab initio from nonsense.

Why my criticism of the earlier paper is still valid:

Now that Morey et al. have produced this very good paper, I can better appreciate what they were trying to get at, in their earlier paper, 'Robust misinterpretation of confidence intervals'2, and I can better explain the nature of the error they made in it.

In the earlier paper, the authors described providing a sample of scientists with a questionnaire to assess their understanding of confidence intervals. The asked:
Professor Bumbledorf conducts an experiment, analyzes the data and reports, "the 95% confidence interval for the mean ranges from 0.1 to 0.4." Which of the following statements are true:
 They then listed a number of statements, including
There is a 95% probability that the true mean lies between 0.1 and 0.4.
They went on to claim that this statement, and several others of related forms were false. They were wrong. The reason they were wrong is essentially the same reason that the statement of FCF, above is indeed a fallacy: they failed to appreciate that a person who has seen the data may have a different probability assignment to a person who has not. I have not seen Prof. Bumbledorf's data, so the statement immediately above this paragraph is correct as far as I'm concerned (as I proved straightforwardly in the earlier post, using the Bernoulli urn rule). For Bumbledorf, however, who is aware of the data, the statement is not guaranteed to be true (depending on the confidence procedure he used).

The authors fell foul of a common form of mind-projection fallacy, by acting as if there is one true probability distribution, independent of one's state of information.

For their questionnaire to have worked the way they wanted, the authors should have asked something like '... which of the following statements are necessarily valid for Bumbledorf to make?'


The traditional definition of the confidence interval allows for methods of calculating errors bars that do not satisfy the basic requirements of error bars:

  1. The integral over an X% interval may not be X%, and (as seen in the submarine example) may be drastically smaller. The probability for a parameter to be inside the confidence interval may be much less than the claimed confidence level.
  2. The definition allows for procedures that produce a narrow interval in cases where the measurement is very imprecise, and a very broad interval, where the parameter can be inferred exactly.
As always - in all aspects of life - correct reasoning is Bayesian reasoning. By calculating error bars from posterior distributions, or algorithms that deliberately strive to approximate them, these problems with conventionally defined confidence intervals are immediately dissolved.


 [1]  R.D. Morey, R. Hoekstra, M.D. Lee, J.N. Rouder, and E.-J. Wagenmakers, 'The fallacy of placing confidence in confidence intervals,' available in draft form, here
 [2]  R. Hoekstra, R.D. Morey, J.N. Rouder, and E.-J. Wagenmakers, 'Robust misinterpretation of confidence intervals', Psyconomic Bulletin & Review, January 2014 (link)


  1. Hi Tom, I just saw your blog post here. Thanks for the summary. I would like to object to your characterisation of the Hoekstra et al survey: note that in the survey, the participants were asked to note which of the responses *logically follow* from the information given. In the sense of the paper, "false" is "this statement does is false in the sense that I cannot infer the statement from the information about the CI." It is certainly true that none of the statements logically follow. It is also certainly true that under some conditions the statements might be true, but inferring this would require information that was not stated in the problem. Best, Richard Morey

    1. Hi Richard

      Many thanks for your comment. Strictly, you are correct that we cannot say that a probability assignment follows logically, without specifying a probability model. As there was no probability model prescribed in the survey question, however, I was free to supply my own, and as there was no information supplied to adjust my probability estimate either above or below the 95% confidence reported, I shot down the middle, as indifference dictates. The situation is analogous to the submarine example, where instead of being supplied the separation between a pair of bubbles, we are only aware of the calculated confidence interval, and we have to determine the desired probability - all we have to go on is the defined 'long-run' behaviour of confidence intervals.

      You say that "under some conditions the statements might be true, but inferring this would require information that was not stated in the problem." Maybe I misunderstand your meaning, but this strikes me as strange. It sounds as if you feel that a probability is something out there, waiting to be discovered, as soon as the required evidence comes to light.

      My view is that probability theory is the machine we use for quantifying how much we know, in the presence of missing information. As long as a question is meaningfully constructed, there can be no situation in which there is not enough information to form probability estimate. Otherwise, how would we ever accumulate enough knowledge to get started?

      Thanks again for your comment.

    2. I don't feel that "probability is something out there". What I'm skeptical of is that there is a meaningful way of marginalizing over the possible probability models and confidence procedures, and hence I'm skeptical that any one is compelled to adopt probability assignment. There are, after all, an uncountable number of probability models, and for each of these probability models there are an uncountable number of 95% confidence procedures. "Many" of these are trivial, having 0 probability of containing the true value. These can't be described in any sort of "space" that I'm aware of. In typical scenarios where the principle of indifference is applied, there are natural symmetries or invariances in the problem that allow one to apply the principle. I don't see that here, but maybe I'm missing something obvious.

    3. Maybe I didn't explain myself clearly. I don't think we need to marginalize over probability models and confidence procedures.

      From your paper:

      "A X% confidence interval for a parameter theta is an interval (L, U) generated by an algorithm that in repeated sampling has an X% probability of containing the true value of theta."

      Thus, if Professor Bumbledorf conducts 100 experiments, and reports a (valid) 95% CI for each, and I draw one of them at random from a hat, then 95 times out of 100, I expect to get one that contains the true value of theta - there is a 95% probability that it will contain the true value (the Bernoulli urn rule). After a single measurement, but without seeing his data, I'm in the same position of randomly sampling from the hat. All I have to go on is the CI and its associated properties. Regardless of the probability model and the shape of the posterior, the integral from L to U is thus 0.95, by definition.