Maximum Entropy: The Fundamental Confidence Fallacy

The title of this post comes from an excellent recent paper (as far as I can tell, still in draft form) on misunderstandings of confidence intervals. The paper, 'The fallacy of placing confidence in confidence intervals', by R. D. Morey et al.¹ is by almost exactly the same set of authors whose earlier paper on a very similar topic I criticized, before, but the current paper does a far better job of explaining the authors' position, and arguing for it.

The authors identify the fundamental confidence fallacy (FCF) as believing automatically that,

If the probability that a random interval contains the true value is X%, then the plausibility (or probability) that a particular observed interval contains the true value is also X%.

A confidence interval, a kind of error bar, is a device used in problems of parameter estimation (e.g. what is the age of the universe? how much rain will fall tomorrow? which gene is most responsible for making me so damn handsome?). As well as calculating a point estimate, such as the most probable value of a parameter, a researcher working on some relevant data may also provide a confidence interval, indicating a region around the parameter's point estimate, in which (hopefully) one can expect the true value of the parameter to reside. This is done because sampling errors (noise) in the data collection process will typically result in the point estimate being not exactly equal to the true value of the parameter.

In general, the point of such errors bars is that a very broad error bar indicates a not very precise measurement, where our confidence that the point estimate is very close to the true value is low, and vice versa.

Morey et al. show, however, that the traditional definition of the confidence interval is too broad to automatically satisfy the general requirements of error bars - hence their contention that the belief summarized in FCF (above) is indeed a fallacy.

Here's the conventional definition of confidence intervals that they take from the literature:

A X% confidence interval for a parameter θ is an interval (L, U) generated by an algorithm that in repeated sampling has an X% probability of containing the true value of θ.

The problem that they rightly identify about this definition is that it fails to account for differences in how informed one is, before and after the data have been gathered. They demonstrate this with a beautifully simple thought experiment about a lost submarine, which I'll try to explain:

The crew of a boat want to drop a rescue line down to the hatch of a 10 m long submarine. They don't know the exact location of the sub, but they know its length, and they know that it produces distinctive bubbles from uniformly distributed locations along its length. They also know that the hatch is exactly half way along the sub's length. They decide to watch for bubbles, to infer the sub's position, but they want to launch their rescue attempt quickly, so they decide to do so as soon as their 50% confidence interval for the location of the hatch is sufficiently narrow.

They reason that for 2 randomly positioned bubbles, the hatch is equally likely to be between them as not, so for a two-bubble data set, they devise the following confidence interval:

⟨x⟩ ± Δx / 2

where ⟨x⟩ is the average position of the 2 bubbles, and Δx is their separation - i.e. the 50% confidence interval is defined exactly by the positions of the 2 bubbles. Note that this confidence interval satisfies perfectly the definition given above.

Unfortunately, when the bubbles rise, they do so with a separation of only 1 cm. The rescuers calculate a 1 cm wide confidence interval, and, falling for the fundamental confidence fallacy, they infer that the hatch has a 50% probability to be in this very narrow region.

In reality, though, this extremely (and spuriously) precise inference has been drawn from almost maximally uninformative data. The bubbles could have arisen from either end of the submarine, or anywhere in between, meaning that the hatch could be located anywhere within a 10 m interval. The probability that the hatch is between the 2 bubbles is about as low as it can be.

On the other hand, had the bubbles been 10 m apart - indicating that they came from opposite ends, the rescuers would have been able to infer the exact location of the hatch, but from their adopted confidence procedure would have obtained a 10 m wide C. I., and hence would have wanted to wait for more data, perhaps losing their only chance to complete the rescue.

The problem with the conventional definition of confidence intervals is that it is set up with respect to the set of all possible measurement outcomes, rather than the specific measurement outcome that occurred. An inference that is valid when the data are not known can hardly be expected to remain necessarily valid when the data are known, but the standard wisdom regarding confidence intervals ignores this.

To remedy this, I've always advocated a somewhat different, unconventional definition of confidence intervals:

An X% confidence interval is a subset of the hypothesis space that has an X% posterior probability to contain the true state of the world.

(On a one-dimensional hypothesis space, this subset would be defined by lower and upper bounds, (L, U), exactly as appear in the above conventional definition.)

This definition also satisfies the conventional one, but is narrower in a way that eliminates its worst problems.

This is essentially the same recommendation made by Morey et al. (they prefer to call it a credence interval). Under a definition of this kind, the fundamental confidence fallacy disappears - if I give you a parameter estimate with a 95% confidence interval, then (i) 95% is the probability that the true value of the parameter lies within the interval and (ii) a narrow confidence interval necessarily corresponds to a precise determination of the parameter (and vice versa).

Under favourable conditions, many of the common frequentist confidence procedures do a reasonable job of approximating these desiderata, but I feel strongly (as, presumably, do Morey et al.) that it is far better to start with and understand a sensible definition, and hence understand when our approximate methods are (and are not) valid, than to sweep such validity issues under the carpet - pretend they don't exist - and proceed ab initio from nonsense.

Why my criticism of the earlier paper is still valid:

Now that Morey et al. have produced this very good paper, I can better appreciate what they were trying to get at, in their earlier paper, 'Robust misinterpretation of confidence intervals'², and I can better explain the nature of the error they made in it.

In the earlier paper, the authors described providing a sample of scientists with a questionnaire to assess their understanding of confidence intervals. The asked:

Professor Bumbledorf conducts an experiment, analyzes the data and reports, "the 95% confidence interval for the mean ranges from 0.1 to 0.4." Which of the following statements are true:

They then listed a number of statements, including

There is a 95% probability that the true mean lies between 0.1 and 0.4.

They went on to claim that this statement, and several others of related forms were false. They were wrong. The reason they were wrong is essentially the same reason that the statement of FCF, above is indeed a fallacy: they failed to appreciate that a person who has seen the data may have a different probability assignment to a person who has not. I have not seen Prof. Bumbledorf's data, so the statement immediately above this paragraph is correct as far as I'm concerned (as I proved straightforwardly in the earlier post, using the Bernoulli urn rule). For Bumbledorf, however, who is aware of the data, the statement is not guaranteed to be true (depending on the confidence procedure he used).

The authors fell foul of a common form of mind-projection fallacy, by acting as if there is one true probability distribution, independent of one's state of information.

For their questionnaire to have worked the way they wanted, the authors should have asked something like '... which of the following statements are necessarily valid for Bumbledorf to make?'

Conclusion

The traditional definition of the confidence interval allows for methods of calculating errors bars that do not satisfy the basic requirements of error bars:

The integral over an X% interval may not be X%, and (as seen in the submarine example) may be drastically smaller. The probability for a parameter to be inside the confidence interval may be much less than the claimed confidence level.
The definition allows for procedures that produce a narrow interval in cases where the measurement is very imprecise, and a very broad interval, where the parameter can be inferred exactly.

As always - in all aspects of life - correct reasoning is Bayesian reasoning. By calculating error bars from posterior distributions, or algorithms that deliberately strive to approximate them, these problems with conventionally defined confidence intervals are immediately dissolved.

References

[1]	R.D. Morey, R. Hoekstra, M.D. Lee, J.N. Rouder, and E.-J. Wagenmakers, 'The fallacy of placing confidence in confidence intervals,' available in draft form, here
[2]	R. Hoekstra, R.D. Morey, J.N. Rouder, and E.-J. Wagenmakers, 'Robust misinterpretation of confidence intervals', Psyconomic Bulletin & Review, January 2014 (link)

4 comments:

Richard MoreyJuly 21, 2015 at 7:34 AM
Hi Tom, I just saw your blog post here. Thanks for the summary. I would like to object to your characterisation of the Hoekstra et al survey: note that in the survey, the participants were asked to note which of the responses *logically follow* from the information given. In the sense of the paper, "false" is "this statement does is false in the sense that I cannot infer the statement from the information about the CI." It is certainly true that none of the statements logically follow. It is also certainly true that under some conditions the statements might be true, but inferring this would require information that was not stated in the problem. Best, Richard Morey

Saturday, April 18, 2015

The Fundamental Confidence Fallacy

4 comments: