Saturday, March 22, 2014

Whose confidence interval is this?

This week, yet again, I was confronted by yet another facet of the nonsensical nature of the frequentist approach to statistics. The blog of Andrew Gelman drew my attention to a recent peer-reviewed paper studying the extent of misunderstanding of the meaning of confidence intervals, among students and researchers. What shocked me, though, was not the only findings of the study.

Confidence intervals are a relatively simple idea in statistics, used to quantify the precision of a measurement. When a measurement is subject to statistical noise, the result is not going to be exactly equal to the parameter under investigation. For a high quality measurement, where the impact of the noise is relatively low, we can expect the result of the measurement to be close to the true value. We can express this expected closeness to the truth by supplying a narrow confidence interval. If the noise is more dominant, then the confidence interval will be wider - we will be less sure that truth is close to the result of the measurement. Confidence intervals are also known as error bars.

Hoekstra et al., the authors of the paper1, asked students and experienced researchers to mark as true or false a number of statements interpreting the meaning of a confidence interval. The results of their survey appear shocking, with large numbers of wrong answers reported, and veteran researchers apparently doing little better than students yet to receive any formal training in statistics.

This is bad. Confidence intervals are good things. Quantifying our knowledge is exactly what science is about, and an assessment of precision is vital to that process. It goes without saying that understanding what has been assessed is also vital, especially for the people doing the assessing!

Just for fun, then, have a go at the survey questions that formed the basis for the data in the paper:

Professor Bumbledorf conducts an experiment, analyzes the data and reports, "the 95% confidence interval for the mean ranges from 0.1 to 0.4." Which of the following statements are true:
1. The probability that the true mean is greater than 0 is at least 95%
2. The probability that the true mean equals 0 is smaller than 5%
3. The null hypothesis that the true mean equals 0 is likely to be incorrect.
4. There is a 95% probability that the true mean lies between 0.1 and 0.4.
5. We can be 95% confident that the true mean lies between 0.1 and 0.4.
6. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between 0.1 and 0.4.

The results of the survey are given in the table below. The numbers are the proportion of survey participants who assessed each item to be true:

 Item 1st year students      (n = 442) masters students      (n = 34) researchers  (n = 118) 1 51% 32% 38% 2 55% 44% 47% 3 73% 68% 86% 4 58% 50% 59% 5 49% 50% 55% 6 66% 79% 58%

So how do you think you did, in comparison to the survey participants?

According to the authors of the paper all six statements about the confidence interval are false.

Did you do a double take just now. Did you feel momentarily confused, perplexed, humiliated? I hope so. I certainly felt confused, when I read the study description. It seemed to me, on examining the 6 statements that exactly one of them was false. Go on, take another scan through the list, see if you can pick out the one I identified as false ......

I'll tell you in a moment.

Though Gelman's assessment of the study had been vaguely endorsing of its message, I inferred the study's design to be very seriously flawed. Could I have been badly confused? Did I really know the technical definition of the confidence interval?

Helpfully, the authors of the paper have supplied that for us:

[If] a particular procedure, when used repeatedly across a series of hypothetical data sets, yields intervals that contain the true parameter value in 95% of cases ... the resulting interval is said to be a 95% CI.

'CI,' of course, means 'confidence interval.' This description pretty much meets my expectations. To be sure, I checked a few other sources, and they match this definition perfectly.

So, armed with this technical information, lets work our way through the list. The job is easier, I feel, if we start at item number 4: "There is a 95% probability that the true mean lies between 0.1 and 0.4."

Some basics, to help us out: imagine an urn (an opaque jar) filled with balls of 2 different colours. Suppose there are 100 balls in total, 95 of which are green, the remaining 5 being red. I insert my hand into the urn and blindly extract a ball. What is the probability that the extracted ball is green. Of course, P(green) = 0.95. This follows from very basic symmetry considerations. The indifference principle gives us equal probability to draw any of the balls. Applying the extended sum rule to this result trivially gives us 0.95 as the probability to draw one of the greens balls. This result is readily generalized, and is known as the Bernoulli urn rule. It is one of the earliest results of probability theory.

We can treat our experiment like an urn. The experiment is a data-generating process, just like the urn. It spits out a sample - not a ball this time, but a sample of data points from a noisy distribution - a sample of data points with an associated confidence interval. We have from the authors' own pen: the 95% CI is produced by a procedure such that contains the true parameter value on 95% of occasions. So here is the question: what is the probability that the confidence interval we obtained is one of the 95% that contain the true parameter value? Trivially, it is 0.95, a.k.a 95%, and statement number 4 on the survey is true.

With number 4 settled, number 1 is also trivially true.Since there is 95% probability that the true parameter values lies between two positive numbers, there can not be less than 95% probability that the true parameter value is greater than 0.

Number 2 must also be true, for similar reasons. (In fact, if the parameter space is continuous, then the probability to be some discrete value is zero, anyway.)

If item 2 is true, then I interpret 3 to be also true - probability less than 5% is my idea of unlikely.

Item 5 is somewhat vague, but to my mind it is the same as number 4. Probability provides a numerical measure of confidence.

That leaves number 6, "if we were to repeat the experiment over an over, then 95% of the time the true mean falls between 0.1 and 0.4." This statement is absurdly false. The true mean does not move around. If I am repeating the experiment, i.e. measuring the same parameter again, then its true value is the same as previously. Oddly, this uniquely false statement from the list received the second highest level of endorsement from the survey participants (a strong majority, in fact). There is clearly something to the paper's claim of 'robust misinterpretation of confidence intervals'.

But how could the authors have been so wrong about the other 5 items?

In the frequentist tradition, a probability is a frequency. The probability that a tossed coin lands heads up is 0.5 exactly because in a large number of tosses, half of them will come up heads. Actually, it probably won't be exactly half that are heads, but we might momentarily overcome the resulting feeling of queasiness this definition produces, to consider it as a serious candidate for an understanding of probability.

The big problem we quickly hit, though, becomes apparent when we ask a perfectly reasonable question like, what is the probability that the universe is between 13.7 and 13.9 billion years old? There is no frequency with which this is true, its truth does not vary. Thus, in the frequentist tradition, facts do not have associated probabilities, because facts are either true or false. One really has to wonder, then, what it is the frequentists think they are assigning probabilities to. In this tradition, therefore, one can not say that some parameter lies in some interval with some probability. It either does or it doesn't.

This raises an obvious question: if the frequentist is barred from calculating the probability that a parameter lies in some interval, how can they calculate their confidence intervals, which, as I showed amount to the same thing? How can they effectively say that the confidence interval from a repeated experiment will probably contain the parameter's true value? The fact is, they can't. Not without cheating. Not without grossly violating their own system.

The wikipedia page for Confidence Interval has a simple example of a frequentist calculation of a 95% confidence interval. I actually don't mind this calculation, I think its a reasonable way (under the right circumstances - i.e. normal approximation is valid) to estimate the precision of a parameter estimate. But, not surprisingly, the calculation produces an equation of the form (where θ is the parameter being estimated, and x1 and x2 are the limits of the confidence interval)

P(x1 ≤  θ ≤ x2) = 0.95

Guess what, this is a probability assignment about the value of θ. Something the frequentist system does not allow. Even though x1 and x2 have been calculated from the current estimate for θ, the wikipedia article currently includes the somewhat ad hoc looking statement
This does not mean that there is 0.95 probability of meeting the parameter [θ] in the interval obtained by using the currently computed value of the [estimate for θ].

The offending expression is made not a probability (and therefore not a violation of frequentist dogma) by simply declaring it so. Yay! that was easy.

Of course, to get any probability assignment, the frequentist must, just like anybody else, assume (explicitly or implicitly) a prior distribution, which also violates the frequentist methodology. A safer way to get point estimates and confidence bounds, therefore, involves explicitly formulating a suitable prior, and then operating on a posterior distribution, obtained from Bayes' theorem. If you'd like to see a simple example of such a calculation of a confidence interval, you could try my earlier article on nuisance parameters.

References

  R. Hoekstra, R.D. Morey, J.N. Rouder, and E.-J. Wagenmakers, 'Robust misinterpretation of confidence intervals', Psyconomic Bulletin & Review, January 2014 (link)

1. To be really precise you should note that sentences of the form "[0.1,0.4] is a 95% confidence interval" are never legit. You can only assign the property of being a "95% confidence interval" to procedures that calculate an interval given the data. The best you can say is "[0.1,0.4] was generated by a procedure that (before I knew the data) I assigned a 95% probability of creating an interval that contained the true value".

So we have:

False: 4a. There is a 95% probability that the true mean lies between 0.1 and 0.4.

True: 4b. There is a 95% probability that the true mean lies between the lower end of the confidence interval and the higher end (provided we haven't yet seen the data that determines what they are).

Our problem here is that the frequentists refuse to treat the true value as an r.v. So before we get the data the Bayesian views both the true value and data as random, but the frequentist views only the data as random. So at this point the frequentist can say "performing the following procedure on the data will yield an interval that with 95% probability contains the true value". When they say this it is the interval that they are treating as random, and they are saying that the statement is true for each possible true value. The Bayesian is treating both the true value and interval as random, but actually agrees with the frequentist's statement since if the statement is true for each possible true value then it must also be true for the random true value weighted according to the Baysian's prior.

After we find out the data, the frequentist calculates her confidence interval and then has no random variables left at all. Thus the frequentist is now incapable of making any probabilistic statements at all. And so all six of the given statements are false.

Now, what does the Bayesian do when we find the data? She updates her probability distribution for the true value. Also, just like for the frequentist, the interval can be calculated and so ceases to be an r.v. But the Bayesian can still make some probabilistic statements since for her the true value is still random. In particular the Bayesian can calculate the probability, based on their posterior, that the true value lies in the interval. But this needn't be 95%, and so even from a Bayesian perspective we must judge all six propositions to be false.

An amusing example is to consider an experiment where the true value is known to be positive (perhaps it is a scale parameter) but it is being measured with some Gaussian noise (with say s.d. 1). Then it is clear that a 95% confidence interval will be given by taking the measured value plus or minus 1.96. So suppose by misfortune we get the measurement "-2" then our confidence interval is [-3.96,-0.04]. Certainly no one would claim that our true value was 95% certain to lie in there!

Gosh, frquentist thinking is complicated isn't it?

1. Thanks for elaborating.

Yes, frequentist thinking is complicated - it needs to be to disguise the fact that it is wrong. The device that the frequentists use to make e.g. item 4 false is completely ad hoc, declared out of thin air, and has no basis.