Maximum Entropy

Friday, June 29, 2012

The Mind Projection Fallacy

There is an argument that I have noted a couple of times when scientific colleagues with religious beliefs have tried to explain to me how they reconcile these seemingly contradictory things. Reality, they say, is divided into two classes of phenomena: the natural and the supernatural. Natural phenomena, they say, are the things that fall into the scope of science, while the supernatural lies outside of science’s grasp, and can not be addressed by rational investigation. This is completely muddle-headed, and seems to me to be based on an example of something called the mind projection fallacy.

A similar argument also crops up occasionally when advocates of alternative medicine try to rationalize the complete failure of their favorite pseudoscientific therapy to provide any evidence of efficacy in rigorous trials.

The very word ‘supernatural,’ at its heart, though, is one of those utterly self-defeating terms, like ‘free will’ and ‘alternative medicine,’ completely devoid of meaning and philosophically bankrupt. What is this free will that people keep going on about? Is it the freedom to break the laws of physics? No, and since every particle in your brain obeys the laws of physics, you are not free to make non-mechanistic decisions, so can we please shut up about free will? (Granted, I am using a restricted meaning of the term ‘free will.’)

And what is alternative medicine? Medicine is the use of interventions that are known to work in order to lessen the effects of disease. If it doesn’t work, or is not known to work, then its not medicine, full stop. There is no alternative. Lets please shut about alternative medicine.

What is the supernatural? Nature is by definition everything that exists and happens. What is outside nature is therefore necessarily an empty set.

The etymology of the word ‘supernatural’ is the result of an error of thinking. This error is the mind projection fallacy: falsely assuming that the properties of one’s model of reality necessarily exhibit correspondence with the actual properties of reality. The following dictionary definition of ‘supernatural’ was quoted to me in a recent discussion of the term (reportedly from Webster’s):

Supernatural. [adjective:] 1. of, pertaining to, or being ‘above or beyond what is natural or explainable by natural law’. 2. of, pertaining to, or attributed to God or a deity. 3. of a superlative degree; preternatural. 4. pertaining to or attributed to ghosts, goblins, or other unearthly beings; eerie; occult.

Number 1, is where we have to focus. Numbers 2 and 4 are, in origin at least, based on erroneous application of number 1, while number 3 is just weird. ‘Of a superlative degree’? That’s not supernatural, by any reasonable standard. ‘Preturnatural’? This word has two meanings (according to Dictionary.com): one is ‘supernatural’ (wow, that’s helpful) and the other is ‘exceptional or abnormal.’ Finding a hundred euros on the pavement would be both exceptional and abnormal, but again, not supernatural unless we are willing to debase the meanings of words to the level of uselessness.

So what about the primary meaning of supernatural, ‘above or beyond what is natural or explainable by natural law.’ The first part poses a problem, since there is no supplied procedure for determining what is natural, other than the obvious definition: ‘whatever is not supernatural.’ Now I’m aware that all word definitions are ultimately circular, but this is a case where the radius of curvature is clearly far to small to represent any useful addition to the language. The second part stipulates ‘explainable by natural law,’ which succumbs to exactly the same objection, but I strongly suspect that many people have failed to see this exactly because they have committed the mind projection fallacy – in this case, conflation of natural law with our description of it. Natural law is the set of principles, whatever they may be, that determine how real phenomena evolve. If a phenomenon is real, then it would be explainable by natural law. If a phenomenon is not real, then what is the point in debating whether or not it is supernatural? I feel, however, that too many people think that natural law is some set of equations, like E = mc², written down in text books – but this is merely our model of natural law. I see no other convincing way to account for the appearance of this phrase in the quoted dictionary definition, than to assume that natural law is being commonly confused with known science in this way, since there seems to be no other good reason to postulate that a phenomenon is not governed by natural law (I know, the exact word was ‘explainable,’ but I think it is hard to rescue the situation by invoking this subtle difference).

By this common understanding of ‘supernatural,’ the photoelectric effect would have been supernatural in and prior to 1904, but natural before the end of 1905. A strange state of affairs, you might think.

Of course, word meanings don’t have to stick exactly to their original literal meanings, and anybody is free to apply the word ‘supernatural’ to any putative phenomenon they wish: gods, ghosts, whatever (as long as they are clear in what they are doing), but I argue firstly, that this is a misnomer, as nothing can be literally beyond nature (supernatural) and secondly, that use of this misguided word leads to horrendous confusions, such as those allowing highly educated and otherwise rational people to claim that religious phenomena (or homeopathy or chi) are by definition supernatural, and therefore by definition beyond the scrutiny of science.

The mind projection fallacy also raises its head in science, all too often, such as in quantum physics, and, in my opinion, in thermodynamics. It has also had very substantial consequences in the development and application of probability theory. Since scientific method generally strives to avoid fallacious reasoning, I feel that it is important to get well acquainted with this particular mental glitch, and to recognize some of the fields in which it still extends a corrupting influence.

Looking at thermodynamics, the famous second law, stating that the entropy of a closed system tends to increase, is often explained by the experts as resulting from our inability to distinguish between individual molecules (or other particles). My conviction, however, is that the mechanical evolution of an ensemble of such particles is unchanged if we are granted a means to identify them after they have evolved. The real reason for the second law seems to be that the proportion of possible initial microstates that result in non-increased entropy is very tiny (and such states appear, therefore with vanishing probability), but polluted with the standard language of the discipline, one can find it hard to grasp this. Why is this standard language an example of the mind projection fallacy? Because there is something unknown to us (the identities of the particles), and the fact of it being unknown is attributed as the cause of the physical evolution of the system, and therefore a physical property of the system. It is not, though, it is a property of our knowledge of the system.

With regard to quantum mechanics, I am open minded on the matter of whether or not nature evolves deterministically or non-deterministically. As I try to be a good scientist, I wait for the evidence to favour one hypothesis strongly over the other before casting my judgement. As much as we know about quantum mechanics already, that evidence is not yet in. I am, however, highly uncomfortable and skeptical about the possibility of something operating without a causal mechanism, yet exhibiting clear tendencies. It seems I am not alone in this, as several other well regarded thinkers have apparently shared this view, notably among them, the exceptional theoretical physicists Albert Einstein, Louis de Broglie, David Bohm, John Bell, and Edwin Jaynes. Imagine my delight, therefore, when I discovered the following passage by Jaynes in a book of conference proceedings¹, articulating magnificently (and far better than I ever could) many of my own long-felt misgivings about the language of quantum mechanics:

The current literature on quantum theory is saturated with the Mind Projection Fallacy. Many of us were first told, as undergraduates, about Bose and Fermi statistics by an argument like this: “You and I cannot distinguish between the particles; therefore the particles behave differently than if we could.” Or the mysteries of the uncertainty principle were explained to us thus: “The momentum of the particle is unknown; therefore it has a high kinetic energy.” A standard of logic that would be considered a psychiatric disorder in other fields, is the accepted norm in quantum theory. But this is really a form of arrogance, as if one were claiming to control Nature by psychokinesis.

Whether or not the position and momentum of a particle (as related in the most familiar version of the Heisenberg principle) are truly ‘undetermined’ or merely unknowable to us, I am unsure, but there is a commonly encountered assumption that these two possibilities must be the same, and this results from the mind projection fallacy. It is indeed a mighty challenge to reconcile quantum phenomena with a fully deterministic mechanics. Some have succeeded, but it remains a challenge to pin down whether or not this is nature’s way. Many, however, follow Bohr and assert that there can be no underlying mechanism behind quantum phenomena. Let me quote Jaynes again (this time from ‘Probability theory: the logic of science’):

the ‘central dogma’ [of quantum theory]… draws the conclusion that belief in causes, and searching for them, is philosophically naïve, If everybody accepted this and abided by it, no further advances in understanding of physical law would ever be made… it seems to us that this attitude places a premium on stupidity.

The field in which the mind projection fallacy has had its most significant practical consequences is perhaps probability theory, which is a colossal shame, as probability is the king of theories: the meta-theory that decides how all other theories are obtained.

If you scan through the articles I have posted here on probability, you’ll observe that most if not all make use of Bayes’ theorem. It is an incredibly important and useful part of statistical reasoning, and represents the core of how human knowledge advances. It is also derived simply, as a trivial rearrangement of two of the most basic principles of probability theory: the product and sum rules. Yet, for a significant portion of the 20^th century, when statistical theory was undergoing explosive development, Bayes’ theorem was rejected by the majority of authorities and practitioners in the field. How on Earth could this have come about? The mind projection fallacy, of course.

Because the theory models real phenomena in terms of probabilities, it was assumed that these probabilities must be real properties of the phenomena. Yet Bayes’ theorem converts a prior probability into a posterior probability by the addition of mere information. And since merely changing the amount of information cannot affect the physical properties of a system, then Bayes’ theorem must simply be wrong. QED.

The property that probability was thought to correspond to was frequency. For example, a coin has a 50% probability to land heads up because the relative frequency with which it does so is one half. For this reason, the orthodox school of statistical thought has become known as frequentist statistics.

One of the most extraordinary scientists of the 20^th century, Ronald Fisher, for example, was one of the people who dominated the development of statistical theory during his lifetime. In his highly influential book, ‘The design of experiments,’² he gave three reasons for rejecting Bayes’ theorem, foremost of which is:

… advocates of inverse probability [Bayes’ theorem] seem forced to regard probability not as an objective quantity measured by observable frequencies….

Clearly, he meant that the impossibility to reconcile Bayes’ theorem with the view of probability as a physical property of real objects (the frequencies with which different events occur) made it impossible to accept the theorem. (His other two reasons are just as bad.) It was Fisher’s deeply held objection to the logical foundations of probability theory that led him to do some of the most important work developing and popularizing the frequentist significance tests, which, as I have argued in detail here and here, are a poor method for assessing data.

Another influential textbook, by Harald Cramér³, asserts that ‘any random variable has a unique probability distribution.’ Again, assuming that the probability is something objective and immutable, a physical property. The randomness is assumed to be necessarily a property of the system under study, rather than a statement of our lack of information – our inability to predict it before hand. To instantly recognize the ridiculousness of both Fisher’s and Cramér’s views, consider that I have just tossed a coin, which has landed, and I am asking you to assess the probability that the face of the coin pointing up is the one depicting the head: your only rational answer is 0.5, and it is the correct answer, for you. For me though, the correct answer is 1, because I am looking at the coin, and I can see the head facing up. Same physical system, different probabilities, dependent on the available information.

On the subject of probabilities as physical properties of the systems we study, I can again quote Jaynes, who has summarized the situation beautifully:

It is therefore illogical to speak of verifying [the Bernoulli urn rule, a law for determining probabilities] by performing experiments with the urn; that would be like trying to verify a boy’s love for his dog by performing experiments on the dog.

We can easily identify other instances of the mind projection fallacy in probability reasoning, some of which I have already discussed in earlier posts. For example, the error of thinking discussed in Logical v’s Causal Dependence, consisting of the belief that the expression P(A|B) can only be different from P(A) if B exerts a causal effect on A (an error that has made it into a number of influential textbooks on statistical mechanics) seems to arise from the conviction that a probability is an objective property of the system under study. If B changes the probability for A, then according to this belief, B changes the physical properties of A, and must therefore be, at least partially, the cause of A.

Another instance is to be found in The Raven Paradox, and consists of the belief that whether or not a particular piece of evidence supports a hypothesis is an objective property of the hypothesis, or the real system to which the hypothesis relates. In that post, we examined the supposition that observation of a sequence of exclusively black ravens supports the hypothesis that all ravens are black. We discovered an instance where such observations actually support the opposite hypothesis, illustrating that the relationship between the hypothesis and the data is entirely dependent on the model we chose. To think otherwise was shown to lead to disturbing and indefensible conclusions about ravens.

[1] 'Maximum Entropy and Bayesian Methods,' edited by J. Skilling, Kluwer Publishing, 1989

[2] 'The Design of Experiements,' R.A. Fisher, Oliver and Boyd, 1935

[3] 'Mathematical Methods of Statistics,' H. Cramér, Princeton University Press, 1946

Saturday, May 26, 2012

How to read a newspaper

One of the things that I believe very firmly is that scientific method is not just for scientists, it's for everybody. Because science is just the systematic evaluation of what is and is not likely to be true, it's the right way to go in all fields where we're interested in making effective decisions, or obtaining new knowledge. These fields include engineering, economics, law and criminal justice, politics and social policy, history, and everyday life. Science, after all, is just the systematization of common sense. In this post, I'll discuss how to employ a scientific approach to evaluating stories we read in the news.

With training and a clear idea of what common sense actually is, we can often do a surprisingly good job of evaluating issues where we have little or no expertise and very limited information. A good starting point when modeling common sense is Bayes' theorem. Its not that I think that our brains necessarily employ Bayes' theorem strictly, but to me it is perfectly obvious that natural selection of chance variations has equipped us with intellectual apparatus that is capable of mimicking to a high degree the results of Bayes' theorem in a broad range of commonly encountered circumstances.

Here's an example that shows how easy it can be to apply formal Bayesian reasoning to the news. A few months ago, a story came out that scientists had apparently observed neutrinos traveling faster than the speed of light in vacuum. Physicist and blogger, Ted Bunn, has outlined a breathtakingly simple estimation of the probability that this finding was accurate. Really, anybody who understands Bayes' theorem and has moderately well trained judgement could have performed this calculation. It juxtaposes two competing hypotheses: (1) physics as we know it severely wrong, and (2) somebody got one of their measurements wrong. The result is overwhelmingly in favor of neutrinos that can not exceed the speed of light, and holds even if the assigned prior probabilities are adjusted by orders of magnitude. Of course, we now know that the neutrinos in question were behaving perfectly in concert with known physics, and the measurement was indeed in error.

We can do a lot to train our intuition and improve our ability to discern merit or its absence in the news by learning something of the way that news journalism functions. For example, from examination of Bayes' theorem, it is clear that the probability, P(T | R), that a story is true, T, given that it has been reported, R depends on P(R | T) and P(R | F). Anything that aids you to estimate these, and similar quantities, therefore, enhances the performance of your inner Bayesian reasoner, and has got to be a good thing.

With particular regard to stories about science in the media, a gold mine of insight is to be found by browsing the archive of Ben Goldacre's Bad Science blog. Here, you can develop a healthy skepticism by reading, among other things, dozens of examples of how journalists distort stories to make them more sexy, how they misunderstand technical details, and how they play into the hands of PR companies.

Looking beyond science stories, an excellent book, 'Flat Earth News,' by seasoned newspaper journalist Nick Davies goes into terrifying detail about the extent to which news journalism in general is broken. Davies outlines ten rules of production, which he argues have important impacts on the quality and content of reported news. Examining them offers a stark vision, but also provides invaluable data in the quest to estimate what is true, what is important, how details are likely to be distorted, and what kinds of things are possibly being kept from you. Very briefly, these rules of production described by Davies are:

(1) Run cheap stories

- no long investigations, use information thats readily available. Most news stories are written in minutes.

(2) Select safe facts

- prefer official sources

(3) Don't upset the wrong people
- the more powerful somebody is, the more likely they are to sue, for example

(4) Select safe ideas

- don't print anything that goes against the widely held consensus

(5) Give both sides of the story

- if you never actually say anything definite, then nobody can accuse you of being wrong

(6) Give them what they want

- if it increases readership, then tell it. Tell it in a way that increases readership.

(7) Bias against truth

- the story can't be too complex, therefore suppress as many details as possible

(8) Give them what they want to believe in

- again, don't upset the readers

(9) Go with the moral panic

- in time of crisis, guage the public opinion and make sure to amplify it

(10) Give them what everybody else is giving them

- don't lose customers just because somebody else stoops lower then you are willing to

Understanding these things enables a skeptical mind to penetrate somewhat beyond what is actually presented in the news. A story can be judged, for example, on where the information probably comes from, what alternative sources were probably ignored, how much effort went into producing the story, why this story is in the news anyway - what is its real importance? This is skepticism. Science is healthy skepticism: scrutinizing everything, trying to find the flaws in every piece of evidence. Obvious, I know, but one or two people out there don't seem to have embraced it fully yet.

Wednesday, May 23, 2012

The Raven Paradox

Earlier, in 'The insignificance of significance tests', I wrote about the silliness of the P-value, one of the main metrics scientists use to determine whether or not they have made a discovery. I explained my view that a far better description of our state of knowledge is the posterior probability. C.R. Weinberg, however, presumably does not agree with all my arguments, having written a commentary some years ago with the title 'It's time to rehabilitate the P-value.'¹ Weinberg acknowledges the usefulness of the posterior probability in this article, and I agree with various other parts of it, but a major part of her argument in favour of P-values is the following statement, from the start of the article's second paragraph:

'The P-value quantifies the discrepancy between a given data set and the null hypothesis,'

Lets take a moment to consider whether or not this is true.

Firstly, lets note that the word 'discrepancy' has a precise meaning and an imprecise one. The discrepancy between 5 and 7 is 2, and has a precise formal meaning. What is the discrepancy, though between an apple and an orange? When we say something like 'there is a discrepancy between what he said and what he did,' this makes use of the general, imprecise meaning of 'disagreement'. What is the discrepancy between a data set and a hypothesis? These are different things and can not be measured against one another, so there is clearly no exact formal meaning of the discrepancy in this case. It is a pity then that the main case made for use of P-values, in this case, is based on an informal concept, rather than an exact idea, which scientists normally prefer.

So it is clear then, that the discrepancy between a data set and the null hypothesis has no useful, clearly definable scientific meaning (and, therefore, can not be quantified, as was the claim). Lets examine, though, what the term 'discrepancy' is trying to convey in this case. Presumably, we are talking about the extent to which the data set alters our belief in the null hypothesis - the degree of inconsistency between the null hypothesis and the observed facts. Surely, we are not talking about any effects on our belief in the data, as the data is (to a fair approximation ²) an empirical fact. Anyway, belief in the data was not the issue that the experiment was performed to resolve, right? The truth or falsity of a particular data set is a very trivial thing, is it not, next to the truth or falsity of some general principle?

But the P-value does not quantify our rational belief in any hypothesis. This is impossible. For starters, as I hinted at in that earlier article, by performing the tail integrals required to calculate P-values, one commits an unnecessary and gratuitous violation of the likelihood principle. This simple (and obvious) principle states that when evaluating rational changes to our belief, one uses only the data that are available. By evaluating tail integrals, one effectively integrates over a host of hypothetical data sets, that have never actually been observed.

Even if we eliminate the tail integration, however, and calculate only P(D | H₀), we are still not much closer to evaluating the information content of our data, D, with respect to H₀. To do this, we need to specify a set of hypotheses against which H₀ is to be tested, and obtain the direct probability, P(D | H), for all hypotheses involved. To think that you can test a hypothesis without considering alternatives is, as I have mentioned once or twice in other articles, the base-rate fallacy.

One of my favorite illustrations of this basic principle, that the probability for a hypothesis has no meaning when divorced from the context of a set of competing hypotheses, is something called the raven paradox. Devised in 1945 by C.G. Hempel, this supposed paradox (in truth, there seem to be no real paradoxes in probability theory) makes use of two simple premises:

(1) An instance of a hypothesis is evidence for that hypothesis.

(2) The statement 'all ravens are black' is logically equivalent to the statement 'everything that is not black is not a raven.'

Taking these two together, it follows that observation of a blue teapot constitutes further evidence for the hypothesis that all ravens are black. While it seems to most that the colour of an observed teapot can convey little information about ravens, this conclusion is claimed to be logically deduced. This paradox got several philosophers tied up in knots, but it took a mathematician and computer scientist, I.J. Good (he worked with Turing both at Bletchley Park and in Manchester, making him one of the world's first ever computer scientists) to point out its resolution.

The obvious solution is that premise number (1) is just plain wrong. You can not evaluate evidence for a hypothesis by considering only that hypothesis. It is necessary to specify all the premises you wish to consider, and specify them in a quantifiable way. Quantifiable, here, means in a manner that allows P(D | H) to be evaluated. To drive this point home, Good provided a thought experiment (a sanity check, as I like to it) where, given the provided background information, we are forced to conclude that observation of a black raven lowers the expectation that all ravens are black.³

Good imagined that we have very strong reasons to suppose that we live in one of only two possible classes of universe, U₁ and U₂, defined as follows:

U₁ ≡ Exactly 100 ravens exist, all black. There are 1 million other birds.

U₂ ≡1,000 black ravens exist, as well as 1 white raven, and 1 million other birds.

Now we can see that if a bird is selected randomly from the population of all birds, and found to be a black raven, then we are most probably in U₂. From Bayes' theorem:

P(U₁ \| DI) =	P(U₁ \| I) × P(D \| U₁ I)
	P(U₁ \| I) × P(D \| U₁ I) + P(U₂ \| I) × P(D \| U₂ I)

No information has been supplied concerning the relative likelihoods of these universes, so symmetry requires us to give them equal prior probabilites, so

P(U₁ \| DI) ≈	10^-4
	10^-4 + 10^-3

which makes it about 10 times more likely that we are in U₂, with some non-black ravens.

[1] C.R. Weinberg, 'It's time to rehabilitate the P-value,' Epidemiology May 2001, Vol. 12 No. 3 (p. 288) (Available here)

[2] We can imagine the following conversation between a PhD student and his advisor:

Supervisor:

This new data is fantastic! And so different to anything we have seen before. Are you sure everything was normal in the lab when you measured these?

Student:

Well, I was a little bit drunk that day....

But this kind of thing is rare, right? Come on, back me up here guys.

[3] I.J. Good, 'The White Shoe is a Red Herring,' British Journal for the Philosophy of Science, 1967, Vol. 17, No. 4 (p. 322) (I haven't read this paper (paywall again), but I have it from several reliable sources that this is where the thought experiment, or an equivalent version of it, appears.)

Wednesday, May 16, 2012

Nuisance Parameters

A few days ago, I was racking my brain trying to think of a suitable example for this piece, when one landed on my lap unexpectedly. On his own statistics blog, Andrew Gelman, has been posting the questions from an exam he set on design and analysis of surveys. Question 1 was the following:

Suppose that, in a survey of 1000 people in a state, 400 say they voted in a recent primary election. Actually, though, the voter turnout was only 30%. Give an estimate of the probability that a nonvoter will falsely state that he or she voted. (Assume that all voters honestly report that they voted.)

Now the requested estimate is simple enough to produce, and needs little greater insight than the product rule and the law of large numbers. But an important part of rationally advancing our knowledge is to be able to quantify the degree of uncertainty in that new knowledge. A parameter estimate without a confidence interval tells us very little, because it is just that, an estimate. We can't empirically determine that the estimated value is the truth, only that it is the most probable value. If we want to make decisions based on some estimated parameter, or give a measurement any kind of substance, we need to know how much more probable it is than other possible values. We can convey this information conveniently by supplying an error bar - a region of values either side of the most likely value, in which the true value is very likely to reside. This is why the error bar is considered to be one of the most important concepts in science.

If we look at the question above, with an ambition to provide not just the estimate, but a region of high confidence on either side, then it becomes one of the simplest possible examples of marginalization, the topic of this post. It is also a really nice example to use here, because it utilizes technology we have already played with, in some of my earlier posts. These technologies are the binomial distribution, used in 'Fly papers and photon detectors' (Equation 5 in that post), and the beta function, which I introduced without naming it in 'How to make ad-hominem arguments' (Equation 4 in that article).

Defining V to be the proposition that a given person voted, Y is the statement that they have declared that they did vote. We want to know the probability that a person says they voted when in fact they did not, which is P(Y | V'). From the product rule

P(YV') = P(V') × P(Y | V')

P(V') = 0.7. Thats the probability that a person did not vote.

If we are experienced in such things, we know that random deviations from expected behaviour decrease in relative magnitude as the size of the sample increases (thats the law of large numbers). This means that for a sample of 1000, we can be confident that the number of voters will not be much different from 300, when P(V) = 0.3. This means that approximately 100 non-voters lied that they had voted, out of a total sample of 1000, so

P(YV') ≈ 0.1

The crude estimate for P(Y | V'), therefore, is 1/7.

A more rigorous calculation acknowledges the uncertainty in P(YV'), and at the same time automatically provides a means to get the desired confidence interval. Lets suppose that to start with, we have no information about the proportion of non-voters who lie in surveys, then we are justified in using a uniform prior distribution. Then it follows from Bayes' theorem that the posterior probability distribution for the true fraction is given by the beta function, in close analogy with the parable of Weety Peebles. If we knew the number of people who didn't vote, but lied in the survey, then this would be a piece of cake, but we don't know it. It is what's called a nuisance parameter. But there is a procedure for dealing with this.

If we have a model with 2 free parameters, θ and n, then the joint probability for any pair of values for these parameters is

P(θn \| DI) = P(θn \| I)	P(D \| θnI)
	P(D \| I)

(1)

But if n is a nuisance parameter, in which we have no direct interest, then we just integrate it out. The so-called marginal probability distribution, P(θ | DI), is the sum over Equation (1) for all possible values of n. If n is a continuous parameter, then the sum becomes an integral:

P(θ | DI) = ∫ P(θn | DI) dn

(2)

In our example, we have one desired parameter, the fraction of non-voters who say that they voted, f, and one nuisance parameter, the actual number of liars in the sample of 1000 people, so to get the distribution over all possible f, we need to calculate a two-dimensional array of numbers, something that is still amenable to a spreadsheet calculation. Down a column, I listed all the possible numbers of liars, n, from 0 to 400 (there can't be more than 400 as all voters tell the truth, according to the provided background information). For each of these n, the total number of non-voters is 600 plus that number (600 is the number of non-lying non-voters). The probability for each of these numbers of non-voters, P(n), was calculated in an adjacent column, using the binomial distribution, with p = 0.7.

Along the top of the spreadsheet, I listed all the hypotheses I wanted to test concerning the value of the desired fraction, f. I divided the full range [0, 1] into 1000 slices of width Δf = 0.001. The probability that the true value of f lies in any given range [f, f + Δf] is estimated as P(f) × Δf. Each P(f) was calculated using the beta function:

P(f \| nI)	=	(N + 1)!	f ⁿ(1 - f)^N-n
		n! (N - n)!

(3)

Here N is the sample size, 1000. Each P(f | nI) was multiplied by the calculated P(n) to give the joint probability, specified in Equation (1). At the bottom, along another row, I calculated the sum of each column, which gave the desired marginal probability distribution, which I plot below:

According to my calculation, the peak of this curve is at 0.143 (which is 1/7, as expected). As an error bar, lets identify points on either side of the peak, such that the enclosed area is 0.95. This means that there is 95% probability that the true value of f lies between these points. To find these points, just integrate the curve in each direction from the peak, until the area in each case first reaches 0.475. Performing this integration gives a 95% confidence interval of [0.102, 0.180].

Now we know not only the most likely value of f, but also how confident we are that the true value of f is near to that estimate. This is what good science is all about.

The process of eliminating nuisance parameters is termed marginalization. Its an important concept in Bayesian statistics. In maximum-likelihood model fitting, all free parameters in a model must be fitted at once, but use of Bayes' theorem not only permits important prior information to enter the calculation and enables confidence-interval estimation without a seperate calculation, but also allows us to reduce the number of parameters that must be calculated to only those that interest us. During my PhD work, for example, most of my time was spent measuring the temporal responses of nanocrystals to short laser pulses. My fitting model included an offset (displacement up the y-axis), a shift (displacement along the time axis), and a scale parameter (dependent on how long I measured for, and how many photons my detector picked up). Thats 3 parameters giving information only about the behaviour of the measurement apparatus. The physical model pertaining to the behaviour of the nanocrystals typically consisted of only 2 time constants. Thats three out of five model variables that are nuisance parameters.

A really excellent read for those with an interest in the technicalities of Bayesian stats, is a text book called 'Bayesian Spectrum Analysis and Parameter Estimation,' by G. Larry Bretthorst (available for free download here). This book describes some stunning work. While discussing the advantages of eliminating nuisance parameters, Bretthorst produces one of the sexiest lines in the whole of the statistical literature:

In a typical small problem, this might reduce the search dimensions from ten to two; in one "large" problem the reduction was from thousands to six or seven.

He goes on: "This represents many orders of magnitude reduction in computation, the difference between what is feasible and what is not."

Thanks to Andrew Gelman for providing valuable inspiration for this post!

'Bayesian Spectrum Analysis and Parameter Estimation' by G. Larry Bretthorst
(free download here)

Thursday, May 10, 2012

Dodgy Dice

Following up on an earlier post, here is another example of how thinking about conditional probabilities in terms of causation rather than logical dependence can lead to cognitive dissonance. This example comes from 'Entropy Demystified,' by Arieh Ben-Naim.

Let A, B, and C be three hypotheses, such that P(B|A) > P(B) and P(C|B) > P(C). Suppose I tell you that P(C|A) < P(C). How does this strike you?

Knowledge of A makes B more likely, and knowledge of B makes C more likely, so shouldn't knowledge of A necessarily also make C more likely?

If you are thinking along these lines and are struggling to accept that the situation I describe is possible, then I suspect it is because, despite my warning, you can't help thinking about P(x|y) in terms of causation. The specific property of causality that seems to invade our heads when we contemplate problems like this is its transitivity: if A causes B and B causes C, then A necessarily is the cause of C.

Logical dependence, however, is not required to exhibit this transitivity property. Let A, B, and C be hypotheses about the outcome of rolling a six-sided die, such that for each hypothesis, the face showing is one of the four specified:

A: {1, 2, 3, 4}

B: {2, 3, 4, 5}

C: {3, 4, 5, 6}

P(A) = P(B) = P(C) = 4/6. If you are told that A is true, then the probability associated with B becomes 3/4, which is greater than 4/6. Similarly, P(C|B) is also greater than P(C). But P(C|A) = 2/4, which is less than P(C), as promised.

Here's another example of unexpected intransitivity for dice, operating on a different level, taken from Ian Stewart's entertaining "Cabinet of Mathematical Curiosities."

Mr. A proposes to his friend Mr. B that they play dice for money with his special dice. He has three of them, and they are each to pick one to play with against the other. Whoever roles the higher number on each throw wins the stake. To convince Mr. B that the dice are fair, Mr. A insists that he take first choice from the 3 dice - that way, if one of the dice is better than the of the others, Mr. B should have an fair chance of picking the better one.

The dice do not have the usual numbers marked on their faces, though. The red one has the numbers {3, 3, 4, 4, 8, 8}. The yellow one is marked with the numbers {1, 1, 5, 5, 9, 9}. Finally, the blue one has the numbers {2, 2, 6, 6, 7, 7}.

Which one, if any, should Mr. B choose?

Check the numbers for yourself. Whichever die Mr. B selects, Mr. A can select one of the others that gives a higher probability of winning.

Sunday, May 6, 2012

The Insignificance of Significance Tests

In an earlier post on the base-rate fallacy, I made use of two important terms, ‘false-positive rate’ and ‘false-negative rate,’ without taking much time to explain them. These are concepts we need to be careful with, because, simple though they are, they have been given terrible names.

Lets start with the false positive rate for a test. This could mean any one of a number of things:

(1) it could be the expected proportion of results produced by the test that are both false and positive

(2) it could be the proportion of positive results that are false

(3) it could be the proportion of false results that are positive

So which one is it? Are you ready?

None of the above.

The false positive rate is the proportion of negative cases that are registered as positive by the test. This definition is widespread and agrees with the Wikipedia article ‘Type I and Type II Errors’. If it is a diagnostic test for some disease, it is the fraction of healthy people who will be told that they have the disease (or referred for further diagnosis).

My claim that the term is confusing, however, is supported by looking at another Wikipedia article, ‘False Positive Rate,’ which provides the definition: “the probability of falsely rejecting the null hypothesis.” The null hypothesis is the proposition that there is no effect to measure – what I have called a negative case. The probability of falsely rejecting the null hypothesis, therefore, depends on the probability that there is no effect to measure, while the proportion of negative cases that are registered as positive does not. The alternate definition in this second Wikipedia article is the same as my number (1), above.

Number (2) on the list above is the posterior probability that the null hypothesis is true, given that the test has indicated it to be false. It is obtained using Bayes’ theorem. Confusion between this posterior probability and the false-positive rate is the base-rate fallacy, yet again. Unfortunately, there is little about the term ‘false-positive rate’ that strives to steer one away from these misconceptions.

What makes the situation much worse is that most scientific research is assessed using the false-positive rate, while what we should really be interested in is assigning a posterior probability to the proposition under investigation.

The false-positive rate is often denoted by the greek letter α. In classical significance testing, α is used to define a significance level – or rather the significance level defines α. The significance level is chosen such that the probability that a negative case triggers the alarm is α. A negative case, once again, is an instance where the null hypothesis is true, and there is no effect. For example, if two groups are being treated for some affliction, one group with an experimental drug and the other with a placebo, the null hypothesis might be that both groups recover at the same rate, as the new treatment has no specific impact on the disease. If the null hypothesis is correct, then in an ideal measurement, there will be no difference between the recovery times of the two groups.

But measurements are not ideal: there is ‘noise’ – random fluctuations in the system under study that produce a non-zero difference between the groups. If we can assign a probability distribution for this noise, however, we can define limits, which, if exceeded by the measurement, suggest that the null hypothesis is false. If x is the measured difference in recovery time for the 2 groups, then there are two points, x_α/2, on the tails of the distribution, such that the integral of one of the tails up to this limit is α/2. The total probability contained in these two tail areas, therefore, sums to α. The x_α/2 points are chosen so that α is equal to some desired significance level, some acceptably low false positive rate. (We can not make the false positive rate too small, because then we would too often fail to spot a real effect – the false negative rate would be too high.) The integration is performed on each side of the error distribution, as we have no certainty in which direction the alternate hypothesis operates: the recovery time with the new drug might be worse than with no treatment, which could still lead to rejection of the null hypothesis.

The null hypothesis predicts a value of x in the peak of the curve, but measurement noise leads to a probability distribution for the experiment as plotted. The x_α/2 points are chosen so that the tails areas either side (the shaded regions) sum to 5%, or whatever the desired significance level is. In classical hypothesis testing, any measured x further from the centre than either of these points is considered statistically significant.

A common value chosen for α is 0.05. That is, if the null hypothesis is true, then the measured value of x will exceed x_α/2 about one time in twenty. This is the basis of how results are reported in probably most of science – if x exceeds the prescribed significance level, then the null hypothesis is rejected, and the result is classified as statistically significant and considered a finding. If the measured value of x is between the two x_α/2 points, then the null hypothesis is not rejected, and the outcome of the study is probably never reported (a fact that contributes hugely to the problem of publication bias in the scientific literature). This system for interpreting scientific data and reporting outcomes of research is, lets be honest, a travesty.

Firstly, the whole rationale of the significance test seems to me to be deeply flawed. The idea is that there is some cutoff, beyond which we can declare the matter final: ‘Yes, we have a finding here. Grand. Nothing more to do on that question.’ How manifestly absurd? Such a binary declaration about a hypothesis, necessary if one wishes to take real-world action based on scientific research, needs to be based on decision theory, which combines both probability theory and some appropriate loss function (something that specifies the cost of making a wrong decision). But the declarations arising in decision theory are not of the form ‘A is true,’ but rather ‘we must act, and rationality demands that we act as if A is true.’ Using some standard α-level is about the crudest substitute for decision theory you could imagine.

Why should our test be biased so much in favour of the null hypothesis, anyway? The alternate hypothesis, H_A, almost always represents an infinity of hypotheses about the magnitude of the possible non-null effect, so a truly 'unbiased' starting point would seem to be one that de-emphasizes H₀. Remember that at the core of the frequentist philosophy (not the approach I wish to promote) is the dictum "let the data speak for themselves": don't let prior impressions taint the measurement.

Secondly, when a finding is reported with a false-positive probability of 0.05, there appears to be a feeling of satisfaction among the scientific community that 0.05 is the probability that the positive finding is false. But meta-analyses regularly do their best to dispel this myth. For example, looking only at genetic association studies, Ioannidis et al.¹ reported that 'results of the first study correlate only modestly with subsequent research on the same association,' while Hirschorn et al.² write that 'of the 166 putative associations which have been studied three or more times, only 6 have been consistently replicated.'

Thinking again about the method for calculating the positive predictive value for a diagnostic test, the probability that a positive finding is false is not determined only by the false-positive rate, α, but also by the false negative rate, and the prior probability. The false negative rate, by analogy with the false positive rate, is the proportion of real effects that will be registered as null findings. It depends on the number of samples taken in the study and the magnitude of the effect. In the medical trial example, if the new drug helps people recover twice as fast, then there will be fewer false negatives than if the difference is only 10%, and a trial with 100 patients in each group will give a more powerful test than a trial with only 10 in each group.

If D_f represents data corresponding to a finding, with x > x_α/2, then from Bayes’ theorem, the probability that the null hypothesis will have been incorrectly rejected is

P(H₀ \| D_f, I) =	P(H₀ \| I)P(D_f \| H₀, I)
	P(H₀ \| I)P(D_f \| H₀, I) + P(H_A \| I)P(D_f \| H_A, I)

(1)

P(D_f | H₀, I) is α, and P(D_f | H_A, I) is 1-β, where β is the false negative rate. Using the standard α level of 0.05, I plot below this posterior probability as a function of the prior probability for the alternate hypothesis, H_A, for several false-negative rates. To estimate the posterior error probability for a specific experiment, Wacholder et al.³ propose to use the p-value instead of α (the p-value is twice the tail integral up to x_m, where x_m is the measured value of x, rather than x_α/2). Strictly, one shouldn't use the p-value, but P(x_m | H₀, I). We use α to evaluate the method and so integrate over all possible outcomes that would be registered as findings, while we use P(x_m) to investigate a particular experiment – there is only one outcome, so no integration is required. I’m fairly sure Wacholder et al. are aware of this: their goal seems to be to provide an approximate methodology capable of salvaging something useful from an almost ubiquitous and highly flawed set of statistical practices. In this regard, I think they can probably be credited with having made a valuable contribution. The problem with this, though, is that for a specific data set, P(D | H_A, I) is not the same as 1-β, and can not be determined. The correct procedure, of course, requires resolving H_Ainto a set of quantifiable hypotheses and calculating the appropriate sampling distributions for each of them.

Posterior probability for H₀following a statistically significant result, with α set at 0.05, plotted vs the prior probability that H₀is false.

We can see readily from the above plot that the posterior probability varies hugely, even for a fixed α, and that α alone (or the p-value) is next to useless for predicting it. As the prior probability gets smaller, the posterior error probability associated with a positive finding approaches unity. Looking closely at a prior of 0.01, which is generous for many experiments, (especially, for example, large-scale genomics studies, where the data and its analysis are now cheap enough to permit almost every conceivable relationship to be tested, regardless of whether or not they are suggested by other known facts) we can see that for a low-power test, with β = 0.8, then P(H₀ | D_f, I) is over 96%. So we crank up the number of samples, improve our instruments, do everything we can to reduce the experimental noise, until, miracle of miracles, we have reduced the false negative rate to almost zero. What is P(H₀ | D_f, I) now? Still 83%. Bugger.

This is not to say that the experiments are not worth doing. Science evidently makes tremendous advances despite these difficulties, and the technology that follows from the science is the most obvious proof of this. (Besides, any substantial success in reducing β will also permit an associated reduction of α.) What it does mean, however, is that the standard ways of reporting ‘findings,’ with alphas and p-values, are desperately inadequate. They fail to represent the information that a data set contains and to convey what should be our rational degree of belief in a hypothesis, given the empirical evidence available. Evaluation of this information content and elucidation of these rational degrees of belief (probabilities) should be the goal of every scientist, and the communication of these things should be viewed as a privilege to take delight in.

Update (19-5-2012)

Previously I stated that:

'Colhoun et al.⁴ found that 95% of reported findings concerning the genetic causes of disease are subsequently found to be false.'

This was based on a statement by Wacholder et al.³ that:

'Colhoun et al. estimated the fraction of false-positive findings in studies of association between a genetic variant and a disease to be at least .95.'

I have subsequently checked this paper by Colhoun et al., and could not find this estimate. I have adjusted the text accordingly, adding new references that support my original point. I apologize for the error. My only excuse is that, locked as it was behind a paywall, I was unable to access this paper for fact checking, before making a special trip to my local university library. I still recommend the Colhoun et al. paper for their discussion of the unsatisfactory nature of evidence evaluation in their field.

[1] J.P. Ioannidis et al. ‘Replication validity of genetic association studies,’ Nature Genetics 2001, 29 (p. 306)

[2] J.N. Hirschorn et al. ‘A comprehensive review of genetic association studies,’ Genetics in Medicine 2002, 4 (p. 45) Available here.

[3] S. Wacholder, et al. ‘Assessing the probability that a positive report is false: an approach for molecular epidemiology studies,’ Journal of the National Cancer Institute, Vol. 96, No. 6 (p. 434), March 17, 2004. Available here.

[4] H.M. Colhoun et al. ‘Problems of reporting genetic associations with complex outcomes,’ Lancet 2003 361:865–72

Friday, April 27, 2012

Logical v's Causal Dependence

At the end of a previous post, I promised to discuss the difference between logical dependence, the substance of probability theory, and causal dependence, which is often assumed to be the thing that probability is directly concerned with. Lets get the ball rolling with a simple example:

A box contains 10 balls: 4 black, 3 white, and 3 red. A man extracts exactly one ball, ‘randomly.’ The extracted ball is never replaced in the box. Consider the following 2 situations:

a) You know that the extracted ball was red. What is the probability that another ball extracted in the same way, will also be red?

b) A second ball has been extracted in the same manner as the first, and is known to be black. The colour of the first ball is not known to you. What is the probability that it was white?

You should try to verify that he answer in the first case is 2/9, and in the second case is 1/3.

Nearly everybody will agree with my answer to situation (a), but some may hesitate about the answer for situation (b). This hesitation seems to result from the feeling that when we write P(A|B) ≠ P(A), then B is, at least partially, the cause of A. (P(A|B) means ‘the probability for A given the assumption that B is true.’) If true, then there would be no possibility for knowledge of B to influence the probability for A, because the colour of the second ball can have had no causal influence on the colour of the first ball.

In fact, it makes no difference at all in what order the balls are drawn, in such cases. The labels ‘first,’ ‘second,’ ‘n^th,’ are really just arbitrary labels, and we can exchange them as we please, without affecting the outcome of the calculation.

In case there is still a doubt, consider a simplified version of our thought experiment:

The box had exactly 2 balls, 1 black and 1 white. Both balls were drawn, ‘at random.’ The second ball drawn was black. What is the probability that the first was black?

The product rule can be written P(AB) = P(A|B)P(B). With this formulation, can we account for cases where A depends on B. When thinking about this dependence, however, it is often tempting to think in terms of causal dependence. But probability theory is concerned with calculations of plausibility with incomplete knowledge, and so what we really need to consider is not causal dependence, but logical dependence. We can verify that P(A|B) does not imply that B is the cause of A, since, thanks to the commutativity of Boolean algebra, AB = BA, and we could just as easily have written the product rule as P(AB) = P(B|A)P(A).

What is the probability that X committed a crime yesterday, given that he confessed to it today? Surely it is altered by our knowledge of the confession, indicating that the propositions are not independent in the sense we need for probability calculations. But it is also clear that a crime committed yesterday was not caused by a confession today.

Edwin Jaynes in ‘Probability Theory: The logic of Science,’ gave the following technical example of the errors that can occur by focusing on causal dependence, rather than logical dependence. Consider multiple hypothesis testing with a set of n hypotheses, H₁, H₂, …, H_n, being examined in the light of m datasets, D₁, D₂, …., D_m. When the data sets are logically independent, the direct probability for the totality of the data given any one of the hypotheses, H_i, satisfies a factorization condition,

P(D₁...D_m | H_i, I) = ∏ _j P(D_j | H_i, I)

(1)

(The capital 'pi' means multiply for all 'j'.) It can be shown, however, that the corresponding condition for the alternate hypothesis, H_i'

P(D₁...D_m | H_i', I) = ∏ _j P(D_j | H_i', I)

(2)

does not hold except in highly trivial cases, though some authors have assumed it to be generally true, based on the fact that no D_i has any causal effect on any other D_j. (Equation (2) requires that P(D_j|D_i) = P(D_j).) The datasets maintain their causal independence, as they must, but they are no longer logically independent. This is because the amount that equivalent units of new information change the relative plausibilities of multiple hypotheses depends on the data that has gone before: the effect of new data on a hypothesis depends on which other hypothesis it competes with most directly.

In Jaynes’ example, he imagined a machine producing some component in large quantities and an effort to determine the fraction of components fabricated that are faulty by randomly sampling 1 component at a time and examining it for faults. The prior information is supposed specific enough to narrow the number of possible hypotheses to 3:

A ≡ ‘The fraction of components that are faulty is 1/3.’

B ≡ ‘The fraction of components that are faulty is 1/6.’

C ≡ ‘The fraction of components that are faulty is 99/100.’

The prior probabilities for these hypotheses are as shown at the extreme left of the graph below. The graph is the calculation of the evolution of the probabilities for each hypothesis as the number of tested components increases. Recall that, from Bayes’ theorem, each P(H_i|D, I) depends on both P(D|H_i, I) and P(D|H_i', I). Each tested component is found to be faulty, so the information added is identical with each sample, but the rates of change of the 3 curves (plotted logarithmically) are not constant.

Evolution of the probabilities of 3 hypotheses as constant new data are added.

Taken from E.T. Jaynes, ‘Probability theory: the logic of science,’ chapter 4.

The ‘Evidence,’ plotted on the vertical axis, is perhaps an unfamiliar expression of probability information. It is the log odds, given by

E = 10 log₁₀	P( H )
	P( H' )

(3)

with the factor of 10 because we choose to measure evidence in decibels. The base 10 is used because of a perceived psychological advantage (our brains seem to be good at thinking in terms of factors of 10). Because we have used a logarithmic scale, the products expressed above in equations (1) and (2) becomes sums, and for constant pieces of new information, we expect to add a constant amount to the evidence, if both factorization conditions hold. The slopes of the curves are not constant, however, indicating that this is not the case, and consecutive items of data are not independent: ΔE depends on what data have preceded that point. Specifically, wherever a pair of hypotheses cross on the graph, there is a change of slope of the remaining hypothesis.

When we calculate P(D|H_i, I) we are supposing for the purposes of calculation that H_i is true, and so the result we get is independent of P(H_i), which is why P(D|H_i, I) factorizes. But P(D|H_i', I) is different, because when the total number of hypotheses is greater then 2, then H_i' is composite and decomposes into at least 2 hypotheses, so P(D|H_i', I) relies upon the relative probabilities for those component propositions.