Maximum Entropy: October 2012

Saturday, October 27, 2012

Parameter Estimation and the Relativity of Wrong

Its not enough for a theory of epistemology to consist of mathematically valid, yet totally abstract theorems. If the theory is to be taken seriously, there has to be a demonstrable correspondence between those theorems and the real world - the theory must make sense. It has to feel right.

This idea has been captured by Jaynes in his exposition of and development upon Cox's theorems¹, in which the sum and product rules of probability theory (formerly, and often still, considered as axioms themselves) were rigorously derived. Jaynes built up the derivation from a small set of basic principles, which he called desiderata, rather than axioms, among which was the requirement quite simply for 'qualitative correspondence with common sense.' If you read Cox, I think it is clear that this is very much in line with his original reasoning.

Feeling right, and a strong overlap with intuition are therefore crucial tests of the validity and consistency of our theory of probability (in fact, of any theory of probability), which, let me reiterate, is really the theory of all science. This is one of the reasons why a query I received recently from reader Yair is a great question, and a really important one, worthy of a full blown post (this one) to explore. This excellent question was about one theory being considered closer to the truth than another, and was phrased in terms of the example of the shape of the Earth: if the Earth is neither flat nor spherical, where does the idea come from that one of these hypotheses is closer to the truth? They are both false after all, within the Boolean logic of our Bayesian system. How can Bayes' theorem replicate this idea (of one false proposition being more correct than another false proposition), as any serious theory of science surely ought to?

Discussing this issue briefly in the comments following my previous post was an important lesson for me. The answer was something that I thought was obvious, but Yair's question reminded me that at some point in the past I had also considered the matter, and expended quite some effort getting to grips with it. Like so many things, it is only obvious after you have seen it. There is a story about the great mathematician, G. H. Hardy: Hardy was on stage at some conference of mathematics giving a talk. At some point, when he was saying 'it is trivial to show that.....,' he ground to a halt, stared at his notes for a moment, scratched his head, then walked off the stage absent-mindedly, into another room. Because of his greatness, and the respect the conference attendees had for him, they all waited patiently. He returned after half an hour, to say 'yes, it is trivial,' before continuing with the rest of his talk, exactly as planned.

Yair's question also reminded me of an excellent little something I read by Isaac Asimov concerning the exact same issue of the shape of the Earth, and degrees of wrongness. This piece is called 'The Relativity of Wrong².' It consists of a reply to a correspondent who expressed the opinion that since all scientific theories are ultimately replaced by newer theories, then they are all demonstrably wrong, and since all theories are wrong, any claim of progress in science must be a fantasy. Asimov did a great job of demonstrating that this opinion is absurd, but he did not point out the specific fallacy committed by his correspondent. In a moment, I'll redress this minor shortcoming, but first, I'll give a bit of detail concerning the machinery with which Bayes' theorem sets about assessing the relative wrongness of a theory.

The technique we are concerned with is model comparison, which I have introduced already. To perform model comparison, however, we need to grasp parameter estimation, which I probably ought to have discussed in more detail before now.

Suppose we are fitting some curve through a series of measured data points, D (e.g. fitting a straight line or a circle to the outline of the Earth), then in general, our fitting model will involve some list of model parameters, which we'll call θ. If the model is represented by the proposition, M, and I represents our background information, as usual, then the probability for any given set of numerical values for the parameters, θ, is given by

(1)

If the model has only one parameter, then this is simple to interpret: θ is just a single number. If the model has two parameters, then the probability distribution P(θ) ranges over two dimensions, and is still quite easy to visualize. For more parameters, we just add more dimensions - harder to visualize, but the maths doesn't change.

The term P(θ | MI) is the prior probability for some specific value of the model parameters, our degree of belief before the data were obtained. There are various ways we could arrive at this prior, including ignorance, measured frequencies, and a previous use of Bayes' theorem.

The term P(D | θMI), known as the likelihood function, needs to be calculated from some sampling distribution. I'll describe how this is most often done. Assuming the correctness of θMI, then we know exactly the path traversed by the model curve. Very naively, we'd think that each data point, d_i, in D must lie on this curve, but of course, there is some measurement error involved: the d's should be close to the model curve, but will not typically lie exactly on it. Small errors will be more probable than large errors. The probability for each d_i, therefore, is the probability associated with the discrepancy between the data point and the expected curve, d_i - y(x_i), where y(x) is the value of the theoretical model curve at the relevant location. This difference, d_i - y(x_i), is called a residual.

Very often, it will be highly justified to assume a Gaussian distribution for the sampling distribution of these errors. There are two reasons for this. One is that the actual frequencies of the errors are very often well approximated as Gaussian. This is due to the overlapping of numerous physical error mechanisms, and is explained by the central limit theorem (a central theorem about limits, rather than a theorem about central limits (whatever they might be)). This also explains why Francis Galton coined the term 'normal distribution' (which we ought to prefer over 'Gaussian,' as Gauss was not the discoverer (de Moivre discovered it, and Laplace popularized it, after finding a clever alternative derivation by Gauss (note to self: use fewer nested parentheses))).

The other reason the assumption of normality is legitimate is an obscure little idea called maximum entropy. If all we know about a distribution is its location and width (mean and standard deviation), then the only function we can use to describe it, without implicitly assuming more information than we have, is the Gaussian function.

Here's what the normal sampling distribution for the error at a single data point looks like:

(2)

For all n d's in D, the total probability is just the product of all these terms given by equation (2), and since e^a×e^b = e^a+b, then

(3)

This, along with our priors, is all we typically need to perform Bayesian parameter estimation.

If we start from ignorance, or if for any other reason, the priors are uniform, then finding the most probable values for θ simply becomes a matter of maximizing the exponential function in equation (3), and the procedure reduces to the method of maximum likelihood. Because of the minus sign in the exponent, maximizing this function requires minimizing Σ[(d_i - y(x_i))²/2σ_i²]. Furthermore, if the standard deviation, σ, is the same for all d, then we just have to minimize Σ[d_i - y(x_i)]², which is the least squares method, beloved of physicists.

Staying, for simplicity, with the assumption of a uniform prior, then it is clear that when comparing two different fitting models, the one that achieves smaller residuals will be the favoured one, according to probability theory. (See, for example, equation (4) in my article on the Ockham Factor.) P(D | θMI) is larger for the model with smaller residuals, as just described.

The whole point of this post was to figure out how to quantify closeness to truth. The residuals we've just been looking at are how wrong the model is: d, the data point is reality, y(x) is the model, the difference between them is the amount of wrongness of the model, which we wanted to quantify. And by Bayes' theorem, more wrongness leads to less probability, exactly as desired.

Within a system of only two models, 'flat Earth' v's 'spherical Earth,' there is no scope for knowing that both models are actually false, but even working with such a system, we would probably keep in mind the strong potential for a third, more accurate model (e.g. the oblate spheroid that Asimov discussed). Such mindfulness is really a manifestation of a broader 'supermodel.' In the two-model system, 'spherical Earth' is closer to the truth because it manifests much smaller residuals. It is also closer to the truth than 'flat Earth,' even after the third model is introduced, because its residuals are still smaller than those for 'flat Earth.' 'Oblate spheroid' will be even closer to the truth in the 3 theory system, but spherical and flat will still have non-zero probability - strictly, we can not rule them out completely, thanks to the unavoidable measurement uncertainty, and so the statement that we know them to be false is not rigorously valid.

I promised earlier to identify the fallacy perpetrated by Asimov's misguided correspondent. I have already discussed it a few months ago. It is the mind-projection fallacy, the false assumption that aspects of our model of reality must be manifested in reality itself. In this case: if wrongness (relating to our knowledge of reality) can be graded, then so must 'true' and 'false' (relating to reality itself) also be graded. There are two ways to reason from here: (1) truth must be fuzzy, or (2) our idea of continuous degrees of wrong must be mistaken.

The idea that all models that are wrong are necessarily all equally wrong, as expressed in the letter with which poor Asimov was confronted, is fallacious in the extreme. Wrong does not have this black/white feature. 'Wrong' and 'false' are not the same. Of course, a wrong theory is also false, but if I'm walking to the shop, I'd rather find my location to be wrong by half a mile than by a hundred miles.

We can say that a theory is less wrong (i.e. produces smaller residuals), without implying that is is more true. 'True' and 'false' retain their black-and-white character, as I believe they must, but our knowledge of what is true is necessarily fuzzy. This is precisely why we use probabilities. As our theories get incrementally less wrong and closer to the truth, so the probabilities we are allowed to assign to them get larger.

There often seems to be a kind of bait-and-switch con trick going on with many of the world's least respectable 'philosophies.' The 'philosopher' makes a trivial but correct observation, then makes a subtle shift, often via the mind-projection fallacy, to produce an equivalent-looking statement that is both revolutionary, and utter garbage. In post-modern relativism (a popular movement in certain circles), we can see this shifting between right/wrong and true/false. The observation is made that all scientific theories are ultimately wrong, then hoping you won't notice the switch, the next thing you hear is that all theories are equally wrong. They can't seem to make their minds up, however, which side of the fallacy they are on, because the next thing you'll probably hear from them is that because knowledge is mutable, then so are the facts themselves: truth is relative to your point of view and the mood you happen to be in, science is nought but a social construct. Part of the joy of familiarity with scientific reasoning is the clarity of thought to see through the fog of such nonsense.

[1] 'Probability, Frequency, and Reasonable Expectation,' R. T. Cox, American Journal of Physics 1946, Vol. 14, No. 1, Pages 1-13. (Available here.)

[2] 'The Relativity of Wrong,' Isaac Asimov, The Skeptical Inquirer, Fall 1989, Vol. 14, No. 1, Pages 35-44. (Download the text here.) (And no, I don't find it spooky that both references are from the same volume and number.)

Monday, October 8, 2012

Total Bayesianism

If you've read even a small sample of the material I've posted so far, you'll recognize that one of my main points concerns the central importance of Bayes' theorem. You might think, though, that the most basic statement of this importance is something like "Bayes' theorem is the most logical method for all data analysis." This, for me though, falls far short of capturing the most general importance of Bayes' rule.

Bayes' theorem is more than just a method of data analysis, a means of crunching the numbers. It represents the rational basis for every aspect of scientific method. And since science is simply the methodical application of common sense, Bayes' theorem can be seen to be (together with decision theory) a good model for all rational behaviour. Indeed, it may be more appropriate to invert that, and say that your brain is a superbly adapted mechanism, evolved for the purpose of simulating the results of Bayes' theorem.

Because I equate scientific method with all rational behaviour, I am no doubt opening myself up to the accusation of scientism¹, but my honest response is: so what? If I am more explicit than some about the necessary and universal validity of science, this is only because reason has led me in this direction. For example, P.Z. Myers, author of the Pharyngula blog (vastly more well known than mine, but you probably knew that already), is one of the great contemporary advocates of scientific method - clear headed and craftsmanlike in the way he constructs his arguments - but in my evidently extreme view, even he can fall short, on occasion, of recognizing the full potential and scope of science. In one instance I recall, when the league of nitwits farted in Myers' general direction, and he himself stood accused of scientism, he deflected the accusation, claiming it was a mistake. My first thought, though, is "hold on, there's no mistake." Myers wrote:

The charge of scientism is a common one, but it’s not right: show us a different, better path to knowledge and we’ll embrace it.

But how is one to show a better path to knowledge? In principle, it can not be done. If Mr. X claims that he can predict the future accurately by banging his head with a stone until visions appear, does that suffice as showing? Of course not, a rigorous scientific test is required. Now, if under the best possible tests, X's predictions appear to be perfectly accurate, any further inferences based on them are only rational to the extent that science is capable of furnishing us (formally, or informally) with a robust probability estimate that his statements represent the truth. Sure, we can use X's weird methodology, but we can only do so rationally, if we do so scientifically. X's head smashing trick will never be better than science (a sentence I did not anticipate writing).

To put it another way, X may yield true statements, but if we have no confidence in their truth, then they might as well be random. Science is the engine generating that confidence.

So, above I claimed that all scientific activity is ultimately driven by Bayes' theorem. Lets look at it again in all its glory:

P(H \| DI) =	P(H \| I) × P(D \| H I)
	P(H \| I) × P(D \| H I) + P(H' \| I) × P(D \| H' I)

(1)

(As usual, H is a hypothesis we want to evaluate, D is some data, I is the background information, and H' means "H is not true.")

The goal of science, whether one accepts it or not, is to calculate the term on the left hand side of equation (1). Now, most, if not all, accepted elements of experimental design are actually adapted to manipulate the terms on the right hand side of this equation, in order to enhance the result. I'll illustrate with a few examples.

Firstly, and most obviously, the equation calls for data, D. We have to look at the world, in order to learn about it. We must perform experiments to probe nature's secrets. We can not make inferences about the real world by thought alone. (Some may appear to do this, but no living brain is completely devoid of stored experiences - the best philosophers are simply very efficient at applying Bayes' theorem (usually without knowing it) to produce powerful inferences from mundane and not very well controlled data. This is why philosophy should never be seen as lying outside empirical science.)

Secondly, the equation captures perfectly what we recognize as the rational course of action when evaluating a theory - we have to ask 'what should I expect to see if this theory is true? - what are its testable hypotheses?' In other words, what data can I make use of in order to calculate P(D | HI)?

Once we've figured out what kind of data we need, the next question is how much data? Bayes' rule informs us: we need P(D | HI) to be as high as possible if true, and as low as possible if false. Lets look at a numerical example:

Suppose I know, on average, how tall some species of flower gets, when I grow the plants in my home. Suppose I suspect that picking off the aphids that live on these flowers will make the plants more healthy, and cause them to grow higher. My crude hypothesis is that the relative frequency with which these specially treated flowers exceed the average height is more than 50 %. My crude data set results from growing N flowers, applying the special treatment to all of them, and recording the number, x, that exceed the known average height.

To check whether P(D | HI) is high when H is true and low when H is false, we'll take the ratio

P(D | H I)

P(D | H' I)

(2)

If H_f says that the frequency with which the flowers exceed their average height is f, then P(D | H_fI) (where D is the number of tall flowers, x, and the total number grown N) is given by the binomial distribution. But our real hypothesis, H, asserts that f is in the range 0.5 < f ≤ 1. This means we're going to have to sum up a whole load of P(D | H_fI)s. We could do the integral exactly, but to avoid the algebra, lets treat the smoothly varying function like a staircase, and split the f-space into 50 parts: f = 0.51, 0.52, ...,0.99, 1.0. To calculate P(D | H'I), we'll do the same, with f = 0.01, 0.02, ...., 0.50.

What we want, e. g. for the P(D | HI), is P(D | [H_0.51 + H_0.52 + ....] I).

Generally, where all hypotheses involved are mutually exclusive, it can be shown (see appendix below) that,

P(D \| [H₁ + H₂ + .....] I) =	P(H₁ \| I) P(D \| H₁ I) + P(H₂ \| I) P(D \| H₂ I) + .....
	P(H₁ \| I) + P(H₂ \| I) + .....

(3)

But we're starting from ignorance, so we'll take all the priors, P(H_f| I), to be the same. We'll also have the same number of them, 50, in both numerator and denominator, so when we take the desired ratio, all the priors will cancel out (as will the width, Δf = 0.01, of each of the intervals on our grid), and all we need to do is sum up P(D | H_f1I) + P(D | H_f2I) + ....., for each relevant range. Each term will come straight from the binomial distribution:

P(x \| N, f) =	N!	x^f (N - x)^1-f
	x! (N - x)!

(4)

If we do that for say 10 test plants, with seven flowers growing beyond average height, then ratio (2) is 7.4. If we increase the number of trials, keeping the ratio of N to x constant, what will happen?

If we try N = 20, x = 14, not too surprisingly, ratio (2) improves. The result now is 22.2, an increase of 14.8. Furthermore if we try N = 30, x = 21, ratio (2) increases again, but this time more quickly: now the ratio is 58.3, and further increase of 36.1.

So, to maximize the contrast between the hypotheses under test, H and H', what we should do is take as many measurements as practically possible. Something every scientist knows already, but something nonetheless demanded by Bayes' theorem.

How is our experimental design working out, then? Well, not that great so far, actually. Presumably the point of the experiment was to decide if removing the parasites from the flowers provided a mechanism enabling them to grow bigger, but all we have really shown is that they did grow bigger. We can show this by resolving e.g. H' into a set of mutually exclusive and exhaustive (within some limited model) sub-hypothesis:

H' = H'A₁ + H'A₂ + H'A₃ + ......

(5)

where H' is, as before, 'removing aphids did not improve growth,' and some of the A's represent alternative causal agencies capable of affecting a change in growth. For example, A₁ is the possibility that a difference in ambient temperature tended to make the plants grow differently. Lets look again at equation (3). This time instead of H_f's, we have all the H'A_i, but the principle is the same. Previously, the priors were all the same, but this time, we can exploit the fact that they need not be. We need to manipulate those priors so that the P(D | H'I) term in the denominator of Bayes' theorem, is always low if the number of tall plants in the experiment is large. We can do this by reducing the priors for some of the A_i corresponding to the alternate causal mechanisms. To achieve this, we'll introduce a radical improvement to our methodology: control.

Instead of relying on past data for plants not treated by having their aphids removed, we'll grow 2 sets of plants, treated identically in all respects, except the one that we are investigating with our study. The temperature will be the same for both groups of plants, so P(A₁ | I) will be zero - there will be no difference in temperature to possibly affect the result. The same will happen to all (if we have really controlled for all confounding variables) the other A_i that corresponded to additional agencies offering explanations for taller plants.

This process of increasing the degree of control can, of course, undergo numerous improvements. Suppose, for example, that after a number of experiments, I begin to wonder if its not actually removing the aphids that affects the plants, but simply the rubbing of the leaves with my fingers that I perform in order to squish the little parasites. So as part of my control procedure, I devise a way to rub the leaves of the plants in the untreated group, while carefully avoiding those villainous arthropods. Not a very plausible scenario, I suppose, but if we give a tentative name to this putative phenomenon, we can appreciate how analogous processes might be very important in other fields. For the sake of argument, lets call it a placebo effect.

Next I begin to worry that I might be subconsciously influencing the outcome of my experiments. Because I'm keen on the hypothesis I'm testing, (think of the agricultural benefits such knowledge could offer!) I worry that I am inadvertently biasing my seed selection, so that healthier looking seeds go into the treatment group, more than into the control group. I can fix this, however, by randomly allocating which group each seed goes into, thereby setting the prior for yet another alternate mechanism to zero. The vital nature of randomization, when available, in collecting good quality scientific data is something we noted already, when looking at Simpson's paradox, and is something that has been well appreciated for at least a hundred years.

Randomization isn't only for alleviating experimenter biases, either. Suppose that my flower pots are filled with soil by somebody else, with no interest in or knowledge of my experimental program. I might be tempted to use every second pot for the control group, but suppose my helper is also filling the pots in pairs, using one hand for each. Suppose also that the pots filled with his left hand receive inadvertently less soil than those filled with his right hand. Unexpected periodicities such as these are also taken care of by proper randomization.

Making real-world observations, and lots of them; control groups; placebo controls; and randomization: some exceedingly obvious measures, some less so, but all contained in that beautiful little theorem. Add these to our Bayesian formalization of Ockham's razor, and its extension, resulting in an explanation for the principle of falsifiability, and we can not avoid noticing that science is a thoroughly Bayesian affair.

Appendix

You might like to look again at the 3 basic rules of probability theory, if your memory needs refreshing.

To derive equation (3), above, we can write down Bayes' theorem in a slightly strange way:

P(D \| [H₁ + H₂ + ....], I) =	P(D \| I) × P([H₁ + H₂ + ....] \| D I)
	P([H₁ + H₂ + ....] \| I)

(A1)

This might look a bit backward, but thinking about it a little abstractly, before any particular meaning is attached to the symbols, we see that it is perfectly valid. If you're not used to Boolean algebra, or anything similar, let me reassure you that its perfectly fine for a combination of propositions, such as A + B + C, (where the + sign means 'or') to be treated as a proposition in its own right. If equation (A1) looks too much, just replace everything in the square brackets with another symbol, X.

As long as all the various sub-hypotheses, H_i, are mutually exclusive, then when we apply the extended sum rule above and below the line, the cross terms vanish, and (A1) becomes:

P(D \| [H₁ + H₂ + ....], I) =	P(D \| I) × [ P(H₁ \| D I) + P(H₂ \| D I) + ..... ]
	P(H₁ \| I) + P(H₂ \| I) + .....

(A2)

We can multiply out the top line, and also make note that for each hypothesis, H_i, we can make two separate applications of the product rule to the expression P(H_iD | I), to show that

P(D \| I) =	P(H_i \| I) P(D \| H_i I)
	P(H_i \| D I)

(A3)

(This is actually exactly the technique by which Bayes' theorem itself can be derived.)

Substituting (A3) into (A2), we see that

P(D \| [H₁ + H₂ + .....] I) =	P(H₁ \| I) P(D \| H₁ I) + P(H₂ \| I) P(D \| H₂ I) + .....
	P(H₁ \| I) + P(H₂ \| I) + .....

(A4)

which is the result we wanted.

[1] From Wikipedia:

Scientism is a term used, usually pejoratively, to refer to belief in the universal applicability of the scientific method and approach, and the view that empirical science constitutes the most authoritative worldview or most valuable part of human learning to the exclusion of other viewpoints.