Maximum Entropy: Parameter Estimation and the Relativity of Wrong

Its not enough for a theory of epistemology to consist of mathematically valid, yet totally abstract theorems. If the theory is to be taken seriously, there has to be a demonstrable correspondence between those theorems and the real world - the theory must make sense. It has to feel right.

This idea has been captured by Jaynes in his exposition of and development upon Cox's theorems¹, in which the sum and product rules of probability theory (formerly, and often still, considered as axioms themselves) were rigorously derived. Jaynes built up the derivation from a small set of basic principles, which he called desiderata, rather than axioms, among which was the requirement quite simply for 'qualitative correspondence with common sense.' If you read Cox, I think it is clear that this is very much in line with his original reasoning.

Feeling right, and a strong overlap with intuition are therefore crucial tests of the validity and consistency of our theory of probability (in fact, of any theory of probability), which, let me reiterate, is really the theory of all science. This is one of the reasons why a query I received recently from reader Yair is a great question, and a really important one, worthy of a full blown post (this one) to explore. This excellent question was about one theory being considered closer to the truth than another, and was phrased in terms of the example of the shape of the Earth: if the Earth is neither flat nor spherical, where does the idea come from that one of these hypotheses is closer to the truth? They are both false after all, within the Boolean logic of our Bayesian system. How can Bayes' theorem replicate this idea (of one false proposition being more correct than another false proposition), as any serious theory of science surely ought to?

Discussing this issue briefly in the comments following my previous post was an important lesson for me. The answer was something that I thought was obvious, but Yair's question reminded me that at some point in the past I had also considered the matter, and expended quite some effort getting to grips with it. Like so many things, it is only obvious after you have seen it. There is a story about the great mathematician, G. H. Hardy: Hardy was on stage at some conference of mathematics giving a talk. At some point, when he was saying 'it is trivial to show that.....,' he ground to a halt, stared at his notes for a moment, scratched his head, then walked off the stage absent-mindedly, into another room. Because of his greatness, and the respect the conference attendees had for him, they all waited patiently. He returned after half an hour, to say 'yes, it is trivial,' before continuing with the rest of his talk, exactly as planned.

Yair's question also reminded me of an excellent little something I read by Isaac Asimov concerning the exact same issue of the shape of the Earth, and degrees of wrongness. This piece is called 'The Relativity of Wrong².' It consists of a reply to a correspondent who expressed the opinion that since all scientific theories are ultimately replaced by newer theories, then they are all demonstrably wrong, and since all theories are wrong, any claim of progress in science must be a fantasy. Asimov did a great job of demonstrating that this opinion is absurd, but he did not point out the specific fallacy committed by his correspondent. In a moment, I'll redress this minor shortcoming, but first, I'll give a bit of detail concerning the machinery with which Bayes' theorem sets about assessing the relative wrongness of a theory.

The technique we are concerned with is model comparison, which I have introduced already. To perform model comparison, however, we need to grasp parameter estimation, which I probably ought to have discussed in more detail before now.

Suppose we are fitting some curve through a series of measured data points, D (e.g. fitting a straight line or a circle to the outline of the Earth), then in general, our fitting model will involve some list of model parameters, which we'll call θ. If the model is represented by the proposition, M, and I represents our background information, as usual, then the probability for any given set of numerical values for the parameters, θ, is given by

(1)

If the model has only one parameter, then this is simple to interpret: θ is just a single number. If the model has two parameters, then the probability distribution P(θ) ranges over two dimensions, and is still quite easy to visualize. For more parameters, we just add more dimensions - harder to visualize, but the maths doesn't change.

The term P(θ | MI) is the prior probability for some specific value of the model parameters, our degree of belief before the data were obtained. There are various ways we could arrive at this prior, including ignorance, measured frequencies, and a previous use of Bayes' theorem.

The term P(D | θMI), known as the likelihood function, needs to be calculated from some sampling distribution. I'll describe how this is most often done. Assuming the correctness of θMI, then we know exactly the path traversed by the model curve. Very naively, we'd think that each data point, d_i, in D must lie on this curve, but of course, there is some measurement error involved: the d's should be close to the model curve, but will not typically lie exactly on it. Small errors will be more probable than large errors. The probability for each d_i, therefore, is the probability associated with the discrepancy between the data point and the expected curve, d_i - y(x_i), where y(x) is the value of the theoretical model curve at the relevant location. This difference, d_i - y(x_i), is called a residual.

Very often, it will be highly justified to assume a Gaussian distribution for the sampling distribution of these errors. There are two reasons for this. One is that the actual frequencies of the errors are very often well approximated as Gaussian. This is due to the overlapping of numerous physical error mechanisms, and is explained by the central limit theorem (a central theorem about limits, rather than a theorem about central limits (whatever they might be)). This also explains why Francis Galton coined the term 'normal distribution' (which we ought to prefer over 'Gaussian,' as Gauss was not the discoverer (de Moivre discovered it, and Laplace popularized it, after finding a clever alternative derivation by Gauss (note to self: use fewer nested parentheses))).

The other reason the assumption of normality is legitimate is an obscure little idea called maximum entropy. If all we know about a distribution is its location and width (mean and standard deviation), then the only function we can use to describe it, without implicitly assuming more information than we have, is the Gaussian function.

Here's what the normal sampling distribution for the error at a single data point looks like:

(2)

For all n d's in D, the total probability is just the product of all these terms given by equation (2), and since e^a×e^b = e^a+b, then

(3)

This, along with our priors, is all we typically need to perform Bayesian parameter estimation.

If we start from ignorance, or if for any other reason, the priors are uniform, then finding the most probable values for θ simply becomes a matter of maximizing the exponential function in equation (3), and the procedure reduces to the method of maximum likelihood. Because of the minus sign in the exponent, maximizing this function requires minimizing Σ[(d_i - y(x_i))²/2σ_i²]. Furthermore, if the standard deviation, σ, is the same for all d, then we just have to minimize Σ[d_i - y(x_i)]², which is the least squares method, beloved of physicists.

Staying, for simplicity, with the assumption of a uniform prior, then it is clear that when comparing two different fitting models, the one that achieves smaller residuals will be the favoured one, according to probability theory. (See, for example, equation (4) in my article on the Ockham Factor.) P(D | θMI) is larger for the model with smaller residuals, as just described.

The whole point of this post was to figure out how to quantify closeness to truth. The residuals we've just been looking at are how wrong the model is: d, the data point is reality, y(x) is the model, the difference between them is the amount of wrongness of the model, which we wanted to quantify. And by Bayes' theorem, more wrongness leads to less probability, exactly as desired.

Within a system of only two models, 'flat Earth' v's 'spherical Earth,' there is no scope for knowing that both models are actually false, but even working with such a system, we would probably keep in mind the strong potential for a third, more accurate model (e.g. the oblate spheroid that Asimov discussed). Such mindfulness is really a manifestation of a broader 'supermodel.' In the two-model system, 'spherical Earth' is closer to the truth because it manifests much smaller residuals. It is also closer to the truth than 'flat Earth,' even after the third model is introduced, because its residuals are still smaller than those for 'flat Earth.' 'Oblate spheroid' will be even closer to the truth in the 3 theory system, but spherical and flat will still have non-zero probability - strictly, we can not rule them out completely, thanks to the unavoidable measurement uncertainty, and so the statement that we know them to be false is not rigorously valid.

I promised earlier to identify the fallacy perpetrated by Asimov's misguided correspondent. I have already discussed it a few months ago. It is the mind-projection fallacy, the false assumption that aspects of our model of reality must be manifested in reality itself. In this case: if wrongness (relating to our knowledge of reality) can be graded, then so must 'true' and 'false' (relating to reality itself) also be graded. There are two ways to reason from here: (1) truth must be fuzzy, or (2) our idea of continuous degrees of wrong must be mistaken.

The idea that all models that are wrong are necessarily all equally wrong, as expressed in the letter with which poor Asimov was confronted, is fallacious in the extreme. Wrong does not have this black/white feature. 'Wrong' and 'false' are not the same. Of course, a wrong theory is also false, but if I'm walking to the shop, I'd rather find my location to be wrong by half a mile than by a hundred miles.

We can say that a theory is less wrong (i.e. produces smaller residuals), without implying that is is more true. 'True' and 'false' retain their black-and-white character, as I believe they must, but our knowledge of what is true is necessarily fuzzy. This is precisely why we use probabilities. As our theories get incrementally less wrong and closer to the truth, so the probabilities we are allowed to assign to them get larger.

There often seems to be a kind of bait-and-switch con trick going on with many of the world's least respectable 'philosophies.' The 'philosopher' makes a trivial but correct observation, then makes a subtle shift, often via the mind-projection fallacy, to produce an equivalent-looking statement that is both revolutionary, and utter garbage. In post-modern relativism (a popular movement in certain circles), we can see this shifting between right/wrong and true/false. The observation is made that all scientific theories are ultimately wrong, then hoping you won't notice the switch, the next thing you hear is that all theories are equally wrong. They can't seem to make their minds up, however, which side of the fallacy they are on, because the next thing you'll probably hear from them is that because knowledge is mutable, then so are the facts themselves: truth is relative to your point of view and the mood you happen to be in, science is nought but a social construct. Part of the joy of familiarity with scientific reasoning is the clarity of thought to see through the fog of such nonsense.

[1] 'Probability, Frequency, and Reasonable Expectation,' R. T. Cox, American Journal of Physics 1946, Vol. 14, No. 1, Pages 1-13. (Available here.)

[2] 'The Relativity of Wrong,' Isaac Asimov, The Skeptical Inquirer, Fall 1989, Vol. 14, No. 1, Pages 35-44. (Download the text here.) (And no, I don't find it spooky that both references are from the same volume and number.)

2 comments:

יאיר רזקNovember 14, 2012 at 12:12 PM
Thanks for another great post. I am afraid I'm still not convinced, however. You did a marvelous job explaining how "less wrong" can be fitted into the Bayesian framework. But you didn't explain why our epistemological framework isn't based on it in the first-place.

You say that "The whole point of this post was to figure out how to quantify closeness to truth. ". But the "truth" or "falsity" of the theory doesn't matter at all for your measure! Indeed, you rely on not knowing it to save your calculus; instead of helping you, knowing more will hinder you! If you were to suddenly come to know that the earth is not a sphere, with absolute certainty, then you would need to revise your measure of how "close" the theory is to the truth - to zero. Yet, clearly, the theory is still as close to the truth as it was before this additional information - it is still "as wrong" as it was before. The residuals didn't change.

You said that "'True' and 'false' retain their black-and-white character, as I believe they must". I too share this metaphysical intuition, but I'm beginning to doubt it. If all we care about is how wrong a theory is, why should we also care about whether or not it is false? In practice, it won't be true anyway; or in other words, it will always be false. It'll surely only be an approximation. But, of course, putting that into the Baysian calculus will wreck it. So why base our entire epistemology on a bivalent logic which we must not really use? Why not start-off from the idea of being less or more wrong - from fuzzy-logic - and try to build a formal epistemology that estimates how close a theory is to the truth, rather than how likely the theory is to be the truth?

I should mention that on the flip-side, I suspect that bivalent logic underlies all rational thought, including all of mathematics - fuzzy logic included. So that I'm not sure that there actually is an option to construct something on any other basis. I have a feeling all information is ultimately in bits (the Church-Turing thesis, I think?). But this is merely an intuition at this point.

I'm still highly ambivalent about the foundation of Logic and Epistemology, so I'm inflicting my rambling thoughts on you - sorry for that.

Yair

Saturday, October 27, 2012

Parameter Estimation and the Relativity of Wrong

2 comments: