Maximum Entropy: August 2012

At the end of my previous post, I argued that an intuitive grasp of Bayesian model comparison is an invaluable asset to those wishing to apply scientific method. Even if one is never going to execute rigorous calculations of the form described in that article, it is still possible to gain important insight into the plausibility of different descriptions of reality, merely by exercising one’s familiarity with the general structure of the formal theory. Here, I’ll illustrate the application of this kind of informal reasoning, which, while not making use of even a single line of algebra, can still be seen to be quite water tight.

One thing I have tried to get across in my writing here is that scientific method is for everybody and can address all meaningful questions concerning fact. If a matter of fact has real consequences for the state of reality, then it is something that can be scrutinized by science. If it has no real consequences, then it is at most charitable, an extremely poor class of fact. In this article, I’ll apply probability theory to the question of whether or not the universe is the product of some omnipotent deity. It’s a lot simpler to do than you might think.

Now there are some (including, sadly, some scientists) who maintain that what I am going to do here is inappropriate and meaningless. To many of these people, reality is divided into two classes of phenomena: the natural and the supernatural. Natural phenomena, they say, are the things that fall into the scope of science, while the supernatural lies outside of science’s grasp, and can not be addressed by rational investigation. This is completely muddle-headed, as I have argued elsewhere. If we can measure it, then it falls into science’s domain. If we can’t measure it, then postulating its existence achieves nothing.

Other disguised forms of this argument exist. I was once asked by another physicist (and good friend): ‘How can science be so arrogant to think that it can address all aspects of reality?’ To which the answer is obvious: if you wish to claim that there is something real that can not be investigated scientifically, how can you be so arrogant to think that you know what it is? What could possibly be the basis for this knowledge?

As I said, addressing the existence of God with probability theory is quite simple to achieve. In fact, it is something that one of my mathematical heroes, Pierre-Simon Laplace, achieved with a single sentence, in a conversation with Napoleon I. The conversation occurred when the emperor was congratulating the scientist on his new book on celestial mechanics, and proceeded as follows:

Napoleon:

You made the system of the world, you explain the laws of all creation, but in all your book you speak not once of the existence of God!

Laplace:

Sire, I had no need of that hypothesis.

Lagrange (another mathematician who was also present):

Ah, but that is such a good hypothesis. It explains so many things!

Laplace:

Indeed, Sire, Monsieur Lagrange has, with his usual sagacity, put his finger on the precise difficulty with the hypothesis: it explains everything, but predicts nothing.

I believe this might be the world’s earliest recorded application of Bayesian model comparison.

By the way, a quick note of thanks: when I first came across the full version of this exchange, I struggled to find strong reason to treat it as more than a legend, but historian and Bayesian, Richard Carrier, has pointed me to sources that strongly boost the odds that this conversation was a real event. Richard also presents arguments from probability theory pertaining to religious matters. See, for example, this video.

Now, to see what Laplace was on about, we should think about model comparison in the terms that I have introduced here and discussed further in the article linked to above. Its true that Bayesian model comparison would not be formally described for more than a hundred years after Laplace’s death, but as the founder of Bayesian inference and a mathematician of extraordinary genius and natural insight, he must have been capable of perceiving the required logic. (Part of the beauty of Bayesian statistics is that hypothesis testing, parameter estimation, and model comparison are really only slightly different versions of the same problem – this gives it a logical unity and coherence that other approaches can only dream of enviously.)

To approach the problem, lets imagine a data set of just a few points – lets say 6 points, which we would like to fit with a polynomial function. The obvious first choice is to try a straight line. Illustrated below are the imagined data and the fitted straight line, which is the maximum likelihood estimate.

Because there is noise in the data, the fitted line misses all the data points, so there are some residuals associated with this fit. Is there a way to reduce the residuals, i.e. to have a model that passes closer to the measured data points? Of course there is: just increase the number of free parameters in the fitting model. In fact, with only six data points, a polynomial with terms up to and including the fifth power is already sufficient to guarantee that the residuals are reduced to exactly zero, as illustrated below, with exactly the same data as before.

Is this sufficient to make the fifth-order polynomial the more likely model? Certainly not. This more complex model has 6 fitting parameters, as opposed to only 2 for the linear fit. As I explained previously, though, each additional degree of freedom adds another dimension to the parameter sample space, which necessarily reduces the amount of prior probability for the parameters in the maximum likelihood region – the available prior probability needs to be spread much more thinly in order to cover the extended sample space. This is the penalty introduced in the form of the Ockham factor. This reduced prior probability, of course, results in a lower posterior probability for the model in most cases.

Now the hypothesis that Napoleon and Lagrange wanted Laplace to take seriously, the one about that omnipotent deity, is one with infinite degrees of freedom. That’s the definition of omnipotent: there is nothing that God can’t do if it wants to. That means infinitely many dimensions in the parameter sample space, and therefore infinitely low prior probability at all points. To see the probability that God exists vanish to zero, we only need to postulate any alternative model of reality with finite degrees of freedom. If my interpretation of Laplace’s comment is correct, this is the fact that he was able to perceive: that there is simply no amount of evidence that could raise the hypothesis of an omnipotent deity to a level of plausibility competitive with other theories of reality.

And what if we relax the requirement for God to be strictly omnipotent? It makes bugger all difference. Every prayer supposedly answered or not answered, every person saved from tragedy or not saved, every event attributed to God’s will represents another degree of freedom. That’s still a tremendous number of degrees of freedom, and while it may be finite, it’s still many orders of magnitude greater than the numbers of free parameters that most specialists would claim to be sufficient for a complete theory of the universe.

At this point, we can take note of yet another important scientific principle that can be recognized as just a special case of Bayes’ theorem, this time Karl Popper’s principal of falsifiability. Popper recognized that in order for a hypothesis to be treated as scientific, and worthy of rational investigation, it must be vulnerable to falsification. That means that a theory must be capable of making specific predictions, which, if they fail to arise in a suitable experiment, will identify the theory as false. If a theory is not falsifiable, then any data nature throws our way can be accommodated by it. This means that the theory predicts nothing whatsoever. Popper coined the term ‘pseudoscience’ for theories like this, such as psychoanalysis and astrology.

Now, if a theory is consistent with all conceivable data sets (i.e. unfalsifiable), this means that the associated model curve is capable of traversing all possible paths through the sample space for the data - just like the 5th order polynomial was able to land exactly on all 6 data points, above, regardless of where they were. Assuming that there is no limit to the number of observations we can accrue, this implies that the model has infinite degrees of freedom, which, as we have just discovered, is really bad news: thanks to the penalty introduced by the Ockham factor, this leaves you with a theory with zero credibility.

The fact that we can derive important and well-known principles of common sense and scientific methodology, such as Ockham’s razor and the principle of falsifiability, as consequences of Bayes’ theorem illustrates further what I have said above about the logical unity of this system. This is why I believe that Bayesian inference, along with the broader theory that it fits into, constitutes the most comprehensive and coherent theory of how knowledge is acquired. (I’ll get round to that broader theory some day, but it should be clear already that Bayes’ theorem is a consequence of more general principles.)

Much of my interest in science comes from deriving great pleasure from knowledge. Real respect for knowledge, however, demands an assessment of its quality. Its not enough to know that experts say that a meteor hitting the Earth killed off the dinosaurs - I want to know how convincingly that explanation stands up beside competing hypotheses. That’s why I’m interested in probability. This is the theory that permits this necessary appraisal of knowledge, the theory of how we know what we know, and how well we know it. Science is the systematic attempt to maximize the quality of our knowledge, and so probability is therefore also the underlying theory of science.

Lets recap the main points, in terms as simple as I can manage. If I try to fit a sequence of data points with a straight line, Ax + B, then there are 2 adjustable model parameters, A and B. So the chosen parameters are represented by coordinates, (x, y), on a two-dimensional plane. If I want a more complicated model, with one more degree of freedom, then the chosen point becomes (x, y, z), in a 3D space. Each free parameter results in an additional dimension for the parameter space. In the 2D case, for example, the prior probability for the point (x, y) is the product of the individual prior probabilities:

P(x | I) × P(y | I)

Since these prior probabilities are all less then one, then the more degrees of freedom there are, the smaller the prior probability will be for any particular point, (x, y, z, ….). If there are infinite degrees of freedom, then the prior probability associated with any point in the parameter space will be zero.

The posterior probability for a model depends strongly on this prior probability distribution over the parameter space, as shown by equation (4) in my article on the Okham factor. If the prior probabilities for the points (x, y, z, …) in the parameter space are all zero, then the probability for the model is also zero.

Any unfalsifiable theory must have infinite degrees of freedom in order to be able to remain consistent with all conceivable observations. With limited degrees of freedom, the complexity of the path traced by the model curve will also be limited, and the theory will be vulnerable to falsification – the model curve will not be guaranteed to be able to find a path that travels to each data point. Any unfalsifiable theory, therefore, has zero posterior probability. This includes the hypothesis of an omnipotent deity. Because of its unlimited powers, such an entity is capable of producing any sequence of events it chooses, meaning that we need a model curve with infinite free parameters to be guaranteed access to all data points.

Maximum Entropy

Thursday, August 16, 2012

Bayes' Theorem: All You Need to Know About Theology