Wednesday, March 28, 2012

Maximum Entropy


In 1948 Claude Shannon forged a link between the thermodynamic concept of entropy and a new formal concept of information. This event marked the beginning of information theory. This discovery captured the imagination of Ed Jaynes, a physicist with strong interest in statistical mechanics and probability theory. His expertise in statistical mechanics meant that he understood entropy better than many. His recognition of probability theory as an extended form of logic meant that he understood that probability calculations (and therefore all of science) are concerned not directly with truths about reality, as many have supposed, but with information about truths.

The distinction may seem strange – science accepts that there are statements about nature that are objectively either true or false, and definitely not some combination of true and false, so the most desirable goal must be to know which of the options, ‘true’ or ‘false’ is the case. But the truth values of such statements are not accessible to human sensation, and therefore remain hidden also from human science. This is a difficult fact for intelligent animals like us to deal with, but we have learned to do so, partly by inventing a set of procedures called science. Science acknowledges that the truth of a proposition can not be known with certainty, and so it sets out instead to determine the probability of truth. For this purpose, it combines empirical information and logic.

For Ed Jaynes, therefore, Shannon’s new information theory was instantly recognizable as a breakthrough of massive importance. Jaynes thought about this new tool, meditated on it, digested it, and played with it intensely. One of the outcomes of this meditation was a beautiful idea known as maximum entropy. The title of this blog, then, is a tribute to Edwin Jaynes, to this beautiful idea of his, and to the many more exceptional ideas he produced.

As a physicist, I never received much education in statistics and probability – we know the sum and product rules, we know how to write down the formulae for the Poisson and normal distributions and how to calculate a mean and a standard deviation, and that’s about it really. Oh and some typically badly understood model fitting by maximum likelihood (we call it ‘method of least squares’, which if you know stats, tells you how limited our understanding is).

During my PhD studies in semiconductor physics, I became very dissatisfied with this situation, as it gradually dawned on me that scientific method and statistical inference must rightly be considered as synonymous: they are both the rational procedure for estimating what is likely to be true, given our necessarily limited information. I set out to teach myself as much as I could about statistics. Not surprisingly, my first investigations led me to what is often referred to as orthodox methodology. I laboured with the traditional hypothesis tests – t-tests and so forth – but I found the whole framework very unpalatable: confused, disjointed, self-contradicting – just ugly. Then I stumbled on Bayes’ theorem, and my world view was elevated to a higher plane. Some time after that I discovered Ed Jaynes’ book, ‘Probability Theory: The Logic of Science,’ and my horizon was expanded again, by another order of magnitude. Problems that I had thought to be only approachable by the orthodox methods became recognizable as simple extensions of Bayes’ theorem, and any nagging doubts I had about the validity of the Bayesian program were banished by Jaynes’ clearly formulated logic.

It is not that I am totally against orthodox (sometimes called frequentist) methods. But the success of frequentist techniques is limited to the range of circumstances in which they do a reasonable job of approximating Bayes’ theorem. The range of applications, however, in which the two approaches diverge is unfortunately quite large, while orthodox theory seems to have nothing fundamental to say about when to expect such divergence.

Bayes’ theorem works by taking a prior probability distribution and combining it with some data to produce an updated distribution, known as the posterior probability. After the next set of data comes in, the posterior probability is treated as the new prior, and another update is performed. The process goes on as long as we wish, with presumably the posterior probability distributions narrowing down ever closer upon a particular hypothesis.

One of the problems we might anticipate with this procedure, however, is where does the process start? What do we use as our original prior? The principle of indifference works in many cases. Indifference works like this: if I am told that a 6 sided die is to be thrown, with no additional information about the die or the method of throwing, then symmetry considerations require that the probability for any of the sides to end up facing upwards is 1/6. For some more complex situations, however, indifference fails. One of the things that the principle of maximum entropy achieves is to provide a technique for assigning priors in a huge range of new problems, unassailable using the principle of indifference.

As Shannon discovered, information can be considered as the flip side of entropy, a thermodynamic idea representing disorder – the more information, the less entropy. Why then should science be interested in maximizing entropy? What we are looking for is the probability distribution that incorporates whatever information we have, without inadvertently incorporating any assumed information that we do not have. We need that probability distribution with the maximum amount of entropy possible, given the constraints set by our available information. Maximum entropy, therefore, is a tool for specifying exactly how much information we possess on a given matter, which is evidently one of the highest possible goals of honest, rational science. This is why I feel that ‘maximum entropy’ is an appropriate title for this blog about scientific method.



Substantial portions of Probability Theory: The Logic of Science, by E. T. Jaynes can be viewed here.

More resources relating to Ed Jaynes and his writings can be found at http://bayes.wustl.edu/

2 comments:

  1. The scientific method is normally defined as being a social phenomenon - due to the so-called "confirmation" phase. Inductive inference is not necessarily social. So: these things can't really be synonyms, unless you define "the scientific method" in a rather unusual manner.

    ReplyDelete
    Replies
    1. Hi Tim

      I'm not really sure what your point is, but I define scientific method to be the systematic evaluation of empirical evidence in order to distinguish what is probably true from what is not. I find that perfectly reasonable.

      There are social aspects to science, in that, for example, one person's ideas can inspire the experiments or theoretical developments of another. Peer review and replication are also social, but they are rooted in the need to reduce the probability for systematic error and bias, and are therefore part of the process of inductive inference.

      Science depends, however, on many processes that are not scientific. For example, the research that gets done depends heavily on who gets funding. This does not imply that the funding mechanisms are part of scientific method, or are necessarily logical.

      The ease with which an idea is accepted as part of the scientific consensus depends on peer review, replication, etc., as well as several processes that are not scientific. These latter provide some measure of the extent to which scientists betray their own methodology.

      Delete