Maximum Entropy: Total Bayesianism

Monday, October 8, 2012

Total Bayesianism

If you've read even a small sample of the material I've posted so far, you'll recognize that one of my main points concerns the central importance of Bayes' theorem. You might think, though, that the most basic statement of this importance is something like "Bayes' theorem is the most logical method for all data analysis." This, for me though, falls far short of capturing the most general importance of Bayes' rule.

Bayes' theorem is more than just a method of data analysis, a means of crunching the numbers. It represents the rational basis for every aspect of scientific method. And since science is simply the methodical application of common sense, Bayes' theorem can be seen to be (together with decision theory) a good model for all rational behaviour. Indeed, it may be more appropriate to invert that, and say that your brain is a superbly adapted mechanism, evolved for the purpose of simulating the results of Bayes' theorem.

Because I equate scientific method with all rational behaviour, I am no doubt opening myself up to the accusation of scientism¹, but my honest response is: so what? If I am more explicit than some about the necessary and universal validity of science, this is only because reason has led me in this direction. For example, P.Z. Myers, author of the Pharyngula blog (vastly more well known than mine, but you probably knew that already), is one of the great contemporary advocates of scientific method - clear headed and craftsmanlike in the way he constructs his arguments - but in my evidently extreme view, even he can fall short, on occasion, of recognizing the full potential and scope of science. In one instance I recall, when the league of nitwits farted in Myers' general direction, and he himself stood accused of scientism, he deflected the accusation, claiming it was a mistake. My first thought, though, is "hold on, there's no mistake." Myers wrote:

The charge of scientism is a common one, but it’s not right: show us a different, better path to knowledge and we’ll embrace it.

But how is one to show a better path to knowledge? In principle, it can not be done. If Mr. X claims that he can predict the future accurately by banging his head with a stone until visions appear, does that suffice as showing? Of course not, a rigorous scientific test is required. Now, if under the best possible tests, X's predictions appear to be perfectly accurate, any further inferences based on them are only rational to the extent that science is capable of furnishing us (formally, or informally) with a robust probability estimate that his statements represent the truth. Sure, we can use X's weird methodology, but we can only do so rationally, if we do so scientifically. X's head smashing trick will never be better than science (a sentence I did not anticipate writing).

To put it another way, X may yield true statements, but if we have no confidence in their truth, then they might as well be random. Science is the engine generating that confidence.

So, above I claimed that all scientific activity is ultimately driven by Bayes' theorem. Lets look at it again in all its glory:

P(H \| DI) =	P(H \| I) × P(D \| H I)
	P(H \| I) × P(D \| H I) + P(H' \| I) × P(D \| H' I)

(1)

(As usual, H is a hypothesis we want to evaluate, D is some data, I is the background information, and H' means "H is not true.")

The goal of science, whether one accepts it or not, is to calculate the term on the left hand side of equation (1). Now, most, if not all, accepted elements of experimental design are actually adapted to manipulate the terms on the right hand side of this equation, in order to enhance the result. I'll illustrate with a few examples.

Firstly, and most obviously, the equation calls for data, D. We have to look at the world, in order to learn about it. We must perform experiments to probe nature's secrets. We can not make inferences about the real world by thought alone. (Some may appear to do this, but no living brain is completely devoid of stored experiences - the best philosophers are simply very efficient at applying Bayes' theorem (usually without knowing it) to produce powerful inferences from mundane and not very well controlled data. This is why philosophy should never be seen as lying outside empirical science.)

Secondly, the equation captures perfectly what we recognize as the rational course of action when evaluating a theory - we have to ask 'what should I expect to see if this theory is true? - what are its testable hypotheses?' In other words, what data can I make use of in order to calculate P(D | HI)?

Once we've figured out what kind of data we need, the next question is how much data? Bayes' rule informs us: we need P(D | HI) to be as high as possible if true, and as low as possible if false. Lets look at a numerical example:

Suppose I know, on average, how tall some species of flower gets, when I grow the plants in my home. Suppose I suspect that picking off the aphids that live on these flowers will make the plants more healthy, and cause them to grow higher. My crude hypothesis is that the relative frequency with which these specially treated flowers exceed the average height is more than 50 %. My crude data set results from growing N flowers, applying the special treatment to all of them, and recording the number, x, that exceed the known average height.

To check whether P(D | HI) is high when H is true and low when H is false, we'll take the ratio

P(D | H I)

P(D | H' I)

(2)

If H_f says that the frequency with which the flowers exceed their average height is f, then P(D | H_fI) (where D is the number of tall flowers, x, and the total number grown N) is given by the binomial distribution. But our real hypothesis, H, asserts that f is in the range 0.5 < f ≤ 1. This means we're going to have to sum up a whole load of P(D | H_fI)s. We could do the integral exactly, but to avoid the algebra, lets treat the smoothly varying function like a staircase, and split the f-space into 50 parts: f = 0.51, 0.52, ...,0.99, 1.0. To calculate P(D | H'I), we'll do the same, with f = 0.01, 0.02, ...., 0.50.

What we want, e. g. for the P(D | HI), is P(D | [H_0.51 + H_0.52 + ....] I).

Generally, where all hypotheses involved are mutually exclusive, it can be shown (see appendix below) that,

P(D \| [H₁ + H₂ + .....] I) =	P(H₁ \| I) P(D \| H₁ I) + P(H₂ \| I) P(D \| H₂ I) + .....
	P(H₁ \| I) + P(H₂ \| I) + .....

(3)

But we're starting from ignorance, so we'll take all the priors, P(H_f| I), to be the same. We'll also have the same number of them, 50, in both numerator and denominator, so when we take the desired ratio, all the priors will cancel out (as will the width, Δf = 0.01, of each of the intervals on our grid), and all we need to do is sum up P(D | H_f1I) + P(D | H_f2I) + ....., for each relevant range. Each term will come straight from the binomial distribution:

P(x \| N, f) =	N!	x^f (N - x)^1-f
	x! (N - x)!

(4)

If we do that for say 10 test plants, with seven flowers growing beyond average height, then ratio (2) is 7.4. If we increase the number of trials, keeping the ratio of N to x constant, what will happen?

If we try N = 20, x = 14, not too surprisingly, ratio (2) improves. The result now is 22.2, an increase of 14.8. Furthermore if we try N = 30, x = 21, ratio (2) increases again, but this time more quickly: now the ratio is 58.3, and further increase of 36.1.

So, to maximize the contrast between the hypotheses under test, H and H', what we should do is take as many measurements as practically possible. Something every scientist knows already, but something nonetheless demanded by Bayes' theorem.

How is our experimental design working out, then? Well, not that great so far, actually. Presumably the point of the experiment was to decide if removing the parasites from the flowers provided a mechanism enabling them to grow bigger, but all we have really shown is that they did grow bigger. We can show this by resolving e.g. H' into a set of mutually exclusive and exhaustive (within some limited model) sub-hypothesis:

H' = H'A₁ + H'A₂ + H'A₃ + ......

(5)

where H' is, as before, 'removing aphids did not improve growth,' and some of the A's represent alternative causal agencies capable of affecting a change in growth. For example, A₁ is the possibility that a difference in ambient temperature tended to make the plants grow differently. Lets look again at equation (3). This time instead of H_f's, we have all the H'A_i, but the principle is the same. Previously, the priors were all the same, but this time, we can exploit the fact that they need not be. We need to manipulate those priors so that the P(D | H'I) term in the denominator of Bayes' theorem, is always low if the number of tall plants in the experiment is large. We can do this by reducing the priors for some of the A_i corresponding to the alternate causal mechanisms. To achieve this, we'll introduce a radical improvement to our methodology: control.

Instead of relying on past data for plants not treated by having their aphids removed, we'll grow 2 sets of plants, treated identically in all respects, except the one that we are investigating with our study. The temperature will be the same for both groups of plants, so P(A₁ | I) will be zero - there will be no difference in temperature to possibly affect the result. The same will happen to all (if we have really controlled for all confounding variables) the other A_i that corresponded to additional agencies offering explanations for taller plants.

This process of increasing the degree of control can, of course, undergo numerous improvements. Suppose, for example, that after a number of experiments, I begin to wonder if its not actually removing the aphids that affects the plants, but simply the rubbing of the leaves with my fingers that I perform in order to squish the little parasites. So as part of my control procedure, I devise a way to rub the leaves of the plants in the untreated group, while carefully avoiding those villainous arthropods. Not a very plausible scenario, I suppose, but if we give a tentative name to this putative phenomenon, we can appreciate how analogous processes might be very important in other fields. For the sake of argument, lets call it a placebo effect.

Next I begin to worry that I might be subconsciously influencing the outcome of my experiments. Because I'm keen on the hypothesis I'm testing, (think of the agricultural benefits such knowledge could offer!) I worry that I am inadvertently biasing my seed selection, so that healthier looking seeds go into the treatment group, more than into the control group. I can fix this, however, by randomly allocating which group each seed goes into, thereby setting the prior for yet another alternate mechanism to zero. The vital nature of randomization, when available, in collecting good quality scientific data is something we noted already, when looking at Simpson's paradox, and is something that has been well appreciated for at least a hundred years.

Randomization isn't only for alleviating experimenter biases, either. Suppose that my flower pots are filled with soil by somebody else, with no interest in or knowledge of my experimental program. I might be tempted to use every second pot for the control group, but suppose my helper is also filling the pots in pairs, using one hand for each. Suppose also that the pots filled with his left hand receive inadvertently less soil than those filled with his right hand. Unexpected periodicities such as these are also taken care of by proper randomization.

Making real-world observations, and lots of them; control groups; placebo controls; and randomization: some exceedingly obvious measures, some less so, but all contained in that beautiful little theorem. Add these to our Bayesian formalization of Ockham's razor, and its extension, resulting in an explanation for the principle of falsifiability, and we can not avoid noticing that science is a thoroughly Bayesian affair.

Appendix

You might like to look again at the 3 basic rules of probability theory, if your memory needs refreshing.

To derive equation (3), above, we can write down Bayes' theorem in a slightly strange way:

P(D \| [H₁ + H₂ + ....], I) =	P(D \| I) × P([H₁ + H₂ + ....] \| D I)
	P([H₁ + H₂ + ....] \| I)

(A1)

This might look a bit backward, but thinking about it a little abstractly, before any particular meaning is attached to the symbols, we see that it is perfectly valid. If you're not used to Boolean algebra, or anything similar, let me reassure you that its perfectly fine for a combination of propositions, such as A + B + C, (where the + sign means 'or') to be treated as a proposition in its own right. If equation (A1) looks too much, just replace everything in the square brackets with another symbol, X.

As long as all the various sub-hypotheses, H_i, are mutually exclusive, then when we apply the extended sum rule above and below the line, the cross terms vanish, and (A1) becomes:

P(D \| [H₁ + H₂ + ....], I) =	P(D \| I) × [ P(H₁ \| D I) + P(H₂ \| D I) + ..... ]
	P(H₁ \| I) + P(H₂ \| I) + .....

(A2)

We can multiply out the top line, and also make note that for each hypothesis, H_i, we can make two separate applications of the product rule to the expression P(H_iD | I), to show that

P(D \| I) =	P(H_i \| I) P(D \| H_i I)
	P(H_i \| D I)

(A3)

(This is actually exactly the technique by which Bayes' theorem itself can be derived.)

Substituting (A3) into (A2), we see that

P(D \| [H₁ + H₂ + .....] I) =	P(H₁ \| I) P(D \| H₁ I) + P(H₂ \| I) P(D \| H₂ I) + .....
	P(H₁ \| I) + P(H₂ \| I) + .....

(A4)

which is the result we wanted.

[1] From Wikipedia:

Scientism is a term used, usually pejoratively, to refer to belief in the universal applicability of the scientific method and approach, and the view that empirical science constitutes the most authoritative worldview or most valuable part of human learning to the exclusion of other viewpoints.

6 comments:

יאיר רזקOctober 9, 2012 at 11:19 AM
A very nice Bayesian explanation of control and randomization, thanks.

I do have a few issues with your introduction, though.

<< If I am more explicit than some about the necessary and universal validity of science, this is only because reason has led me in this direction. >>

I submit that the arguments justifying Bayesianism are rational, analytic arguments rather than scientific ones. So you are not, actually, supporting the position the "*universal* applicability of the scientific method and approach". There is room for philosophy, mathematics, and so on - part of which is justifying Bayesian/scientific thinking itself! The point of real scientism, rather, is that these methods do indeed establish science as the way to think *about reality*. Hardly anyone claims real universality.

What I am concerned with recently is, however, precisely that - and one key point against Bayesianism is that it does not capture well the scientific method in that it is only concerned about which theories we ought to believe, rather than which theories we have evidence for. Dempster Shafer theory allows one to calculate the Bayesian probabilities, or levels of credence, as needed; but it is primarily concerned with determining the degree of support, which is a slightly different beast. Do you have any arguments for the Bayesian as opposed to the DS formalism?

Another basic feature of Bayesianism that is problematic is that it is based on Boolean logic, and I have long wondered whether fuzzy logic would better serve to capture statements about reality. It surely isn't simply "true" that the earth is a sphere, for example - rather, it is closer to the truth than the position that the earth is flat. The calculus based on fuzzy sets, however, looks quite intimidating and I haven't even begun to consider it.

For a more thorough review of formal options, see the Standford Encyclopedia of Philosophy on "Formal Representations of Belief",

http://plato.stanford.edu/entries/formal-belief/

I'm quite at a loss at this problem of riches. I confess I have not considered the myriad of options raised there, and am not sure if any are indeed advantageous over Baysianism and why, as the article essentially maintains.

Yair
ReplyDelete
Replies
יאיר רזקOctober 12, 2012 at 3:41 PM
<< OK, when I wrote that these axioms are derived from experience, thats really my rather terse shorthand for derived from experience, repeatedly, uncountably put to the test, with always glowing results, and found to form a robust system, leading to the derivation of a coherent and comprehensive system of thought. Before we had Bayes' theorem, that was the best criterion for science we had. In hindsight, we can see that all of this is just informal Bayesianism anyway - come up with a model and compare it against reality. >>

A maxim of empiricism is that all knowledge is rooted in experience. I agree. A priori knowledge is rooted in evolutionary experience, as you said. However, this does not make it scientific, or Bayesian. There is a difference between pruning ways-of-thought that don't work and adjusting their probabilities by Bayes' theorem. There is a difference between forming new hypothesis and manifesting new mutations or gene combinations. It's just not the same thing. And there is a difference between understanding the genesis of our ways of thinking and justifying them non-cyclically.

I still maintain that our basic logical intuitions are what justifies scientific principles of thought, and that these intuitions are a priori in that while their genesis can be understood on the basis of an evolutionary trial-and-error process this does not justify them, but rather this understanding can only be had in light of them.

<< Yes, evidence and probability are numerically different, but they contain the same information. >>

That's not quite true in DS. Let me put it a bit more formally: you can have a DS degree of support S(A) for proposition A. Let's say we have S(A)=S(~A)=0. Then, the Plausibility (probability) you'll calculate will be P(A)=P(~A)=0.5. Yet, you can also have S(A)=S(~A)=0.2, representing some evidence and arguments for both sides. The probabilities will still be P(A)=P(~A)=0.5, but the epistemic situation is quite different: in the first case, we have no information, we are totally in ignorance; in the second case, we have some information, it's just inconclusive. This seems the sort of thing that a formalization of science should represent, no?

<< This is precisely what it means when we say it is closer to the truth: it is less wrong. >>

But where is this "closer to the truth" in the Bayesian model? We're only speaking of True or False. Let me, again, be more formal - we both agree that the proposition "The Earth is a sphere"=A_S is False. Under the rules for Baysianism, given this information P(A_S)=0, as in general P(FALSE)=0. Clearly, this proposition is also "less wrong" - yet we have just removed it from any Bayesian calculation.

If we only care about being "less wrong", why are we tracking things as either wholly wrong (FALSE) or wholly true (TRUE)? It just makes more sense to start off with a system of underlying logic that is already talking of things being more or less wrong - in other words, fuzzy logic.

Realize that I'm not married to these ideas. It's just that I'm have my doubts about the Bayesian approach, and I'm raising them. Maybe I'm mistaken, and maybe Bayesianism is correct - I don't have a firm grasp on the right epistemology yet.

Cheers,

Yair
ReplyDelete
Replies

Add comment