If you've read even a small sample of the material I've posted so far, you'll recognize that one of my main points concerns the central importance of Bayes' theorem. You might think, though, that the most basic statement of this importance is something like "Bayes' theorem is the most logical method for all data analysis." This, for me though, falls far short of capturing the most general importance of Bayes' rule.
Bayes' theorem is more than just a method of data analysis, a means of crunching the numbers. It represents the rational basis for every aspect of scientific method. And since science is simply the methodical application of common sense, Bayes' theorem can be seen to be (together with decision theory) a good model for all rational behaviour. Indeed, it may be more appropriate to invert that, and say that your brain is a superbly adapted mechanism, evolved for the purpose of simulating the results of Bayes' theorem.
Because I equate scientific method with all rational behaviour, I am no doubt opening myself up to the accusation of scientism
1, but my honest response is: so what? If I am more explicit than some about the necessary and universal validity of science, this is only because reason has led me in this direction. For example, P.Z. Myers, author of the
Pharyngula blog (vastly more well known than mine, but you probably knew that already), is one of the great contemporary advocates of scientific method - clear headed and craftsmanlike in the way he constructs his arguments - but in my evidently extreme view, even he can fall short, on occasion, of recognizing the full potential and scope of science. In one instance I recall, when
the league of nitwits farted in Myers' general direction, and he himself stood accused of scientism, he deflected the accusation, claiming it was a mistake. My first thought, though, is "hold on, there's no mistake." Myers wrote:
The charge of scientism is a common one, but it’s not right: show us a different, better path to knowledge and we’ll embrace it.
But how is one to show a better path to knowledge? In principle, it can not be done. If Mr. X claims that he can predict the future accurately by banging his head with a stone until visions appear, does that suffice as showing? Of course not, a rigorous scientific test is required. Now, if under the best possible tests, X's predictions appear to be perfectly accurate, any further inferences based on them are only rational to the extent that science is capable of furnishing us (formally, or informally) with a robust probability estimate that his statements represent the truth. Sure, we can use X's weird methodology, but we can only do so rationally, if we do so scientifically. X's head smashing trick will never be better than science (a sentence I did not anticipate writing).
To put it another way, X may yield true statements, but if we have no confidence in their truth, then they might as well be random. Science is the engine generating that confidence.
So, above I claimed that all scientific activity is ultimately driven by Bayes' theorem. Lets look at it again in all its glory:
| P(H | DI) =
| P(H | I) × P(D | H I)
|
| P(H | I) × P(D | H I) + P(H' | I) × P(D | H' I)
|
|
(1)
|
(As usual, H is a hypothesis we want to evaluate, D is some data,
I is the background information, and H' means "H is not true.")
The goal of science, whether one accepts it or not, is to calculate the term on the left hand side of equation (1). Now, most, if not all, accepted elements of experimental design are actually adapted to manipulate the terms on the right hand side of this equation, in order to enhance the result. I'll illustrate with a few examples.
Firstly, and most obviously, the equation calls for data, D. We have to look at the world, in order to learn about it. We must perform experiments to probe nature's secrets. We can not make inferences about the real world by thought alone. (Some may appear to do this, but no living brain is completely devoid of stored experiences - the best philosophers are simply very efficient at applying Bayes' theorem (usually without knowing it) to produce powerful inferences from mundane and not very well controlled data. This is why philosophy should never be seen as lying outside empirical science.)
Secondly, the equation captures perfectly what we recognize as the rational course of action when evaluating a theory - we have to ask 'what should I expect to see if this theory is true? - what are its testable hypotheses?' In other words, what data can I make use of in order to calculate
P(D | HI)?
Once we've figured out what kind of data we need, the next question is how much data? Bayes' rule informs us: we need
P(D | HI) to be as high as possible if true, and as low as possible if false. Lets look at a numerical example:
Suppose I know, on average, how tall some species of flower gets, when I grow the plants in my home. Suppose I suspect that picking off the aphids that live on these flowers will make the plants more healthy, and cause them to grow higher. My crude hypothesis is that the relative frequency with which these specially treated flowers exceed the average height is more than 50 %. My crude data set results from growing N flowers, applying the special treatment to all of them, and recording the number, x, that exceed the known average height.
To check whether
P(D | HI) is high when H is true and low when H is false, we'll take the ratio
If H
f says that the frequency with which the flowers exceed their average height is f, then
P(D | HfI) (where D is the number of tall flowers, x, and the total number grown N) is given by the binomial distribution. But our real hypothesis, H, asserts that f is in the range 0.5 < f ≤ 1. This means we're going to have to sum up a whole load of
P(D | HfI)s. We could do the integral exactly, but to avoid the algebra, lets treat the smoothly varying function like a staircase, and split the f-space into 50 parts: f = 0.51, 0.52, ...,0.99, 1.0. To calculate
P(D | H'I), we'll do the same, with f = 0.01, 0.02, ...., 0.50.
What we want, e. g. for the
P(D | HI), is
P(D | [H0.51 + H0.52 + ....] I).
Generally, where all hypotheses involved are mutually exclusive, it can be shown (see
appendix below) that,
| P(D | [H1 + H2 + .....] I) =
| P(H1 | I) P(D | H1 I) + P(H2 | I) P(D | H2 I) + .....
|
| P(H1 | I) + P(H2 | I) + .....
|
|
(3)
|
But we're starting from ignorance, so we'll take all the priors,
P(Hf | I), to be the same. We'll also have the same number of them, 50, in both numerator and denominator, so when we take the desired ratio, all the priors will cancel out (as will the width, Δf = 0.01, of each of the intervals on our grid), and all we need to do is sum up P(D | Hf1I) + P(D | Hf2I) + ....., for each relevant range. Each term will come straight from the binomial distribution:
| P(x | N, f) =
| N!
|
xf (N - x)1-f
|
| x! (N - x)!
|
|
(4)
|
If we do that for say 10 test plants, with seven flowers growing beyond average height, then ratio (2) is 7.4. If we increase the number of trials, keeping the ratio of N to x constant, what will happen?
If we try N = 20, x = 14, not too surprisingly, ratio (2) improves. The result now is 22.2, an increase of 14.8. Furthermore if we try N = 30, x = 21, ratio (2) increases again, but this time more quickly: now the ratio is 58.3, and further increase of 36.1.
So, to maximize the contrast between the hypotheses under test, H and H', what we should do is take as many measurements as practically possible. Something every scientist knows already, but something nonetheless demanded by Bayes' theorem.
How is our experimental design working out, then? Well, not that great so far, actually. Presumably the point of the experiment was to decide if removing the parasites from the flowers provided a mechanism enabling them to grow bigger, but all we have really shown is that they did grow bigger. We can show this by resolving e.g. H' into a set of mutually exclusive and exhaustive (within some limited model) sub-hypothesis:
| H' = H'A1 + H'A2 + H'A3 + ...... |
|
(5)
|
where H' is, as before, 'removing aphids did not improve growth,' and some of the A's represent alternative causal agencies capable of affecting a change in growth. For example,
A1 is the possibility that a difference in ambient temperature tended to make the plants grow differently. Lets look again at equation (3). This time instead of
Hf's, we have all the
H'Ai, but the principle is the same. Previously, the priors were all the same, but this time, we can exploit the fact that they need not be. We need to manipulate those priors so that the
P(D | H'I) term in the denominator of Bayes' theorem, is always low if the number of tall plants in the experiment is large. We can do this by reducing the priors for some of the A
i corresponding to the alternate causal mechanisms. To achieve this, we'll introduce a radical improvement to our methodology: control.
Instead of relying on past data for plants not treated by having their aphids removed, we'll grow 2 sets of plants, treated identically in all respects, except the one that we are investigating with our study. The temperature will be the same for both groups of plants, so
P(A1 | I) will be zero - there will be no difference in temperature to possibly affect the result. The same will happen to all (if we have really controlled for all confounding variables) the other
Ai that corresponded to additional agencies offering explanations for taller plants.
This process of increasing the degree of control can, of course, undergo numerous improvements. Suppose, for example, that after a number of experiments, I begin to wonder if its not actually removing the aphids that affects the plants, but simply the rubbing of the leaves with my fingers that I perform in order to squish the little parasites. So as part of my control procedure, I devise a way to rub the leaves of the plants in the untreated group, while carefully avoiding those villainous arthropods. Not a very plausible scenario, I suppose, but if we give a tentative name to this putative phenomenon, we can appreciate how analogous processes might be very important in other fields. For the sake of argument, lets call it a placebo effect.
Next I begin to worry that I might be subconsciously influencing the outcome of my experiments. Because I'm keen on the hypothesis I'm testing, (think of the agricultural benefits such knowledge could offer!) I worry that I am inadvertently biasing my seed selection, so that healthier looking seeds go into the treatment group, more than into the control group. I can fix this, however, by randomly allocating which group each seed goes into, thereby setting the prior for yet another alternate mechanism to zero. The vital nature of randomization, when available, in collecting good quality scientific data is something we noted already, when looking at
Simpson's paradox, and is something that has been well appreciated for at least a hundred years.
Randomization isn't only for alleviating experimenter biases, either. Suppose that my flower pots are filled with soil by somebody else, with no interest in or knowledge of my experimental program. I might be tempted to use every second pot for the control group, but suppose my helper is also filling the pots in pairs, using one hand for each. Suppose also that the pots filled with his left hand receive inadvertently less soil than those filled with his right hand. Unexpected periodicities such as these are also taken care of by proper randomization.
Making real-world observations, and lots of them; control groups; placebo controls; and randomization: some exceedingly obvious measures, some less so, but all contained in that beautiful little theorem. Add these to our Bayesian formalization of
Ockham's razor, and its extension, resulting in an explanation for the
principle of falsifiability, and we can not avoid noticing that science is a thoroughly Bayesian affair.
Appendix
You might like to look again at the
3 basic rules of probability theory, if your memory needs refreshing.
To derive equation (3), above, we can write down Bayes' theorem in a slightly strange way:
| P(D | [H1 + H2 + ....], I) =
| P(D | I) × P([H1 + H2 + ....] | D I)
|
| P([H1 + H2 + ....] | I)
|
|
(A1)
|
This might look a bit backward, but thinking about it a little abstractly, before any particular meaning is attached to the symbols, we see that it is perfectly valid. If you're not used to Boolean algebra, or anything similar, let me reassure you that its perfectly fine for a combination of propositions, such as A + B + C, (where the + sign means 'or') to be treated as a proposition in its own right. If equation (A1) looks too much, just replace everything in the square brackets with another symbol, X.
As long as all the various sub-hypotheses, H
i, are mutually exclusive, then when we apply the extended sum rule above and below the line, the cross terms vanish, and (A1) becomes:
| P(D | [H1 + H2 + ....], I) =
| P(D | I) × [ P(H1 | D I) + P(H2 | D I) + ..... ]
|
| P(H1 | I) + P(H2 | I) + .....
|
|
(A2)
|
We can multiply out the top line, and also make note that for each hypothesis, H
i, we can make two separate applications of the product rule to the expression
P(Hi D | I), to show that
| P(D | I) =
| P(Hi | I) P(D | Hi I)
|
| P(Hi | D I)
|
|
(A3)
|
(This is actually exactly the technique by which Bayes' theorem itself can be derived.)
Substituting (A3) into (A2), we see that
| P(D | [H1 + H2 + .....] I) =
| P(H1 | I) P(D | H1 I) + P(H2 | I) P(D | H2 I) + .....
|
| P(H1 | I) + P(H2 | I) + .....
|
|
(A4)
|
which is the result we wanted.
Scientism is a term used, usually pejoratively,
to refer to belief in the universal applicability of the scientific method and approach, and the view that empirical science constitutes the most authoritative worldview or most valuable part of human learning to the exclusion of other viewpoints.