If you've read even a small sample of the material I've posted so far, you'll recognize that one of my main points concerns the central importance of Bayes' theorem. You might think, though, that the most basic statement of this importance is something like "Bayes' theorem is the most logical method for all data analysis." This, for me though, falls far short of capturing the most general importance of Bayes' rule.
Bayes' theorem is more than just a method of data analysis, a means of crunching the numbers. It represents the rational basis for every aspect of scientific method. And since science is simply the methodical application of common sense, Bayes' theorem can be seen to be (together with decision theory) a good model for all rational behaviour. Indeed, it may be more appropriate to invert that, and say that your brain is a superbly adapted mechanism, evolved for the purpose of simulating the results of Bayes' theorem.
Because I equate scientific method with all rational behaviour, I am no doubt opening myself up to the accusation of scientism^{1}, but my honest response is: so what? If I am more explicit than some about the necessary and universal validity of science, this is only because reason has led me in this direction. For example, P.Z. Myers, author of the Pharyngula blog (vastly more well known than mine, but you probably knew that already), is one of the great contemporary advocates of scientific method  clear headed and craftsmanlike in the way he constructs his arguments  but in my evidently extreme view, even he can fall short, on occasion, of recognizing the full potential and scope of science. In one instance I recall, when the league of nitwits farted in Myers' general direction, and he himself stood accused of scientism, he deflected the accusation, claiming it was a mistake. My first thought, though, is "hold on, there's no mistake." Myers wrote:
The charge of scientism is a common one, but it’s not right: show us a different, better path to knowledge and we’ll embrace it.
But how is one to show a better path to knowledge? In principle, it can not be done. If Mr. X claims that he can predict the future accurately by banging his head with a stone until visions appear, does that suffice as showing? Of course not, a rigorous scientific test is required. Now, if under the best possible tests, X's predictions appear to be perfectly accurate, any further inferences based on them are only rational to the extent that science is capable of furnishing us (formally, or informally) with a robust probability estimate that his statements represent the truth. Sure, we can use X's weird methodology, but we can only do so rationally, if we do so scientifically. X's head smashing trick will never be better than science (a sentence I did not anticipate writing).
To put it another way, X may yield true statements, but if we have no confidence in their truth, then they might as well be random. Science is the engine generating that confidence.
So, above I claimed that all scientific activity is ultimately driven by Bayes' theorem. Lets look at it again in all its glory:
(As usual, H is a hypothesis we want to evaluate, D is some data, I is the background information, and H' means "H is not true.")
The goal of science, whether one accepts it or not, is to calculate the term on the left hand side of equation (1). Now, most, if not all, accepted elements of experimental design are actually adapted to manipulate the terms on the right hand side of this equation, in order to enhance the result. I'll illustrate with a few examples.
Firstly, and most obviously, the equation calls for data, D. We have to look at the world, in order to learn about it. We must perform experiments to probe nature's secrets. We can not make inferences about the real world by thought alone. (Some may appear to do this, but no living brain is completely devoid of stored experiences  the best philosophers are simply very efficient at applying Bayes' theorem (usually without knowing it) to produce powerful inferences from mundane and not very well controlled data. This is why philosophy should never be seen as lying outside empirical science.)
Secondly, the equation captures perfectly what we recognize as the rational course of action when evaluating a theory  we have to ask 'what should I expect to see if this theory is true?  what are its testable hypotheses?' In other words, what data can I make use of in order to calculate P(D  HI)?
Once we've figured out what kind of data we need, the next question is how much data? Bayes' rule informs us: we need P(D  HI) to be as high as possible if true, and as low as possible if false. Lets look at a numerical example:
Suppose I know, on average, how tall some species of flower gets, when I grow the plants in my home. Suppose I suspect that picking off the aphids that live on these flowers will make the plants more healthy, and cause them to grow higher. My crude hypothesis is that the relative frequency with which these specially treated flowers exceed the average height is more than 50 %. My crude data set results from growing N flowers, applying the special treatment to all of them, and recording the number, x, that exceed the known average height.
To check whether P(D  HI) is high when H is true and low when H is false, we'll take the ratio
If H_{f} says that the frequency with which the flowers exceed their average height is f, then P(D  H_{f}I) (where D is the number of tall flowers, x, and the total number grown N) is given by the binomial distribution. But our real hypothesis, H, asserts that f is in the range 0.5 < f ≤ 1. This means we're going to have to sum up a whole load of P(D  H_{f}I)s. We could do the integral exactly, but to avoid the algebra, lets treat the smoothly varying function like a staircase, and split the fspace into 50 parts: f = 0.51, 0.52, ...,0.99, 1.0. To calculate P(D  H'I), we'll do the same, with f = 0.01, 0.02, ...., 0.50.
What we want, e. g. for the P(D  HI), is P(D  [H_{0.51} + H_{0.52} + ....] I).
Generally, where all hypotheses involved are mutually exclusive, it can be shown (see appendix below) that,
To put it another way, X may yield true statements, but if we have no confidence in their truth, then they might as well be random. Science is the engine generating that confidence.
So, above I claimed that all scientific activity is ultimately driven by Bayes' theorem. Lets look at it again in all its glory:

(1) 
(As usual, H is a hypothesis we want to evaluate, D is some data, I is the background information, and H' means "H is not true.")
The goal of science, whether one accepts it or not, is to calculate the term on the left hand side of equation (1). Now, most, if not all, accepted elements of experimental design are actually adapted to manipulate the terms on the right hand side of this equation, in order to enhance the result. I'll illustrate with a few examples.
Firstly, and most obviously, the equation calls for data, D. We have to look at the world, in order to learn about it. We must perform experiments to probe nature's secrets. We can not make inferences about the real world by thought alone. (Some may appear to do this, but no living brain is completely devoid of stored experiences  the best philosophers are simply very efficient at applying Bayes' theorem (usually without knowing it) to produce powerful inferences from mundane and not very well controlled data. This is why philosophy should never be seen as lying outside empirical science.)
Secondly, the equation captures perfectly what we recognize as the rational course of action when evaluating a theory  we have to ask 'what should I expect to see if this theory is true?  what are its testable hypotheses?' In other words, what data can I make use of in order to calculate P(D  HI)?
Once we've figured out what kind of data we need, the next question is how much data? Bayes' rule informs us: we need P(D  HI) to be as high as possible if true, and as low as possible if false. Lets look at a numerical example:
Suppose I know, on average, how tall some species of flower gets, when I grow the plants in my home. Suppose I suspect that picking off the aphids that live on these flowers will make the plants more healthy, and cause them to grow higher. My crude hypothesis is that the relative frequency with which these specially treated flowers exceed the average height is more than 50 %. My crude data set results from growing N flowers, applying the special treatment to all of them, and recording the number, x, that exceed the known average height.
To check whether P(D  HI) is high when H is true and low when H is false, we'll take the ratio

(2) 
If H_{f} says that the frequency with which the flowers exceed their average height is f, then P(D  H_{f}I) (where D is the number of tall flowers, x, and the total number grown N) is given by the binomial distribution. But our real hypothesis, H, asserts that f is in the range 0.5 < f ≤ 1. This means we're going to have to sum up a whole load of P(D  H_{f}I)s. We could do the integral exactly, but to avoid the algebra, lets treat the smoothly varying function like a staircase, and split the fspace into 50 parts: f = 0.51, 0.52, ...,0.99, 1.0. To calculate P(D  H'I), we'll do the same, with f = 0.01, 0.02, ...., 0.50.
What we want, e. g. for the P(D  HI), is P(D  [H_{0.51} + H_{0.52} + ....] I).
Generally, where all hypotheses involved are mutually exclusive, it can be shown (see appendix below) that,

(3) 
But we're starting from ignorance, so we'll take all the priors, P(H_{f } I), to be the same. We'll also have the same number of them, 50, in both numerator and denominator, so when we take the desired ratio, all the priors will cancel out (as will the width, Δf = 0.01, of each of the intervals on our grid), and all we need to do is sum up P(D  H_{f1}I) + P(D  H_{f2}I) + ....., for each relevant range. Each term will come straight from the binomial distribution:
If we do that for say 10 test plants, with seven flowers growing beyond average height, then ratio (2) is 7.4. If we increase the number of trials, keeping the ratio of N to x constant, what will happen?

(4) 
If we do that for say 10 test plants, with seven flowers growing beyond average height, then ratio (2) is 7.4. If we increase the number of trials, keeping the ratio of N to x constant, what will happen?
If we try N = 20, x = 14, not too surprisingly, ratio (2) improves. The result now is 22.2, an increase of 14.8. Furthermore if we try N = 30, x = 21, ratio (2) increases again, but this time more quickly: now the ratio is 58.3, and further increase of 36.1.
So, to maximize the contrast between the hypotheses under test, H and H', what we should do is take as many measurements as practically possible. Something every scientist knows already, but something nonetheless demanded by Bayes' theorem.
How is our experimental design working out, then? Well, not that great so far, actually. Presumably the point of the experiment was to decide if removing the parasites from the flowers provided a mechanism enabling them to grow bigger, but all we have really shown is that they did grow bigger. We can show this by resolving e.g. H' into a set of mutually exclusive and exhaustive (within some limited model) subhypothesis:

(5) 
where H' is, as before, 'removing aphids did not improve growth,' and some of the A's represent alternative causal agencies capable of affecting a change in growth. For example, A_{1} is the possibility that a difference in ambient temperature tended to make the plants grow differently. Lets look again at equation (3). This time instead of H_{f}'s, we have all the H'A_{i}, but the principle is the same. Previously, the priors were all the same, but this time, we can exploit the fact that they need not be. We need to manipulate those priors so that the P(D  H'I) term in the denominator of Bayes' theorem, is always low if the number of tall plants in the experiment is large. We can do this by reducing the priors for some of the A_{i} corresponding to the alternate causal mechanisms. To achieve this, we'll introduce a radical improvement to our methodology: control.
Instead of relying on past data for plants not treated by having their aphids removed, we'll grow 2 sets of plants, treated identically in all respects, except the one that we are investigating with our study. The temperature will be the same for both groups of plants, so P(A_{1}  I) will be zero  there will be no difference in temperature to possibly affect the result. The same will happen to all (if we have really controlled for all confounding variables) the other A_{i} that corresponded to additional agencies offering explanations for taller plants.
This process of increasing the degree of control can, of course, undergo numerous improvements. Suppose, for example, that after a number of experiments, I begin to wonder if its not actually removing the aphids that affects the plants, but simply the rubbing of the leaves with my fingers that I perform in order to squish the little parasites. So as part of my control procedure, I devise a way to rub the leaves of the plants in the untreated group, while carefully avoiding those villainous arthropods. Not a very plausible scenario, I suppose, but if we give a tentative name to this putative phenomenon, we can appreciate how analogous processes might be very important in other fields. For the sake of argument, lets call it a placebo effect.
Next I begin to worry that I might be subconsciously influencing the outcome of my experiments. Because I'm keen on the hypothesis I'm testing, (think of the agricultural benefits such knowledge could offer!) I worry that I am inadvertently biasing my seed selection, so that healthier looking seeds go into the treatment group, more than into the control group. I can fix this, however, by randomly allocating which group each seed goes into, thereby setting the prior for yet another alternate mechanism to zero. The vital nature of randomization, when available, in collecting good quality scientific data is something we noted already, when looking at Simpson's paradox, and is something that has been well appreciated for at least a hundred years.
Randomization isn't only for alleviating experimenter biases, either. Suppose that my flower pots are filled with soil by somebody else, with no interest in or knowledge of my experimental program. I might be tempted to use every second pot for the control group, but suppose my helper is also filling the pots in pairs, using one hand for each. Suppose also that the pots filled with his left hand receive inadvertently less soil than those filled with his right hand. Unexpected periodicities such as these are also taken care of by proper randomization.
Making realworld observations, and lots of them; control groups; placebo controls; and randomization: some exceedingly obvious measures, some less so, but all contained in that beautiful little theorem. Add these to our Bayesian formalization of Ockham's razor, and its extension, resulting in an explanation for the principle of falsifiability, and we can not avoid noticing that science is a thoroughly Bayesian affair.
Appendix
You might like to look again at the 3 basic rules of probability theory, if your memory needs refreshing.
To derive equation (3), above, we can write down Bayes' theorem in a slightly strange way:

(A1) 
This might look a bit backward, but thinking about it a little abstractly, before any particular meaning is attached to the symbols, we see that it is perfectly valid. If you're not used to Boolean algebra, or anything similar, let me reassure you that its perfectly fine for a combination of propositions, such as A + B + C, (where the + sign means 'or') to be treated as a proposition in its own right. If equation (A1) looks too much, just replace everything in the square brackets with another symbol, X.
As long as all the various subhypotheses, H_{i}, are mutually exclusive, then when we apply the extended sum rule above and below the line, the cross terms vanish, and (A1) becomes:

(A2) 
We can multiply out the top line, and also make note that for each hypothesis, H_{i}, we can make two separate applications of the product rule to the expression P(H_{i }D  I), to show that

(A3) 
(This is actually exactly the technique by which Bayes' theorem itself can be derived.)
Substituting (A3) into (A2), we see that

(A4) 
which is the result we wanted.
[1] From Wikipedia:
Scientism is a term used, usually pejoratively, ^{}^{}^{}to refer to belief in the universal applicability of the scientific method and approach, and the view that empirical science constitutes the most authoritative worldview or most valuable part of human learning to the exclusion of other viewpoints.
A very nice Bayesian explanation of control and randomization, thanks.
ReplyDeleteI do have a few issues with your introduction, though.
<< If I am more explicit than some about the necessary and universal validity of science, this is only because reason has led me in this direction. >>
I submit that the arguments justifying Bayesianism are rational, analytic arguments rather than scientific ones. So you are not, actually, supporting the position the "*universal* applicability of the scientific method and approach". There is room for philosophy, mathematics, and so on  part of which is justifying Bayesian/scientific thinking itself! The point of real scientism, rather, is that these methods do indeed establish science as the way to think *about reality*. Hardly anyone claims real universality.
What I am concerned with recently is, however, precisely that  and one key point against Bayesianism is that it does not capture well the scientific method in that it is only concerned about which theories we ought to believe, rather than which theories we have evidence for. Dempster Shafer theory allows one to calculate the Bayesian probabilities, or levels of credence, as needed; but it is primarily concerned with determining the degree of support, which is a slightly different beast. Do you have any arguments for the Bayesian as opposed to the DS formalism?
Another basic feature of Bayesianism that is problematic is that it is based on Boolean logic, and I have long wondered whether fuzzy logic would better serve to capture statements about reality. It surely isn't simply "true" that the earth is a sphere, for example  rather, it is closer to the truth than the position that the earth is flat. The calculus based on fuzzy sets, however, looks quite intimidating and I haven't even begun to consider it.
For a more thorough review of formal options, see the Standford Encyclopedia of Philosophy on "Formal Representations of Belief",
http://plato.stanford.edu/entries/formalbelief/
I'm quite at a loss at this problem of riches. I confess I have not considered the myriad of options raised there, and am not sure if any are indeed advantageous over Baysianism and why, as the article essentially maintains.
Yair
Hi Yair
DeleteThanks for commenting.
The axioms that support this system are not arbitrary, they are derived from experience, either directly accumulated since birth (and presumably somewhat earlier) or else inherited from our ancestors, in the form of our naturally selected physiology biasing us toward selecting certain assumptions (I guess that the former is more important). In any case, they must correspond to nature in some way, or else they can not contribute to science/philosophy. Because of this, inferences drawn from these axioms are, in my view, scientific.
You make a distinction between theories we should believe in and theories with supporting evidence. These are the same. Evidence is the only rational reason for believing. But I have never looked at this DS formalism, so I can't comment on that, specifically.
Finally, about fuzzy logic. There is the trivial kind, which is just Boolean algebra using more bits, and then there is the controversial kind, which is the assertion that atomic statements can have varying degrees of truth. This latter, I do not accept. (The statement "my name is Tom and I am 12 feet tall," is partly true, but molecular.) The reason that saying the Earth is spherical appears closer to the truth (not more true) than saying it is flat is that if I take these 2 fitting models, then P(D  HI) will vastly favour the spherical model  the experimental errors I would need in order to accommodate the flat theory quickly explode as soon as any kind of sophistication enters the measurement.
Thanks for the links,
Tom
Hi Tom,
Delete<< The axioms that support this system are not arbitrary, they are derived from experience... Because of this, inferences drawn from these axioms are, in my view, scientific. >>
I'm sorry, but I don't follow. The axioms underlying rational argumentation, logic, and so on were certainly shaped by biological and cultural evolution. I fail to see, however, why that makes them scientific. Our belief that the sky is a dome above us (what do you mean it isn't? Look at it. Look at it! Of course it is!) was also created by these factors  and was eradicated by scientific thinking. The fact that a thought process has such evolutionary roots is irrelevant to the question of whether it is scientific or not. The general application of Reason, that can be used to justify Bayesian reasoning about the content of reality (perhaps), is not itself Bayesian or scientific. It is protoscientific, protoBayesian. Formally, it utilizes the Boolean logic that Baysianism assumes, not the full Bayesian calculus built on top of it.
<< You make a distinction between theories we should believe in and theories with supporting evidence. ...But I have never looked at this DS formalism, so I can't comment on that, specifically. >>
Not quite  the distinction is between our degree of belief in a theory and the degree that we have evidential support for that belief. These are quite different things. If there are only two possible theories, for example, and no further information, we would say that our (Bayesian) degree of belief in each is 0.5, whereas our support for each is 0.0. This approach allows one to distinguish between ignorance and uncertainty.
DS theory is the main contestant to Bayesianism, it seems. I couldn't find a good summary/introduction to it. I can't pretend to understand it, yet.
<< the assertion that atomic statements can have varying degrees of truth. This latter, I do not accept. ... The reason that saying the Earth is spherical appears closer to the truth (not more true) than saying it is flat is that if I take these 2 fitting models, then P(D  HI) will vastly favour the spherical model >>
But why would the measurements support the Spherical Earth over the Flat Earth theory? Strictly speaking, they are both False. Clearly, however, and in measurable and definable ways, the Spherical Earth hypothesis more accurately models reality. And yet, it does not model reality perfectly. I suggest that this distinction is generic  better scientific theories will model reality better, but should not be understood as simply True descriptions of it. And this gradation of accuracy and comprehensiveness leads me to suspect that atomic statements do have a measure of truth to them, a measure of the accuracy and completeness of the correspondence between the description and reality.
Thanks for the discussion, and of course the post,
Yair Rezek
OK, when I wrote that these axioms are derived from experience, thats really my rather terse shorthand for derived from experience, repeatedly, uncountably put to the test, with always glowing results, and found to form a robust system, leading to the derivation of a coherent and comprehensive system of thought. Before we had Bayes' theorem, that was the best criterion for science we had. In hindsight, we can see that all of this is just informal Bayesianism anyway  come up with a model and compare it against reality.
DeleteYes, evidence and probability are numerically different, but they contain the same information. For example, you will never convince me that model X has stronger evidence supporting it, but that I should rationally believe Y rather than X. By arriving at the value of zero in your example, you seem to measure evidence in dB, in which case my point can be made explicit: E(X) = 10log(P(X)/[1P(X)]).
The reason the spherical model is supported, despite not being true, is that the residuals are vastly smaller with that fitting function. This is precisely what it means when we say it is closer to the truth: it is less wrong.
Asimov has written a nice piece on this, (though without invoking formal Bayesian reasoning). The relativity he speaks of is not degrees of truth, but degrees of our grasp of truth (which we call probabilities). The confusion on the part of his correspondent seems to be the mind projection fallacy: "wrong" and "false" are not the same.
<< OK, when I wrote that these axioms are derived from experience, thats really my rather terse shorthand for derived from experience, repeatedly, uncountably put to the test, with always glowing results, and found to form a robust system, leading to the derivation of a coherent and comprehensive system of thought. Before we had Bayes' theorem, that was the best criterion for science we had. In hindsight, we can see that all of this is just informal Bayesianism anyway  come up with a model and compare it against reality. >>
ReplyDeleteA maxim of empiricism is that all knowledge is rooted in experience. I agree. A priori knowledge is rooted in evolutionary experience, as you said. However, this does not make it scientific, or Bayesian. There is a difference between pruning waysofthought that don't work and adjusting their probabilities by Bayes' theorem. There is a difference between forming new hypothesis and manifesting new mutations or gene combinations. It's just not the same thing. And there is a difference between understanding the genesis of our ways of thinking and justifying them noncyclically.
I still maintain that our basic logical intuitions are what justifies scientific principles of thought, and that these intuitions are a priori in that while their genesis can be understood on the basis of an evolutionary trialanderror process this does not justify them, but rather this understanding can only be had in light of them.
<< Yes, evidence and probability are numerically different, but they contain the same information. >>
That's not quite true in DS. Let me put it a bit more formally: you can have a DS degree of support S(A) for proposition A. Let's say we have S(A)=S(~A)=0. Then, the Plausibility (probability) you'll calculate will be P(A)=P(~A)=0.5. Yet, you can also have S(A)=S(~A)=0.2, representing some evidence and arguments for both sides. The probabilities will still be P(A)=P(~A)=0.5, but the epistemic situation is quite different: in the first case, we have no information, we are totally in ignorance; in the second case, we have some information, it's just inconclusive. This seems the sort of thing that a formalization of science should represent, no?
<< This is precisely what it means when we say it is closer to the truth: it is less wrong. >>
But where is this "closer to the truth" in the Bayesian model? We're only speaking of True or False. Let me, again, be more formal  we both agree that the proposition "The Earth is a sphere"=A_S is False. Under the rules for Baysianism, given this information P(A_S)=0, as in general P(FALSE)=0. Clearly, this proposition is also "less wrong"  yet we have just removed it from any Bayesian calculation.
If we only care about being "less wrong", why are we tracking things as either wholly wrong (FALSE) or wholly true (TRUE)? It just makes more sense to start off with a system of underlying logic that is already talking of things being more or less wrong  in other words, fuzzy logic.
Realize that I'm not married to these ideas. It's just that I'm have my doubts about the Bayesian approach, and I'm raising them. Maybe I'm mistaken, and maybe Bayesianism is correct  I don't have a firm grasp on the right epistemology yet.
Cheers,
Yair
Your question about relative wrongness is a very important one, touching on 2 crucial topics. Its something I thought about in the past, and then pretty much forgot about, having satisfied myself. You have reminded me that it probably took quite a while to get the ideas straight, so I have decided to tackle it with a fullblown post (first thing, when I have a little spare time).
DeleteIn the mean time, consider this: when we talk about a proposition being true of false, we do so only really for the purpose of enabling certain thought experiments. As rationalists, we must leave some room for doubt  to declare a proposition about reality definitively true, we would need to step outside the system, like God, or the Q continuum. We both agree that humans can't do that.
Now ask yourself which theory would require the least persuasion for you to accept: 'flat Earth' or 'spherical (on average) Earth'? Next, do the maths, and you'll see that Bayes' theorem gives the same answer, when you do the model comparison.
Thanks for a really good question.