Maximum Entropy

Saturday, March 9, 2013

Scientific Morality

Elsewhere, I have argued that all questions relating to facts about the real world can be addressed by scientific method, and that if we are truly interested in approaching the correct answers to such questions, then we are foolish to rely on any other method. One important class of questions most people, including, it seems, many scientists, think can’t be answered scientifically concerns how we should determine our moral goals. It seems to me that only a modest amount of reflection is required to dispel this pervasive myth.

The purpose of science is to provide ever more precise plausibility estimates for propositions concerning the sate of reality. Rightness and wrongness, though, are not among the objective properties matter. To attribute ideas such as good and evil to external nature is to commit the mind projection fallacy – to assume that properties of our internal model of reality correspond to properties of the reality.

Consider that classic image of natural selection in action, a lion chasing a zebra, with intention to devour it. There is no sense in which we can apply concepts of right and wrong, or good and evil to this situation. It is not evil that the zebra suffers agony at the teeth and claws of the lion, any more than it is evil if the lion suffers misery and protracted death from starvation. What we have here is simply chemistry unfolding in a mildly interesting manner – genes either preparing to replicate or losing the opportunity to do so. Furthermore, to assert that humanity is something more than this mildly interesting chemistry is certainly not founded on any rational procedure.

That we experience pain, misery, and torment are mind-bending, heart-wrenching facts about life, for us. But the universe does not care. There is no sense in which these values can be attributed as properties of the universe. To the universe, we are just yet more chemistry, a collection of slightly unusual aggregates of matter that dance about on the surface of an otherwise insignificant chunk of rock, one of an estimated 10²⁰ planets in the known universe.

It is not that natural selection, or the universe, wants us to maximize our happiness. It is simply that a genome capable of generating an algorithm for producing the sensation of such a value apparently gains an additional advantage in the struggle to preserve its information. In fact, it is possible that our genes would prefer us to be unhappy – its that ‘Oh shit, I’d better not do that again,’ effect that seems to give the mechanism its selective advantage. If you are really happy, then you will stop striving to do better, and your cousin, who has the ‘try harder’ mutation, will walk all over you.

This might seem an odd line of reasoning, given the agenda I have defined in the opening paragraph, but we have already established enough to demonstrate undeniably that morality concerns matters of fact with objective truth, and that there is a clear scientific route to take in pursuit of those truths. In fact, it was formulating an argument along the above lines, a couple of years ago, while trying to refute the conclusions of this infamous TED talk by Sam Harris, that I came to my current understanding of the topic. I had formerly assumed the standard position, that science can not specify moral goals. (Harris's ideas were shortly afterward expanded to book length, in 'The moral landscape.')

The above paragraphs can be condensed into one very simple statement, one of only two simple principles (the other being a trivial tautology) needed to establish the validity of a potential science-based morality:

Principle (1):

‘Good’ and ‘evil’ are concepts with no reality outside minds.

To obtain principle (2), we simply must remind ourselves what the word ‘morality’ means:

Principle (2):

Morality is doing what is good.

Principle (1) states that good and evil do not exist outside minds. This does not, however, mean that they have no objective existence. Since they are words used to describe our mental reactions to different situations, then it is clear that they are values with some physical representation inside our minds. We forget this too easily, partly, perhaps, because of the scientific doctrine of objectivity, which typically means that our mental state should be kept as separate from the system under study and the process of gathering evidence as possible. This is normally a good principle – we have dozens of documented cognitive biases that make it essential to eliminate the influence of our preconceptions and emotional responses when conducting science. But what happens when our minds are the system under study? Crude application of the doctrine of objectivity can cause confusion. A mental state is not something lacking objective reality, as we might erroneously infer from this doctrine – we are, after all, physical entities. Our brains are made of atoms, and our thoughts and emotions necessarily correspond to specific configurations of matter and energy in our neurological hardware.

Because the mental states pertaining to right and wrong have no existence outside minds, then we can be confident that anything there is to be known about them can only be discovered by looking inside minds.

We already have instrumentation capable of measuring important data pertaining to mental states (functional magnetic resonance imaging being currently very popular), and since enormous improvements in this technology are quite likely, then it is in principle possible that we will at some time be capable of generating extremely detailed scientific information on the subject of what stimuli correspond to value judgements of right and wrong. I believe we already know enough to make a highly functional first approximation.

We can correlate measured mental states with people’s self-reported happiness, and other surrogate measures, and we can stochastically map those mental states to the stimuli that caused them. And since morality is simply doing good, we can scientifically optimize our behaviour to maximize the occurrence of the relevant mental states (experienced good), and scientific morality has in principle entirely achieved it objectives.

Objections:

1. The value problem

Goes something like this: ‘How can we know that we should value wellbeing? Surely science can’t tell us what to value, can it?’ These questions are confused. It is not the goal for science to tell us what to value (not at the highest level, at least). We want science to tell us what we actually do value. Note that value and perceived good are completely synonymous, and there is no good other than the perceived kind. Similarly, the declaration that we value wellbeing is another unavoidable tautology. We are talking about measuring the perception of value in people’s brains, and trying to bring about the circumstances that enhance the frequency of the observed mental states. There is no ambiguity about whether or not we should do this: to say you don’t value wellbeing, or happiness, or whatever is good would be self contradictory. Goodness is the extent to which something is appropriate, so there is no sense asking why we should do good: you might as well ask for proof that we should do what we should do.

2. The measurement problem

Another commonly voiced objection is: ‘How do we know we are measuring the right thing?’ We can’t, but this does not nullify the existence of objective answers to questions about morality. While being one of the commonest complaints against objective morality, it is also one of the most unfair and idiotic. The same objection applies to all of science. Science systematically evaluates knowledge, in the form of direct sensory experience, in order to ascribe probabilities to propositions about phenomena in nature. There is no direct logical link between these sensory experiences and the phenomena we wish to learn about - the actual truth values of these scientific propositions remain forever obscured. This is exactly why probabilities constitute the ultimate expression of our knowledge.

Consider an apparently simple physical concept, temperature, and its measurement. The original concept of temperature related entirely to a subjective experience – if I put my hand on something hot, it feels hot, and this is how we originally knew that something possessed a high temperature. (There are occasionally other indicators as well, such as combustion, but all of these come down to subjective experience in the end.) At some point, a clever person realized that they could correlate these subjective symptoms of high temperature with the expansion of a fluid in a narrow glass capillary. The first accurate thermometer was born. How did they know that the rising fluid in the thermometer corresponded to the same phenomenon they were experiencing when they felt an object’s high temperature? They did carefully controlled experiments. But ultimately, they had no way of knowing for certain.

As science advanced, people realized that the assumed linear thermal expansion upon which calibration of these simple thermometers is based breaks down at extreme temperatures, and new technologies were developed to overcome the difficulty, giving progressively more accurate measurements, and greater ranges of validity.

Now note a curious thing. Even with the crude glass capillary style of thermometer, subjective experience is no longer the ultimate arbiter of temperature. A concept originally devised to account for subjective experience was found (with very high probability) to have an underlying physical mechanism that could be characterized accurately with external instruments. Given two objects of similar temperature, even a practiced human can easily be mistaken as to which one is hotter. A precise thermometer, however, will not be mistaken, and will give a result in direct conflict with the human. There has never been a serious suggestion, however, that such occurrences invalidate the use of the thermometer – instead, it is obvious that subjective experience is not perfectly correlated with the physics that it tries to characterize. So it is with feelings of wellbeing, and we must expect that neuroscience will quickly advance to a point where many kinds of mental states can be determined far more accurately than using a person’s self-reported state of mind (I would think this condition already holds in several cases). There is no contradiction here, as these mental states are real states of matter inside the substrates of minds (usually brains). People can be mistaken about what they want.

3. The persuasion problem

You can not persuade somebody who is committed to the contrary view that wellbeing should be valued. Does this mean that value is not an objective property of reality, or that wellbeing can’t be the basic guide of morality? Of course not. That wellbeing is valued is a tautology, as pointed out already. Being well, happy, and satisfied is just the fortunate condition of possessing much of what you value. That some people will not accept this analytical truth is a consequence of the failure of their minds to operate efficiently. The often-quoted fact that nearly half of all Americans do not accept the theory of evolution (link) has absolutely no impact on the validity of this model of how life reaches is various states of diversity.

4. What about psychopaths?

Is one morality less true than another? Since morality is rationally figuring out how to improve you state of happiness, no, symmetry demands that there is no privileged authority on ethical behaviour.

But this implies that psychopaths are also moral, right? Wrong, actually. It doesn’t imply that anybody’s actions are moral, the requirement for rationality needs to be met.

Ultimately, though, it is possible to imagine that there might be psychopaths who are perfectly rational, and highly successful at maximizing their own fulfillment by acting abhorrently to others. This still poses no real philosophical problem. Psychopaths are outnumbered by an estimated 99 to 1 (in the US), so our morality trumps theirs (actually, by those odds there seems a reasonable chance that some psychopath will read this). Our (non-psychopathic) morality dictates that we do what we can to limit their harmful tendencies. Even the psychopaths, by the way, presumably do not want to be the victims of other psychopaths.

Additionally, a person can be mistaken about what their highest-level desires are. A person can be mistaken about the relationship between their lower-level desires and their ultimate goals, and most obviously, a person can be mistaken about the likely outcomes of their actions and what will be the results on their happiness. This means that even the ethics of a person acting rationally can be improved by furnishing then with more accurate facts.

All rational, equally well-informed behaviour is equally valid. Pretending that there is something wrong with this principle, because it fails to unambiguously condemn the foul actions of sadists is to ignore the obvious truth that the appropriateness of one’s conduct is tautologically determined by its capacity to generate mental states corresponding to appropriateness. Furthermore, it is to insist that room be left open for some principle that, besides being wrong, will, almost by definition, not have the slightest impact on the behaviour of psychopaths, who don’t seem to care what is expected of them by moral philosophers.

5. Lack of uniqueness

It is not clear that there is a unique solution to the mathematical problem of maximizing wellbeing. There may very well be some upper limit to how happy I can be, and there may be many ways to arrive at that upper limit (but I’m not holding my breath). This does nothing to negate the arguments laid out. Whatever class of conditions enables that upper limit to be attained, it is determined by objective facts about nature.

6. Lack of universality

Different people have different values, but as discussed at number 4, this poses no philosophical threat to my thesis. Human beings, however, as members of a single species, are more similar than we are different, so substantial overlap between the ambitions of different individuals is to be highly expected. Furthermore, direct commonality of goals can also be confidently predicted – what’s good for you is good for me. The fact that our morality is ultimately selfish does not need to stand in the way of a high degree of cooperation.

A society that protects all people indiscriminately is very likely to protect me. A global economy that flourishes has a good chance to be one in which I am not poor. A society where learning and technology thrive is a good choice if I hope to have a comfortable life, and complex technology is very difficult to attain without close, structured cooperation between thousands of people.

In ‘The selfish gene,’ Richard Dawkins has written much on how cooperative behaviour emerges naturally among selfish agents, by mechanisms such as kin selection and reciprocal altruism. He has also made a nice TV documentary, 'Nice guys finish first,' on the topic, specifically to answer critics who unintelligently insisted that selfish genes imply selfish people. Further, by initiating the concept of memes, he has made it clear how human behaviour has evolved far beyond the consequences of mere population genetics. We are not bound by genetic evolution anymore, and cultural innovations capable of enhancing mutually beneficial technological and social developments are to be positively expected.

7. Which utility function?

There are different ways to model value, e.g. a rational utility function, or prospect theory (which more accurately models how humans naturally assess utility), but which one should we use? The answer is simple: the one that works. Briefly, (a bit late for that, I think) the point is that one day we might be smart enough to measure utility directly, by studying the physical states of brains (and potentially other substrates supporting minds).

8. Kahneman’s alternate selves

Daniel Kahneman and coworkers have repeatedly shown that there are different measures of happiness within a single human mind (see this TED talk, for example): the experiencing self, and the remembering self. Which one is correct for our purposes? Again, (and again very briefly) the one that works. Measure the success of basing your behaviour on one choice and compare to outcomes with the other choice, then adopt the policy that produces the optimal result.

9. How does wellbeing aggregate?

Assuming that my happiness is affected by some measure of global happiness (see number 6), how should we combine happiness measures for different people? Should they be added, multiplied, or what? Isn’t the final choice ultimately arbitrary?

No, its not. Once more, measure the outcome of some likely aggregation function and compare it with other candidates. Did we need some a-priori basis for accurately combining temperatures in order to accept its validity as a physical concept? Did the cave man need to know that two objects placed in contact would equilibrate to some intermediate temperature, rather than their temperatures adding algebraically, in order to know that the embers in his fire were hot?

Friday, March 8, 2013

What is Randomness?

Random variables play an important part in the vocabulary of probability theory. I think there's a lot of confusion, though, about what randomness actually is. A few days ago, I found an expert statistician trying to distinguish between mere statistical fluctuation and actual changes in the causal environment. Another example that always bugs me comes from computer science, and is the ubiquitous insistence that a deterministic algorithm can not produce random numbers, but only pseudo-random numbers.

Each of these examples commits a fallacy. The first may be an isolated slip-up, or else a deliberate attempt to gloss over technicalities with sloppy language (I may even be guilty (gasp!) of either of these myself, on occasion), but the second is almost universal within the entire profession of computer scientists, which represents a significant sample of the world's technically disciplined. If any one of those computer scientists understood what randomness is, they would recognize immediately that the need to distinguish between random and pseudo-random is entirely fictional.

Its not that I have anything against computer scientists. Information technology, after all, is what this blog is all about: systematically processing knowledge, and modern society owes its existence to computer science. Nor do I think that computer scientists are excessively prone to the fallacy I'm talking about. In fact, the statistical literature, compiled by those who, of all people, should have dedicated considerable effort to understanding this topic, is crammed with instances, of which my initial example is representative. As another example, the current Wikipedia entry on randomness contains a very confused section that starts with the statement: 'Randomness, as opposed to unpredictability, is an objective property.'

So what is the problem? Lets look at pseudo-random numbers, first. The point about pseudo-random numbers is that they are produced in a clever way to 'replicate' a random variable - they come appropriately distributed and they appear uncorrelated, which is to say that even knowing the n^th, (n-1)^th, ... numbers in a list, it will be impossible to predict what the (n+1)^th number will be. They are considered to be not truly random, however, because they are produced by mechanical operations of a computer on fixed states of its circuits. If we only knew the states of the circuit and the operations, then we would know the number that will come out next. To say that this prevents us from describing the numbers as random, however, is an instance of the mind-projection fallacy, as I have discussed before.

The mind projection fallacy consists of assuming that properties of our model of reality necessarily correspond to properties of reality. We often talk about a phenomenon being random. This makes it tempting to conclude that randomness is a property of the phenomenon itself, but this assumes too much, and also demands that the events we are talking about take place in some kind of bubble, where physics doesn't operate.

When I dip my hand into an urn to draw out a ball whose colour can't be predicted, the colour that comes out is rightly considered to be a random variable. But we are not talking about some quantum-mechanical wavefunction that collapses the moment the first photon from the ball hits my retina (or the moment the nerve impulse reaches my visual cortex, or any of a million other candidate moments). We are talking about a process with real and definite cause and effect relationships. The layout of the coloured balls inside the urn, the trajectory of my hand into the urn, and the exact moment I decide to close my hand uniquely determine the outcome of the draw. Randomness is not the occurrence of causeless events, but is a consequence of our incomplete information. It's not a property of the balls in the urn, but a property of our prior state of knowledge.

It might be that at microscopic scales, quantum stochastic variability really does emerge from an absence of causation, but there are 2 important points to note in relation to this discussion. Firstly, good scientists recognize the need for agnosticism on this front - there really isn't enough evidence yet to decide one way or the other (BBC Radio 4's excellent 'In Our Time' has an episode, entitled 'The measurement problem,' with an interesting discussion on the topic). Secondly, the vast majority of cases where the concept of randomness is applied concern macroscopic phenomena, where classical mechanics is a perfectly adequate model. For these reasons, the only sensible general usage of the word 'random' is when referring to missing information, rather than as a description of uncaused events. That wikipedia article I quoted from, in apparent recognition of this, later cites Brownian motion and chaos as examples of randomness, thereby contradicting the earlier quote (though inexplicably, the two are identified as separate classes of random behaviour).

Getting back to the urn, if I knew precisely the coordinates (relative to my hand) and colours of the balls inside, the colour of the extracted sphere would not be a a surprise, and therefore wouldn't be a random variable. But under the standard drawing conditions, in which these are not known, it is a random variable. Similarly, knowing the state and operations of a deterministic computer algorithm would render its output non-random (provided I have the computational resources elsewhere (and the inclination) to replicate those operations), but this does not affect the randomness of its output when we don't know these things.

And finally, how can there be a distinction between statistical fluctuations and changes in the causal environment? If I repeat the experiment with the urn and get a different coloured ball the second time, which is that, sampling variability, or a difference in causes? Both, of course. What could sampling variability (at the macroscopic scale) be the result of, if not mechanical differences in the evolution of the experiment? If I toss a coin 4 times and get 4 heads in a row, in one sense, that's a statistical fluke, but in another, its the inevitable result of a system obeying completely deterministic mechanical laws. All that decides the level at which we find ourselves discussing the matter is our degree of awareness of the states and operations of nature.

Ok, so we live in a world polluted by some sloppy terminology, but does it really matter? I think it does. Every statistical model is an attempt to describe some physical process. As long as we systematically deny the action of physics on any aspects of these processes, then we close off access to potentially valuable physical insight. This is actually the problem that the discussed attempt to separate sampling fluctuation and causal variation was trying to address, but this half-hearted formulation serves only to postpone the problem until some later date.

Thursday, February 28, 2013

Legally Insane

David Spiegelhalter's blog, Understanding Uncertainty, informs us of a recent insane ruling from the England and Wales Court of Appeal, concerning the usability of probabilities as evidence in court cases. Lord Justice Toulson's ruling contains the following wisdom:

The chances of something happening in the future may be expressed in terms of percentage... But you cannot properly say that there is a 25 per cent chance that something has happened... Either it has or it has not.

This is wrong. Shockingly, scarily wrong. The judge is saying that probabilities only apply to future events, and not to past events, and is effectively decreeing that such evidence in inadmissible in a court of law. This ruling is based, it seems, on an earlier case, in which a judge ruled:

It is not, in my opinion, correct to say that on arrival at the hospital he had a 25 per cent. chance of recovery. If insufficient blood vessels were left intact by the fall he had no prospect of avoiding complete avascular necrosis, whereas if sufficient blood vessels were left intact on the judge's findings no further damage to the blood supply would have resulted if he had been given immediate treatment, and he would not have suffered the avascular necrosis.

It seems that Justice Toulson is not alone among judges for his profound ignorance of probability, logic and the basic principles of how knowledge is acquired. In fact, for almost 30 years, the legal profession has had its own special probabilistic fallacy, the prosecutor's fallacy, named after it. You might think that by now they should have made an effort to get to grips with basic methodology, but instead, they just keep on making stupid judgements.

I have pointed out that a judge who can claim that probabilities in general can't be assigned to past events is incompetent and not fit to perform their duties. Though it might seem harsh, I stand by this analysis. You might think that this is some obscure point of epistemology with little or no practical importance, but it is much more than that.

There are three main reasons blunders like this, made by a high court judge, should scare the crap out of society at large. Firstly, as pointed out, it demonstrates a serious ignorance of what probabilities are. Probability theory is completely symmetric with regard to time, and probabilities are simply a systematic way of quantifying our current knowledge. Just an obscure mathematical point? Not at all. Such ignorance shows off a complete disregard for what knowledge is, what it means to be rational, and what is actually involved when evidence is evaluated. What has been asserted is that calculations of the kind when I worked out the probability that somebody has contracted a certain disease are completely meaningless. Maybe the judge won't find it meaningless next time he is in his doctor's office trying to plan his future. Call me overly strict, but I expect somebody in such a position of power, whose job consists to such a high degree of evaluating evidence, to be able to wield a modest understanding of what evidence actually is. How can any reasonable standard of statistical evidence be enforced, when judges are so ignorant about probability?

But its not just ignorance thats on display here - also a shocking disrespect for logic. In the same paragraph as his pronouncement on the meaninglessness of probabilities, when applied to past events, the judge manages to immediately contradict himself rather blatantly:

In deciding a question of past fact the court will, of course, give the answer which it believes is more likely to be (more probably) the right answer than the wrong answer

How can anybody capable of holding such obviously incompatible positions at the same time, on any topic, be capable of presiding over a court? What is clear, is that every single judgement of fact, in every single sphere of life relies on some kind of probability assessment. The question that remains, then, is whether we want to make that assessment as systematic and rigorous as we can, or are we happy relying on unexamined instinct and faulty logic? For high-ranked judges to favour the latter is a nightmare scenario.

Secondly, this ruling, if implemented, would make an enormous variety of important types of evidence impossible to use in legal cases, and would severely hinder the capacity of courts to efficiently determine what is the likely truth. Genetic evidence, for example, is based on Bayesian calculations, as it must be, in order to attain validity.

To present probabilities is to make the highest quality of inference available. Why is this judge against high-quality inference? Indeed, why is he so overtly opposed to scientific method? In Toulson's ruling, we also have this:

When judging whether a case for believing that an event was caused in a particular way is stronger that the case for not so believing, the process is not scientific...

Why not scientific? Why not demand the highest standards of logic and inference? Why is he not complaining that the process is not scientific enough, instead of insisting that we rely on some inefficient and non-systematic procedure? The mind boggles. There is only one correct way to assess the implications of evidence, and to quantitatively combine multiple pieces of evidence, and that is Bayes' theorem (techniques that successfully replicate its outcomes can occasionally be used also). Judge Toulson's ruling constitutes a rejection of Bayesian reasoning, and thereby demands that the legal profession turn its back on the rational evaluation of empirical facts.

Thirdly, the judges on this case have made a serious blunder with a technical issue, while obviously being unaware of their incompetence to reason about the topic. Certainly, a judge doesn't need to be an expert in all the technical subjects that may be relevant to any given case. But then they must be able to appreciate that the technical issues are beyond them. They can not perform technical analyses that they are unqualified to perform. If they want to base their decisions on philosophy, probability, mathematical theorems, or whatever, they damn well get it right, or ask somebody else, who knows what they are doing. In what other technical forensic issues are these judges hopelessly unaware of their complete lack of understanding? A society that aspires to be a free and enlightened society must not tolerate such oblivious overconfidence among people with such an important job.

Saturday, February 9, 2013

Inductive inference or deductive falsification?

Andrew Gelman and Cosma Rohilla Shalizi have just published an interesting paper¹, 'Philosophy and the practice of Bayesian statistics,' which is all about the underlying nature of science and Bayesian statistics, as well as practical elements of scientific inference that some researchers seem to be reluctant to indulge in. The paper comes accompanied by an introduction, five comments, and a final authors' response to comments (link to journal issue). I found the paper to be a highly thought-provoking read, and its extensive list of references serves as a summary of much of what's worth knowing about at the cutting edge of epistemology research. While I'm recommending reading this article, there are major components of its thesis that I disagree with.

Firstly, I love the title. Many would have been satisfied with 'The philosophy and practice of Bayesian statistics,' but clearly the authors have broader ambitions than that. Actually, I really applaud the sentiment.

In terms of their practical guidelines, Gelman and Rohilla Shalzi are spot on. They are talking about model checking - graphical or statistical techniques using simulated data generated according to the model supported by a statistical analysis, with the goal of assessing the appropriateness of that model. This is necessary, since any probability calculation is dependent on the context of the hypothesis space within which the problem is formulated. These hypothesis spaces, however, are not derived some some divine, infallible formula, but find their genesis in the whims of our imaginations. There is no guarantee, or even strong reason to suppose that the chosen set of alternative propositions actually contains a single true statement. A totally inappropriate theory, therefore can attain a very high posterior probability, depending on the environment of models in which it finds itself competing. Within a given system of models, Bayes' theorem has no way to alert us to such calamities, and something additional to the standard Bayesian protocol is appropriate.

Some researchers committed to the validity of the Bayesian program feel, apparently, that this additional process is inappropriate, because it seems to step outside the confines of Bayesian logic. I contend that this is mistaken, which I will explain in laying out my disagreement with the paper under discussion.

Edwin Jaynes (in honour of whose work the title of this blog was chosen) was also a strong advocate of model checking, and he pointed out many times that Bayesian methodology exerts its greatest power when it shows us that we need to discard a theory that can serve us no more. I'm reminded of one of my favorite Jaynes quotes²:

To reject a Bayesian calculation because it has given us an incorrect prediction is like disconnecting a fire alarm because that annoying bell keeps ringing.

In terms of the philosophical discussion, Gelman and Rohilla Shalzi argue that model checking, while vital, is outside Bayesian logic and furthermore is not part of inductive inference. They claim that falsification of scientific models is deductive in nature. These three claims represent the main points of departure between my understanding and theirs.

I posted a comment on Professor Gelman's blog, so I'll just paste in directly from there (the details of that specific model from their discussion I refer to are not important, its just an example they used):

You wrote:

“It turned out that this varying-intercept model did not fit our data, … We found this not through any process of Bayesian induction but rather through model checking.”

I agree on the value of model checking, but I wonder if this is really distinct from inductive inference. In order to say that the model was inappropriate, don’t you think that you must, at least informally, have assigned it a low probability? In which case, your model checking procedure seems to be a heuristic designed to mimic efficiently the logic of Bayesian induction.

Even if you didn’t formally specify a hypothesis space, what you seem to have done is to say ‘look, this model mis-matches the data so much that it must be easy to find an alternate model that would achieve a much higher posterior.’ As such, the process of model checking attains absolutely strict validity only when that extended hypothesis space is explicitly examined, and your intuition numerically confirmed. Granted, many cases will be so obvious that the full analysis isn’t needed, but hopefully you get the point.

There certainly is a strong asymmetry between verification and falsification, but I can’t accept your thesis that falsification is deductive. Sure, its typically harder for a model with an assigned probability near zero to be brought back into contention than it is for a model with currently very high probability to be crushed by new evidence, but its not in principle impossible. Newtonian mechanics might be the real way of the world, and all that evidence against it might just have been a dream. The problem is that this requires not just Newtonian mechanics, but Newtonian mechanics + some other implausible stuff, which as intuition warns (and mathematics can confirm) deserves very small prior weight. (A currently favorable model can always be superseded by another model with not significantly greater complexity, which accounts for the asymmetry between falsification and verification.) The mathematics that verifies this is Bayesian and, it seems to me, inductive.

That we can apparently falsify a theory without considering alternatives seems to be simply this strong asymmetry allowing Bayesian logic to be reliably but informally approximated without specifying the entire (super)model.

By the way, the mathematics that confirms the low prior for propositions like Newton + all that extra weird stuff is Ockham's razor (a.k.a. Bayes' theorem).

I have pointed out previously that non-Bayesian techniques certainly have their usefulness, but that their validity is limited to the extent that they succeed in replicating the result of a Bayesian calculation. Model checking seems to me to be no exception. Indeed, the recommended model checks can only work when the outcome is so obvious that the full, rigorous analysis is not needed. If the case is too close to call by these techniques, then you must roll out Bayes' theorem again, or stick with your current model. This should be obvious.

I have laid out the Bayesian basis for falsificationism elsewhere, but I did not discuss this asymmetry between falsifying and verifying theories, which I also think is important. Some Bayesian methodologists seem to hold the view that they have equal status, but they do not. They are not, however, as asymmetric as Popper felt - he did not accept that any form of verification was ever valid. One must wonder, then, why he had any interest whatsoever in science. Science does not, however, verify by making absolute statements about a theory's truth, but rather, its statements are to be seen as of the 'less wrong' type.

The idea that falsification is deductive seems to me indefensible. It is entirely statistical, and is just as incapable of absolute certainty as any other reasoning about phenomena in the real world (except perhaps where propositions are falsified on the basis that they are incoherently expressed, though perhaps it is better not to say such things are false, but rather null statements). If falsification is deductive, how big do the residuals between data and model need to be in order to reach that magic tipping point?

Oh, and Professor Gelman's response to my comment?

'Fair enough.'

Actually, he said a little more than that, but I think its fair to say he conceded that strictly, I had a good point. (You may judge for yourself if you wish.)

Anyway, I'm looking forward to dipping into some of the published comments and responses, as well as some of the many materials cited in this paper. Recommended reading for anyone interested in epistemology - what we can know, and under what circumstances we can know it. By the way, all the articles in that journal seem to be open access, so bravo, hats off, and three cheers to the British Journal of Mathematical and Statistical Psychology.

[1] 'Philosophy and the practice of Bayesian statistics,' British Journal of Mathematical and Statistical Psychology, February 2013, Vol. 66, Issue 1, pages 8 - 38 (link)

[2] 'Clearing up mysteries, the original goal,' by E.T. Jaynes, in 'Maximum entropy and Bayesian methods,' edited by J. Skilling, Kluwer Publishing, 1989

Wednesday, January 16, 2013

Natural Selection By Proxy

Here, I'll give a short summary of one of my favourite studies of recent decades: John Endler's ingenious field and laboratory experiments on small tropical fish¹, which in my (distinctly non-expert) opinion constitute one of the most compelling and 'slam-dunk' proofs available of the theory biological evolution by natural selection. After I've done that, in an act of unadulterated vanity, I'll suggest an extension to these experiments that I feel would considerably boost the information content of their results. That will be what I have dubbed 'selection by proxy'.

Don't get me wrong, Endler's experiments are brilliant. I first read about them in Richard Dawkins' delightful book, 'The Greatest Show on Earth,' and they captured my imagination, which is why I'm writing about them now.

Endler worked on guppies, small tropical fish, the males of which are decorated with coloured spots of varying hues and sizes. Different populations of guppies in the wild were found to exhibit different tendencies with regard to these spot patterns. Some populations show predominantly bright colours, while others prefer more subtle pigments. Some have large spots, while other have small ones. Its easy to contemplate the possibility that these differences in appearance are adaptive under different conditions. Two competing factors capable of contributing a great deal to the fitness of a male guppy are (1) ability to avoid getting eaten by predatory fish, and (2) ability to attract female guppies for baby making. Vivid colourful spots might contribute much to (2), but could be a distinct disadvantage where (1) is a major problem, and if coloration is determined by natural selection, then we would expect different degrees of visibility to be manifested in environments with different levels of predation. And so colour differences might be accounted for.

Furthermore, the idea suggests itself to the insightful observer that in the gravel-bottomed streams in which guppies often live, a range of spot sizes that's matched to the predominant particle size of the gravel in the stream bed would help a guppy to avoid being eaten, and that the tendency for particle and spot sizes to match will be greater where predators are more of a menace, and greater crypsis is an advantage.

These considerations lead to several testable predictions concerning the likely outcomes if populations of guppies are transplanted to environments with different degrees of predation and different pebble sizes in their stream beds. These predicted outcomes are extremely unlikely under the hypothesis that natural selection is false. Such transplantations, both into carefully crafted laboratory environments, and into natural streams with no pre-existing guppy populations, constituted the punch line of Endler's experiments, and the observed results matched the predictions extraordinarily closely, after only a few months of naturally selected breeding.

Its the high degree of preparatory groundwork and the many careful controls in these experiments, however, that result in both the high likelihood, P(D_p | H I), for the predicted outcome under natural selection, and the very low likelihood, P(D_p | H' I), under the natural-selection-false hypothesis. These likelihoods, under almost any prior, lead to only one possible logical outcome, when plugged into Bayes' theorem, and make the results conclusive.

The established fact that the patterning of male guppies is genetically controlled served both causes. Of course, natural selection can not act in a constructive way if the selected traits are not passed on to next generation, so the likelihood under H goes up with this knowledge. At the same time, alternate ways to account for any observed evolution of guppy appearance, such as developmental polymorphisms or phenotypic plasticity (such as the colour variability of chameleons, to take an extreme example), are ruled out, hitting P(D_p | H' I) quite hard.

Observations of wild populations had established the types of spot pattern frequent in areas with known levels of predation - there was no need to guess what kind of patterns would be easy and difficult for predators to see, if natural selection is the underlying cause. The expected outcome under this kind of selection could be forecast quite precisely, again enhancing the likelihood function under natural selection.

Selection between genotypes obviously requires the presence of different genotypes to select from, and in the laboratory experiments, this was ensured by several measures leading to broad genetic diversity within the breeding population. This, yet again, increased P(D_p | H I). (Genetic diversity in the wild is often ensured by the tendency for individuals to occasionally get washed downstream to areas with different selective pressures, which is one of the factors that made these fish such a fertile topic for research.)

The experiment employed a 3 × 2 factorial design. Three predation levels (strong, weak, and none) were combined with 2 gravel sizes, giving 6 different types of selection. The production of results appropriate for each of these selection types constitutes a very well defined prediction and would certainly be hard to credit under any alternate hypothesis, and P(D_p | H' I) suffers further at the hands of the expected (and realized) data.

Finally, additional blows were dealt to the likelihood under H', by prudent controls eliminating the possibility of effects due to population density and body size variations under differing predation conditions.

With this clever design and extensive controls, the data that Endler's guppies have yielded offer totally compelling evidence for the role of natural selection. Stronger predation led unmistakably to guppies with less vivid coloration, and greater ability to blend inconspicuously with their environment, after a relatively small number of generations.

I first read about these experiments with great enjoyment, but there was another thing that came to my mind: what the data did not say. It is quite inescapable from the results that natural selection of genetic differences was responsible for observed phenotypic changes arising in populations placed in different environments, but the data say nothing about the mechanism leading to those genetic differences. This, of course, is something that is central to the theory of natural selection. Indeed, we might consider the full name of this theory to be 'biological evolution by natural selection of random genetic mutations.' For the sake of completeness, we would like to have data that speak not only of the natural selection part, but also of the random basis for the genetic transformation.

I'm not saying that there is any serious doubt about this, but neither was there serious doubt about natural selection prior to Endler's result. (In fact, there is some legitimate uncertainty about the relative importance of natural selection v's other processes, such as genetic drift - uncertainty that work of Endler's kind can alleviate.) The theory of biological evolution, though, is a wonderful and extremely important theory. It stands out for a special reason: every other scientific theory we have is ultimately guaranteed to be wrong (though the degree of wrongness is often very small). Evolution by natural selection is the only theory I can think of that in principle could be strictly correct (and with great probability is), and so deserves to have all its major components tested as harshly as we reasonably can. This is how science honours a really great idea.

To test the randomness of genetic mutation, we need to consider alternative hypotheses. I can think of only one with non-vanishing plausibility: that at the molecular level, biology is adaptive in some goal-seeking way. That the cellular machinery strives, somehow, to generate mutations that make their future lineages more suitably adapted to their environment. I'll admit the prior probability is quite low, but I (as an amateur in the field) think its not impossible to imagine a world in which this happens, and as the only remotely credible contender, we should perhaps test it.

We could perform such a test by arranging for natural selection by proxy. That is, an experiment much like Endler's, but with a twist: at each generation, the individuals to breed are not the ones that were selected (e.g. by mates or (passively) by predators), but their genetically identical clones. At each generation, pairs of clones are produced, one of which is added to the experimental population, inhabiting the selective environment. The other clone is kept in selection-free surroundings, and is therefore never exposed to any of the influences that might make goal-seeking mutations work. Any goal-seeking mechanism can only plausibly be based on feedback from the environment, so if we eliminate that feedback and observe no difference in the tendency for phenotypes to adapt (compared to a control experiment executed with the original method), then we have the bonus of having verified all the major components of the theory. And if, against all expectation, there turned out to be a significant difference between the direct and proxy experiments, it would be the discovery of the century, which for its own sake might be worth the gamble. Just a thought.

[1]	Natural Selection on Color Patterns in Poecilia reticulata, Endler, J.A., Evolution, 1980, Vol. 34, Pages 76-91 (Downloadable here)

Saturday, January 5, 2013

Great Expectations

Thinking ahead to some of the things I'd like to write about in the near future, I expect I'm going to want to make use of the technical concept of expectation. So I'll introduce it now, gently. Some readers will find much of this quite elementary, but perhaps some of the examples will be amusing. This post will be something of a moderate blizzard of equations, and I have some doubts about it, but algebra is important - it's the grease that lubricates the gears of all science. Anyway, I hope you can have some fun with it.

Expectation is just a technical term for average, or mean. It goes by several notations: statisticians like E(X), but physicists use the angle brackets, <X>. As physicists are best, and since E(X) is something I've already used to denote the evidence for a proposition, I'll go with the angle brackets.

We've dealt with the means of several probability distributions, usually denoting them by the Greek letter μ, but how does one actually determine the mean of a distribution? A simple argument should be enough to convince you of what the procedure is. The average of a list of numbers is just the sum, divided by the number of entries in the list. But if some numbers appear more than once in the list, then we can simplify the expression of this sum by just multiplying these by how often they show up. For n unique numbers, x₁, x₂, ..., x_n, appearing k₁, k₂, ..., k_n times, respectively, in the list, the average is given by

(1)

But if we imagine drawing one of the numbers at random from the list, the probability to draw any particular number, x_i, by symmetry is just k_i/ (k₁+ k₂+ ... + k_n), and so the expectation of the random variable X is given by

(2)

In the limit where the adjacent points in the probability distribution are infinitely close to one another, this sum over a discrete sample space converts to an integral over a continuum:

(3)

In the technical jargon, when we say that we expect a particular value, we don't quite mean the same thing one usually means. Of course, the actual realized value may be different from what our expectation is, without any contradiction. The expectation serves to summarize our knowledge of what values can arise. In fact, it may even be impossible for the expected value to be realized - the probability distribution may vanish to zero at the expectation if, for example, the rest of the distribution consists of two equally sized humps, equally distant from <X>. (This occurs, for instance, with the first excited state of a quantum-mechanical particle in a box - such a particle in this state is never found at the centre of the box, even though that is the expectation for its position.) This illustrates starkly that it is by no means automatic that the point of expectation is the same as the point of maximum probability.

We can easily extend the expectation formula in an important way. Not only is it useful to know the expectation of the random variable X, but also it can be important to quantify the mean of another random variable, Y, equal to f(X), some function of the first. The probability for some realization, y, of Y is given by the sum of all the P(x) corresponding to the values of x that produce that y:

(4)

This follows very straightforwardly from the sum rule: if a particular value for y is the outcome of our function f operating on any of the x's in the sub-list {x_j, x_k, ...}, then P(y) is equal to P(x_j or x_k or ...), and since the x's, are all mutually exclusive, this becomes P(x_j) + P(x_k) + ....

Noting the generality of equation 2, it is clear that

(5)

which from equation (4) is

(6)

which by distributivity gives

(7)

and since all x map to some y, and replacing y with f(x), this becomes

(8)

which was our goal.

Suppose a random number generator produces integers from 1 to 9, inclusive, all equally probable, what is the expectation for x, the next random digit to come out? Do the sums, verify that the answer is 5. Now, what is the expectation of x²? Perhaps we feel comfortable guessing 25, but better than guessing, lets use equation (8), which we went through so much trouble deriving. The number we want is given by Σ(x²/9), which is 31.667. We see that <x²> ≠ <x>², which is a damn good thing, because the difference between these two is a special number: it is the variance of the distribution for x, i.e. the square of the standard deviation, which (latter) is the expectation of the distance of any measurement from the mean of its distribution. This result is quite general.

As another example, lets think about the expected number of coin tosses, n, one needs to perform before the first head comes up. The probability for any particular n is the probability to get 1 - n tails in a row, followed by 1 head, which is (1/2)ⁿ. We're going to need an infinite sum of these terms, each multiplied by n. There's another fair chunk of algebra following, but nothing really of any difficulty, so don't be put off. Take it one line at a time, and challenge yourself to find a flaw - I'm hoping you won't. I'll start by examining this series:

(9)

multiplying both sides by a convenient factor,

(10)

then removing all terms that annihilate each other, and dividing by that funny factor,

(11)

But p is going to be the probability to obtain a head (or a tail) on some coin toss, so we know it’s less than 1, which means that as n goes to infinity, pⁿ⁺¹ goes to 0, so

(12)

Now the point of this comes when we take the derivative of equation (12) with respect to p:

(13)

Finally, to bump that power of p on the left hand side up from k-1 to k, we just need to multiply by p again, which gives us, from equation (2), the expectation for the number of tosses required to first observe a head:

(14)

Noting that p = 1/2, we find that the expected number of tosses is a neat 2.

Perhaps you already guessed it, but I find this result quite extraordinarily regular. Of all the plausible numbers it could have been, 3.124, 1.583, etc., it turned out to be simple old 2. Quite an exceptionally well behaved system. Unfortunately, all the clunky algebra I've used makes it far from obvious why the result is so simple, but we can generalize the result with a cuter derivation that will also leave us closer to this understanding. We can do this using something called the theorem of total expectation:

(15)

It’s starting to look a little like quantum mechanics, but don’t worry about that. This theorem is both intuitively highly plausible and relatively easy to verify formally. I won't derive it now, but I'll use it to work out the expected number of trials required to first obtain an outcome whose probability in any trial is p. I'll set x equal to the number of trials required, n, and set y equal to the outcome of the first trial:

(16)

The expectation of n, when n is known to be 1 is, not surprisingly, 1. Not only that, but the expectation when n is known to be not 1 is just 1 added to the usual expectation of n since one of our chances out of all those trials has been robbed away:

(17)

which is very easily solved,

(18)

So the result for the tossed coin, where p = (1 - p) = 0.5, generalizes to any value for p: the expected number of trials required is 1/p.

Finally a head scratching example, not because of its difficulty (its the simplest so far), but because of its consequences. Imagine a game in which the participant tosses a coin repeatedly until a head first appears, and receives as a prize 2ⁿ dollars, where n is the number of tosses required. This famous problem is known as the St. Petersburg paradox. The expected prize is Σ(2ⁿ/2ⁿ), which clearly diverges to infinity. Firstly, this illustrates that not all expectations are well behaved. Further, the paradox emerges when we consider how much money we should be willing to pay for the chance to participate in this game. Would you be keen to pay everything you own for an expected infinite reward? Why not? Meditation on this conundrum led Daniel Bernoulli to invent decision theory, but how and why will have to wait until some point in the future.