Wednesday, May 22, 2013

Signal and Noise




Here's a useful thought experiment (slightly reworded) from Ronald Fisher's well-known text book from 1925, 'Statistical methods for research workers':
In each of two nearly identical universes, agricultural researchers wanted to compare 2 fertilizers (I know, it sounds like bullshit). In each universe, similar protocols were performed (this really happened, I swear): 2 plots of land were each divided into two parts, and the different parts treated with the different fertilizers. The same crop plant was cultivated on each part of each plot, and the individual yields recorded. The yields, in tons, were:
Universe 1:

plot
fertilizer A
fertilizer B
1
20
28
2
23
32


Universe 2:

plot
fertilizer A
fertilizer B
1
20
28
2
23
41

In which universe is there stronger evidence for the advantage of fertilizer B over fertilizer A?

In universe 1, the average advantage is 8.5 tons per plot, while in universe 2, the advantage is 13 tons. Seems like the farmers in universe 2 can place greater justified faith in fertilizer B.

But that conclusion is a bit too quick. There's more to the strength of evidence than just the magnitudes of the averages. We also have to consider the quality of the evidence, the signal to noise ratio. In universe 1, the numbers for each fertilizer are tightly clustered, (20 is not much different from 23, and 28 is not much different from 32) supporting the idea that the experiment was well controlled - random factors probably contributed little to the outcomes. 

In universe 2, however, the experiment doesn't look as well controlled. There's a big difference between the 2 results for fertilizer B, indicating that there is much more going on than just the choice of fertilizer. There's apparently more noise in this case, and if there's that much noise, then maybe the outcome of the experiment is purely down to random chance.

To analyze the relationship between the signal and the noise, Fisher recommends a null-hypothesis significance test (well, he invented significance tests, after all). Modelling the two sets of samples (A and B) as drawn from the same normally distributed population (the null hypothesis), we can calculate the plausibility of the observed difference between their means under the null hypothesis, H0. If the observed data, D, is too implausible under H0, then, as tradition goes, H0 is rejected. The problem is, we only have a few samples from which to calculate the width of that normal distribution, so another distribution, Student's t-distribution, which accounts for the uncertainty of the standard deviation, is used instead. To get a p-value, we have to calculate a t-statistic, and integrate the t-distribution from that statistic out to infinity to obtain the desired implausibility of D under H0.

To get the t-statistic, we can first calculate an aggregate sample standard deviation, s:



where xi, mi, and ni are respectively yields, averages, and the numbers of samples, for fertilizer i.

The t-statistic for comparison of two means (where H0 states that the 2 means are the same) is then given by


Taking account of the number of degrees of freedom in the experiment, nA + nB - 2, equal to 2, tables or almost any mathematical software are consulted to perform the required integral, which gives directly the p-value. For universe 1, the p-value I get from this test (two-tailed integration) is 0.041 - quite significant, the null hypothesis is on shaky ground. For universe 2, however, where we noticed that there was apparently a far greater random component to the data, the p-value is 0.11. This is almost 3 times larger, meaning that the data are here more believable under H0. We have weaker grounds for supposing that the fertilizers perform any different in universe 2.

This rare foray into the realm of orthodox stats has been hopefully sufficient to illustrate the point about quality of information, in terms of signal and noise (don't ask for guarantees that I did everything correctly, I may never understand the mindset under which these significance tests make sense). What I don't like, though, about the null-hypothesis significance test (among other things) is that no alternate hypotheses are formulated or evaluated. Without comparing Hto any other H's, the whole process is frankly rather hollow. What's more, if H0 is rejected, then further machinery is required to figure out how large the effect is. What I want, generally, is a set of proper probabilities, worked out for a whole range of possibilities, including H0.

Just for a laugh, then, I'll work through an approximate model, that'll allow me to plot a continuous probability distribution for a whole range of values for the average difference between the yields for the 2 fertilizers. I'll use the t-statistic again, but I won't be integrating tails (at least, not until after I have a posterior distribution). This t-statistic will help me get around the ambiguity concerning the width of the noise distribution, when calculating the likelihood function.

We can test the significance of the mean of a single set of samples from a single population, relative to some hypothetical mean, μ0, using another formula for the t-statistic:

where s is the regular sample standard deviation. (For emphasis, the reason for the different formula is that we are doing a different test - looking at a single mean, <x>, as opposed to comparing 2 means.)

The likelihood function, P(D | HI),  is calculated by evaluating the t-distribution with this statistic and the number of degrees of freedom, n - 1, which is 1.

The single population we're looking at is the population of differences between fertilizer B and fertilizer A. The parameter to vary in order to generate the likelihood function is the hypothesis, μ0.

As a prior density, I'll use a normal distribution centered at zero. To assign a width, I'll set the standard deviation to 15 tons, which means that there is a very small probability (< 5%) that the absolute value of the difference in yields for the two fertilizers exceeds 30 tons. The posterior probability is then simply the normalized product of this prior and the likelihood function, from Bayes' Theorem.

The graph below shows the posterior probability density as a function of μB-A, for each of our universes. The curves illustrate clearly the impact of decreasing the signal to noise ratio in universe 2: though the peak is further to the right, the tail extends further to the left, due to the lower sensitivity of the experiment in that universe. Integrating the two curves from -∞ to 0, we see that the probability that fertilizer B is actually not better than fertilizer A is twice as large in universe 2 as it is in universe 1, which is similar to the result above, in terms of p-values. As the old saying goes: garbage in, garbage out. Precise inference demands a well-controlled experiment.



Its only human that quite often in the quest for knowledge we'll derive greater confidence from results like universe 2, rather than universe 1. It takes care not to be seduced by a greater overall difference, before taking time to consider how much of that difference is likely to be due to random fluctuations. Often, for brevity, we'll summarize an experiment by recording only the mean result (or in less formal circumstances subconsciously estimate the mean, and forget all the other details), but as we've seen, to draw good-quality inferences we need to note not only the mean but also the dispersion and the number of samples (roughly, confidence in a result scales1 according to SNR ×  n  ). How many figures quoted by politicians or newspapers (or anybody else with influence) lose their sting when we notice that no error bar has been provided? Context is all important. Sometimes, only moderately careful analysis is enough to overturn an intuitively appealing conclusion, and it's results like this that show the importance of a cultivated awareness of the mathematical machinery of rational inference.





[1]
'Why randomized controlled trials fail but needn't: 2. Failure to employ physiological statistics, or the only formula a clinician-trialist is ever likely to need (or understand!),' D.L. Sackett, CMAJ October 30, 2001 vol. 165 no. 9, link



Thursday, April 18, 2013

AllTrials Campaign



About two weeks ago, I got a personal email from Ben Goldacre and Tracey Brown. I and 39,999 other people, that is. It was an update about a petition I signed a while back, demanding higher standards in evidence-based medicine. They've just reached 40,000 signatures, which is really great, but they are aiming for 1 million. This petition is good for medicine, and good for science generally.

The campaign calls for the publication of the results of all clinical trials. Vast amounts of clinical research are carried out, predominantly by drug companies, who have traditionally reserved the right to publish only what they choose. This practice (not entirely limited to drug companies) is harmful and unethical for many reasons. 

If we think about the traditional standards of statistical significance, a systematic causal effect is deemed to have been observed in any case where the results of an experiment are sufficiently unlikely to have arisen (by random variability alone) if there was no effect present. 'Sufficiently unlikely' often means a p-value of 0.05, meaning that if there is no effect to observe, random noise will produce a statistically significant impression of a real effect in an expected 1 out of 20 similar experiments. 

This means that even if there is no true effect, the more individual experiments you do, the greater the chance of sooner or later producing a trial suggesting an interesting positive result. In medical science (as in any other discipline), therefore, partial reporting of outcomes leaves substantial room for distortion of the data and the conclusions they lead to. Quoting from Goldacre's Bad Science website:
The scale of this problem is enormous. It exposes patients to unnecessary harm, because the wrong treatment may be prescribed when the evidence is distorted. It also affects some very expensive drugs. For example, governments around the world have spent billions on a drug called Tamiflu: the UK alone spent £500 million on this drug in 2009, which is 5% of the total £10bn NHS drugs budget, on one drug. But Roche, the drug’s manufacturer, published fewer than half of the clinical trials conducted on it, and continue to withhold vitally important information about these trials from doctors and researchers even today. Some academics now suspect that the drug may be no better than paracetamol.
And quoting from the AllTrials website:
Around half of all clinical trials have not been published; some trials have not even been registered. If action is not taken urgently, information on what was done and what was found in trials could be lost forever, leading to bad treatment decisions, missed opportunities for good medicine, and trials being repeated unnecessarily.
The situation may not always be quite as bad as incorrectly reporting efficacy for a treatment. When cherry-picked data are reported, effect sizes can be exaggerated, so that, for example, a new drug appears to do a better job than a rival product when in fact it performs no better. Even if there is no deliberate cherry picking, data from a larger number of subjects is obviously more informative, so if that data exists, it must be made available in order to maximize the benefits for society.

You might think that this is data that belongs to the drug companies, and they can do what they like with it, but these are trials carried out on real people, who are led to believe that sacrifices they are making are contributing to the advancement of science. Anyway, one couldn't reasonably argue that reporting totally made up data would be ok - there's a justified obligation to tell the truth, and any omission of relevant data constitutes a violation of that obligation just as much as actively lying. 

I wholeheartedly support the AllTrials initiative, which seeks to make it obligatory for all findings from clinical trials to be placed in the public domain. Medical science is literally about life-and-death decisions, so if any discipline demands the full power of scientific method, this is it. Rational decision making, of course, demands high quality inference from the best available data.

Click the logo below to go to the AllTrials website, and consider adding your name to the petition:


By adding our names to a petition like this, we not only make an important statement about how we think medical research ought to be conducted for the better, we add extra weight to the political credibility of science and evidence generally, which is an important additional bonus. If it is seen to be politically unacceptable to allow medical data to be swept under the carpet, impeding the ability of doctors and patients to make informed decisions, then at some point, awareness must grow of the many other areas where decisions are routinely made, based on no data whatsoever. And as awareness grows, so too must incredulity and outrage.

I'm proud to be a small part of a growing movement of people who want to stand up and say 'Look, science is bloody great. You either agree with us, or you are deluded.'

Ben Goldacre, one of the people most involved in this drive to make the drug companies honest, is not only a sterling advocate of evidence-based medicine, but also a committed campaigner for evidence-based... well everything really, and I think he is absolutely right. One of the most right people on the planet, perhaps. If we think about the alternatives to scientific method, with its logical evaluation of empirical experience, all we have is guesswork and superstition, and lets face it, these are fairly stupid things to base important decisions on.

Goldacre's admirable multi-pronged attack on society's failures to apply proper scientific method when it should includes co-authoring this excellent article, Test, learn, adapt: developing public policy with randomized controlled trials, on using RCTs to assess social policies. The extension of scientific method into politics is a logical consequence of any serious commitment to serving society, but bizarrely, politicians have traditionally denied any substantial relevance of science to their discipline. This great and practical paper gently destroys this myth, as well as outlining many of the steps required to make evidence-based politics work. This radio broadcast, also by Goldacre, discusses the same topic. Its a subject very closely allied to my post on moral science, which goes beyond the truism that science is vital for working towards our moral goals, to argue that science is necessary and sufficient for establishing with confidence what our moral goals actually are.

More recently, Goldacre has authored another paper on evidence-based education (downloadable here), extending his righteous agenda into yet another important direction. One of the nice things about this particular direction is that by enabling systematic improvement of education, we can expect better awareness of science among the future members of society, thereby feeding back into the wider program.

These, then are the main reasons we should all sign the AllTrials petition:
  • we demand the best possible health care, and we understand that this standard can not be reached if  data from clinical trials is withheld from publication
  • we definitely don't want to make it easy for multi-billion dollar drug companies to make money by being dishonest
  • we want to send a loud message that there is a growing tide of opinion in favour of evidence and its scientific evaluation, in all areas of human endeavour, to tell our leaders that their scorn for science is a measure of their scorn for society, which they hold at their peril





Saturday, March 9, 2013

Scientific Morality





Elsewhere, I have argued that all questions relating to facts about the real world can be addressed by scientific method, and that if we are truly interested in approaching the correct answers to such questions, then we are foolish to rely on any other method. One important class of questions most people, including, it seems, many scientists, think can’t be answered scientifically concerns how we should determine our moral goals. It seems to me that only a modest amount of reflection is required to dispel this pervasive myth.

The purpose of science is to provide ever more precise plausibility estimates for propositions concerning the sate of reality. Rightness and wrongness, though, are not among the objective properties matter. To attribute ideas such as good and evil to external nature is to commit the mind projection fallacy – to assume that properties of our internal model of reality correspond to properties of the reality.

Consider that classic image of natural selection in action, a lion chasing a zebra, with intention to devour it. There is no sense in which we can apply concepts of right and wrong, or good and evil to this situation. It is not evil that the zebra suffers agony at the teeth and claws of the lion, any more than it is evil if the lion suffers misery and protracted death from starvation. What we have here is simply chemistry unfolding in a mildly interesting manner – genes either preparing to replicate or losing the opportunity to do so. Furthermore, to assert that humanity is something more than this mildly interesting chemistry is certainly not founded on any rational procedure.

That we experience pain, misery, and torment are mind-bending, heart-wrenching facts about life, for us. But the universe does not care. There is no sense in which these values can be attributed as properties of the universe. To the universe, we are just yet more chemistry, a collection of slightly unusual aggregates of matter that dance about on the surface of an otherwise insignificant chunk of rock, one of an estimated 1020 planets in the known universe.

It is not that natural selection, or the universe, wants us to maximize our happiness. It is simply that a genome capable of generating an algorithm for producing the sensation of such a value apparently gains an additional advantage in the struggle to preserve its information. In fact, it is possible that our genes would prefer us to be unhappy – its that ‘Oh shit, I’d better not do that again,’ effect that seems to give the mechanism its selective advantage. If you are really happy, then you will stop striving to do better, and your cousin, who has the ‘try harder’ mutation, will walk all over you.

This might seem an odd line of reasoning, given the agenda I have defined in the opening paragraph, but we have already established enough to demonstrate undeniably that morality concerns matters of fact with objective truth, and that there is a clear scientific route to take in pursuit of those truths. In fact, it was formulating an argument along the above lines, a couple of years ago, while trying to refute the conclusions of this infamous TED talk by Sam Harris, that I came to my current understanding of the topic. I had formerly assumed the standard position, that science can not specify moral goals. (Harris's ideas were shortly afterward expanded to book length, in 'The moral landscape.')

The above paragraphs can be condensed into one very simple statement, one of only two simple principles (the other being a trivial tautology) needed to establish the validity of a potential science-based morality:

Principle (1):

‘Good’ and ‘evil’ are concepts with no reality outside minds.


To obtain principle (2), we simply must remind ourselves what the word ‘morality’ means:

Principle (2):

            Morality is doing what is good.


Principle (1) states that good and evil do not exist outside minds. This does not, however, mean that they have no objective existence. Since they are words used to describe our mental reactions to different situations, then it is clear that they are values with some physical representation inside our minds. We forget this too easily, partly, perhaps, because of the scientific doctrine of objectivity, which typically means that our mental state should be kept as separate from the system under study and the process of gathering evidence as possible. This is normally a good principle – we have dozens of documented cognitive biases that make it essential to eliminate the influence of our preconceptions and emotional responses when conducting science. But what happens when our minds are the system under study? Crude application of the doctrine of objectivity can cause confusion. A mental state is not something lacking objective reality, as we might erroneously infer from this doctrine – we are, after all, physical entities. Our brains are made of atoms, and our thoughts and emotions necessarily correspond to specific configurations of matter and energy in our neurological hardware.

Because the mental states pertaining to right and wrong have no existence outside minds, then we can be confident that anything there is to be known about them can only be discovered by looking inside minds.

We already have instrumentation capable of measuring important data pertaining to mental states (functional magnetic resonance imaging being currently very popular), and since enormous improvements in this technology are quite likely, then it is in principle possible that we will at some time be capable of generating extremely detailed scientific information on the subject of what stimuli correspond to value judgements of right and wrong. I believe we already know enough to make a highly functional first approximation.

We can correlate measured mental states with people’s self-reported happiness, and other surrogate measures, and we can stochastically map those mental states to the stimuli that caused them. And since morality is simply doing good, we can scientifically optimize our behaviour to maximize the occurrence of the relevant mental states (experienced good), and scientific morality has in principle entirely achieved it objectives.


Objections:

1.     The value problem

Goes something like this: ‘How can we know that we should value wellbeing? Surely science can’t tell us what to value, can it?’ These questions are confused. It is not the goal for science to tell us what to value (not at the highest level, at least). We want science to tell us what we actually do value. Note that value and perceived good are completely synonymous, and there is no good other than the perceived kind. Similarly, the declaration that we value wellbeing is another unavoidable tautology. We are talking about measuring the perception of value in people’s brains, and trying to bring about the circumstances that enhance the frequency of the observed mental states. There is no ambiguity about whether or not we should do this: to say you don’t value wellbeing, or happiness, or whatever is good would be self contradictory. Goodness is the extent to which something is appropriate, so there is no sense asking why we should do good: you might as well ask for proof that we should do what we should do.


2.     The measurement problem

Another commonly voiced objection is: ‘How do we know we are measuring the right thing?’ We can’t, but this does not nullify the existence of objective answers to questions about morality. While being one of the commonest complaints against objective morality, it is also one of the most unfair and idiotic. The same objection applies to all of science. Science systematically evaluates knowledge, in the form of direct sensory experience, in order to ascribe probabilities to propositions about phenomena in nature. There is no direct logical link between these sensory experiences and the phenomena we wish to learn about - the actual truth values of these scientific propositions remain forever obscured. This is exactly why probabilities constitute the ultimate expression of our knowledge.

Consider an apparently simple physical concept, temperature, and its measurement. The original concept of temperature related entirely to a subjective experience – if I put my hand on something hot, it feels hot, and this is how we originally knew that something possessed a high temperature. (There are occasionally other indicators as well, such as combustion, but all of these come down to subjective experience in the end.) At some point, a clever person realized that they could correlate these subjective symptoms of high temperature with the expansion of a fluid in a narrow glass capillary. The first accurate thermometer was born. How did they know that the rising fluid in the thermometer corresponded to the same phenomenon they were experiencing when they felt an object’s high temperature? They did carefully controlled experiments. But ultimately, they had no way of knowing for certain.

As science advanced, people realized that the assumed linear thermal expansion upon which calibration of these simple thermometers is based breaks down at extreme temperatures, and new technologies were developed to overcome the difficulty, giving progressively more accurate measurements, and greater ranges of validity.    

Now note a curious thing. Even with the crude glass capillary style of thermometer, subjective experience is no longer the ultimate arbiter of temperature. A concept originally devised to account for subjective experience was found (with very high probability) to have an underlying physical mechanism that could be characterized accurately with external instruments. Given two objects of similar temperature, even a practiced human can easily be mistaken as to which one is hotter. A precise thermometer, however, will not be mistaken, and will give a result in direct conflict with the human. There has never been a serious suggestion, however, that such occurrences invalidate the use of the thermometer – instead, it is obvious that subjective experience is not perfectly correlated with the physics that it tries to characterize. So it is with feelings of wellbeing, and we must expect that neuroscience will quickly advance to a point where many kinds of mental states can be determined far more accurately than using a person’s self-reported state of mind (I would think this condition already holds in several cases). There is no contradiction here, as these mental states are real states of matter inside the substrates of minds (usually brains). People can be mistaken about what they want.


3.     The persuasion problem

You can not persuade somebody who is committed to the contrary view that wellbeing should be valued. Does this mean that value is not an objective property of reality, or that wellbeing can’t be the basic guide of morality? Of course not. That wellbeing is valued is a tautology, as pointed out already. Being well, happy, and satisfied is just the fortunate condition of possessing much of what you value. That some people will not accept this analytical truth is a consequence of the failure of their minds to operate efficiently. The often-quoted fact that nearly half of all Americans do not accept the theory of evolution (link) has absolutely no impact on the validity of this model of how life reaches is various states of diversity.


4.     What about psychopaths?

Is one morality less true than another? Since morality is rationally figuring out how to improve you state of happiness, no, symmetry demands that there is no privileged authority on ethical behaviour.

But this implies that psychopaths are also moral, right? Wrong, actually. It doesn’t imply that anybody’s actions are moral, the requirement for rationality needs to be met. 

Ultimately, though, it is possible to imagine that there might be psychopaths who are perfectly rational, and highly successful at maximizing their own fulfillment by acting abhorrently to others. This still poses no real philosophical problem. Psychopaths are outnumbered by an estimated 99 to 1 (in the US), so our morality trumps theirs (actually, by those odds there seems a reasonable chance that some psychopath will read this). Our (non-psychopathic) morality dictates that we do what we can to limit their harmful tendencies. Even the psychopaths, by the way, presumably do not want to be the victims of other psychopaths.

Additionally, a person can be mistaken about what their highest-level desires are. A person can be mistaken about the relationship between their lower-level desires and their ultimate goals, and most obviously, a person can be mistaken about the likely outcomes of their actions and what will be the results on their happiness. This means that even the ethics of a person acting rationally can be improved by furnishing then with more accurate facts.  

All rational, equally well-informed behaviour is equally valid. Pretending that there is something wrong with this principle, because it fails to unambiguously condemn the foul actions of sadists is to ignore the obvious truth that the appropriateness of one’s conduct is tautologically determined by its capacity to generate mental states corresponding to appropriateness. Furthermore, it is to insist that room be left open for some principle that, besides being wrong, will, almost by definition, not have the slightest impact on the behaviour of psychopaths, who don’t seem to care what is expected of them by moral philosophers.


5.     Lack of uniqueness

It is not clear that there is a unique solution to the mathematical problem of maximizing wellbeing. There may very well be some upper limit to how happy I can be, and there may be many ways to arrive at that upper limit (but I’m not holding my breath). This does nothing to negate the arguments laid out. Whatever class of conditions enables that upper limit to be attained, it is determined by objective facts about nature. 


6.     Lack of universality

Different people have different values, but as discussed at number 4, this poses no philosophical threat to my thesis. Human beings, however, as members of a single species, are more similar than we are different, so substantial overlap between the ambitions of different individuals is to be highly expected. Furthermore, direct commonality of goals can also be confidently predicted – what’s good for you is good for me. The fact that our morality is ultimately selfish does not need to stand in the way of a high degree of cooperation.

A society that protects all people indiscriminately is very likely to protect me. A global economy that flourishes has a good chance to be one in which I am not poor. A society where learning and technology thrive is a good choice if I hope to have a comfortable life, and complex technology is very difficult to attain without close, structured cooperation between thousands of people.

In ‘The selfish gene,’ Richard Dawkins has written much on how cooperative behaviour emerges naturally among selfish agents, by mechanisms such as kin selection and reciprocal altruism. He has also made a nice TV documentary, 'Nice guys finish first,' on the topic, specifically to answer critics who unintelligently insisted that selfish genes imply selfish people. Further, by initiating the concept of memes, he has made it clear how human behaviour has evolved far beyond the consequences of mere population genetics. We are not bound by genetic evolution anymore, and cultural innovations capable of enhancing mutually beneficial technological and social developments are to be positively expected.


7.     Which utility function?

There are different ways to model value, e.g. a rational utility function, or prospect theory (which more accurately models how humans naturally assess utility), but which one should we use? The answer is simple: the one that works. Briefly, (a bit late for that, I think) the point is that one day we might be smart enough to measure utility directly, by studying the physical states of brains (and potentially other substrates supporting minds).


8.     Kahneman’s alternate selves

Daniel Kahneman and coworkers have repeatedly shown that there are different measures of happiness within a single human mind (see this TED talk, for example): the experiencing self, and the remembering self. Which one is correct for our purposes? Again, (and again very briefly) the one that works. Measure the success of basing your behaviour on one choice and compare to outcomes with the other choice, then adopt the policy that produces the optimal result.


9.     How does wellbeing aggregate?

Assuming that my happiness is affected by some measure of global happiness (see number 6), how should we combine happiness measures for different people? Should they be added, multiplied, or what? Isn’t the final choice ultimately arbitrary?

No, its not. Once more, measure the outcome of some likely aggregation function and compare it with other candidates. Did we need some a-priori basis for accurately combining temperatures in order to accept its validity as a physical concept? Did the cave man need to know that two objects placed in contact would equilibrate to some intermediate temperature, rather than their temperatures adding algebraically, in order to know that the embers in his fire were hot?





Friday, March 8, 2013

What is Randomness?



Random variables play an important part in the vocabulary of probability theory. I think there's a lot of confusion, though, about what randomness actually is. A few days ago, I found an expert statistician trying to distinguish between mere statistical fluctuation and actual changes in the causal environment. Another example that always bugs me comes from computer science, and is the ubiquitous insistence that a deterministic algorithm can not produce random numbers, but only pseudo-random numbers.   

Each of these examples commits a fallacy. The first may be an isolated slip-up, or else a deliberate attempt to gloss over technicalities with sloppy language (I may even be guilty (gasp!) of either of these myself, on occasion), but the second is almost universal within the entire profession of computer scientists, which represents a significant sample of the world's technically disciplined. If any one of those computer scientists understood what randomness is, they would recognize immediately that the need to distinguish between random and pseudo-random is entirely fictional. 

Its not that I have anything against computer scientists. Information technology, after all, is what this blog is all about: systematically processing knowledge, and modern society owes its existence to computer science. Nor do I think that computer scientists are excessively prone to the fallacy I'm talking about. In fact, the statistical literature, compiled by those who, of all people, should have dedicated considerable effort to understanding this topic, is crammed with instances, of which my initial example is representative. As another example, the current Wikipedia entry on randomness contains a very confused section that starts with the statement: 'Randomness, as opposed to unpredictability, is an objective property.'

So what is the problem? Lets look at pseudo-random numbers, first. The point about pseudo-random numbers is that they are produced in a clever way to 'replicate' a random variable - they come appropriately distributed and they appear uncorrelated, which is to say that even knowing the nth, (n-1)th, ... numbers in a list, it will be impossible to predict what the (n+1)th number will be. They are considered to be not truly random, however, because they are produced by mechanical operations of a computer on fixed states of its circuits. If we only knew the states of the circuit and the operations, then we would know the number that will come out next. To say that this prevents us from describing the numbers as random, however, is an instance of the mind-projection fallacy, as I have discussed before.

The mind projection fallacy consists of assuming that properties of our model of reality necessarily correspond to properties of reality. We often talk about a phenomenon being random. This makes it tempting to conclude that randomness is a property of the phenomenon itself, but this assumes too much, and also demands that the events we are talking about take place in some kind of bubble, where physics doesn't operate.

When I dip my hand into an urn to draw out a ball whose colour can't be predicted, the colour that comes out is rightly considered to be a random variable. But we are not talking about some quantum-mechanical wavefunction that collapses the moment the first photon from the ball hits my retina (or the moment the nerve impulse reaches my visual cortex, or any of a million other candidate moments). We are talking about a process with  real and definite cause and effect relationships. The layout of the coloured balls inside the urn, the trajectory of my hand into the urn, and the exact moment I decide to close my hand uniquely determine the outcome of the draw. Randomness is not the occurrence of causeless events, but is a consequence of our incomplete information. It's not a property of the balls in the urn, but a property of our prior state of knowledge.

It might be that at microscopic scales, quantum stochastic variability really does emerge from an absence of causation, but there are 2 important points to note in relation to this discussion. Firstly, good scientists recognize the need for agnosticism on this front - there really isn't enough evidence yet to decide one way or the other (BBC Radio 4's excellent 'In Our Time' has an episode, entitled 'The measurement problem,' with an interesting discussion on the topic). Secondly, the vast majority of cases where the concept of randomness is applied concern macroscopic phenomena, where classical mechanics is a perfectly adequate model. For these reasons, the only sensible general usage of the word 'random' is when referring to missing information, rather than as a description of uncaused events. That wikipedia article I quoted from, in apparent recognition of this, later cites Brownian motion and chaos as examples of randomness, thereby contradicting the earlier quote (though inexplicably, the two are identified as separate classes of random behaviour).

Getting back to the urn, if I knew precisely the coordinates (relative to my hand) and colours of the balls inside, the colour of the extracted sphere would not be a a surprise, and therefore wouldn't be a random variable. But under the standard drawing conditions, in which these are not known, it is a random variable. Similarly, knowing the state and operations of a deterministic computer algorithm would render its output non-random (provided I have the computational resources elsewhere (and the inclination) to replicate those operations), but this does not affect the randomness of its output when we don't know these things.

And finally, how can there be a distinction between statistical fluctuations and changes in the causal environment? If I repeat the experiment with the urn and get a different coloured ball the second time, which is that, sampling variability, or a difference in causes? Both, of course. What could sampling variability (at the macroscopic scale) be the result of, if not mechanical differences in the evolution of the experiment? If I toss a coin 4 times and get 4 heads in a row, in one sense, that's a statistical fluke, but in another, its the inevitable result of a system obeying completely deterministic mechanical laws. All that decides the level at which we find ourselves discussing the matter is our degree of awareness of the states and operations of nature.

Ok, so we live in a world polluted by some sloppy terminology, but does it really matter? I think it does. Every statistical model is an attempt to describe some physical process. As long as we systematically deny the action of physics on any aspects of these processes, then we close off access to potentially valuable physical insight. This is actually the problem that the discussed attempt to separate sampling fluctuation and causal variation was trying to address, but this half-hearted formulation serves only to postpone the problem until some later date.



Thursday, February 28, 2013

Legally Insane



David Spiegelhalter's blog, Understanding Uncertainty, informs us of a recent insane ruling from the England and Wales Court of Appeal, concerning the usability of probabilities as evidence in court cases. Lord Justice Toulson's ruling contains the following wisdom:
The chances of something happening in the future may be expressed in terms of percentage... But you cannot properly say that there is a 25 per cent chance that something has happened... Either it has or it has not.
This is wrong. Shockingly, scarily wrong. The judge is saying that probabilities only apply to future events, and not to past events, and is effectively decreeing that such evidence in inadmissible in a court of law. This ruling is based, it seems, on an earlier case, in which a judge ruled:
It is not, in my opinion, correct to say that on arrival at the hospital he had a 25 per cent. chance of recovery. If insufficient blood vessels were left intact by the fall he had no prospect of avoiding complete avascular necrosis, whereas if sufficient blood vessels were left intact on the judge's findings no further damage to the blood supply would have resulted if he had been given immediate treatment, and he would not have suffered the avascular necrosis.
It seems that Justice Toulson is not alone among judges for his profound ignorance of probability, logic and the basic principles of how knowledge is acquired. In fact, for almost 30 years, the legal profession has had its own special probabilistic fallacy, the prosecutor's fallacy, named after it. You might think that by now they should have made an effort to get to grips with basic methodology, but instead, they just keep on making stupid judgements. 

I have pointed out that a judge who can claim that probabilities in general can't be assigned to past events is incompetent and not fit to perform their duties. Though it might seem harsh, I stand by this analysis. You might think that this is some obscure point of epistemology with little or no practical importance, but it is much more than that.

There are three main reasons blunders like this, made by a high court judge, should scare the crap out of society at large. Firstly, as pointed out, it demonstrates a serious ignorance of what probabilities are. Probability theory is completely symmetric with regard to time, and probabilities are simply a systematic way of quantifying our current knowledge. Just an obscure mathematical point? Not at all. Such ignorance shows off a complete disregard for what knowledge is, what it means to be rational, and what is actually involved when evidence is evaluated. What has been asserted is that calculations of the kind when I worked out the probability that somebody has contracted a certain disease are completely meaningless. Maybe the judge won't find it meaningless next time he is in his doctor's office trying to plan his future. Call me overly strict, but I expect somebody in such a position of power, whose job consists to such a high degree of evaluating evidence, to be able to wield a modest understanding of what evidence actually is. How can any reasonable standard of statistical evidence be enforced, when judges are so ignorant about probability? 

But its not just ignorance thats on display here - also a shocking disrespect for logic. In the same paragraph as his pronouncement on the meaninglessness of probabilities, when applied to past events, the judge manages to immediately contradict himself rather blatantly:
In deciding a question of past fact the court will, of course, give the answer which it believes is more likely to be (more probably) the right answer than the wrong answer 
How can anybody capable of holding such obviously incompatible positions at the same time, on any topic, be capable of presiding over a court? What is clear, is that every single judgement of fact, in every single sphere of life relies on some kind of probability assessment. The question that remains, then, is whether we want to make that assessment as systematic and rigorous as we can, or are we happy relying on unexamined instinct and faulty logic? For high-ranked judges to favour the latter is a nightmare scenario.

Secondly, this ruling, if implemented, would make an enormous variety of important types of evidence impossible to use in legal cases, and would severely hinder the capacity of courts to efficiently determine what is the likely truth. Genetic evidence, for example, is based on Bayesian calculations, as it must be, in order to attain validity. 

To present probabilities is to make the highest quality of inference available. Why is this judge against high-quality inference? Indeed, why is he so overtly opposed to scientific method? In Toulson's ruling, we also have this:
When judging whether a case for believing that an event was caused in a particular way is stronger that the case for not so believing, the process is not scientific...
Why not scientific? Why not demand the highest standards of logic and inference? Why is he not complaining that the process is not scientific enough, instead of insisting that we rely on some inefficient and non-systematic procedure? The mind boggles. There is only one correct way to  assess the implications of evidence, and to quantitatively combine multiple pieces of evidence, and that is Bayes' theorem (techniques that successfully replicate its outcomes can occasionally be used also). Judge Toulson's ruling constitutes a rejection of Bayesian reasoning, and thereby demands that the legal profession turn its back on the rational evaluation of empirical facts.

Thirdly, the judges on this case have made a serious blunder with a technical issue, while obviously being unaware of their incompetence to reason about the topic. Certainly, a judge doesn't need to be an expert in all the technical subjects that may be relevant to any given case. But then they must be able to appreciate that the technical issues are beyond them. They can not perform technical analyses that they are unqualified to perform. If they want to base their decisions on philosophy, probability, mathematical theorems, or whatever, they damn well get it right, or ask somebody else, who knows what they are doing. In what other technical forensic issues are these judges hopelessly unaware of their complete lack of understanding? A society that aspires to be a free and enlightened society must not tolerate such oblivious overconfidence among people with such an important job.   





Saturday, February 9, 2013

Inductive inference or deductive falsification?



Andrew Gelman and Cosma Rohilla Shalizi have just published an interesting paper1, 'Philosophy and the practice of Bayesian statistics,' which is all about the underlying nature of science and Bayesian statistics, as well as practical elements of scientific inference that some researchers seem to be reluctant to indulge in. The paper comes accompanied by an introduction, five comments, and a final authors' response to comments (link to journal issue). I found the paper to be a highly thought-provoking read, and its extensive list of references serves as a summary of much of what's worth knowing about at the cutting edge of epistemology research. While I'm recommending reading this article, there are major components of its thesis that I disagree with.

Firstly, I love the title. Many would have been satisfied with 'The philosophy and practice of Bayesian statistics,' but clearly the authors have broader ambitions than that. Actually, I really applaud the sentiment.

In terms of their practical guidelines, Gelman and Rohilla Shalzi are spot on. They are talking about model checking - graphical or statistical techniques using simulated data generated according to the model supported by a statistical analysis, with the goal of assessing the appropriateness of that model. This is necessary, since any probability calculation is dependent on the context of the hypothesis space within which the problem is formulated. These hypothesis spaces, however, are not derived some some divine, infallible formula, but find their genesis in the whims of our imaginations. There is no guarantee, or even strong reason to suppose that the chosen set of alternative propositions actually contains a single true statement. A totally inappropriate theory, therefore can attain a very high posterior probability, depending on the environment of models in which it finds itself competing. Within a given system of models, Bayes' theorem has no way to alert us to such calamities, and something additional to the standard Bayesian protocol is appropriate.

Some researchers committed to the validity of the Bayesian program feel, apparently, that this additional process is inappropriate, because it seems to step outside the confines of Bayesian logic. I contend that this is mistaken, which I will explain in laying out my disagreement with the paper under discussion.

Edwin Jaynes (in honour of whose work the title of this blog was chosen) was also a strong advocate of model checking, and he pointed out many times that Bayesian methodology exerts its greatest power when it shows us that we need to discard a theory that can serve us no more. I'm reminded of one of my favorite Jaynes quotes2:

To reject a Bayesian calculation because it has given us an incorrect prediction is like disconnecting a fire alarm because that annoying bell keeps ringing. 

In terms of the philosophical discussion, Gelman and Rohilla Shalzi argue that model checking, while vital, is  outside Bayesian logic and furthermore is not part of inductive inference. They claim that falsification of scientific models is deductive in nature. These three claims represent the main points of departure between my understanding and theirs. 

I posted a comment on Professor Gelman's blog, so I'll just paste in directly from there (the details of that specific model from their discussion I refer to are not important, its just an example they used):

You wrote: 
“It turned out that this varying-intercept model did not fit our data, … We found this not through any process of Bayesian induction but rather through model checking.” 
I agree on the value of model checking, but I wonder if this is really distinct from inductive inference. In order to say that the model was inappropriate, don’t you think that you must, at least informally, have assigned it a low probability? In which case, your model checking procedure seems to be a heuristic designed to mimic efficiently the logic of Bayesian induction. 
Even if you didn’t formally specify a hypothesis space, what you seem to have done is to say ‘look, this model mis-matches the data so much that it must be easy to find an alternate model that would achieve a much higher posterior.’ As such, the process of model checking attains absolutely strict validity only when that extended hypothesis space is explicitly examined, and your intuition numerically confirmed. Granted, many cases will be so obvious that the full analysis isn’t needed, but hopefully you get the point. 
There certainly is a strong asymmetry between verification and falsification, but I can’t accept your thesis that falsification is deductive. Sure, its typically harder for a model with an assigned probability near zero to be brought back into contention than it is for a model with currently very high probability to be crushed by new evidence, but its not in principle impossible. Newtonian mechanics might be the real way of the world, and all that evidence against it might just have been a dream. The problem is that this requires not just Newtonian mechanics, but Newtonian mechanics + some other implausible stuff, which as intuition warns (and mathematics can confirm) deserves very small prior weight. (A currently favorable model can always be superseded by another model with not significantly greater complexity, which accounts for the asymmetry between falsification and verification.) The mathematics that verifies this is Bayesian and, it seems to me, inductive. 
That we can apparently falsify a theory without considering alternatives seems to be simply this strong asymmetry allowing Bayesian logic to be reliably but informally approximated without specifying the entire (super)model.

By the way, the mathematics that confirms the low prior for propositions like Newton + all that  extra weird stuff is Ockham's razor (a.k.a. Bayes' theorem). 

I have pointed out previously that non-Bayesian techniques certainly have their usefulness, but that their validity is limited to the extent that they succeed in replicating the result of a Bayesian calculation. Model checking seems to me to be no exception. Indeed, the recommended model checks can only work when the outcome is so obvious that the full, rigorous analysis is not needed. If the case is too close to call by these techniques, then you must roll out Bayes' theorem again, or stick with your current model. This should be obvious.

I have laid out the Bayesian basis for falsificationism elsewhere, but I did not discuss this asymmetry between falsifying and verifying theories, which I also think is important. Some Bayesian methodologists seem to hold the view that they have equal status, but they do not. They are not, however, as asymmetric as Popper felt - he did not accept that any form of verification was ever valid. One must wonder, then, why he had any interest whatsoever in science. Science does not, however, verify by making absolute statements about a theory's truth, but rather, its statements are to be seen as of the 'less wrong' type.

The idea that falsification is deductive seems to me indefensible. It is entirely statistical, and is just as incapable of absolute certainty as any other reasoning about phenomena in the real world (except perhaps where propositions are falsified on the basis that they are incoherently expressed, though perhaps it is better not to say such things are false, but rather null statements). If falsification is deductive, how big do the residuals between data and model need to be in order to reach that magic tipping point? 

Oh, and Professor Gelman's response to my comment?


'Fair enough.'

Actually, he said a little more than that, but I think its fair to say he conceded that strictly, I had a good point. (You may judge for yourself if you wish.)

Anyway, I'm looking forward to dipping into some of the published comments and responses, as well as some of the many materials cited in this paper. Recommended reading for anyone interested in epistemology - what we can know, and under what circumstances we can know it. By the way, all the articles in that journal seem to be open access, so bravo, hats off, and three cheers to the British Journal of Mathematical and Statistical Psychology.






[1]  'Philosophy and the practice of Bayesian statistics,' British Journal of Mathematical and Statistical Psychology, February 2013, Vol. 66, Issue 1, pages 8 - 38 (link)


[2] 'Clearing up mysteries, the original goal,' by E.T. Jaynes, in 'Maximum entropy and Bayesian methods,' edited by J. Skilling, Kluwer Publishing, 1989



Wednesday, January 16, 2013

Natural Selection By Proxy



Here, I'll give a short summary of one of my favourite studies of recent decades: John Endler's ingenious field and laboratory experiments on small tropical fish1, which in my (distinctly non-expert) opinion constitute one of the most compelling and 'slam-dunk' proofs available of the theory biological evolution by natural selection. After I've done that, in an act of unadulterated vanity, I'll suggest an extension to these experiments that I feel would considerably boost the information content of their results. That will be what I have dubbed 'selection by proxy'.

Don't get me wrong, Endler's experiments are brilliant. I first read about them in Richard Dawkins' delightful book, 'The Greatest Show on Earth,' and they captured my imagination, which is why I'm writing about them now.

Endler worked on guppies, small tropical fish, the males of which are decorated with coloured spots of varying hues and sizes. Different populations of guppies in the wild were found to exhibit different tendencies with regard to these spot patterns. Some populations show predominantly bright colours, while others prefer more subtle pigments. Some have large spots, while other have small ones. Its easy to contemplate the possibility that these differences in appearance are adaptive under different conditions. Two competing factors capable of contributing a great deal to the fitness of a male guppy are (1) ability to avoid getting eaten by predatory fish, and (2) ability to attract female guppies for baby making. Vivid colourful spots might contribute much to (2), but could be a distinct disadvantage where (1) is a major problem, and if coloration is determined by natural selection, then we would expect different degrees of visibility to be manifested in environments with different levels of predation. And so colour differences might be accounted for.

Furthermore, the idea suggests itself to the insightful observer that in the gravel-bottomed streams in which guppies often live, a range of spot sizes that's matched to the predominant particle size of the gravel in the stream bed would help a guppy to avoid being eaten, and that the tendency for particle and spot sizes to match will be greater where predators are more of a menace, and greater crypsis is an advantage. 

These considerations lead to several testable predictions concerning the likely outcomes if populations of guppies are transplanted to environments with different degrees of predation and different pebble sizes in their stream beds. These predicted outcomes are extremely unlikely under the hypothesis that natural selection is false. Such transplantations, both into carefully crafted laboratory environments, and into natural streams with no pre-existing guppy populations, constituted the punch line of Endler's experiments, and the observed results matched the predictions extraordinarily closely, after only a few months of naturally selected breeding. 

Its the high degree of preparatory groundwork and the many careful controls in these experiments, however, that result in both the high likelihood, P(Dp | H I), for the predicted outcome under natural selection, and the very low likelihood, P(Dp | H' I), under the natural-selection-false hypothesis. These likelihoods, under almost any prior, lead to only one possible logical outcome, when plugged into Bayes' theorem, and make the results conclusive.

The established fact that the patterning of male guppies is genetically controlled served both causes. Of course, natural selection can not act in a constructive way if the selected traits are not passed on to next generation, so the likelihood under H goes up with this knowledge. At the same time, alternate ways to account for any observed evolution of guppy appearance, such as developmental polymorphisms or phenotypic plasticity (such as the colour variability of chameleons, to take an extreme example), are ruled out, hitting P(Dp | H' I) quite hard.

Observations of wild populations had established the types of spot pattern frequent in areas with known levels of predation - there was no need to guess what kind of patterns would be easy and difficult for predators to see, if natural selection is the underlying cause. The expected outcome under this kind of selection could be forecast quite precisely, again enhancing the likelihood function under natural selection.

Selection between genotypes obviously requires the presence of different genotypes to select from, and in the laboratory experiments, this was ensured by several measures leading to broad genetic diversity within the breeding population. This, yet again, increased P(Dp | H I). (Genetic diversity in the wild is often ensured by the tendency for individuals to occasionally get washed downstream to areas with different selective pressures, which is one of the factors that made these fish such a fertile topic for research.)

The experiment employed a 3 × 2 factorial design. Three predation levels (strong, weak, and none) were combined with 2 gravel sizes, giving 6 different types of selection. The production of results appropriate for each of these selection types constitutes a very well defined prediction and would certainly be hard to credit under any alternate hypothesis, and P(Dp | H' I) suffers further at the hands of the expected (and realized) data.

Finally, additional blows were dealt to the likelihood under H', by prudent controls eliminating the possibility of effects due to population density and body size variations under differing predation conditions.

With this clever design and extensive controls, the data that Endler's guppies have yielded offer totally compelling evidence for the role of natural selection. Stronger predation led unmistakably to guppies with less vivid coloration, and greater ability to blend inconspicuously with their environment, after a relatively small number of generations.

I first read about these experiments with great enjoyment, but there was another thing that came to my mind: what the data did not say. It is quite inescapable from the results that natural selection of genetic differences was responsible for observed phenotypic changes arising in populations placed in different environments, but the data say nothing about the mechanism leading to those genetic differences. This, of course, is something that is central to the theory of natural selection. Indeed, we might consider the full name of this theory to be 'biological evolution by natural selection of random genetic mutations.' For the sake of completeness, we would like to have data that speak not only of the natural selection part, but also of the random basis for the genetic transformation.

I'm not saying that there is any serious doubt about this, but neither was there serious doubt about natural selection prior to Endler's result. (In fact, there is some legitimate uncertainty about the relative importance of natural selection v's other processes, such as genetic drift - uncertainty that work of Endler's kind can alleviate.) The theory of biological evolution, though, is a wonderful and extremely important theory. It stands out for a special reason: every other scientific theory we have is ultimately guaranteed to be wrong (though the degree of wrongness is often very small). Evolution by natural selection is the only theory I can think of that in principle could be strictly correct (and with great probability is), and so deserves to have all its major components tested as harshly as we reasonably can. This is how science honours a really great idea.

To test the randomness of genetic mutation, we need to consider alternative hypotheses. I can think of only one with non-vanishing plausibility: that at the molecular level, biology is adaptive in some goal-seeking way. That the cellular machinery strives, somehow, to generate mutations that make their future lineages more suitably adapted to their environment. I'll admit the prior probability is quite low, but I (as an amateur in the field) think its not impossible to imagine a world in which this happens, and as the only remotely credible contender, we should perhaps test it.

We could perform such a test by arranging for natural selection by proxy. That is, an experiment much like Endler's, but with a twist: at each generation, the individuals to breed are not the ones that were selected (e.g. by mates or (passively) by predators), but their genetically identical clones. At each generation, pairs of clones are produced, one of which is added to the experimental population, inhabiting the selective environment. The other clone is kept in selection-free surroundings, and is therefore never exposed to any of the influences that might make goal-seeking mutations work. Any goal-seeking mechanism can only plausibly be based on feedback from the environment, so if we eliminate that feedback and observe no difference in the tendency for phenotypes to adapt (compared to a control experiment executed with the original method), then we have the bonus of having verified all the major components of the theory. And if, against all expectation, there turned out to be a significant difference between the direct and proxy experiments, it would be the discovery of the century, which for its own sake might be worth the gamble. Just a thought.







[1]Natural Selection on Color Patterns in Poecilia reticulata, Endler, J.A., Evolution, 1980, Vol. 34, Pages 76-91 (Downloadable here)