Saturday, May 24, 2014

Pass / Fail Mentality



Recently, I was talking about calibration (here and here), and how it should be more than just identifying the most likely cause of the output of a measuring instrument. The calibration process should strive to characterize the range of true conditions that might produce such an output, along with any major asymmetries (bias) in the relationship between the truth and the instrument's reading. In short, we need to identify all the major characteristics of the probability distribution over states of the world, given the condition of our measuring device.

Failure to go beyond simply identifying each possible instrument reading with a single most probable cause is a special case of a very general problem that in my opinion plagues scientific practice. Such a major failure mode should have a special name, so lets call it 'pass / fail mentality.' It is the most extreme possible form of a fallacy known as spurious precision, and involves needlessly throwing away information.

In The Calibration Problem, I argued that science is necessarily inductive, and consists, ideally, of producing probability distributions over sets of exclusive and hopefully exhaustive propositions. To present a result with an error bar that is too narrow to accurately characterize one of these probability distributions is to be guilty of spurious precision. To collapse that error bar to zero width: 'X is 5.1,' is to take this fallacy as far as it can go.

Science is often considered to consist of tests of Boolean propositions (molecule M speeds up recovery from disease D, planet Earth is getting warmer, humans and dandelions share a common ancestor, etc), and so presenting a finding with zero recognized uncertainty often consists of a statement of the form, "proposition X passed the test," or  "proposition X failed the test." Hence the name I'm giving to this fallacy. What we would prefer is a statement of the form, "proposition X, achieving a probability of 0.87, looks quite reliable."

Better still, would be to examine an uncollapsed hypothesis space, such that instead of saying, "yes the Earth is getting warmer," or even, "yes, the Earth is very probably getting warmer," we would say something like, "the rate of change of the Earth's surface temperature is X ± Y."

Unfortunately, there seems to be a tendency in human nature to prefer statements of exaggerated precision. Probability theory gives us the tools to manage uncertainty. Until we properly understand probability, therefore, we do not understand uncertainty. And ill-understood uncertainty is a scary thing. Yet, we know that the function of science is to help banish our fear of the unknown. Thus, it's often overwhelmingly tempting to assume that the function of science is to banish uncertainty completely.

This is the temptation that many a radical skeptic has succumbed to and / or manipulated in others. The climate-change denier, the young-Earth creationist, the anti-vaccination lobbyist - all use the same tactic: Look, they can't decide if the Earth is 4.53 or 4.54 billion years old! They have no certainty about anything!

So strong is this desire to see uncertainty eliminated, (and so strong, perhaps, the desire not to give ammunition to the radical skeptic) that much of the way science is conducted and reported is built around this flawed model: Based on our results, we have decided that P is true. Effect Q has passed the test of statistical significance.

Often it has been said by gurus of scientific method that science must proceed by asking specific questions, and that these questions must be of the yes / no variety. Hence, we have seen debates between the rival philosophies of verificationism v's falsificationism; should we proceed by proving our theories true or by proving them false? This debate is wrong - strictly speaking, we can do neither.

Many a data set has been left unpublished because it was insufficient to answer any question 'conclusively.' But this inconclusiveness is itself a useful piece of information. It indicates, for example, that any possible effect size is likely to be small, and in combination with other weakly-informative studies in meta-analysis, it can be used to help aggregate a more informative result.

The tendency to not publish ambiguous results is not always the fault of the practicing researcher either. Most scientific journals will publish only conclusive results, and of those, positive results usually receive far more favorable attention.

Null-hypothesis significance testing (one of the most common forms of data analysis in use in contemporary scientific literature) is a classic example of the pass / fail mentality running rampant in scientific endeavour. A threshold is set, and if some data-derived metric exceeds that threshold, then the data go on to fame and fortune. Otherwise, they go to the back of the filing cabinet. This madness is not necessarily limited to frequentist methods, though. Anywhere that some hard boundary between finding and non-finding has been set will exhibit the same flaw, and it won't matter if that boundary is a p-value, a posterior probability, or a likelihood ratio.

Odd things occur when experimental data is examined using threshold-like criteria, such as the α-levels applied to p-values. Meta-analysts have found that the scientific literature can exhibit excess results that just barely cross the significance threshold, contradicting statistical analysis that predicts what the distribution of p-values should be. For example, Masicampo and Lalande1 found an anomalous spike in the number of reported p-values just less than 0.05, in the field of psychology. Very many studies use the same arbitrary significance level of p = 0.05, so that a p-value greater than 0.05 is considered inconclusive. One interpretation is that many researchers achieving p-values close to, but not quite crossing the significance barrier tended to 're-work' their analyses in various ways, until the magic α-level was crossed. Use of thresholds leads to distortion of the evidence.

Applied probability is called decision theory. It is the endeavour to determine how to act, based on our empirical findings. Where actions are to be taken, often a probability distribution is going to need to be collapsed, somewhere. We can't always distribute our actions - it doesn't usually work to only 70% undergo surgery. (Though, mixed strategies very often are possible, and advisable.) Thus, on superficial analysis, the introduction of a threshold of significance, or the expression of a parameter estimate as an infinitely narrow spike in probability space may seem reasonable: if action requires us to gamble on a single state of the world (e.g. "only surgery can save your life"), what else are we to do?

The problem is that the pass/fail mentality, perhaps motivated by a crude appreciation of decision-theoretic considerations, does not explicitly acknowledge any elements of decision theory, and totally fails to implement the most basic aspects of decision analysis.

Many a frequentist chastises the Bayesian for introducing personal bias, in the form of the prior probability distribution, without realizing that frequentist techniques do exactly this, arbitrarily, and without acknowledgement. By implementing arbitrary α-levels, the significance tester is effectively setting a decision threshold, without ever formally or even approximately evaluating any utility function, which is one of the things that any decision analysis must have. In fact, it's even worse: the introduction of the decision threshold goes so unacknowledged, that the urge, upon completing the calculation is not to say, "for economical reasons, we should behave as if X is true," but rather to simply say, "X is true." Nobody (pretty much) actually, explicitly thinks this way, but the conventions of reporting have been set up in such a way that this non sequitur represents a real point of psychological attraction, which easily impacts on the thinking and behaviour of the insufficiently wary.

One of the implicit assumptions in pass / fail thinking is that the best point estimate is the peak (or sometimes the mean) of the probability distribution. Partly, this arises because there remains confusion as to whether the exercise is one of decision, or one of pure science, to determine what is true. But a stupidly simple toy example of decision making serves to show that the optimum, with respect to action, depends on the utility function, and can be arbitrarily far from the probability peak.

Suppose we play a gambling game, but it's not much of a gamble, as the cost of entry into the game is zero. An exotic roulette wheel produces outcomes that are Poisson distributed, with mean equal to 10, such that the probability to land on 38 (the highest number on the wheel) is extremely low (about 8 × 10-11, in fact). If the outcome matches our prediction, we win a prize that is dependent on what the outcome is. It just happens that only one of the prizes is non-zero (and positive) - the prize awarded for a correctly predicted outcome of 38. What outcome should we predict? Obviously, in this trivial game, our expected utility is maximized by betting on 38, even though it has a low probability to arise. 

In the post on the calibration problem, I showed in detail why it is that science will always be fundamentally concerned with calculating probability distributions. Hopefully the above considerations have helped to illustrate further why as much of the details of those probability distributions as possible should be retained, when the final analysis is being assessed and reported. To present a point estimate without an error bar is frankly unthinkable, for any conscientious scientist. To neglect to mention any strong asymmetry of the resulting probability curve is to needlessly discard valuable information, and such fine details, once eliminated, represent a lost opportunity when evidence from multiple studies is to be aggregated. Sometimes action requires a hard decision, but do the decision analysis explicitly, so that you know what you have done, and so that others can review your assumptions, and determine whether your utility function is the same as theirs.





References



 [1]  Masicampo, E. J. and Lalande, D. R., 'A peculiar prevalence of p values just below .05,' Quarterly Journal of Experimental Psychology, 65 (11), 2271-2279, 2012 (Link to paper, paywalled, unfortunately. See a short discussion with a plot of Masicampo & Lalande's data here.)




1 comment: