Friday, December 21, 2012


I often marvel at the achievements of the early scientific pioneers, the Galileos, the Newtons, and the like. Their degree of understanding would have been extraordinary under any circumstances, but as if it wasn't hard enough, they had almost no technical vocabulary to build their ideas from. They had to develop the vocabulary themselves. How did they even know what to think, without a theoretical framework already in place? Amazing. But at other times, I wonder if their situation was not one of incredible intellectual liberty, almost entirely unchained by technical jargon and untrammelled by rigorous notation. Perhaps it was a slight advantage for them, not to have those vast regions of concept space effectively cut off from possible exploration by the focusing effects of a mature scientific language. Standardized scientific language may or may not limit the ease with which novel ideas are explored, but I think there are strong grounds for believing that jargon can actively inhibit comprehension of communicated ideas, as I now want to explore.

Its certainly true that beyond a certain elementary point, scientific progress, or any kind of intellectual advance, is severely hindered without the existence of a robust technical vocabulary, but we should not conflate the proliferation of jargon with the advance of understanding. Standardized terminology is vital for ‘high-level’ thought and debate, but all too often, we seem to see this terminology as an indicator of technical progress or sophisticated thought, when it is the content of ideas we should be examining for such indications.

There is a common con trick, one is almost expected to use in order to advance one’s self, which consists of enhancing credibility by expanding the number of words one uses and the complexity of the phrases they are fitted into. It seems as though one is trying to create the illusion of intellectual rigour and content, and perhaps it’s not a bad guess to suggest that jargon proliferates most wildly where intellectual rigour is least supported by the content of the ideas being expressed. Richard Dawkins relates somewhere (possibly in ‘Unweaving the rainbow’) a story of a post-modernist philosopher who gave a talk, and in reply to a questioner who said that he wasn’t able to understand some point, said ‘oh, thank you very much.’ It suggests that the content of the idea was not important, otherwise the speaker would certainly have been unhappy that it was not understandable. Instead, it was the level of difficulty of the language that gave the talk its merit.

It has been shown experimentally that adding vacuous additional words can have a powerful psychological effect. Ellen Langer’s famous study1, for example, consisted of approaching people in the middle of a photocopying job, and asking to butt in. If the experimenter (blinded to the purpose of the experiment) said “Excuse me, I have 5 pages. May I use the xerox machine?,” a modest majority of people let her (60%), but if she said “Excuse me, I have 5 pages. May I use the xerox machine, because I have to make copies?” the number of people persuaded to step aside was much greater (93%). This shows clearly how words that add zero information can greatly enhance credibility - an effect that is exploited much too often, and not just by charmers, business people, sports commentators, and post-modernists, but by scientists as well. The other day I was reading an academic article on hyperspectral imaging, a phrase that made me uneasy - I wondered what it was - until I realised that ‘hyperspectral imaging’ is exactly that same thing as, yup, ‘spectral imaging.’

Even if we have excised the redundancy from jargon-rich language, I often suspect that technical jargon can actually impede understanding. Just as unnecessary multiplicity of terms can enhance credibility at the photocopier, I suspect that recognition of familiar jargon gives one an easy feeling which is too often confused for comprehension. You can test this with skilled scientists, by tinkering just a little bit with their beloved terminology, and observing their often blank or slightly panicked expressions. Once, when preparing a manuscript on the lifetimes of charged particles in semiconductors (the lifetime is similar to the half life in radioactivity), in one place I substituted ‘lifetime’ with the phrase ‘survival time.’ When I showed the text to a close colleague (and far better experimentalist than me) for comments, he was very uncomfortable with this tiny change. He seemed unable to relate this new phrase to his established technical lexicon.

You might think that this uneasiness is due to the need for each scientific term to be rigorously defined and used precisely, but its not. Scientists mix up their jargon all the time quite freely, and without anybody batting an eyelid most of the time. I have read, for example, an extremely technical textbook in which an expert author copiously uses the term ‘cross-section’ (something related to a particle’s interactability, and necessarily with units of area) in place of frequency, reaction probability, lifetime, mean free path, and a whole host of concepts, all somewhat related to the tendency of a pair of particles to bump into each other. Nobody minds (except for grumpy arses like me), simply because the word is familiar in the context.

Tversky and Kahneman have provided what I interpret as strong experimental evidence2 for my theory that jargon substitutes familiarity for comprehension. Two groups of study participants were asked to estimate a couple of probable outcomes from some imaginary health survey. One group was asked two questions in the form ‘what percentage of survey participants do you think had had heart attacks?’ and ‘what percentage of the survey participants were over 55 and had had heart attacks?’ By simple logic, the latter percentage can not be larger than the first as ‘over 55 and has had a heart attack’ is a subset of ’has had a heart attack,’ but 65% of subjects estimated the latter percentage as the larger. This is called the conjunction fallacy. Apparently, the greater detail, all parts of which sit comfortably together, creates a false sense of psychological coherence that messes with our ability gauge probabilities properly.

The other group was asked the same questions but worded differently: ‘out of a hundred survey participants, how many do you think had had heart attacks, how many do think were over 55 and had had heart attacks?’ Subjects in the second group turned out to be much less likely to commit the conjunction fallacy, only 25% this time. This seems to me to show that many people can comfortably use a technical word, such as ‘percentage’,  almost every day, without ever forming a clear idea in their heads of what it means. If the people asked to think in terms of percentages had properly examined the meaning of the word, they would have necessarily found themselves answering exactly the same question as the subjects in the other group, and there should have been no difference between the two groups’ abilities to reason correctly. Having this familiar word, ‘percentage,’ which everyone recognizes instantly, seems to stand in the way of a full comprehension of the question being asked. Over reliance on technical jargon actually does impede understanding of technical concepts. This seems to be particularly true when familiar abstract ideas are not deliberately translated into the concrete realm.

When I read a piece of technical literature, I have a deliberate policy with regard to jargon that greatly enhances my comprehension. As with the ‘hyperspectral imaging’ example, redundancy upsets me, so I mentally remove it, allowing myself to focus on the actual (uncrowded) information content. In this case, I actually had to perform a quick internet search to convince myself that the ‘hyper’ bit really was just hype, before I could comfortably continue reading. Once all the unnecessary words have been removed, I typically reread each difficult or important sentence, with technical terms mentally replaced with synonyms. This forces me to think beyond the mere recognition of beguiling catchphrases, and coerces an explicit relation of the abstract to the real. Its only after I can make sense of the text with the jargon tinkered with in this way that I feel my understanding is at an acceptable level. And if I can’t understand it after this exercise, then I have the advantage of knowing it.

For writers, I wonder if there is some profit to be had, in terms of depth of appreciation, by occasionally using terms that are unfamiliar in the given context. The odd wacky metaphor might be just the thing fire up the reader's sparkle circuits.

[1] The Mindlessness of Ostensibly Thoughtful Action: The Role of "Placebic" Information in Interpersonal Interaction, Langer E., Blank A., and Chanowitz B., Journal of Personality and Social Psychology, 1978, Vol. 36, No. 6, Pages 635-42 (Sorry, the link is paywalled.)

[2] Extension versus intuitive reasoning: The conjunction fallacy in probability judgment, Tversky, A., and Kahneman, D., Psychological Review, 1983, Vol. 90, No. 4, Pages 293–315

Monday, December 10, 2012

The Regression Fallacy

Consider a teacher, keen to apply rational techniques to maximize the effectiveness of his didactic program. For some time, he has been gathering data on the outcomes of certain stimuli aimed at improving his pupil's performance. He has been punishing those students that under-perform in discrete tasks, and rewarding those that excel. The results show, unexpectedly, that performances were improved, on average, only for pupils that received punishments, while those that were rewarded did worse subsequently. The teacher is seriously considering desisting from future rewards, and continuing only with punishments. What would be your advice to him?

First note that a pupil's performance in any given task will have some significant random component. Luck in knowing a particular topic very well, mood at the time of execution of the task, degree of tiredness, pre-occupation with something else, or other haphazard effects could conspire to affect the student's performance. The second thing to note is this: if a random variable is sampled twice, and the first case is far from average, then the second is most likely to be closer to the average. Neglect of this simple fact is common, and is a special case of the regression fallacy.

If a pupil achieves an outstanding result in some test, then probably this as partly due to the quality of the student, and partly due to random factors. It is also most likely that the random factors contributed positively to the result. So a particular sample of this random variable has produced a result far into the right-hand tail of its probability distribution. The odds of a subsequent sample from the same probability distribution being lower than the first are clearly the ratio of the areas of the distribution, either side of the initial value. These odds are relatively high.

Imagine a random number generator that produces an integer from 1 to 100 inclusive, all equally probable. Suppose that in the first of two draws, the number 90 comes up. There are now 89 ways to get a smaller number on the second draw, and only 10 ways to get a larger number. In a very similar way, a student who performs very well in a test (an therefore receives a reward) has the odds stacked against them, if they hope to score better in the next test. The regression fallacy, in this case, is to assume that the administered reward is the cause of the eventual decline in performance.

The argument works exactly the same way for a poorly performing pupil - a really bad outcome is most likely, by chance alone, to be followed by an improvement. This tendency for extreme results to be followed by more ordinary results is called regression to the mean. It is not impossible that an intervention such as a punishment could cause improved future performance, but the automatic assumption that an observed improvement is caused by the administered punishment is fallacious.

Another common example comes from medical science. Its when my sinusitis is worst that I sleep with a freshly severed rabbit's foot under my pillow. I almost always feel better the next morning.

These before-after scenarios are a special case, as I mentioned. In general, all we need in order to see regression to the mean is to sample two correlated random variables. They may be from the same distribution (before-after), or they may be from different distributions.

If I tell you that Hans is 6 foot, 4 inches tall (193 cm), and ask you what you expect to be the most likely height of his fully-grown son, Ezekiel, you might correctly reason that father's heights and son's heights are correlated. You might think, therefore, that the best guess for Ezekiel's height is also 6', 4", but you would be forgetting about regression to the mean - Ezekiel's height is actually most likely to be closer to average. This is because the correlation between father's and son's heights is not perfect. On a scale where 0 represents no correlation whatsoever, and 1 indicates perfect correlation (knowledge of one necessarily fixes the other precisely1), the correlation coefficient for father-son stature is about 0.5. Reasoning informally, therefore, we might adjust our estimate for Ezekiel's still unknown height to half way between 6' 4'' and the population average. It turns out we'd be bang on with this estimate. (I don't mean, of course, that this is guaranteed to be the fellow's height, but that this would be our best possible guess, most likely to be near his actual height.)

The success of this simple revision follows from the normal (Gaussian) probability distribution for people's heights. The normal distribution can be applied in a great many circumstances, both for physical reasons (central limit theorem), and for the reason that we often lack any information required to assign a more complicated distribution (maximum entropy). If two variables, x and y are each assigned a normal distribution (with respective means and standard deviations μi and σi), and provided that certain not-too-exclusive conditions are met (all linear combinations of x and y are also normal), then their joint distribution, P(xy | I), follows the bivariate normal distribution, which I won't type out, but follow the link if you'd like to see it. (As usual, I is our background information.) To get the conditional probability for y, given a known value of x, we can make use of the product rule , to give

P(x | I) is the marginal distribution for x, just the familiar normal distribution for a single variable. If one goes through the slightly awkward algebra, it is found that for xy bivariate normal, y|x is also normally distributed2, with mean

and standard deviation

where ρ is the correlation coefficient, given by

Knowing this mean and standard deviation, we can now make a good estimate of how much regression to the mean to expect in any given situation. We can state our best guess for y and its error bar.

We can rearrange equation (2) to give

which says that the expected number of standard deviations between y - given our information about x - and μy (the mean of y when nothing is known about about x) is the same as the number of standard deviations between the observed value of x and μx, only multiplied by the correlation coefficient. A bit of a mouthful, perhaps, but actually a fairly easy estimate to perform, even in an informal context. In fact, this is just what we did when estimating Ezekiel's height.

When reasoning informally, we can ask the simplified question, 'what value of y is roughly equally as improbable as the known value of x?' The human mind is actually not bad at performing such estimates. Next, we need to figure out how far it is from the expected value of y (μy) and reduce that distance from μy by the fraction ρ, which again (with a bit of practice, perhaps), we can also estimate not too badly. In any case, an internet search will often be as complicated as any data-gathering exercise needed to calculate ρ more accurately.

Here are a few correlation coefficients for some familiar phenomena:

Life expectancy by nation (data from Wikipedia here and here):

life expectancy vs GDP per capita: 0.53
male life expectancy vs female life expectancy: 0.98

IQ scores (data from Wikipedia again):

same person tested twice: 0.95
identical twins raised together: 0.86
identical twins raised separately: 0.76
unrelated children raised together: 0.3

Amount of rainfall (in US) and frequency of 'flea' searches on Google: 0.87
(from Google Correlate)

With the above procedure for estimating y|x, we can get better, more rational estimates for a whole host of important things: how will our company perform this year, given our profits last year? How will our company perform if we hire this new manager, given how his previous company performed? What are my shares going to do next? What will the weather do tomorrow? How significant is this person's psychological assessment? Or criminal record?

In summary, when two partially correlated, random variables are sampled, there is a tendency for an extreme value of one to be accompanied by a less extreme value for the other. This is simple to the point of tautology, and is termed regression to the mean. The regression fallacy is a failure to account for this effect when making predictions, or investigating causation. One common form is the erroneous assumption of cause and effect in 'before-after' type experiments. Rabbits' feet do not cure sinusitis (in case you were still wondering). Another kind of fallacious reasoning is the failure to regress to the mean an estimate or prediction of one variable based on another known fact. For two normally distributed, correlated variables, the ratio of the expected distance (in standard deviations) of one variable from its marginal mean to the actual distance of the other from its mean is the correlation coefficient.

[1] Note: there are also cases where this condition holds for zero correlation, i.e. situations where y is completely determined by x, even though their correlation coefficient is zero. Lack of correlation can not be taken to imply independence, though if x and y are jointly (bivariate) normal, lack of correlation does strictly imply independence.

[2] I've been a little cavalier with the notation, but you can just read y|x as 'the value of y, given x.' Here, y is to be understood as a number, not a proposition.