Tuesday, April 10, 2012

Booze, sexual discrimination, and the best pills to prevent being murdered

The following data were used in a case of alleged gender bias in graduate admissions at an American university:


% Admitted
44 %
35 %

Assume that we know that male and female applicants are equally capable. Is the allegation of discrimination against female applicants proved by the data?

The data come from a real case, from 1973 at the University of California, Berkeley. The data looked damning, but a subsequent analysis actually demonstrated a slight but statistically significant bias in favor of women. The original data are incomplete and do not account for confounding factors. This is an example of Simpson’s paradox.

An imaginary example

Before looking into the resolution of this paradox, we'll examine some hypothetical data on lung cancer rates. This clever example is taken from a ‘Bad Science’ article, ‘Any set of figures needs adjusting before it can be usefully reported,’ by Ben Goldacre, and gives us a good chance to recognize quickly the likely 'true' explanation for the data. Again, these are imaginary numbers. 

Our hypothetical researchers first produce the following results from an epidemiological study:

Lung cancer rates
13.7 %
(366 / 2666)
5.0 %
(98 / 1954)

The data suggest a causal relationship between drinking alcohol and contracting lung cancer.
Does it seem plausible that such a relationship exists?
Is there a likely alternative explanation?

It is well established, indeed it was one of the early triumphs of epidemiology, that the risk of lung cancer is increased by smoking. Thinking along these lines, we imagine the investigators going back to the study participants and asking them whether or not they smoked. The revised results are shown below:

23.1 %
(330 / 1430)
23.2 %
(47 / 203)
2.9 %
(36 / 1236)
(51 / 1751)

Here we discover the cause of the problem with the original data: the actual cause of the difference in cancer rates between the two groups was the difference in the occurrence of smoking. Smoking acted as a confounder in the original study. Looking at the new data, we see that for the two groups, smokers and non-smokers, the rate of lung cancer is completely independent of whether a person drinks alcohol. We also see that the number of people taking part in the study who smoke but do not drink is very small. This is the fact that gave the appearance of a higher risk for drinkers, compared to non drinkers. 

So, what is Simpson’s Paradox?

Simpson’s paradox is a trap that one can fall into when determining cause and effect relationships from frequency data.

It is that a degree of correlation can be changed considerably when the data are divided into different sub-groups, i.e. when the data are adjusted for confounders.

As in the case above considering smokers and drinkers, it may appear that A caused B, but when A is resolved into separate statements, such as AC and A~C ('~' meaning 'not'), the causal influence of A is seen to disappear.

It may even be, as in the UCB sexual discrimination case, that the observed effect reverses direction when the apparent causal agency (e.g. the individual being male or female) is resolved according to possible values of a confounding variable, as we'll now see.

Back to the initial problem

We had the following data on graduate admissions:


% Admitted
44 %
35 %

These data suggest very strongly that there is a problem of women being unfairly treated. (In frequentist terms, the null hypothesis is rejected at a very high level of significance.) But take a moment to consider this: what kind of confounders might be active in such a situation? Are male and female applications necessarily the same in all relevant respects? How might they differ?

The following extended data set clears up a lot. The applications have been broken down by department:


Now when male and female admittance is compared for each department individually, it is seen that for most departments, female applicants had a higher probability to be admitted. The difference is small but unlikely to be due to chance.

We see that the percentages accepted get smaller going down the list, i.e. the hardest departments to get into are near the bottom. Also the numbers of female of applicants tend to be lower near the top of the list, while the opposite holds for the male applicants.

The reason, therefore, that it appeared at first that women were being unfairly treated was that most of the female applications were to departments that are harder to get into, while most of the men were applying to departments with larger percentages of applicants accepted.

How can we guard against Simpson’s paradox?

This analysis illustrates an important general point about the presentation and reporting of statistics. The explanation for how the original study was being confounded would have been masked, had we only inserted the observed rates (the percentage figures) in the above tables. It is only by looking at the raw numbers that we can fully appreciate what is going on. This leads on to a much more general principle, that the raw data are always beneficial to examine, and should not be discarded after the statistical analysis has been performed. Researchers should continue to pay attention to their raw data, and it should also, where possible, be made available to reviewers and anybody interested in reading the published findings resulting from a study.

Where possible, the best guard against Simpson’s paradox is randomization. If the subjects in a trial have been randomized into 'treatment groups', then the potential confounders that go with them should also be randomly distributed between groups, and only the true effects of the treatments will be observed. This illustrates the vast superiority of data from a carefully designed randomized trial over observational data. Epidemiological 'findings' from studies with tens of thousands of subjects have, on several occasions, been overturned as soon as data from randomized trials have become available. This article by Gary Taubes, 'Do we really know what makes us healthy?' includes several examples from medical science, including the once-held belief that hormone replacement therapy reduces a woman's risk of death by heart attack: the presence of confounders in one study with 16,500 participants was revealed when somebody noticed that women receiving HRT were also much less likely to be murdered than those not receiving the treatment.

This is not to say that epidemiological data should be ignored completely. Sometimes it is the best information we can get, and epidemiologists are clever people who understand the pitfalls of their profession far better than I do. In fact, many forms of science are of the epidemiological type, where randomization is impossible.

What does Simpson's paradox teach us about probability theory?

We might feel that the evident problems with the simple datasets in the above examples undermine the logical foundations of statistical analysis. How can it be that a clear relationship between two variables, as evidenced by the analysis of empirical data, turned out not to be a causal one? Contrary to what many have assumed, however, probability theory is only indirectly concerned with causal dependence. Probability theory deals with the logical investigation of information, and so, as I will discuss in a future post, is fundamentally only concerned with the logical dependence of variables, rather then with matters of direct causation. 

1 comment:

  1. I think the table with (non)drinkers vs (non)smokers should have the 2nd column in reverse order, i.e. so that smokers/nondrinkers are 2.9% and non-smokers/non-drinkers are 23.2%

    Other than that, thanks for this article!