Friday, April 27, 2012

Logical v's Causal Dependence

At the end of a previous post, I promised to discuss the difference between logical dependence, the substance of probability theory, and causal dependence, which is often assumed to be the thing that probability is directly concerned with. Lets get the ball rolling with a simple example:

A box contains 10 balls: 4 black, 3 white, and 3 red. A man extracts exactly one ball, ‘randomly.’ The extracted ball is never replaced in the box. Consider the following 2 situations:

a) You know that the extracted ball was red. What is the probability that another ball extracted in the same way, will also be red?

b) A second ball has been extracted in the same manner as the first, and is known to be black. The colour of the first ball is not known to you. What is the probability that it was white?

You should try to verify that he answer in the first case is 2/9, and in the second case is 1/3.

Nearly everybody will agree with my answer to situation (a), but some may hesitate about the answer for situation (b). This hesitation seems to result from the feeling that when we write P(A|B) ≠ P(A), then B is, at least partially, the cause of A. (P(A|B) means ‘the probability for A given the assumption that B is true.’) If  true, then there would be no possibility for knowledge of B to influence the probability for A, because the colour of the second ball can have had no causal influence on the colour of the first ball.

In fact, it makes no difference at all in what order the balls are drawn, in such cases. The labels ‘first,’ ‘second,’ ‘nth,’ are really just arbitrary labels, and we can exchange them as we please, without affecting the outcome of the calculation.

In case there is still a doubt, consider a simplified version of our thought experiment:

The box had exactly 2 balls, 1 black and 1 white. Both balls were drawn, ‘at random.’ The second ball drawn was black. What is the probability that the first was black?

The product rule can be written P(AB) = P(A|B)P(B). With this formulation, can we account for cases where A depends on B. When thinking about this dependence, however, it is often tempting to think in terms of causal dependence. But probability theory is concerned with calculations of plausibility with incomplete knowledge, and so what we really need to consider is not causal dependence, but logical dependence. We can verify that P(A|B) does not imply that B is the cause of A, since, thanks to the commutativity of Boolean algebra, AB = BA, and we could just as easily have written the product rule as P(AB) = P(B|A)P(A).

What is the probability that X committed a crime yesterday, given that he confessed to it today? Surely it is altered by our knowledge of the confession, indicating that the propositions are not independent in the sense we need for probability calculations. But it is also clear that a crime committed yesterday was not caused by a confession today. 

Edwin Jaynes in ‘Probability Theory: The logic of Science,’ gave the following technical example of the errors that can occur by focusing on causal dependence, rather than logical dependence. Consider multiple hypothesis testing with a set of n hypotheses, H1, H2, …, Hn, being examined in the light of m datasets, D1, D2, …., Dm. When the data sets are logically independent, the direct probability for the totality of the data given any one of the hypotheses, Hi, satisfies a factorization condition,

P(D1...Dm | Hi, I)  =   ∏ j P(Dj | Hi, I)

(The capital 'pi' means multiply for all 'j'.) It can be shown, however, that the corresponding condition for the alternate hypothesis, Hi'

P(D1...Dm | Hi', I)  =   ∏ j P(Dj | Hi', I)

does not hold except in highly trivial cases, though some authors have assumed it to be generally true, based on the fact that no Di has any causal effect on any other Dj. (Equation (2) requires that P(Dj|Di) = P(Dj).) The datasets maintain their causal independence, as they must, but they are no longer logically independent. This is because the amount that equivalent units of new information change the relative plausibilities of multiple hypotheses depends on the data that has gone before: the effect of new data on a hypothesis depends on which other hypothesis it competes with most directly.

In Jaynes’ example, he imagined a machine producing some component in large quantities and an effort to determine the fraction of components fabricated that are faulty by randomly sampling 1 component at a time and examining it for faults. The prior information is supposed specific enough to narrow the number of possible hypotheses to 3:

A ≡ ‘The fraction of components that are faulty is 1/3.’
B ≡ ‘The fraction of components that are faulty is 1/6.’
C ≡ ‘The fraction of components that are faulty is 99/100.’

The prior probabilities for these hypotheses are as shown at the extreme left of the graph below. The graph is the calculation of the evolution of the probabilities for each hypothesis as the number of tested components increases. Recall that, from Bayes’ theorem, each P(Hi|D, I) depends on both P(D|Hi, I) and P(D|Hi', I). Each tested component is found to be faulty, so the information added is identical with each sample, but the rates of change of the 3 curves (plotted logarithmically) are not constant.

Evolution of the probabilities of 3 hypotheses as constant new data are added.
Taken from E.T. Jaynes, ‘Probability theory: the logic of science,’ chapter 4. 

The ‘Evidence,’ plotted on the vertical axis, is perhaps an unfamiliar expression of probability information. It is the log odds, given by

E  =   10 log10   P( H )
P( H' )

with the factor of 10 because we choose to measure evidence in decibels. The base 10 is used because of a perceived psychological advantage (our brains seem to be good at thinking in terms of factors of 10). Because we have used a logarithmic scale, the products expressed above in equations (1) and (2) becomes sums, and for constant pieces of new information, we expect to add a constant amount to the evidence, if both factorization conditions hold. The slopes of the curves are not constant, however, indicating that this is not the case, and consecutive items of data are not independent: ΔE depends on what data have preceded that point. Specifically, wherever a pair of hypotheses cross on the graph, there is a change of slope of the remaining hypothesis.

When we calculate P(D|Hi, I) we are supposing for the purposes of calculation that Hi is true, and so the result we get is independent of P(Hi), which is why P(D|Hi, I) factorizes. But P(D|Hi', I)  is different, because when the total number of hypotheses is greater then 2, then Hi' is composite and decomposes into at least 2 hypotheses, so P(D|Hi', I) relies upon the relative probabilities for those component propositions.

Tuesday, April 17, 2012

Fly papers and photon detectors: another base-rate fallacy

Here is my version of a simple riddle posed by Jaynes in ‘Probability Theory: The Logic of Science’:
A constant Poissonian light source emits photons at an average rate of 100 per second. The photons strike a detector with 10% efficiency. In one particular second, the detector registers 15 counts. What is the expectation for the number of photons actually emitted in that second? (Assume a detector with no dark counts and zero dead time.)
Lets denote the number of emissions by n, the average number of emissions (the source strength) by s, the number of detected photons by c, and the detector efficiency by ϕ.

In case some of the terminology in the question is unfamiliar, ‘Poissonian’ just means that the number of events in any second conforms to a Poisson probability distribution with a mean of s:
P(n | s)  =  e-s  sn
Make an effort to work out your best answer to the above question before reading on. Consider writing it down - feel free to record it in the comments below! 

Many textbooks on statistics specify 'maximum likelihood' as the state of the art for parameter estimation, so lets see what result it produces here. The likelihood function for a particular model is the probability to obtain the observed data, assuming that model to be true:

L(M) = P(D | M, I) (2)

where D is the set of data, M is the model in question, and I is the prior information available.

In the cases where the prior information effectively specifies the form of the model, but not the model parameters, then denoting the parameters of the model by θ, this equation becomes

L(θ) = P(D | θ, I) (3)

and we have the problem of determining the best estimate of the actual value of the θ from the observed data, which is often assumed to be the value at which the likelihood function is maximized:

θ  = argmax(L(θ | D))

In the present case, θ is a single parameter, the number of emitted photons, n, and D is the fact that we have observed c photocounts. P(D | θI) becomes P(c|nϕs). But if n is known, then knowledge of s is irrelevant, so P(c|nϕs) = P(c|nϕ), which is given by the binomial distribution:

P(c | nφ)  =  n!
c!(n - c)!
  ×  φc (1-φ)n-c (5)

For ϕ = 0.1 and c  = 15, then it is easy enough to verify that the value of n that maximizes this equation is 150, which is just what many of us would expect.

Lets examine this solution, though, by considering a different problem. Suppose I am interested in the number of flies in Utrecht, the city I live in. I hang a fly paper in my kitchen in order to gather some data (don’t worry, lab experiments on flies don’t require ethical approval). After one day, there are 3 flies trapped. Unfortunately, I don’t know the efficiency of my detector (the fly paper), but suppose I have good reasons to accept that the average number trapped is proportional to the total number in the population. On day 2, I count 6 new casualties on the fly paper, and announce the astonishing result that in one day, the number of flies in Utrecht has doubled. You’d hopefully respond that I am bonkers - clearly the experiment is nowhere near sensitive enough to determine such a thing. Yet this is essentially the result that we have declared above, under the advice of those who champion the method of maximum likelihood for all parameter estimation problems.

The following data are the result of a Monte-Carlo simulation of the emitter and detector described in the original question:

These data represent 32,000 seconds of simulated emissions. First, I generated 32000 random numbers, uniformly distributed between 0 and 1. Then, for each of these, I scanned along the Poisson distribution in equation (1), until the integrated probability first exceeded that random number. This gave the number of emitted photons, n, in each simulated second. Then, for each of these n, I generated another n uniform random numbers. For any of these random numbers less than or equal to the detector efficiency, ϕ, one detected photon was added to the count for that second. Each green dot in the graph above, therefore, displays the emissions and counts for a single second.

The data confirm that 150 emissions in a single second is an extremely rare event (didn’t occur once in the experiment), while 15 photo-counts occurred on very many (more than a thousand) occasions. From the graph above, it appears that the ‘middle’ of the distribution for 15 counts corresponds to somewhere only slightly above 100 emissions. We can get a better view by taking a slice through this graph along 15 counts, and histogramming the numbers of emissions that were responsible for those counts:

The peak of this histogram is at 105 emitted photons, which is also the average number (rounded) of emitted photons resulting in 15 counts. This is evidently our approximate solution.

The exact solution follows from Bayes’ theorem, and has been provided by Jaynes. The general statement of Bayes’ theorem, again, is

P(A | BI)  =  P(A | I)  P(B | AI)
P(B | I)

Filling in the notation I have defined above, this becomes

P(n | φcs)  =  P(n | φs)   P(c | nφs)
P(c | φs)

As noted, if we know the exact number of photons emitted, then no amount of information about the source strength can improve our ability to estimate the expected number of counts. Also, the efficiency of the detector is irrelevant to the number of photons emitted, so
P(n | φcs)  = P(n | s)   P(c | nφ)
P(c | φs)

Two of these distributions are specified above, in equations (1) and (5).

P(n|s) and P(c|nϕ) can be combined to derive the probability distribution for the number of counts given the source strength, which, in line with intuition, is again Poissonian:
P(c | φs)  =  e-sφ  (sφ)c

Combining all these, and canceling terms, we get
P(n | φcs)  =   e-s(1-φ) [ s(1-φ) ] n - c

This is, yet again, a Poisson distribution, with mean, s(1-ϕ), only shifted up by c, since with no dark counts, the number emitted can not be less than the number detected. The mean of this distribution is, therefore:

〈n〉 =   c + s(1 - φ) (11)

This gives 15 + 100×0.9 = 105 expected emissions, in the case of 15 detected photons. Counter intuitively, the 5 additional counts, above the expected 10, are only evidence for 5 more photons emitted, above the average.

The figure obtained by maximum likelihood can be seen to be another instance of the base-rate fallacy: 150 is equal to the maximum of P(c|nϕs), as n is varied. This is the essence of what maximum likelihood parameter estimation does - given a model with parameter θ, the maximum likelihood estimate for θ is given by argmax[P(data | θ)]. In this case, it takes no account of the probability distribution for the number of emissions, n. What we really needed was the expectation (approximately the maximum) of P(n | cϕs), which is the expectation of P(n | s)×P(c | nϕs). As with the fly paper, the efficiency of the photo detector, (at 10%) was too low to permit us to claim anything close to a 50% excess over the average 100 emissions per second.

Under the right circumstances, maximum likelihood can do a very good job of parameter estimation, but we see here a situation where it fails miserably. As with all rational procedures, its success is determined entirely by its capability to mimic the results of Bayes’ theorem. The answer it gave here differed from Bayes by an order of magnitude, and the reason for that failure was the important information contained in the base rates, P(n|s). In a future post, I’ll discuss another kind of problem where maximum likelihood can give a badly flawed impression, even when the prior distribution conveys little or no information. 

Tuesday, April 10, 2012

Booze, sexual discrimination, and the best pills to prevent being murdered

The following data were used in a case of alleged gender bias in graduate admissions at an American university:


% Admitted
44 %
35 %

Assume that we know that male and female applicants are equally capable. Is the allegation of discrimination against female applicants proved by the data?

The data come from a real case, from 1973 at the University of California, Berkeley. The data looked damning, but a subsequent analysis actually demonstrated a slight but statistically significant bias in favor of women. The original data are incomplete and do not account for confounding factors. This is an example of Simpson’s paradox.

An imaginary example

Before looking into the resolution of this paradox, we'll examine some hypothetical data on lung cancer rates. This clever example is taken from a ‘Bad Science’ article, ‘Any set of figures needs adjusting before it can be usefully reported,’ by Ben Goldacre, and gives us a good chance to recognize quickly the likely 'true' explanation for the data. Again, these are imaginary numbers. 

Our hypothetical researchers first produce the following results from an epidemiological study:

Lung cancer rates
13.7 %
(366 / 2666)
5.0 %
(98 / 1954)

The data suggest a causal relationship between drinking alcohol and contracting lung cancer.
Does it seem plausible that such a relationship exists?
Is there a likely alternative explanation?

It is well established, indeed it was one of the early triumphs of epidemiology, that the risk of lung cancer is increased by smoking. Thinking along these lines, we imagine the investigators going back to the study participants and asking them whether or not they smoked. The revised results are shown below:

23.1 %
(330 / 1430)
23.2 %
(47 / 203)
2.9 %
(36 / 1236)
(51 / 1751)

Here we discover the cause of the problem with the original data: the actual cause of the difference in cancer rates between the two groups was the difference in the occurrence of smoking. Smoking acted as a confounder in the original study. Looking at the new data, we see that for the two groups, smokers and non-smokers, the rate of lung cancer is completely independent of whether a person drinks alcohol. We also see that the number of people taking part in the study who smoke but do not drink is very small. This is the fact that gave the appearance of a higher risk for drinkers, compared to non drinkers. 

So, what is Simpson’s Paradox?

Simpson’s paradox is a trap that one can fall into when determining cause and effect relationships from frequency data.

It is that a degree of correlation can be changed considerably when the data are divided into different sub-groups, i.e. when the data are adjusted for confounders.

As in the case above considering smokers and drinkers, it may appear that A caused B, but when A is resolved into separate statements, such as AC and A~C ('~' meaning 'not'), the causal influence of A is seen to disappear.

It may even be, as in the UCB sexual discrimination case, that the observed effect reverses direction when the apparent causal agency (e.g. the individual being male or female) is resolved according to possible values of a confounding variable, as we'll now see.

Back to the initial problem

We had the following data on graduate admissions:


% Admitted
44 %
35 %

These data suggest very strongly that there is a problem of women being unfairly treated. (In frequentist terms, the null hypothesis is rejected at a very high level of significance.) But take a moment to consider this: what kind of confounders might be active in such a situation? Are male and female applications necessarily the same in all relevant respects? How might they differ?

The following extended data set clears up a lot. The applications have been broken down by department:


Now when male and female admittance is compared for each department individually, it is seen that for most departments, female applicants had a higher probability to be admitted. The difference is small but unlikely to be due to chance.

We see that the percentages accepted get smaller going down the list, i.e. the hardest departments to get into are near the bottom. Also the numbers of female of applicants tend to be lower near the top of the list, while the opposite holds for the male applicants.

The reason, therefore, that it appeared at first that women were being unfairly treated was that most of the female applications were to departments that are harder to get into, while most of the men were applying to departments with larger percentages of applicants accepted.

How can we guard against Simpson’s paradox?

This analysis illustrates an important general point about the presentation and reporting of statistics. The explanation for how the original study was being confounded would have been masked, had we only inserted the observed rates (the percentage figures) in the above tables. It is only by looking at the raw numbers that we can fully appreciate what is going on. This leads on to a much more general principle, that the raw data are always beneficial to examine, and should not be discarded after the statistical analysis has been performed. Researchers should continue to pay attention to their raw data, and it should also, where possible, be made available to reviewers and anybody interested in reading the published findings resulting from a study.

Where possible, the best guard against Simpson’s paradox is randomization. If the subjects in a trial have been randomized into 'treatment groups', then the potential confounders that go with them should also be randomly distributed between groups, and only the true effects of the treatments will be observed. This illustrates the vast superiority of data from a carefully designed randomized trial over observational data. Epidemiological 'findings' from studies with tens of thousands of subjects have, on several occasions, been overturned as soon as data from randomized trials have become available. This article by Gary Taubes, 'Do we really know what makes us healthy?' includes several examples from medical science, including the once-held belief that hormone replacement therapy reduces a woman's risk of death by heart attack: the presence of confounders in one study with 16,500 participants was revealed when somebody noticed that women receiving HRT were also much less likely to be murdered than those not receiving the treatment.

This is not to say that epidemiological data should be ignored completely. Sometimes it is the best information we can get, and epidemiologists are clever people who understand the pitfalls of their profession far better than I do. In fact, many forms of science are of the epidemiological type, where randomization is impossible.

What does Simpson's paradox teach us about probability theory?

We might feel that the evident problems with the simple datasets in the above examples undermine the logical foundations of statistical analysis. How can it be that a clear relationship between two variables, as evidenced by the analysis of empirical data, turned out not to be a causal one? Contrary to what many have assumed, however, probability theory is only indirectly concerned with causal dependence. Probability theory deals with the logical investigation of information, and so, as I will discuss in a future post, is fundamentally only concerned with the logical dependence of variables, rather then with matters of direct causation. 

Friday, April 6, 2012

Capture-recapture and the Charles Darwin problem

In chapter 9 of ‘Probability Theory: The Logic of Science’ by Edwin Jaynes, the following problem is offered as an exercise:

When Charles Darwin first landed on the Galapagos Islands in September 1835, he had no idea how many different species of plants he would find there. Having examined n = 122 specimens and finding that they can be classified into m = 19 different species what is the probability that there are still more species as yet unobserved?

In previous posts on hypothesis testing, I described a binary hypothesis test, and a test with infinite hypotheses. In this case, however, we are in the unfortunate position of being unable to say how many hypotheses we should be testing. There are other unknowns involved, and we might feel that the problem is impossible to solve due to the missing information. Missing information, however, is exactly what probability theory was developed for. Here, I’ll propose a simple model that will allow a sensible and remarkably robust answer to be obtained.

The first step is to recognize that the problem is equivalent to a common problem in ecology: how to estimate the size of a population of animals, without being able to count them all. A method that is commonly used in such cases is known as 'capture-recapture'.

Suppose you want to know how many fish there are in a lake, but there is no hope, and no way of knowing whether or not you have succeeded in catching and counting all the  fish present. So this is what you might do: you spend a day fishing using some non-lethal method. The fish captured are counted and marked, then, at the end of the day, released. The next day, you start fishing again, and make note of the proportion of fish caught on this day that have already been marked from the previous day.

Intuitively, we can recognize that if most of the fish obtained on the second day have tags on them, then the total number should not be much larger than the number first caught. Alternatively, if none of the specimens from the second day has a tag, then the number first caught is probably a tiny proportion of the total population. With a few simple assumptions, we can easily produce a more quantitative answer than these intuitive impressions.

Lets assume that the population we want to count is constant over the period within which we are performing our investigation. Lets also assume that whenever an animal is caught, each individual within the population has the same probability to be the one trapped. This means, for example, that catching an individual once does not affect its likelihood to be caught again, which may or may not be strictly correct. One that is trapped may learn something from the process, and be less likely to be caught again. Alternatively, an animal that gets caught might be one of the stupid ones that can’t recognize the trap as easily as some of the others. These considerations represent areas where we might refine our model in future, depending how successful the simple model is.

If on day two of the operation, we find that 1 in 10 of the animals trapped already has a tag, then we can estimate that the number caught on day one is approximately 10% of the total population. To determine how precise this is, we should use Bayes’ theorem. This is a little cumbersome, though, since the probabilities to catch a previously caught specimen change with each animal captured.

Luckily, the current problem featuring Mr. Darwin is a relatively simple case, in which we make several repeated visits, but on each visit we restrict ourselves to catching only one specimen. The experiment is only mildly different from traditional capture-recapture: normally one captures individuals from a real population of animals, while in this case Darwin has been busy obtaining specimens from a more abstract population of species. He has not been literally marking or tagging those species that he has found, but he has recorded them in his notebook, so that he will know for certain if he encounters the same species again. By saying that he releases his catch after each individual specimen has been logged, I really only mean that at all times each member of the population of species has a chance to be caught. 

Let HN be the hypothesis that N is the total number of species in the population. At any time, let r be the number of species previously recorded and n be the total number of specimens previously examined. Let T represent a Boolean variable which equals 1 if the current, (n+1)th species has already been recorded, and 0 otherwise. Bayes’ theorem states that

P(HN | TrnI) = A × P(HN | I) × P(T | rnHNI)

The constant A is the usual normalization constant that ensures that the sum over all mutually exclusive hypotheses is 1. (The r×n goes on the right-hand side of the ‘|’ in the last term, above, because at each update, it does not constitute part of the new data.)

For each specimen found, therefore, we simply update the prior distribution over N by multiplying by P(T | rnHNI) and then dividing by the sum of all those terms. If T = 1, and the current species has already been recorded, then P(T | rnHNI) is r / N, otherwise it is (N - r) / N. With 122 specimens, as stated in the definition of the problem, this leaves us simply with 121 updates to perform, which is child’s play as soon as we know how to write simple computer programs.

All that remains is to decide what kind of initial prior to use. This seems like a big problem, because we don’t know how big the largest possible N should be. We have infinitely many hypotheses, but now unbounded. This means that if we want to use a uniform prior (if we really have ‘no idea how many different species...’), then we can not normalize it - each of the P(HN | I) is infinitely small. But we can use common sense to support some upper limit: presumably there is some number of species that is physically impossible due to there being insufficient space available. Ultimately our choice will be somewhat arbitrary, but we will find that with a reasonable amount of data, the end result is practically independent of the number of hypotheses chosen. 

Here is some Python code I quickly typed up to solve the problem:

def special_capture_recapture(n, k, max_N):

    # The Charles Darwin problem is a special case
    # of the capture-recapture problem, with multiple visits,
    # and only 1 specimen captured on each visit.

    # n is total number of specimens examined
    # k is number of different species identified
    # max_N is the number of hypotheses
    #   (actually our maximum is max_N – 1, 
    #       since N=0 is also possible)

    # **** STEP 1: set up the priors (uniform) ****

    N = range(max_N)  # a vector representing 
                      # the assumed hypothesis space
    pN = [(1.0/float(max_N)) for x in N]

    # **** STEP 2: perform n-1 updates to pN ****

    # (first specimen does not alter probabilities
    #  except that P(0) -> 0 )

    for i in range(2, k+1):
        # first the previously un-tagged species
        for number in N:
            if number < k:
                pN[number] = 0.0
                pN[number] *= 1.0 - (float(i-1))/float(number)
        normalization = sum(pN)
        for number in N:
            pN[number] = pN[number] / normalization
    for i in range(k, n + 1):                    
        # and now the non-new species
        for number in N:
            if number < k:
                pN[number] = 0.0
                pN[number] *= (float(k)/float(number))
        normalization = sum(pN)
        for number in N:
            pN[number] = pN[number] / normalization

    # **** STEP 3: display the result ****

    print 'P(N > ' + str(k) + ') = ' + str(sum(pN) - pN[k])

Choosing initially 1000 hypotheses, representing the numbers 0 to 999, we find that the probability that Darwin did not find all the species present is 0.036017924.

We can do a sensitivity analysis by varying the number of hypotheses:

for 100 hypotheses we get P(N>19) = 0.036017924

for 10000 hypotheses we get P(N>19) = 0.036017924

To see any deviation of the end result at this level of precision (8 significant figures), we have to go down to only 25 initial hypotheses, for which we get P = 0.036017923. So as long as the initial set of hypotheses is large enough to ensure that there is no chance that the real number of species is not covered, then we can be confident that the result obtained is a reasonable one. Even if the amount of data is reduced from 122 samples to only 30, P(N>19) ranges from 0.9983 to 0.9984 for numbers of hypotheses ranging from 50 to 1000, again showing minimal sensitivity to the number of initial hypotheses. We can, therefore, safely use the truncated, rectangular prior in sequential testing to tell us when we can be confident that we have found almost certainly everything there is to find.