Friday, April 6, 2012

Capture-recapture and the Charles Darwin problem

In chapter 9 of ‘Probability Theory: The Logic of Science’ by Edwin Jaynes, the following problem is offered as an exercise:

When Charles Darwin first landed on the Galapagos Islands in September 1835, he had no idea how many different species of plants he would find there. Having examined n = 122 specimens and finding that they can be classified into m = 19 different species what is the probability that there are still more species as yet unobserved?

In previous posts on hypothesis testing, I described a binary hypothesis test, and a test with infinite hypotheses. In this case, however, we are in the unfortunate position of being unable to say how many hypotheses we should be testing. There are other unknowns involved, and we might feel that the problem is impossible to solve due to the missing information. Missing information, however, is exactly what probability theory was developed for. Here, I’ll propose a simple model that will allow a sensible and remarkably robust answer to be obtained.

The first step is to recognize that the problem is equivalent to a common problem in ecology: how to estimate the size of a population of animals, without being able to count them all. A method that is commonly used in such cases is known as 'capture-recapture'.

Suppose you want to know how many fish there are in a lake, but there is no hope, and no way of knowing whether or not you have succeeded in catching and counting all the  fish present. So this is what you might do: you spend a day fishing using some non-lethal method. The fish captured are counted and marked, then, at the end of the day, released. The next day, you start fishing again, and make note of the proportion of fish caught on this day that have already been marked from the previous day.

Intuitively, we can recognize that if most of the fish obtained on the second day have tags on them, then the total number should not be much larger than the number first caught. Alternatively, if none of the specimens from the second day has a tag, then the number first caught is probably a tiny proportion of the total population. With a few simple assumptions, we can easily produce a more quantitative answer than these intuitive impressions.

Lets assume that the population we want to count is constant over the period within which we are performing our investigation. Lets also assume that whenever an animal is caught, each individual within the population has the same probability to be the one trapped. This means, for example, that catching an individual once does not affect its likelihood to be caught again, which may or may not be strictly correct. One that is trapped may learn something from the process, and be less likely to be caught again. Alternatively, an animal that gets caught might be one of the stupid ones that can’t recognize the trap as easily as some of the others. These considerations represent areas where we might refine our model in future, depending how successful the simple model is.

If on day two of the operation, we find that 1 in 10 of the animals trapped already has a tag, then we can estimate that the number caught on day one is approximately 10% of the total population. To determine how precise this is, we should use Bayes’ theorem. This is a little cumbersome, though, since the probabilities to catch a previously caught specimen change with each animal captured.

Luckily, the current problem featuring Mr. Darwin is a relatively simple case, in which we make several repeated visits, but on each visit we restrict ourselves to catching only one specimen. The experiment is only mildly different from traditional capture-recapture: normally one captures individuals from a real population of animals, while in this case Darwin has been busy obtaining specimens from a more abstract population of species. He has not been literally marking or tagging those species that he has found, but he has recorded them in his notebook, so that he will know for certain if he encounters the same species again. By saying that he releases his catch after each individual specimen has been logged, I really only mean that at all times each member of the population of species has a chance to be caught.

Let HN be the hypothesis that N is the total number of species in the population. At any time, let r be the number of species previously recorded and n be the total number of specimens previously examined. Let T represent a Boolean variable which equals 1 if the current, (n+1)th species has already been recorded, and 0 otherwise. Bayes’ theorem states that

P(HN | TrnI) = A × P(HN | I) × P(T | rnHNI)

The constant A is the usual normalization constant that ensures that the sum over all mutually exclusive hypotheses is 1. (The r×n goes on the right-hand side of the ‘|’ in the last term, above, because at each update, it does not constitute part of the new data.)

For each specimen found, therefore, we simply update the prior distribution over N by multiplying by P(T | rnHNI) and then dividing by the sum of all those terms. If T = 1, and the current species has already been recorded, then P(T | rnHNI) is r / N, otherwise it is (N - r) / N. With 122 specimens, as stated in the definition of the problem, this leaves us simply with 121 updates to perform, which is child’s play as soon as we know how to write simple computer programs.

All that remains is to decide what kind of initial prior to use. This seems like a big problem, because we don’t know how big the largest possible N should be. We have infinitely many hypotheses, but now unbounded. This means that if we want to use a uniform prior (if we really have ‘no idea how many different species...’), then we can not normalize it - each of the P(HN | I) is infinitely small. But we can use common sense to support some upper limit: presumably there is some number of species that is physically impossible due to there being insufficient space available. Ultimately our choice will be somewhat arbitrary, but we will find that with a reasonable amount of data, the end result is practically independent of the number of hypotheses chosen.

Here is some Python code I quickly typed up to solve the problem:

def special_capture_recapture(n, k, max_N):

# The Charles Darwin problem is a special case
# of the capture-recapture problem, with multiple visits,
# and only 1 specimen captured on each visit.

# n is total number of specimens examined
# k is number of different species identified
# max_N is the number of hypotheses
#   (actually our maximum is max_N – 1,
#       since N=0 is also possible)

# **** STEP 1: set up the priors (uniform) ****

N = range(max_N)  # a vector representing
# the assumed hypothesis space
pN = [(1.0/float(max_N)) for x in N]

# **** STEP 2: perform n-1 updates to pN ****

# (first specimen does not alter probabilities
#  except that P(0) -> 0 )

for i in range(2, k+1):
# first the previously un-tagged species
for number in N:
if number < k:
pN[number] = 0.0
else:
pN[number] *= 1.0 - (float(i-1))/float(number)
normalization = sum(pN)
for number in N:
pN[number] = pN[number] / normalization

for i in range(k, n + 1):
# and now the non-new species
for number in N:
if number < k:
pN[number] = 0.0
else:
pN[number] *= (float(k)/float(number))
normalization = sum(pN)
for number in N:
pN[number] = pN[number] / normalization

# **** STEP 3: display the result ****

print 'P(N > ' + str(k) + ') = ' + str(sum(pN) - pN[k])

Choosing initially 1000 hypotheses, representing the numbers 0 to 999, we find that the probability that Darwin did not find all the species present is 0.036017924.

We can do a sensitivity analysis by varying the number of hypotheses:

for 100 hypotheses we get P(N>19) = 0.036017924

for 10000 hypotheses we get P(N>19) = 0.036017924

To see any deviation of the end result at this level of precision (8 significant figures), we have to go down to only 25 initial hypotheses, for which we get P = 0.036017923. So as long as the initial set of hypotheses is large enough to ensure that there is no chance that the real number of species is not covered, then we can be confident that the result obtained is a reasonable one. Even if the amount of data is reduced from 122 samples to only 30, P(N>19) ranges from 0.9983 to 0.9984 for numbers of hypotheses ranging from 50 to 1000, again showing minimal sensitivity to the number of initial hypotheses. We can, therefore, safely use the truncated, rectangular prior in sequential testing to tell us when we can be confident that we have found almost certainly everything there is to find.