In chapter 9 of
‘Probability Theory: The Logic of Science’ by Edwin Jaynes, the following
problem is offered as an exercise:
When Charles Darwin first landed on the Galapagos Islands in
September 1835, he had no idea how many different species of plants he would
find there. Having examined n = 122 specimens and finding that they can be
classified into m = 19 different species what is the
probability that there are still more species as yet unobserved?
In previous
posts on hypothesis testing, I described a binary hypothesis test, and a test
with infinite hypotheses. In this case, however, we are in the unfortunate
position of being unable to say how many hypotheses we should be testing. There
are other unknowns involved, and we might feel that the problem is impossible
to solve due to the missing information. Missing information, however, is exactly
what probability theory was developed for. Here, I’ll propose a simple model
that will allow a sensible and remarkably robust answer to be obtained.
The first step
is to recognize that the problem is equivalent to a common problem in ecology:
how to estimate the size of a population of animals, without being able to
count them all. A method that is commonly used in such cases is known as 'capture-recapture'.
Suppose you want to know how many fish there are in a lake, but there
is no hope, and no way of knowing whether or not you have succeeded in catching
and counting all the fish present. So this is what you might do: you spend a
day fishing using some non-lethal method. The fish captured are counted and
marked, then, at the end of the day, released. The next day, you start fishing
again, and make note of the proportion of fish caught on this day that have
already been marked from the previous day.
Intuitively, we
can recognize that if most of the fish obtained on the second day have tags on
them, then the total number should not be much larger than the number first
caught. Alternatively, if none of the specimens from the second day has a tag,
then the number first caught is probably a tiny proportion of the total
population. With a few simple assumptions, we can easily produce a more
quantitative answer than these intuitive impressions.
Lets assume that
the population we want to count is constant over the period within which we are
performing our investigation. Lets also assume that whenever an animal is
caught, each individual within the population has the same probability to be
the one trapped. This means, for example, that catching an individual once does
not affect its likelihood to be caught again, which may or may not be strictly
correct. One that is trapped may learn something from the process, and be less
likely to be caught again. Alternatively, an animal that gets caught might be
one of the stupid ones that can’t recognize the trap as easily as some of the
others. These considerations represent areas where we might refine our model in
future, depending how successful the simple model is.
If on day two of
the operation, we find that 1 in 10 of the animals trapped already has a tag, then we
can estimate that the number caught on day one is approximately 10% of the
total population. To determine how precise this is, we
should use Bayes’ theorem. This is a little cumbersome, though, since the
probabilities to catch a previously caught specimen change with each animal captured.
Luckily, the
current problem featuring Mr. Darwin is a relatively simple case, in which we
make several repeated visits, but on each visit we restrict ourselves to
catching only one specimen. The experiment is only mildly different from
traditional capture-recapture: normally one captures individuals from a real
population of animals, while in this case Darwin has been busy obtaining
specimens from a more abstract population of species. He has not been literally
marking or tagging those species that he has found, but he has recorded them in
his notebook, so that he will know for certain if he encounters the same species
again. By saying that he releases his catch after each individual specimen has
been logged, I really only mean that at all times each member of the population
of species has a chance to be caught.
Let HN
be the hypothesis that N is the total number of species in the population. At
any time, let r be the number of species previously recorded and n be the total
number of specimens previously examined. Let T represent a Boolean variable
which equals 1 if the current, (n+1)th species has already been recorded,
and 0 otherwise. Bayes’ theorem states that
P(HN | TrnI) = A × P(HN | I) × P(T | rnHNI)
The constant A
is the usual normalization constant that ensures that the sum over all mutually
exclusive hypotheses is 1. (The r×n goes on the right-hand side of the ‘|’ in
the last term, above, because at each update, it does not constitute part of
the new data.)
For each
specimen found, therefore, we simply update the prior distribution over N by
multiplying by P(T | rnHNI) and then dividing by the sum of all those
terms. If T = 1, and the current species has already been recorded, then
P(T | rnHNI) is r / N, otherwise it is (N - r) / N. With 122
specimens, as stated in the definition of the problem, this leaves us simply
with 121 updates to perform, which is child’s play as soon as we know how to
write simple computer programs.
All that remains
is to decide what kind of initial prior to use. This seems like a big problem,
because we don’t know how big the largest possible N should be. We have
infinitely many hypotheses, but now unbounded. This means that if we want to
use a uniform prior (if we really have ‘no idea how many different species...’),
then we can not normalize it - each of the P(HN | I) is infinitely small.
But we can use common sense to support some upper limit: presumably there is
some number of species that is physically impossible due to there being
insufficient space available. Ultimately our choice will be somewhat arbitrary,
but we will find that with a reasonable amount of data, the end result is
practically independent of the number of hypotheses chosen.
Here is some Python code I quickly typed up to solve the problem:
def
special_capture_recapture(n, k, max_N):
# The Charles Darwin
problem is a special case
# of the capture-recapture
problem, with multiple visits,
# and only 1 specimen
captured on each visit.
# n is total number of
specimens examined
# k is number of different
species identified
# max_N is the number of
hypotheses
# (actually our maximum is
max_N – 1,
# since N=0 is also possible)
# since N=0 is also possible)
# **** STEP 1: set up the
priors (uniform) ****
N = range(max_N) # a vector representing
# the assumed hypothesis space
# the
pN = [(1.0/float(max_N))
for x in N]
# **** STEP 2: perform n-1
updates to pN ****
# (first specimen does not
alter probabilities
# except that P(0) -> 0 )
for i in range(2, k+1):
#
first the previously un-tagged species
for number in N:
if number < k:
pN[number] = 0.0
else:
pN[number] *= 1.0 - (float(i-1))/float(number)
normalization = sum(pN)
for
number in N:
pN[number] = pN[number] / normalization
for i in range(k, n +
1):
#
and now the non-new species
for
number in N:
if number < k:
pN[number] = 0.0
else:
pN[number] *= (float(k)/float(number))
normalization = sum(pN)
for
number in N:
pN[number] = pN[number] / normalization
# **** STEP 3: display the
result ****
print 'P(N > ' + str(k)
+ ') = ' + str(sum(pN) - pN[k])
Choosing
initially 1000 hypotheses, representing the numbers 0 to 999, we find that the probability that Darwin did not find all the species present is
0.036017924.
We can do a
sensitivity analysis by varying the number of hypotheses:
for
100 hypotheses we get P(N>19) = 0.036017924
for
10000 hypotheses we get P(N>19) = 0.036017924
To see any
deviation of the end result at this level of precision (8 significant figures),
we have to go down to only 25 initial hypotheses, for which we get P = 0.036017923. So
as long as the initial set of hypotheses is large enough to ensure that there
is no chance that the real number of species is not covered, then we can be
confident that the result obtained is a reasonable one. Even if the amount of
data is reduced from 122 samples to only 30, P(N>19) ranges from 0.9983 to
0.9984 for numbers of hypotheses ranging from 50 to 1000, again showing minimal sensitivity to the number of initial hypotheses. We can, therefore, safely use the truncated, rectangular prior in sequential testing to tell us when we can be confident that we have found almost certainly everything there is to find.
No comments:
Post a Comment