Saturday, October 31, 2015

Multi-level modeling

In a post last year, I went through some inference problems concerning a hypothetical medical test. For example, using the known rate of occurrence of some disease, and the known characteristics of a diagnostic test (false-positive and false-negative rates), we were able to obtain the probability that a subject has the disease, based on the test result.

In this post, I'll demonstrate some hierarchical modeling, in a similar context of medical diagnosis. Suppose we know the characteristics of the diagnostic test, but not the frequency of occurrence of the disease, can we figure this out from a set of test results?
A medical screening test has a false-positive rate of 0.15 and a false-negative rate of 0.1. One thousand randomly sampled subjects were tested, resulting in 213 positive test results. What is the posterior distribution over the background prevalence of the disease in this population?

This is technically quite similar to an example in Allen Downey's book, 'Think Bayes: Bayesian Statistics Made Simple.' Allen was kind enough to credit me with having inspired his Geiger-counter example, though it is really a different kind problem to the particle-counting example of mine that he was referring to. The current question seems to me more similar to Allen's problem, even though the scenario of medical screening seems quite different.

In Allen's Geiger-counter problem, a particle counter with a certain sensitivity counts n particles in a given time interval, and the challenge is to figure out the (on average) constant emission rate of the radioactive sample that emitted those particles. The solution has to marginalize over (integrate out) a nuisance parameter, which is the actual number of particles emitted during the interval, in order to work back to the activity of the sample. This number of emissions sits between the average emission rate and the number of registered counts, in the chain of causation, but we don't need to know exactly what the number is, hence the term, 'nuisance parameter.'

The current problem is analogous in that we have to similarly work backwards to the rate of occurrence of the disease (similar to the emission rate of the radioactive sample) from the known test results (number of detected counts). The false-negative rate for the medical test plays a similar role to the Geiger-counter efficiency. Both encode the expected fraction of 'real events' that get registered by the instrument. In this case, however, there is an additional nuisance parameter to deal with, because now, we have to cope with false positives, as well as false negatives.

(We could generalize the Geiger-counter problem, to bring it to the same level of complexity, by positing that in each detection interval, the detector will pick up some number of background events - cosmic rays, detector dark counts, etc. - and that this number has a known, fixed long-term average.)

Defining r to be the rate of occurrence of the disease, and p to be the number of positive test results, Bayes' theorem for the present situation looks like this:

(1)

In the statement of the problem, I didn't specify any prior distribution for the rate of occurrence, r. In keeping with the information content in the problem specification, we'll adopt a uniform distribution over the interval (0, 1). (In the real world, this is typically a very bad thing to do - there are very few things that we no absolutely nothing about - but here it'll give us the advantage of making it easier to check that our analysis makes sense.)

The trick, then, is entirely wrapped up in calculating the likelihood function. We have to evaluate the consequences of each possible value of r. For these purposes, the state of the world consists of a conjunction of 3 variables: number of positive test results, number of those positive tests caused by presence of the disease, and number of people who were tested that had the disease. The first of these is the known result of our measurement. The other two are unknown, but affect how the measurement result was arrived at, so we need to integrate over them.

(At first glance, it might seem that of these 2 nuisance parameters, if I know one, then I know the other. Don't forget, however, that there are two ways to get a positive test result: an accurate diagnosis, and a false positive.)

Because we have 3 variables that determine the relevant properties of the world, the true state of reality (assuming we know it) can be represented as a point in a three-dimensional possibility space. The likelihood function, however, only cares about one of those 3 coordinates, the number of positive test results (our measurement result), so the relevant region of possibility space is a plane, parallel to the other 2 axes. The total probability to have obtained this number of test results is the sum of the probabilities for all points on this plane.

In general, the total probability for some proposition, x, can be written as a sum of probabilities, in terms of some set of mutually exclusive propositions, y1, y2, y3, ....

(2)

This follows from the extended sum rule. From the product rule, each term in the sum can be re-written, yielding the marginalization result:

(3)

Returning to the medical screening problem, let's use pt to represent the number of true positives, d to represent the number of people in the cohort who had the disease. The total number of participants, N, has been taken out of the background information, I, to be explicit. Thus, each point in our 3D hypothesis space has an associated probability that looks like this:

(4)

Treating p and pt as a single proposition, we can marginalize over d, using equation 3:

(5)

Next we just do the same thing again on the term containing p, pt:

(6)

Each of the terms on the right hand side is given by the binomial distribution. The first is the probability to obtain p positive test results, when there are pt true positives. In other words, it is the probability to get (p -  pt) false positives from the (N - d) people in the group who did not have the disease, when the false-positive rate is rfp:

(7)
The second term is the probability to get pt true positives, when the number with the disease is d. This depends on the probability for an afflicted individual to receive a positive test result, which is 1 minus the false-negative rate, (1 - rfn):

(8)

The third term is the probability to get d people with the condition from a sample of N, when the rate of occurrence is r:

(9)

The double sum in equation (6), at each possible value for r, is a task best done by a computer. The python code I wrote for this is given in the appendix, below. The code, of course does not really perform the calculation at every possible value of r, (there are an uncountable infinity of them) but takes a series of little hops through the hypothesis space, in steps 0.002 apart. Because this step size is much narrower than the resulting probability peak, the approximation that the probability varies linearly between steps does no harm to the outcome.

Recall that we're using a uniform prior, and so, from equation (1) , once we calculate the likelihood from equation (6) at all possible values for r, the resulting curve is proportional to the posterior distribution. Thus, a plot of the likelihoods produced by my python function (after normalization) gives the required distribution:

The figure gives a point estimate and a confidence interval. Because the posterior distribution is symmetric, my point estimate is obtained by taking the highest point on the curve.

To get the 50% confidence interval (using my unorthodox, but completely sensible definition of confidence intervals), I just kept moving one step to the left and to the right from the high point, until the sum of the probability between my left-hand and right-hand markers first reached 0.5. Again, this is a good procedure in this case, because the curve is symmetric - in the case of asymmetry, we would need a different procedure, if, for example, we required an interval with equal amounts of probability on either side of the point estimate.

The centre of the posterior distribution is at r = 0.15. Let's see if that makes sense. With 1000 test subjects, at this rate, we expect 150 cases of the disease. With 85% of positive cases correctly identified (false-negative rate = 0.15) then we should have 127.5 true positive test results (on average). Also we should get (1000 - 150) × 0.1 = 85 false positive test results. Adding these together, we get 212.5 expected positive test results, which is, to the nearest integer, what was specified at the top. It looks like all the theory and coding have done what they should have done.

For fun, we can also check that the calculated confidence interval makes sense. I've specified a 50% confidence interval, which means that if we do the experiment multiple times, about half of the calculated confidence intervals should contain the true value for the incidence rate of the disease. With only a little additional python code, I ran a Monte-Carlo simulation of several measurements of 1000 test subjects each.

The true incidence rate was fixed at 0.15, and the number of positive test results was randomly generated for each iteration of the simulated measurement. The python package, numpy, can generate binomially distributed random numbers. I used the following lines of code to generate each number of positive test results, p:

d = numpy.random.binomial(1000, 0.15)           # number with disease
p_t = numpy.random.binomial(d, (1 - r_fn))       # number of true positives
p_f = numpy.random.binomial(N - d, r_fp)         # number of false positives
p = p_t + p_f

Then, using each p, I calculated the 50% confidence limits, as before, and counted the occasions on which the true value for r, 0.15, fell between these limits. I ran 100 simulated experiments, and amazingly, exactly 50 produced errors bars that contained the true value. There certainly was a little luck here (standard deviation here is 5, so 32% of the time, such a set of 100 measurements will produce true-value-containing C.I.'s either fewer than 45 times, or more than 55 times), but still this result serves as a robust validation of my numerical procedures. Probability theory works!

(Note: to keep the code presented in the appendix as understandable as possible, I didn't do much optimization. To have done the 100 measurement Monte-Carlo run with this exact code would have taken days, I think. The code I ran was a little different. In particular, the ranges over r, d, and pt were truncated, as large regions of these parameter spaces contribute negligibly. This allowed my simulation to run in just under an hour.)

Appendix

# Python source code for the calculation
# Warning: this algorithm is very slow - considerable optimization is possible

import numpy as np
from scipy.stats import binom      # calculates binomial distributions

def get_likelihoods(N, p):

# inputs:
# N is total number of people tested
# p is number of positive test results

r_fn = 0.15                # false-negative rate
r_fp = 0.1                  # false-positive rate

# number with disease can be anything up to number of people tested:
d_range = range(N + 1)

# number of true positives can be anything up to total number of positives:
p_t_range = range(p + 1)

likelihoods = [ ]

delta_r = 0.002
rList = np.arange(0, 1 + delta_r, delta_r)

for r in rList:                                                         # scan over hypothesis space
temp = 0
for d in d_range:                                                # these 2 for loops do the double summation
for p_t in p_t_range:
p1 = binom.pmf(p - p_t, N - d, r_fp)           # equation (7)
p2 = binom.pmf(p_t, d, (1 - r_fn))              # equation (8)
p3 = binom.pmf(d, N, r)                              # equation (9)

temp += (p1 * p2 * p3)

likelihoods.append(temp)

return likelihoods