A few days ago, I was racking my brain trying to think of a suitable example for this piece, when one landed on my lap unexpectedly. On his own statistics blog, Andrew Gelman, has been posting the questions from an exam he set on design and analysis of surveys. Question 1 was the following:
Suppose that, in a survey of 1000 people in a state, 400 say they voted in a recent primary election. Actually, though, the voter turnout was only 30%. Give an estimate of the probability that a nonvoter will falsely state that he or she voted. (Assume that all voters honestly report that they voted.)
Now the requested estimate is simple enough to produce, and needs little greater insight than the product rule and the law of large numbers. But an important part of rationally advancing our knowledge is to be able to quantify the degree of uncertainty in that new knowledge. A parameter estimate without a confidence interval tells us very little, because it is just that, an estimate. We can't empirically determine that the estimated value is the truth, only that it is the most probable value. If we want to make decisions based on some estimated parameter, or give a measurement any kind of substance, we need to know how much more probable it is than other possible values. We can convey this information conveniently by supplying an error bar - a region of values either side of the most likely value, in which the true value is very likely to reside. This is why the error bar is considered to be one of the most important concepts in science.
If we look at the question above, with an ambition to provide not just the estimate, but a region of high confidence on either side, then it becomes one of the simplest possible examples of marginalization, the topic of this post. It is also a really nice example to use here, because it utilizes technology we have already played with, in some of my earlier posts. These technologies are the binomial distribution, used in 'Fly papers and photon detectors' (Equation 5 in that post), and the beta function, which I introduced without naming it in 'How to make ad-hominem arguments' (Equation 4 in that article).
Defining V to be the proposition that a given person voted, Y is the statement that they have declared that they did vote. We want to know the probability that a person says they voted when in fact they did not, which is P(Y | V'). From the product rule
P(YV') = P(V') × P(Y | V')
P(V') = 0.7. Thats the probability that a person did not vote.
If we are experienced in such things, we know that random deviations from expected behaviour decrease in relative magnitude as the size of the sample increases (thats the law of large numbers). This means that for a sample of 1000, we can be confident that the number of voters will not be much different from 300, when P(V) = 0.3. This means that approximately 100 non-voters lied that they had voted, out of a total sample of 1000, so
P(YV') ≈ 0.1
The crude estimate for P(Y | V'), therefore, is 1/7.
A more rigorous calculation acknowledges the uncertainty in P(YV'), and at the same time automatically provides a means to get the desired confidence interval. Lets suppose that to start with, we have no information about the proportion of non-voters who lie in surveys, then we are justified in using a uniform prior distribution. Then it follows from Bayes' theorem that the posterior probability distribution for the true fraction is given by the beta function, in close analogy with the parable of Weety Peebles. If we knew the number of people who didn't vote, but lied in the survey, then this would be a piece of cake, but we don't know it. It is what's called a nuisance parameter. But there is a procedure for dealing with this.
If we have a model with 2 free parameters, θ and n, then the joint probability for any pair of values for these parameters is
If we have a model with 2 free parameters, θ and n, then the joint probability for any pair of values for these parameters is
|
(1) |
But if n is a nuisance parameter, in which we have no direct interest, then we just integrate it out. The so-called marginal probability distribution, P(θ | DI), is the sum over Equation (1) for all possible values of n. If n is a continuous parameter, then the sum becomes an integral:
P(θ | DI) = ∫ P(θn | DI) dn | (2) |
In our example, we have one desired parameter, the fraction of non-voters who say that they voted, f, and one nuisance parameter, the actual number of liars in the sample of 1000 people, so to get the distribution over all possible f, we need to calculate a two-dimensional array of numbers, something that is still amenable to a spreadsheet calculation. Down a column, I listed all the possible numbers of liars, n, from 0 to 400 (there can't be more than 400 as all voters tell the truth, according to the provided background information). For each of these n, the total number of non-voters is 600 plus that number (600 is the number of non-lying non-voters). The probability for each of these numbers of non-voters, P(n), was calculated in an adjacent column, using the binomial distribution, with p = 0.7.
Along the top of the spreadsheet, I listed all the hypotheses I wanted to test concerning the value of the desired fraction, f. I divided the full range [0, 1] into 1000 slices of width Δf = 0.001. The probability that the true value of f lies in any given range [f, f + Δf] is estimated as P(f) × Δf. Each P(f) was calculated using the beta function:
|
(3) |
Here N is the sample size, 1000. Each P(f | nI) was multiplied by the calculated P(n) to give the joint probability, specified in Equation (1). At the bottom, along another row, I calculated the sum of each column, which gave the desired marginal probability distribution, which I plot below:
According to my calculation, the peak of this curve is at 0.143 (which is 1/7, as expected). As an error bar, lets identify points on either side of the peak, such that the enclosed area is 0.95. This means that there is 95% probability that the true value of f lies between these points. To find these points, just integrate the curve in each direction from the peak, until the area in each case first reaches 0.475. Performing this integration gives a 95% confidence interval of [0.102, 0.180].
Now we know not only the most likely value of f, but also how confident we are that the true value of f is near to that estimate. This is what good science is all about.
The process of eliminating nuisance parameters is termed marginalization. Its an important concept in Bayesian statistics. In maximum-likelihood model fitting, all free parameters in a model must be fitted at once, but use of Bayes' theorem not only permits important prior information to enter the calculation and enables confidence-interval estimation without a seperate calculation, but also allows us to reduce the number of parameters that must be calculated to only those that interest us. During my PhD work, for example, most of my time was spent measuring the temporal responses of nanocrystals to short laser pulses. My fitting model included an offset (displacement up the y-axis), a shift (displacement along the time axis), and a scale parameter (dependent on how long I measured for, and how many photons my detector picked up). Thats 3 parameters giving information only about the behaviour of the measurement apparatus. The physical model pertaining to the behaviour of the nanocrystals typically consisted of only 2 time constants. Thats three out of five model variables that are nuisance parameters.
A really excellent read for those with an interest in the technicalities of Bayesian stats, is a text book called 'Bayesian Spectrum Analysis and Parameter Estimation,' by G. Larry Bretthorst (available for free download here). This book describes some stunning work. While discussing the advantages of eliminating nuisance parameters, Bretthorst produces one of the sexiest lines in the whole of the statistical literature:
In a typical small problem, this might reduce the search dimensions from ten to two; in one "large" problem the reduction was from thousands to six or seven.
He goes on: "This represents many orders of magnitude reduction in computation, the difference between what is feasible and what is not."
Thanks to Andrew Gelman for providing valuable inspiration for this post!
'Bayesian Spectrum Analysis and Parameter Estimation' by G. Larry Bretthorst
(free download here)
Great post!
ReplyDeleteI've been reading your blog and it is a great help to understand this stuff. Congratulations and thank you!
I don't understand the first part of the article. Why don't you just start off with p(f / D) to see which f is more likely ?
ReplyDeleteThe point is that we don't know the number from the survey of 1000 that actually voted, we only know that over the population, 30% voted. Consequently, we don't have the exact number of liars in the sample. We can still generate a probability distribution over the proportion of lying non-voters, by integrating over the possible numbers of liars in the sample.
DeleteI go over this integration process (marginalization) briefly but formally in the glossary, http://maximum-entropy-blog.blogspot.com/p/glossary.html#marginal
Hope this helps.