Maximum Entropy: Error Bars

Showing posts with label Error Bars. Show all posts

Thursday, August 3, 2017

Standard Error

In the quantification of uncertainty, there is an important distinction that's often overlooked. This is the distinction between the dispersion of a distribution, and the dispersion of the mean of the distribution.

By 'dispersion of a distribution,' I mean how poorly is the mass of that probability distribution localized in hypothesis space. If half the employees in Company A are aged between 30 and 40, and half the employees in Company B are aged between 25 and 50, then (all else equal) the probability distribution over the age of a randomly sampled employee from Company B has a wider dispersion then the corresponding distribution for Company A.

A common measure of dispersion is the standard deviation, which is the average of the distance between all the parts of the distribution and the mean of that distribution.

Multi-level modeling

In a post last year, I went through some inference problems concerning a hypothetical medical test. For example, using the known rate of occurrence of some disease, and the known characteristics of a diagnostic test (false-positive and false-negative rates), we were able to obtain the probability that a subject has the disease, based on the test result.

In this post, I'll demonstrate some hierarchical modeling, in a similar context of medical diagnosis. Suppose we know the characteristics of the diagnostic test, but not the frequency of occurrence of the disease, can we figure this out from a set of test results?

A medical screening test has a false-positive rate of 0.15 and a false-negative rate of 0.1. One thousand randomly sampled subjects were tested, resulting in 213 positive test results. What is the posterior distribution over the background prevalence of the disease in this population?

The Fundamental Confidence Fallacy

The title of this post comes from an excellent recent paper (as far as I can tell, still in draft form) on misunderstandings of confidence intervals. The paper, 'The fallacy of placing confidence in confidence intervals', by R. D. Morey et al.¹ is by almost exactly the same set of authors whose earlier paper on a very similar topic I criticized, before, but the current paper does a far better job of explaining the authors' position, and arguing for it.

The authors identify the fundamental confidence fallacy (FCF) as believing automatically that,

If the probability that a random interval contains the true value is X%, then the plausibility (or probability) that a particular observed interval contains the true value is also X%.

Pass / Fail Mentality

(Following on from The Calibration Problem: Why Science Is Not Deductive)

Recently, I was talking about calibration (here and here), and how it should be more than just identifying the most likely cause of the output of a measuring instrument. The calibration process should strive to characterize the range of true conditions that might produce such an output, along with any major asymmetries (bias) in the relationship between the truth and the instrument's reading. In short, we need to identify all the major characteristics of the probability distribution over states of the world, given the condition of our measuring device.

Failure to go beyond simply identifying each possible instrument reading with a single most probable cause is a special case of a very general problem that in my opinion plagues scientific practice. Such a major failure mode should have a special name, so lets call it 'pass / fail mentality.' It is the most extreme possible form of a fallacy known as spurious precision, and involves needlessly throwing away information.

Whose confidence interval is this?

This week, yet again, I was confronted by yet another facet of the nonsensical nature of the frequentist approach to statistics. The blog of Andrew Gelman drew my attention to a recent peer-reviewed paper studying the extent of misunderstanding of the meaning of confidence intervals, among students and researchers. What shocked me, though, was not the only findings of the study.

Confidence intervals are a relatively simple idea in statistics, used to quantify the precision of a measurement. When a measurement is subject to statistical noise, the result is not going to be exactly equal to the parameter under investigation. For a high quality measurement, where the impact of the noise is relatively low, we can expect the result of the measurement to be close to the true value. We can express this expected closeness to the truth by supplying a narrow confidence interval. If the noise is more dominant, then the confidence interval will be wider - we will be less sure that truth is close to the result of the measurement. Confidence intervals are also known as error bars.

Error Bars for Binary Parameters

Propositions about real phenomena are either true or false. For some logical proposition, e.g. "there is milk in the fridge", let A be the binary parameter denoting its truth value. Now, truth values are not in the habit of marching themselves up to us and announcing their identity. In fact, for propositions about specific things in the real world, there is normally no way whatsoever to gain direct access to these truth values, and we must make do with inferences drawn from our raw experiences. We need a system, therefore, to assess the reliability of our inferences, and that system is probability theory. When we do parameter estimation, a convenient way to summarize the results of the probability calculations is the error bar, and it would seem to be necessary to have some corresponding tool to capture our degree of confidence when we estimate a binary parameter, such as A. But what could this error bar possibly look like? The hypothesis space consists of only two discrete points, and there isn't enough room to convey the required information.

Let me pose a different question: how easy is to change your mind? One of the important functions of probability theory is to quantify evidence in terms of how easy it would be for future evidence to change our minds. Suppose I stand at the side of a not-too-busy road, and wonder in which direction the next car to pass me will be travelling. Let A now represent the proposition that any particular observed vehicle is traveling to the left. Suppose that, upon my arrival at the scene, I'm in a position of extreme ignorance about the patterns of traffic on the road, and that my ignorance is best represented (for symmetry reasons) by indifference, and my resulting probability estimate for A is 50%.

Suppose that after a large number of observations in this situation, I find that almost equal numbers of vehicles have been going right as have been going left. This results in a probability assignment for A that is again 50%. Here's the curious thing, though: in my initial state of indifference, only a small number of observations would have been sufficient for me to form a strong opinion that the frequency with which A is true, f_A, is close to either 0 or 1. But now, having made a large number of observations, I have accumulated substantial evidence that f_A is in fact close to 0.5, and it would take a comparably large number of observations to convince me otherwise. The appropriate response to possible future evidence has changed considerably, but I used the same number, 50%, to summarize my state of information. How can this be?

In fact, the solution is quite automatic. In order to calculate P(A), it is first necessary to assign a probability distribution over frequency space, P(f_A). I did this in one of my earliest bog posts, in which I solved a thinly disguised version of Laplace's sunrise problem. Lets treat this traffic problem in the same way. My starting position in the traffic problem, indifference, meant that my information about the relative frequency with which an observed vehicle travels to the left was best encoded with a prior probability distribution that is the same value at all points within the hypothesis space. Lets assume also that we start with the conviction (from whatever source) that the frequency, f_A, is constant in the long run and that consecutive events are independent. Laplace's solution (this is, yet again, identical to the sunrise problem he solved just over 200 years ago) provides a neat expression for P(A), known as the rule of succession (p is probability that next event is type X, n is number of observed occurrences of type X events, and N is total number of observed events):

(1)

but his method follows that same route I took when predicting a person's behaviour from past observations: at each possible frequency (between 0 and 1) calculate P(f_A) from Bayes' theorem, using the binomial distribution to calculate the likelihood function. The proposition A can be resolved into a set of mutually exclusive and exhaustive propositions about the frequency, f_A, giving P(A) = P(A[f₁ + f₂ + f₃ +....]), so that the product rule, applied directly after the extended sum rule means that the final assignment of P(A) consists of integrating over the product f_A×P(f_A), which we recognize as obtaining the expectation, 〈f_A〉.

The figure below depicts the evolution of the distribution, P(f_A| DI), for the first N observations, for several N. The data all come from a single sequence of binary uniform random variables, and the procedure follows equation (4), from my earlier article. We started, at N = 0, from indifference, and the distribution was flat. Gradually, as more and more data was added, a peak emerged, and got steadily sharper and sharper:

(The numbers on the y-axis are larger than 1, but that's OK because they are probability densities - once the curve is integrated, which involves multiplying each value by a differential element, df, the result is exactly 1.) The probability distribution, P(f_A| DI), is therefore the answer to our initial question: P(f_A| DI) contains all the information we have about the robustness of P(A) against new evidence, and we get our error bar by somehow characterizing the width of P(f_A| DI).

Now, an important principle of probability theory requires that the order with which we incorporate different elements of the data, D, does not affect the final posterior supplied by Bayes' theorem. For D = {d₁, d₂, d₃, ...}, we could work the initial prior over to a posterior using only d₁, then, using this posterior as the new prior, repeat for d₂, and so on through the list. We could do the same thing, only taking the d's in any order we choose. We could bundle them into sub-units, or we could process the whole damn lot in a single batch. The final probability assignment must be the same in each case. Violation of this principle would invalidate our theory (assuming there is no causal path, e.g. if I'm observing my own mental state, from knowledge of some of the d's to subsequent observed d's).

For example, each curve on the graph above shows the result from a single application of Bayes' theorem, though I could just as well have processed each individual observation separately, producing the same result. This works because the prior distribution is changing with each new bit of data added, gradually recording the combined effect of all the evidence. Each d_i becomes subsumed into the background information, I, before the next one is treated.

But we might have the feeling that something peculiar happens if we try to carry this principle over to the calculation of P(A | DI). What is the result of observing 9 consecutive cars travelling to the left? It depends what has happened before, obviously. Suppose D₁ is now the result of 1 million observations, consisting of exactly 500,000 vehicles moving in each direction. The posterior assignment is almost exactly 50%. Now I see D₂, those 9 cars travelling to the left - what is the outcome? The new prior is 50%, the same as it was before the first observation.

What the hell is going on here? How do we account for the fact that these 9 vehicles have a much weaker effect on our rational belief now, than they would have done if they had arrived right at the beginning of the experiment? The outcome of Bayes' theorem is proportional to prior times likelihood: P(A | I)×P(D | AI). Looking at 2 very different situations, 9 observations after 1 million, and 9 observations after zero, the prior is the same, the proposition, A, is the same, and D is the same. The rule of succession with n = N = 9 gives the same result in each case. It seems like we have a problem. We might solve the problem by recognizing that the correct answer comes by first getting P(f_A| DI) then finding its expectation, but how did we recognize this? Is it possible that we rationally reached out to something external to probability theory to figure out that direct calculation of P(A | DI) would not work? Could it be that probability theory is not the complete description of rationality? (Whatever that means.)

Of course, such flights of fancy aren't necessary. The direct calculation of P(A | DI) works perfectly fine, as long as we follow the procedure correctly. Lets define 2 new propositions,

L = A = "the next vehicle to pass will be travelling to the left,"

R = "the next vehicle to pass will be travelling to the right."

With D₁ and D₂ as before:

D₁ = "500,000 out of 1 million vehicles were travelling to the left"

D₂ = "Additional to D₁, 9 out of 9 vehicles were travelling to the left"

Background information is given by

I₁ = "prior distribution over f is uniform, f is constant in the long run,
and subsequent events are independent"

From this we have the first posterior,

P(L | D₁I₁) = 0.5

(2)

Now comes the crucial step, we must fully incorporate the information in D₁

I₂ = I₁D₁

(3)

Now, after obtaining D₂, the posterior for L becomes

(4)

When we pose and solve a problem that's explicitly about the frequency, f, of the data-generating process, we often don't pay much heed to the updating of I in equation (3), because it is mathematically irrelevant to the likelihood, P(D₂ | fD₁I₁). Assuming a particular value for the frequency renders all the information in D₁ powerless to influence this number. But if we are being strict, we must make this substitution, as I is necessarily defined as all the information we have relevant to the problem, apart from the current batch of data (D₂, in this case).

The priors in equation (4) are equal, so they cancel out. The likelihood is not hard to calculate, remember what it means: the probability to see 9 out of 9 travelling to the left, given that 500,000 out of 1,000,000 were travelling to the left, previously, and given that the next one will be travelling to the left. That is, what is the probability to have 9 out of 9 travelling to the left, given that in total n = 500,001 out of N = 1,000,001 travel to the left. We can use the same procedure as before to calculate the probability distribution over the possible frequencies, P(f | LI₂). For any given frequency, the assumption of independence in I₁ means that the only information we have about the probability for any given vehicle's direction is this frequency, and so the probability and the frequency have the same numerical value. This means that for any assumed frequency, the probability to have 9 in a row going to the left is f ⁹, from the product rule. But since we have a probability distribution over a range of frequencies, we take the expectation by integrating over the product P(f)×f ⁹.

We can do that integration numerically, and we get a small number: 0.00195321. The counter-part of the likelihood, the one conditioned on R rather than L, is obtained by an analogous process. It produces another small, but very similar number: 0.00195318. From these numbers, the ratio in equation (4) gives 0.5000045, which does not radically disagree with the 0.5000005 we already had. (For comparison, if N = n = 9 was the complete data set, the result would be P(L) = 0.9091, as you can easily confirm.) Thus, when we do the calculation properly, a sample of only 9 makes almost no difference after a sample of 1 million, and peace can be restored in the cosmos.

Using the same procedure, we can confirm also that combining D₁ and D₂ into a single data set, with N = 1,000,009 and n = 500,009, gives precisely the same outcome for P(L | DI), 0.5000045, exactly as it must.

Wednesday, May 16, 2012

Nuisance Parameters

A few days ago, I was racking my brain trying to think of a suitable example for this piece, when one landed on my lap unexpectedly. On his own statistics blog, Andrew Gelman, has been posting the questions from an exam he set on design and analysis of surveys. Question 1 was the following:

Suppose that, in a survey of 1000 people in a state, 400 say they voted in a recent primary election. Actually, though, the voter turnout was only 30%. Give an estimate of the probability that a nonvoter will falsely state that he or she voted. (Assume that all voters honestly report that they voted.)

Now the requested estimate is simple enough to produce, and needs little greater insight than the product rule and the law of large numbers. But an important part of rationally advancing our knowledge is to be able to quantify the degree of uncertainty in that new knowledge. A parameter estimate without a confidence interval tells us very little, because it is just that, an estimate. We can't empirically determine that the estimated value is the truth, only that it is the most probable value. If we want to make decisions based on some estimated parameter, or give a measurement any kind of substance, we need to know how much more probable it is than other possible values. We can convey this information conveniently by supplying an error bar - a region of values either side of the most likely value, in which the true value is very likely to reside. This is why the error bar is considered to be one of the most important concepts in science.

If we look at the question above, with an ambition to provide not just the estimate, but a region of high confidence on either side, then it becomes one of the simplest possible examples of marginalization, the topic of this post. It is also a really nice example to use here, because it utilizes technology we have already played with, in some of my earlier posts. These technologies are the binomial distribution, used in 'Fly papers and photon detectors' (Equation 5 in that post), and the beta function, which I introduced without naming it in 'How to make ad-hominem arguments' (Equation 4 in that article).

Defining V to be the proposition that a given person voted, Y is the statement that they have declared that they did vote. We want to know the probability that a person says they voted when in fact they did not, which is P(Y | V'). From the product rule

P(YV') = P(V') × P(Y | V')

P(V') = 0.7. Thats the probability that a person did not vote.

If we are experienced in such things, we know that random deviations from expected behaviour decrease in relative magnitude as the size of the sample increases (thats the law of large numbers). This means that for a sample of 1000, we can be confident that the number of voters will not be much different from 300, when P(V) = 0.3. This means that approximately 100 non-voters lied that they had voted, out of a total sample of 1000, so

P(YV') ≈ 0.1

The crude estimate for P(Y | V'), therefore, is 1/7.

A more rigorous calculation acknowledges the uncertainty in P(YV'), and at the same time automatically provides a means to get the desired confidence interval. Lets suppose that to start with, we have no information about the proportion of non-voters who lie in surveys, then we are justified in using a uniform prior distribution. Then it follows from Bayes' theorem that the posterior probability distribution for the true fraction is given by the beta function, in close analogy with the parable of Weety Peebles. If we knew the number of people who didn't vote, but lied in the survey, then this would be a piece of cake, but we don't know it. It is what's called a nuisance parameter. But there is a procedure for dealing with this.

If we have a model with 2 free parameters, θ and n, then the joint probability for any pair of values for these parameters is

P(θn \| DI) = P(θn \| I)	P(D \| θnI)
	P(D \| I)

(1)

But if n is a nuisance parameter, in which we have no direct interest, then we just integrate it out. The so-called marginal probability distribution, P(θ | DI), is the sum over Equation (1) for all possible values of n. If n is a continuous parameter, then the sum becomes an integral:

P(θ | DI) = ∫ P(θn | DI) dn

(2)

In our example, we have one desired parameter, the fraction of non-voters who say that they voted, f, and one nuisance parameter, the actual number of liars in the sample of 1000 people, so to get the distribution over all possible f, we need to calculate a two-dimensional array of numbers, something that is still amenable to a spreadsheet calculation. Down a column, I listed all the possible numbers of liars, n, from 0 to 400 (there can't be more than 400 as all voters tell the truth, according to the provided background information). For each of these n, the total number of non-voters is 600 plus that number (600 is the number of non-lying non-voters). The probability for each of these numbers of non-voters, P(n), was calculated in an adjacent column, using the binomial distribution, with p = 0.7.

Along the top of the spreadsheet, I listed all the hypotheses I wanted to test concerning the value of the desired fraction, f. I divided the full range [0, 1] into 1000 slices of width Δf = 0.001. The probability that the true value of f lies in any given range [f, f + Δf] is estimated as P(f) × Δf. Each P(f) was calculated using the beta function:

P(f \| nI)	=	(N + 1)!	f ⁿ(1 - f)^N-n
		n! (N - n)!

(3)

Here N is the sample size, 1000. Each P(f | nI) was multiplied by the calculated P(n) to give the joint probability, specified in Equation (1). At the bottom, along another row, I calculated the sum of each column, which gave the desired marginal probability distribution, which I plot below:

According to my calculation, the peak of this curve is at 0.143 (which is 1/7, as expected). As an error bar, lets identify points on either side of the peak, such that the enclosed area is 0.95. This means that there is 95% probability that the true value of f lies between these points. To find these points, just integrate the curve in each direction from the peak, until the area in each case first reaches 0.475. Performing this integration gives a 95% confidence interval of [0.102, 0.180].

Now we know not only the most likely value of f, but also how confident we are that the true value of f is near to that estimate. This is what good science is all about.

The process of eliminating nuisance parameters is termed marginalization. Its an important concept in Bayesian statistics. In maximum-likelihood model fitting, all free parameters in a model must be fitted at once, but use of Bayes' theorem not only permits important prior information to enter the calculation and enables confidence-interval estimation without a seperate calculation, but also allows us to reduce the number of parameters that must be calculated to only those that interest us. During my PhD work, for example, most of my time was spent measuring the temporal responses of nanocrystals to short laser pulses. My fitting model included an offset (displacement up the y-axis), a shift (displacement along the time axis), and a scale parameter (dependent on how long I measured for, and how many photons my detector picked up). Thats 3 parameters giving information only about the behaviour of the measurement apparatus. The physical model pertaining to the behaviour of the nanocrystals typically consisted of only 2 time constants. Thats three out of five model variables that are nuisance parameters.

A really excellent read for those with an interest in the technicalities of Bayesian stats, is a text book called 'Bayesian Spectrum Analysis and Parameter Estimation,' by G. Larry Bretthorst (available for free download here). This book describes some stunning work. While discussing the advantages of eliminating nuisance parameters, Bretthorst produces one of the sexiest lines in the whole of the statistical literature:

In a typical small problem, this might reduce the search dimensions from ten to two; in one "large" problem the reduction was from thousands to six or seven.

He goes on: "This represents many orders of magnitude reduction in computation, the difference between what is feasible and what is not."

Thanks to Andrew Gelman for providing valuable inspiration for this post!

'Bayesian Spectrum Analysis and Parameter Estimation' by G. Larry Bretthorst
(free download here)

Maximum Entropy

Thursday, August 3, 2017

Standard Error

Saturday, October 31, 2015

Multi-level modeling

Saturday, April 18, 2015

The Fundamental Confidence Fallacy

Saturday, May 24, 2014

Pass / Fail Mentality

Saturday, March 22, 2014

Whose confidence interval is this?

Saturday, October 5, 2013

Error Bars for Binary Parameters

Wednesday, May 16, 2012

Nuisance Parameters