Maximum Entropy: October 2013

Saturday, October 26, 2013

Entropy Games

In 1948, Claude Shannon, an electrical engineer working at Bell labs, was interested in the problem of communicating messages along physical channels, such as telephone wires. He was particularly interested in issues like how many bits of data are needed to communicate a message, how much redundancy is appropriate when the channel is noisy, and how much a message can be safely compressed.

In that year, Shannon figured out¹ that he could mathematically specify the minimum number of bits required to convey any message. You see every message, every proposition, in fact, whether actively digitized or not, can be expressed as some sequence of answers to yes / no questions, and every string of binary digits is exactly that: a sequence of answers to yes / no questions. So if you know the minimum number of bits required to send a message, you know everything you need to know about the amount of information it contains.

Entropy of Kangaroos

All this discussion of scientific method, the deep roots of probability theory, mathematics, and morality is all well and good, but what about kangaroos? As I'm sure most of my more philosophically sophisticated readers appreciate, kangaroos play a necessarily central and vital role in any valid epistemology. To celebrate this fact, I'd like to consider a mathematical calculation that first appeared in the image-analysis literature, just coming up to 30 years ago¹. I'll paraphrase the original problem in my own words:

We all know that two thirds of kangaroos are right handed, and that one third of kangaroos drink beer (the remaining two thirds preferring whisky). These are true facts. What is the probability that a randomly encountered kangaroo is a left-handed beer drinker? Find a unique answer.

No Such Thing as a Probability for a Probability

In the previous post, I discussed a problem of parameter estimation, in which the parameter of interest is a frequency: the relative frequency with which some data-generating process produces observations of some given type. In the example I chose (mathematically equivalent to Laplace's sunrise problem), we assumed a frequency that is fixed in the long term, and we assumed logical independence between successive observations. As a result, the frequency with which the process produces X , if known, has the same numerical value as the probability that any particular event will be an X. Many authors covering this problem exploit this correspondence, and describe the sought after parameter directly as a probability. This seems to me to be confusing, unnecessary, and incorrect.

We perform parameter estimation by calculating probability distributions, but if the parameter we are after is itself a probability, then we have the following weird riddle to solve: What is a probability for a probability? What could this mean?

A probability is a rational account of one's state of knowledge, contingent upon some model. Subject to the constraints of that model (e.g. the necessary assumption that probability theory is correct), there is no wiggle room with regard to a probability - its associated distribution, if such existed, would be a two-valued function, being everywhere either on or off, and being on in exactly one location. What I have described, however, is not a probability distribution, as the probability at a discrete location in a continuous hypothesis space has no meaning. This opens up a few potential philosophical avenues, but in any case, this 'distribution' is clearly not the one the problem was about, so we don't need to pursue them.

In fact, we never need to discuss the probability for a probability. Where a probability is obtained as the expectation of some other nuisance parameter, that parameter will always be a frequency. To begin to appreciate the generality of this, suppose I'm fitting a mathematical function, y = f(x), with model parameters, θ, to some set of observed data pairs, (x, y). None of the θ_i can be a probability, since each (x, y) pair is a real observation of some actual physical process - each parameter is chosen to describe some aspect of the physical nature of the system under scrutiny.

Suppose we ask a question concerning the truth of a proposition, Q: "If x is 250, y(x) is in the interval, a = [a₁, a₂]."

We proceed first to calculate the multi-dimensional posterior distribution over θ-space. Then we evaluate at each point in θ-space the probability distribution for the frequency with which y(250) ∈ [a₁, a₂]. If y(x) is deterministic, at all frequencies this will be either 1 or 0. Regardless whether or not y is deterministic, the product of this function with the distribution, P(θ), gives the probability distribution over (f, θ), and the integral over this product is the final probability for Q. We never needed a probability distribution over probability space, only over f and θ space, and since every inverse problem in probability theory can be expressed as an exercise in parameter estimation, we have highly compelling reasons to say that this will always hold.

It might seem as though multi-level, hierarchical modeling presents a counter example to this. In the hierarchical case, the function y(x) (or some function higher still up the ladder) becomes itself one of several possibilities in some top-level hypothesis space. We may, for example suspect that our data pairs could be fitted by either a linear function, or a quadratic, in which case our job is to find out which is more suitable. In this case, the probability that y(250) is in some particular range depends on which fitting function is correct, which is itself expressible as a probability distribution, and we seem to be back to having a probability for a probability.

But every multi-level model can be expressed as a simple parameter estimation problem. For a fitting function, y_A(x), we might have parameters θ_A = {θ_A1, θ_A2, ....}, and for another function, y_B(x), parameters θ_B = {θ_B1, θ_B2, ....}. The entire problem is thus mathematically indistinguishable from a single parameter estimation problem with θ = {θ_A1, θ_A2, ...., θ_B1, θ_B2, ...., θ_N}, where θ_N is an additional hypothesis specifying the name of the true fitting function. By the above argument, none of the θ's here can be a probability. (What does θ_B1 mean in model A? It is irrelevant: for a given point in the sub-space, θ_A, the probability is uniform over θ_B.)

Often, though, it is conceptually advantageous to use the language of multi-level modeling. In fact, this is exactly what happened previously, when we studied various incarnations of the sunrise problem. Here is how we coped:

We had a parameter (see previous post), which we called A, denoting the truth value of some binary proposition. That parameter was itself determined by a frequency, f, for which we devised a means to calculate a probability distribution. When we needed to know the probability that a system with internal frequency, f, would produce 9 events of type X in a row, we made use of the logical independence of subsequent events to say that the P(X) is numerically the same as f (the Bernoulli urn rule). Thus, we were able to make use of the laws of probability (the product rule in this case) to calculate P(9 in a row | this f is temporarily assumed correct) = f ⁹. Under the assumptions of the model, therefore, for any assumed f, the value f ⁹ is the frequency with which this physical process produces 9 X's out of 9 samples, and our result was again an expectation over frequency space (though this time a different frequency). We actually made 2 translations: from frequency to probability and then from probability back to frequency, before calculating the final probability. It may seem unnecessarily cumbersome, but by doing this, we avoid the nonsense of a probability for a probability.

(There are at least 2 reasons why I think avoiding such nonsense is important. Firstly, when we teach, we should avoid our students harboring the justified suspicion that we are telling them nonsense. The student does not have to be fully conscious that any nonsense was transmitted, for the teaching process to be badly undermined. Secondly, when we do actual work with probability calculus, there may be occasions when we solve problems of an exotic nature, where arming ourselves with normally harmless nonsense could lead to a severe failure of the calculation, perhaps even seeming to produce an instance where the entire theory implodes.)

What if nature is telling us that we shouldn't impose the assumption of logical independence? No big deal, we just need to add a few more gears to the machine. For example, we might introduce some high-order autoregression model to predict how an event depends on those that came before it. Such a model will have a set of n + 1 coefficients, but for each point in the space of those coefficients, we will be able to form the desired frequency distribution. We can then proceed to solve the problem: with what frequency does this system produce an X, given that the previous n events were thing₁, thing₂, .... The frequency of interest will typically be different to the global frequency for the system (if such exists), but the final probability will always be an expectation of a frequency.

The same kind of argument applies if subsequent events are independent, but f varies with time in some other way. There is no level of complexity that changes the overall thesis.

It might look like we have strayed dangerously close to the dreaded frequency interpretation of probability, but really we haven't. As I pointed out in the linked-to glossary article, every probability can be considered an expected frequency, but owing to the theory ladenness of the procedure that arrives at those expected frequencies, whenever we reach the designated top level of our calculation, we are prevented from identifying probability with actual frequency. To make this identification is to claim to be omniscient. It is thus incorrect to talk, as some authors do, of physical probabilities, as opposed to epistemic probabilities.

Saturday, October 5, 2013

Error Bars for Binary Parameters

Propositions about real phenomena are either true or false. For some logical proposition, e.g. "there is milk in the fridge", let A be the binary parameter denoting its truth value. Now, truth values are not in the habit of marching themselves up to us and announcing their identity. In fact, for propositions about specific things in the real world, there is normally no way whatsoever to gain direct access to these truth values, and we must make do with inferences drawn from our raw experiences. We need a system, therefore, to assess the reliability of our inferences, and that system is probability theory. When we do parameter estimation, a convenient way to summarize the results of the probability calculations is the error bar, and it would seem to be necessary to have some corresponding tool to capture our degree of confidence when we estimate a binary parameter, such as A. But what could this error bar possibly look like? The hypothesis space consists of only two discrete points, and there isn't enough room to convey the required information.

Let me pose a different question: how easy is to change your mind? One of the important functions of probability theory is to quantify evidence in terms of how easy it would be for future evidence to change our minds. Suppose I stand at the side of a not-too-busy road, and wonder in which direction the next car to pass me will be travelling. Let A now represent the proposition that any particular observed vehicle is traveling to the left. Suppose that, upon my arrival at the scene, I'm in a position of extreme ignorance about the patterns of traffic on the road, and that my ignorance is best represented (for symmetry reasons) by indifference, and my resulting probability estimate for A is 50%.

Suppose that after a large number of observations in this situation, I find that almost equal numbers of vehicles have been going right as have been going left. This results in a probability assignment for A that is again 50%. Here's the curious thing, though: in my initial state of indifference, only a small number of observations would have been sufficient for me to form a strong opinion that the frequency with which A is true, f_A, is close to either 0 or 1. But now, having made a large number of observations, I have accumulated substantial evidence that f_A is in fact close to 0.5, and it would take a comparably large number of observations to convince me otherwise. The appropriate response to possible future evidence has changed considerably, but I used the same number, 50%, to summarize my state of information. How can this be?

In fact, the solution is quite automatic. In order to calculate P(A), it is first necessary to assign a probability distribution over frequency space, P(f_A). I did this in one of my earliest bog posts, in which I solved a thinly disguised version of Laplace's sunrise problem. Lets treat this traffic problem in the same way. My starting position in the traffic problem, indifference, meant that my information about the relative frequency with which an observed vehicle travels to the left was best encoded with a prior probability distribution that is the same value at all points within the hypothesis space. Lets assume also that we start with the conviction (from whatever source) that the frequency, f_A, is constant in the long run and that consecutive events are independent. Laplace's solution (this is, yet again, identical to the sunrise problem he solved just over 200 years ago) provides a neat expression for P(A), known as the rule of succession (p is probability that next event is type X, n is number of observed occurrences of type X events, and N is total number of observed events):

(1)

but his method follows that same route I took when predicting a person's behaviour from past observations: at each possible frequency (between 0 and 1) calculate P(f_A) from Bayes' theorem, using the binomial distribution to calculate the likelihood function. The proposition A can be resolved into a set of mutually exclusive and exhaustive propositions about the frequency, f_A, giving P(A) = P(A[f₁ + f₂ + f₃ +....]), so that the product rule, applied directly after the extended sum rule means that the final assignment of P(A) consists of integrating over the product f_A×P(f_A), which we recognize as obtaining the expectation, 〈f_A〉.

The figure below depicts the evolution of the distribution, P(f_A| DI), for the first N observations, for several N. The data all come from a single sequence of binary uniform random variables, and the procedure follows equation (4), from my earlier article. We started, at N = 0, from indifference, and the distribution was flat. Gradually, as more and more data was added, a peak emerged, and got steadily sharper and sharper:

(The numbers on the y-axis are larger than 1, but that's OK because they are probability densities - once the curve is integrated, which involves multiplying each value by a differential element, df, the result is exactly 1.) The probability distribution, P(f_A| DI), is therefore the answer to our initial question: P(f_A| DI) contains all the information we have about the robustness of P(A) against new evidence, and we get our error bar by somehow characterizing the width of P(f_A| DI).

Now, an important principle of probability theory requires that the order with which we incorporate different elements of the data, D, does not affect the final posterior supplied by Bayes' theorem. For D = {d₁, d₂, d₃, ...}, we could work the initial prior over to a posterior using only d₁, then, using this posterior as the new prior, repeat for d₂, and so on through the list. We could do the same thing, only taking the d's in any order we choose. We could bundle them into sub-units, or we could process the whole damn lot in a single batch. The final probability assignment must be the same in each case. Violation of this principle would invalidate our theory (assuming there is no causal path, e.g. if I'm observing my own mental state, from knowledge of some of the d's to subsequent observed d's).

For example, each curve on the graph above shows the result from a single application of Bayes' theorem, though I could just as well have processed each individual observation separately, producing the same result. This works because the prior distribution is changing with each new bit of data added, gradually recording the combined effect of all the evidence. Each d_i becomes subsumed into the background information, I, before the next one is treated.

But we might have the feeling that something peculiar happens if we try to carry this principle over to the calculation of P(A | DI). What is the result of observing 9 consecutive cars travelling to the left? It depends what has happened before, obviously. Suppose D₁ is now the result of 1 million observations, consisting of exactly 500,000 vehicles moving in each direction. The posterior assignment is almost exactly 50%. Now I see D₂, those 9 cars travelling to the left - what is the outcome? The new prior is 50%, the same as it was before the first observation.

What the hell is going on here? How do we account for the fact that these 9 vehicles have a much weaker effect on our rational belief now, than they would have done if they had arrived right at the beginning of the experiment? The outcome of Bayes' theorem is proportional to prior times likelihood: P(A | I)×P(D | AI). Looking at 2 very different situations, 9 observations after 1 million, and 9 observations after zero, the prior is the same, the proposition, A, is the same, and D is the same. The rule of succession with n = N = 9 gives the same result in each case. It seems like we have a problem. We might solve the problem by recognizing that the correct answer comes by first getting P(f_A| DI) then finding its expectation, but how did we recognize this? Is it possible that we rationally reached out to something external to probability theory to figure out that direct calculation of P(A | DI) would not work? Could it be that probability theory is not the complete description of rationality? (Whatever that means.)

Of course, such flights of fancy aren't necessary. The direct calculation of P(A | DI) works perfectly fine, as long as we follow the procedure correctly. Lets define 2 new propositions,

L = A = "the next vehicle to pass will be travelling to the left,"

R = "the next vehicle to pass will be travelling to the right."

With D₁ and D₂ as before:

D₁ = "500,000 out of 1 million vehicles were travelling to the left"

D₂ = "Additional to D₁, 9 out of 9 vehicles were travelling to the left"

Background information is given by

I₁ = "prior distribution over f is uniform, f is constant in the long run,
and subsequent events are independent"

From this we have the first posterior,

P(L | D₁I₁) = 0.5

(2)

Now comes the crucial step, we must fully incorporate the information in D₁

I₂ = I₁D₁

(3)

Now, after obtaining D₂, the posterior for L becomes

(4)

When we pose and solve a problem that's explicitly about the frequency, f, of the data-generating process, we often don't pay much heed to the updating of I in equation (3), because it is mathematically irrelevant to the likelihood, P(D₂ | fD₁I₁). Assuming a particular value for the frequency renders all the information in D₁ powerless to influence this number. But if we are being strict, we must make this substitution, as I is necessarily defined as all the information we have relevant to the problem, apart from the current batch of data (D₂, in this case).

The priors in equation (4) are equal, so they cancel out. The likelihood is not hard to calculate, remember what it means: the probability to see 9 out of 9 travelling to the left, given that 500,000 out of 1,000,000 were travelling to the left, previously, and given that the next one will be travelling to the left. That is, what is the probability to have 9 out of 9 travelling to the left, given that in total n = 500,001 out of N = 1,000,001 travel to the left. We can use the same procedure as before to calculate the probability distribution over the possible frequencies, P(f | LI₂). For any given frequency, the assumption of independence in I₁ means that the only information we have about the probability for any given vehicle's direction is this frequency, and so the probability and the frequency have the same numerical value. This means that for any assumed frequency, the probability to have 9 in a row going to the left is f ⁹, from the product rule. But since we have a probability distribution over a range of frequencies, we take the expectation by integrating over the product P(f)×f ⁹.

We can do that integration numerically, and we get a small number: 0.00195321. The counter-part of the likelihood, the one conditioned on R rather than L, is obtained by an analogous process. It produces another small, but very similar number: 0.00195318. From these numbers, the ratio in equation (4) gives 0.5000045, which does not radically disagree with the 0.5000005 we already had. (For comparison, if N = n = 9 was the complete data set, the result would be P(L) = 0.9091, as you can easily confirm.) Thus, when we do the calculation properly, a sample of only 9 makes almost no difference after a sample of 1 million, and peace can be restored in the cosmos.

Using the same procedure, we can confirm also that combining D₁ and D₂ into a single data set, with N = 1,000,009 and n = 500,009, gives precisely the same outcome for P(L | DI), 0.5000045, exactly as it must.