Maximum Entropy: No Such Thing as a Probability for a Probability

In the previous post, I discussed a problem of parameter estimation, in which the parameter of interest is a frequency: the relative frequency with which some data-generating process produces observations of some given type. In the example I chose (mathematically equivalent to Laplace's sunrise problem), we assumed a frequency that is fixed in the long term, and we assumed logical independence between successive observations. As a result, the frequency with which the process produces X , if known, has the same numerical value as the probability that any particular event will be an X. Many authors covering this problem exploit this correspondence, and describe the sought after parameter directly as a probability. This seems to me to be confusing, unnecessary, and incorrect.

We perform parameter estimation by calculating probability distributions, but if the parameter we are after is itself a probability, then we have the following weird riddle to solve: What is a probability for a probability? What could this mean?

A probability is a rational account of one's state of knowledge, contingent upon some model. Subject to the constraints of that model (e.g. the necessary assumption that probability theory is correct), there is no wiggle room with regard to a probability - its associated distribution, if such existed, would be a two-valued function, being everywhere either on or off, and being on in exactly one location. What I have described, however, is not a probability distribution, as the probability at a discrete location in a continuous hypothesis space has no meaning. This opens up a few potential philosophical avenues, but in any case, this 'distribution' is clearly not the one the problem was about, so we don't need to pursue them.

In fact, we never need to discuss the probability for a probability. Where a probability is obtained as the expectation of some other nuisance parameter, that parameter will always be a frequency. To begin to appreciate the generality of this, suppose I'm fitting a mathematical function, y = f(x), with model parameters, θ, to some set of observed data pairs, (x, y). None of the θ_i can be a probability, since each (x, y) pair is a real observation of some actual physical process - each parameter is chosen to describe some aspect of the physical nature of the system under scrutiny.

Suppose we ask a question concerning the truth of a proposition, Q: "If x is 250, y(x) is in the interval, a = [a₁, a₂]."

We proceed first to calculate the multi-dimensional posterior distribution over θ-space. Then we evaluate at each point in θ-space the probability distribution for the frequency with which y(250) ∈ [a₁, a₂]. If y(x) is deterministic, at all frequencies this will be either 1 or 0. Regardless whether or not y is deterministic, the product of this function with the distribution, P(θ), gives the probability distribution over (f, θ), and the integral over this product is the final probability for Q. We never needed a probability distribution over probability space, only over f and θ space, and since every inverse problem in probability theory can be expressed as an exercise in parameter estimation, we have highly compelling reasons to say that this will always hold.

It might seem as though multi-level, hierarchical modeling presents a counter example to this. In the hierarchical case, the function y(x) (or some function higher still up the ladder) becomes itself one of several possibilities in some top-level hypothesis space. We may, for example suspect that our data pairs could be fitted by either a linear function, or a quadratic, in which case our job is to find out which is more suitable. In this case, the probability that y(250) is in some particular range depends on which fitting function is correct, which is itself expressible as a probability distribution, and we seem to be back to having a probability for a probability.

But every multi-level model can be expressed as a simple parameter estimation problem. For a fitting function, y_A(x), we might have parameters θ_A = {θ_A1, θ_A2, ....}, and for another function, y_B(x), parameters θ_B = {θ_B1, θ_B2, ....}. The entire problem is thus mathematically indistinguishable from a single parameter estimation problem with θ = {θ_A1, θ_A2, ...., θ_B1, θ_B2, ...., θ_N}, where θ_N is an additional hypothesis specifying the name of the true fitting function. By the above argument, none of the θ's here can be a probability. (What does θ_B1 mean in model A? It is irrelevant: for a given point in the sub-space, θ_A, the probability is uniform over θ_B.)

Often, though, it is conceptually advantageous to use the language of multi-level modeling. In fact, this is exactly what happened previously, when we studied various incarnations of the sunrise problem. Here is how we coped:

We had a parameter (see previous post), which we called A, denoting the truth value of some binary proposition. That parameter was itself determined by a frequency, f, for which we devised a means to calculate a probability distribution. When we needed to know the probability that a system with internal frequency, f, would produce 9 events of type X in a row, we made use of the logical independence of subsequent events to say that the P(X) is numerically the same as f (the Bernoulli urn rule). Thus, we were able to make use of the laws of probability (the product rule in this case) to calculate P(9 in a row | this f is temporarily assumed correct) = f ⁹. Under the assumptions of the model, therefore, for any assumed f, the value f ⁹ is the frequency with which this physical process produces 9 X's out of 9 samples, and our result was again an expectation over frequency space (though this time a different frequency). We actually made 2 translations: from frequency to probability and then from probability back to frequency, before calculating the final probability. It may seem unnecessarily cumbersome, but by doing this, we avoid the nonsense of a probability for a probability.

(There are at least 2 reasons why I think avoiding such nonsense is important. Firstly, when we teach, we should avoid our students harboring the justified suspicion that we are telling them nonsense. The student does not have to be fully conscious that any nonsense was transmitted, for the teaching process to be badly undermined. Secondly, when we do actual work with probability calculus, there may be occasions when we solve problems of an exotic nature, where arming ourselves with normally harmless nonsense could lead to a severe failure of the calculation, perhaps even seeming to produce an instance where the entire theory implodes.)

What if nature is telling us that we shouldn't impose the assumption of logical independence? No big deal, we just need to add a few more gears to the machine. For example, we might introduce some high-order autoregression model to predict how an event depends on those that came before it. Such a model will have a set of n + 1 coefficients, but for each point in the space of those coefficients, we will be able to form the desired frequency distribution. We can then proceed to solve the problem: with what frequency does this system produce an X, given that the previous n events were thing₁, thing₂, .... The frequency of interest will typically be different to the global frequency for the system (if such exists), but the final probability will always be an expectation of a frequency.

The same kind of argument applies if subsequent events are independent, but f varies with time in some other way. There is no level of complexity that changes the overall thesis.

It might look like we have strayed dangerously close to the dreaded frequency interpretation of probability, but really we haven't. As I pointed out in the linked-to glossary article, every probability can be considered an expected frequency, but owing to the theory ladenness of the procedure that arrives at those expected frequencies, whenever we reach the designated top level of our calculation, we are prevented from identifying probability with actual frequency. To make this identification is to claim to be omniscient. It is thus incorrect to talk, as some authors do, of physical probabilities, as opposed to epistemic probabilities.

Maximum Entropy

Friday, October 11, 2013

No Such Thing as a Probability for a Probability

No comments:

Post a Comment