Koalas are not as exclusive as kangaroos. At least, when it comes to their drinking habits. As I explained before, kangaroos drink beer or whisky, but not both. Koalas like to mix things up a bit more, when it comes to their choice of drink, but how much exactly? What is the probability, for example, that any given koala who drinks beer on any given night will also drink whisky on the same night? These are the sorts of urgent questions that science must seek to answer with the utmost speed and accuracy.
Sunday, December 22, 2013
Saturday, November 16, 2013
The Acid Test of Indifference
In recent posts, I've looked at the interpretation of the Shannon entropy, and the justification for the maximum entropy principle in inference under uncertainty. In the latter case, we looked at how mathematical investigation of the entropy function can help with establishing prior probability distributions from first principles.
There are some prior distributions, however, that we know automatically, without having to give the slightest thought to entropy. If the maximum entropy principle is really going to work, the first thing it has got to be able to do is to reproduce those distributions that we can deduce already, using other methods.
Friday, November 1, 2013
Monkeys and Multiplicity
Monkeys love to make a mess. Monkeys like to throw stones. Give a monkey a bucket of small pebbles, and before too long, those pebbles will be scattered indiscriminately in all directions. These are true facts about monkeys, facts we can exploit for the construction of a random number generator.
Set up a room full of empty buckets. Add one bucket full of pebbles and one mischievous monkey. Once the pebbles have been scattered, the number of little stones in each bucket is a random variable. We're going to use this random number generator for an unusual purpose, though. In fact, we could call it a 'calculus of probability,' because we're going to use this exotic apparatus for figuring out probability distributions from first principles1.
Saturday, October 26, 2013
Entropy Games
In 1948, Claude Shannon, an electrical engineer working at Bell labs, was interested in the problem of communicating messages along physical channels, such as telephone wires. He was particularly interested in issues like how many bits of data are needed to communicate a message, how much redundancy is appropriate when the channel is noisy, and how much a message can be safely compressed.
In that year, Shannon figured out1 that he could mathematically specify the minimum number of bits required to convey any message. You see every message, every proposition, in fact, whether actively digitized or not, can be expressed as some sequence of answers to yes / no questions, and every string of binary digits is exactly that: a sequence of answers to yes / no questions. So if you know the minimum number of bits required to send a message, you know everything you need to know about the amount of information it contains.
Friday, October 18, 2013
Entropy of Kangaroos
All this discussion of scientific method, the deep roots of probability theory, mathematics, and morality is all well and good, but what about kangaroos? As I'm sure most of my more philosophically sophisticated readers appreciate, kangaroos play a necessarily central and vital role in any valid epistemology. To celebrate this fact, I'd like to consider a mathematical calculation that first appeared in the image-analysis literature, just coming up to 30 years ago1. I'll paraphrase the original problem in my own words:
We all know that two thirds of kangaroos are right handed, and that one third of kangaroos drink beer (the remaining two thirds preferring whisky). These are true facts. What is the probability that a randomly encountered kangaroo is a left-handed beer drinker? Find a unique answer.
Friday, October 11, 2013
No Such Thing as a Probability for a Probability
In the previous post, I discussed a problem of parameter estimation, in which the parameter of interest
is a frequency: the relative frequency with which some data-generating process produces observations of some given type. In the example I chose (mathematically equivalent to Laplace's sunrise problem), we assumed a frequency that is fixed in the long term, and we assumed logical independence between successive observations. As a result, the frequency with which the process produces X , if known, has the same numerical value as the probability that any particular event will
be an X. Many authors covering this problem exploit this correspondence, and
describe the sought after parameter directly as a probability. This seems to me
to be confusing, unnecessary, and incorrect.
We perform parameter estimation by
calculating probability distributions, but if the parameter we are after is
itself a probability, then we have the following weird riddle to solve: What is
a probability for a probability? What could this mean?
A probability is a
rational account of one's state of knowledge, contingent upon some model.
Subject to the constraints of that model (e.g. the necessary assumption that probability
theory is correct), there is no wiggle room with regard to a probability - its
associated distribution, if such existed, would be a two-valued function, being
everywhere either on or off, and being on in exactly one location. What I have
described, however, is not a probability distribution, as the probability at a
discrete location in a continuous hypothesis space has no meaning. This opens
up a few potential philosophical avenues, but in any case, this 'distribution'
is clearly not the one the problem was about, so we don't need to pursue them.
In fact, we never need to discuss the
probability for a probability. Where a probability is obtained as the
expectation of some other nuisance parameter, that parameter will always be a frequency. To begin to appreciate the generality of this, suppose I'm fitting a mathematical
function, y = f(x), with model parameters, θ, to some set of observed data
pairs, (x, y). None of the θi can be a probability, since each (x, y) pair is a real observation of some actual physical process - each parameter is chosen to describe some aspect of the physical nature of the system under scrutiny.
Suppose we ask a question concerning the truth of a proposition, Q: "If x is 250, y(x) is in the interval, a = [a1, a2]."
We proceed first to calculate the multi-dimensional posterior distribution over θ-space. Then we evaluate at each point in θ-space the probability distribution for the frequency with which y(250) ∈ [a1, a2]. If y(x) is deterministic, at all frequencies this will be either 1 or 0. Regardless whether or not y is deterministic, the product of this function with the distribution, P(θ), gives the probability distribution over (f, θ), and the integral over this product is the final probability for Q. We never needed a probability distribution over probability space, only over f and θ space, and since every inverse problem in probability theory can be expressed as an exercise in parameter estimation, we have highly compelling reasons to say that this will always hold.
Suppose we ask a question concerning the truth of a proposition, Q: "If x is 250, y(x) is in the interval, a = [a1, a2]."
We proceed first to calculate the multi-dimensional posterior distribution over θ-space. Then we evaluate at each point in θ-space the probability distribution for the frequency with which y(250) ∈ [a1, a2]. If y(x) is deterministic, at all frequencies this will be either 1 or 0. Regardless whether or not y is deterministic, the product of this function with the distribution, P(θ), gives the probability distribution over (f, θ), and the integral over this product is the final probability for Q. We never needed a probability distribution over probability space, only over f and θ space, and since every inverse problem in probability theory can be expressed as an exercise in parameter estimation, we have highly compelling reasons to say that this will always hold.
It might seem as though multi-level, hierarchical modeling presents a counter example to this. In the hierarchical case, the function y(x) (or some function higher still up the ladder) becomes itself one of several possibilities in some top-level hypothesis space. We may, for example suspect that our data pairs could be fitted by either a linear function, or a quadratic, in which case our job is to find out which is more suitable. In this case, the probability that y(250) is in some particular range depends on which fitting function is correct, which is itself expressible as a probability distribution, and we seem to be back to having a probability for a probability.
But every multi-level model can be expressed as a simple parameter estimation problem. For a fitting function, yA(x), we might have parameters θA = {θA1, θA2, ....}, and for another function, yB(x), parameters θB = {θB1, θB2, ....}. The entire problem is thus mathematically indistinguishable from a single parameter estimation problem with θ = {θA1, θA2, ...., θB1, θB2, ...., θN}, where θN is an additional hypothesis specifying the name of the true fitting function. By the above argument, none of the θ's here can be a probability. (What does θB1 mean in model A? It is irrelevant: for a given point in the sub-space, θA, the probability is uniform over θB.)
Often, though, it is conceptually advantageous to use the language of multi-level modeling. In fact, this is exactly what happened previously, when we studied various incarnations of the sunrise problem. Here is how we coped:
We had a parameter (see previous post), which we called A, denoting the truth value of some binary proposition. That parameter was itself determined by a frequency, f, for which we devised a means to calculate a probability distribution. When we needed to know the probability that a system with internal frequency, f, would produce 9 events of type X in a row, we made use of the logical independence of subsequent events to say that the P(X) is numerically the same as f (the Bernoulli urn rule). Thus, we were able to make use of the laws of probability (the product rule in this case) to calculate P(9 in a row | this f is temporarily assumed correct) = f 9. Under the assumptions of the model, therefore, for any assumed f, the value f 9 is the frequency with which this physical process produces 9 X's out of 9 samples, and our result was again an expectation over frequency space (though this time a different frequency). We actually made 2 translations: from frequency to probability and then from probability back to frequency, before calculating the final probability. It may seem unnecessarily cumbersome, but by doing this, we avoid the nonsense of a probability for a probability.
(There are at least 2 reasons why I think avoiding such nonsense is important. Firstly, when we teach, we should avoid our students harboring the justified suspicion that we are telling them nonsense. The student does not have to be fully conscious that any nonsense was transmitted, for the teaching process to be badly undermined. Secondly, when we do actual work with probability calculus, there may be occasions when we solve problems of an exotic nature, where arming ourselves with normally harmless nonsense could lead to a severe failure of the calculation, perhaps even seeming to produce an instance where the entire theory implodes.)
What if nature is telling us that we shouldn't impose the assumption of logical independence? No big deal, we just need to add a few more gears to the machine. For example, we might introduce some high-order autoregression model to predict how an event depends on those that came before it. Such a model will have a set of n + 1 coefficients, but for each point in the space of those coefficients, we will be able to form the desired frequency distribution. We can then proceed to solve the problem: with what frequency does this system produce an X, given that the previous n events were thing1, thing2, .... The frequency of interest will typically be different to the global frequency for the system (if such exists), but the final probability will always be an expectation of a frequency.
The same kind of argument applies if subsequent events are independent, but f varies with time in some other way. There is no level of complexity that changes the overall thesis.
It might look like we have strayed dangerously close to the dreaded frequency interpretation of probability, but really we haven't. As I pointed out in the linked-to glossary article, every probability can be considered an expected frequency, but owing to the theory ladenness of the procedure that arrives at those expected frequencies, whenever we reach the designated top level of our calculation, we are prevented from identifying probability with actual frequency. To make this identification is to claim to be omniscient. It is thus incorrect to talk, as some authors do, of physical probabilities, as opposed to epistemic probabilities.
Saturday, October 5, 2013
Error Bars for Binary Parameters
Propositions about real phenomena are either true or false. For some logical proposition, e.g. "there is milk in the fridge", let A be the binary parameter denoting its truth value. Now, truth values are not in the habit of marching themselves up to us and announcing their identity. In fact, for propositions about specific things in the real world, there is normally no way whatsoever to gain direct access to these truth values, and we must make do with inferences drawn from our raw experiences. We need a system, therefore, to assess the reliability of our inferences, and that system is probability theory. When we do parameter estimation, a convenient way to summarize the results of the probability calculations is the error bar, and it would seem to be necessary to have some corresponding tool to capture our degree of confidence when we estimate a binary parameter, such as A. But what could this error bar possibly look like? The hypothesis space consists of only two discrete points, and there isn't enough room to convey the required information.
Let me pose a different question: how easy is to change your mind? One of the important functions of probability theory is to quantify evidence in terms of how easy it would be for future evidence to change our minds. Suppose I stand at the side of a not-too-busy road, and wonder in which direction the next car to pass me will be travelling. Let A now represent the proposition that any particular observed vehicle is traveling to the left. Suppose that, upon my arrival at the scene, I'm in a position of extreme ignorance about the patterns of traffic on the road, and that my ignorance is best represented (for symmetry reasons) by indifference, and my resulting probability estimate for A is 50%.
Suppose that after a large number of observations in this situation, I find that almost equal numbers of vehicles have been going right as have been going left. This results in a probability assignment for A that is again 50%. Here's the curious thing, though: in my initial state of indifference, only a small number of observations would have been sufficient for me to form a strong opinion that the frequency with which A is true, fA, is close to either 0 or 1. But now, having made a large number of observations, I have accumulated substantial evidence that fA is in fact close to 0.5, and it would take a comparably large number of observations to convince me otherwise. The appropriate response to possible future evidence has changed considerably, but I used the same number, 50%, to summarize my state of information. How can this be?
In fact, the solution is quite automatic. In order to calculate P(A), it is first necessary to assign a probability distribution over frequency space, P(fA). I did this in one of my earliest bog posts, in which I solved a thinly disguised version of Laplace's sunrise problem. Lets treat this traffic problem in the same way. My starting position in the traffic problem, indifference, meant that my information about the relative frequency with which an observed vehicle travels to the left was best encoded with a prior probability distribution that is the same value at all points within the hypothesis space. Lets assume also that we start with the conviction (from whatever source) that the frequency, fA, is constant in the long run and that consecutive events are independent. Laplace's solution (this is, yet again, identical to the sunrise problem he solved just over 200 years ago) provides a neat expression for P(A), known as the rule of succession (p is probability that next event is type X, n is number of observed occurrences of type X events, and N is total number of observed events):
but his method follows that same route I took when predicting a person's behaviour from past observations: at each possible frequency (between 0 and 1) calculate P(fA) from Bayes' theorem, using the binomial distribution to calculate the likelihood function. The proposition A can be resolved into a set of mutually exclusive and exhaustive propositions about the frequency, fA, giving P(A) = P(A[f1 + f2 + f3 +....]), so that the product rule, applied directly after the extended sum rule means that the final assignment of P(A) consists of integrating over the product fA×P(fA), which we recognize as obtaining the expectation, 〈fA〉.
The figure below depicts the evolution of the distribution, P(fA | DI), for the first N observations, for several N. The data all come from a single sequence of binary uniform random variables, and the procedure follows equation (4), from my earlier article. We started, at N = 0, from indifference, and the distribution was flat. Gradually, as more and more data was added, a peak emerged, and got steadily sharper and sharper:
(The numbers on the y-axis are larger than 1, but that's OK because they are probability densities - once the curve is integrated, which involves multiplying each value by a differential element, df, the result is exactly 1.) The probability distribution, P(fA | DI), is therefore the answer to our initial question: P(fA | DI) contains all the information we have about the robustness of P(A) against new evidence, and we get our error bar by somehow characterizing the width of P(fA | DI).
Now, an important principle of probability theory requires that the order with which we incorporate different elements of the data, D, does not affect the final posterior supplied by Bayes' theorem. For D = {d1, d2, d3, ...}, we could work the initial prior over to a posterior using only d1, then, using this posterior as the new prior, repeat for d2, and so on through the list. We could do the same thing, only taking the d's in any order we choose. We could bundle them into sub-units, or we could process the whole damn lot in a single batch. The final probability assignment must be the same in each case. Violation of this principle would invalidate our theory (assuming there is no causal path, e.g. if I'm observing my own mental state, from knowledge of some of the d's to subsequent observed d's).
For example, each curve on the graph above shows the result from a single application of Bayes' theorem, though I could just as well have processed each individual observation separately, producing the same result. This works because the prior distribution is changing with each new bit of data added, gradually recording the combined effect of all the evidence. Each di becomes subsumed into the background information, I, before the next one is treated.
But we might have the feeling that something peculiar happens if we try to carry this principle over to the calculation of P(A | DI). What is the result of observing 9 consecutive cars travelling to the left? It depends what has happened before, obviously. Suppose D1 is now the result of 1 million observations, consisting of exactly 500,000 vehicles moving in each direction. The posterior assignment is almost exactly 50%. Now I see D2, those 9 cars travelling to the left - what is the outcome? The new prior is 50%, the same as it was before the first observation.
What the hell is going on here? How do we account for the fact that these 9 vehicles have a much weaker effect on our rational belief now, than they would have done if they had arrived right at the beginning of the experiment? The outcome of Bayes' theorem is proportional to prior times likelihood: P(A | I)×P(D | AI). Looking at 2 very different situations, 9 observations after 1 million, and 9 observations after zero, the prior is the same, the proposition, A, is the same, and D is the same. The rule of succession with n = N = 9 gives the same result in each case. It seems like we have a problem. We might solve the problem by recognizing that the correct answer comes by first getting P(fA | DI) then finding its expectation, but how did we recognize this? Is it possible that we rationally reached out to something external to probability theory to figure out that direct calculation of P(A | DI) would not work? Could it be that probability theory is not the complete description of rationality? (Whatever that means.)
Of course, such flights of fancy aren't necessary. The direct calculation of P(A | DI) works perfectly fine, as long as we follow the procedure correctly. Lets define 2 new propositions,
With D1 and D2 as before:
Background information is given by
From this we have the first posterior,
Now comes the crucial step, we must fully incorporate the information in D1
Now, after obtaining D2, the posterior for L becomes
When we pose and solve a problem that's explicitly about the frequency, f, of the data-generating process, we often don't pay much heed to the updating of I in equation (3), because it is mathematically irrelevant to the likelihood, P(D2 | fD1I1). Assuming a particular value for the frequency renders all the information in D1 powerless to influence this number. But if we are being strict, we must make this substitution, as I is necessarily defined as all the information we have relevant to the problem, apart from the current batch of data (D2, in this case).
The priors in equation (4) are equal, so they cancel out. The likelihood is not hard to calculate, remember what it means: the probability to see 9 out of 9 travelling to the left, given that 500,000 out of 1,000,000 were travelling to the left, previously, and given that the next one will be travelling to the left. That is, what is the probability to have 9 out of 9 travelling to the left, given that in total n = 500,001 out of N = 1,000,001 travel to the left. We can use the same procedure as before to calculate the probability distribution over the possible frequencies, P(f | LI2). For any given frequency, the assumption of independence in I1 means that the only information we have about the probability for any given vehicle's direction is this frequency, and so the probability and the frequency have the same numerical value. This means that for any assumed frequency, the probability to have 9 in a row going to the left is f 9, from the product rule. But since we have a probability distribution over a range of frequencies, we take the expectation by integrating over the product P(f)×f 9.
We can do that integration numerically, and we get a small number: 0.00195321. The counter-part of the likelihood, the one conditioned on R rather than L, is obtained by an analogous process. It produces another small, but very similar number: 0.00195318. From these numbers, the ratio in equation (4) gives 0.5000045, which does not radically disagree with the 0.5000005 we already had. (For comparison, if N = n = 9 was the complete data set, the result would be P(L) = 0.9091, as you can easily confirm.) Thus, when we do the calculation properly, a sample of only 9 makes almost no difference after a sample of 1 million, and peace can be restored in the cosmos.
Using the same procedure, we can confirm also that combining D1 and D2 into a single data set, with N = 1,000,009 and n = 500,009, gives precisely the same outcome for P(L | DI), 0.5000045, exactly as it must.
but his method follows that same route I took when predicting a person's behaviour from past observations: at each possible frequency (between 0 and 1) calculate P(fA) from Bayes' theorem, using the binomial distribution to calculate the likelihood function. The proposition A can be resolved into a set of mutually exclusive and exhaustive propositions about the frequency, fA, giving P(A) = P(A[f1 + f2 + f3 +....]), so that the product rule, applied directly after the extended sum rule means that the final assignment of P(A) consists of integrating over the product fA×P(fA), which we recognize as obtaining the expectation, 〈fA〉.
The figure below depicts the evolution of the distribution, P(fA | DI), for the first N observations, for several N. The data all come from a single sequence of binary uniform random variables, and the procedure follows equation (4), from my earlier article. We started, at N = 0, from indifference, and the distribution was flat. Gradually, as more and more data was added, a peak emerged, and got steadily sharper and sharper:
(The numbers on the y-axis are larger than 1, but that's OK because they are probability densities - once the curve is integrated, which involves multiplying each value by a differential element, df, the result is exactly 1.) The probability distribution, P(fA | DI), is therefore the answer to our initial question: P(fA | DI) contains all the information we have about the robustness of P(A) against new evidence, and we get our error bar by somehow characterizing the width of P(fA | DI).
Now, an important principle of probability theory requires that the order with which we incorporate different elements of the data, D, does not affect the final posterior supplied by Bayes' theorem. For D = {d1, d2, d3, ...}, we could work the initial prior over to a posterior using only d1, then, using this posterior as the new prior, repeat for d2, and so on through the list. We could do the same thing, only taking the d's in any order we choose. We could bundle them into sub-units, or we could process the whole damn lot in a single batch. The final probability assignment must be the same in each case. Violation of this principle would invalidate our theory (assuming there is no causal path, e.g. if I'm observing my own mental state, from knowledge of some of the d's to subsequent observed d's).
For example, each curve on the graph above shows the result from a single application of Bayes' theorem, though I could just as well have processed each individual observation separately, producing the same result. This works because the prior distribution is changing with each new bit of data added, gradually recording the combined effect of all the evidence. Each di becomes subsumed into the background information, I, before the next one is treated.
But we might have the feeling that something peculiar happens if we try to carry this principle over to the calculation of P(A | DI). What is the result of observing 9 consecutive cars travelling to the left? It depends what has happened before, obviously. Suppose D1 is now the result of 1 million observations, consisting of exactly 500,000 vehicles moving in each direction. The posterior assignment is almost exactly 50%. Now I see D2, those 9 cars travelling to the left - what is the outcome? The new prior is 50%, the same as it was before the first observation.
What the hell is going on here? How do we account for the fact that these 9 vehicles have a much weaker effect on our rational belief now, than they would have done if they had arrived right at the beginning of the experiment? The outcome of Bayes' theorem is proportional to prior times likelihood: P(A | I)×P(D | AI). Looking at 2 very different situations, 9 observations after 1 million, and 9 observations after zero, the prior is the same, the proposition, A, is the same, and D is the same. The rule of succession with n = N = 9 gives the same result in each case. It seems like we have a problem. We might solve the problem by recognizing that the correct answer comes by first getting P(fA | DI) then finding its expectation, but how did we recognize this? Is it possible that we rationally reached out to something external to probability theory to figure out that direct calculation of P(A | DI) would not work? Could it be that probability theory is not the complete description of rationality? (Whatever that means.)
Of course, such flights of fancy aren't necessary. The direct calculation of P(A | DI) works perfectly fine, as long as we follow the procedure correctly. Lets define 2 new propositions,
L = A = "the next vehicle to pass will be travelling to the left,"
R = "the next vehicle to pass will be travelling to the right."
D1 = "500,000 out of 1 million vehicles were travelling to the left"
D2 = "Additional to D1, 9 out of 9 vehicles were travelling to the left"
Background information is given by
I1 = "prior distribution over f is uniform, f is constant in the long run,
and subsequent events are independent"
and subsequent events are independent"
From this we have the first posterior,
|
(2) |
Now comes the crucial step, we must fully incorporate the information in D1
|
(3) |
Now, after obtaining D2, the posterior for L becomes
(4) |
When we pose and solve a problem that's explicitly about the frequency, f, of the data-generating process, we often don't pay much heed to the updating of I in equation (3), because it is mathematically irrelevant to the likelihood, P(D2 | fD1I1). Assuming a particular value for the frequency renders all the information in D1 powerless to influence this number. But if we are being strict, we must make this substitution, as I is necessarily defined as all the information we have relevant to the problem, apart from the current batch of data (D2, in this case).
The priors in equation (4) are equal, so they cancel out. The likelihood is not hard to calculate, remember what it means: the probability to see 9 out of 9 travelling to the left, given that 500,000 out of 1,000,000 were travelling to the left, previously, and given that the next one will be travelling to the left. That is, what is the probability to have 9 out of 9 travelling to the left, given that in total n = 500,001 out of N = 1,000,001 travel to the left. We can use the same procedure as before to calculate the probability distribution over the possible frequencies, P(f | LI2). For any given frequency, the assumption of independence in I1 means that the only information we have about the probability for any given vehicle's direction is this frequency, and so the probability and the frequency have the same numerical value. This means that for any assumed frequency, the probability to have 9 in a row going to the left is f 9, from the product rule. But since we have a probability distribution over a range of frequencies, we take the expectation by integrating over the product P(f)×f 9.
We can do that integration numerically, and we get a small number: 0.00195321. The counter-part of the likelihood, the one conditioned on R rather than L, is obtained by an analogous process. It produces another small, but very similar number: 0.00195318. From these numbers, the ratio in equation (4) gives 0.5000045, which does not radically disagree with the 0.5000005 we already had. (For comparison, if N = n = 9 was the complete data set, the result would be P(L) = 0.9091, as you can easily confirm.) Thus, when we do the calculation properly, a sample of only 9 makes almost no difference after a sample of 1 million, and peace can be restored in the cosmos.
Using the same procedure, we can confirm also that combining D1 and D2 into a single data set, with N = 1,000,009 and n = 500,009, gives precisely the same outcome for P(L | DI), 0.5000045, exactly as it must.
Friday, September 13, 2013
Is Rationality Desirable?
Seriously, why all this fuss about rationality and science, and all
that? Can we be just as happy, or even more so, being irrational, as by being
rational? Are there aspects of our lives where rationality doesn't help? Might
rationality actually be a danger?
Think for a moment about what it means to desire something.
To desire something trivially entails desiring an efficient means to
attain it. To desire X is to expect my life to be better if I add X to my
possessions. To desire X, and not to avail of a known opportunity to increase
my probability to add X to my possessions, therefore, is either (1) to do
something counter to my desires, or (2) to desire my life not to get better.
Number (2) is strictly impossible – a better life for me is, by definition, one in which more of my desires are fulfilled. Number (1) is incoherent
– there can be no motivation for anyone to do anything against their own
interests. Behaviour mode (1) is not impossible, but it can only be the result
of a malfunction.
Let’s consider some complicating circumstances to check the robustness
of this.
- Suppose I desire a cigarette. Not to smoke a cigarette, however, is clearly in my interests. There is no contradiction here. Besides (hypothetically) wanting to smoke something, I also have other goals, such as a long healthy life, which are of greater importance to me. To desire a cigarette is to be aware of part of my mind that mistakenly thinks this will make my life better, even though in expectation, it will not. This is not really an example of desiring what I do not desire, because a few puffs of nicotine is not my highest desire – when all desires that can be compared on the same dimension are accounted for, the net outcome is what counts. Neither is it an example of acting against my desires if I turn down the offer of a smoke, for the same reason.
- Suppose I desire to reach the top of a mountain, but I refuse to take the cable car that conveniently departs every 30 minutes, preferring instead to scale the steep and difficult cliffs by hand and foot. Simplistically, this looks like genuinely desiring to not avail of an efficient means to attain my desires, but in reality, it is clearly the case that reaching the summit is only part of the goal, another part being the pleasure derived from the challenging method of getting there.
Despite complications arising from the inner structure of our desires,
therefore, for me to knowingly refuse to adopt behaviour that would increase my
probability to fulfill my desires is undeniably undesirable. Now, behavior that
we know increases our chances to get what we desire has certain general
features. For example, it requires an ability to accumulate reliable
information about the world. It is not satisfactory to take a wild guess at the
best course of action, and just hope that it works. This might work, but it will not work reliably. My rational
expectation to achieve my goal is no better than if I do nothing. Reliability
begins to enter the picture when I can make informed guesses. I must be able to
make reliable predictions about what will happen as a result of my actions, and
to make these predictions, I need a model of reality with some fidelity. Not just fidelity,
but known fidelity - to increase the probability to achieve my goals, I need a
strategy that I have good reasons to trust.
It happens that there is a procedure capable of supplying the kinds of
reliable information and models of reality that enable the kinds of predictions
we desire to make, in the pursuit of our desires. Furthermore, we all know what
it is. It is called scientific method. Remember the reliability criterion? This
is what makes science scientific. The gold standard for assessing the
reliability of a proposition about the real world is probability theory – a
kind of reasoning from empirical experience. Thus the ability of science to say
anything worthwhile about the structure of reality comes from its application
of probability theory or any of several approximations that are demonstrably
good in certain special cases. If there is something that is better than
today’s science, then better is the result of a favorable outcome under
probabilistic analysis (since 'better' implies 'reliably better'), thus, whatever it is, it is tomorrow’s science.
So, if I desire a thing, then I desire a means to maximize my
expectation to get it, so I desire a means to make reliable predictions of the
outcomes of my actions, meaning that I desire a model of the world in which I
can justifiably invest a high level of belief, thus I desire to employ
scientific method, the set of procedures best qualified to identify reliable
propositions about reality. Therefore, rationality is desirable. Full stop.
We cannot expect to be as happy by being irrational as by being
rational. We might be lucky, but by definition, we cannot rely on luck, and our
desires entail also desiring reliable strategies.
Items (A) to (D), below, detail some subtleties related to these
conclusions.
(A) Where’s the fun in that?
Seriously? Being rational is always desirable? Seems like an awfully
dry, humorless existence, always having to consult a set of equations before
deciding what to do!
What this objection amounts to is another example (ii), from above,
where the climber chooses to take the difficult route to the top of the
mountain. What is really meant by a dry existence is something like elimination
of pleasant surprises, spontaneity, and ad-hoc creativity, and that these
things are actually part of what we value.
Of course, there are also unpleasant surprises possible, and we do value
minimizing those. The capacity to increase the frequency of pleasant surprises,
while not dangerously exposing ourselves to trouble is something that, of
course, is best delivered through being rational. Being in a contained way irrational may be one of
our goals, but as always, the best way to achieve this is by being rational
about it. (I won’t have much opportunity to continue my pursuit of
irrationality tomorrow, if I die recklessly today.)
(B) Sophistication effect
To be rational (and thus make maximal use of scientific method, as
required by a coherent pursuit of our desires) means to make study of likely failure
modes of human reasoning (if you are human). This reduces the probability of
committing fallacies of reasoning yourself, thus increasing the probability
that your model of reality is correct. But, there is a recognized failure mode
of human reasoning that actually results from increased awareness of failure
modes of reasoning. It goes like this: knowing many of the mechanisms by which
seemingly intelligent people can be misled by their own flawed heuristic
reasoning methods makes it easy for me for hypothesize reasons to ignore good
evidence, when it supports a proposition that I don’t like – “Oh sure, he says
he has seen 20 cases of X and no cases of Y, but that’s probably a confirmation bias.”
Does this undermine my argument? Not at all. This is not really a
danger of rationality. If anything, it is a danger of education (though one
that I confidently predict that a rational analysis will reveal to be not
sufficient to argue for reduced education). What has happened, in the above
example is of course itself a form of flawed reasoning, it is reasoning based
on what I desire to be true, and thus isn't rational. It may be a pursuit of
rationality that led me to reason in this way, but this is only because my
quest has been (hopefully temporarily) derailed. Thus my desire to be rational (entailed
trivially by my possession of desire for anything) makes it often desirable for me to have the support of like-minded rational people, capable of pointing out the
error, when even the honest quest for reliable information leads me into a trap
of fallacious inference.
(C) Where does it stop?
The assessment of probability is open ended. If there is anything
about probability theory that sucks, this is it, but no matter how brilliant
the minds that come to work on this problem, no way around it can ever be
found, in principle. It is just something we have to live with - pretending it's not there won't make it go away. What it means,
though, is that no probability can be divorced from the model within which it
is calculated. There is always a possibility that my hypothesis space does not
contain a true hypothesis. For example, I can use probability theory to determine the most likely coefficients, A and B, in a linear model used to fit some data, but investigation of the linear model will say nothing about other
possible fitting functions. I can repeat a similar analysis using say a
three-parameter quadratic fit, and then decide which fitting model is the most
likely using Ockham’s razor, but then what about some third candidate?
Or what if the Gaussian noise model I used in my
assessment of the fits is wrong? What if I suspect that some of the measurements
in my data set are flawed? Perhaps the whole experiment was just a dream. These things can all be checked in essentially the
same way as all the previously considered possibilities (using probability
theory), but it is quite clear that the process can continue indefinitely.
Rationality is thus a slippery concept: how much does it take to be
rational? Since the underlying procedure of rationality, the calculation of
probabilities, can always be improved by adding another level, won’t it go on
forever, precluding the possibility to ever reach a decision?
To answer this, let us note that to execute a calculation capable of
deciding how to achieve maximal happiness and prosperity for all of humanity
and all other life on Earth is not a rational thing to do if the calculation is
so costly that its completion results in the immediate extinction of all humanity and all
other life on Earth.
Rationality is necessarily a reflexive process, both (as described
above) in that it requires analysis of the potential failure modes of the
particular hardware/software combination being utilized (awareness of cognitive
biases), and in that it must try to monitor its own cost. Recall that
rationality owes its ultimate justification to the fulfillment of desires. These
desires necessarily supersede the desire to be rational itself. An algorithm
designed to do nothing other than be rational would do literally nothing -
so without a higher goal above it, rationality is
literally nothing.
Thus, if the cost of the chosen rational procedure is expected to
prevent the necessarily higher-level desire being fulfilled, then rationality
dictates that the calculation be stopped (or better, not started). Furthermore, the (necessary) desire
to employ a procedure that doesn't diminish the likelihood to achieve the
highest goals entails a procedure capable of assessing and flagging when such
an occurrence is likely.
(D) Going with your gut feeling
On a related issue, concerning again the contingency (the lack of
guarantee that the hypothesis space actually contains a true hypothesis) and
potential difficulty of a rational calculation, do we need to worry that the
possible computational difficulty and, ultimately, the possibility that we will
be wrong in the end will make rationality uncompetitive with our innate
capabilities of judgment? Only in a very limited sense.
Yes, we have superbly adapted computational organs, with efficiencies
far exceeding any artificial hardware that we can so far devise, and capable of
solving problems vastly more difficult than any rigorous probability-crunching
machine that we can now build. And yes, it probably is rational under many
circumstances to favor the rough and ready output of somebody’s bias-ridden
squishy brain over the hassle of a near-impossible, but oh-so rigorous
calculation. But under what circumstances? Either, as noted, when the cost of
the calculation prohibits the attainment of the ultimate goal, or when
rationally evaluated empirical evidence indicates that it is probably safe to
do so.
Human brain function is at least partially rational, after all. Our
brains are adapted for and (I am highly justified in believing) quite
successful at making self-serving judgments, which, as noted, is founded upon
an ability to form a reliable impression of the workings of our environment. And,
as also noted, the degree of rigor called for in any rational calculation is
determined by the costs of the possible calculations, the costs of not doing
the calculations, and the amount we expect to gain from them.
This is not to downplay the importance of scientific method. Let me
emphasize: a reliable estimate of when it is acceptable to rely on heuristics,
rather than full-blown analysis, can only come from a rational procedure. The list of known cognitive biases that interfere with sound reasoning is unfortunately rather extensive, and presumably still growing. The science informs us that rather often, our innate judgement exhibits significantly less success than rational procedure.
Sunday, July 28, 2013
Forward Problems
Recently, I unveiled a collection of mathematical material, intended partly as an easy entry point for people interested in learning probability from scratch. One thing that struck me, though, as conspicuously missing was a set of simple examples of basic forward problems. To correct this, the following is a brief tutorial, illustrating application of some of the most basic elements of probability theory.
Virtually anybody who has studied probability at school has studied forward problems. Many, though, will have never learned about inverse problems, and so the term 'forward problem' is often never even introduced. What are referred to as forward problems are what many identify as the entirety of probability theory, but in reality, this is only half of it.
Forward problems start from specification of the mechanics of some experiment, and work from there to calculate the probabilities with which the experiment will have certain outcomes. A simple example is: a deck of cards has 52 cards, including exactly 4 aces, what is the probability to obtain an ace on a single draw from the deck? The name for the discipline covering these kinds of problems is sampling theory.
Inverse problems, on the other hand, turn this situation around: the outcome of the experiment is known, and the task is to tease out the likely mechanics that lead to that outcome. Inverse problems are the primary occupation of scientists. They are of great interest to everybody else, as well, whether we realize it or not, as all understanding about the real world results from analysing inverse problems. Solution of inverse problems, though, is tackled using Bayes' theorem, and therefore also requires the implementation of sampling theory.
To answer the example above (drawing an ace from a full deck), we need only one of the basic laws: the Bernoulli urn rule. Due to the lack of any information to the contrary, symmetry requires that each card is equally probable to be drawn on a single draw, and you can quickly verify that the answer is P(Ace) = 1/13. Here is another example, requiring mildly greater thought:
The Birthday Paradox
Not really a paradox, but supposedly many find the result surprising. The problem is the following:
What is the minimum number of people needed to be gathered together to ensure that the probability that at least two of them share a birthday exceeds one half?
It doesn't look immediately like my definition of a forward problem, but actually it is several - to get the solution, we'll calculate the desired probability for all possible numbers of people, until we pass the 50% requirement. A 'birthday' is a date consisting of a day and a month, like January 1st - the year is unimportant. Assume also that each year has 365 days. To maximize our exposure to as many basic rules as possible, I'll demonstrate 2 equivalent solutions.
Solution 1
Having at least 1 shared birthday in a group of people can be fulfilled in several possible ways. There may be exactly 2 matching birthdays, or exactly 3 matching, or exactly 4, and so on. We don't want to calculate all these probabilities. Instead, we can invoke the sum rule to observe that
|
(1) |
To get P(none shared), we'll use the Bernoulli urn rule again. If the number of people in the group is n, then we want the number of ways to have n different dates, divided by the number of ways to have any n dates. To have n different dates, we must select n objects from a pool of 365, where each object can be selected once at most. Thus we need the number of permutations, 365Pn, which, from the linked glossary entry, is
(2) |
That's our numerator, the denominator is the total number of possible lists of n birthdays. Well, there are 365 ways to chose the first, and 365 ways to chose the second, thus 365 × 365 ways to chose the first two. Similarly 365 × 365 × 365 ways to chose the first three, thus 365n ways to chose n dates that may or may not be different. Taking the ratio, then
And now we only need to crunch the numbers for different values for n, to find out where P first exceeds 50%. To calculate this, I've written a short python module, which is in the appendix, below (a spreadsheet program can also do this kind of calculation). Python is open-source, so in principle anybody with a computer and an internet connection (e.g. You) can use it. The graph below shows the output for n equal to 1 to 60:
The answer, contrary to Douglas Adams, is 23.
Solution 2
The second method is equivalent to the first, it still employs that trick in equation (1) with the sum rule, and exactly the same formula, equation (4), is derived, but this time we'll use the sum rule, the product rule, and the logical independence of different people's birthdays.
To visualize this solution more easily, lets imagine people entering a room, one by one, and checking each time whether or not there is a shared birthday. When the first person enters the room, the probability there is no shared birthday in the room is 1. When the second person enters the room, the probability that he shares his birthday with the person already there is 1/365,so the probability that he does not share the other person's birthday, from the sum rule, is (1 - 1/365).
The probability for no shared birthdays, with n = 2, is given by the product rule:
We can simplify this, though. Logical independence means that if we know proposition A to be true, the probability associated with proposition B is the same as if we know nothing about A: P(B | I) = P(B | AI). Thus the probability that person B's birthday is on a particular date is not changed by any assumptions about person A's birthday.
So the product rule reduces in this case to
Then the probability of no matching birthdays with n = 2 is 1 × (1-1/365). For n = 3, we repeat the same procedure, getting as our answer 1 × (1-1/365) × (1-2/365).
For n people, therefore, we get
which re-arranges to give the same result as before:
As before, we subtract this from 1 to get the probability to have at least 1 match among the n people present.
(3) |
Thus, when we apply the sum rule, the desired expression is
(4) |
Solution 2
The second method is equivalent to the first, it still employs that trick in equation (1) with the sum rule, and exactly the same formula, equation (4), is derived, but this time we'll use the sum rule, the product rule, and the logical independence of different people's birthdays.
To visualize this solution more easily, lets imagine people entering a room, one by one, and checking each time whether or not there is a shared birthday. When the first person enters the room, the probability there is no shared birthday in the room is 1. When the second person enters the room, the probability that he shares his birthday with the person already there is 1/365,so the probability that he does not share the other person's birthday, from the sum rule, is (1 - 1/365).
The probability for no shared birthdays, with n = 2, is given by the product rule:
|
(5) |
We can simplify this, though. Logical independence means that if we know proposition A to be true, the probability associated with proposition B is the same as if we know nothing about A: P(B | I) = P(B | AI). Thus the probability that person B's birthday is on a particular date is not changed by any assumptions about person A's birthday.
So the product rule reduces in this case to
| (6) |
For n people, therefore, we get
which re-arranges to give the same result as before:
(7) |
As before, we subtract this from 1 to get the probability to have at least 1 match among the n people present.
Appendix
import numpy as np # package for numerical calculations
import matplotlib.pyplot as plt # plotting package
def permut(n, k): # count the number of permutations, nPk
answer = 1
for i in np.arange(n, n-k, -1):
answer *= i
return answer
def share_prob(n): # calculate P(some_shared_bDay | n)
p = []
for i in n:
p.append(1.0 - permut(365, i)/np.power(365.0, i))
p = np.array(p)
return p
def plot_prob(): # plot distribution & find 50% point
x = np.arange(1.0, 61.0, 1.0) # list the integers, n, from 1 to 60 inclusive
y = share_prob(x) # get P(share|n)
x50 = x[y>0.5][0] # calculate x, y coordinates where 50%
y50 = y[y>0.5][0] # threshold first crossed
fig = plt.figure() # create plot
ax = fig.add_subplot(111)
ax.plot(x, y, linewidth=2)
# mark out the solution on the plot:
ax.plot([x50, x50], [0, y50], '-k', linewidth=1)
ax.plot([0, x50], [y50, y50], '-k', linewidth=1)
ax.plot(x50, y50, 'or', markeredgecolor='r', markersize=10)
ax.text(x50*1.1, y50*1.05, 'Smallest n giving P > 0.5:', fontsize=14)
ax.text(x50*1.1, y50*0.95, '(%d, %.4f)' %(x50, y50), fontsize=14)
# add some other information:
ax.set_title('P(some_shared_birthday | n)', fontsize=15)
ax.set_ylabel('Probability', fontsize=14)
ax.set_xlabel('Number of people, n', fontsize=14)
ax.axis([1, 60, 0, 1.05])
Saturday, July 20, 2013
Greatness, By Definition
The goals for this blog have always been two-fold: (1) to bring students and professionals in the sciences into close acquaintance with centrally important topics in Bayesian statistics and rational scientific method, and (2) to bring the universal scope and beauty of science to the awareness of as many as possible, both within and outside the field - if you have any kind of problem in the real world, then science is the tool for you.
Effective communication to scientists, though, runs the risk of being impenetrable for non-scientists, while my efforts to simplify make me feel that the mathematically adept reader will be quickly bored.
Helping to make the material on the blog more accessible, therefore, and as part of a very, very slow but steady plan to achieve world domination, here are two new resources I've put together, which I am very happy to announce:
- Glossary - definitions of technical terms used on the blog
- Mathematical Resource - a set of links explaining the basics of probability from the beginning. Blog articles and glossary entries are linked in a logical order, starting from zero assumed prior expertise.
Both of these now appear in the links list on the right-hand sidebar.
The new resources are partially complete. Some of the names of entries in the glossary, for example, do not yet correspond to existing entries. Regular updates are planned.
The mathematical resource is a near-instantaneous extension of the material compiled for the glossary, and is actually my glacially slow response to a highly useful suggestion made by Richard Carrier, almost one year ago. The material has been organized in what seems to me to be a logical order, and for those interested, may be viewed as a short course in statistics, delivering, I hope, real practical skills in the topic. Its main purpose, though, is to provide an entry point for those interested in the blog, but unfamiliar with some of the important technical concepts.
The glossary may also be useful to those already familiar with the topics. Terms are used on the blog, for example, with meanings different to those of many other authors. Hypothesis testing is one such case, limited by some to denoting frequentist tests of significance, but used here to refer more generally to any ranking of the reliability of propositions.
The new glossary, then, is an attempt to rationalize the terminology, bringing the vocabulary back in line with what it was always intended to mean, not to reflect some flawed surrogates for those original intentions. For the same reason, in some cases alternate terms are used, such as 'falsifiability principle', in preference to the more common 'falsification principle'.
Important distinctions are also highlighted. Morality, for example is differentiated from moral fact. Philosophers are found to be distinct from 'nominal philosophers'. Equally importantly, science is explicitly distanced from 'the activity of scientists'. As a result, morality, philosophy, and science are found to be different words for exactly the same phenomenon.
In a previous article, I warned against excessive reliance on jargon, so it's perhaps worth explaining how the current initiative is not hypocrisy. That article was concerned with over use of unnecessary jargon, which often serves as an impediment to understanding. Symptoms of this include (1) jargon terms replaced with direct (but less familiar) synonyms result in confusion, and (2) vocabulary replaced with familiar terms with inapplicable meanings goes unnoticed. By providing a precise lexicon, we can help to prevent exactly these problems, and others, thus whetting the edge of our analytical blade, and accelerating our philosophical progress.
On a closely related topic, there is a fashionable notion going along the lines that to argue from the definition of words is fallacious, as if definitions of words are useless. This is not correct: argument from definition is a valid form of reasoning, but one that is very commonly misused.
Eliezer Yudkowsky's highly recommendable sequence, A Human's Guide to Words, covers the fallacious application very well. His principal example is the ancient riddle: if a tree falls in a forest, where nobody is present to hear it, does it make a sound? Yudkowsky imagines two people arguing over this riddle. One asserts, "yes, by definition: acoustic vibrations travel through the air," the other responds, "no, by definition: no auditory sensation occurs in anybody's brain." These two are clearly applying different definitions to the same word. In order to reach consensus, they must agree on a single definition.
These two haven't committed any fallacy yet, each is reasoning correctly from their own definitions. But it is a pointless argument - as pointless as me arguing in English, when I only understand English, with a person who only speaks and understands Japanese. Fallacy begins, however, as in the following example.
Suppose you and I both understand and agree on what the word 'rainbow' refers to. One day, though, I'm writing a dictionary, and under 'rainbow,' I include the innocent looking phrase: "occurs shortly after rain." (Well duh, rain is even in the name.) So we go visit a big waterfall and see colours in the spray, and I say "look, it must have recently rained here." Con artists term this tactic 'bait and switch.' I can not legitimately reason in this way, because I have arbitrarily attached not a symbol to a meaning, but attributes to a real physical object.
These two haven't committed any fallacy yet, each is reasoning correctly from their own definitions. But it is a pointless argument - as pointless as me arguing in English, when I only understand English, with a person who only speaks and understands Japanese. Fallacy begins, however, as in the following example.
Suppose you and I both understand and agree on what the word 'rainbow' refers to. One day, though, I'm writing a dictionary, and under 'rainbow,' I include the innocent looking phrase: "occurs shortly after rain." (Well duh, rain is even in the name.) So we go visit a big waterfall and see colours in the spray, and I say "look, it must have recently rained here." Con artists term this tactic 'bait and switch.' I can not legitimately reason in this way, because I have arbitrarily attached not a symbol to a meaning, but attributes to a real physical object.
To show trivially that there is a valid form of argument from definition, though, consider the following truism: "black things are black." This is necessarily true, because blackness is exactly the property I'm talking about when I invoke the phrase "black things." It is not that I am hoping to alter the contents of reality by asserting the necessary truth of the statement, but that I am referring to a particular class of entities, and the entities I have in mind just happen to all be black - by definition.
One might complain, "but I prefer to use the phrase 'black things' not just for things that are black, but also for things that are nearly black." This would certainly be perverse, but it's not in any sense I can think of illegal. Fine, if you want to use the term that way, you may do so. I'll continue to implement my definition, which I find to be the most reasonable, and every time you hear me say the words "black things," you must replace the words with any symbol you like that conveys the required meaning to you. Your symbol might be a sequence of 47 charcoal 'z's marked on papyrus, or a mental image of a yak, I don't care.
Yes, our definitions are arbitrary, but arbitrary in the sense that there is no prior privileged status of the symbols we end up using, and not in the sense that the meanings we attach to those symbols are unimportant.
Here's an example from my own experience. Several times, I have tried unsuccessfully to explain to people my discovery that ethics is a scientific discipline. (By the way I'm not claiming priority for this discovery.) The objections typically go through 3 phases. First is the feeling of hand-waviness, which is understandable, given how ridiculously simple the argument is:
them- | No way, its too simple. You can't possibly claim such an unexpected result with such a thin argument. |
me- | OK, show me which details I've glossed over. |
them- | [pause...] All right, the argument looks logically sound, but I don't believe it - look at your axioms: why should I accept those? Those aren't the axioms I choose. |
me- | Those aren't axioms at all. I don't need to assume their truth. These are basic statements that are true by definition. If you don't like the words I've attached to those definitions, then pick you own words, I'm happy to accommodate them. |
them- | YOU CAN'T DO THAT! You're trying to alter reality by your choice of definitions.... |
And that's the final stumbling block people seem to have the biggest trouble getting over.
Definitions are important. If you think that making definitions is a bogus attempt to alter reality, then be true to your beliefs: see how much intellectual progress you can make without assigning meanings to words. The new <fanfare!> Maximum Entropy Glossary </fanfare!> is an attempt to streamline intellectual progress. If you engage with anything I have written on this blog, then you engage with meanings I have attached to strings of typed characters. In some important cases, I have tried to make those meanings clear and precise. If you find yourself disagreeing with strings of characters, then you are making a mistake. If you disagree with the way I manipulate meanings then we can discuss it like adults, confident that we are talking about the same things.
Definitions are important. If you think that making definitions is a bogus attempt to alter reality, then be true to your beliefs: see how much intellectual progress you can make without assigning meanings to words. The new <fanfare!> Maximum Entropy Glossary </fanfare!> is an attempt to streamline intellectual progress. If you engage with anything I have written on this blog, then you engage with meanings I have attached to strings of typed characters. In some important cases, I have tried to make those meanings clear and precise. If you find yourself disagreeing with strings of characters, then you are making a mistake. If you disagree with the way I manipulate meanings then we can discuss it like adults, confident that we are talking about the same things.
Subscribe to:
Posts (Atom)