All this discussion of scientific method, the deep roots of probability theory, mathematics, and morality is all well and good, but what about kangaroos? As I'm sure most of my more philosophically sophisticated readers appreciate, kangaroos play a necessarily central and vital role in any valid epistemology. To celebrate this fact, I'd like to consider a mathematical calculation that first appeared in the image-analysis literature, just coming up to 30 years ago1. I'll paraphrase the original problem in my own words:
The last part of the statement of the puzzle, 'find a unique answer,' is implicit in the question, but I've emphasized it, as it highlights the difference between the sensible view of probability and the old, frequentist school of thought. The ardent frequentist will deny that any unique solution to the kangaroo problem exists. They will tell you that there is not enough information in the statement of the question to produce an answer. But reasoning under uncertainty (missing information) is totally and utterly, completely exactly what probability theory is for, so of course there is an answer to this question. Quite possibly, if you're good with numbers, your intuition is leaning toward the correct number already.
To help focus your mind, below is a beautiful portrait of a left-handed, beer-swilling kangaroo (identity unknown). This sketch appeared in the original paper by Gull and Skilling that introduced this problem. I love the stylish way this kangaroo is showing its tongue to the artist. Ask yourself: "how plausible is this kangaroo?" This is the number we are looking for.
It's not the first time kangaroos have appeared in the statistical literature, either (I told you they're philosophically significant). William Gosset, better known to many of us as Student, used the following diagram2 as a tool to help remember elements of mathematical terminology. The animal on the left is a platypus, representing the general shape of a platykurtic distribution, while the more leptokurtic curve is depicted with kangaroos, naturally, because of all their "lepping about."
All right, to analyze our problem, we can draw the hypothesis space as a matrix:
This kind of matrix is also known as a contingency table. The numbers opposite each atomic proposition give the total for that case, e.g. the overall proportion of left-handed 'roos, p11 + p12, is 1/3, as initially stated. (I used to have difficulty remembering the conventional ordering of those little ij numbered subscripts on the elements of a matrix so I also developed a mnemonic. If you ever learned basic electronics, you might remember that the response time of a capacitor in a simple circuit is given by something called the RC constant. So that's how I remember - RC constant: first the Row, then the Column, (pretty much) always.)
We can easily reduce the number of variables in the above table. Whatever p21 is, p11 + p21 = 1/3, so p21 = 1/3 - p11. All the pij can be expressed in terms of p11 :
This is where the die-hard frequentist stops working and breaks into tears, for though it is obvious that p11 is constrained to the range 0 ≤ p11 ≤ 1/3, he can see no way to single out 1 from the uncountably infinite number of possibilities within that range.
The thing that stops the frequentist in his tracks is that he doesn't know the correct correlation between left handedness and beverage preference. It could be that all left-handers drink whisky, or none do, or that the proportions of left-handed kangaroos that drink whisky and beer are equal, or anything in between.
But for us, this is not a problem. What we do not know can't interfere with our calculation, as we simply want a number that characterizes what we do know. There may be some non-zero correlation, but we have no information about that, so symmetry demands that we remain indifferent. Any correlation could be positive or negative, so any arbitrary choice has 50% probability to be in the wrong direction. So, denoting a beer drinker as B and a left hander as L:
at least until we have better information. We are not proclaiming something about actual proportions (frequencies) here. We are merely acknowledging that the handedness and the drinking habits of these marsupials are at present logically independent.
From the product rule, though, P(BL) = P(B | L) × P(L), etc, so:
or, from the contingency table:
yielding our desired result:
________________
There is, though, another way to reach this result: as the title of this post suggests, by calculating entropy. This is, of course, the real reason Gull and Skilling invented this problem. I'm not going to explain today what entropy is. That'll have have to wait for a future post. I'm just going to state a formula for it, and tell you a truly remarkable thing that can be done with it. If you don't already know the rationale behind it, until I offer up an explanation, I expect your time will be pretty much fully occupied with contemplating why the number provided by this formula should have the extraordinary properties I claim for it.
The formula for the Shannon entropy, H, is:
where the p's are probabilities and the sum is over the entire hypothesis space.
The principle of maximum entropy asserts that for a given problem, the distribution of probabilities that maximizes the number H, subject to the constraints of that problem, is the one that best characterizes our knowledge. This maximum entropy distribution is the one that is maximally non-committal, without violating the information established in the problem. Thus, any adoption of a distribution of lower than maximum entropy constitutes an assumption of information that we do not really have, thereby constituting an irrational inference, of a type often referred to as 'spurious precision.'
Calculating the entropy in this case is quite fun and simple. All probabilties are fixed, for a given p11, so we can vary this parameter, and see where the entropy function is maximized. The plotted curve below shows this calculation. The vertical line is drawn in to show where our previous argument declared that p11 should be.
The maximum-entropy point corresponds perfectly with the earlier analysis, yielding P(BL) = 1/9, just as I claimed it would. I don't claim this is a proof of the maximum entropy principle, but perhaps you'll feel inspired to at least wonder what this H stands for, and how its maximization comes to have this special property. I'll be covering these things shortly.
The maximum entropy principle is certainly not the main inspiration for this blog, but Ed Jaynes' profound understanding of the mechanics of rational inference, with which he was able to recognize this principle, is easily good enough to have inspired the title.
References
We all know that two thirds of kangaroos are right handed, and that one third of kangaroos drink beer (the remaining two thirds preferring whisky). These are true facts. What is the probability that a randomly encountered kangaroo is a left-handed beer drinker? Find a unique answer.
The last part of the statement of the puzzle, 'find a unique answer,' is implicit in the question, but I've emphasized it, as it highlights the difference between the sensible view of probability and the old, frequentist school of thought. The ardent frequentist will deny that any unique solution to the kangaroo problem exists. They will tell you that there is not enough information in the statement of the question to produce an answer. But reasoning under uncertainty (missing information) is totally and utterly, completely exactly what probability theory is for, so of course there is an answer to this question. Quite possibly, if you're good with numbers, your intuition is leaning toward the correct number already.
To help focus your mind, below is a beautiful portrait of a left-handed, beer-swilling kangaroo (identity unknown). This sketch appeared in the original paper by Gull and Skilling that introduced this problem. I love the stylish way this kangaroo is showing its tongue to the artist. Ask yourself: "how plausible is this kangaroo?" This is the number we are looking for.
It's not the first time kangaroos have appeared in the statistical literature, either (I told you they're philosophically significant). William Gosset, better known to many of us as Student, used the following diagram2 as a tool to help remember elements of mathematical terminology. The animal on the left is a platypus, representing the general shape of a platykurtic distribution, while the more leptokurtic curve is depicted with kangaroos, naturally, because of all their "lepping about."
All right, to analyze our problem, we can draw the hypothesis space as a matrix:
This kind of matrix is also known as a contingency table. The numbers opposite each atomic proposition give the total for that case, e.g. the overall proportion of left-handed 'roos, p11 + p12, is 1/3, as initially stated. (I used to have difficulty remembering the conventional ordering of those little ij numbered subscripts on the elements of a matrix so I also developed a mnemonic. If you ever learned basic electronics, you might remember that the response time of a capacitor in a simple circuit is given by something called the RC constant. So that's how I remember - RC constant: first the Row, then the Column, (pretty much) always.)
We can easily reduce the number of variables in the above table. Whatever p21 is, p11 + p21 = 1/3, so p21 = 1/3 - p11. All the pij can be expressed in terms of p11 :
This is where the die-hard frequentist stops working and breaks into tears, for though it is obvious that p11 is constrained to the range 0 ≤ p11 ≤ 1/3, he can see no way to single out 1 from the uncountably infinite number of possibilities within that range.
The thing that stops the frequentist in his tracks is that he doesn't know the correct correlation between left handedness and beverage preference. It could be that all left-handers drink whisky, or none do, or that the proportions of left-handed kangaroos that drink whisky and beer are equal, or anything in between.
But for us, this is not a problem. What we do not know can't interfere with our calculation, as we simply want a number that characterizes what we do know. There may be some non-zero correlation, but we have no information about that, so symmetry demands that we remain indifferent. Any correlation could be positive or negative, so any arbitrary choice has 50% probability to be in the wrong direction. So, denoting a beer drinker as B and a left hander as L:
P(B | L) = P(B | R),
at least until we have better information. We are not proclaiming something about actual proportions (frequencies) here. We are merely acknowledging that the handedness and the drinking habits of these marsupials are at present logically independent.
From the product rule, though, P(BL) = P(B | L) × P(L), etc, so:
or, from the contingency table:
yielding our desired result:
P(BL) = p11 = 1/9
________________
There is, though, another way to reach this result: as the title of this post suggests, by calculating entropy. This is, of course, the real reason Gull and Skilling invented this problem. I'm not going to explain today what entropy is. That'll have have to wait for a future post. I'm just going to state a formula for it, and tell you a truly remarkable thing that can be done with it. If you don't already know the rationale behind it, until I offer up an explanation, I expect your time will be pretty much fully occupied with contemplating why the number provided by this formula should have the extraordinary properties I claim for it.
The formula for the Shannon entropy, H, is:
H = - Σ p×log(p)
where the p's are probabilities and the sum is over the entire hypothesis space.
The principle of maximum entropy asserts that for a given problem, the distribution of probabilities that maximizes the number H, subject to the constraints of that problem, is the one that best characterizes our knowledge. This maximum entropy distribution is the one that is maximally non-committal, without violating the information established in the problem. Thus, any adoption of a distribution of lower than maximum entropy constitutes an assumption of information that we do not really have, thereby constituting an irrational inference, of a type often referred to as 'spurious precision.'
Calculating the entropy in this case is quite fun and simple. All probabilties are fixed, for a given p11, so we can vary this parameter, and see where the entropy function is maximized. The plotted curve below shows this calculation. The vertical line is drawn in to show where our previous argument declared that p11 should be.
The maximum-entropy point corresponds perfectly with the earlier analysis, yielding P(BL) = 1/9, just as I claimed it would. I don't claim this is a proof of the maximum entropy principle, but perhaps you'll feel inspired to at least wonder what this H stands for, and how its maximization comes to have this special property. I'll be covering these things shortly.
The maximum entropy principle is certainly not the main inspiration for this blog, but Ed Jaynes' profound understanding of the mechanics of rational inference, with which he was able to recognize this principle, is easily good enough to have inspired the title.
References
[1] | Gull, S.F. and Skilling, J., 'The maximum entropy method,' in 'Indirect imaging,' edited by Roberts, J.A. (Cambridge Univ. Press), 1984 |
[2] | Student, 'Errors of routine analysis,' Biometrika 19, no. 1, page 160, July 1927 |
The perfect thing to read a Saturday morning with pancakes and tea. Thank you!
ReplyDeleteYou're very welcome!
Delete