Maximum Entropy: Causality

Showing posts with label Causality. Show all posts

Sunday, December 22, 2013

Confounded Koalas

Koalas are not as exclusive as kangaroos. At least, when it comes to their drinking habits. As I explained before, kangaroos drink beer or whisky, but not both. Koalas like to mix things up a bit more, when it comes to their choice of drink, but how much exactly? What is the probability, for example, that any given koala who drinks beer on any given night will also drink whisky on the same night? These are the sorts of urgent questions that science must seek to answer with the utmost speed and accuracy.

The Regression Fallacy

Consider a teacher, keen to apply rational techniques to maximize the effectiveness of his didactic program. For some time, he has been gathering data on the outcomes of certain stimuli aimed at improving his pupil's performance. He has been punishing those students that under-perform in discrete tasks, and rewarding those that excel. The results show, unexpectedly, that performances were improved, on average, only for pupils that received punishments, while those that were rewarded did worse subsequently. The teacher is seriously considering desisting from future rewards, and continuing only with punishments. What would be your advice to him?

First note that a pupil's performance in any given task will have some significant random component. Luck in knowing a particular topic very well, mood at the time of execution of the task, degree of tiredness, pre-occupation with something else, or other haphazard effects could conspire to affect the student's performance. The second thing to note is this: if a random variable is sampled twice, and the first case is far from average, then the second is most likely to be closer to the average. Neglect of this simple fact is common, and is a special case of the regression fallacy.

If a pupil achieves an outstanding result in some test, then probably this as partly due to the quality of the student, and partly due to random factors. It is also most likely that the random factors contributed positively to the result. So a particular sample of this random variable has produced a result far into the right-hand tail of its probability distribution. The odds of a subsequent sample from the same probability distribution being lower than the first are clearly the ratio of the areas of the distribution, either side of the initial value. These odds are relatively high.

Imagine a random number generator that produces an integer from 1 to 100 inclusive, all equally probable. Suppose that in the first of two draws, the number 90 comes up. There are now 89 ways to get a smaller number on the second draw, and only 10 ways to get a larger number. In a very similar way, a student who performs very well in a test (an therefore receives a reward) has the odds stacked against them, if they hope to score better in the next test. The regression fallacy, in this case, is to assume that the administered reward is the cause of the eventual decline in performance.

The argument works exactly the same way for a poorly performing pupil - a really bad outcome is most likely, by chance alone, to be followed by an improvement. This tendency for extreme results to be followed by more ordinary results is called regression to the mean. It is not impossible that an intervention such as a punishment could cause improved future performance, but the automatic assumption that an observed improvement is caused by the administered punishment is fallacious.

Another common example comes from medical science. Its when my sinusitis is worst that I sleep with a freshly severed rabbit's foot under my pillow. I almost always feel better the next morning.

These before-after scenarios are a special case, as I mentioned. In general, all we need in order to see regression to the mean is to sample two correlated random variables. They may be from the same distribution (before-after), or they may be from different distributions.

If I tell you that Hans is 6 foot, 4 inches tall (193 cm), and ask you what you expect to be the most likely height of his fully-grown son, Ezekiel, you might correctly reason that father's heights and son's heights are correlated. You might think, therefore, that the best guess for Ezekiel's height is also 6', 4", but you would be forgetting about regression to the mean - Ezekiel's height is actually most likely to be closer to average. This is because the correlation between father's and son's heights is not perfect. On a scale where 0 represents no correlation whatsoever, and 1 indicates perfect correlation (knowledge of one necessarily fixes the other precisely¹), the correlation coefficient for father-son stature is about 0.5. Reasoning informally, therefore, we might adjust our estimate for Ezekiel's still unknown height to half way between 6' 4'' and the population average. It turns out we'd be bang on with this estimate. (I don't mean, of course, that this is guaranteed to be the fellow's height, but that this would be our best possible guess, most likely to be near his actual height.)

The success of this simple revision follows from the normal (Gaussian) probability distribution for people's heights. The normal distribution can be applied in a great many circumstances, both for physical reasons (central limit theorem), and for the reason that we often lack any information required to assign a more complicated distribution (maximum entropy). If two variables, x and y are each assigned a normal distribution (with respective means and standard deviations μ_i and σ_i), and provided that certain not-too-exclusive conditions are met (all linear combinations of x and y are also normal), then their joint distribution, P(xy | I), follows the bivariate normal distribution, which I won't type out, but follow the link if you'd like to see it. (As usual, I is our background information.) To get the conditional probability for y, given a known value of x, we can make use of the product rule , to give

(1)

P(x | I) is the marginal distribution for x, just the familiar normal distribution for a single variable. If one goes through the slightly awkward algebra, it is found that for xy bivariate normal, y|x is also normally distributed², with mean

(2)

and standard deviation

(3)

where ρ is the correlation coefficient, given by

(4)

Knowing this mean and standard deviation, we can now make a good estimate of how much regression to the mean to expect in any given situation. We can state our best guess for y and its error bar.

We can rearrange equation (2) to give

(5)

which says that the expected number of standard deviations between y - given our information about x - and μ_y (the mean of y when nothing is known about about x) is the same as the number of standard deviations between the observed value of x and μ_x, only multiplied by the correlation coefficient. A bit of a mouthful, perhaps, but actually a fairly easy estimate to perform, even in an informal context. In fact, this is just what we did when estimating Ezekiel's height.

When reasoning informally, we can ask the simplified question, 'what value of y is roughly equally as improbable as the known value of x?' The human mind is actually not bad at performing such estimates. Next, we need to figure out how far it is from the expected value of y (μ_y) and reduce that distance from μ_y by the fraction ρ, which again (with a bit of practice, perhaps), we can also estimate not too badly. In any case, an internet search will often be as complicated as any data-gathering exercise needed to calculate ρ more accurately.

Here are a few correlation coefficients for some familiar phenomena:

Life expectancy by nation (data from Wikipedia here and here):

life expectancy vs GDP per capita:	0.53
male life expectancy vs female life expectancy:	0.98

IQ scores (data from Wikipedia again):

same person tested twice:	0.95
identical twins raised together:	0.86
identical twins raised separately:	0.76
unrelated children raised together:	0.3

Amount of rainfall (in US) and frequency of 'flea' searches on Google: 0.87
(from Google Correlate)

With the above procedure for estimating y|x, we can get better, more rational estimates for a whole host of important things: how will our company perform this year, given our profits last year? How will our company perform if we hire this new manager, given how his previous company performed? What are my shares going to do next? What will the weather do tomorrow? How significant is this person's psychological assessment? Or criminal record?

In summary, when two partially correlated, random variables are sampled, there is a tendency for an extreme value of one to be accompanied by a less extreme value for the other. This is simple to the point of tautology, and is termed regression to the mean. The regression fallacy is a failure to account for this effect when making predictions, or investigating causation. One common form is the erroneous assumption of cause and effect in 'before-after' type experiments. Rabbits' feet do not cure sinusitis (in case you were still wondering). Another kind of fallacious reasoning is the failure to regress to the mean an estimate or prediction of one variable based on another known fact. For two normally distributed, correlated variables, the ratio of the expected distance (in standard deviations) of one variable from its marginal mean to the actual distance of the other from its mean is the correlation coefficient.

[1]	Note: there are also cases where this condition holds for zero correlation, i.e. situations where y is completely determined by x, even though their correlation coefficient is zero. Lack of correlation can not be taken to imply independence, though if x and y are jointly (bivariate) normal, lack of correlation does strictly imply independence.
[2]	I've been a little cavalier with the notation, but you can just read y\|x as 'the value of y, given x.' Here, y is to be understood as a number, not a proposition.

Thursday, May 10, 2012

Dodgy Dice

Following up on an earlier post, here is another example of how thinking about conditional probabilities in terms of causation rather than logical dependence can lead to cognitive dissonance. This example comes from 'Entropy Demystified,' by Arieh Ben-Naim.

Let A, B, and C be three hypotheses, such that P(B|A) > P(B) and P(C|B) > P(C). Suppose I tell you that P(C|A) < P(C). How does this strike you?

Knowledge of A makes B more likely, and knowledge of B makes C more likely, so shouldn't knowledge of A necessarily also make C more likely?

If you are thinking along these lines and are struggling to accept that the situation I describe is possible, then I suspect it is because, despite my warning, you can't help thinking about P(x|y) in terms of causation. The specific property of causality that seems to invade our heads when we contemplate problems like this is its transitivity: if A causes B and B causes C, then A necessarily is the cause of C.

Logical dependence, however, is not required to exhibit this transitivity property. Let A, B, and C be hypotheses about the outcome of rolling a six-sided die, such that for each hypothesis, the face showing is one of the four specified:

A: {1, 2, 3, 4}

B: {2, 3, 4, 5}

C: {3, 4, 5, 6}

P(A) = P(B) = P(C) = 4/6. If you are told that A is true, then the probability associated with B becomes 3/4, which is greater than 4/6. Similarly, P(C|B) is also greater than P(C). But P(C|A) = 2/4, which is less than P(C), as promised.

Here's another example of unexpected intransitivity for dice, operating on a different level, taken from Ian Stewart's entertaining "Cabinet of Mathematical Curiosities."

Mr. A proposes to his friend Mr. B that they play dice for money with his special dice. He has three of them, and they are each to pick one to play with against the other. Whoever roles the higher number on each throw wins the stake. To convince Mr. B that the dice are fair, Mr. A insists that he take first choice from the 3 dice - that way, if one of the dice is better than the of the others, Mr. B should have an fair chance of picking the better one.

The dice do not have the usual numbers marked on their faces, though. The red one has the numbers {3, 3, 4, 4, 8, 8}. The yellow one is marked with the numbers {1, 1, 5, 5, 9, 9}. Finally, the blue one has the numbers {2, 2, 6, 6, 7, 7}.

Which one, if any, should Mr. B choose?

Check the numbers for yourself. Whichever die Mr. B selects, Mr. A can select one of the others that gives a higher probability of winning.

Friday, April 27, 2012

Logical v's Causal Dependence

At the end of a previous post, I promised to discuss the difference between logical dependence, the substance of probability theory, and causal dependence, which is often assumed to be the thing that probability is directly concerned with. Lets get the ball rolling with a simple example:

A box contains 10 balls: 4 black, 3 white, and 3 red. A man extracts exactly one ball, ‘randomly.’ The extracted ball is never replaced in the box. Consider the following 2 situations:

a) You know that the extracted ball was red. What is the probability that another ball extracted in the same way, will also be red?

b) A second ball has been extracted in the same manner as the first, and is known to be black. The colour of the first ball is not known to you. What is the probability that it was white?

You should try to verify that he answer in the first case is 2/9, and in the second case is 1/3.

Nearly everybody will agree with my answer to situation (a), but some may hesitate about the answer for situation (b). This hesitation seems to result from the feeling that when we write P(A|B) ≠ P(A), then B is, at least partially, the cause of A. (P(A|B) means ‘the probability for A given the assumption that B is true.’) If true, then there would be no possibility for knowledge of B to influence the probability for A, because the colour of the second ball can have had no causal influence on the colour of the first ball.

In fact, it makes no difference at all in what order the balls are drawn, in such cases. The labels ‘first,’ ‘second,’ ‘n^th,’ are really just arbitrary labels, and we can exchange them as we please, without affecting the outcome of the calculation.

In case there is still a doubt, consider a simplified version of our thought experiment:

The box had exactly 2 balls, 1 black and 1 white. Both balls were drawn, ‘at random.’ The second ball drawn was black. What is the probability that the first was black?

The product rule can be written P(AB) = P(A|B)P(B). With this formulation, can we account for cases where A depends on B. When thinking about this dependence, however, it is often tempting to think in terms of causal dependence. But probability theory is concerned with calculations of plausibility with incomplete knowledge, and so what we really need to consider is not causal dependence, but logical dependence. We can verify that P(A|B) does not imply that B is the cause of A, since, thanks to the commutativity of Boolean algebra, AB = BA, and we could just as easily have written the product rule as P(AB) = P(B|A)P(A).

What is the probability that X committed a crime yesterday, given that he confessed to it today? Surely it is altered by our knowledge of the confession, indicating that the propositions are not independent in the sense we need for probability calculations. But it is also clear that a crime committed yesterday was not caused by a confession today.

Edwin Jaynes in ‘Probability Theory: The logic of Science,’ gave the following technical example of the errors that can occur by focusing on causal dependence, rather than logical dependence. Consider multiple hypothesis testing with a set of n hypotheses, H₁, H₂, …, H_n, being examined in the light of m datasets, D₁, D₂, …., D_m. When the data sets are logically independent, the direct probability for the totality of the data given any one of the hypotheses, H_i, satisfies a factorization condition,

P(D₁...D_m | H_i, I) = ∏ _j P(D_j | H_i, I)

(1)

(The capital 'pi' means multiply for all 'j'.) It can be shown, however, that the corresponding condition for the alternate hypothesis, H_i'

P(D₁...D_m | H_i', I) = ∏ _j P(D_j | H_i', I)

(2)

does not hold except in highly trivial cases, though some authors have assumed it to be generally true, based on the fact that no D_i has any causal effect on any other D_j. (Equation (2) requires that P(D_j|D_i) = P(D_j).) The datasets maintain their causal independence, as they must, but they are no longer logically independent. This is because the amount that equivalent units of new information change the relative plausibilities of multiple hypotheses depends on the data that has gone before: the effect of new data on a hypothesis depends on which other hypothesis it competes with most directly.

In Jaynes’ example, he imagined a machine producing some component in large quantities and an effort to determine the fraction of components fabricated that are faulty by randomly sampling 1 component at a time and examining it for faults. The prior information is supposed specific enough to narrow the number of possible hypotheses to 3:

A ≡ ‘The fraction of components that are faulty is 1/3.’

B ≡ ‘The fraction of components that are faulty is 1/6.’

C ≡ ‘The fraction of components that are faulty is 99/100.’

The prior probabilities for these hypotheses are as shown at the extreme left of the graph below. The graph is the calculation of the evolution of the probabilities for each hypothesis as the number of tested components increases. Recall that, from Bayes’ theorem, each P(H_i|D, I) depends on both P(D|H_i, I) and P(D|H_i', I). Each tested component is found to be faulty, so the information added is identical with each sample, but the rates of change of the 3 curves (plotted logarithmically) are not constant.

Evolution of the probabilities of 3 hypotheses as constant new data are added.

Taken from E.T. Jaynes, ‘Probability theory: the logic of science,’ chapter 4.

The ‘Evidence,’ plotted on the vertical axis, is perhaps an unfamiliar expression of probability information. It is the log odds, given by

E = 10 log₁₀	P( H )
	P( H' )

(3)

with the factor of 10 because we choose to measure evidence in decibels. The base 10 is used because of a perceived psychological advantage (our brains seem to be good at thinking in terms of factors of 10). Because we have used a logarithmic scale, the products expressed above in equations (1) and (2) becomes sums, and for constant pieces of new information, we expect to add a constant amount to the evidence, if both factorization conditions hold. The slopes of the curves are not constant, however, indicating that this is not the case, and consecutive items of data are not independent: ΔE depends on what data have preceded that point. Specifically, wherever a pair of hypotheses cross on the graph, there is a change of slope of the remaining hypothesis.

When we calculate P(D|H_i, I) we are supposing for the purposes of calculation that H_i is true, and so the result we get is independent of P(H_i), which is why P(D|H_i, I) factorizes. But P(D|H_i', I) is different, because when the total number of hypotheses is greater then 2, then H_i' is composite and decomposes into at least 2 hypotheses, so P(D|H_i', I) relies upon the relative probabilities for those component propositions.