Maximum Entropy: March 2012

Friday, March 30, 2012

How to make ad-hominem arguments

My previous post on the base rate fallacy dealt with an example of binary hypothesis testing - hypothesis testing where there are exactly two competing hypotheses, such as either a person has a disease or they do not.

Traditional frequentist hypothesis testing is often considered (slightly erroneously) to be limited to binary hypothesis testing, the two being the null and alternate hypotheses. In fact the alternate hypothesis is never formulated, and does not take part in any calculation, so frequentist hypothesis testing typically only tests a single hypothesis. If this seems bizarre to you, then welcome to the club.

In the Bayesian framework, however, binary hypothesis testing is just a special case. In general there can be any number of hypotheses. For example, the murder was committed by Mr. A, or Mr. B, or Mr. C,.....

In this post I will demonstrate an example of testing an infinite number of hypotheses. In doing so, I will investigate the logical position of ad-hominem arguments - arguments based on who is making a particular point, rather than arguments that address that point. We are often told that such arguments are illogical and should take no place in rational discussion and decision making. A Bayesian, however, acknowledges that it is illogical to ignore relevant prior information when trying to reach a decision about something - anything, including a person. We'll get to the nitty gritty in a moment.

First though, to get warmed up, an amusing example concerning a book, ‘Logic and Its Limits,’ by P. Shaw. It is generally a good book, introducing many forms of logical error that can occur, particularly in verbal arguments and in the written word, rather than emphasizing the algebraic manipulation of formal logic. What I describe here should not be taken as a negative review of this book, its really just a little glitch - perhaps more like a typo than anything else. But the author of the book is described in its front matter as ‘senior lecturer in philosophy at the University of Glasgow,’ and so this mistake from a section on ad hominem arguments is deliciously ironic. One of the exercises posed to the reader in this chapter is to analyze the following statement:

"As Descartes said, appeals to authority are never really reliable."

At the back of the book, in the answers-to-exercises section, the author tells us that the statement is:

"self-refuting"

It seems to me, the only way one might think this is self refuting is if it is assumed that Descartes is himself an authority. Suppose then that the statement is false as claimed. Then an authority (Descartes) has told us something unreliable. Since we would have proof that authorities do make mistakes, we would have to accept that authority can never be relied upon, as there can be no means to know when an authority is or is not reliable. Thus, by assuming the statement false we are forced to accept that the statement is true. Reductio ad absurdum.

How should we treat the authority of Mr. Shaw on the subject of logic?

(To clarify: the reason we can assert that Descartes’ statement is correct is that we have performed the appropriate logical analysis ourselves, and so our statement is not that ‘it is true because Descartes said it,’ - which would be an appeal to authority - but that ‘since Descartes, an expert, said it, the falsity of this particular statement would be a logical contradiction.’ By proving that appeals to authority are never reliable, we also demonstrate that the statement beginning “As Descartes said” is not an appeal to authority.)

Now, on with the main program. Consider that it is possible that the details of an argument may be difficult to access or analyze, but we still would like to assess the probability that the argument is sound. Here is a simple example: a man (lets call him Weety Peebles) has a penchant and a talent for getting his provocative ideas published in the press. In the past you have observed ten occasions when his wild ideas have made the headlines. On each occasion, you have analyzed his claims, and on all occasions but one, found them to be totally devoid of any truth, and to have no logical merit whatsoever. On an eleventh occasion you see in your newspaper a headline '[something interesting if true], declares Weety Peebles.' You don't really want to bother reading it, however, because of your experience. What is the probability that you should read the story?

Suppose that when Mr. Peebles goes public with one of his stories he tells the truth with some relative frequency - unknown, but fixed by the many variables that define his nature. In other words, over long stretches of time the fraction of his stories that are true remains about the same. Our first job is to determine the probability distribution that describes the information we have about this frequency. The possibilities for the desired frequency, f, include all numbers between 0 and 1 - an uncountably infinite number of possible frequencies. For each of these values, f, let H_f be the hypothesis that the true frequency is f. By analogy with the principle of indifference for discrete sample spaces, we'll start with a prior that is uniform over the range 0 to 1. This is how we encode the fact that we start with no reason to favor one frequency more than any other. Now all we need to do is use Bayes' theorem to update the prior distribution using the data we have relating to those ten occasions of examining Mr. Peebles' output.

For a set of hypotheses numbered 1 to n, the general form of Bayes' theorem for the first of those hypotheses, H₁, is

P(H₁ \| DI)	=	P(H₁ \| I) × P(D \| H₁I)
		P(D \| I)

(1)

Where all n propositions are exhaustive and mutually exclusive, the denominator in this equation can be resolved as shown:

P(H₁ \| DI)	=	P(H₁ \| I) × P(D \| H₁I)
		P(DH₁ + DH₂+....+ DH_n\| I)

(2)

(The product, DH, means that both D and H are true, while any sum, A+B, means that A or B is true.) Next, applying the product and extended sum rules (see appendix below), this becomes:

P(H₁ \| DI)	=	P(H₁ \| I) × P(D \| H₁I)
		Σ₁ⁿ P(D \| H_iI) × P(H_i \| I)

(3)

Finally, if the hypotheses are drawn from a continuous sample space, then the sum in the denominator simply becomes an integral.

To perform this update we can divide the range of possible frequencies into, say, 200 intervals, each of width 0.005, and let these approximate our continuous sample space. Then, for each of these intervals, we need to calculate P(H_f | DI).

This means calculating P(D | H_iI), for each frequency, which is just f_i, if it was an occasion where Weety spoke the truth, or (1 - f_i) if it was not. We can perform 10 such updates, one for each of the occasions in our experience. Alternatively, we can perform a single update for the 10 occasions in one go, using a result obtained by Thomas Bayes, some time in the 18th century:

P(f \| DI)	=	(N + 1)!	f ⁿ(1 - f)^N-n
		n! (N - n)!

(4)

Here, N represents the number of occasions we have experience of and n represents the number of times Mr. Peebles was not talking rubbish. (Note that this formula applies only when we start with a uniform prior, and only when f is constant.)

I performed this calculation very quickly using a spreadsheet (because I'm integrating over little rectangles, I multiply each P by Δf, the width of each rectangle), obtaining the following probability distribution for f, given 10 past experiences, including only one occasion where there was some merit to Weety's story:

Recall the interpretation of the numbers: a frequency of 0 means he never tells the truth, while a frequency of 1 means he always tells the truth. We see that nearly all the 'mass' is located between 0 and 0.5 - as we might have anticipated.

To complete the exercise, it only remains to determine the probability that on our current, 11th occasion, the news item in question is worth reading - in other words, the probability that Weety Peebles is not lying again. We still don't know the actual frequency that determines how often Peebles is truthful, so we have to integrate over the total sample space. We divided the continuous sample space into 200 intervals - each narrow interval, we approximate as corresponding to a discrete frequency. For each of these frequencies, therefore, we take the product of that frequency, f, with its probability, P(f), then adding them all up, we end up with the total probability that the current news story is true. The result of this summation is 0.167, almost a 17% chance that the story is true. (Note that if I had taken f to be simply 1/10, the exact fraction corresponding to our experience, we would have wrongly estimated only a 10% chance of a story worth reading, which would have done Peebles a slight disservice.)

Without me even specifying anything about the content of Weety Peebles' story, you are now in a position to say: 'Mr. Peebles, you are very probably talking out of your arse.'

And that is an ad hom with some seriously refined logic behind it.

In general, we know from our experience that we can learn from our experience. Our common sense, therefore, should have warned us that ad-hominem arguments can be rational. In hindsight, we probably recognize that we use them all the time. What we see about Bayes' theorem is that it provides a formal way of quantifying our common sense.

The Base Rate Fallacy

Here is a simple puzzle:

A man takes a diagnostic test for a certain disease and the result is positive. The false positive rate for the test in this case is the same as the false negative rate, 0.001. The background prevalence of the disease is 1 in 10,000. What is the probability that he has the disease?

This problem is one of the simplest possible examples of a broad class of problems, known as hypothesis testing, concerned with defining a set of mutually contradictory statements about the world (hypotheses) and figuring out some kind of measure of the faith we can have in each of them.

It might be tempting to think that the desired probability is just 1- (false-positive rate), which would be 0.999. Be warned, however, that this is quite an infamous problem. In 1982, a study was published¹ for which 100 physicians had been asked to solve an equivalent question. All but 5 got the answer wrong by a factor of about 10. Maybe it’s a good idea then to go through the logic carefully.

Think about the following:

What values should the correct answer depend on?

Other than reducing the false-positive rate, what would increase the probability that a person receiving a positive test result would have the disease?

The correct calculation needs to find some kind of balance between the likelihood that the person has the disease (the frequency with which the disease is contracted by similar people) and the likelihood that the positive test result was a mistake (the false positive rate). We should see intuitively that if the prevalence of the disease is high, the probability that any particular positive test result is a true positive is higher than if the disease is extremely rare.

The rate with which the disease is contracted is 1 in 10,000 people, so to make it simple, we will imagine that we have tested 10,000 people. Therefore we expect 1 true case of the disease. We also expect 10 false positives, so our estimate goes from 0.999 to 1 in 11, 0.09091. This answer is very close, but not precisely correct.

The frequency with which we see true positives must be reduced by the possibility that we can have false negatives also, how do we encode that in our calculation?

We require the conditional probability that the man has the disease, given that his test result was positive, P(D|R₊). This is the number of ways of getting a positive result and having the disease, divided by the total number of ways of getting a positive test result,

(1),

where D is the proposition that he has the disease, C means he is clear, and R₊denotes the positive test result.

If we ask what is the probability of drawing the ace of hearts on the first draw from a deck of cards and the ace of spades on the second, without replacing the first card before the second draw, we have P(A_HA_S) = P(A_H)P(A_s|A_H). The probability for the second draw is modified by what we know to have taken place on the first.

Similarly, P(R₊D) = P(D)P(R₊|D), and P(R₊C) = P(C)P(R₊|C), so

(2).

P(D) is the background rate for the disease.
P(R₊|D) is the true positive rate, equal to 1 – (false negative rate).
P(C) = 1 – P(D).
P(R₊|C) = false positive rate

(3),

which is 0.09090.

The formula we have arrived at above, by simple application of common sense is known as Bayes’ theorem. Many people assume the answer to be more like 0.999, but the correct answer is an order of magnitude smaller. As mentioned, most medical doctors also get questions like this wrong by about an order of magnitude. The correct answer to the question, 0.0909, is called in medical science the positive-predictive value of the test. Generally, it is known as the posterior probability.

Bayes’ theorem has been a controversial idea during the development of statistical reasoning, with many authorities dismissing it as an absurdity. This has led to the consequence that orthodox statistics, still today, does not employ this vitally important technique. Here, we have developed a special case of Bayes’ theorem by simple reasoning. In generality, it follows as a straightforward re-arrangement of probabilistic laws (the product and sum rules) that are so simple that most authors treat them as axioms, but which in fact can be rigorously derived (with a little effort) from extremely simple and perfectly reasonable principles. It is overwhelmingly one of my central beliefs about science that a logical calculus of probability can only be achieved, and the highest quality inferences extracted from data when Bayes’ theorem is accepted and applied whenever appropriate.

The general statement of Bayes’ theorem is

(4).

Here 'I' represents the background information: a set of statements concerning the scope of the problem that are considered true for the purposes of the calculation. In working through the medical testing problem, above, I have omitted the 'I', but in every case where I right down a probability without including the 'I', this is to be recognized as short hand - the 'I' is always really there and the calculation makes no sense without it.

The error that leads many people to over estimate, by an order of magnitude, probabilities such as the one required in this question is known as the base-rate fallacy. Specifically in this case, the base rate, or expected incidence, of the disease has been ignored, leading to a calamitous miscalculation. The base-rate fallacy amounts to believing that P(A|B) = P(B|A). In the above calculation this corresponds to saying that P(D|R₊), which was desired, is the same as P(R₊|D), the latter being equal to 1 – false positive rate.

In frequentist statistics, a probability is identified with a frequency. In this framework, therefore, it makes no sense to ask what is the probability that a hypothesis H is true, since there is no sense in which a relative frequency for the truth of H can be obtained. As a measure of faith in the proposition H in light of data, D, therefore, the frequentist habitually uses not P(H|D), but P(D|H), and so he commits himself to committing the base-rate fallacy.

In case it is still not completely clear that the base rate fallacy is indeed a fallacy, lets employ a thought experiment with an extreme case. (These extreme cases, while not necessarily realistic, allow the desired outcome of a theory to be obtained directly and compared with the result of the theory - something computer scientists call a 'sanity check'.) Imagine the case where the base rate is higher than the sensitivity of the test. For example let the sensitivity be 98% (ie 2% false positive rate) and let the background prevalence of the disease be 99%. Then, P(B|A) is 0.98, and substituting this for P(A|B), we have an answer that is lower than P(A) = 0.99. The positive result of a high-quality test (98% sensitivity) giving lower probability that the test subject has the disease than before the test result was known.

[1] Eddy, D. M. (1982). Probabilistic reasoning in clinical medicine: Problems and opportunities. In D. Kahneman, P. Slovic, & A. Tversky (Eds.), Judgment under uncertainty: Heuristics and biases (pp. 249–267). Cambridge, England: Cambridge University Press. (In this study 95 out of 100 physicians answered between 0.7 and 0.8 to a similar question, to which the correct answer was 0.078.)

Wednesday, March 28, 2012

Maximum Entropy

In 1948 Claude Shannon forged a link between the thermodynamic concept of entropy and a new formal concept of information. This event marked the beginning of information theory. This discovery captured the imagination of Ed Jaynes, a physicist with strong interest in statistical mechanics and probability theory. His expertise in statistical mechanics meant that he understood entropy better than many. His recognition of probability theory as an extended form of logic meant that he understood that probability calculations (and therefore all of science) are concerned not directly with truths about reality, as many have supposed, but with information about truths.

The distinction may seem strange – science accepts that there are statements about nature that are objectively either true or false, and definitely not some combination of true and false, so the most desirable goal must be to know which of the options, ‘true’ or ‘false’ is the case. But the truth values of such statements are not accessible to human sensation, and therefore remain hidden also from human science. This is a difficult fact for intelligent animals like us to deal with, but we have learned to do so, partly by inventing a set of procedures called science. Science acknowledges that the truth of a proposition can not be known with certainty, and so it sets out instead to determine the probability of truth. For this purpose, it combines empirical information and logic.

For Ed Jaynes, therefore, Shannon’s new information theory was instantly recognizable as a breakthrough of massive importance. Jaynes thought about this new tool, meditated on it, digested it, and played with it intensely. One of the outcomes of this meditation was a beautiful idea known as maximum entropy. The title of this blog, then, is a tribute to Edwin Jaynes, to this beautiful idea of his, and to the many more exceptional ideas he produced.

As a physicist, I never received much education in statistics and probability – we know the sum and product rules, we know how to write down the formulae for the Poisson and normal distributions and how to calculate a mean and a standard deviation, and that’s about it really. Oh and some typically badly understood model fitting by maximum likelihood (we call it ‘method of least squares’, which if you know stats, tells you how limited our understanding is).

During my PhD studies in semiconductor physics, I became very dissatisfied with this situation, as it gradually dawned on me that scientific method and statistical inference must rightly be considered as synonymous: they are both the rational procedure for estimating what is likely to be true, given our necessarily limited information. I set out to teach myself as much as I could about statistics. Not surprisingly, my first investigations led me to what is often referred to as orthodox methodology. I laboured with the traditional hypothesis tests – t-tests and so forth – but I found the whole framework very unpalatable: confused, disjointed, self-contradicting – just ugly. Then I stumbled on Bayes’ theorem, and my world view was elevated to a higher plane. Some time after that I discovered Ed Jaynes’ book, ‘Probability Theory: The Logic of Science,’ and my horizon was expanded again, by another order of magnitude. Problems that I had thought to be only approachable by the orthodox methods became recognizable as simple extensions of Bayes’ theorem, and any nagging doubts I had about the validity of the Bayesian program were banished by Jaynes’ clearly formulated logic.

It is not that I am totally against orthodox (sometimes called frequentist) methods. But the success of frequentist techniques is limited to the range of circumstances in which they do a reasonable job of approximating Bayes’ theorem. The range of applications, however, in which the two approaches diverge is unfortunately quite large, while orthodox theory seems to have nothing fundamental to say about when to expect such divergence.

Bayes’ theorem works by taking a prior probability distribution and combining it with some data to produce an updated distribution, known as the posterior probability. After the next set of data comes in, the posterior probability is treated as the new prior, and another update is performed. The process goes on as long as we wish, with presumably the posterior probability distributions narrowing down ever closer upon a particular hypothesis.

One of the problems we might anticipate with this procedure, however, is where does the process start? What do we use as our original prior? The principle of indifference works in many cases. Indifference works like this: if I am told that a 6 sided die is to be thrown, with no additional information about the die or the method of throwing, then symmetry considerations require that the probability for any of the sides to end up facing upwards is 1/6. For some more complex situations, however, indifference fails. One of the things that the principle of maximum entropy achieves is to provide a technique for assigning priors in a huge range of new problems, unassailable using the principle of indifference.

As Shannon discovered, information can be considered as the flip side of entropy, a thermodynamic idea representing disorder – the more information, the less entropy. Why then should science be interested in maximizing entropy? What we are looking for is the probability distribution that incorporates whatever information we have, without inadvertently incorporating any assumed information that we do not have. We need that probability distribution with the maximum amount of entropy possible, given the constraints set by our available information. Maximum entropy, therefore, is a tool for specifying exactly how much information we possess on a given matter, which is evidently one of the highest possible goals of honest, rational science. This is why I feel that ‘maximum entropy’ is an appropriate title for this blog about scientific method.

Friday, March 30, 2012

How to make ad-hominem arguments

Thursday, March 29, 2012

The Base Rate Fallacy

Wednesday, March 28, 2012

Maximum Entropy