Maximum Entropy: How to make ad-hominem arguments

My previous post on the base rate fallacy dealt with an example of binary hypothesis testing - hypothesis testing where there are exactly two competing hypotheses, such as either a person has a disease or they do not.

Traditional frequentist hypothesis testing is often considered (slightly erroneously) to be limited to binary hypothesis testing, the two being the null and alternate hypotheses. In fact the alternate hypothesis is never formulated, and does not take part in any calculation, so frequentist hypothesis testing typically only tests a single hypothesis. If this seems bizarre to you, then welcome to the club.

In the Bayesian framework, however, binary hypothesis testing is just a special case. In general there can be any number of hypotheses. For example, the murder was committed by Mr. A, or Mr. B, or Mr. C,.....

In this post I will demonstrate an example of testing an infinite number of hypotheses. In doing so, I will investigate the logical position of ad-hominem arguments - arguments based on who is making a particular point, rather than arguments that address that point. We are often told that such arguments are illogical and should take no place in rational discussion and decision making. A Bayesian, however, acknowledges that it is illogical to ignore relevant prior information when trying to reach a decision about something - anything, including a person. We'll get to the nitty gritty in a moment.

First though, to get warmed up, an amusing example concerning a book, ‘Logic and Its Limits,’ by P. Shaw. It is generally a good book, introducing many forms of logical error that can occur, particularly in verbal arguments and in the written word, rather than emphasizing the algebraic manipulation of formal logic. What I describe here should not be taken as a negative review of this book, its really just a little glitch - perhaps more like a typo than anything else. But the author of the book is described in its front matter as ‘senior lecturer in philosophy at the University of Glasgow,’ and so this mistake from a section on ad hominem arguments is deliciously ironic. One of the exercises posed to the reader in this chapter is to analyze the following statement:

"As Descartes said, appeals to authority are never really reliable."

At the back of the book, in the answers-to-exercises section, the author tells us that the statement is:

"self-refuting"

It seems to me, the only way one might think this is self refuting is if it is assumed that Descartes is himself an authority. Suppose then that the statement is false as claimed. Then an authority (Descartes) has told us something unreliable. Since we would have proof that authorities do make mistakes, we would have to accept that authority can never be relied upon, as there can be no means to know when an authority is or is not reliable. Thus, by assuming the statement false we are forced to accept that the statement is true. Reductio ad absurdum.

How should we treat the authority of Mr. Shaw on the subject of logic?

(To clarify: the reason we can assert that Descartes’ statement is correct is that we have performed the appropriate logical analysis ourselves, and so our statement is not that ‘it is true because Descartes said it,’ - which would be an appeal to authority - but that ‘since Descartes, an expert, said it, the falsity of this particular statement would be a logical contradiction.’ By proving that appeals to authority are never reliable, we also demonstrate that the statement beginning “As Descartes said” is not an appeal to authority.)

Now, on with the main program. Consider that it is possible that the details of an argument may be difficult to access or analyze, but we still would like to assess the probability that the argument is sound. Here is a simple example: a man (lets call him Weety Peebles) has a penchant and a talent for getting his provocative ideas published in the press. In the past you have observed ten occasions when his wild ideas have made the headlines. On each occasion, you have analyzed his claims, and on all occasions but one, found them to be totally devoid of any truth, and to have no logical merit whatsoever. On an eleventh occasion you see in your newspaper a headline '[something interesting if true], declares Weety Peebles.' You don't really want to bother reading it, however, because of your experience. What is the probability that you should read the story?

Suppose that when Mr. Peebles goes public with one of his stories he tells the truth with some relative frequency - unknown, but fixed by the many variables that define his nature. In other words, over long stretches of time the fraction of his stories that are true remains about the same. Our first job is to determine the probability distribution that describes the information we have about this frequency. The possibilities for the desired frequency, f, include all numbers between 0 and 1 - an uncountably infinite number of possible frequencies. For each of these values, f, let H_f be the hypothesis that the true frequency is f. By analogy with the principle of indifference for discrete sample spaces, we'll start with a prior that is uniform over the range 0 to 1. This is how we encode the fact that we start with no reason to favor one frequency more than any other. Now all we need to do is use Bayes' theorem to update the prior distribution using the data we have relating to those ten occasions of examining Mr. Peebles' output.

For a set of hypotheses numbered 1 to n, the general form of Bayes' theorem for the first of those hypotheses, H₁, is

P(H₁ \| DI)	=	P(H₁ \| I) × P(D \| H₁I)
		P(D \| I)

(1)

Where all n propositions are exhaustive and mutually exclusive, the denominator in this equation can be resolved as shown:

P(H₁ \| DI)	=	P(H₁ \| I) × P(D \| H₁I)
		P(DH₁ + DH₂+....+ DH_n\| I)

(2)

(The product, DH, means that both D and H are true, while any sum, A+B, means that A or B is true.) Next, applying the product and extended sum rules (see appendix below), this becomes:

P(H₁ \| DI)	=	P(H₁ \| I) × P(D \| H₁I)
		Σ₁ⁿ P(D \| H_iI) × P(H_i \| I)

(3)

Finally, if the hypotheses are drawn from a continuous sample space, then the sum in the denominator simply becomes an integral.

To perform this update we can divide the range of possible frequencies into, say, 200 intervals, each of width 0.005, and let these approximate our continuous sample space. Then, for each of these intervals, we need to calculate P(H_f | DI).

This means calculating P(D | H_iI), for each frequency, which is just f_i, if it was an occasion where Weety spoke the truth, or (1 - f_i) if it was not. We can perform 10 such updates, one for each of the occasions in our experience. Alternatively, we can perform a single update for the 10 occasions in one go, using a result obtained by Thomas Bayes, some time in the 18th century:

P(f \| DI)	=	(N + 1)!	f ⁿ(1 - f)^N-n
		n! (N - n)!

(4)

Here, N represents the number of occasions we have experience of and n represents the number of times Mr. Peebles was not talking rubbish. (Note that this formula applies only when we start with a uniform prior, and only when f is constant.)

I performed this calculation very quickly using a spreadsheet (because I'm integrating over little rectangles, I multiply each P by Δf, the width of each rectangle), obtaining the following probability distribution for f, given 10 past experiences, including only one occasion where there was some merit to Weety's story:

Recall the interpretation of the numbers: a frequency of 0 means he never tells the truth, while a frequency of 1 means he always tells the truth. We see that nearly all the 'mass' is located between 0 and 0.5 - as we might have anticipated.

To complete the exercise, it only remains to determine the probability that on our current, 11th occasion, the news item in question is worth reading - in other words, the probability that Weety Peebles is not lying again. We still don't know the actual frequency that determines how often Peebles is truthful, so we have to integrate over the total sample space. We divided the continuous sample space into 200 intervals - each narrow interval, we approximate as corresponding to a discrete frequency. For each of these frequencies, therefore, we take the product of that frequency, f, with its probability, P(f), then adding them all up, we end up with the total probability that the current news story is true. The result of this summation is 0.167, almost a 17% chance that the story is true. (Note that if I had taken f to be simply 1/10, the exact fraction corresponding to our experience, we would have wrongly estimated only a 10% chance of a story worth reading, which would have done Peebles a slight disservice.)

Without me even specifying anything about the content of Weety Peebles' story, you are now in a position to say: 'Mr. Peebles, you are very probably talking out of your arse.'

And that is an ad hom with some seriously refined logic behind it.

In general, we know from our experience that we can learn from our experience. Our common sense, therefore, should have warned us that ad-hominem arguments can be rational. In hindsight, we probably recognize that we use them all the time. What we see about Bayes' theorem is that it provides a formal way of quantifying our common sense.