Maximum Entropy

Disruptive Writing Style

2017-09-02T00:36:00.001-05:00

In the pursuit of science, under whose umbrella I consider all intellectually rigorous activity to fall, the formulation and communication of ideas are critical. Here, I'll outline aspects of my own attitude to scientific communication.

In an earlier post on jargon, I advocated against over reliance on familiar terminology, as this can often give a false impression of understanding. I recommended occasionally throwing out unusual pieces of vocabulary, in the hope of ensuring that one's audience is engaged in the concepts, and not just semi-consciously following the signposts. This technique is a significant part of a communication strategy we might call 'disruptive writing style.'

When I say disruptive, I'm not talking about the content of an essay, article, speech, or whatever. I don't mean that the matter under discussion is disruptive, the way the birth of digital electronics represented a disruptive technology. Instead, I'm talking about the vehicle by which one conveys one's ideas to a wider appreciation. I'm talking about a style that occasionally strives to prevent the smooth progress of the reader (or listener) from beginning to end of your piece, in order to ensure that your thesis is being taken in.

Now, there are obviously good ways and bad ways of being disruptive, and in proposing a disruptive style, it is important to try to identify what it is that makes a disruption constructive. Clearly, anything that makes a piece of writing harder to read ought to be carefully justified. When colleagues send me drafts of articles for comment, I often advise them to remove small elements that seem to interrupt the smooth flow of the text, so what is the difference between these elements and what I consider to be part of a good disruptive style?

In general, we can say that any effective disruptive element should add something, in terms of focusing the mind of the reader, or bringing a new perspective, and hence better understanding. There are a number of devices that can achieve such goals, examples of which I'll give shortly.

By corollary, anything that stops the reader in his or her tracks without satisfying either of these criteria should probably be avoided. Such things are likely to achieve nothing more than to serve as distractions. Examples of such distractions might be a word used in genuinely the wrong place, such that its meaning is distinct from what was intended. In such cases, the best that can happen is that the reader recognises the mistake and is quickly able to substitute the intended meaning and continue reading. Possibly, such a reader is left with a slightly diminished opinion of the writer's level of concentration while they were writing. In a worse case, the reader may be unable to unambiguously reconstruct what the writer had in mind, which might dilute the message or, worse still, leave the reader feeling generally confused.

Other examples of badly flowing text might include poor grammar, that fails to pin down unambiguously the meaning of the text, or prevents the reader from effortlessly interpreting what is written. In this regard, it is worth noting that a piece of writing can sometimes be made to flow better by deliberately violating the accepted rules of grammar. Remember that the ultimate goal is not adherence to the rules, but readability and ease of understanding.

As a final example, one thing that disturbs me, as a reader, is redundancy. This is presumably because I often have too much respect for the writer, taking seriously the proposition that each word in the text is there for a reason. As a writer, however, my ideal reader will have some of this respect for me, and I like my writing style to reflect this. When I encounter redundancy, I often get the feeling that there is something that I'm not understanding. For instance, a very popular phrase in scientific discussion is 'dynamic range.' When I traverse a phrase like this, I'm left wondering why the word 'dynamic' is there. What was the writer trying to get across to me, that the word 'range' on its own wasn't enough for? What is the additional missing element? Why am I apparently too stupid to be able to see what it means? In truth, there is no intended meaning behind that little 'dynamic' part. The phrase 'dynamic range' has exactly the same meaning as the word 'range' on its own. But in the process of figuring this out, I've forgotten much of what the writer had said beforehand, and will be left with an uneasy feeling as I proceed through the rest.

So lets look at some of the positive elements of the disruptive writing style. Firstly, there is that method I wrote about before, which is to avoid using the same familiar words over and over again. In this case (if the technique is not overused), the available profit may well be an order-preserving function (monotonically rising) of the unusualness of the applied vocabulary. Here is my reasoning for this: if two words are synonymous, but are not commonly used in the same context, one cannot simply invoke a familiar rule that they mean the same thing. One must, instead, recognise that the underlying concept under discussion is adequately described by the two terms in question. This will require a degree of mindfulness concerning the properties of the concept. Thus, the reader is less likely to make it all the way through your treatise with a mere illusion of understanding, brought about through familiarity, but instead, has been forced to engage seriously with the subject matter.

This technique has, in my opinion, a valuable self regulating mechanism, in that the magnitude of the disruptive effect can be expected to grow somewhat in proportion to the extent to which it is needed. The reader who already has a clear understanding of the concepts you are referring to, who instinctively looks behind the veil of words at the core substance of their meaning, is unlikely to have difficulty placing the significance of an unfamiliar metaphor, and will pass through your terminological minefield undisturbed. They may even derive a little unexpected pleasure from the small intellectual kick of being reminded of an obscure connection that they were not directly thinking about at the time.

On the other hand, readers who plan to utilise their conditioned reflex that certain words belong in a discussion of a certain topic to breeze through your piece without much engagement, will probably experience the greatest disruptive impact from this device. Similarly, the reader who hasn't taken the time to absorb the basic consequences of object X having properties Y and Z, will have more hardship recognising that such-and-such unfamiliar description refers to so-and-so concept. It is in these cases, however, when it would seem to be most necessary to utilise some rhetorical technique to shout to the reader, 'Hey, pay attention, this is important!' The reader who has to work hardest to make the connection between some familiar piece of jargon and your chosen synonym is exactly the person to whom you most need to say, 'OK, hold up a minute, this is an article about such-and-such, pause for a moment and think about what that means.'

Another way to explain why I like to mix up unusual and unexpected terminology is that I feel it helps to steer people (including me, the writer) away from one of the most common forms of intellectual error: the mind-projection fallacy. This is the treacherous tendency to assume that aspects of our models of reality must necessarily be part of reality. By frequently changing the language used to describe key concepts, one is implicitly pointing out that it doesn't matter what we call things, and what we think of them. What matters is what they really are.

Other techniques for arresting your consumers' too-rapid progress through your written product exist. Often, these techniques will combine a double effect of (1) halting, and thereby focussing the reader's attention and upon a key point, creating greater investment in comprehension, while (2) by virtue of the novelty of the delivery method, rendering this key point more memorable.

One possibility to achieve both of these results is through a brief injection humour in an otherwise serious exposition. A joke is often most effective, when one wishes to be taken most seriously. (Conversely, in a work of end-to-end slapstick, a short moment of seriousness can be devastatingly effective.)

A highly flamboyant metaphor can achieve both of these focussing functions, while also capitalising on many of the advantages of the unusual-synonym technique, described above. A little flourish of verbal petals, dripping with imaginative nectar, could be just the thing to fertilize your readers' thoughts with the pollen of your intellectual endeavour.

If looking for other options, note that an unexpected element of foul language might be just the thing to stir up some much-needed shit. This can go a long way towards increasing the half life of your critical notion in the minds of others. If done well, this can be like running a highlighter pen across your text, and saying to the reader, 'even if you remember nothing else, this bit is important.'

Sometimes, pseudo-directly addressing the reader (I'm looking at you, in particular) is an interesting way of bridging the gap between your remote, abstract ideas, and the real world inhabited by your audience. This technique was used to excellent effect in the 2015 movie, 'The Big Short,' about the 2007-2008 financial crisis. By dint of various actors occasionally facing the camera and talking directly to the viewers, this movie does a wonderful job of repeatedly reminding us that 'yes folks, while it sounds fantastic, this insane stuff really did happen!'

In summary, basically anything that constitutes an unexpected departure from your usual writing style can be an effective way to grab some enhanced attention. An unusually short sentence or paragraph, for example, can be a good way to emphasize an important point.

This works.

Sums and Differences of Random Numbers

2017-08-18T22:28:00.001-05:00

Here's a problem that cropped up in the course of some calculations I have been working on recently. In stating the problem, I'll include the physical context, thought this context isn't important for the rest of the discussion, which applies generally to certain manipulations on probability distributions:

A high-energy massive particle (e.g. a proton or α-particle), whose initial kinetic energy is governed by a certain probability distribution, passes through a slab of material. As it travels through the material, it scatters some of the electrons inside the slab, dissipating a small fraction of its energy. The amount of its energy that it loses also has some probability distribution (the so-called 'straggling function'). What is the probability distribution over the energy of the particle as it exits the slab of material?

I'm sure all of us wrestle with exactly this question several times each day. The problem concerns the probability distribution over the difference between two independent random variables¹ (in this case the particle's initial energy and the energy it deposits in the slab of material).

The route to solving this problem involves utilizing the solution to a related problem, namely the probability distribution over the sum of two random variables, so first let's look at that.

The sum of two independent random numbers

In the previous post, I stated that the distribution over this sum is the convolution of the two distributions from which the two random variables are sampled. First, I'll give my explanation of why this is so. Later, to mitigate the risk that my reasoning is wrong, I'll show the results of some computational experiments that test this reasoning.

Suppose we have a parameter, x, and two probability distributions, which we model using continuous functions f(x) and g(x), from which the numbers x₁ and x₂ are respectively sampled. Let's call the sum of these two x_sum.

For any given value of x_sum, there are many possibilities for how we got it. We could have a small value for x₁ added to a large value for x₂. Or, for example, we could add any small number, δ, to x₁, and then subtract the same δ from x₂, to get the same outcome for x_sum.

For a given value of x_sum and a given value of x₁, x₂ must be given by (x_sum- x₁), and the probability for that event is given by the product rule, simplified by our assumption of independence²:

But, as we noted, there are many combinations of x₁ and x₂ that could produce any given x_sum. Thus, the probability for any particular x_sum will be given by the extended sum rule (noting the exclusivity of the different events). We just add up all the probabilities for all the contributing possibilities:

(1)

The integral given in Equation 1 is the definition of the convolution of the two functions f(x) and g(x), denoted f(x)*g(x).

Quality assurance

In principle, this argument should be enough, but since computers are now powerful enough that running a decent Monte-Carlo simulation costs very little, it is reasonable to test our reasoning using this technique.

The easiest way to perform this Monte-Carlo calculation is to convert the PDFs f(x) and g(x) to CDFs, then use a numerical random number generator, uniform between 0 and 1, to sample x₁ and x₂. At each iteration, we get two random numbers, one for each of f(x) and g(x), then we simply find the value of x at which the appropriate CDF matches the random number. We then build a histogram of the generated sums.

The results of this procedure are plotted below for arbitrary functions, f(x) and g(x), together with the result of direct calculation of the integral in Equation 1 (see appendix below for my simple numerical implementation of this).

Evaluation of Equation 1 (solid red curve, marked 'Direct integration') achieves consistency, firstly with common-sense expectation: the result, shown with the solid red curve, has its maximum at x = 9,000, i.e. the sum of the locations where f(x) and g(x) are both maximized (7,000 and 2,000, respectively), and has the expected shape, having the same long right-hand tail as g(x), and the same rounded peak as f(x).

Secondly, the result matches perfectly the Monte-Carlo prediction (source code also shown in the appendix). Here I sampled 10,000 sums, x₁ + x₂, and rescaled the result to the height of the normalized PDF.

Logical analysis and simple experimentation agree. Our result can be considered robust! (And a couple of centuries worth of mathematicians, who already knew the thing to be correct, can continue to rest in peace.)

Then why does the analytical result matter?

In some special cases, analytical formulas can be derived from Equation 1. Going beyond those special cases, though, we usually need to think about computational speed. The Monte-Carlo technique is certainly cheap now compared to 50 or 100 years ago, when scientists and engineers had to spend much more of their time finding analytical or approximate analytical solutions to all sorts of problems, but thanks to a special property of the convolution operation, there is an even quicker way to numerically calculate convolutions.

This property, termed the convolution theorem, is that f(x)*g(x) is the same as inverse Fourier transform of the product of the Fourier transforms of f(x) and g(x). Thanks to Fast Fourier Transform (FFT) algorithms, numerical methods for convolution can be executed in very little time.

The convolve() function in the numpy package for Python, for example, uses the FFT technique. When I repeated the calculation in the above figure, using numpy.convolve() instead of my (admittedly crude) integration procedure, I got pretty much exactly the same result, but it executed about 1,000 times faster. (Admittedly, not all of this performance improvement is due to the logic of the improved algorithm, but rather to its implementation: numpy calls code that is written in C, C++, and Fortran, and can therefore execute the same logic much faster than it could in native Python.)

The difference between two independent random numbers

Now we just need to find a way to adapt our result to problems of the form stated at the beginning of this article. Modifying our earlier reasoning to consider x_diff = x₁ - x₂, we note that for a given x₂ and x_diff, it must be that x₁ = x_diff + x₂. Consequently, we find that our required probability distribution is

(2)

The only real difference³ between Equations 1 and 2 is that the minus sign in Equation 1 has become a plus sign in Equation 2.

Now Equation 2 is technically sufficient to solve the problem, but as we have seen, direct integrations of the like of Equations 1 and 2 can be pretty slow. Since Equation 2 is so tantalisingly similar to Equation 1, therefore, it would be great if there was some mathematical trick to make Equation 2 look more like Equation 1. This would allow us to calculate P(x_diff) using our super-speedy FFT-based numerical procedure. It turns out there is such a trick.

Instead of subtracting a random number drawn from g(x), what we can do is add a number drawn from g(-x), which is g(x) reflected through the y-axis. This is the function we'd like to convolve with f(x).

Unfortunately, convolution algorithms, such as numpy.covolve(), don't know anything about the x-axis - they only receive a set of y-values, so as far as the algorithm is concerned, x = 0 corresponds to the number at the beginning of the list, ans so on. There is no direct way for us to tell numpy.convolve() that these y-values correspond to negative x-values.

All is not lost, however. If we simply reverse the list of numbers specifying the values of g(x), and convolve with that, it will be like we are calculating the distribution over the sum of numbers drawn from f(x) and g(-x), except that the x₂ we'll be adding will be too large, by exactly the length of the array defining g(x). Thus to recover the correct position of the convolved distribution, we must chop off from the beginning of the list of numbers specifying the result, a section of length equal to the length of the supplied g(x). My code to execute this is shown in the appendix.

Testing the procedure is now super easy. We already have the Monte-Carlo procedure that we used above, and we know that it works. Converting it to our current purposes requires only changing one '+' to a '-'. The verification looks like this:

The overlap between the random experiment and the nifty FFT procedure is good enough to conclude that my reasoning was again sound. For your interest, my Monte-Carlo took almost 4 times longer to run, but gives a spectrum with only 50 bins, to cover a range covered by about 4,000 bins with the FFT method. Thus, I'd need to run the Monte-Carlo about 80 times longer again, to achieve the same resolution, while not compromising on the signal-to-noise ratio.

Notes

[1]	In fact, this is an approximation. The straggling function is really a function of the particle's initial energy, and our two random variables are not strictly independent. If the distribution over the initial energy is narrow enough, however, it'll be safe to assume that any changes to the straggling function over the range of energies will be too small to significantly affect the outcome of the calculation.
[2]	Events A and B are conditionally independence if and only if P(A\|B,I) = P(A\|I) and P(B\|A,I) = P(B\|I).
[3]	The convolution operation has the property of commutativity, such that f(x)g(x)* is identical to g(x)f(x)*.

Appendix: Python source code

Here is the function I used for to calculate the distribution over the sum of 2 numbers, by direct calculation of the integral in Equation 1 (this method was about 1,000 times slower than the numpy.convolve() function, which uses FFT):

def sum_by_integration(f, g):

# f and g are numpy arrays giving the magnitude of the functions at each location.
# Note that f and g are defined on the same grid

    n = len(f) + len(g)
    s = numpy.zeros(n)
    steps = len(f)

    counter = 0
    while counter < (n - 1):
        counter += 1
        for step in range(steps):
            if ((ctr - step) >= 0) and ((ctr - step) < len(g)):
                s[ctr] += f[step] * g[ctr - step]

    return s

Here's my function to calculate the CDF, for the Monte-Carlo method. Numpy has its own function to do this, numpy.cumsum(), but I'll give you this recipe, which can be translated into other languages:

def CDF(pdf):

    # pdf is an array giving the magnitude of the probability distribution function at each location.

    cdf = numpy.zeros(numpy.shape(pdf))
    cdf[0] = pdf[0]

    counter = 0
    for p in pdf[1:]:
        counter += 1
        cdf[counter] = cdf[counter - 1] + p

cdf = cdf / float(cdf[-1])                            # (cdf always goes to unity)

    return cdf

And here's the Monte-Carlo implementation, itself:

def sum_by_MC(f, g, N=10000):

    cdf1 = CDF(f)
    cdf2 = CDF(g)

    sums = []

    counter = 0
    while counter < N:
        counter += 1
        r1, r2 = numpy.random.uniform(0, 1, 2)

        # use nonzero() to find index where CDF first exceeds random number
        x1 = numpy.nonzero(cdf1 >= r1)[0][0]
        x2 = numpy.nonzero(cdf2 >= r2)[0][0]
        sums.append(x1 + x2)

    hist, edges = numpy.histogram(sums, bins=50)

    return hist, edges

My code to calculate the distribution of the difference between two random variables is given next:

def difference_by_FFT(f, g):

# calculate distribution over f - g

    g = g[::-1]                                                         # reverse g(x) to look like a shifted g(-x)
    difference = numpy.convolve(f, g, mode='full')    # numpy's fast convolution algorithm
    difference = difference[len(g):]                           # chop off the beginning section
    difference = difference / numpy.sum(difference) # normalize

    return difference

Standard Error

2017-08-03T21:09:00.001-05:00

In the quantification of uncertainty, there is an important distinction that's often overlooked. This is the distinction between the dispersion of a distribution, and the dispersion of the mean of the distribution.

By 'dispersion of a distribution,' I mean how poorly is the mass of that probability distribution localized in hypothesis space. If half the employees in Company A are aged between 30 and 40, and half the employees in Company B are aged between 25 and 50, then (all else equal) the probability distribution over the age of a randomly sampled employee from Company B has a wider dispersion then the corresponding distribution for Company A.

A common measure of dispersion is the standard deviation, which is the average of the distance between all the parts of the distribution and the mean of that distribution.

Often in science (and in unconscious thought), we will attempt to estimate the mean of some randomly variable parameter (e.g. the age of a company employee) by sampling a subset of the entities to which the parameter applies. This sampling is often done because it is too costly or impossible to measure all such entities. Consequently, it is typically unlikely than an estimate of a parameter's mean (from measurement of a sample) will be exactly the same as the actual mean of the entire population. Thus, out of respect for the limitations of our data, we will (when we are being honest) estimate not only the desired mean, but also the uncertainty associated with that estimate. In doing so, we recognize that the mean we are interested in has a probability distribution over the relevant parameter space.

This is why we have error bars - to quantify the dispersion of the mean of the distribution we are measuring.

Often, however, when estimating the desired error bar (particularly when operating in less formal contexts), it is tempting to fall into the trap of assuming that the distribution over the mean of a variable is the same as the distribution over the variable itself.

These distributions are not the same, and consequently, their dispersions and standard deviations will be different. To see this, consider the following example:

We imagine two experiments consisting of drawing samples from a normal distribution with mean of 0 and standard deviation of 1. In each experiment, the distribution from which we draw is the same, but we draw different numbers of samples. Each experiment is illustrated in the figure below. On the left, the case where 100 simulated samples were produced (using the numpy.random.normal() function, from python), and on the right, the case where 100,000 samples were generated. The samples are histogrammed (with the same binning in each case), and the black line shows the same underlying distribution (scaled by the number of samples).

Because, on the left, we have far fewer measurements to go on, the overall result is affected by much greater sampling error (noise), and the shape of the underlying distribution is far less faithfully reproduced. We can appreciate that the mean obtained from this small sample is likely to be further from the true mean of the underlying process than the mean obtained from the larger sample. Consequently, the error bars for the means should be different, depending on how many samples were collected. The means from these experiments are indicated above each histogram, and indeed, the sample mean on the right is much closer to 0, the mean of the generating distribution, than that on the left, as our reasoning predicts.

For fun (and because an experiment with a respectable random number generator often gives vastly more statistical insight than any amount of algebra), lets continue the analysis further, by repeating our experiments many times, and logging the means of the samples generated in each iteration. Histogramming these means, separately for each sample size, will allow us to visualize the dispersions of the means, and to begin to see how it depends on the number of samples obtained:

(Instead of taking 100,000 samples for the high-n case, I used only 1,000. This was to allow the two resulting histograms to be reasonably well appreciated on the same plot. The histograms are drawn partially transparent - the darkest part is just where the 2 histograms overlap.)

One thing is clear from this result: the dispersion of the mean really is different, when we take a different number of samples. Certainly, regardless if we take 100 samples, or 1,000, the standard deviation of the result is clearly much smaller than the standard deviation of the generating process, which recall was 1 (look at the numbers of the x-axes of the 2 figures). Thus, when looking for an error bar for our estimated population mean, we will quite likely be horribly under-selling our work, if we choose the standard deviation of our sample values.

To serve as a reminder of this, statisticians have a special term for the standard deviation of a parameter upon which the form of some other probability distribution depends: they call it the standard error (a fitting name, since it's exactly what they'ed like to prevent you from making). The most common standard error used is, not surprisingly, the standard error of the mean (SEM). The standard error is a standard deviation, just like any other, and describes a distribution's dispersion in exactly the usual way. But we use the term to remind us that it is not the uncertainty of the entire distribution we are talking about, but the uncertainty of one of its parameters.

A special case

As I've mentioned before, the normal distribution is very important in statistics for both an ontological and an epistemological reason. The ontological reason is from the physics of complicated systems - the more moving parts in your system, the closer its behaviour matches a normal distribution (the central limit theorem). The epistemological reason is that under many conditions, the probability distribution that makes full use of our available information, without including any unjustified assumptions is the normal distribution (the principle of maximum entropy).

Fortunately, the normal distribution has several convenient symmetry properties that make its mathematical analysis less arduous than it might be. So, it'll be both informative and convenient to investigate the standard error of the mean of a sample generated from an independent Gaussian (normally distributed) process, with standard deviation, σ. This standard error will be very straightforward to calculate, and will provide an easy means of characterizing uncertainty (providing an error bar).

Firstly, note that if we add together two independent random variables, the result will be another random variable, whose distribution is the convolution of the distributions for the two variables added.

One of those handy symmetry properties now: the convolution of 2 Gaussians is another Gaussian, with standard deviation given by Pythagoras' theorem:

Iterating this fact as many times as we need to, we see that the sum of n independent random numbers, each drawn from a Gaussian process with the same standard deviation, σ, has the following standard deviation

(1)

This is the result for the sum of the samples. To go from here to the standard deviation for the mean of our samples, we start with a little proof of a more general result for the standard deviation of some parameter, x, multiplied by a constant, a, σ_ax.

Since

where x̄ is the mean of x, simple substitution yields

Now, the mean of ax is

which is just a times the mean of x, and so

or simply, when a is a constant,

Thus, to go from σ_tot, the standard deviation of the sum, S, to the standard deviation of the mean, which is S/n, we just divide σ_tot by n. Applying this to equation (1), we find:

Simplifying this, we arrive finally at an expression for the standard error of the mean of n samples of a normally distributed variable:

(2)

And that's it, a cheap and straightforward way to calculate an error bar, applicable to many circumstances.

Note that when x is normally distributed, an error bar given as ± σ_x̄ corresponds to a 68.3% confidence interval.

A Little Caveat

When the dispersion of the generating process is not initially known we have to estimate σ from the data. If however, the number of samples, n, is fairly small (say, less than 50 or so), the standard formula doesn't work out perfectly. This is because the appropriate prior distribution over σ is not the flat distribution that we tacitly assume when we use

Instead, we should use the sample standard deviation, s, when calculating the standard error of the mean:

In other words, the corrected SEM, accounting for bias in the estimation of σ due to limited sample size, n, will differ from the uncorrected SEM by a simple factor:

(Thus, at n = 50, our standard error will be a tiny bit over 1% too small, if we use the uncorrected version of the standard deviation.)

Multi-level modeling

2015-10-31T02:18:00.001-05:00

In a post last year, I went through some inference problems concerning a hypothetical medical test. For example, using the known rate of occurrence of some disease, and the known characteristics of a diagnostic test (false-positive and false-negative rates), we were able to obtain the probability that a subject has the disease, based on the test result.

In this post, I'll demonstrate some hierarchical modeling, in a similar context of medical diagnosis. Suppose we know the characteristics of the diagnostic test, but not the frequency of occurrence of the disease, can we figure this out from a set of test results?

A medical screening test has a false-positive rate of 0.15 and a false-negative rate of 0.1. One thousand randomly sampled subjects were tested, resulting in 213 positive test results. What is the posterior distribution over the background prevalence of the disease in this population?

This is technically quite similar to an example in Allen Downey's book, 'Think Bayes: Bayesian Statistics Made Simple.' Allen was kind enough to credit me with having inspired his Geiger-counter example, though it is really a different kind problem to the particle-counting example of mine that he was referring to. The current question seems to me more similar to Allen's problem, even though the scenario of medical screening seems quite different.

In Allen's Geiger-counter problem, a particle counter with a certain sensitivity counts n particles in a given time interval, and the challenge is to figure out the (on average) constant emission rate of the radioactive sample that emitted those particles. The solution has to marginalize over (integrate out) a nuisance parameter, which is the actual number of particles emitted during the interval, in order to work back to the activity of the sample. This number of emissions sits between the average emission rate and the number of registered counts, in the chain of causation, but we don't need to know exactly what the number is, hence the term, 'nuisance parameter.'

The current problem is analogous in that we have to similarly work backwards to the rate of occurrence of the disease (similar to the emission rate of the radioactive sample) from the known test results (number of detected counts). The false-negative rate for the medical test plays a similar role to the Geiger-counter efficiency. Both encode the expected fraction of 'real events' that get registered by the instrument. In this case, however, there is an additional nuisance parameter to deal with, because now, we have to cope with false positives, as well as false negatives.

(We could generalize the Geiger-counter problem, to bring it to the same level of complexity, by positing that in each detection interval, the detector will pick up some number of background events - cosmic rays, detector dark counts, etc. - and that this number has a known, fixed long-term average.)

Defining r to be the rate of occurrence of the disease, and p to be the number of positive test results, Bayes' theorem for the present situation looks like this:

(1)

In the statement of the problem, I didn't specify any prior distribution for the rate of occurrence, r. In keeping with the information content in the problem specification, we'll adopt a uniform distribution over the interval (0, 1). (In the real world, this is typically a very bad thing to do - there are very few things that we no absolutely nothing about - but here it'll give us the advantage of making it easier to check that our analysis makes sense.)

The trick, then, is entirely wrapped up in calculating the likelihood function. We have to evaluate the consequences of each possible value of r. For these purposes, the state of the world consists of a conjunction of 3 variables: number of positive test results, number of those positive tests caused by presence of the disease, and number of people who were tested that had the disease. The first of these is the known result of our measurement. The other two are unknown, but affect how the measurement result was arrived at, so we need to integrate over them.

(At first glance, it might seem that of these 2 nuisance parameters, if I know one, then I know the other. Don't forget, however, that there are two ways to get a positive test result: an accurate diagnosis, and a false positive.)

Because we have 3 variables that determine the relevant properties of the world, the true state of reality (assuming we know it) can be represented as a point in a three-dimensional possibility space. The likelihood function, however, only cares about one of those 3 coordinates, the number of positive test results (our measurement result), so the relevant region of possibility space is a plane, parallel to the other 2 axes. The total probability to have obtained this number of test results is the sum of the probabilities for all points on this plane.

In general, the total probability for some proposition, x, can be written as a sum of probabilities, in terms of some set of mutually exclusive propositions, y₁, y₂, y₃, ....

(2)

This follows from the extended sum rule. From the product rule, each term in the sum can be re-written, yielding the marginalization result:

(3)

Returning to the medical screening problem, let's use p_t to represent the number of true positives, d to represent the number of people in the cohort who had the disease. The total number of participants, N, has been taken out of the background information, I, to be explicit. Thus, each point in our 3D hypothesis space has an associated probability that looks like this:

(4)

Treating p and p_t as a single proposition, we can marginalize over d, using equation 3:

(5)

Next we just do the same thing again on the term containing p, p_t:

(6)

Each of the terms on the right hand side is given by the binomial distribution. The first is the probability to obtain p positive test results, when there are p_t true positives. In other words, it is the probability to get (p - p_t) false positives from the (N - d) people in the group who did not have the disease, when the false-positive rate is r_fp:

(7)

The second term is the probability to get p_t true positives, when the number with the disease is d. This depends on the probability for an afflicted individual to receive a positive test result, which is 1 minus the false-negative rate, (1 - r_fn):

(8)

The third term is the probability to get d people with the condition from a sample of N, when the rate of occurrence is r:

(9)

The double sum in equation (6), at each possible value for r, is a task best done by a computer. The python code I wrote for this is given in the appendix, below. The code, of course does not really perform the calculation at every possible value of r, (there are an uncountable infinity of them) but takes a series of little hops through the hypothesis space, in steps 0.002 apart. Because this step size is much narrower than the resulting probability peak, the approximation that the probability varies linearly between steps does no harm to the outcome.

Recall that we're using a uniform prior, and so, from equation (1) , once we calculate the likelihood from equation (6) at all possible values for r, the resulting curve is proportional to the posterior distribution. Thus, a plot of the likelihoods produced by my python function (after normalization) gives the required distribution:

The figure gives a point estimate and a confidence interval. Because the posterior distribution is symmetric, my point estimate is obtained by taking the highest point on the curve.

To get the 50% confidence interval (using my unorthodox, but completely sensible definition of confidence intervals), I just kept moving one step to the left and to the right from the high point, until the sum of the probability between my left-hand and right-hand markers first reached 0.5. Again, this is a good procedure in this case, because the curve is symmetric - in the case of asymmetry, we would need a different procedure, if, for example, we required an interval with equal amounts of probability on either side of the point estimate.

The centre of the posterior distribution is at r = 0.15. Let's see if that makes sense. With 1000 test subjects, at this rate, we expect 150 cases of the disease. With 85% of positive cases correctly identified (false-negative rate = 0.15) then we should have 127.5 true positive test results (on average). Also we should get (1000 - 150) × 0.1 = 85 false positive test results. Adding these together, we get 212.5 expected positive test results, which is, to the nearest integer, what was specified at the top. It looks like all the theory and coding have done what they should have done.

For fun, we can also check that the calculated confidence interval makes sense. I've specified a 50% confidence interval, which means that if we do the experiment multiple times, about half of the calculated confidence intervals should contain the true value for the incidence rate of the disease. With only a little additional python code, I ran a Monte-Carlo simulation of several measurements of 1000 test subjects each.

The true incidence rate was fixed at 0.15, and the number of positive test results was randomly generated for each iteration of the simulated measurement. The python package, numpy, can generate binomially distributed random numbers. I used the following lines of code to generate each number of positive test results, p:

d = numpy.random.binomial(1000, 0.15)           # number with disease
p_t = numpy.random.binomial(d, (1 - r_fn))       # number of true positives
p_f = numpy.random.binomial(N - d, r_fp)         # number of false positives
p = p_t + p_f

Then, using each p, I calculated the 50% confidence limits, as before, and counted the occasions on which the true value for r, 0.15, fell between these limits. I ran 100 simulated experiments, and amazingly, exactly 50 produced errors bars that contained the true value. There certainly was a little luck here (standard deviation here is 5, so 32% of the time, such a set of 100 measurements will produce true-value-containing C.I.'s either fewer than 45 times, or more than 55 times), but still this result serves as a robust validation of my numerical procedures. Probability theory works!

(Note: to keep the code presented in the appendix as understandable as possible, I didn't do much optimization. To have done the 100 measurement Monte-Carlo run with this exact code would have taken days, I think. The code I ran was a little different. In particular, the ranges over r, d, and p_twere truncated, as large regions of these parameter spaces contribute negligibly. This allowed my simulation to run in just under an hour.)

Appendix

# Python source code for the calculation

# Warning: this algorithm is very slow - considerable optimization is possible

import numpy as np
from scipy.stats import binom # calculates binomial distributions

def get_likelihoods(N, p):

# inputs:

# N is total number of people tested
# p is number of positive test results

r_fn = 0.15 # false-negative rate
r_fp = 0.1 # false-positive rate

# number with disease can be anything up to number of people tested:

d_range = range(N + 1)

# number of true positives can be anything up to total number of positives:

p_t_range = range(p + 1)

    likelihoods = [ ]

    delta_r = 0.002
    rList = np.arange(0, 1 + delta_r, delta_r)

    for r in rList:                                                 # scan over hypothesis space

      temp = 0
        for d in d_range:                                                # these 2 for loops do the double summation
          for p_t in p_t_range:
              p1 = binom.pmf(p - p_t, N - d, r_fp) # equation (7)
              p2 = binom.pmf(p_t, d, (1 - r_fn))        # equation (8)
                p3 = binom.pmf(d, N, r)                            # equation (9)

                temp += (p1 * p2 * p3)

        likelihoods.append(temp)

return likelihoods

Mean vs median - a careful balancing act

2015-04-25T01:02:00.000-05:00

Two common measures of the location of a probability distribution are the mean and the median. While generally, they are quite different things, some familiar distributions have their mean and median at the same point (~~all such distributions are symmetric~~, (see comment, below) and vice versa).

The mean of a distribution, as we all know, is its average, while the median is, roughly speaking, the point at which the amount of probability mass to one side is the same as the amount on the other side. Upon hasty consideration, these definitions can appear to denote the same thing, and so confusion between the two concepts is common. Annoyingly, my own PhD thesis contains a sentence¹ that explicitly confuses the mean for the median (and furthermore, none of the half dozen eminent scientists whose job it was to assess my thesis (who otherwise all did an excellent job!) reported noticing this blunder).

Confusion between the mean and the median is highly analogous to a difficulty experienced by many young children when they try to balance asymmetric blocks on top of one another, as has been reported by cognitive scientist Annette Karmiloff-Smith².

In a radio interview a couple of years ago (BBC Radio 4, 'The Life Scientific', Jan. 22, 2013), Karmiloff-Smith described briefly the finding (starting about 12 minutes into the interview): young children were asked to try to balance various blocks - some symmetric, others invisibly loaded on one side - on a narrow beam. Children of a particular age group were, it seems, old enough to expect the balance point to correspond to the geometric midpoint of the object, and tried first to balance the blocks there. Obviously, in the case of the asymmetrically weighted blocks, the midpoint would not work, and the blocks would fall. Despite repeated attempts with the same outcome, however, often a child would remain apparently unshakable in its faith in the midpoint, and continue to try to balance the item there.

Interestingly, many slightly younger children, perhaps not yet old enough to have learned the significance of the midpoint of an object, had an easier time adjusting from the geometric centre to the actual centre of mass after a few trials.

The task of finding the centre of mass of a physical object is mathematically identical to the matter of locating the mean of a probability distribution. The mean of a distribution over x, given as

is the point at which (treating distances from the mean, to the left, as negative and distances to the right as positive) the products of the individual probabilities with their corresponding distances from the mean sum up to zero. (If we shifted the origin of our coordinate system to the mean of the distribution, in the above formula, the integral would be zero.)

From the law of the lever, however, the force with which a mass tends to tip an object on a fulcrum is given by the product of the mass with its distance from the fulcrum. Since an object will balance when the forces on one side have the same magnitude as the forces on the other side, the centre of balance also corresponds to the point at which the sum of these mass × distance products comes to zero. So, the centre of mass is also the mean of the mass distribution.

The midpoint of an object is also closely analogous to the median. If we reduce an object to a one-dimensional mental model, then the correspondence becomes exact. At the median, m, (assuming a continuous distribution) the amounts of mass on either side are equal:

In one dimension, length stands in for mass, and the median is the point equidistant from each end.

Note, however, that even encoding for differences in density along the length of an object / distribution, the mean and the median will only be the same in the case of symmetry about the centre of mass. The median involves the integral of mass, while the mean integrates the product between mass and distance. The mean pays more attention to masses situated further out, thanks to this product, while the median doesn't care where the mass happens to be. If a distribution has an extended tail on one side only, then the mean will typically be positioned further out into the tail than the median.

At some point in their development, children seem to learn to expect symmetry in the objects and phenomena they encounter. This is quite reasonable, as without symmetry, there can be no physics (all physical laws are realizations of symmetry of one kind or another).

The devil is in the details, though, and the symmetry need not always be of the simplest forms. As we approach adulthood, we presumably come to appreciate this, and I suspect that as adults we can look forward to much faster success in balancing exercises such as the ones those children described earlier struggled with. No doubt our continually built up experiences of mechanical interaction with reality contribute much to this attainment of maturity.

But in the course of our day-to-day existence, we have far less cause to experience and interact with explicit probability distributions, so the lessons pertaining to them can be harder to win (particularly given our excess exposure to symmetric distributions such the Gaussian). An intuitive grasp of the difference between mean and median is one presumably almost all adults possess, when it comes to simple physical objects, but banishing this confusion can be more than child's play when it comes to statistics. Hopefully, by noting (as I've tried to do here) the similarities between the mechanical and the abstract, we can ease the process.

References

[1]	Really, you think I'm going to give you a page number? Go find it yourself!
[2]	Annette Karmiloff-Smith and Bärbel Inhelder, 'If you want to get ahead, get a theory,' Cognition, volume 3, issue 3, p 195-212 (1975)

The Fundamental Confidence Fallacy

2015-04-18T19:02:00.001-05:00

The title of this post comes from an excellent recent paper (as far as I can tell, still in draft form) on misunderstandings of confidence intervals. The paper, 'The fallacy of placing confidence in confidence intervals', by R. D. Morey et al.¹ is by almost exactly the same set of authors whose earlier paper on a very similar topic I criticized, before, but the current paper does a far better job of explaining the authors' position, and arguing for it.

The authors identify the fundamental confidence fallacy (FCF) as believing automatically that,

If the probability that a random interval contains the true value is X%, then the plausibility (or probability) that a particular observed interval contains the true value is also X%.

A confidence interval, a kind of error bar, is a device used in problems of parameter estimation (e.g. what is the age of the universe? how much rain will fall tomorrow? which gene is most responsible for making me so damn handsome?). As well as calculating a point estimate, such as the most probable value of a parameter, a researcher working on some relevant data may also provide a confidence interval, indicating a region around the parameter's point estimate, in which (hopefully) one can expect the true value of the parameter to reside. This is done because sampling errors (noise) in the data collection process will typically result in the point estimate being not exactly equal to the true value of the parameter.

In general, the point of such errors bars is that a very broad error bar indicates a not very precise measurement, where our confidence that the point estimate is very close to the true value is low, and vice versa.

Morey et al. show, however, that the traditional definition of the confidence interval is too broad to automatically satisfy the general requirements of error bars - hence their contention that the belief summarized in FCF (above) is indeed a fallacy.

Here's the conventional definition of confidence intervals that they take from the literature:

A X% confidence interval for a parameter θ is an interval (L, U) generated by an algorithm that in repeated sampling has an X% probability of containing the true value of θ.

The problem that they rightly identify about this definition is that it fails to account for differences in how informed one is, before and after the data have been gathered. They demonstrate this with a beautifully simple thought experiment about a lost submarine, which I'll try to explain:

The crew of a boat want to drop a rescue line down to the hatch of a 10 m long submarine. They don't know the exact location of the sub, but they know its length, and they know that it produces distinctive bubbles from uniformly distributed locations along its length. They also know that the hatch is exactly half way along the sub's length. They decide to watch for bubbles, to infer the sub's position, but they want to launch their rescue attempt quickly, so they decide to do so as soon as their 50% confidence interval for the location of the hatch is sufficiently narrow.

They reason that for 2 randomly positioned bubbles, the hatch is equally likely to be between them as not, so for a two-bubble data set, they devise the following confidence interval:

⟨x⟩ ± Δx / 2

where ⟨x⟩ is the average position of the 2 bubbles, and Δx is their separation - i.e. the 50% confidence interval is defined exactly by the positions of the 2 bubbles. Note that this confidence interval satisfies perfectly the definition given above.

Unfortunately, when the bubbles rise, they do so with a separation of only 1 cm. The rescuers calculate a 1 cm wide confidence interval, and, falling for the fundamental confidence fallacy, they infer that the hatch has a 50% probability to be in this very narrow region.

In reality, though, this extremely (and spuriously) precise inference has been drawn from almost maximally uninformative data. The bubbles could have arisen from either end of the submarine, or anywhere in between, meaning that the hatch could be located anywhere within a 10 m interval. The probability that the hatch is between the 2 bubbles is about as low as it can be.

On the other hand, had the bubbles been 10 m apart - indicating that they came from opposite ends, the rescuers would have been able to infer the exact location of the hatch, but from their adopted confidence procedure would have obtained a 10 m wide C. I., and hence would have wanted to wait for more data, perhaps losing their only chance to complete the rescue.

The problem with the conventional definition of confidence intervals is that it is set up with respect to the set of all possible measurement outcomes, rather than the specific measurement outcome that occurred. An inference that is valid when the data are not known can hardly be expected to remain necessarily valid when the data are known, but the standard wisdom regarding confidence intervals ignores this.

To remedy this, I've always advocated a somewhat different, unconventional definition of confidence intervals:

An X% confidence interval is a subset of the hypothesis space that has an X% posterior probability to contain the true state of the world.

(On a one-dimensional hypothesis space, this subset would be defined by lower and upper bounds, (L, U), exactly as appear in the above conventional definition.)

This definition also satisfies the conventional one, but is narrower in a way that eliminates its worst problems.

This is essentially the same recommendation made by Morey et al. (they prefer to call it a credence interval). Under a definition of this kind, the fundamental confidence fallacy disappears - if I give you a parameter estimate with a 95% confidence interval, then (i) 95% is the probability that the true value of the parameter lies within the interval and (ii) a narrow confidence interval necessarily corresponds to a precise determination of the parameter (and vice versa).

Under favourable conditions, many of the common frequentist confidence procedures do a reasonable job of approximating these desiderata, but I feel strongly (as, presumably, do Morey et al.) that it is far better to start with and understand a sensible definition, and hence understand when our approximate methods are (and are not) valid, than to sweep such validity issues under the carpet - pretend they don't exist - and proceed ab initio from nonsense.

Why my criticism of the earlier paper is still valid:

Now that Morey et al. have produced this very good paper, I can better appreciate what they were trying to get at, in their earlier paper, 'Robust misinterpretation of confidence intervals'², and I can better explain the nature of the error they made in it.

In the earlier paper, the authors described providing a sample of scientists with a questionnaire to assess their understanding of confidence intervals. The asked:

Professor Bumbledorf conducts an experiment, analyzes the data and reports, "the 95% confidence interval for the mean ranges from 0.1 to 0.4." Which of the following statements are true:

They then listed a number of statements, including

There is a 95% probability that the true mean lies between 0.1 and 0.4.

They went on to claim that this statement, and several others of related forms were false. They were wrong. The reason they were wrong is essentially the same reason that the statement of FCF, above is indeed a fallacy: they failed to appreciate that a person who has seen the data may have a different probability assignment to a person who has not. I have not seen Prof. Bumbledorf's data, so the statement immediately above this paragraph is correct as far as I'm concerned (as I proved straightforwardly in the earlier post, using the Bernoulli urn rule). For Bumbledorf, however, who is aware of the data, the statement is not guaranteed to be true (depending on the confidence procedure he used).

The authors fell foul of a common form of mind-projection fallacy, by acting as if there is one true probability distribution, independent of one's state of information.

For their questionnaire to have worked the way they wanted, the authors should have asked something like '... which of the following statements are necessarily valid for Bumbledorf to make?'

Conclusion

The traditional definition of the confidence interval allows for methods of calculating errors bars that do not satisfy the basic requirements of error bars:

The integral over an X% interval may not be X%, and (as seen in the submarine example) may be drastically smaller. The probability for a parameter to be inside the confidence interval may be much less than the claimed confidence level.
The definition allows for procedures that produce a narrow interval in cases where the measurement is very imprecise, and a very broad interval, where the parameter can be inferred exactly.

As always - in all aspects of life - correct reasoning is Bayesian reasoning. By calculating error bars from posterior distributions, or algorithms that deliberately strive to approximate them, these problems with conventionally defined confidence intervals are immediately dissolved.

References

[1]	R.D. Morey, R. Hoekstra, M.D. Lee, J.N. Rouder, and E.-J. Wagenmakers, 'The fallacy of placing confidence in confidence intervals,' available in draft form, here
[2]	R. Hoekstra, R.D. Morey, J.N. Rouder, and E.-J. Wagenmakers, 'Robust misinterpretation of confidence intervals', Psyconomic Bulletin & Review, January 2014 (link)

Science is for Everyone

2014-12-12T22:00:00.001-06:00

In the previous post, I explained that science is suitable for investigating all matters. Pursuing a similar theme, I want now to discuss how science is for all people, not just bearded academics with white lab coats. (Pardon the stereotype, and let me emphasize that there is no good reason why 50% of all scientists should not be women.)

I mentioned something in that last post that is also central to this discussion: scientific method is a graded affair - not black or white. Whatever we can learn by implementing a low level of scientific rigour, we can learn a little more, in a little more detail, and with a little more confidence, by applying a slightly more systematic procedure.

It is simply not the case that one goes to university to study for a degree in science, passes one's exams, then receives a degree certificate, inscribed with the words, 'Now you are imbued with the power of science. All your future endeavours will be fully scientific. None without this certificate will possess the sacred scientific touch.'

Understanding of and ability to implement scientific method is something that is built up and refined over many years. I have three degree in physics, with a couple of years of postdoctoral experience, and I'm still learning - much to my perpetual delight!

But this is by no means to claim that those without my kind of training are barred from joining the party. The basic principles of scientific thought are really pretty simple, and can be grasped quite easily. There are good reasons for this. The thing that ultimately makes science so special is: it works! And the thing about humans, as organisms evolved through natural selection, is that we come equipped with brains that, for the most part, also work. Thus, scientific method and human brains can readily form an easy, comfortable partnership. Scientific method is just cultivated common sense.

Popularizers of science often draw on the purely curiosity-driven aspects of science. Glamourous, big-science instruments, such as the Hubble space telescope and the large hadron collider play a prominent part. This is good, but it is not enough. We also need to draw particular attention to the extraordinary practical advantages of being able to figure stuff out. Knowing things (forming rationally supportable high levels of confidence) means being able to make very effective decisions.

And no matter what level of scientific rigour you are currently working at, you can always achieve infinitesimally more robust conclusions, and hence infinitesimally more effective decisions, by applying an infinitesimally more scientific approach. You don't have to be a rocket surgeon to use and benefit from a little scientific optimization.

As long as you are a decision-making entity with values (an intelligent being), then you desire to be able to make effective decisions (see Is rationality desirable?). No matter what question is relevant to your pursuit of value, whether it is how to make good bread, or how to judge the merits of newspaper stories, scientific method can get you the answer most efficiently.

A couple of nights ago, in the wee small hours, I was sitting in a nuclear research facility near Tokyo, blasting a detector with some extremely frisky atomic nuclei, when the conversation with my boss (while we fought off the urge to sleep) turned to the topic of science-fair projects. Having been educated in Ireland, I didn't grow up with the science-fair tradition, which is a big shame, though I did do plenty of exciting experiments with my dad, which probably played a big part in forming my eventual disposition as an adult.

My boss recalled that of all that projects that his kids had done, the one that they most enjoyed and remembered most vividly was one for which they tested a range of electric batteries to find out which would last longest in their toys. The reason they enjoyed it, of course, was that they really wanted to know the answer. It was a serious practical question. I think this is a fantastic lesson for a child to learn: not only does it encourage fascination and familiarity with the scientific process, but it also conveys the practical advantages of systematic investigation.

Familiarity with fundamentals of experimental design, such as control and randomization, and awareness of basic data collection and reduction techniques, coupled with a willingness to use them (and to constantly improve one's use of them!) is an immensely powerful thing.

Beyond that, a moderate interest in technical topics, such as medical research, can help one to understand research results reported in the media, and so make better informed decisions concerning ones use of and attitude towards various technologies. This convenient NHS guide to understanding medical research stories, is a good place to start, if you're interested in that particular thing, but also introduces several general concepts in the field of estimating the merits of evidence.

Ultimately, a society in which individuals value science becomes a society that collectively values science, and a society that is better equipped to face its many challenges. Science is for scientists. Science is also for the common man and woman (not that there is anything uncommon about scientists!). Science is also for the politicians, who have to design a mutually beneficial path through the difficult territory of being human. Democratically elected politicians will continue to ignore evidence in favour of their personal agenda, as long as the voting public continue telling them that this is OK.

Scientism

2014-12-12T19:09:00.001-06:00

It perplexes me that the word 'scientism' is predominantly used as a slur to put people down and criticize their world view and methodology. I realized something recently, however, that helped me understand the error that is often being made, and how that error compounds the problem that is often being called out when people make the accusation of scientism.

First off, lets settle what scientism is. Wikipedia gives a good definition, that fits well with the contexts in which I see the term used:

Scientism is belief in the universal applicability of the scientific method and approach, and the view that empirical science constitutes the most authoritative worldview or most valuable part of human learning to the exclusion of other viewpoints.

Well, that's a strange accusation. I've made it clear in numerous places that this is exactly my position, and I've repeatedly defended that position with robust logical arguments.

On the universal applicability of scientific method: yes, absolutely. If a thing has meaningful consequences, then why should science not be a good way learn about its properties? Whether it is something normally associated with science (astronomy, atomic physics, medicine, evolutionary biology, or whatever), or something more concerned with politics, law (here and here), morality, issues of religion, or the supernatural, all things can be investigated scientifically.

To be clear on my position on the supernatural: there is no such thing, it's a necessarily empty set (I'll come back to this point in a future post), but there are certain putative entities often identified as supernatural: ghost, goblins, fairies, gods, and such like. If such things did exist, however, (it's hard to say absolutely categorically that they don't, at least without being more precise about what they are, though there is, of course, no evidence supporting belief in any of them) then they would necessarily be physical beings, amenable to scientific investigation.

On the maximally authoritative nature of science-based investigation: again, this is necessarily so. Being scientific really just means being systematic. The only alternatives are at best, ignorance and at worst, fantasy passed off as fact. Why would any person wish to know about something, and choose to be non-systematic in the manner of their formation of beliefs about it? One cannot, while being coherent (see my post, Is rationality desirable?). To think that one can achieve rationally supportable degrees of belief, without using a rationally supportable procedure is a clear mistake. And this is identically what scientific method does: produce rationally supportable degrees of belief.

Now some might be tempted to argue that science isn't always necessary. Some things, for example, are just obvious. But let me emphasize: scientific method is a graded affair - not black or white. Whatever we can learn by implementing a low level of scientific rigour, we can learn a little more, in a little more detail, and with a little more confidence, by applying a slightly more systematic procedure.

___

Now, here's the thing I noticed when I recently saw a little scientism bomb being dropped, elsewhere on the internet, by a person whose awareness of the scope and meaning of scientific method I have good reasons to trust. It seemed to me that what this person was complaining of was actually the opposite of scientism, i.e. dismissal as irrelevant or intractable, certain valid philosophical questions, because they are perceived not to fall within the scope of scientific method.

Well, lets be clear about what philosophy is: philosophy is love of wisdom. Wisdom can be thought of as dividable into three categories (where by 'knowing', in the following, I mean having a rationally supportable high level of confidence in some proposition):

Knowing what things are true
Knowing what procedures are effective for discerning what things are true
Knowing how to behave effectively under certain circumstances

Note, however, that items 2 and 3 are really special cases of item 1. The question of whether it is valid, for example, to use probability theory when trying to attain rationally supportable degrees of belief is a question of fact about the nature of the real world (spoiler alert: it IS valid to use probability theory). The question of how to behave is a duo of empirical problems: (i) what is my utility function? (what do I actually value? - yes, this is an empirical question, my values are physical properties of my mind), and (ii) what actions will lead to consequences that will maximize my expected utility? (In fact, 2 is also a special case of 3: how should I behave if I value knowledge of X?) So all of philosophy is about figuring out what is probably true.

But, as I just argued, all meaningful questions of fact are best answered using science, and so love of wisdom entails a desire to follow scientific method. Thus, philosophy (defined as an endeavour, and not in terms of the traditional type of education received by the typical practitioner) is identical to science.

Thus, all philosophical questions fall under the scope of scientific method, and the scientist who dismisses such issues as unscientific is failing to appreciate the range of validity, the very meaning, of their own profession. This is a mistake that it's important to call out. Part of the reason we have a society run by politicians and law makers who believe that they can divine correct policy, without implementing scientific procedure, is that prominent scientists are repeatedly telling them that they can. ('Oh, that's not a scientific question, that's a matter of human affairs,' or, 'there's the evidence, now it's for you, the politicians, to decide what it means,' or perhaps worst of all, the Nuremberg defence: 'don't ask me if it's right or wrong, I'm just a scientist.')

But notice that the accusation of scientism completely misses the mark, here. Scientism, recall, is believing that all questions fall under science's magisterium, while the actual error being committed is the claim that certain problems are not in this category.

The cry of scientism, therefore, fails to draw proper attention to the fallacy that has been committed, and, in fact, is quite likely to reinforce it. Faced with this charge, one is, of course, free to refer to sources such as the Wikipedia definition, quoted above. Many a scientist who is somewhat on the ball, philosophically, however, is likely to look at such definitions and say, 'yup, that's me, and proud of it!' And of whatever mistake they might have made that prompted the rebuke, they are likely to conclude, 'if that's scientism, then I'm perfectly happy to continue committing the crime.'

Probability Trees and Marginal Distributions

2014-11-08T00:36:00.000-06:00

In a blog post earlier this year about medical screening, On the hazards of significance testing. Part 1: the screening problem, statistical expert David Colquhoun demonstrates a simple way of visualizing the structure of certain probabilistic problems. This diagram, which we might call a probability tree, makes the sometimes counter-intuitive solutions to such problems far more easy to grasp (and in the process, helps put over-inflated claims about the effectiveness of screening into perspective).

I discussed exactly this medical screening problem in my first ever technical blog post, The Base-Rate Fallacy (though my numbers were entirely made up, while David's come more-or-less directly from reality). I was therefore extremely jealous of his simple diagram, which conveyed instantly the structure and solution of the problem. How much more direct an explanation than all the words in my earlier blog post.

The problem concerns a medical diagnostic test with certain false-negative and false-positive rates (David's diagram uses the related (complementary) terms sensitivity and specificity). Given that the medical condition under test has a certain prevalence (base rate), what is the probability that an individual receiving a positive test result actually has the disease? It turns out that when the prevalence of the disease is low (often the case), this probability is usually much lower than one imagines.

In David's tree diagram (direct link to the diagram), with a false-negative rate of 0.2, a false-positive rate of 0.05, and prevalence of 0.01, one can clearly see why the desired probably in the case he examined is 14%, rather than the 95% that many people naturally gravitate towards.

Because of the extreme clarity offered by this kind of visualization, I've decided to translate my earlier example into a similar diagram. In my imagined case, we had a far more accurate test, with equal false-positive and -negative rates at 0.001, but also a rarer condition, with background prevalence of 1 in 10,000. To avoid having to think about outcomes for fractions of people, we'll imagine that exactly 10 million people are tested:

This problem represents a fairly simple example of a marginal probability distribution extracted from a hypothesis space over two (binary) propositions: X = receives a [positive, negative] test result, and Y = [has, doesn't have] the disease. In the linked glossary entry, I derive some general properties of such marginal distributions, but as algebra has the annoying habit of often refusing to impress a clear intuitive understanding on our minds, we can use the tree diagram to confirm our grasp on these things.

For example, from the above diagram, we can tell instantly that the overall probability for a person to receive a positive test result is simply 10,998 divided by 10,000,000. What we've done to obtain this number, however, is to implement, without noticing, the same procedure specified by the marginalization formula I derived in the glossary. That formula is:

This just means, to get the probability for x, regardless what y is, add up for each possible value of y, the product of the probability for y and the probability for x given that value of y. This is effectively what the above diagram allows us to do by inspection.

In this case, the y's are the propositions that the subject respectively has and doesn't have the disease. x is the condition that the subject's test result is positive. The overall probability for a positive test result, from the above formula is the sum of two terms. The first of these terms is the probability to not have the disease. i.e. 1 - prevalence, multiplied by P(positive result | subject not afflicted), which is the false positive rate = 0.001. Multiplying these together gives 0.0009999. This is exactly how we arrived at the number in the top right box in the diagram (remember that the numbers in the diagram are multiplied by 10 million).

The second term we need is obtained by similar means, only this time, y = 'does not have disease,' and P(x | y) is now 1 - false-positive rate, giving P = 0.0000999, corresponding to the the top box in the lower half of the right-hand column in the diagram. Adding these together gives P(x) = 0.0010998, corresponding to the top line in green on the diagram.

Another statistical expert with a blog, David Spiegelhalter, has recently used a similar tree diagram to solve a related problem of weather forecasting, here. Part of the success of diagrams like David Colquhoun's diagram (and my copycat graph) is that it lays out a multidimensional problem in a visually accessible way. As David Spiegelhalter explains, another important element is that these diagrams exploit a translation from probabilities to expected frequencies, and back, which eases the workload on our conceptual machinery.

As Spiegelhalter shows, such problems as the medical screening puzzle and his weather forecasting riddle can also be represented using contingency tables. For these 2 dimensional hypothesis spaces over binary propositions, the contingency tables are drawn as 2 by 2 arrays (with totals usually added for good measure). The table for my diagnosis test problem looks like this:

These tables generalize easily to non-binary hypotheses, e.g. 'the patient has a temperature of [34, 35, 36, ... , 40] degree centigrade.' Instead of having two columns for [pass, fail] the test, we would have one column for each outcome of the thermometer reading.

One way that contingency tables don't readily generalize, however, is when there are more than two dimensions in the probability space. This is when probability tree diagrams become especially useful. On a tree diagram, higher dimensionality is added by simply increasing the number of columns of nodes (boxes). All columns but the left-most represent one dimension in probability space. Lets take an example that sticks with binary hypotheses.

Suppose that the background prevalence of the disease in my original example is itself a random variable. Let's say that 4% of the population possess a genetic mutation that makes them an unfortunate 65 time more susceptible to that medical condition (assume that people without the mutation still have the same 1 in 10,000 risk having of the disease as before). Using a tree diagram, we can proceed to easily solve non-trivial problems, such as:

(i) What is the overall probability for a person to receive a positive test result?

(ii) What is the probability that I have the condition, assuming that I tested positive?

(iii) What is the probability that I have the mutation, given that I tested positive for the disease?

This time, my diagram will omit the left-most rectangle, which doesn't really convey any information, anyway, and I'll leave the translations to and from expected frequencies as an exercise for you, if you're bothered:

Notice that the probabilities in the middle and right-hand columns are conditional probabilities, of the form P(x | y). So the number 0.9999 in the top middle box is the probability, P(un-afflicted | no mutation). From the product rule, this number multiplied by the probability to have the mutation, P(y), is the joint probability, P(xy), i.e. the probability to have the mutation and not be afflicted by the disease.

For question (i) we just have to add up 4 different cases. Working from top to bottom, the first is the probability to receive a (false) positive test, and not have the disease, and not have the mutation:

P = 0.96 × 0.9999 × 0.001 = 0.0009955

Obtaining the other terms by the same means (each corresponding to a positive test result in the right-hand column), and accumulating them:

P(positive test) = 0.0009599 + 0.0000959 + 0.00003974 + 0.000025974,

i.e.

P(positive test) = 0.00136

For question (ii), there are 2 ways to receive a positive test, and have the disease, corresponding to having and not having the genetic mutation. So we sum the probabilities for these two, and divide by the total, just obtained:

which gives

P(disease | positive test) = 0.2624

Finally, for (iii), instead of adding on the top line the second and fourth terms, we add the last two:

yielding,

P(mutation | positive test) = 0.22097

Notice that the method of going backwards through the graph (solving for parameters towards the left hand side), in questions (ii) and (iii), is exactly the same as solving Bayes' theorem.

This problem can be made much more complicated, without introducing any further difficulty, other than the number of arithmetic operations required. As mentioned, we could have any number of columns in the diagram (dimensions to the problem), and any number hypotheses in each dimension. Also it would not have mattered at all if, for example, the false positive rate depended upon whether or not the subject has the mutation. As long as we have numbers to put in those little boxes, the solution is close to automatic.

Fear of Science

2014-09-20T03:52:00.000-05:00

Many people react negatively to the idea that moral principles can be inferred entirely using scientific method. There is a general feeling that this is impossible. This seems to be partly why quite a lot of people view the decline of traditional sources of moral instruction as a serious threat. This is a major, double mistake.

In August last year, I attended an event, 'Answers in Science,' at Houston Museum of Natural Science, aimed at raising awareness of the way that a number of christian fundamentalists have been trying to sabotage the quality of scientific education in Texas schools. Among several that spoke there, two people raised points that struck me as highly significant, given the line of thought I've been pursuing for some time, with regard to the relationship between science and morality. They were Kathy Miller, from Texas Freedom Network, and Mike Aus, a former pastor.

Kathy spoke very informatively about the mechanisms and procedures of education review boards in Texas. She explained how a disturbing amount of what they do is protected from public scrutiny, and how a large proportion of the people on such boards are radically skeptical christian fundamentalists, who often believe in the literal truth of the bible, and aim to have topics such as biological evolution, one of the most important of scientific theories, removed from the school curriculum.

Teaching that the theory of evolution is false is not only factually wrong (and therefore a huge disservice to any student who wishes to pursue a scientific career, or just understand science), but also conveys a hideously distorted message concerning how knowledge and understanding are obtained, giving a corrupted view of what constitutes evidence, and seriously undermining the pursuit of rationality. This hinders society's thriving in important ways.

What is interesting about these boards is that their members are publicly elected, and the science deniers who sit on them do so because they receive popular support. Kathy discussed briefly how this happens. She suggested that the people voting for these radical religious fanatics are themselves predominantly not religious fundamentalists, just culturally christian, and generally sympathetic to christian values.

It seemed from Miller's remarks that many who give support to those who would remove the theory of evolution from the school curriculum, and would teach that the Earth (indeed, the entire universe) is less than 10,000 years old, that dinosaurs and humans were contemporary, that global warming is not a problem, and so on, do so out of the simple desire for their children to learn to be good people.

When I heard this, I immediately drew parallels with the MORI poll, conducted in 2011 in the UK for the Richard Dawkins Foundation. Quoting from the linked press release:

"Asked why they had been recorded as Christian in the 2011 Census, only three in ten (31%) said it was because they genuinely try to follow the Christian religion, with four in ten (41%) saying it was because they try to be a good person and associate that with Christianity."

There seems to be a coherent message emerging: religious identification, and possibly support for radical religious teaching, may be associated more with a broad wish to retain moral integrity than with belief in specific religious doctrines.

But the tendency of many to rely on traditional religious teachings for provision of a moral foundation is a double mistake, as I mentioned. Firstly, such teachings are based on superstition and dogma, constructed to further the selfish aims of its authors - not healthy indicators. As issues such as evolution, the age of the universe, and countless other matters show, religious dogma has a terrible track record in the sphere of getting things right. And predictably so - there is absolutely nothing about the methodology of its construction to predispose it towards accuracy.

Even in the realm of morality, there is ample evidence for the poor track record of popular religions. Aficionados of these religions have been steadily accepting the continual erosion of their traditional teachings for centuries. Uncounted practices, once encouraged by popular religions, such as the keeping of slaves, persecution of other races, pursuit of holy wars, subjugation of women, proscription of homosexuality, and animal sacrifices, to name but a few, are now considered unacceptable in civilized society.

Secondly, by rejecting science in favor of religion for moral guidance, one is implicitly declaring that morality must be determined by irrational means, which is a clear absurdity (see for example Is Rationality Desirable?). Indeed, whatever there is to be discovered about morality can only be discovered efficiently and reliably by employing scientific method (see Scientific Morality). This applies not only to the methods of achieving our moral objectives, but also to the process of inferring what our core moral objectives are. In the quest to learn how best to behave, by turning one's nose up at scientific method, one is immediately taking on a serious and unnecessary handicap.

This is a mistake made not only by many sympathetic to religious teachings (and lets not forget that there are some good things about such value systems), but also by many in the scientific community. Popular misconceptions about the limitations of science are hardly blameworthy, when the intellectual consensus encourages belief in those same alleged limitations. Many scientists I've spoken to, and many more I've read on the internet, are repulsed by the idea that science can venture into the realm of human value, and consider it ridiculous. This is a traditional view, presumably owing much to thousands of years of religious dogma. Upon moderate reflection, however, it can be seen to be an obvious travesty (see Practical Morality, Part 2). It is now crucial for the scientific community as a whole to undertake that moderate reflection, and throw off the shackles of dogma (things that scientists are usually quite good at). One only needs to ponder how it can possibly be that something meaningful can be not measurable, or that something measurable can not be investigated scientifically, to realize that there is something seriously amiss with the traditional viewpoint.

Only after these issues are widely understood within the scientific and intellectual communities can we expect popular reliance on often harmful superstitions for moral guidance to be significantly diminished. This will not only reduce the skepticism and fear of scientific method, but will also enable the flourishing of a new (but long overdue) discipline of moral science, which in all reasonable expectation must lead to an enhanced understanding of ethics.
_

Mike Aus's point (the other speaker I noted at that Houston event), from his personal perspective as a non-scientist, was that the theory of evolution, when partially understood, can be damn scary. Success, in terms of natural selection, is highly dependent on one's ability to outmaneuver one's competitors. This seems to give the impression that if another person has a resource that I could benefit from, then it would be natural, and therefore (according to the theory) proper, for me to stab them in the back and take it. This suggests that some of the skepticism concerning the role of science in matters of morality comes from a feeling that science (in particular, the theory of biological evolution) entails immoral behaviour.

This comes down to another problem of the popular understanding of science. Again, a problem most easily solved after scientists themselves understand the issues more fully. I'm not claiming that biologists don't understand evolution, but as long as scientists don't grasp the origin of morality, they won't be able to explain the relationships between morality and biology, and they won't be able to identify the regions of overlap and non-overlap.

Broadly, there are two problems with this simplistic perception of evolutionary theory. Firstly, whether or not an organism is a rival, in terms of natural selection, depends on very many factors. In particular, for a species with a technology as sophisticated as ours, there are very strong reasons why cooperation usually works out far better for us than brutal, short-sighted bullying. We term this effect 'the social contract' (see Practical Morality, Part 1).

Secondly, whatever behavior maximizes our evolutionary fitness, this is not identical with what is morally optimal. 'Survival of the fittest' may prefer X, but we are not 'survival of the fittest,' we are humans. Chance effects in our evolution (e.g. 'unexpected' side effects of having brains like ours), together with systematic effects in our upbringing can equip us with values that conflict with the algorithm of natural selection (of genes, at least) in significant ways. There is no reason that propagation of my genes must be my ultimate source of value.

_

Morality, as a discipline, is entirely contained within scientific method. Problems of moral decision demand decent estimates of 2 classes of facts: (i) what it is we want and (ii) what the outcomes of various actions are likely to be. Consequently, such problems can only be solved by analyzing real experiences in a coherent manner - something we call science. To get the best and most reliable solution to any decision problem, we need controlled data, and we need to analyze it using sound methodology. Anything else is just guess work. Scientists and decision makers need to consider these things, and overcome the cultural biases that so often stand in the way of the obvious. Once we reach widespread understanding in this sphere, we'll have gone a long way towards reducing the fear of science that presently holds back society.

Pass / Fail Mentality

2014-05-24T10:46:00.000-05:00

(Following on from The Calibration Problem: Why Science Is Not Deductive)

Recently, I was talking about calibration (here and here), and how it should be more than just identifying the most likely cause of the output of a measuring instrument. The calibration process should strive to characterize the range of true conditions that might produce such an output, along with any major asymmetries (bias) in the relationship between the truth and the instrument's reading. In short, we need to identify all the major characteristics of the probability distribution over states of the world, given the condition of our measuring device.

Failure to go beyond simply identifying each possible instrument reading with a single most probable cause is a special case of a very general problem that in my opinion plagues scientific practice. Such a major failure mode should have a special name, so lets call it 'pass / fail mentality.' It is the most extreme possible form of a fallacy known as spurious precision, and involves needlessly throwing away information.

In The Calibration Problem, I argued that science is necessarily inductive, and consists, ideally, of producing probability distributions over sets of exclusive and hopefully exhaustive propositions. To present a result with an error bar that is too narrow to accurately characterize one of these probability distributions is to be guilty of spurious precision. To collapse that error bar to zero width: 'X is 5.1,' is to take this fallacy as far as it can go.

Science is often considered to consist of tests of Boolean propositions (molecule M speeds up recovery from disease D, planet Earth is getting warmer, humans and dandelions share a common ancestor, etc), and so presenting a finding with zero recognized uncertainty often consists of a statement of the form, "proposition X passed the test," or "proposition X failed the test." Hence the name I'm giving to this fallacy. What we would prefer is a statement of the form, "proposition X, achieving a probability of 0.87, looks quite reliable."

Better still, would be to examine an uncollapsed hypothesis space, such that instead of saying, "yes the Earth is getting warmer," or even, "yes, the Earth is very probably getting warmer," we would say something like, "the rate of change of the Earth's surface temperature is X ± Y."

Unfortunately, there seems to be a tendency in human nature to prefer statements of exaggerated precision. Probability theory gives us the tools to manage uncertainty. Until we properly understand probability, therefore, we do not understand uncertainty. And ill-understood uncertainty is a scary thing. Yet, we know that the function of science is to help banish our fear of the unknown. Thus, it's often overwhelmingly tempting to assume that the function of science is to banish uncertainty completely.

This is the temptation that many a radical skeptic has succumbed to and / or manipulated in others. The climate-change denier, the young-Earth creationist, the anti-vaccination lobbyist - all use the same tactic: Look, they can't decide if the Earth is 4.53 or 4.54 billion years old! They have no certainty about anything!

So strong is this desire to see uncertainty eliminated, (and so strong, perhaps, the desire not to give ammunition to the radical skeptic) that much of the way science is conducted and reported is built around this flawed model: Based on our results, we have decided that P is true. Effect Q has passed the test of statistical significance.

Often it has been said by gurus of scientific method that science must proceed by asking specific questions, and that these questions must be of the yes / no variety. Hence, we have seen debates between the rival philosophies of verificationism v's falsificationism; should we proceed by proving our theories true or by proving them false? This debate is wrong - strictly speaking, we can do neither.

Many a data set has been left unpublished because it was insufficient to answer any question 'conclusively.' But this inconclusiveness is itself a useful piece of information. It indicates, for example, that any possible effect size is likely to be small, and in combination with other weakly-informative studies in meta-analysis, it can be used to help aggregate a more informative result.

The tendency to not publish ambiguous results is not always the fault of the practicing researcher either. Most scientific journals will publish only conclusive results, and of those, positive results usually receive far more favorable attention.

Null-hypothesis significance testing (one of the most common forms of data analysis in use in contemporary scientific literature) is a classic example of the pass / fail mentality running rampant in scientific endeavour. A threshold is set, and if some data-derived metric exceeds that threshold, then the data go on to fame and fortune. Otherwise, they go to the back of the filing cabinet. This madness is not necessarily limited to frequentist methods, though. Anywhere that some hard boundary between finding and non-finding has been set will exhibit the same flaw, and it won't matter if that boundary is a p-value, a posterior probability, or a likelihood ratio.

Odd things occur when experimental data is examined using threshold-like criteria, such as the α-levels applied to p-values. Meta-analysts have found that the scientific literature can exhibit excess results that just barely cross the significance threshold, contradicting statistical analysis that predicts what the distribution of p-values should be. For example, Masicampo and Lalande¹ found an anomalous spike in the number of reported p-values just less than 0.05, in the field of psychology. Very many studies use the same arbitrary significance level of p = 0.05, so that a p-value greater than 0.05 is considered inconclusive. One interpretation is that many researchers achieving p-values close to, but not quite crossing the significance barrier tended to 're-work' their analyses in various ways, until the magic α-level was crossed. Use of thresholds leads to distortion of the evidence.

Applied probability is called decision theory. It is the endeavour to determine how to act, based on our empirical findings. Where actions are to be taken, often a probability distribution is going to need to be collapsed, somewhere. We can't always distribute our actions - it doesn't usually work to only 70% undergo surgery. (Though, mixed strategies very often are possible, and advisable.) Thus, on superficial analysis, the introduction of a threshold of significance, or the expression of a parameter estimate as an infinitely narrow spike in probability space may seem reasonable: if action requires us to gamble on a single state of the world (e.g. "only surgery can save your life"), what else are we to do?

The problem is that the pass/fail mentality, perhaps motivated by a crude appreciation of decision-theoretic considerations, does not explicitly acknowledge any elements of decision theory, and totally fails to implement the most basic aspects of decision analysis.

Many a frequentist chastises the Bayesian for introducing personal bias, in the form of the prior probability distribution, without realizing that frequentist techniques do exactly this, arbitrarily, and without acknowledgement. By implementing arbitrary α-levels, the significance tester is effectively setting a decision threshold, without ever formally or even approximately evaluating any utility function, which is one of the things that any decision analysis must have. In fact, it's even worse: the introduction of the decision threshold goes so unacknowledged, that the urge, upon completing the calculation is not to say, "for economical reasons, we should behave as if X is true," but rather to simply say, "X is true." Nobody (pretty much) actually, explicitly thinks this way, but the conventions of reporting have been set up in such a way that this non sequitur represents a real point of psychological attraction, which easily impacts on the thinking and behaviour of the insufficiently wary.

One of the implicit assumptions in pass / fail thinking is that the best point estimate is the peak (or sometimes the mean) of the probability distribution. Partly, this arises because there remains confusion as to whether the exercise is one of decision, or one of pure science, to determine what is true. But a stupidly simple toy example of decision making serves to show that the optimum, with respect to action, depends on the utility function, and can be arbitrarily far from the probability peak.

Suppose we play a gambling game, but it's not much of a gamble, as the cost of entry into the game is zero. An exotic roulette wheel produces outcomes that are Poisson distributed, with mean equal to 10, such that the probability to land on 38 (the highest number on the wheel) is extremely low (about 8 × 10^-11, in fact). If the outcome matches our prediction, we win a prize that is dependent on what the outcome is. It just happens that only one of the prizes is non-zero (and positive) - the prize awarded for a correctly predicted outcome of 38. What outcome should we predict? Obviously, in this trivial game, our expected utility is maximized by betting on 38, even though it has a low probability to arise.

In the post on the calibration problem, I showed in detail why it is that science will always be fundamentally concerned with calculating probability distributions. Hopefully the above considerations have helped to illustrate further why as much of the details of those probability distributions as possible should be retained, when the final analysis is being assessed and reported. To present a point estimate without an error bar is frankly unthinkable, for any conscientious scientist. To neglect to mention any strong asymmetry of the resulting probability curve is to needlessly discard valuable information, and such fine details, once eliminated, represent a lost opportunity when evidence from multiple studies is to be aggregated. Sometimes action requires a hard decision, but do the decision analysis explicitly, so that you know what you have done, and so that others can review your assumptions, and determine whether your utility function is the same as theirs.

References

[1]	Masicampo, E. J. and Lalande, D. R., 'A peculiar prevalence of p values just below .05,' Quarterly Journal of Experimental Psychology, 65 (11), 2271-2279, 2012 (Link to paper, paywalled, unfortunately. See a short discussion with a plot of Masicampo & Lalande's data here.)

Announcing: Moral Science Index

2014-05-20T18:16:00.000-05:00

Continuing the paradigm established by my glossary and my mathematical index, I've put together an index to and summary of the material I've accumulated on the topic of moral science. The index can be reached here, or from the link, 'Moral Science', on the right-hand side, beneath my profile.

The idea is simply to provide a point of entry for people interested in knowing what I have to say on this topic. People can see everything I have presented on this theme, the order in which the different pieces were published (and hence, approximately their dependency), a short description of each piece's function, together with some global motivating and qualifying remarks.

The relationship between science and morality represents a significant percentage of the material on my blog. It's an important (by definition) and highly overlooked topic, so I think it is important for people to have a single point of access to this material, the same way that the mathematical index provides a consolidated resource for learning about statistics, and the same way that the glossary represents the most definitive statement of my philosophy available, anywhere. (In some respects, I now view the blog as secondary to the glossary.)

I will try to keep the moral science index current - as I release more material, I'll update the index accordingly.

As always, I welcome your comments, questions, criticisms, outraged indignation, etc. If anything needs clarification, the fault is mine. If you're curious about some detail I can help with, then I'm delighted to do so (that's the whole point of the website, actually). Comments are open here and on the index itself, and alternative contact details exist on the right hand side of this page.

Some Highlights:

For your convenience, I'll reproduce here some of the major points from the moral science page.

(1) As of the publication date of this blog post, the index stands at:

Blog entries on this topic (in order of publication):

Scientific Morality

Crime and Punishment

Is Rationality Desirable?

Practical Morality, Part 1

Practical Morality, Part 2

Glossary entries on relevant concepts:

Absolutism

Consequentialism

Morality

Rationality

(2) To disclaim any extraordinary expertise in any specific realm of moral decision:

My writing on ethics is not to prescribe how to behave, but to inform on how to know how to behave.

(3) Quoting from the overview:

The founding principle behind my writing on this blog is that there is no better method to learn about anything than science. If a thing is meaningful - has consequences - science can measure it, by virtue of those consequences....

It is often said that science has nothing to say on the matter of what constitutes moral behaviour. If correct, this leaves us with only one option: morality has no meaning, it is a non-concept. It seems to me absurdly trivial that this is not so. Anyway, only a moderate amount of reflection is required to prove it. Thus, it is equally trivial to prove that science can guide us - in fact, is the optimal guide - concerning moral prescription.

(4) Another feature on the moral science page is a short list of blog articles I expect to write on the topic in the near future, covering (in no particular order):

the correspondence, if any, between correct consequentialism and classic utilitarianism

the correspondence, if any, between correct consequentialism and political libertarianism

(Spoiler alert: the answer in both cases is, not so much.)

some necessary aspects of the nature of human decision criteria

the limited insight offered by the classic thought experiments in the philosophy of ethics

the potential for correct moral realism to significantly reduce reliance on superstition, leading to a better informed and more rationally directed society

The Calibration Problem: Why Science Is Not Deductive

2014-05-17T01:21:00.002-05:00

Here is perhaps the most important fact about scientific method that anybody can ever learn: the optimal course of a scientific investigation is to provide probability assignments for propositions about the universe, and when scientific method deviates from this optimum path, it is valid only to the extent that it successfully approximates this ideal. There is a simple reason for this:

We would love to be able to say that we are 100% certain about X, that Y is guaranteed to be true, or that fact Z about the universe has somehow entered my head and impressed infallible knowledge of its necessary truth on my mind, but of course, except for the most trivial propositions, none of these is possible.

Firstly, every measurement is subject to noise, so there will always be a degree of uncertainty about what caused a particular experience.

Secondly, and far more fundamentally, calibration of any instrument requires certain symmetries of physical law to be hypothesized. Here's what I mean:

If a 63.7 kg weight caused my machine to go 'beep' yesterday, I might postulate that a 63.7 kg weight will do the same today, because I'm assuming that the relevant laws of physics (and my device) are the same. If a mass on a spring oscillates at a given frequency, allowing me to count out a certain number of seconds, I might assume that changing my location on the surface of the Earth will not change that frequency. (And by the way, how would I know that the frequency was fixed at all?) These are hypotheses that may or may not be true. The only way to test such hypotheses is by the development of further instrumentation. Such instrumentation, though, is subject to a similar calibration problem, reliant on some other kind of analogical reasoning.

Until I realize that solid objects expand and contract as the average kinetic energy of their atoms changes, it might never occur to me that length measurements taken with a simple ruler vary slightly, but systematically, with the ambient temperature. Furthermore, in order to discern that such a bias is present, I need to make a comparison against some other calibrated standard. Such auxiliary standards, though, will always suffer the same type of vulnerability.

Thus, we cannot prove with 100% certainty exactly which symmetries hold in nature (though the assumption that some symmetries hold is a priori sound, which I might get round to in a future post). We can only demonstrate that up to now, our experiences are consistent with some set of postulated symmetries. The process of testing assumed principles of calibration (laws of physics) against empirical experience in this way is known as induction.

So, faced with the impossibility of knowing with absolute certainty any but the most trivial facts about the world (e.g. that only things that exist are affected by gravity), we must fall back on the next best thing: to quantify our justifiable degrees of belief in the various propositions we are interested in. Prescribing the manner in which this is achieved is the task of probability theory. Essentially, induction works by applying Bayes' theorem, or some reasonable approximation.

As for those very trivial statements that we can deduce with total confidence, what do we get from those? Nothing, really. Take a look at my example: only things that exist are affected by gravity. Does this tell us anything about things that actually exist? In fact, no. It only tells us about things that don't exist. This is both why it is so trivial, and why it can be known without scrutinizing any evidence. Let's think about about one of the most famous examples, Descartes' cogito: 'I think, therefore I am.' Again, this doesn't really tell us anything. It doesn't say what I am, what it is to think, or what it is to be, only that thinking, like gravitational attraction, is a property limited to things that exist. It doesn't even suggest, for example, that I am in any way a separate entity to every other thinking object in the universe.

For completeness, there is strictly no way out of the calibration problem. Consider the following thought experiment: imagine for the sake of argument that at some time, some clever scientist somehow devises an argument that identifies a unique set of symmetries in physical law, such that all other possible sets of symmetries lead to statements that are self contradicting, and therefore cannot be true. Imagine that this miraculous argument is actually correct. Obviously, the validity of such an argument, and our knowledge of that validity are not the same thing. This latter relies on our ability to confidently check the required logic, implying that there is yet another instrument in need of calibration: our own intellectual faculties, the fidelity of which can not, by any possible means, be established a priori.

Inductive inference is often contrasted with deductive logic. Deductive logic performs trivial operations on assumed premises, to draw conclusions that, according to the system, can not be false if the premises are true. The classic example starts from two premises, (i) 'all men are mortal,' and (ii) 'Socrates is a man,' to reach the unavoidable result, 'Socrates is mortal'.

Some of the well-known philosophers of science believed that because inductively derived information is not capable of guaranteeing truth, it must be inferior to deductive logic, or worse (e.g. with Karl Popper) it must be strictly useless - another fine example of mind-projection fallacy: because a statement about reality can only be true or false, then any degree of belief in it that I possess must be all or nothing - a position I have refuted elsewhere. (Popper is best known for his falsifiability criterion - in Inductive inference or deductive falsification? I show that, contrary to Popper and others, falsification must be inductive, but it's also important to note that falsification is not the only direction in which science can progress.)

Deduction often feels far more steadfast than inductive inference, because of its power to guarantee the conclusion from the employed premises, but really, deduction on its own tells us absolutely nothing about the world. Because of the calibration problem, the premises of any useful deduction can not be guaranteed by any means. To put it another way, one may argue that mathematical theorems possess necessary truth, but to the extent that this is true, they apply only to abstract, mathematical objects, x's and y's, but not real entities inhabiting the universe.

There may seem to be a problem, in that probability is a mathematical theory, meaning that all its theorems are derived using deductive logic. How can probabilistic reasoning be more powerful than deduction, if probability theory depends on deduction? It's what probability theory is about that allows it to lay legitimate claim to a uniquely privileged position among mathematical theories. The theory of differential calculus, for example, is a theory about x's and y's - entities with no real existence, not even in the mind of the person who fully perceives the theory. Using the theory of differential calculus, however, I can use those x's and y's to represent, for example, space and time, and formulate a theory of gravitation. We might start from Newton's inverse square law and use the theory to predict that the planets will adopt elliptical orbits around the sun, but could we then know with deductive certainty that this is the truth? Of course not. Could we even infer that this is probably the truth? No, not without probability theory.

Probability theory is still a theory of abstract x's and y's, but the objects in this theory are now not surrogates for masses on springs or airoplane wings, they are rational agents and their rankings of believability. The theory of probability, therefore, provides a bridge between a mechanical theory and the thing that it is a theory of. It allows us - real agents - to quantify the correspondence between model and reality. On its own, a mechanical model, such as a theory of gravity, has no knowable relationship with what is actually going on. It is inductive inference that allows us to say, 'yes, the assumptions of this model are reasonable,' and 'yes, the predictions of this model match my experiences well.'

Finally, while it looks like inductive inference is founded on deductive logic, where do we suppose the axioms needed to derive our mathematical systems derive from? It is surely perverse to suggest that they come from anywhere other than our experience of the world, and what works, intellectually. Such experience is derived in 3 ways:

(i) our population-genetic history - our brains are the way the are, because the way they regulate our behaviour is a good match for the way the world operates, leading to efficient propagation of the genes that prescribe our brains' construction

(ii) our cultural history - early philosophers experimented with all manner of intellectual systems, eliminating all sorts of obvious mistakes along the way, and passing on a treasure trove of useful heuristics

(iii) our personal history - our direct contact with nature makes certain axiomatic systems feel highly unpalatable, because they just don't match what we see

To spell it out explicitly: deductive inference is in fact founded upon inductively derived principles.

The idea that inductive learning is more powerful than deductive logic has been recognized at least as far back as 1620, when probability theory was still in its infancy (just a few meager decades of faltering development). In that year, Francis Bacon, one of the founders of empirical scientific method, published his great work on the subject, 'Novum Organum,' (full text). This title means 'New Instrument,' and was a reference to Aristotle's 'Organon,' (full text). This was Aristotle's book on deductive logic, which had stood for centuries as the accepted model for all epistemology. Bacon's title was carefully chosen to send the message, "Aristotle is now obsolete." Bacon's great contribution was to say that deduction alone gets you nowhere. If you want to know what the world is actually made of, and how it behaves, he argued, you must make observations and do experiments. Science, in fact all knowledge, is based on experience, not pure thought.

Calibrating an X-ray Spectrometer - Spectral Distortion

2014-05-06T22:38:00.000-05:00

Calibration is a process whereby a relationship is inferred between the output of some measuring instrument and the physical process responsible for that output. An instrument may be something as simple as a ruler, or something as complicated as the Human Genome Project or the Planck cosmic background survey. Calibration is fundamental to science. We might even say that it is science.

When we think about calibration, we often think simply about finding the most probable value for some physical parameter, given some reading from an instrument. In the previous part, I described this simple process for a device used to characterize the distribution of photon energies in a stream of x-rays.

But we really ought to think of calibration as more than this. To make the best inferences possible from a reading, we should formulate the entire probability distribution, not just the location of its maximum, for the state of the world when the machine goes "bing," or when the display reads "42." When the readout says 7, it's good to know that I've most probably just found a black hole (perhaps), but it's also good to know what alternative explanations there are, and what amounts of probability mass they command.

At the end of the previous post, I showed this measurement of an x-ray spectrum, made with a cadmium telluride (CdTe) detector:

As I explained, those step-like drops in intensity at 27 and 32 keV are artifacts of the detector, and are consequences of an asymmetric probability distribution, P(state of the world | this instrument reading). Such asymmetry makes it all the more important to look beyond determining just the distribution's peak, as it means that our instrument is systematically distorting the truth. In this post, I'll describe my analysis aimed at converting this repeatably distorted detected spectrum to a representation of the true spectrum of the source.

The detector, of course, is made of atoms (Cd and Te), and these atoms have K-edges just like all others, as I described previously. That means that when a photon with more energy than these K-edges is absorbed, there is a chance for a fluorescence photon to leave the detector, taking away its characteristic energy with it. This energy, of course, does not contribute to the number of photo-generated electrons in the current pulse corresponding to such a detection event, and the detector registers photons with systematically lower energy than they really have. The two steps in the spectrum correspond to the points at which the incoming photons first exceed the Cd and Te K-edges. Every time a fluorescence photon escapes, the registered energy equals the true energy minus that of the fluorescence photon.

We want to work out the probability for a K-photon to escape the detector, so we can correct this asymmetric distortion. Let’s start by defining a few propositions:

A	≡	an x-ray photon from the source was absorbed in the detector

Cd	≡	an x-ray was absorbed by a cadmium atom in the detector

z	≡	an x-ray from source was absorbed at depth = z

E	≡	a fluorescence photon escaped the detector

K	≡	a K-shell emission occurred

K_i	≡	a K-shell emission into the i^th recombination channel occurred

θ	≡	a fluorescence photon was emitted at angle θ, with respect to the optic axis

E_i	≡	a fluorescence photon from the i^th recombination channel escaped the detector

D	≡	an absorbed photon from the source is detected in full (the full amount of energy absorbed is converted to charges collected at the readout electrode)

In the derivation below, I'll only consider the case where a photon is absorbed by a cadmium atom, but the full calculation will consist of also examining the alternate case, where absorption occurs in tellurium. For this, we just need to replace P(Cd | A, I) with P(Te | A, I) = 1 - P(Cd | A, I), from the sum rule.

We want to know the number of x-ray photons absorbed in the detector, N_A. What we actually know is the number detected, N_D.

On average, the number detected is

Once a photon has been absorbed, our probability model, I, assumes that the only possibility for it to be not detected in full is for some of the energy to escape as a fluorescence photon. Thus, from the sum rule,

Also, our model assumes exactly 4 fluorescence channels:

i	channel	fluorescence energy (keV)
1	Cd, K_α	23.2082
2	Cd, K_β	26.1586
3	Te, K_α	27.3773
4	Te, K_β	31.091

So, the detection probability is

Or, invoking the extended sum rule for disjoint propositions,

From which, we can estimate the number of absorbed x-rays from the number detected:

(1)

A fluorescence photon can only escape the detector if it has been emitted, so the proposition E_i is the same as the proposition E_i K_i. Thus,

which, from the product rule, can be decomposed:

(2)

The probability for K-emission in the i^th channel can also be similarly decomposed. For example, for i = 1, K_α emission from cadmium, K₁ is a conjunction of three things: (a) absorption by a cadmium atom, (b) emission of a fluorescence photon (i.e. no non-radiative relaxation, such as Auger recombination, where the energy goes into another electron), and (c) emission into the α line:

The first term is the relative intensity (material dependent) of the K_α line, obtainable from published tables,

The second term is the overall K-fluorescence yield, termed ω_K, obtained from literature¹, and the third term is given by the photoelectric absorption coefficients (discussed two posts ago):

We need one other term in order to evaluate Eq. (2). The escape probability is dependent on the unknown absorption depth, z, of the incident x-ray photon. We will integrate this nuisance parameter out, to give the desired marginal distribution:

The absorption depth is independent of the ensuing fluorescence channel, so

(3)

Note that, from the product rule,

i.e. the probability to be absorbed at a given depth (obtained from the exponential distribution) must be normalized by dividing by the overall absorption probability in the 1 mm thickness of the detector.

The depth-dependent escape probability, P(E_i | z, K_i, A, I), is also dependent on the unknown emission angle of the fluorescence photon, θ, relative to the direction of travel of the detected photon (perpendicular to detector surface). This is because different angles (for a given emission depth) correspond to different distances required to exit the CdTe detector. Again, let’s marginalize over this nuisance parameter:

(4)

(θ is independent of all other variables, and is uniform over 0 ≤ θ ≤ π, hence P(θ | I) × dθ is 1 divided by the number of samples over the half circle.)

Strictly, we should integrate over two angles, θ and φ (the detector has a square profile, and escape distances from the side depend on φ), but because the detector is heavily masked, so that only the centre is illuminated, and because the detector is very wide, compared to the penetration depth at the K-photon energies, we treat escape from the bottom or top as the only escape paths. This is supported by noting that reducing the size of the mask aperture has no effect on these observed artifacts. For the same reason, my model does not integrate over the x- and y-coordinates in the detector volume.

For emission angles less than 90⁰ (moving downwards), the distance required to exit the detector through the bottom surface is

while for photons moving upwards, the distance to escape through the top is

For each angle, and each depth, the desired escape probability is the exponential function: exp(-µ_PE × d_esc), where µ_PE is the photo-electric absorption coefficient at the relevant energy.

Finally, combining Eq’s (2), (3), and (4), for the Cd K_α fluorescence channel, we have (with similar formulae for the other channels):

(5)

Using Eq’s (5) and (1), we can get the number of photons absorbed at each energy, from the number counted by the detector. The intensity at the j^th energy, I_j, needs to be enhanced according to Eq. (1). The intensities at each of the depleted energies, I_{j - n}, where n is the number of detector channels spanned by the energy of the relevant fluorescence photon, need to have a number of counts, I_j × P(E_i | A, I), subtracted.

The procedure starts at the highest energy in the spectrum. Once the j^th detector channel has been processed, we move on to the j-1^th, until all channels have been adjusted. By traversing backwards through the spectrum, any contribution from higher energies will have been removed before the number of escape photons is calculated for each energy channel.

Note that the fluorescence yields are step functions of the incident energy – nothing is emitted for incident energies below the K-edges.

Once the detected spectrum has been converted to an absorbed spectrum, the spectrum incident on the detector is obtained by dividing the absorbed spectrum by the energy-dependent quantum efficiency of the 1 mm CdTe detector, and finally, the spectrum incident on the outer window of the detector is obtained by dividing again, by the transmission efficiency of the 250 µm beryllium window. This allows characterization of the spectrum emitted by the source, which is the ultimate goal of the activity. These transmission and absorption efficiencies are obtained again, using the appropriate energy-dependent attenuation coefficients, all of which can be obtained from this convenient NIST database.

All the calculations I described were carried out by my computer. I numerically integrated over z, taking 200 samples, and θ, with 180 samples (taking care not to let θ = 90, to avoid division by zero when the cosine is taken). The result is as follows, and exhibits partial success:

It seems to me that all the logic I described is correct. All the simplifying assumptions seem reasonable, and I have checked the code and not found any errors, but sadly, the result is not quite what it should be. Those spurious steps in the spectrum have been successfully removed, but have been replaced by a couple of sharp spikes. Evidently my model of the device is not quite adequate. These spikes could relatively easily be removed, by simply noting that they shouldn't be there - they are narrow enough that a linear or quadratic interpolation between the points on either side would be a fair fix, though a highly unsatisfying one. There's still some work to do here, before total victory can be declared.

In the literature, I see others working with a similar spectrometer encountered similar difficulties, which they also couldn't explain². Here is my best (though at present, highly speculative) guess for what might be happening:

My model assumes the onset of the K-edge is immediate, in line with basic known physics. But the original spectrum shows a gradual onset over almost 2 keV. Naturally, the detector exhibits measurement uncertainty in the form of a symmetric broadening, but at < 0.2 keV (known from the fluorescence measurements) this is insufficiently broad to account for the observed effect. The detector, however, is placed under a high voltage, to drive the photo-generated electrons to the readout electrode - several hundred volts, which results in a strong electric field that could conceivably affect measurably the binding energies of the atoms' electrons. Furthermore, CdTe technology is not as well developed as that of other semiconductors, such as silicon, and the CdTe crystals that can be grown are not of such high quality. Because of this, defects in the CdTe crystal can lead to local and transient distortion of the applied electric field, which just might lead to small differences in the effective K-edges at different locations (and different times). There could be some really interesting device physics going on here. If so, remember, you heard it here first!

If I make significant progress with this, I'll try to post a follow-up. If you know how to solve this problem, please drop me an email!

References

[1]	A. Markowicz, in Handbook of X-Ray Spectrometry, edited by Van Grieken and Markowicz, (Marcel Dekker) 2002
[2]	R. Redus, J. Pantazis, T. Pantazis, A. Huber, and B. Cross, Characterization of CdTe detectors for quantitative X-ray spectroscopy, presented at the 2007 Denver X-ray Conference and submitted to IEEE Trans. Nucl. Sci, 2008

Acknowledgement

Big thanks to Charles Willis and Bill Erwin at M.D. Anderson for lending me their spectrometer. It's a nice piece of kit.

Calibrating an X-ray Spectrometer - First Steps

2014-05-03T01:36:00.000-05:00

Recently, I've been working with a borrowed piece of equipment - an x-ray spectrometer - whose response I need to understand, so I can take measurements with it. This is a special case of the general problem of calibration, which is a crucial topic in science, so I'd like to take some time to describe the procedure I went through. As you'll see later, the problem is not fully solved yet, which I suppose illustrates the trial-and-error nature of scientific work. Regardless of the degree of ultimate success, though, the process I'll describe strikes me as a fine illustration of the basic logic of experimental science.

Digital x-ray detectors work because when x-rays are absorbed in the detector, the energy goes into liberating large numbers of electrons, which get collected in the detector's read-out circuitry. To make a detector that can record the energy of the incoming x-ray, we can exploit the fact that on average, each electron gets a certain amount of energy, so that the number of electrons liberated is proportional to the energy in the absorbed x-ray photon. This will work as long as the detector sampling rate is high compared to the photon flux (a condition whose violation we call 'pile-up').

Each absorbed x-ray, therefore, creates a current pulse proportional to the x-ray's energy. For a multi-channel x-ray spectrometer, each current pulse is analyzed and assigned by the electronics to one of a range of available channels, each corresponding to a particular range of energies. Each time a pulse is assigned to a channel, a counter corresponding to that channel is incremented by 1. The calibration problem for such an instrument, therefore, is to find the relationship between the channel being incremented and the energy of the photon.

Most often, the calibration problem is considered to consist of finding the mean (or very often, just the mode) of the probability distribution, P(energy | channel), though a more complete calibration consists of characterizing the entire distribution, not just its peak or mean. This becomes particularly important if the probability distribution is notably asymmetric.

Finding the peak is a good place to start, however. In the present case, one method to do this relies on the phenomenon of K-shell fluorescence, which I'll briefly explain. The diagram below represents the spectrum of energies available to the electrons in an atom:

The number subscripts on the right indicate the so-called principle quantum number, n. At n = 1, the electron has the lowest energy it can have for that atom - its orbit is also closest (on average) to the nucleus. Higher-energy levels exist, getting more close together (energetically) as n increases, until a certain critical energy, at which the electron is no-longer bound to the nucleus - the electron is liberated to wander the vacuum, hence the 'V' subscript.

The n = 1 orbital is also referred to as the K-shell, n = 2 is known as the L-shell, and so on. If an electron in the K-shell absorbs a photon and is given enough energy to exceed E_V, then the electron leaves the atom altogether. The minimum energy required for this, the difference between E₁ and E_V , is known as the K-edge.

If an atom with several electrons is ionized in this way, an electron from a higher orbital must drop down to the vacated level to restore equilibrium, and this process often produces K-shell fluorescence - the excess energy of the electron that moves down to fill the K-shell is released in the form of a photon. Most often, the relaxing electron will come from either the n = 2 or the n = 3 orbital, and these are the transitions I've marked on the diagram - the emitted light is depicted as the green oscillations. For the transition n = 2 to n = 1, a K_α photon is emitted, while relaxation from n = 3 to n = 1 produces a K_βphoton.

For hydrogen, the transitions terminating at the K-shell produce ultraviolet photons, and are termed the Lyman series (the Balmer series are the transitions terminating at the L-shell (n = 2), and so on). For larger atoms, the K-photon energies are in the x-ray range. Because of the discrete nature of the energy levels participating in these fluorescence events, a fluorescence spectrum for a pure metal will consist of a series of very sharp lines, whose energies are unvarying properties of the atoms of the metal. Here is a fluorescence spectrum I measured for a pure sample of tin using my borrowed cadmium-telluride (CdTe) spectrometer. The tin was exposed to the photon flux coming from my tungsten x-ray tube, and the fluorescence was collected in a 90° back-scattering geometry:

The energies of these peaks can be looked up, (for example, in tables in this x-ray data booklet, from Lawrence Berkeley Lab) and compared to the channels at which the recorded signal peaks. Repeating for several fluorescent metals (zinc, zirconium, and tungsten, in my case) gives a series of channel-energy pairs, which can be fitted with some calibration model using maximum likelihood, or some other method. The spectrometer I was using is quite well designed, with the consequence that a linear fitting model was suitable for finding the expected energy for photons registered in each channel.

Because the measurements are noisy, simply taking the channel at which the signal is maximum is not the best way to to find the peak channel. To find the peak channel, then, the fluorescence spectra were fitted with reasonable line shape functions, in this case a Gaussian function for each emission line, using maximum likelihood. In each case, the fitting software I used gave an error bar for the fitted mean of each Gaussian, which gives the standard deviation of the assumed Gaussian error distribution for each inferred peak position. From this information, the following table was drawn up:

One thing to notice is that the α and β emissions actually can have substructure, such that for tungsten, three different β lines take part, though two of them are not resolved (that's why I used their average position for the calibration).

The third and fourth columns in the table give the fitted peak positions and their associated standard deviations. The known peak energies are plotted against these fitted peak channels, and fitted with a linear model. The linear model uses the a and b parameters given in the little box on the right of the main table. The sixth column in the table has the values of the linear model at each channel number in the third column.

From my earlier description of parameter estimation, the joint likelihood function for for any set of model parameters, θ, can often be calculated from

where the d's are the data (the known peak energies, corresponding to the measured channels), and the y's are the model values. We can therefore maximize the likelihood function by minimizing the sum of the squared residuals, divided by the square of the standard deviation (from column 4). These weighted residuals are in the last column, and their sum is given as the χ² parameter, in the little box. This χ² is optimized numerically (it can also be done analytically, using linear algebra), by adjusting the a and b parameters until χ² is at its minimum. The resulting fit is given in the table, and is shown in the plot below:

Each data point has been plotted with its associated error bar, but most of the error bars are smaller than the data markers.

That was the easy part. The difficulty appears when we look to see if there is any systematic distortion of a measured spectrum - that asymmetry in P(energy | channel), I was talking about. Take a look at this spectrum I measured directly for the tungsten x-ray tube:

The spectrum has quite a bit of structure, and most of it reflects the true nature of the source very well. The x-rays are produced by firing a stream of electrons at a tungsten target. In this case, the electrons are accelerated by a 100 kV potential difference (giving them each exactly 100 kilo electron volts of energy). These electrons can continuously lose energy as they fly through the metal, causing a continuum of radiation to be emitted. That's the main, broad peak in the spectrum, known as 'bremsstrahlung.'

These accelerated electrons can also knock away inner electrons from the tungsten atoms, leading to rearrangement of the outer electrons, and associated fluorescence emissions, exactly as described above for photo-absorption. At a little over 10 keV, there are two sharp emission lines corresponding to the L-shell characteristic fluorescence from the tungsten in the x-ray tube. At about 59 and 67 keV, two more groups of lines appear, due to the K-shell transitions for tungsten.

There are, however, some step-like artifacts in the spectrum, at about 27 and 32 keV, which are not characteristics of the spectrum from the x-ray tube. Instead, these are properties of the detector. These energies happen to match the K-edges for the cadmium and tellurium atoms in the detector, and it's a safe bet that these step-like drops in intensity are due to fluorescent photons carrying absorbed energy out of the detector, before it gets a chance to be collected at the readout electrode. In the next part, I'll describe my efforts so far to correct such effects, by calculating sampling distributions for these and a number of other spectral distortion mechanisms that I confidently believe to be influencing the detected signal.

Big thanks to Charles Willis and Bill Erwin at M.D. Anderson for lending me their spectrometer. It's a nice piece of kit.

The Exponential Distribution

2014-04-26T00:37:00.002-05:00

The exponential distribution holds a special significance for me. My PhD thesis was all about optical transients, the simplest mathematical models of which are exponential distributions. Currently, I work in x-ray science, which is heavily concerned with the depletion of an (x-ray) optical field as it traverses some distribution of matter (both in an object being imaged, and in the detector) - this time the exponential distribution is over space, rather than time, but the mathematics is the same.

Any kind of involvement with mathematical science quickly brings us into intimate contact with exponential functions, as these arise left, right, and centre, in the solutions of differential equations. The reason for this is related to the fact that the exponential is the only mathematical function that is its own derivative. This is closely related to a special property of the exponential distribution, known as memorylessness (what will happen next - its rate of change - is entirely governed by the current state). So let's take a quick look into how the exponential distribution comes about, and what its major characteristics are.

Imagine a stream of photons incident on some distribution of matter. It's no surprise to learn that some of those photons are going to be absorbed or scattered, so that they non-longer continue on their original path. The number that are scattered will depend on the thickness of the matter that they pass through, which is why, on a foggy day, things that are close to you can be easily seen, things not too far off can be somewhat made out, while objects a bit further away can't be seen at all. We'd like to know exactly what the dependence on distance is.

Lets denote as P_L(U) the probability (dependent on length, L) that a photon will remain unabsorbed by its surrounding medium. For any infinitessimally thin strip of that medium (whose distance in is L), the probability to be absorbed at that location is the product P_L(U) × P(A | U), where P(A | U) is the probability to be absorbed in that strip, given that it was not absorbed in any early strip. This follows from the product rule applied to the necessary conjunction, 'unabsorbed, before now' AND 'absorbed here', required for an absorption to occur at a particular place. The probability P(A | U) is independent of where the photon has been up to now - adding the U after the vertical bar ensures this. There is no physical reason for P(A | U) to depend on the photon's history, and this is the property of memorylessness I mentioned a moment ago. To put it another way, we are dealing with a Markov process, which can be a useful fact to remember.

Because P(A | U) is unchanging, we have invented a special symbol for it, μ, which we call the absorption coefficient. As each consecutive layer of the absorbing medium is traversed by the photon, the probability for the photon not to have been absorbed is reduced by the amount, P_L(U) × μ (from the sum rule). Or, to put it another way, the rate of change of P_L(U) with respect to the path length traversed is:

(1)

We can rearrange this equation, then take the integral of each side:

(2)

The left-hand side is solved using item 5 in my table of intergals, while the right-hand side is given by item 3:

(3)

As always, ln(.) represents the natural logarithm. Since this equation is true for all distances, we can form equations for distances, L and 0, and then subtract the 2 equations:

(4)

From the laws of logs, this becomes

(5)

or, taking the exponential of each side

(6)

where we have finally identified the proportionality of P(U) and the intensity, I, of the optical field. This describes the exponential decay of the photon flux. This equation is called the Beer-Lambert Law. P_L(U) is not a probability distribution over L, however, as the set of propositions, unabsorbed at L₁, unabsorbed at L₂, ... etc., are not exclusive. P_L(A), though, is a distribution over a set of disjoint (non-overlapping) propositions (a photon can not be absorbed in more than one place), and as we found, is proportional to P_L(U). As noted above, the constant of proportionality is μ, so the absorption probability density as a function of distance, L, is (setting the photon's initial existence probability, I₀, to 1):

(7)

It's easy enough to verify that this function is normalized (e.g. check Eq. 12, for L = ∞).

In general, if a parameter, x, is assigned an exponential distribution, with decay constant, λ, then the normalized PDF is

(8)

To maintain unit consistency, the units of λ are the inverse of the units of x. If x is distance, in mm, then λ has units mm^-1. If x is time, in seconds, then λ is a rate (or frequency) with units s^-1.

Below, I've plotted an exponential decay (not normalized), following exp(-x/300), from x = 0 to 1000:

We can visualize the memorylessness of the thing, and appreciate how some of the exponential distribution's spooky symmetry comes about by starting at any point further along the x-axis and advancing along x by a distance of another 1000 units, and expanding the y-axis to fill the same amount of space on the screen. Below, I chose to start at x = 900, near the end of the previous plot. The curve looks identical to before. Note that the numbers on the x- and y-axes are different, but the functional form is the same. It is as if those first 900 units on the x-axis had never happened.

Any two-level, time-invariant decay process is exponential. The photon is a two-level system, it goes from unabsorbed to absorbed, then it's game over, and as long as its environment isn't changing, it exhibits the required temporal symmetry. A radioactive nucleus is a similar two level system - not decayed followed by decayed. Very many other physical systems follow the same pattern. The process is still exponential if there are several decay channels between the two levels of the system. More complex dynamics can be described by various combinations of exponential functions.

Beyond photons and atoms, many other phenomena are exponential. Even some human affairs, such as the time that a hospital bed remains occupied, follow this remarkable formula.

The mean of the exponential distribution is obtained in the usual way, by evaluating the definite integral from 0 to ∞ (the exponential distribution has no density below x = 0):

(9)

This is can be tackled easily using integration by parts, yielding

(10)

In another amazing display of symmetry, the standard deviation for the exponential distribution is the same as the mean:

(11)

Obtaining the cumulative distribution function for the exponential distribution is as easy as it ever gets. Where f(L) = μ × exp(-μL) was the probability for a photon to be absorbed at L (Eq. 7), recall that exp(-μL) was also the probability for the photon to be unabsorbed prior to reaching L. But the statement 'unabsorbed up to L' is complimentary to the statement, 'absorbed anywhere between 0 and L,' so the CDF is simply

(12)

When an electron in an atom is given a jolt of extra energy, and promoted to a higher orbital, the time in which is stays in that high energy state, before relaxing down to its equilibrium state also follows the exponential distribution. The average lifetime of the excited state is 1/λ, which is termed the time constant, τ, and the evolution of an ensemble of N excited atoms is written

(13)

It is straightforward to see that τ is the expected time it takes for the number of excited atoms to fall to 1/e times the initial number.

Radioactive nuclei are more usually characterized by their half life, T_1/2, rather than τ. The half life is the time it takes N(t) to reach half its initial value. It is the median of the exponential distribution, as can be seen directly from Eq. 12. It is found easily by setting t = T_1/2 in Eq. 13: N(T_1/2) / N(0) = exp(-T_1/2/τ) = 1/2, and solving:

(14)

This particular formula highlights the general difference between the mean and the median: the mean is the centre of mass (depending on the sum of the products of mass times distance), while the median is the point at which the mass to the left equals the mass to the right (depending only on the sum of the masses).

Note: we mustn't fall into the trap of thinking that after 2 half lives, all the radioactive nuclei will have decayed. Remember, the process is memoryless - in 2 half lives, the population drops to one quarter, in 3 half lives it drops to one eighth, and so on.

Of further interest is that for any continuous parameter restricted to non-negative values, the exponential distribution has the property of maximum entropy.

Update, 05/12/2021

Years later, looking over this post again, I see that there is a crucially important extension of these concepts, that I somehow failed to include. For completeness, it makes sense to add it on now.

Going back to the example of photons undergoing absorption in some medium, we have given probability that a photon will be absorbed at some location given that it has remained unabsorbed prior to reaching that location, P(A | U), a special symbol, μ. We also term this a 'linear attenuation coefficient.' The reason we use the word 'linear' is not only good to know in general, but also allows simplification of our calculations when multiple interaction processes are at work.

Suppose that a region of space is occupied by two different species of atoms, 1 and 2, and that they are each evenly distributed over that region. Their respective photo-absorption processes, we'll denote A₁ and A₂.

Recall that P_L(A) = P_L(U)P(A | U).

Now, however, we note that the proposition, 'absorbed' is the disjunction 'absorbed by an atom of type 1' OR 'absorbed by an atom of type 2.'

Thus,

P(A | U) = P(A₁ | U) + P(A₂ | U)

from the extended sum rule for disjoint propositions (a photon will never be absorbed by different atoms).

Consequently,

P_L(A) = P_L(U) [ P(A₁ | U ) + P(A₂ | U) ]

P_L(A) = P_L(U) . (μ₁ + μ₂) (15)

Equation (15) has a couple of important consequences:

If we have 2 or more attenuation mechanisms present, the overall attenuation is that obtained by adding the individual attenuation coefficients. We draw attention to this convenient fact by referring to the coefficients as 'linear.' This is also why a 2-state system remains exponentially distributed, regardless how many ways there are to switch from 'upper' to 'lower' state.
If atoms 1 and 2 are actually of the same type, and I have simply doubled the number of atoms within the fixed volume, then the modified attenuation coefficient is simply double that in the previous case. Obviously, this generalizes to any change in density of the medium. Whatever factor we change the density by is the same factor we must modify μ by.

Whose confidence interval is this?

2014-03-22T03:41:00.001-05:00

This week, yet again, I was confronted by yet another facet of the nonsensical nature of the frequentist approach to statistics. The blog of Andrew Gelman drew my attention to a recent peer-reviewed paper studying the extent of misunderstanding of the meaning of confidence intervals, among students and researchers. What shocked me, though, was not the only findings of the study.

Confidence intervals are a relatively simple idea in statistics, used to quantify the precision of a measurement. When a measurement is subject to statistical noise, the result is not going to be exactly equal to the parameter under investigation. For a high quality measurement, where the impact of the noise is relatively low, we can expect the result of the measurement to be close to the true value. We can express this expected closeness to the truth by supplying a narrow confidence interval. If the noise is more dominant, then the confidence interval will be wider - we will be less sure that truth is close to the result of the measurement. Confidence intervals are also known as error bars.

Hoekstra et al., the authors of the paper¹, asked students and experienced researchers to mark as true or false a number of statements interpreting the meaning of a confidence interval. The results of their survey appear shocking, with large numbers of wrong answers reported, and veteran researchers apparently doing little better than students yet to receive any formal training in statistics.

This is bad. Confidence intervals are good things. Quantifying our knowledge is exactly what science is about, and an assessment of precision is vital to that process. It goes without saying that understanding what has been assessed is also vital, especially for the people doing the assessing!

Just for fun, then, have a go at the survey questions that formed the basis for the data in the paper:

Professor Bumbledorf conducts an experiment, analyzes the data and reports, "the 95% confidence interval for the mean ranges from 0.1 to 0.4." Which of the following statements are true:

The probability that the true mean is greater than 0 is at least 95%
The probability that the true mean equals 0 is smaller than 5%
The null hypothesis that the true mean equals 0 is likely to be incorrect.
There is a 95% probability that the true mean lies between 0.1 and 0.4.
We can be 95% confident that the true mean lies between 0.1 and 0.4.
If we were to repeat the experiment over and over, then 95% of the time the true mean falls between 0.1 and 0.4.

The results of the survey are given in the table below. The numbers are the proportion of survey participants who assessed each item to be true:

Item	1st year students (n = 442)	masters students (n = 34)	researchers (n = 118)
1	51%	32%	38%
2	55%	44%	47%
3	73%	68%	86%
4	58%	50%	59%
5	49%	50%	55%
6	66%	79%	58%

So how do you think you did, in comparison to the survey participants?

According to the authors of the paper all six statements about the confidence interval are false.

Did you do a double take just now. Did you feel momentarily confused, perplexed, humiliated? I hope so. I certainly felt confused, when I read the study description. It seemed to me, on examining the 6 statements that exactly one of them was false. Go on, take another scan through the list, see if you can pick out the one I identified as false ......

I'll tell you in a moment.

Though Gelman's assessment of the study had been vaguely endorsing of its message, I inferred the study's design to be very seriously flawed. Could I have been badly confused? Did I really know the technical definition of the confidence interval?

Helpfully, the authors of the paper have supplied that for us:

[If] a particular procedure, when used repeatedly across a series of hypothetical data sets, yields intervals that contain the true parameter value in 95% of cases ... the resulting interval is said to be a 95% CI.

'CI,' of course, means 'confidence interval.' This description pretty much meets my expectations. To be sure, I checked a few other sources, and they match this definition perfectly.

So, armed with this technical information, lets work our way through the list. The job is easier, I feel, if we start at item number 4: "There is a 95% probability that the true mean lies between 0.1 and 0.4."

Some basics, to help us out: imagine an urn (an opaque jar) filled with balls of 2 different colours. Suppose there are 100 balls in total, 95 of which are green, the remaining 5 being red. I insert my hand into the urn and blindly extract a ball. What is the probability that the extracted ball is green. Of course, P(green) = 0.95. This follows from very basic symmetry considerations. The indifference principle gives us equal probability to draw any of the balls. Applying the extended sum rule to this result trivially gives us 0.95 as the probability to draw one of the greens balls. This result is readily generalized, and is known as the Bernoulli urn rule. It is one of the earliest results of probability theory.

We can treat our experiment like an urn. The experiment is a data-generating process, just like the urn. It spits out a sample - not a ball this time, but a sample of data points from a noisy distribution - a sample of data points with an associated confidence interval. We have from the authors' own pen: the 95% CI is produced by a procedure such that contains the true parameter value on 95% of occasions. So here is the question: what is the probability that the confidence interval we obtained is one of the 95% that contain the true parameter value? Trivially, it is 0.95, a.k.a 95%, and statement number 4 on the survey is true.

With number 4 settled, number 1 is also trivially true.Since there is 95% probability that the true parameter values lies between two positive numbers, there can not be less than 95% probability that the true parameter value is greater than 0.

Number 2 must also be true, for similar reasons. (In fact, if the parameter space is continuous, then the probability to be some discrete value is zero, anyway.)

If item 2 is true, then I interpret 3 to be also true - probability less than 5% is my idea of unlikely.

Item 5 is somewhat vague, but to my mind it is the same as number 4. Probability provides a numerical measure of confidence.

That leaves number 6, "if we were to repeat the experiment over an over, then 95% of the time the true mean falls between 0.1 and 0.4." This statement is absurdly false. The true mean does not move around. If I am repeating the experiment, i.e. measuring the same parameter again, then its true value is the same as previously. Oddly, this uniquely false statement from the list received the second highest level of endorsement from the survey participants (a strong majority, in fact). There is clearly something to the paper's claim of 'robust misinterpretation of confidence intervals'.

But how could the authors have been so wrong about the other 5 items?

In the frequentist tradition, a probability is a frequency. The probability that a tossed coin lands heads up is 0.5 exactly because in a large number of tosses, half of them will come up heads. Actually, it probably won't be exactly half that are heads, but we might momentarily overcome the resulting feeling of queasiness this definition produces, to consider it as a serious candidate for an understanding of probability.

The big problem we quickly hit, though, becomes apparent when we ask a perfectly reasonable question like, what is the probability that the universe is between 13.7 and 13.9 billion years old? There is no frequency with which this is true, its truth does not vary. Thus, in the frequentist tradition, facts do not have associated probabilities, because facts are either true or false. One really has to wonder, then, what it is the frequentists think they are assigning probabilities to. In this tradition, therefore, one can not say that some parameter lies in some interval with some probability. It either does or it doesn't.

This raises an obvious question: if the frequentist is barred from calculating the probability that a parameter lies in some interval, how can they calculate their confidence intervals, which, as I showed amount to the same thing? How can they effectively say that the confidence interval from a repeated experiment will probably contain the parameter's true value? The fact is, they can't. Not without cheating. Not without grossly violating their own system.

The wikipedia page for Confidence Interval has a simple example of a frequentist calculation of a 95% confidence interval. I actually don't mind this calculation, I think its a reasonable way (under the right circumstances - i.e. normal approximation is valid) to estimate the precision of a parameter estimate. But, not surprisingly, the calculation produces an equation of the form (where θ is the parameter being estimated, and x₁ and x₂ are the limits of the confidence interval)

P(x₁ ≤ θ ≤ x₂) = 0.95

Guess what, this is a probability assignment about the value of θ. Something the frequentist system does not allow. Even though x₁ and x₂ have been calculated from the current estimate for θ, the wikipedia article currently includes the somewhat ad hoc looking statement

This does not mean that there is 0.95 probability of meeting the parameter [θ] in the interval obtained by using the currently computed value of the [estimate for θ].

The offending expression is made not a probability (and therefore not a violation of frequentist dogma) by simply declaring it so. Yay! that was easy.

Of course, to get any probability assignment, the frequentist must, just like anybody else, assume (explicitly or implicitly) a prior distribution, which also violates the frequentist methodology. A safer way to get point estimates and confidence bounds, therefore, involves explicitly formulating a suitable prior, and then operating on a posterior distribution, obtained from Bayes' theorem. If you'd like to see a simple example of such a calculation of a confidence interval, you could try my earlier article on nuisance parameters.

References

[1]	R. Hoekstra, R.D. Morey, J.N. Rouder, and E.-J. Wagenmakers, 'Robust misinterpretation of confidence intervals', Psyconomic Bulletin & Review, January 2014 (link)

The Full Adder Circuit

2014-02-27T21:10:00.002-06:00

I recently wrote a very brief introduction to Boolean algebra for the glossary, so I thought it would be worth describing a very simple but important application example. There are two main reasons why I'm interested in Boolean algebra. The first is that in probability theory, the hypotheses we investigate are assumed to be Boolean in character (true or false, with no intermediates allowed). The second is that Boolean algebra is an important branch of logic, and therefore intimately linked to science and rationality.

In an earlier post, I discussed how all transfer of information comes down to a sequence of answers to yes/no questions. In this spirit, therefore, consider the following:

By answering only yes/no type questions, calculate the sum 234 + 111. In other words, if you were a digital computer, how would you perform this calculation?

Expressing numbers by answering yes/no questions

The numbers 234 and 111 are expressed (though I haven't specified it) in the conventional base 10 form. You're probably at the very least dimly aware that to solve this problem, we'll need to convert these numbers to base 2, or binary form.

In base ten, each digit is the answer to a question, 'how many multiples of 10 to the power n are there?', where n is one subtracted from the digit's place in the number. For instance, in 234, the 4 is in the first place and therefore signals how many times 10⁰ appears in the number (once all the higher powers of 10 have been accounted for). Any number greater than 0 raised the power 0 is 1, so 4 times 10⁰ is just 4. Similarly, 3 times 10¹ is 30, and 2 times 10² is 200. Add them all up, and you get 234.

Actually, in technical circles, what I referred to as the first digit is normally called the zeroth digit - digits are counted the same way Europeans count the floors of a building (Europeans are more logical than Americans!). Therefore, the 3 in 234 is in the first position and the 2 is in the second position. From here on, I'll switch to the standard zero-based indexing. In this nomenclature, the power to which the base is raised is the relevant index, and not 1 subtracted from the index.

Expressing a number in binary uses almost exactly the same scheme, except that now each digit answers a yes/no type question. So if the number (base 10) is 15, we could start by asking 'is it larger than or equal to 2⁴?' The answer is no, so a 0 goes into the left-most (in this case the fourth) digit. Next, 'does it contain 2³?' Yes, 2³ is 8, and 8 is less than 15, so the next digit is a 1: 01.

Having accounted for those 8, that leaves 7 remaining to be accounted for. Of those remaining 7, is there a 2² present? Yes: 011. And 3 left over. Of those remaining 3, is there a 2¹ present? Yes: 0111. And there is also a 2⁰ left, so we end up with 01111. That's 15 expressed in binary.

Adding binary numbers

A device that adds together just two binary digits is called a half adder circuit. We'll understand why in a moment. Its inputs are I₁ and I₂. If I₁ is 1 and I₂ is 0, then the sum is 1. Similarly, if the values are reversed. If, however, the inputs are I₁ = 1, I₂ = 1, the sum (in decimal) is 2, which in binary is 10. We can't express this with a single output bit so we need two outputs, which we call 'sum' and 'carry'. In this case, the sum is 0 and the carry is 1.

This half adder circuit is not sufficient to add together strings of more than 1 bit. Suppose we are adding 11 and 01. For the ~~first~~ zeroth (right-most) digit, the sum is 0, and the carry is 1. If we use another half adder to add the ~~second~~ first digits, we get 1 again, giving as combined output the string 11. Obviously, the output should be larger than the inputs, when both numbers are non-zero and positive. The reason we fell short is that we did not include the carry from summing the zeroth digit. Thus we need a circuit with three inputs, I₁, I₂, and carry_in. This new circuit is called a full adder.

We don't need to know physically how such a circuit might be implemented (of course, digital electronics implements all logical operations using networks of transistors), but we do need to work out the logic of its operation. We can do this by drawing up a table.

I₁	I₂	carry_in	decimal result	binary result	sum	carry_out
0	0	0	0	0	0	0
0	0	1	1	1	1	0
0	1	0	1	1	1	0
0	1	1	2	10	0	1
1	0	0	1	1	1	0
1	0	1	2	10	0	1
1	1	0	2	10	0	1
1	1	1	3	11	1	1

All possible combinations of the 3 input bits are given in the 3 left-most columns. The fourth column states the standard decimal results when each of these combinations of 0's and 1's are added up. The fifth column converts these decimal results to binary. Finally, the last 2 columns are the zeroth (right-most) and the first digits of this binary result (the sum and carry_out, respectively).

We need two logical expressions. One to give us the 'sum' output bit, and another to yield the 'carry_out' output bit.

Lets start with the sum. There are four cases where this bit is 1, so the logical expression we need is the disjunction (A or B or C... ) of these four cases. Each case is a conjunction of some set of values for each of the 3 input bits. For example, the first case where the sum bit is 1 is "I₁ is false and I₂is false and carry_in is true," which I'll denote as I₁ I₂C.

Denoting disjunction as '+', then the full expression we're after is:

sum = I₁I₂C + I₁I₂C + I₁I₂C + I₁I₂C

Similarly, for the carry_out bit we have:

carry_out = I₁I₂C + I₁I₂C + I₁I₂C + I₁I₂C

To add two strings of binary digits, then, start with I₁ and I₂equal to the zeroth (right-most) digits in each string. The initial carry_in is 0. The sum and carry_out bits are calculated according to the above two expressions. For the next step, I₁ and I₂are assigned the values of the first digits (next to right-most position) in each of the input strings, and the carry_in is now assigned the carry_out value from the previous step. Finally, the result is the string generated by all the sum bits. We need a cascade of 9 full-adder circuits to perform the required logic.

I made a spreadsheet implementation of this procedure, which can be found here. You don't have write permission for that file, but if you save your own copy, then you shouldn't have any problem experimenting with it.

And that's all we need in order to do logical computations. We can gain some efficiency improvements by applying axioms and theorems of Boolean algebra to simplify the logical expressions we obtain (particularly when we have more complex expressions). This is called Boolean minimization (interestingly enough, this is often most easily done using a computer). But apart from that, the process described here is pretty much all there is to it.

By the way, the answer to the question, 234 + 111:

yes, no, yes, no, yes, yes, no, no, yes
(101011001)

Practical Morality, Part 2

2014-02-02T23:47:00.001-06:00

It has been said that democracy is the worst form of government, except all those others that have been tried.

Winston Churchill

(The second of two parts. Read the first installment here.)

Politics & Science

I have a funny little feeling that Churchill actually knew a small bit about politics. According to dear, old Winston, democracy sucks. But why does it suck? And does it necessarily suck?

A full analysis of these questions could run into thousands of pages, and obviously stretches far beyond any area in which I could claim expertise, but for now at least, I want to point out just one aspect of democracy's poor performance to date that can most definitely be fixed. That is, the failure so far of both politicians and the electorate to explicitly recognize the necessarily rational basis for morality.

Time and again, we see scientific experts consulted in order to obtain the best quality data possible to support some process of policy decision, only to see the elected politicians ignoring what they have been told, in favor of the decision that always suited their prior ideology. This is bad enough, but very often, the scientific analysis is never even sought. Somehow, this is seen by the voting public as acceptable. Worse still, it seems to be often treated as desirable. Certainly, it is something built into the contemporary political culture of many democracies.

This perverse situation is made possible, almost inevitable, in fact, by the widespread, mistaken belief that science has absolutely nothing to say about what is morally desirable. Under this insidious assumption, how could the expert scientist possibly have anything conclusive to say about morality? Morality is not the art of what is, but of what should be done, so evidently, we must enforce a clear division of labour, such that the data gathering is left to the expert scientist, while morality is left to the ethicist and the expert politician. Seriously, it's not as if politics can be reduced to questions of fact, is it?

There, the absurdity of the prevailing position exposed.

This position is so ubiquitous, it seems to be held even by many of the most respected (and powerful) scientists around. For example, in an episode of BBC Radio 4's "The life scientific," broadcast on October 2nd, 2012, Mark Wolport was interviewed by Jim Al-Khalili. Walport, who at the time was about to assume the position of chief scientific adviser to the British government, spoke about numerous things that made brilliant sense to me, but about 16 minutes in, he was asked what his attitude would be, should his advice be ignored by the politicians. This was his response:

It’s very important for an adviser to distinguish between what is the science, and then recognize that there may be a series of different decisions that you can take, and that’s then politics.

Quite clearly, he is making the point that at some instant, the science ends and then politics takes over, that the politician may choose to ignore the best quality advice (in favour of what? gut feeling? divine inspiration?), and that this is just fine: the scientist, incapable of judging human affairs, must deliver his evidence then keep his mouth shut. A little later, Wolport elaborated further on (in his apparent view) the strict divide between science and politics:

That's absolutely right, [politics isn't always based on reason,] politics is based on all sorts of things, it’s based on political ideology, it’s based sometimes on pragmatism, it’s based on choosing which battles to fight and which battles not to fight, but that’s as it were the distinction between scientific advice and political decisions.

From somebody with Wolport's scientific credentials, I'd have expected to hear these comments followed by something to the effect that this is wrong, and this culture has to be changed. But it seems that this chief scientific adviser holds the view that this perceived distinction between rationally acquired understanding and the running of a country is right and proper.

It may well be that to some extent during this interview, Wolport felt unable to express his true views on this matter, having already seen the outcome when another prominent scientific adviser dared cross the UK. government. In 2009 David Nutt was sacked from his position as chairman of the Advisory Council on the Misuse of Drugs, by Home Secretary Alan Johnson. Nutt's apparent crime was to point out that the legal classification of recreational drugs in the UK was incommensurate with the best scientific measures of harm caused by drugs. Johnson's political career received no visible setbacks as a result of this action (and its strong inherent suggestion that he likes to play without the net).

The moment we realize that the question of what ought to be done is a question concerning matters of fact, and that matters of fact can only be answered more reliably when investigated more scientifically, then we begin to wonder how on Earth it can be acceptable for policy decisions affecting potentially millions of people to go against the best quality scientific advice available. What procedure could possibly justify such decisions? (At some point, a decision has been made, without using the decision procedure, that the decision procedure is broken!)

The politicians feel it is appropriate to ignore scientific advice, partly because the top scientists (like Wolport) are telling them this is so. They feel they have understanding and expertise that the scientist cannot tap into, because this is the prevailing culture: human needs can not be assessed by evidence and logic. Politicians are encouraged to invent their own dubious epistemologies, because society persistently fails to recognize the truth about moral realism, and the logical relationship between morality and science.

Within this culture, the politician is free, even expected, to employ his deliberately non-scientific judgement, often citing a mandate from the masses to justify manifestly unsound policies: 'who am I, a servant of the people, to defy popular opinion?' Well, sorry folks, but there are some things you just don't get to vote on. If I'm feeling unwell and go to the doctor, he will not say, "your test results are in, you either have 6 months to live, or its just a minor cold, you decide!"

Indeed, the elected politician is a servant of the people, and as such, trivially, has a duty to serve the interests of the population. This can only be done when a rational procedure enabling reliable predictions about the social outcomes of policy decisions is utilized. Another historic British statesman, Edmund Burke, said this, in 1774, which is apt:

Your representative owes you, not his industry only, but his judgment; and he betrays, instead of serving you, if he sacrifices it to your opinion.

Honesty As A Meta-Virtue

As I mentioned in Part 1, the possibility of moral relativism scares the living crap out of people. The kind of moral relativism I describe, (we've got to be careful, there are other kinds, that make little sense) follows as a trivial and necessary consequence of the moral realism I have outlined here and in earlier essays. What makes an act I perform moral is a combination of (a) the likely, (real) future state of the world, with v's without the act, and (b) my utility function, (the algorithm that assigns for me relative value to difference states of existence) which is a real, and objective property of the matter the composes my mind. We thus arrive at realism. We also arrive at the obvious conclusion that another decision-making entity with a different utility function, even if placed under identical circumstances to mine, may have radically different actions that count for it as moral.

In short, what makes it moral for me to pursue goal X is the trivial fact that I desire X (supported by a sound rational procedure - this is crucial!). I admit, this does have some chilling-sounding consequences, particularly when the parenthetical qualification is omitted.

Whether it is out of fear for what we ourselves might do should our assessment of value happen to change or out of alarm at the prospect that others might not share the same values as us (I suspect the latter as dominant), a major industry has grown up over the centuries to firmly establish a certain dogma. According to this dogma, the determination of morality is universal and absolute. There is no sense in which X could be moral for me, but immoral for another. In particular, the determination of morality consists of no self-serving component - what you desire counts for nothing, all that matters is the rules (Kant's categorical imperative).

The objective of this dogma is clear: a moral code can be established such that the capacity for a person to think 'outside the box' is completely eliminated. In our fearful state, we might selfishly breathe a sigh of relief, but essentially, it is a technology for destroying a person's autonomy.

In a famous paper of 1972¹, Philippa Foot has exposed the ridiculous absurdity of this dogma. According to the dogma (quoting from Foot):

Actions that are truly moral must be done "for their own sake," "because they are right," and not some ulterior purpose.

But what then motivates a person to be moral?

"Stupid question," snorts the dogmatist, "it is the obvious fact that to be moral is good."

But what if I have no interest in what is good?

"But you ought to be interested in what is good, if you were not, then you would not be a moral person!"

Um...., so the only possible reason to be moral is that it is immoral not to be.

Great. All that remains is to arbitrarily decide what is moral, so we can all toe the line. And it must be an arbitrary decision, mind you, for if it were not, then whatever non-arbitrary procedure might be used would constitute a motivating principle, violating the central moral dogma. In fact, the decision to decide what is moral must also be arbitrary - we could equally arbitrarily decide not to decide this. Now fuck off, and don't ask any more questions.

Ladies and gentlemen, this travesty, this grotesque parody of what the human mind is capable of, remains to this day the orthodox and default view of morality. For centuries, respectable intellectuals have been proudly going round in circles, confidently marching right up their own backsides, with arguments of exactly this type.

Utter nonsense as it is, might we not yet draw comfort from the status quo that this dogma provides? For as long as it is universally accepted, does it not serve well to minimize the incidence of crime and immorality? Let's not be too hasty.

What is universally accepted? Crime is a major problem for society, and crime ain't committing itself! Somebody is breaking the rules, and if we looked inside the mind of a criminal, it seems self evident that the message we'd receive would be something like: "Frankly, my dear, I couldn't give a toss about being moral or about being good. I don't give a damn about the rules."

The cases where the moral dogma has failed are perhaps exactly the cases where it (or something) was most needed. And I think it's a good bet that in many cases this failure is because what has been unsuccessfully drummed into the potential criminal's head has been such profoundly manifest gibberish, worthless circular garbage, not suitable to convince any self-respecting person with the tiniest inclination towards independent thought. An opportunity to correct an antisocial tendency has been lost, in a way that in hindsight seems almost inevitable.

So here's my crazy proposal: instead of making up any old absurd crap, and teaching that to our kids as the basis for appropriate behaviour, why don't we just tell them the truth? I think this policy has some serious potential advantages.

So maybe you don't care about rules for their own sake. This is good and proper - society needs more free thinkers. Disconnected from any consideration of consequences, rules are nothing more than sounds leaking out of people's mouths, and trails of ink on pieces of paper. But do you care about yourself? Ultimately, this is all you need to care about, in order to be good.

If you actually do care about your own wellbeing, (which a priori you must), then if you are being consistent, you must also care about adopting a sound strategy for achieving your goals, and it turns out that for reasons touched upon in Part 1, such strategies overwhelmingly involve cooperating with other people - fulfilling one's obligations laid out in the social contract. The profound effect of the social contract is that for me, as a fundamentally selfish entity, I do not merely need to act as if I care about other people. Rather, I actually do care, for naturally selected reasons, both biological and cultural. This, we can expect to hold true for the vast majority of humans, under the vast majority of conceivable circumstances - we have solid mathematics (game theory) capable of explaining how this comes about.

It seems to me that what I've just argued can not be refuted. However, whether people would tend to behave better or worse if such a message were to be adopted as the principal method of teaching morality is ultimately an empirical question. I do not know for certain. I don't claim to know exactly why people misbehave. But that this message ought to be better than the traditional dogma, I consider to be supported by very powerful arguments.

The principal advantages I see for my proposal, as opposed to continued appeal to the absolutist dogma, are (i) honesty, (ii) appeal to self interest, and (iii) coherence. If the main argument used in an attempt to stop me doing something I believe I want to do is a lie, then it seems there is a good chance that I will recognize it as a lie, and ignore it. This seems like a strong general tendency.

Similarly, if what I'm told is the basis for morality sounds suspiciously very much like the incoherent babbling of an imbecile, then I think the risk that I will not follow the prescribed moral code is enhanced. The absolutist dogma manifestly makes no sense, and must be expected to lose significant credibility as a result. For all I know, there may be a large population of law-abiding psychopaths, who avoid antisocial behaviour principally because of their indoctrination since birth in the old moral dogma (this seems to be a major fear people have when I discuss openly recognizing the truth of morality based on self interest), but is there any good reason to think that indoctrination in utter nonsense should be more effective than indoctrination in a moral methodology that makes natural good sense? Realistic moral relativism has the enormous advantage of exactly this kind of coherence, such that we do not need to anesthetize our brains to believe it.

This brings us to a further advantage of honest moral teaching: its success does not depend on the cultivated suppression of free thought and critical evaluation (behaviours above which, few can be ranked as higher virtues). When the people we trust most repeatedly tell us nonsense, and try to pass it off as truth, it is hard not to believe it. But the price of believing manifest gibberish is an internal crisis, known to psychologists as cognitive dissonance. It seems quite reasonable to suppose that a mind committed to holding incoherent propositions as beliefs must become adept at suppressing it's ability to recognize that incoherence. I think we can easily anticipate the dangers of this talent.

I opened this essay with a slightly depressing quotation from Winston Churchill about democracy. The ultimate goal, however, of considering moral realism in these two posts has been a fuller democratization of the social contract. With this goal in mind, then, let me end by offsetting that quote with a more positive piece of advice from the same extraordinary man:

Never, never, never quit.

References

[1]	Philippa Foot, "Morality as a system of hypothetical imperatives," Philosophical Review Vol. 84, pages 305 to 316, 1972. (Link)

Practical Morality, Part 1

2014-01-28T23:38:00.000-06:00

(The first of two parts. Part 2 is here.)

The Social Contract

Where-ever you are right now, take a quick look around. Do a quick survey of all the stuff you can see. Think about the number of things you have around you that other people have made. If you are in your own home, then great, the experiment works even better - the things around you probably belong to you, you make some kind of use of them, and quite possibly your life would be less satisfying without them. Some of these things may even be, if not essential for life, indispensable for a comfortable modern existence.

Now try to count up the number of these things that you could make yourself. I'm trying this now (the counting, not the actual making), and there is really very little that I could contemplate building, perhaps some of the simpler bits of wooden furniture, if really pushed. Quite possibly (and this is not meant as an insult) there is nothing of the stuff that have around you right now that you could build yourself.

Alternatively, perhaps you are quite adept with your hands, and there are several things you could put together yourself. But presumably, you would need tools for this. And possibly even tools to make those tools. You would need materials, for the thing you want to make and for the tools to make them.

Even if you have the skill and energy to actually make some of the things you own, starting from nothing but a pile of raw materials (ignoring for the moment the complexity of acquiring that pile of raw materials), you wouldn't be very efficient. So many processes are involved in converting raw materials into desirable objects, and so many of them require expert knowledge, skill, practice, and dedicated specialization, that you could never approach the efficiency of a large community of individuals, each pretty much focused on some limited domain of expertise. Economies of scale emerge naturally from such specialization. If all I do is cut down trees, then before too long, I predict that I am going to be better at cutting down trees than somebody who just cuts down a tree whenever he needs a bit of wood. If all I do is cut down trees, then I can invest a lot into the having the best possible tools, tailor made for the job of cutting down trees - I don't need my tools to be also good for digging copper out of the ground, because I don't do that, somebody else does that.

And we haven't even begun to think about anything technologically advanced. Getting your laptop to you, for the low price you paid for it, took the research and development efforts of literally thousands of engineers and scientists, practically all of whom got to a position to be able to do that work by engaging in years of dedicated study during which it was impossible to support themselves through full-time employment. For the technology to reach this stage, it took support structures to enable those people to engage in full-time study, and it took efficient dissemination of knowledge, sufficient to make research results readily available and synthesizable in all corners of the world. It also took stable international trade to have all the rare-earth elements and other necessary materials readily available for the manufacture of this product, and it took robust legal protection of intellectual property, in order for all those R&D hours at the product development stage to be worth the investment.

These mechanisms together with many others, all evolved to help make society function as much as possible to everybody's benefit, are known collectively as "the social contract." They are, for example, what make it reasonable for me to exchange items of real economic value for a few trivial-looking pieces of paper, or a few bits of information on some (to me unknown) computer. They make it possible for me drive a car in confidence that another vehicle coming towards me will stay on its designated side of the road, allowing us to pass each other without injury. They make it probable that the foods I buy won't poison me, the machines I use won't kill me, and the politicians I vote for won't throw me in jail if I refuse to vote for them next time.

Biologically evolved behaviours, such as my tendency to care much more about close family members than about people I've never met, play a major part in defining our core moral objectives. These may include elements of social cooperation, such as, perhaps, a fundamental desire to live in proximity with other people, leading naturally to a desire to live peacefully with other people. Such genetically determined traits help to make us intrinsically caring about others. The social contract does not necessarily define our core moral values, but, by virtue of the colossal technological benefits its brings us, serves as an indispensable aid for achieving the things we value most. It has the profound effect of making my personal, selfish values intimately entangled with the values of more or less every other human on the planet.

Realistic Moral Relativism: A Practical Matter

Fact (1): your core moral values are completely determined by the real physical properties of the matter out of which your mind is built. This is what I mean by moral realism. (See my earlier article, and supporting arguments)

Fact (2): there are no principles of moral value that necessarily hold for all beings in all parts of the universe. This is what I mean by moral relativism. There is one moral meta-principle that holds absolutely, namely Fact (1), above, but it doesn't specify any value that any being must hold.

Fact (1) is the foundation of our moral science. It has the dual advantages of: (i) phenomenal empirical support, and (ii) being logically inescapable. Fact (2) follows as a trivial consequence: since values are determined by arrangements of matter, then different arrangements of matter may support different values.

Fact (2) scares the bejeezus out of people, which I'll discuss more in Part 2. Whatever the reason, however, there is extraordinary resistance to acceptance of Fact (1). When those of us who have come to recognize the potential to develop a moral science try to explain these findings, there is a tendency to present thought experiments aimed at demonstrating the possibility to measure moral value. These typically involve some highly advanced neurological apparatus, maybe some kind of ultra-high-resolution, perfectly calibrated magnetic resonance scanner, capable of recording all relevant details of a person's evolving brain states, and using the data to precisely quantify value. If we could do that, we explain, we would know everything about the human condition, and the moral facts would be laid out before us.

There is nothing incorrect about this (once 1 or 2 matters of interpretation have been clarified), but the argument often fails to have its desired impact. There seem to be two common reasons for this, both of which I sympathize with. Both follow from the extreme implausibility of the described measurement: to record a complete description of a person's brain state. Person A complains that such a measurement, resulting in complete knowledge of a person's private state of mind is strictly impossible, thus invalidating the principle we hoped to illustrate. Person B doesn't have the time to philosophically investigate the limits of epistemology, but recognizes the practical impossibility of this - it ain't gonna happen in the foreseeable future - and so dismisses the whole thing as a fanciful science fiction, not worthy of a second thought.

Both person A and person B have missed the point. It was supposed to be a thought experiment, illustrating the kinds of information that we might strive to access, in order to advance a moral science. But the truth of Fact (1) does not depend on the ability to attain such complete knowledge, and neither, in fact, does our ability to develop a moral science from it.

To think that a truth does not exist, simply because we are prevented, in principle, from uncertainty-free knowledge of it is mind-projection fallacy, pure and simple. Facts exist. Propositions about the real world are either true or false, and their truth state is independent of how confident we feel about them (we might feel that the recursive proposition, X: "I believe confidently that X is true", is a counter example, but X is not a coherent proposition about the real world - what could it possibly mean?). No science can deliver knowledge that is completely free of uncertainty. This is why the gold standard for expressing scientific advance consists of calculating probabilities. And because we have probability theory, that exquisite invention that saves us from the misery of complete epistemological crisis, science does not need absolute certainty in order to make concrete advances: with incomplete knowledge we continue, often in baby steps, to make real advances in understanding, enabling actual technological gains.

We don't need the full blown complexity of the above thought experiment in order to establish our moral science, or indeed to produce from it incremental strategic advances for society. In fact there are only two things we need, in order to make progress in moral science:

to measure our moral goals
to measure the universe

"Really?" you ask, "that's all?"

Stay with me. First I'll explain why those two, then I'll say a bit about what they mean.

To work efficiently towards our goals, it is a requirement that we have reliable estimates of what our goals are, hence the first requirement. Without these estimates, any effort we expend relies on luck to achieve its objectives - we might just as well do nothing. About the second requirement, to maximize the probability to attain one's goals, one has to choose strategically between some set of possible actions. The outcomes of those actions, though, are entirely dependent on the content and behaviour of one's environment - if my goal is to boil a pan of water, then attempting to light a fire under it is not a good strategy, if the pan of water happens to be currently at the bottom of a swimming pool. So we need some kind of reasonable model of reality - an estimate of the stuff that populates it, and a somewhat accurate account of the mechanisms by which that stuff interacts.

Now it's time to qualify what I mean by measure, when I say for example, "measure the universe." A measurement consists of two steps: step one is collecting some set of empirical data, and step two consists of some procedure to draw inferences from that data, usually by combination with previous inferences from previous data. That's it. Notice that there is no statement in there about the quality of the data, or the degree of uncertainty in the resulting inference. Note, though, that as long as a good procedure of inference (a scientific procedure) has been used, we can always construct a model of reality according to which our uncertainty is reduced by the new data. The new data always tells us something new about the world.

Thus, I can measure the universe, simply by opening my eyes. If this is my first time to measure the universe, then it's quite a good start! Every time I open my eyes, I can make new inferences, with an expected increase in confidence about the contents of my environment and the mechanisms by which those contents transform themselves.

Similarly, I can interrogate my moral goals, simply by asking myself, 'what do you actually want?' It is perhaps the crudest experiment we can imagine, but we can also easily imagine improved experimental designs. This is the principal activity of the scientist. There is no sense in which we can invalidate a set of raw data, so what we do instead is try to think of ways in which our inference procedure might have failed to capture what is really going on. And once we have thought of some possible failure modes, we can add controls to our experiments. If the machine says "24," then the machine says "24," and it only remains to discover how well the machine is calibrated - what is the correspondence between the machine saying "24," and the thing I actually want to draw inferences about?

So a better protocol for measuring a person's goals might control for the fact that a person may be mistaken about what their goals are. Luckily, psychologists have already developed such protocols, capable of investigating a subject's state of mind, even when she is probably not very well aware of it herself. We can look at other behaviours that according to separate evidence are more intimately connected to the aspects of interest of a person's state of mind. Pupil dilation, for example, typically happens without the person's awareness, and is extremely hard to fake. With careful observations of the pupil, one can determine, for example, that a subject was very interested in a particular stimulus, without them having had the faintest idea of it.

The range and sophistication of the protocols and controls we might apply to the problem of measuring value are open ended. Dozens of powerful methodologies exist already for the investigation of mental states (all of which can be cross-checked against each another), from the carefully worded questionnaire, right up to the heroic machines of neuroscience, such as the famed fMRI scanner, which, while often problematic to interpret (or see many articles by Neuroskeptic), still provides an immense richness of data. Thus, we should not let anybody tell us that at the present time, moral value can not be measured - the only challenge for the future is to reduce the error bars.

One might complain that the fMRI experiment only measures neural activity, whereas what we want to know is the subject's mental state, but in this regard, how is fMRI different from measuring pupil dilation? It is the same calibration problem. This problem is solved in essentially the same way in all science: by trying to think of ways that our inference procedure might be too naive, and doing experiments to test them. To say that there is no way to manage the calibration problem is to suppose that all technological advance to date is the result of pure good fortune.

In practice, we very often already act as if we know that knowledge of and progress towards our moral objectives are both attainable through rational means. Regardless what we consciously profess, we do this because experience informs us that it works. For instance, I believe, with a high level of confidence, that my future satisfaction will be compromised if money I have now leaves my possession without me getting something of value in return, and I therefore make conscious efforts to ensure that I do not lose track of my wallet [compulsively checks pocket again ...].

In run-of-the-mill, day-to-day activity, our unconscious, not-too-rigorous adherence to and application of Fact (1), from above, serves us very well. We know this, because whatever departures there exist between what really happens and what we would be inclined to describe with the phrase, 'serves us very well,' are small enough for us not to find them particularly obvious. This, of course, is no accident. If such departures were very obvious, then we would tend to modify our behaviour, accordingly - this is actually exactly what has happened, and we call the process 'growing up,' though the phrase can apply equally well to the history of a given individual as to the application of selective pressures on gene populations, over time scales of thousands of millennia.

In a similar way, in run-of-the-mill, day-to-day activity, it is enough for me to know that there is some force of gravity tending to make things go down, but if I want to establish a communications satellite in orbit, I had better know about the inverse-square law that describes that force, and a few other complicated things besides. Thus, as we strive to answer ever more exacting moral questions, and as we place ever more challenging demands on our moral technology, it is obvious that we must be ever more rigorous and deliberate in the development of our moral science. This can only happen effectively, if the people who are to work on these grand problems can acknowledge the truth and practical importance of Fact (1).

Our future flourishing, therefore, must be expected to be greatly enhanced by widespread taking on board of the principles of moral science - Fact (1) and its corollaries - through ever improving estimated answers to such questions as (1) what values are primary, as opposed to secondary? (2) what secondary values do we hold in the mistaken belief that they support our primary goals? (3) what common moral values are held by almost all people? (4) how much does the perception of value differ from person to person? and (5) at what point do software engineers need to worry about possible suffering experienced by their algorithms? That is for the future (though it can start today). In the remaining two sections, in Part 2, I'll give good reasons to suppose that a broad acceptance by society of the validity of moral science should have immediate, important, and valuable consequences for almost everybody.

I'm not claiming that a precise determination of core human values is an easy measurement to take - it is fraught with difficulty - but it will never be possible until it is accepted as something we can aspire to. Understanding Fact (1) and its inevitable truth is a crucial first step, and to begin taking such steps is practically guaranteed to lead to some kind of improved understanding. Not only that, but merely recognizing that this moral science is in principle possible has substantial immediate practical consequences, as I will argue, next.

Consider also this: there are, no doubt, some truths about the universe that are forever obscured from science (such as what Euclid's grandfather had for breakfast 3 days before his 10th birthday), but this can only be because these things make no significant difference to anything today. Conversely, if a thing makes a big difference, then by definition, we can detect it and measure it relatively easily. This must apply equally well to the things that affect the outcomes of our moral decisions. The prospects for an applied science of human ethics, producing practical technological benefits, are not so bleak.

Find Part 2 here.

The Dennett Tennis Test

2014-01-16T09:35:00.001-06:00

This is a simple little tactic to consider employing next time you find yourself in conversation with somebody who doubts the efficacy of scientific method, particularly anybody who wants to propose any kind of alternative.

You know the sort, I’m sure. The kind of person who says “sure, scientific data argues that acupuncture is total and utter garbage, but maybe acupuncture is one of those things science isn’t equipped to investigate.” Or the kind of person who says “absolutely, you’re right, I have no evidence that my god exists, but frankly I’m offended at your crass insistence that everything has to come down to evidence.” Or “how ridiculous to suggest that science has anything to say about the supernatural.”

“Oh, you’re so reductionistic,” another might say, shaking their head with earnest sympathy, “there is so much more to mother nature than your sterile lab experiments.” (Yes there is, evidently, but what exactly is your point?)

Or nodding with diplomatic wisdom, one will say, “yes, yes, I see your point, but don’t you think your scientific paradigm is just a social construct, one of many equally valid points of view?”

To all such people, I propose the following reply, paraphrasing philosopher Daniel Dennett:

You know what, your argument is completely convincing, except for one detail: everything you say proves unambiguously that you are a ham sandwich, wrapped in tin foil.

This is what I'm calling the Dennett Tennis Test (DTT). I’ll leave it to you to decide if the phrase is ugly or poetic. Faced with DTT, your conversation partner has 2 options:

(1) They can agree with you, at which point it is clearly time to end the conversation - they have declared themself to be mad.

(2) They can protest that your conclusion is unsupportable, thus proving that they are dishonest - they do not really believe in the efficacy of their own argument.

So how does this work, and where the hell does the weird name come from?

The thing about tennis is that, quite like the two options listed above, there are 2 ways you can play the game: with or without the net. To make the game fair, though, if one player plays with a net, then so does the other.

I picked the phrase ‘Dennett Tennis Test’ in honour of an argument made by Dan Dennett, (in his excellent book “Darwin’s dangerous idea”) which he built around a remark that he attributes to Ronald de Sousa, likening philosophical theology to intellectual tennis without the net. Just as the tennis net filters out bad serves and bad returns, so in a reasonable discussion does rationality filter out the crummy arguments. The point of DTT is to say “oh, I didn’t know you wanted to play without the net, well, never mind, I’m a sportsman, I’ll play the way you want.” If your opponent agrees to this arrangement, and its most explicitly displayed outrageous consequences, then they signal that they have no commitment to approaching the truth. If they protest, then you have the right to ask why they expect the rules to apply asymmetrically. Either logic is abandoned, allowing their arguments and yours to pass unfiltered, or reason steps in, exerting the same selection pressure on everybody’s statements. When your opponent protests that your conclusion is unjustifiable, they establish exactly the standard of evidence that is sufficient for their own position to instantly crumble.

All the arguments I have taken as examples at the top (these are not straw men of my own concoction, they occur regularly, even within academia) take the form of asserting that there is a way of knowing that bypasses the need for evidence and its logical evaluation. The alternative-medicine advocate who sees the scientific research, but still clings to the belief that the hocus-pocus treatment works is claiming access to knowledge that science can’t deliver. They’re not just saying the science was done badly, but that science is the wrong tool entirely. For example, Dr. Peter Fisher, homeopath to Queen Elizabeth II and prominent homeopathy researcher, (here reaching a stunning level of perversity): "'Inherent implausibility' is a poor guide to future understanding."

The religious enthusiast, clinging to the notion of faith as an alternative to reason, decides to believe, as if that mere decision were enough to shape the structure of reality - as if wanting to have faith is enough to make it true. Many religious believers actively boast of their non-reliance on evidence, claiming it virtuous to place their faith in … well in faith itself actually. But by this "logic", I can just as legitimately put my faith in absolutely any bloody thing I please. In reality, the blatant circularity of this kind of epistemology cannot convince any honest thinker. It is really just an obfuscation, where somebody really wants a thing to be true (or wants others to believe it), but knows deep down that the evidence they have is insufficient. With a complex enough wording, repeated sufficiently often, the believer hides the fact that their belief really is based on evidence - mainly testimony from trusted people (have you noticed how different belief systems exhibit clear spatio-temporal correlation? Is it coincidence that most religious people follow the same religion as their parents?). Such evidence that people have for their religious faith, however, crumbles when subjected to the tiniest scrutiny, (for example, exactly similar evidence provides exactly as reliable support for a host of other contradictory hypotheses) and they invent this capability to know without evidence, which they call faith, in a desperate attempt to avoid the obvious, uncomfortable conclusion. Logic is suspended, without any justification whatsoever - ham sandwiches, all round.

In 1996 physicist Alan Sokal wrote a preposterous spoof paper¹, which he submitted to a ‘serious’ philosophical journal, ‘Social Text’. The paper was crammed with utter nonsense statements (and copious flattering references to the works of the journal’s editors), and they loved it. They published it, never once suspecting that it was a load of deliberate trash. The subject of the paper? That science is a social construct, that all belief systems are equally valid, and that quantum theory proves it. The academic (nominal) philosophers behind the movement this journal epitomizes were committed to the view that if I believe that I can step out of a 10th floor window and float away to Jupiter to enjoy cups of Jovian tea with the locals, before returning home to commune holistically with all the insects on the planet, obtaining crucial knowledge about the birth of time from them, then that belief is as valid as the belief that the world is approximately spherical. This is a philosophy that explicitly refuses to rank the believability of propositions based on the observed behaviour of reality - a form of radical skepticism in which logic is eagerly shunned. How have these people managed to create an academic discipline based on the deliberate application of no intellectual discipline? How would they respond to DTT? Sokal has effectively tried it, and they went for option (1).

In case you think my characterization of the views of the 'strong sociologists' and postmodern relativists - the sorts of academics that Sokal was ridiculing - is far too absurd to be accurate, Sokal's book, written with Jean Bricmont, "Fashionable nonsense," (also published as "Intellectual Impostures") is crammed with quotations from people at the forefront of this movement that again and again prove exactly this. Here are two brief examples:

Barnes and Bloor²:

"It is those who ... grant certain forms of knowledge privileged status, who pose the real threat to a scientific understanding of knowledge and cognition."

Paul Feyerabend³:

"All methodologies have their limitations and the only 'rule' that survives is 'anything goes'."

The proposition that the person you are arguing with is a ham sandwich may be good for raising a cheap laugh, but one might feel that it's far too ridiculous to serve as a serious parody of anybody's actual views. Under close examination, though, it doesn't take long to see that the arguments that I propose to target with the Dennett Tennis Test propose a system of belief formation according to which this proposition is every bit as valid as those that being argued for - gods, the healing power of crystals, truth as a social construct, or whatever it happens to be.

References

[1]	Alan Sokal, 'Transgressing the boundaries: toward a transformative hermeneutics of quantum gravity', Social Text #46/47, pages 217 to 252, 1996
[2]	'Relativism, rationalism, and the sociology of knowledge', in 'Rationality and relativism', edited by Hollis and Lukes.
[3]	'Against method,' Paul Feyerabend, 1975

Confounded Koalas

2013-12-22T01:45:00.000-06:00

Koalas are not as exclusive as kangaroos. At least, when it comes to their drinking habits. As I explained before, kangaroos drink beer or whisky, but not both. Koalas like to mix things up a bit more, when it comes to their choice of drink, but how much exactly? What is the probability, for example, that any given koala who drinks beer on any given night will also drink whisky on the same night? These are the sorts of urgent questions that science must seek to answer with the utmost speed and accuracy.

It's an empirical question, so we'll need some empirical evidence. Luckily, I've been out in the field already. I went deep into the outback one day, and asked lots of koalas what they had been drinking the night before. To save time, though, I didn't bother questioning animals that I believed hadn't been drinking anything on the previous evening. To do this, I polled only hung-over koalas. You see, when a koala gets a hang over, its nose turns bright red, which can be seen from quite a distance. This clever strategy saved me a lot of time walking through the Australian scrub, trying to catch up with subjects who couldn't add anything to my required data set. Here are the numbers I obtained, after interviewing 1222 red-nosed koalas:

Beer but no whisky:	505
Whisky but no beer:	436
Beer and whisky:	281

So the total number of whisky drinkers, for example, came to 59% of the 1222 study participants. Of the subset of those 1222 individuals who drank beer on the preceding evening, however, the number that also drank whisky came to only 36 %. Thus we conclude that consumption of whisky is anti-correlated with consumption of beer - if I know that an individual has consumed beer, I consider it less likely to have drunk whisky than otherwise. Letting A represent the proposition that a koala drank whisky and B stand for a beer drinker, we conclude that P(A) > P(A | B).

Ok, confession time. Brace yourself, this is going to come as quite a shock. These are completely made up data. I've never even been to Australia.

But here is the really weird thing:

The numbers above were actually produced by a randomized model that assumed a complete lack of correlation between beer drinking and whisky consumption. The two were assumed independent, meaning that in fact, P(A) = P(A | B), in contrast with the strong impression given by the generated data.

The number of koalas simulated is quite large, so this isn't a case of random noise producing a spurious finding. In fact, what we've been the victim of here is a kind of biased sampling, known as Berkson's paradox. Our attempt to investigate the relationship between two variables, A and B, has been confounded in an interesting way by a third variable, C.

A fairly trivial special case of the product rule states that (where, as always, X+Y denotes 'X or Y')

P(A.[A+B]) = P(A | A+B) × P(A+B)

and because the conjunction of 2 propositions, XY, is identical to YX, then also

P(A.[A+B]) = P(A+B | A) × P(A)

Now, A+B is a sure thing, if A is already known to be true, so combining these two results gives

P(A) = P(A | A+B) × P(A+B)

Assuming that neither P(A) nor P(B) is 0 or 1, this means that

P(A) < P(A | A+B)

(1)

which is something we ought to have expected already - knowledge that at least one of the propositions A and B is true constitutes good evidence that A is true.

But a good way to be confident that A+B is true is if events A and B are both separately known to cause another event C, which is known to have occurred. This is what has happened on this occasion: C is the hangover I used to select subjects for the study. So while we were trying to estimate P(A), what we actually measured was more indicative of P(A | A+B), and thus our result of 59% was, from equation (1), an overestimate.

What happened when we were estimating P(A | B)?

A simple way to get to grips with this is by drawing a truth table, to compare the 2 propositions, "A or B" and "B and (A or B)":

A	B	A+B	B.(A+B)
0	0	0	0
0	1	1	1
1	0	1	0
1	1	1	1

From the truth table, it's quite clear that "B.(A+B)" has, under all circumstances, the same values as "B" - it is the same proposition. Thus in obtaining an estimate for P(A | B.[A+B]), which we inadvertently did, when what we wanted was P(A | B), the addition of the extra information, B, erased any effect of our prior knowledge of A+B that isn't preserved in our knowledge of just B. Therefore, P(A | B.[A+B]) = P(A | B), and our measured proportion of 36% of all beer drinkers who drink whisky did not suffer any distortion due to my chosen selection method.

Because my figure for P(A) was overestimated, however, while the result for P(A | B) was not, the effect was a spurious impression that P(A | B) < P(A), implying negative correlation, contrary to the reality of the data-generating process.

In fact, the numbers I gave above, came from 2000 simulated koalas, randomly assigned as beer drinkers with probability 0.4, and independently assigned as whisky drinkers with probability 0.36. This is perfectly reflected in the observed proportions for beer drinkers (786/2000 = 0.393) and whiskey drinkers (717/2000 = 0.359). The result for P(A | B) was also perfectly consistent with independence - the proportion of beer drinkers who indulged in whisky was the same as the proportion for the entire population, 36%.

With Berkson's paradox, our attempt to draw inferences about the relationship between two variables, A and B, was confounded by a third, correlated variable, C. Something very similar was going on, when we examined Simpson's paradox, but the effect of the confounder was slightly different. With Simpson's paradox, two non-independent variables, A and B, are rendered conditionally independent upon receipt of information concerning C (A is "screened off" from B by C, in the language of the graph theorists) - without knowing C, we are lulled into incorrectly thinking that A is a direct cause of B.

With Berkson's fallacy, the effect is opposite: knowledge of C (or inadvertently selecting a biased sample such that C was true on an excess number of occasions) made two otherwise uncorrelated variables appear to be dependent upon one another. The effect was such that occurrence of A seemed to suppress the occurrence of B. (Note that even if I hadn't consciously decided to look only for hung-over animals, their bright red noses would have been easier to spot in the undergrowth, leading to their being over-represented in the survey, which would have had a similar effect.)

While the third variable, C, is ignored, it can confound our scientific efforts, but once brought to our attention, figuring out what causes what actually becomes easier. Thanks to differences in effects, such as those differences between Simpson's and Berkson's paradoxes just sketched, it can actually help us distinguish between different classes of causal relationships. If C screens off A from B, then certain distributions of cause and effect can be ruled out, while certain other causal relationships are excluded when C introduces dependence between A and B. [This paragraph was modified slightly on 12-23-2013 to remove an error.]

Full causal analysis is only possible when we perform controlled interventions (e.g. randomized controlled clinical trials), but if we stretch our intelligence, there is a lot that can still be done when intervention is difficult or impossible to implement - a situation many scientists have to live with. (Cosmology, anyone? geology? archaeology? Just a few examples.)

The Acid Test of Indifference

2013-11-16T11:30:00.000-06:00

In recent posts, I've looked at the interpretation of the Shannon entropy, and the justification for the maximum entropy principle in inference under uncertainty. In the latter case, we looked at how mathematical investigation of the entropy function can help with establishing prior probability distributions from first principles.

There are some prior distributions, however, that we know automatically, without having to give the slightest thought to entropy. If the maximum entropy principle is really going to work, the first thing it has got to be able to do is to reproduce those distributions that we can deduce already, using other methods.

There's one case, in particular, that I'm thinking of, and it's the uniform distribution of Laplace's principle of indifference: with n hypotheses and no information to the contrary, each must rationally be assigned the same probability, 1/n. This principle is pretty much self evident. If we really need to check that it's correct, we just need to consider the symmetry of the situation: suppose we have a small cube with sides labelled with the numbers 1 to 6 (a die). Without any stronger information, these numbers really are just arbitrary labels - we could, for example, decide instead to denote the side marked with a 1 by the label "6" and vice versa. But nothing physical about the die will have been changed by this change of convention, so no alteration of the probability assignment concerning the outcome of the usual experiment is called for. Thus each of these two outcomes ("1" or "6") must be equally probable, with further similar arguments applying to all pair of sides.

So if the entropy principle is valid, it must also arrive at this uniform distribution under the circumstances of us being maximally uninformed. Let's check if it works.

We start with a discrete probability distribution over X, f(x₁, x₂, …., x_n), equal to (p₁, p₂, …., p_n). The entropy for this distribution is, as usual given by

which, for reasons that’ll soon become clear, I’ll express as

Suppose that for f(x₁, x₂, …., x_n) the probability at x₁, p₁, is smaller than p₂. Imagine another distribution, f’(X), in which p₃ to p_n are identical to f(X), but p₁ and p₂ have been made more similar, by adding a tiny number, ε, to p₁ and subtracting the same number from p₂ (this latter subtraction is necessary, so as not to violate the normalization condition). We want to examine the entropy of this distribution, relative to the other:

And so the difference in entropy is:

We can consolidate this, using the laws of logs:

In the 17^th century, Nicholas Mercator was a multi-talented Danish mathematician. His many accomplishments include the design and construction of chronometers for kings and fountains for palaces, as well as making theoretical contributions to the field of music. He seems to be the first person to have made use of the natural base for logarithms, and in 1668 he published a convenient series expansion of the expression log_e(1+x):

We now know this as the Mercator series. (Remarkably, this work predates by 47 years the introduction of the Taylor series, of which the above expansion is an example.) This looks promising to us in the current investigation, as for very small x, only the first term will remain significant, so we’d like to express the logarithms in the above entropy difference in this form.

(We might wonder why this Danish dude had a Latin name, but apparently that was often done in those days, to enhance one’s academic standing and whatnot – his original name was Kauffman, by the way, which means the same thing: shopkeeper (I don't understand why he felt it wasn't intellectual enough). Actually, since moving to the US last year, I see a similar thing going on at universities here, where it is apparently often considered a highly coveted marker of status to be able to prove that you know at least three letter of the Greek alphabet.)

Taking one of those terms in ΔH, factorizing and again applying the laws of logs:

with a similar result for log(p₂ - ε), so

Supposing we have made ε arbitrarily small (without actually equaling zero), then when we implement the Mercator expansion, all terms with ε squared, or raised to higher powers, will be negligibly small, so

For p₂ > p₁ and ε > 0, this expression is necessarily positive, meaning that H’ > H, i.e. the distribution formed by taking 2 unequal probabilities in f, and adjusting them both to make them more nearly equal results in a new distribution, f’, with higher entropy.

This procedure of adjusting unequal probabilities by some tiny ε can, of course, continue for as long as it takes to end up with all probabilities in the distribution equal, and the result will necessarily have higher entropy than any of the distributions that preceded it. This proves that the uniform distribution is globally the one with maximum entropy.

Another approach we could have taken to check that the entropy principle produces the appropriate uninformative prior elegantly uses the method of Lagrange multipliers, which is a powerful technique employed frequently at the business end of entropy applications. (Yes, that's the same Lagrange, reportedly teased by Laplace in front of Napoleon - but we mustn't think that Lagrange was a fool, he was one of the all-time greats of mathematics and physics, and it probably says more about Laplace that he so easily go the better of Lagrange on that occasion.)

The favorite joke that non-Bayesians use to try to taunt us with is the claim that we pull our priors out of our posteriors. They think our prior distributions are arbitrary and unjustified, and that as a consequence our entire epistemology crumbles. But this only showcases their own ignorance. Never mind the obvious fact that if it were true, no kind of learning whatsoever would ever be possible, for us and them alike. In reality, to derive our priors we make use of simple and obvious symmetry considerations (such as indifference), which not only work just fine, but provide results that stand verified when we apply far more rigorous formalisms, such as maximum entropy and group theory (which I haven't discussed yet).

Monkeys and Multiplicity

2013-11-01T22:38:00.000-05:00

Monkeys love to make a mess. Monkeys like to throw stones. Give a monkey a bucket of small pebbles, and before too long, those pebbles will be scattered indiscriminately in all directions. These are true facts about monkeys, facts we can exploit for the construction of a random number generator.

Set up a room full of empty buckets. Add one bucket full of pebbles and one mischievous monkey. Once the pebbles have been scattered, the number of little stones in each bucket is a random variable. We're going to use this random number generator for an unusual purpose, though. In fact, we could call it a 'calculus of probability,' because we're going to use this exotic apparatus for figuring out probability distributions from first principles¹.

Lets say we have a hypothesis space consisting of n mutually exclusive and exhaustive propositions. We need to find a prior probability distribution over this set of hypotheses, so we can begin the process of Bayesian updating, using observed data. But nobody can tell us what the prior distribution ought to be, all we have is a limited set of structural constraints - we know the whole distribution must sum to 1, and perhaps we can deduce other properties, such as an appropriate mean or standard deviation. So we design an unusual experiment to help us out. We need to lay out n empty buckets - one bucket for each hypothesis in the entire set of possibilities (better nail them to the floor).

Each little pebble that we give to the monkey represents a small unit of probability, so the arrangement of stones in buckets at the end represents a candidate probability distribution. We've been very clever, beforehand, and we've made sure that the resting place of each pebble is uniformly randomly distributed over the set of buckets - in keeping with everything we've heard about the chaotic nature of monkeys, this one exhibits no systematic bias - so we decide that this candidate probability distribution is a fair sample from among all the possibilities.

We examine the candidate distribution, and check that it doesn't violate any of our known constraints. If it does violate any constraints, then it's no use to us, and we have to forget about it. If not, we record the distribution, then go tidy everything up, so we can repeat the whole process again. And again. And again and again and again, and again. After enough cycles, there should be a clear winner, one distribution that occurs more often than any other. This will be the probability distribution that best approximates our state of knowledge.

To convince ourselves that some distributions really will occur significantly more frequently than others, we just need to do a little combinatorics. Suppose the total supply of ammo given to the little beast comes to N pebbles.

The number of possible ways to end up with exactly N₁ pebbles in the first bucket is given by the binomial coefficient:

and with those N₁ pebbles accounted for, the number of ways to get exactly N₂in the second bucket is

and so on.

So the total number of opportunities to get some particular distribution of stones (probability mass) over all n buckets (hypotheses) is given by the product

For each i^th term in this product (up to n-1), the term in brackets in the denominator is the same as the numerator of the (i+1)^th term, so these all cancel out, and for the n^th term, the corresponding part of the denominator comes to 0!, which is 1. So any particular outcome of this experiment can, in a long run of similar trials, be expected to occur with frequency proportional to

which we call the multiplicity of the particular distribution defined by the numbers N₁, N₂, ...., N_n.

The number of buckets, n, only needs to be moderately large for it to be quite difficult to visualize how W varies as a function of the N_i. We can get around this by looking into the case of only 2 buckets. Below, I've plotted the multiplicity as a function of the fraction of stones in the first of two available buckets, for three different total numbers of stones: 50, 500, and 1000 (blue and green curves enhanced to come up to the same height as the red curve).

Each case is characterized by a fairly sharp peak, centered at 50%. As the number of stones available increases, so does the sharpness of the peak - it becomes less and less plausible for the original supply of pebbles to end up very unequally divided between the available resting places.

Calculating large factorials is quite tricky work, though (just look at the numbers: 2.7 × 10²⁹⁹, for N = 1000). To find which distribution is expected to occur most frequently, we need to locate the arrangement for which W is maximized, but any monotonically increasing function of W will be maximized by the same distribution, so instead lets maximize another function, which I'll arbitrarily call H:

But since each N_i is N×p_i (where p_i is the probability for any given stone to land in the i^th bucket, according to this distribution), and repeatedly using the quotient rule for logs,

Now, because probabilities vary over a continuous range between 0 and 1, and because we don't wan't to impose overly artificial constraints on the outcome of the experiment, we will have given the monkey a really, really large number of pebbles. This means that we can simplify the above expression for H using the Stirling approximation, which states generally that when k is very large (where we're using the natural base),

with the approximation getting better as k gets larger. In our particular case, this yields

and suddenly, we can see why that 1/N was inexplicably sitting at the front of our expression for H:

The Σp_i term at the end is 1, so

at which point the product rule for logs produces

finally, yielding

which, oh my goodness gracious, is exactly the same as the quantity we had previously labelled H: the Shannon entropy.

So, it turns out that the probability distribution most likely to come out favorite in the repeated monkey experiment, the one corresponding to the arrangement of stones with the highest multiplicity, and therefore the one best expressing our state of ignorance, is the one with the highest entropy. And this is the maximum entropy principle. It means that any distribution with lower entropy than is permitted by the constraints present in the problem has somehow incorporated more information that is actually available to us, and so adoption of any such distribution constitutes an irrational inference.

Note, though, that if your gut feeling is that the maximum entropy distribution you've calculated is not specific enough, this is your gut expressing the opinion that you actually do have some prior information that you've neglected to include. It's probably worth exploring the possibility that your gut is right.

It might seem, from this multiplicity argument, that for a given number of points, n, in the hypothesis space, there is only one possible maximum entropy distribution, but recall that sometimes (as in the kangaroo problem) we have information from which we can formulate additional constraints. Sometimes the unconstrained maximum-multiplicity distribution will be ruled out by these constraints, and we have to select a distribution from those that aren't ruled out. It's in such cases that the method actually gets interesting and useful.

This hasn't been Jaynes' original derivation of the maximum entropy principle (the logic presented here was brought to Jaynes' attention by G. Wallis), and neither is it a rigorous mathematical proof, but it has the substantial advantage of intuitive appeal. It even makes concrete the abstract link between the maximum entropy principle and the second law of thermodynamics.

Thinking again about the expanding gas example from the previous post, we can see the remarkable similarity between the simple universe we considered then, a box containing 100 gas molecules, with the space inside the box conceptually divided into 32 equal-sized regions, and the apparatus described here. The expanding gas scenario corresponds very closely to a possible instance of the monkey experiment with 32 buckets and 100 pebbles:

As the multiplicity curves above testify, the state with the highest multiplicity is the one with as close as possible to equal numbers of molecules in each region, so a state like the one above is far more likely than one in which one half of the box is empty. Given the fact that the gas molecules move about, in a highly uncorrelated way, therefore, we can see that the second law of thermodynamics (the fact that a closed system in a low-entropy state now will tend to move to and stay in a high-entropy state) amounts to a tautology: if there is a more probable state available, then expect to find the system in that state next time you look.

We also saw above that increasing the number of pebbles / molecules reduces the relative width of the multiplicity curve. But the numbers of particles encountered in typical physical situations are astronomically huge. In a cubic meter of air, for example, at sea level and around room temperature, there are (from the ideal-gas approximation) around 2.5 × 10²⁵ molecules (which, by the way, weighs roughly 1 kg). This means that once a system like this gas-in-a-box experiment has evolved to its state of highest entropy, the chances of seeing the entropy decrease again by any appreciable amount vanish completely. Thus the second tendency of thermodynamics receives a well earned promotion, and we call it instead 'the second law'.

Please note: no primates were harmed during the making of this post.

References

[1]

I can't claim credit for designing this monkey experiment. Versions of it are dotted about the literature, possibly originating with:

Gull, S.F. and Daniell, G.J., Image Reconstruction from Incomplete and Noisy Data, Nature 272, no. 20, page 686, 1978

Entropy Games

2013-10-26T01:45:00.000-05:00

In 1948, Claude Shannon, an electrical engineer working at Bell labs, was interested in the problem of communicating messages along physical channels, such as telephone wires. He was particularly interested in issues like how many bits of data are needed to communicate a message, how much redundancy is appropriate when the channel is noisy, and how much a message can be safely compressed.

In that year, Shannon figured out¹ that he could mathematically specify the minimum number of bits required to convey any message. You see every message, every proposition, in fact, whether actively digitized or not, can be expressed as some sequence of answers to yes / no questions, and every string of binary digits is exactly that: a sequence of answers to yes / no questions. So if you know the minimum number of bits required to send a message, you know everything you need to know about the amount of information it contains.

Many possible sets of yes / no questions can yield the same message, but they're not all the same length. Consider the game of 20 questions, in which one player has to guess or deduce the name of a person that the other player is thinking of, simply by asking questions to which the answer will be either 'yes' or 'no.' A really crappy strategy when playing this game would be to ask the following set of questions:

Is it Dorothy Hodgekin? [Nope.]
Is it Emily Noether? [No!]
Is it Caroline Herschel? [No, dumb-ass!]
Is it Margaret Cavendish? [Oh, for crying out loud.]
etc.

A strategy like this is likely to take an awfully long time to finish the game, particularly as the correct answer may be a man with no outstanding scientific credentials. A better way to play the game is to devise questions that divide the set of possibilities as nearly in half as possible. A good first question is therefore, 'is it a male?' From there, we could proceed with, 'is it a famous person?' and so on, producing a significantly smaller set of possibilities with each answer. With good interrogation, the size of the hypothesis space decays exponentially, while naive guessing, such as the examples above, leaves the range of possibilities virtually unchanged.

So an optimum string of bits for a message is one such that each bit reduces the number of possible messages the sender might be sending by 50% (with a caveat we'll see in a moment). Thus, if we have a set of 128 symbols, say lower- and upper-case English alphabet, digits 0-9, several punctuation marks, and various other symbols (the ASCII system), we don't need 128 bits to transmit each symbol, only 7. The first bit answers the question: 'is it in the first half of the list?' The second bit answers the question: 'of the remaining list of possibilities, is it in the first half of the list?' and so on. The 7th bit, therefore, leaves a list containing exactly one symbol.

Suppose our alphabet - the complete set of symbols out of which our messages are to be constructed - consists of n symbols. If all symbols in our alphabet occur equally frequently, then the number of bits, call it x, needed to transmit one symbol is given by n = 2^x. The solution to this equation, of course, is

x = log₂(n)

But if all symbols occur equally frequently, then the probability, p, for the i^th symbol in a given message to be one in particular is 1/n, from the principle of indifference, so

x = log₂(1/p)

or, from the laws of logs (special case of the quotient rule, noting that log(1) = 0)

x = -log₂(p)

In normal communication, however, the symbols do not usually occur equally frequently. Furthermore, a message will often contain symbols that are unnecessary. Consider the text you are reading now. This text consists of an arrangement of white and black pixels. The state of each pixel is determined by answering a single yes / no question, but do you actually need to receive all those bits at your visual cortex in order to read the text? Hopefully, this next game will answer that question in the negative. Try to read the following partially obscured text:

Tests (upublished work, conducted by me, very small sample) reveal that most people can cope with reading this. Similarly, most people can manage to interpret SMS text, composed with many of the vowels excluded from their proper places.

Luckily, as is not too hard to see, the above formula generalizes to the case where symbols don't carry equal amounts of information, which happens when they don't occur with equal frequency, or where some parts of a message are superfluous (the obscured pixels along the top of that message, above, evidently didn't convey any additional useful information).

Suppose, for example, that we have a set of only 4 symbols: A, B, C, and D. Suppose that in an average message, A appears 50% of the time, and the remaining letters are half B's, a quarter C's, and a quarter D's. Dividing the list of possibilities into 2 halves with the first bit is not optimally efficient here (that caveat I warned you about). Such a strategy would take 2 bits to transfer each symbol, even though the C's and D's carry more information than the A's.

Rather than splitting the list in half, it is clearly the probability distribution over the list we should split in half, and because A's occur half the time, our first question is, 'is it an A?' If the answer is yes, then we have a symbol from only one bit. If the answer is no, then the next question is, 'is it a B?' since B's occur on 50% of the occasions when its not an A. Sometimes we'll need to ask a third question, 'is it a C?' But that'll only happen 25% of the time, and the average number of bits needed to transfer 1 symbol will be (0.5 × 1) + (0.25 × 2) + (0.25 × 3) = 1.75 bits.

So the number of bits needed to transfer a symbol is still a function of the prior probability that that symbol would have been the one sent. If the i^th symbol has prior probability associated it of p_i, then:

x_i = -log₂(p_i)

The average of x, over the entire alphabet of symbols, is just the expectation over this expression, and is what we call the entropy:

H = 〈x_i〉 = -Σ p_ilog₂(p_i)

which, thankfully, is the same expression we used to solve the kangaroo problem, before. In fact, when calculating entropy, the base is not all that important - a different base will give a different number, but if we consistently use the same base, then our numbers will vary consistently. Often, the natural base is preferred, while sometimes base 10 is used, but then the units are not bits, but respectively, nats or Harts. Note that this formula (with base 2) reproduces the 1.75 obtained for the A's, B's, C's, and D's, above.

Why does this formula from communication theory have anything to do with science and inference?

Well, every scientific experiment, in fact every observation or experience, can be considered as a message from Mother Nature to us. Every bit of information from Mother Nature reduces the number of plausible universes by half, and we can count the bits using Shannon's theory. (By the way, I should be careful: the universe hates it when we personify her.) In the next post, I'll give some insight into why the maximum entropy principle (which I've already applied to kangaroos) is a valid tool in statistical inference.

Why do we call it entropy?

Good question, thank you very much for asking. Legend has it, Shannon initially adopted the term entropy primarily because of the remarkable similarity of his formula to an equation already used by physicists to calculate thermodynamic entropy. It was supposed to be an analogy, but several authors see it as more. There is nothing resembling consensus on this issue, but it seems clear to me that the information theoretic and thermodynamic uses of the term entropy are essentially the same.

Borrowing from Arieh Ben-Naim's examples in Entropy Demystified ², we can look at one of the canonical illustrations of physical entropy changing: the expansion of a gas in a closed container. The box depicted below has a partition down the middle. The left side contains gas molecules, while the right side has been evacuated. For each gas molecule, we play again the 20 questions game to figure out where it is - except that 4 questions are enough to fix its location to one of the 16 regions (to the left of the partition) marked with the dotted lines.

At a certain point of time, we cause the partition to dissolve, allowing the gas to spread to all parts of the box, as shown in the second picture:

The physicist and the information theorist are agreed: entropy has increased. From physics (see for example), because the temperature hasn't changed, the change in entropy is proportional to the log of the ratio of the volumes occupied by the gas after and before the partition was removed. That is, log(2) = 1 more unit of entropy per molecule has been added (in bits, rather than nats).

From the point of view of information theory, in order to locate any particular molecule, I now have to find the right place from among 32 little regions - I need one more bit of information than I needed before (the total entropy of a message is the average symbol entropy multiplied by the length of the message).

(We could also have achieved an increase in physical entropy by increasing the temperature rather than the volume, but then, according to the Maxwell distribution, the width of the probability curve associated with each particle's velocity would also increase, thus increasing the number of bits needed to pinpoint each particle's momentum.)

It is important to note that the even dispersal of the gas throughout the box, after the partition is removed, is not caused by our amount of information having been reduced, as some of my teachers suggested when I was an undergraduate. Of course, this belief commits the mind-projection fallacy, and gets the situation exactly backwards.

As we'll see in the next post, the reason that physical systems tend to be found in or evolving towards states of maximal entropy (the second law of thermodynamics) is exactly the same reason that maximum-entropy distributions are appropriate for inference under missing information (the maximum entropy principle): those maximum-entropy states / distributions have the highest multiplicity.

References

[1]	Shannon, C.E., "A Mathematical Theory of Communication," Bell System Technical Journal 27 (3), pages 379–423, 1948 (Download here.)
[2]	Ben-Naim, A., "Entropy Demystified," World Scientific Publishing Company, 2008