Here's a useful thought experiment (slightly reworded) from Ronald Fisher's well-known text book from 1925, 'Statistical methods for research workers':
In each of two nearly identical universes, agricultural researchers wanted to compare 2 fertilizers (I know, it sounds like bullshit). In each universe, similar protocols were performed (this really happened, I swear): 2 plots of land were each divided into two parts, and the different parts treated with the different fertilizers. The same crop plant was cultivated on each part of each plot, and the individual yields recorded. The yields, in tons, were:
Universe 1:
plot
|
fertilizer A
|
fertilizer B
|
1
|
20
|
28
|
2
|
23
|
32
|
Universe 2:
plot
|
fertilizer A
|
fertilizer B
|
1
|
20
|
28
|
2
|
23
|
41
|
In which universe is there stronger evidence for the advantage of fertilizer B over fertilizer A?
In universe 1, the average advantage is 8.5 tons per plot, while in universe 2, the advantage is 13 tons. Seems like the farmers in universe 2 can place greater justified faith in fertilizer B.
But that conclusion is a bit too quick. There's more to the strength of evidence than just the magnitudes of the averages. We also have to consider the quality of the evidence, the signal to noise ratio. In universe 1, the numbers for each fertilizer are tightly clustered, (20 is not much different from 23, and 28 is not much different from 32) supporting the idea that the experiment was well controlled - random factors probably contributed little to the outcomes.
In universe 2, however, the experiment doesn't look as well controlled. There's a big difference between the 2 results for fertilizer B, indicating that there is much more going on than just the choice of fertilizer. There's apparently more noise in this case, and if there's that much noise, then maybe the outcome of the experiment is purely down to random chance.
To analyze the relationship between the signal and the noise, Fisher recommends a null-hypothesis significance test (well, he invented significance tests, after all). Modelling the two sets of samples (A and B) as drawn from the same normally distributed population (the null hypothesis), we can calculate the plausibility of the observed difference between their means under the null hypothesis, H0. If the observed data, D, is too implausible under H0, then, as tradition goes, H0 is rejected. The problem is, we only have a few samples from which to calculate the width of that normal distribution, so another distribution, Student's t-distribution, which accounts for the uncertainty of the standard deviation, is used instead. To get a p-value, we have to calculate a t-statistic, and integrate the t-distribution from that statistic out to infinity to obtain the desired implausibility of D under H0.
To get the t-statistic, we can first calculate an aggregate sample standard deviation, s:
To get the t-statistic, we can first calculate an aggregate sample standard deviation, s:
where xi, mi, and ni are respectively yields, averages, and the numbers of samples, for fertilizer i.
The t-statistic for comparison of two means (where H0 states that the 2 means are the same) is then given by
Taking account of the number of degrees of freedom in the experiment, nA + nB - 2, equal to 2, tables or almost any mathematical software are consulted to perform the required integral, which gives directly the p-value. For universe 1, the p-value I get from this test (two-tailed integration) is 0.041 - quite significant, the null hypothesis is on shaky ground. For universe 2, however, where we noticed that there was apparently a far greater random component to the data, the p-value is 0.11. This is almost 3 times larger, meaning that the data are here more believable under H0. We have weaker grounds for supposing that the fertilizers perform any different in universe 2.
This rare foray into the realm of orthodox stats has been hopefully sufficient to illustrate the point about quality of information, in terms of signal and noise (don't ask for guarantees that I did everything correctly, I may never understand the mindset under which these significance tests make sense). What I don't like, though, about the null-hypothesis significance test (among other things) is that no alternate hypotheses are formulated or evaluated. Without comparing H0 to any other H's, the whole process is frankly rather hollow. What's more, if H0 is rejected, then further machinery is required to figure out how large the effect is. What I want, generally, is a set of proper probabilities, worked out for a whole range of possibilities, including H0.
Just for a laugh, then, I'll work through an approximate model, that'll allow me to plot a continuous probability distribution for a whole range of values for the average difference between the yields for the 2 fertilizers. I'll use the t-statistic again, but I won't be integrating tails (at least, not until after I have a posterior distribution). This t-statistic will help me get around the ambiguity concerning the width of the noise distribution, when calculating the likelihood function.
We can test the significance of the mean of a single set of samples from a single population, relative to some hypothetical mean, μ0, using another formula for the t-statistic:
where s is the regular sample standard deviation. (For emphasis, the reason for the different formula is that we are doing a different test - looking at a single mean, <x>, as opposed to comparing 2 means.)
The likelihood function, P(D | HI), is calculated by evaluating the t-distribution with this statistic and the number of degrees of freedom, n - 1, which is 1.
The single population we're looking at is the population of differences between fertilizer B and fertilizer A. The parameter to vary in order to generate the likelihood function is the hypothesis, μ0.
As a prior density, I'll use a normal distribution centered at zero. To assign a width, I'll set the standard deviation to 15 tons, which means that there is a very small probability (< 5%) that the absolute value of the difference in yields for the two fertilizers exceeds 30 tons. The posterior probability is then simply the normalized product of this prior and the likelihood function, from Bayes' Theorem.
The graph below shows the posterior probability density as a function of μB-A, for each of our universes. The curves illustrate clearly the impact of decreasing the signal to noise ratio in universe 2: though the peak is further to the right, the tail extends further to the left, due to the lower sensitivity of the experiment in that universe. Integrating the two curves from -∞ to 0, we see that the probability that fertilizer B is actually not better than fertilizer A is twice as large in universe 2 as it is in universe 1, which is similar to the result above, in terms of p-values. As the old saying goes: garbage in, garbage out. Precise inference demands a well-controlled experiment.
Its only human that quite often in the quest for knowledge we'll derive greater confidence from results like universe 2, rather than universe 1. It takes care not to be seduced by a greater overall difference, before taking time to consider how much of that difference is likely to be due to random fluctuations. Often, for brevity, we'll summarize an experiment by recording only the mean result (or in less formal circumstances subconsciously estimate the mean, and forget all the other details), but as we've seen, to draw good-quality inferences we need to note not only the mean but also the dispersion and the number of samples (roughly, confidence in a result scales1 according to SNR × √ n ). How many figures quoted by politicians or newspapers (or anybody else with influence) lose their sting when we notice that no error bar has been provided? Context is all important. Sometimes, only moderately careful analysis is enough to overturn an intuitively appealing conclusion, and it's results like this that show the importance of a cultivated awareness of the mathematical machinery of rational inference.
[1] |
'Why randomized controlled trials fail but needn't: 2. Failure to employ physiological statistics, or the only formula a clinician-trialist is ever likely to need (or understand!),' D.L. Sackett, CMAJ October 30, 2001 vol. 165 no. 9, link
|
No comments:
Post a Comment