tag:blogger.com,1999:blog-7153393418031337342017-03-25T09:08:54.285-05:00Maximum Entropya blog about science, statistics, and rationality - one of my favorite thingsTom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.comBlogger60125tag:blogger.com,1999:blog-715339341803133734.post-13391830098107924772015-10-31T02:18:00.001-05:002015-10-31T02:18:34.350-05:00Multi-level modeling<style> td.upper_line { bordertop:solid 1px black; } table.fraction { textalign: center; verticalalign: middle; margintop:0.5em; marginbottom:0.5em; lineheight: 2em; } </style> <style> table.num_eqn { width:99%; textalign: center; verticalalign: middle; margintop:0.5em; marginbottom:0.5em; lineheight: 2em; } td.eqn_number { textalign:right; width:2em; } </style> <br /><div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">In a <a href="http://maximum-entropy-blog.blogspot.com/2014/11/probability-trees-and-marginal.html">post</a> last year, I went through some inference problems concerning a hypothetical medical test. For example, using the known rate of occurrence of some disease, and the known characteristics of a diagnostic test (<a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#false-positive">false-positive</a> and <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#false-negative">false-negative</a> rates), we were able to obtain the probability that a subject has the disease, based on the test result.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">In this post, I'll demonstrate some hierarchical modeling, in a similar context of medical diagnosis. Suppose we know the characteristics of the diagnostic test, but not the frequency of occurrence of the disease, can we figure this out from a set of test results? </div><blockquote class="tr_bq"><div style="text-align: justify;"><i>A medical screening test has a false-positive rate of 0.15 and a false-negative rate of 0.1. One thousand randomly sampled subjects were tested, resulting in 213 positive test results. What is the posterior distribution over the background prevalence of the disease in this population? </i></div></blockquote><div style="text-align: justify;"><br /></div><div style="text-align: justify;"></div><a name='more'></a><br /><div style="text-align: justify;">This is technically quite similar to an example in Allen Downey's <a href="http://greenteapress.com/thinkbayes/html/index.html">book</a>, 'Think Bayes: Bayesian Statistics Made Simple.' Allen was kind enough to credit me with having inspired his <a href="http://greenteapress.com/thinkbayes/html/thinkbayes015.html#toc115">Geiger-counter example</a>, though it is really a different kind problem to the <a href="http://maximum-entropy-blog.blogspot.com/2012/04/fly-papers-and-photon-detectors-another.html">particle-counting example</a> of mine that he was referring to. The current question seems to me more similar to Allen's problem, even though the scenario of medical screening seems quite different.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">In Allen's Geiger-counter problem, a particle counter with a certain <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#sensitivity">sensitivity</a> counts <i>n</i> particles in a given time interval, and the challenge is to figure out the (on average) constant emission rate of the radioactive sample that emitted those particles. The solution has to <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#marginal">marginalize</a> over (integrate out) a nuisance parameter, which is the actual number of particles emitted during the interval, in order to work back to the activity of the sample. This number of emissions sits between the average emission rate and the number of registered counts, in the chain of causation, but we don't need to know exactly what the number is, hence the term, 'nuisance parameter.'</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">The current problem is analogous in that we have to similarly work backwards to the rate of occurrence of the disease (similar to the emission rate of the radioactive sample) from the known test results (number of detected counts). The false-negative rate for the medical test plays a similar role to the Geiger-counter efficiency. Both encode the expected fraction of 'real events' that get registered by the instrument. In this case, however, there is an additional nuisance parameter to deal with, because now, we have to cope with false positives, as well as false negatives.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">(We could generalize the Geiger-counter problem, to bring it to the same level of complexity, by positing that in each detection interval, the detector will pick up some number of background events - cosmic rays, detector dark counts, etc. - and that this number has a known, fixed long-term average.)</div><div style="text-align: justify;"><br />Defining <i>r</i> to be the rate of occurrence of the disease, and <i>p</i> to be the number of positive test results, <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#bayes-theorem">Bayes' theorem</a> for the present situation looks like this:</div><div style="text-align: justify;"><br /><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody><tr><td><table align="center" cellpadding="0" cellspacing="0"><tbody><tr><td><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-6AvGt7j7HJ8/VjOVjFsrG4I/AAAAAAAACmQ/iJvPw23XxlY/s1600/eq1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="86" src="http://3.bp.blogspot.com/-6AvGt7j7HJ8/VjOVjFsrG4I/AAAAAAAACmQ/iJvPw23XxlY/s320/eq1.png" width="320" /></a></div></td> </tr></tbody></table></td> <td class="eqn_number">(1) </td> </tr></tbody></table><br /></div><div style="text-align: justify;"><br />In the statement of the problem, I didn't specify any prior distribution for the rate of occurrence, <i>r</i>. In keeping with the information content in the problem specification, we'll adopt a uniform distribution over the interval (0, 1). (In the real world, this is typically a very bad thing to do - there are very few things that we no absolutely nothing about - but here it'll give us the advantage of making it easier to check that our analysis makes sense.)</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">The trick, then, is entirely wrapped up in calculating the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#likelihood">likelihood function</a>. We have to evaluate the consequences of each possible value of <i>r</i>. For these purposes, the state of the world consists of a <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#conjunction">conjunction</a> of 3 variables: number of positive test results, number of those positive tests caused by presence of the disease, and number of people who were tested that had the disease. The first of these is the known result of our measurement. The other two are unknown, but affect how the measurement result was arrived at, so we need to integrate over them.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">(At first glance, it might seem that of these 2 nuisance parameters, if I know one, then I know the other. Don't forget, however, that there are two ways to get a positive test result: an accurate diagnosis, and a false positive.)</div><div style="text-align: justify;"><br />Because we have 3 variables that determine the relevant properties of the world, the true state of reality (assuming we know it) can be represented as a point in a three-dimensional possibility space. The likelihood function, however, only cares about one of those 3 coordinates, the number of positive test results (our measurement result), so the relevant region of possibility space is a plane, parallel to the other 2 axes. The total probability to have obtained this number of test results is the sum of the probabilities for all points on this plane. <br /><br />In general, the total probability for some proposition, x, can be written as a sum of probabilities, in terms of some set of mutually exclusive propositions, y<sub>1</sub>, y<sub>2</sub>, y<sub>3</sub>, .... </div><div style="text-align: justify;"><br /><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody><tr><td><table align="center" cellpadding="0" cellspacing="0"><tbody><tr><td><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-0MrX51Tu_Dk/VjOYmSYiEdI/AAAAAAAACmc/PwLWjpd4RwA/s1600/eq2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="40" src="http://4.bp.blogspot.com/-0MrX51Tu_Dk/VjOYmSYiEdI/AAAAAAAACmc/PwLWjpd4RwA/s400/eq2.png" width="480" /></a></div></td> </tr></tbody></table></td> <td class="eqn_number">(2) </td> </tr></tbody></table><br /></div><div style="text-align: justify;"><br />This follows from the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#extended-sum-rule">extended sum rule</a>. From the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#product-rule">product rule</a>, each term in the sum can be re-written, yielding the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#marginal">marginalization</a> result:<br /><br /><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody><tr><td><table align="center" cellpadding="0" cellspacing="0"><tbody><tr><td><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-sXPh3AUDrX0/VjOZxvEtR9I/AAAAAAAACmo/fLrNt-XkgX4/s1600/eq3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="45" src="http://4.bp.blogspot.com/-sXPh3AUDrX0/VjOZxvEtR9I/AAAAAAAACmo/fLrNt-XkgX4/s320/eq3.png" width="400" /></a></div></td> </tr></tbody></table></td> <td class="eqn_number">(3) </td> </tr></tbody></table><br />Returning to the medical screening problem, let's use p<sub>t</sub> to represent the number of true positives, d to represent the number of people in the cohort who had the disease. The total number of participants, N, has been taken out of the background information, <span style="font-family: Times,"Times New Roman",serif;">I</span>, to be explicit. Thus, each point in our 3D hypothesis space has an associated probability that looks like this:</div><div style="text-align: justify;"><br /><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody><tr><td><table align="center" cellpadding="0" cellspacing="0"><tbody><tr><td><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-32XmCtuhvds/VjOch0LJIyI/AAAAAAAACnA/j4-l_i0ix6c/s1600/eq4.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="56" src="http://4.bp.blogspot.com/-32XmCtuhvds/VjOch0LJIyI/AAAAAAAACnA/j4-l_i0ix6c/s200/eq4.png" width="200" /></a></div></td> </tr></tbody></table></td> <td class="eqn_number">(4) </td> </tr></tbody></table><br />Treating p and p<sub>t</sub> as a single proposition, we can marginalize over d, using equation 3:</div><br /><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody><tr><td><table align="center" cellpadding="0" cellspacing="0"><tbody><tr><td><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-TxmdjK29IWE/VjOh7KnzvkI/AAAAAAAACnQ/wLSl0PPIX18/s1600/eq5.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="50" src="http://3.bp.blogspot.com/-TxmdjK29IWE/VjOh7KnzvkI/AAAAAAAACnQ/wLSl0PPIX18/s640/eq5.png" width="561" /></a></div></td> </tr></tbody></table></td> <td class="eqn_number">(5) </td> </tr></tbody></table><br /><div style="text-align: justify;">Next we just do the same thing again on the term containing p, p<sub>t</sub>:</div><div style="text-align: justify;"><br /><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody><tr><td><table align="center" cellpadding="0" cellspacing="0"><tbody><tr><td><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-GOhYsDfEW7Y/VjOiAvdKEuI/AAAAAAAACnY/2iQOSKPChSY/s1600/eq6.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="48" src="http://2.bp.blogspot.com/-GOhYsDfEW7Y/VjOiAvdKEuI/AAAAAAAACnY/2iQOSKPChSY/s640/eq6.png" width="640" /></a></div></td> </tr></tbody></table></td> <td class="eqn_number">(6) </td> </tr></tbody></table><br />Each of the terms on the right hand side is given by the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#binomial">binomial distribution</a>. The first is the probability to obtain p positive test results, when there are p<sub>t</sub> true positives. In other words, it is the probability to get (p - p<sub>t</sub>) false positives from the (<i>N - d</i>) people in the group who did not have the disease, when the false-positive rate is r<sub>fp</sub>: <br /><br /><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody><tr><td><table align="center" cellpadding="0" cellspacing="0"><tbody><tr><td><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-GLzPtTND-QE/VjOmsxW2GZI/AAAAAAAACnk/2M-JN7cJXKU/s1600/eq7.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="35" src="http://2.bp.blogspot.com/-GLzPtTND-QE/VjOmsxW2GZI/AAAAAAAACnk/2M-JN7cJXKU/s200/eq7.png" width="152" /></a></div></td> </tr></tbody></table></td> <td class="eqn_number">(7) </td> </tr></tbody></table>The second term is the probability to get p<sub>t</sub> true positives, when the number with the disease is <i>d</i>. This depends on the probability for an afflicted individual to receive a positive test result, which is 1 minus the false-negative rate, (1 - r<sub>fn</sub>):<br /><br /><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody><tr><td><table align="center" cellpadding="0" cellspacing="0"><tbody><tr><td><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-a65zmgxVuk0/VjOotVWwNCI/AAAAAAAACn8/PIJnF--qlQU/s1600/eq8.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="35" src="http://4.bp.blogspot.com/-a65zmgxVuk0/VjOotVWwNCI/AAAAAAAACn8/PIJnF--qlQU/s200/eq8.png" width="152" /></a></div></td> </tr></tbody></table></td> <td class="eqn_number">(8) </td> </tr></tbody></table><br />The third term is the probability to get d people with the condition from a sample of N, when the rate of occurrence is <i>r</i>:<br /><br /><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody><tr><td><table align="center" cellpadding="0" cellspacing="0"><tbody><tr><td><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-xdoXXDSI-vU/VjOpgQTFzTI/AAAAAAAACoI/FDQLZATQW7o/s1600/eq9.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="" border="0" height="35" src="http://4.bp.blogspot.com/-xdoXXDSI-vU/VjOpgQTFzTI/AAAAAAAACoI/FDQLZATQW7o/s200/eq9.png" title="" width="127" /></a></div></td> </tr></tbody></table></td> <td class="eqn_number">(9) </td> </tr></tbody></table><br />The double sum in equation (6), at each possible value for <i>r</i>, is a task best done by a computer. The python code I wrote for this is given in the appendix, below. The code, of course does not really perform the calculation at every possible value of <i>r</i>, (there are an uncountable infinity of them) but takes a series of little hops through the hypothesis space, in steps 0.002 apart. Because this step size is much narrower than the resulting probability peak, the approximation that the probability varies linearly between steps does no harm to the outcome. <br /><br />Recall that we're using a uniform prior, and so, from equation (1) , once we calculate the likelihood from equation (6) at all possible values for <i>r</i>, the resulting curve is proportional to the posterior distribution. Thus, a plot of the likelihoods produced by my python function (after <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#normalization">normalization</a>) gives the required distribution:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-Y5mOUblc7kI/VjPPOF1PUrI/AAAAAAAACoY/hrsh2cr_PAs/s1600/posterior.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="427" src="http://3.bp.blogspot.com/-Y5mOUblc7kI/VjPPOF1PUrI/AAAAAAAACoY/hrsh2cr_PAs/s640/posterior.png" width="640" /></a></div><br /><br />The figure gives a point estimate and a confidence interval. Because the posterior distribution is symmetric, my point estimate is obtained by taking the highest point on the curve.<br /><br />To get the 50% confidence interval (using my unorthodox, but completely sensible definition of <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#confidence">confidence intervals</a>), I just kept moving one step to the left and to the right from the high point, until the sum of the probability between my left-hand and right-hand markers first reached 0.5. Again, this is a good procedure in this case, because the curve is symmetric - in the case of asymmetry, we would need a different procedure, if, for example, we required an interval with equal amounts of probability on either side of the point estimate.<br /><br />The centre of the posterior distribution is at r = 0.15. Let's see if that makes sense. With 1000 test subjects, at this rate, we expect 150 cases of the disease. With 85% of positive cases correctly identified (false-negative rate = 0.15) then we should have 127.5 true positive test results (on average). Also we should get (1000 - 150) × 0.1 = 85 false positive test results. Adding these together, we get 212.5 expected positive test results, which is, to the nearest integer, what was specified at the top. It looks like all the theory and coding have done what they should have done.<br /><br />For fun, we can also check that the calculated confidence interval makes sense. I've specified a 50% confidence interval, which means that if we do the experiment multiple times, about half of the calculated confidence intervals should contain the true value for the incidence rate of the disease. With only a little additional python code, I ran a Monte-Carlo simulation of several measurements of 1000 test subjects each.<br /><br />The true incidence rate was fixed at 0.15, and the number of positive test results was randomly generated for each iteration of the simulated measurement. The python package, numpy, can generate binomially distributed random numbers. I used the following lines of code to generate each number of positive test results, p:<br /><br />d = numpy.random.binomial(1000, 0.15) # number with disease<br />p_t = numpy.random.binomial(d, (1 - r_fn)) # number of true positives<br />p_f = numpy.random.binomial(N - d, r_fp) # number of false positives<br />p = p_t + p_f<br /><br />Then, using each <i>p</i>, I calculated the 50% confidence limits, as before, and counted the occasions on which the true value for <i>r</i>, 0.15, fell between these limits. I ran 100 simulated experiments, and amazingly, exactly 50 produced errors bars that contained the true value. There certainly was a little luck here (standard deviation here is 5, so 32% of the time, such a set of 100 measurements will produce true-value-containing C.I.'s either fewer than 45 times, or more than 55 times), but still this result serves as a robust validation of my numerical procedures. Probability theory works! <br /><br />(Note: to keep the code presented in the appendix as understandable as possible, I didn't do much optimization. To have done the 100 measurement Monte-Carlo run with this exact code would have taken days, I think. The code I ran was a little different. In particular, the ranges over <i>r</i>, <i>d</i>, and <i>p</i><sub><i>t</i> </sub>were truncated, as large regions of these parameter spaces contribute negligibly. This allowed my simulation to run in just under an hour.) <br /><br /><br /><hr /><br /><h3><span style="color: blue; font-size: large;">Appendix</span></h3><br /><div style="text-align: left;"># Python source code for the calculation</div><div style="text-align: left;"># Warning: this algorithm is very slow - considerable optimization is possible</div><div style="text-align: left;"><br /></div><div style="text-align: left;">import numpy as np<br />from scipy.stats import binom # calculates binomial distributions</div><div style="text-align: left;"><br /></div><div style="text-align: left;">def get_likelihoods(N, p):</div><div style="text-align: left;"> </div><div style="text-align: left;"> # inputs:</div><div style="text-align: left;"> # N is total number of people tested<br /> # p is number of positive test results</div><div style="text-align: left;"></div><div style="text-align: left;"></div><div style="text-align: left;"></div><div style="text-align: left;"> </div><div style="text-align: left;"></div><div style="text-align: left;"> r_fn = 0.15 # false-negative rate<br /> r_fp = 0.1 # false-positive rate</div><div style="text-align: left;"></div><div style="text-align: left;"><br /> # number with disease can be anything up to number of people tested: </div><div style="text-align: left;"> d_range = range(N + 1) </div><div style="text-align: left;"><br /></div><div style="text-align: left;"> # number of true positives can be anything up to total number of positives:</div><div style="text-align: left;"> p_t_range = range(p + 1)</div><div style="text-align: left;"><br /></div><div style="text-align: left;"> likelihoods = [ ]<br /><br /> delta_r = 0.002<br /> rList = np.arange(0, 1 + delta_r, delta_r)<br /><br /> for r in rList: # scan over hypothesis space</div><div style="text-align: left;"> temp = 0<br /> for d in d_range: # these 2 for loops do the double summation<br /> for p_t in p_t_range: <br /> p1 = binom.pmf(p - p_t, N - d, r_fp) # equation (7)<br /> p2 = binom.pmf(p_t, d, (1 - r_fn)) # equation (8)<br /> p3 = binom.pmf(d, N, r) # equation (9)<br /> <br /> temp += (p1 * p2 * p3)<br /> <br /> likelihoods.append(temp)</div></div><div style="text-align: left;"><br /> return likelihoods<br /><br /><br /><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;"></div>Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com0tag:blogger.com,1999:blog-715339341803133734.post-44530454495393299142015-04-25T01:02:00.000-05:002015-04-28T18:05:53.266-05:00Mean vs median - a careful balancing act<br /><br /><div style="text-align: justify;">Two common measures of the location of a probability distribution are the mean and the median. While generally, they are quite different things, some familiar distributions have their mean and median at the same point (<strike>all such distributions are symmetric</strike>, (see comment, below) and <i>vice versa</i>).</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">The mean of a distribution, as we all know, is its average, while the median is, roughly speaking, the point at which the amount of probability mass to one side is the same as the amount on the other side. Upon hasty consideration, these definitions can appear to denote the same thing, and so confusion between the two concepts is common. Annoyingly, my own PhD thesis contains a sentence<sup>1</sup> that explicitly confuses the mean for the median (and furthermore, none of the half dozen eminent scientists whose job it was to assess my thesis (who otherwise all did an excellent job!) reported noticing this blunder).</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Confusion between the mean and the median is highly analogous to a difficulty experienced by many young children when they try to balance asymmetric blocks on top of one another, as has been reported by cognitive scientist Annette Karmiloff-Smith<sup>2</sup>.<br /><br /><a name='more'></a>In a radio interview a couple of years ago (BBC Radio 4, '<a href="http://www.bbc.co.uk/programmes/b01pzr00">The Life Scientific</a>', Jan. 22, 2013), Karmiloff-Smith described briefly the finding (starting about 12 minutes into the interview): young children were asked to try to balance various blocks - some symmetric, others invisibly loaded on one side - on a narrow beam. Children of a particular age group were, it seems, old enough to expect the balance point to correspond to the geometric midpoint of the object, and tried first to balance the blocks there. Obviously, in the case of the asymmetrically weighted blocks, the midpoint would not work, and the blocks would fall. Despite repeated attempts with the same outcome, however, often a child would remain apparently unshakable in its faith in the midpoint, and continue to try to balance the item there.<br /><br />Interestingly, many slightly younger children, perhaps not yet old enough to have learned the significance of the midpoint of an object, had an easier time adjusting from the geometric centre to the actual centre of mass after a few trials.<br /><br />The task of finding the centre of mass of a physical object is mathematically identical to the matter of locating the mean of a probability distribution. The <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#mean">mean</a> of a distribution over x, given as<br /><div style="text-align: center;"><br /></div><div style="text-align: center;"><img src="http://4.bp.blogspot.com/-0_-YkSS941I/UOfA_uYMePI/AAAAAAAABVM/MJ37nGrVmnc/s1600/eq3.tiff" /></div>is the point at which (treating distances from the mean, to the left, as negative and distances to the right as positive) the products of the individual probabilities with their corresponding distances from the mean sum up to zero. (If we shifted the origin of our coordinate system to the mean of the distribution, in the above formula, the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#integral">integral</a> would be zero.)<br /><br />From the law of the lever, however, the force with which a mass tends to tip an object on a fulcrum is given by the product of the mass with its distance from the fulcrum. Since an object will balance when the forces on one side have the same magnitude as the forces on the other side, the centre of balance also corresponds to the point at which the sum of these mass × distance products comes to zero. So, the centre of mass is also the mean of the mass distribution.<br /><br />The midpoint of an object is also closely analogous to the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#median">median</a>. If we reduce an object to a one-dimensional mental model, then the correspondence becomes exact. At the median, m, (assuming a continuous distribution) the amounts of mass on either side are equal:<br /><br /><div style="text-align: center;"><img border="0" src="http://3.bp.blogspot.com/-l73saYZ6_1E/UddClaQThfI/AAAAAAAAB9E/OO6sMCFUTYs/s1600/median.JPG" /> </div>In one dimension, length stands in for mass, and the median is the point equidistant from each end.<br /><br />Note, however, that even encoding for differences in density along the length of an object / distribution, the mean and the median will only be the same in the case of symmetry about the centre of mass. The median involves the integral of mass, while the mean integrates the product between mass and distance. The mean pays more attention to masses situated further out, thanks to this product, while the median doesn't care where the mass happens to be. If a distribution has an extended tail on one side only, then the mean will typically be positioned further out into the tail than the median. <br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-pE_7AD55XC0/VTsplVh1DyI/AAAAAAAACkE/TF41o7ERLrw/s1600/mean_median_fig.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-pE_7AD55XC0/VTsplVh1DyI/AAAAAAAACkE/TF41o7ERLrw/s1600/mean_median_fig.png" height="395" width="640" /></a></div>At some point in their development, children seem to learn to expect symmetry in the objects and phenomena they encounter. This is quite reasonable, as without symmetry, there can be no physics (all physical laws are realizations of symmetry of one kind or another).<br /><br />The devil is in the details, though, and the symmetry need not always be of the simplest forms. As we approach adulthood, we presumably come to appreciate this, and I suspect that as adults we can look forward to much faster success in balancing exercises such as the ones those children described earlier struggled with. No doubt our continually built up experiences of mechanical interaction with reality contribute much to this attainment of maturity.<br /><br />But in the course of our day-to-day existence, we have far less cause to experience and interact with explicit probability distributions, so the lessons pertaining to them can be harder to win (particularly given our excess exposure to symmetric distributions such the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#normal-dist">Gaussian</a>). An intuitive grasp of the difference between mean and median is one presumably almost all adults possess, when it comes to simple physical objects, but banishing this confusion can be more than child's play when it comes to statistics. Hopefully, by noting (as I've tried to do here) the similarities between the mechanical and the abstract, we can ease the process. <br /><br /><br /><hr /><br /><br /><span style="color: blue;"><b>References</b></span><br /><br /><!-- ************************ Table of references ************************--> <br /><table> <tbody><tr> <td valign="top"> [1] </td> <td>Really, you think I'm going to give you a page number? Go find it yourself!</td> </tr><tr> <td valign="top"> [2] </td> <td>Annette Karmiloff-Smith and Bärbel Inhelder, '<i>If you want to get ahead, get a theory,</i>' Cognition, volume 3, issue 3, p 195-212 (1975)</td> </tr></tbody></table><br /><br /><br /><br /></div>Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com3tag:blogger.com,1999:blog-715339341803133734.post-63362018218663120322015-04-18T19:02:00.001-05:002015-04-18T19:02:11.941-05:00The Fundamental Confidence Fallacy<div style="text-align: justify;"><br /></div><div style="text-align: justify;">The title of this post comes from an excellent recent paper (as far as I can tell, still in draft form) on misunderstandings of confidence intervals. The paper, '<i>The fallacy of placing confidence in confidence intervals</i>', by R. D. Morey <i>et al.</i><sup>1</sup> is by almost exactly the same set of authors whose earlier paper on a very similar topic I criticized, <a href="http://maximum-entropy-blog.blogspot.com/2014/03/whose-confidence-interval-is-this.html">before</a>, but the current paper does a far better job of explaining the authors' position, and arguing for it.<br /><br />The authors identify the fundamental confidence fallacy (FCF) as believing automatically that,<br /><blockquote class="tr_bq"><i>If the probability that a random interval contains the true value is X%, then the plausibility (or probability) that a particular observed interval contains the true value is also X%. </i></blockquote><br /><a name='more'></a>A confidence interval, a kind of error bar, is a device used in problems of <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#param-est">parameter estimation</a> (e.g. what is the age of the universe? how much rain will fall tomorrow? which gene is most responsible for making me so damn handsome?). As well as calculating a point estimate, such as the most probable value of a parameter, a researcher working on some relevant data may also provide a confidence interval, indicating a region around the parameter's point estimate, in which (hopefully) one can expect the true value of the parameter to reside. This is done because sampling errors (<a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#noise">noise</a>) in the data collection process will typically result in the point estimate being not exactly equal to the true value of the parameter.<br /><br />In general, the point of such errors bars is that a very broad error bar indicates a not very precise measurement, where our confidence that the point estimate is very close to the true value is low, and <i>vice versa</i>.<br /><br />Morey <i>et al</i>. show, however, that the traditional definition of the confidence interval is too broad to automatically satisfy the general requirements of error bars - hence their contention that the belief summarized in FCF (above) is indeed a fallacy.<br /><br />Here's the <b>conventional definition</b> of confidence intervals that they take from the literature:<br /><blockquote class="tr_bq"><i>A X% confidence interval for a parameter θ is an interval (L, U) generated by an algorithm that in repeated sampling has an X% probability of containing the true value of θ.</i></blockquote>The problem that they rightly identify about this definition is that it fails to account for differences in how informed one is, before and after the data have been gathered. They demonstrate this with a beautifully simple thought experiment about a lost submarine, which I'll try to explain:<br /><br />The crew of a boat want to drop a rescue line down to the hatch of a 10 m long submarine. They don't know the exact location of the sub, but they know its length, and they know that it produces distinctive bubbles from uniformly distributed locations along its length. They also know that the hatch is exactly half way along the sub's length. They decide to watch for bubbles, to infer the sub's position, but they want to launch their rescue attempt quickly, so they decide to do so as soon as their 50% confidence interval for the location of the hatch is sufficiently narrow.<br /><br />They reason that for 2 randomly positioned bubbles, the hatch is equally likely to be between them as not, so for a two-bubble data set, they devise the following confidence interval:<br /><br /><div style="text-align: center;">⟨x⟩ ± Δx / 2 </div><br />where ⟨x⟩ is the average position of the 2 bubbles, and Δx is their separation - i.e. the 50% confidence interval is defined exactly by the positions of the 2 bubbles. Note that this confidence interval satisfies perfectly the definition given above.<br /><br />Unfortunately, when the bubbles rise, they do so with a separation of only 1 cm. The rescuers calculate a 1 cm wide confidence interval, and, falling for the fundamental confidence fallacy, they infer that the hatch has a 50% probability to be in this very narrow region.<br /><br />In reality, though, this extremely (and spuriously) precise inference has been drawn from almost maximally uninformative data. The bubbles could have arisen from either end of the submarine, or anywhere in between, meaning that the hatch could be located anywhere within a 10 m interval. The probability that the hatch is between the 2 bubbles is about as low as it can be.<br /><br />On the other hand, had the bubbles been 10 m apart - indicating that they came from opposite ends, the rescuers would have been able to infer the exact location of the hatch, but from their adopted confidence procedure would have obtained a 10 m wide C. I., and hence would have wanted to wait for more data, perhaps losing their only chance to complete the rescue.<br /><br />The problem with the conventional definition of confidence intervals is that it is set up with respect to the set of all possible measurement outcomes, rather than the specific measurement outcome that occurred. An inference that is valid when the data are not known can hardly be expected to remain necessarily valid when the data are known, but the standard wisdom regarding confidence intervals ignores this.<br /><br />To remedy this, I've always advocated a somewhat different, <b>unconventional definition</b> of confidence intervals:<br /><blockquote class="tr_bq"><i>An X% confidence interval is a subset of the hypothesis space that has an X% posterior probability to contain the true state of the world.</i> </blockquote>(On a one-dimensional hypothesis space, this subset would be defined by lower and upper bounds, (L, U), exactly as appear in the above conventional definition.)<br /><br />This definition also satisfies the conventional one, but is narrower in a way that eliminates its worst problems. <br /><br />This is essentially the same recommendation made by Morey <i>et al</i>. (they prefer to call it a credence interval). Under a definition of this kind, the fundamental confidence fallacy disappears - if I give you a parameter estimate with a 95% confidence interval, then (i) 95% <i>is</i> the probability that the true value of the parameter lies within the interval and (ii) a narrow confidence interval necessarily corresponds to a precise determination of the parameter (and <i>vice versa</i>). <br /><br />Under favourable conditions, many of the common frequentist confidence procedures do a reasonable job of approximating these desiderata, but I feel strongly (as, presumably, do Morey <i>et al</i>.) that it is far better to start with and understand a sensible definition, and hence understand when our approximate methods are (and are not) valid, than to sweep such validity issues under the carpet - pretend they don't exist - and proceed <i>ab initio</i> from nonsense.<br /><br /><br /><span style="color: blue;"><b>Why my criticism of the earlier paper is still valid:</b></span><br /><br />Now that Morey <i>et al</i>. have produced this very good paper, I can better appreciate what they were trying to get at, in their earlier paper, '<i>Robust misinterpretation of confidence intervals</i>'<sup>2</sup>, and I can better explain the nature of the error they made in it. <br /><br />In the earlier paper, the authors described providing a sample of scientists with a questionnaire to assess their understanding of confidence intervals. The asked:<br /><blockquote class="tr_bq"><i>Professor Bumbledorf conducts an experiment, analyzes the data and reports, "the 95% confidence interval for the mean ranges from 0.1 to 0.4." Which of the following statements are true:</i> </blockquote> They then listed a number of statements, including<br /><blockquote class="tr_bq"><i>There is a 95% probability that the true mean lies between 0.1 and 0.4.</i></blockquote>They went on to claim that this statement, and several others of related forms were false. They were wrong. The reason they were wrong is essentially the same reason that the statement of FCF, above is indeed a fallacy: they failed to appreciate that a person who has seen the data may have a different probability assignment to a person who has not. I have not seen Prof. Bumbledorf's data, so the statement immediately above this paragraph is correct as far as I'm concerned (as I proved straightforwardly in the <a href="http://maximum-entropy-blog.blogspot.com/2014/03/whose-confidence-interval-is-this.html">earlier post</a>, using the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#bernoulli-urn">Bernoulli urn rule</a>). For Bumbledorf, however, who is aware of the data, the statement is not guaranteed to be true (depending on the confidence procedure he used).<br /><br />The authors fell foul of a common form of <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#mind-projection">mind-projection fallacy</a>, by acting as if there is one true probability distribution, independent of one's state of information.<br /><br />For their questionnaire to have worked the way they wanted, the authors should have asked something like '<i>... which of the following statements are necessarily valid for Bumbledorf to make?</i>'<br /><br /><br /><span style="color: blue;"><b>Conclusion</b></span><br /><br />The traditional definition of the confidence interval allows for methods of calculating errors bars that do not satisfy the basic requirements of error bars:<br /><br /><ol><li>The integral over an X% interval may not be X%, and (as seen in the submarine example) may be drastically smaller. The probability for a parameter to be inside the confidence interval may be much less than the claimed confidence level. </li><li>The definition allows for procedures that produce a narrow interval in cases where the measurement is very imprecise, and a very broad interval, where the parameter can be inferred exactly. </li></ol></div><div style="text-align: justify;"> </div><div style="text-align: justify;">As always - in all aspects of life - correct reasoning is <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#bayes-theorem">Bayesian</a> reasoning. By calculating error bars from <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#posterior">posterior distributions</a>, or algorithms that deliberately strive to approximate them, these problems with conventionally defined confidence intervals are immediately dissolved.<br /> <br /><br /><hr /><br /><br /><span style="color: blue;"><b>References</b></span><br /><br /><!-- ************************ Table of references ************************--> <br /><table> <tbody><tr> <td valign="top"> [1] </td> <td>R.D. Morey, R. Hoekstra, M.D. Lee, J.N. Rouder, and E.-J. Wagenmakers, '<i>The fallacy of placing confidence in confidence intervals</i>,' available in draft form, <a href="http://pcl.missouri.edu/sites/default/files/Morey.etal_.2014.CI_.pdf">here</a></td> </tr><tr> <td valign="top"> [2] </td> <td>R. Hoekstra, R.D. Morey, J.N. Rouder, and E.-J. Wagenmakers, <i>'Robust misinterpretation of confidence intervals'</i>, Psyconomic Bulletin & Review, January 2014 (<a href="http://www.ejwagenmakers.com/inpress/HoekstraEtAlPBR.pdf">link</a>)</td> </tr></tbody></table><br /><br /><br /></div>Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com4tag:blogger.com,1999:blog-715339341803133734.post-42911092256324587242014-12-12T22:00:00.001-06:002014-12-15T17:10:12.182-06:00Science is for Everyone<div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">In the <a href="http://maximum-entropy-blog.blogspot.jp/2014/12/scientism.html">previous post</a>, I explained that science is suitable for investigating all matters. Pursuing a similar theme, I want now to discuss how science is for all people, not just bearded academics with white lab coats. (Pardon the stereotype, and let me emphasize that there is no good reason why 50% of all scientists should not be women.)</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">I mentioned something in that last post that is also central to this discussion: scientific method is a graded affair - not black or white. Whatever we can learn by implementing a low level of scientific rigour, we can learn a little more, in a little more detail, and with a little more confidence, by applying a slightly more systematic procedure.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;"></div><a name='more'></a>It is simply not the case that one goes to university to study for a degree in science, passes one's exams, then receives a degree certificate, inscribed with the words, 'Now you are imbued with the power of science. All your future endeavours will be fully scientific. None without this certificate will possess the sacred scientific touch.'<br /><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Understanding of and ability to implement scientific method is something that is built up and refined over many years. I have three degree in physics, with a couple of years of postdoctoral experience, and I'm still learning - much to my perpetual delight!</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">But this is by no means to claim that those without my kind of training are barred from joining the party. The basic principles of scientific thought are really pretty simple, and can be grasped quite easily. There are good reasons for this. The thing that ultimately makes science so special is: it works! And the thing about humans, as organisms evolved through natural selection, is that we come equipped with brains that, for the most part, also work. Thus, scientific method and human brains can readily form an easy, comfortable partnership. Scientific method is just cultivated common sense.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Popularizers of science often draw on the purely curiosity-driven aspects of science. Glamourous, big-science instruments, such as the Hubble space telescope and the large hadron collider play a prominent part. This is good, but it is not enough. We also need to draw particular attention to the extraordinary practical advantages of being able to figure stuff out. Knowing things (forming rationally supportable high levels of confidence) means being able to make very effective decisions.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">And no matter what level of scientific rigour you are currently working at, you can always achieve infinitesimally more robust conclusions, and hence infinitesimally more effective decisions, by applying an infinitesimally more scientific approach. You don't have to be a rocket surgeon to use and benefit from a little scientific optimization.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">As long as you are a decision-making entity with values (an intelligent being), then you desire to be able to make effective decisions (see <a href="http://maximum-entropy-blog.blogspot.com/2013/09/is-rationality-desirable.html">Is rationality desirable?</a>). No matter what question is relevant to your pursuit of value, whether it is how to make good bread, or <a href="http://maximum-entropy-blog.blogspot.com/2012/05/how-to-read-newspaper.html">how to judge the merits of newspaper stories</a>, scientific method can get you the answer most efficiently.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">A couple of nights ago, in the wee small hours, I was sitting in a nuclear research facility near Tokyo, blasting a detector with some extremely frisky atomic nuclei, when the conversation with my boss (while we fought off the urge to sleep) turned to the topic of science-fair projects. Having been educated in Ireland, I didn't grow up with the science-fair tradition, which is a big shame, though I did do plenty of exciting experiments with my dad, which probably played a big part in forming my eventual disposition as an adult.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">My boss recalled that of all that projects that his kids had done, the one that they most enjoyed and remembered most vividly was one for which they tested a range of electric batteries to find out which would last longest in their toys. The reason they enjoyed it, of course, was that they really wanted to know the answer. It was a serious practical question. I think this is a fantastic lesson for a child to learn: not only does it encourage fascination and familiarity with the scientific process, but it also conveys the practical advantages of systematic investigation.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Familiarity with fundamentals of experimental design, such as control and randomization, and awareness of basic data collection and reduction techniques, coupled with a willingness to use them (and to constantly improve one's use of them!) is an immensely powerful thing.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Beyond that, a moderate interest in technical topics, such as medical research, can help one to understand research results reported in the media, and so make better informed decisions concerning ones use of and attitude towards various technologies. This convenient NHS <a href="http://www.nhs.uk/news/Pages/Howtoreadarticlesabouthealthandhealthcare.aspx">guide to understanding medical research stories</a>, is a good place to start, if you're interested in that particular thing, but also introduces several general concepts in the field of estimating the merits of evidence. </div><div style="text-align: justify;"><br />Ultimately, a society in which individuals value science becomes a society that collectively values science, and a society that is better equipped to face its many challenges. Science is for scientists. Science is also for the common man and woman (not that there is anything uncommon about scientists!). Science is also for the politicians, who have to design a mutually beneficial path through the difficult territory of being human. Democratically elected politicians will continue to ignore evidence in favour of their personal agenda, as long as the voting public continue telling them that this is OK. </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;"></div>Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com0tag:blogger.com,1999:blog-715339341803133734.post-39307259325977403922014-12-12T19:09:00.001-06:002014-12-12T19:09:15.696-06:00Scientism<br /><br /><div style="text-align: justify;">It perplexes me that the word 'scientism' is predominantly used as a slur to put people down and criticize their world view and methodology. I realized something recently, however, that helped me understand the error that is often being made, and how that error compounds the problem that is often being called out when people make the accusation of scientism.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">First off, lets settle what scientism is. <a href="http://en.wikipedia.org/wiki/Scientism">Wikipedia</a> gives a good definition, that fits well with the contexts in which I see the term used:</div><div style="text-align: justify;"><blockquote class="tr_bq"><i>Scientism is belief in the universal applicability of the scientific method and approach, and the view that empirical science constitutes the most authoritative worldview or most valuable part of human learning to the exclusion of other viewpoints. </i></blockquote></div><br /><a name='more'></a><br /><div style="text-align: justify;">Well, that's a strange accusation. I've made it clear in numerous places that this is exactly my position, and I've repeatedly defended that position with robust logical arguments. </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">On the universal applicability of scientific method: yes, absolutely. If a thing has meaningful consequences, then why should science not be a good way learn about its properties? Whether it is something normally associated with science (astronomy, atomic physics, medicine, evolutionary biology, or whatever), or something more concerned with <a href="http://maximum-entropy-blog.blogspot.com/2014/02/practical-morality-part-2.html">politics</a>, law (<a href="http://maximum-entropy-blog.blogspot.com/2013/02/legally-insane.html">here</a> and <a href="http://maximum-entropy-blog.blogspot.com/2013/06/crime-and-punishment.html">here</a>), <a href="http://maximum-entropy-blog.blogspot.com/p/rational-ethics.html">morality</a>, issues of <a href="http://maximum-entropy-blog.blogspot.com/2012/08/bayes-theorem-all-you-need-to-know.html">religion</a>, or the <a href="http://maximum-entropy-blog.blogspot.com/2012/06/the-mind-projection-fallacy.html">supernatural</a>, all things can be investigated scientifically. </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">To be clear on my position on the supernatural: there is no such thing, it's a necessarily empty set (I'll come back to this point in a future post), but there are certain putative entities often identified as supernatural: ghost, goblins, fairies, gods, and such like. If such things did exist, however, (it's hard to say absolutely categorically that they don't, at least without being more precise about what they are, though there is, of course, no evidence supporting belief in any of them) then they would necessarily be physical beings, amenable to scientific investigation.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">On the maximally authoritative nature of science-based investigation: again, this is necessarily so. Being scientific really just means being systematic. The only alternatives are at best, ignorance and at worst, fantasy passed off as fact. Why would any person wish to know about something, and choose to be non-systematic in the manner of their formation of beliefs about it? One cannot, while being coherent (see my post, <a href="http://maximum-entropy-blog.blogspot.com/2013/09/is-rationality-desirable.html">Is rationality desirable?</a>). To think that one can achieve rationally supportable degrees of belief, without using a rationally supportable procedure is a clear mistake. And this is identically what scientific method does: produce rationally supportable degrees of belief.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Now some might be tempted to argue that science isn't always necessary. Some things, for example, are just obvious. But let me emphasize: scientific method is a graded affair - not black or white. Whatever we can learn by implementing a low level of scientific rigour, we can learn a little more, in a little more detail, and with a little more confidence, by applying a slightly more systematic procedure.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">___</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Now, here's the thing I noticed when I recently saw a little scientism bomb being dropped, elsewhere on the internet, by a person whose awareness of the scope and meaning of scientific method I have good reasons to trust. It seemed to me that what this person was complaining of was actually the opposite of scientism, i.e. dismissal as irrelevant or intractable, certain valid philosophical questions, because they are perceived not to fall within the scope of scientific method. </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Well, lets be clear about what <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#philosophy">philosophy</a> is: philosophy is love of wisdom. Wisdom can be thought of as dividable into three categories (where by 'knowing', in the following, I mean having a rationally supportable high level of confidence in some proposition):</div><ol style="text-align: justify;"><li>Knowing what things are true</li><li>Knowing what procedures are effective for discerning what things are true</li><li>Knowing how to behave effectively under certain circumstances</li></ol><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Note, however, that items 2 and 3 are really special cases of item 1. The question of whether it is valid, for example, to use probability theory when trying to attain rationally supportable degrees of belief is a question of fact about the nature of the real world (spoiler alert: it <i><b>IS</b></i> valid to use probability theory). The question of how to behave is a duo of empirical problems: (i) what is my utility function? (what do I actually value? - yes, this is an empirical question, my values are physical properties of my mind), and (ii) what actions will lead to consequences that will maximize my <a href="http://maximum-entropy-blog.blogspot.com/2013/01/great-expectations.html">expected</a> utility? (In fact, 2 is also a special case of 3: how should I behave if I value knowledge of X?) So all of philosophy is about figuring out what is probably true.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">But, as I just argued, all meaningful questions of fact are best answered using science, and so love of wisdom entails a desire to follow scientific method. Thus, philosophy (defined as an endeavour, and not in terms of the traditional type of education received by the typical practitioner) is identical to science.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Thus, all philosophical questions fall under the scope of scientific method, and the scientist who dismisses such issues as unscientific is failing to appreciate the range of validity, the very meaning, of their own profession. This is a mistake that it's important to call out. Part of the reason we have a society run by politicians and law makers who believe that they can divine correct policy, without implementing scientific procedure, is that prominent scientists are repeatedly telling them that they can. ('Oh, that's not a scientific question, that's a matter of human affairs,' or, 'there's the evidence, now it's for you, the politicians, to decide what it means,' or perhaps worst of all, the Nuremberg defence: 'don't ask me if it's right or wrong, I'm just a scientist.')</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">But notice that the accusation of scientism completely misses the mark, here. Scientism, recall, is believing that all questions fall under science's magisterium, while the actual error being committed is the claim that certain problems are not in this category. </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">The cry of scientism, therefore, fails to draw proper attention to the fallacy that has been committed, and, in fact, is quite likely to reinforce it. Faced with this charge, one is, of course, free to refer to sources such as the Wikipedia definition, quoted above. Many a scientist who is somewhat on the ball, philosophically, however, is likely to look at such definitions and say, 'yup, that's me, and proud of it!' And of whatever mistake they might have made that prompted the rebuke, they are likely to conclude, 'if that's scientism, then I'm perfectly happy to continue committing the crime.' </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /></div><br />Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com3tag:blogger.com,1999:blog-715339341803133734.post-63332454348480045012014-11-08T00:36:00.000-06:002014-11-08T00:36:52.694-06:00Probability Trees and Marginal Distributions<div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">In a blog post earlier this year about medical screening, <a href="http://www.dcscience.net/?p=6473">On the hazards of significance testing. Part 1: the screening problem</a>, statistical expert David Colquhoun demonstrates a simple way of visualizing the structure of certain probabilistic problems. This diagram, which we might call a probability tree, makes the sometimes counter-intuitive solutions to such problems far more easy to grasp (and in the process, helps put over-inflated claims about the effectiveness of screening into perspective). </div><div style="text-align: justify;"></div><a name='more'></a><br /><div style="text-align: justify;">I discussed exactly this medical screening problem in my first ever technical blog post, <a href="http://maximum-entropy-blog.blogspot.com/2012/03/base-rate-fallacy.html">The Base-Rate Fallacy</a> (though my numbers were entirely made up, while David's come more-or-less directly from reality). I was therefore extremely jealous of his simple diagram, which conveyed instantly the structure and solution of the problem. How much more direct an explanation than all the words in my earlier blog post.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">The problem concerns a medical diagnostic test with certain <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#false-negative">false-negative</a> and <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#false-positive">false-positive</a> rates (David's diagram uses the related (complementary) terms <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#sensitivity">sensitivity</a> and <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#specificity">specificity</a>). Given that the medical condition under test has a certain prevalence (base rate), what is the probability that an individual receiving a positive test result actually has the disease? It turns out that when the prevalence of the disease is low (often the case), this probability is usually much lower than one imagines. </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">In David's tree diagram (direct <a href="http://www.dcscience.net/significance-screening-Fig-1.jpg">link to the diagram</a>), with a false-negative rate of 0.2, a false-positive rate of 0.05, and prevalence of 0.01, one can clearly see why the desired probably in the case he examined is 14%, rather than the 95% that many people naturally gravitate towards. </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Because of the extreme clarity offered by this kind of visualization, I've decided to translate my earlier example into a similar diagram. In my imagined case, we had a far more accurate test, with equal false-positive and -negative rates at 0.001, but also a rarer condition, with background prevalence of 1 in 10,000. To avoid having to think about outcomes for fractions of people, we'll imagine that exactly 10 million people are tested:</div><div style="text-align: justify;"><br /></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-NJjAv2bRjr8/VFhj_1GGlGI/AAAAAAAACh4/ZMExHxrCoOk/s1600/positive_predictive_value.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-NJjAv2bRjr8/VFhj_1GGlGI/AAAAAAAACh4/ZMExHxrCoOk/s1600/positive_predictive_value.png" height="480" width="640" /></a></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">This problem represents a fairly simple example of a <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#marginal">marginal probability distribution</a> extracted from a hypothesis space over two (binary) propositions: X = receives a [positive, negative] test result, and Y = [has, doesn't have] the disease. In the linked glossary entry, I derive some general properties of such marginal distributions, but as algebra has the annoying habit of often refusing to impress a clear intuitive understanding on our minds, we can use the tree diagram to confirm our grasp on these things. </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">For example, from the above diagram, we can tell instantly that the overall probability for a person to receive a positive test result is simply 10,998 divided by 10,000,000. What we've done to obtain this number, however, is to implement, without noticing, the same procedure specified by the marginalization formula I derived in the glossary. That formula is:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-pBIJmUEeQ24/U1Z-pNeRDTI/AAAAAAAACVk/-ltRkkTt63o/s1600/marginal_2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-pBIJmUEeQ24/U1Z-pNeRDTI/AAAAAAAACVk/-ltRkkTt63o/s1600/marginal_2.PNG" /></a></div></div><div style="text-align: justify;"><br />This just means, to get the probability for x, regardless what y is, add up for each possible value of y, the product of the probability for y and the probability for x given that value of y. This is effectively what the above diagram allows us to do by inspection.<br /><br />In this case, the y's are the propositions that the subject respectively has and doesn't have the disease. x is the condition that the subject's test result is positive. The overall probability for a positive test result, from the above formula is the sum of two terms. The first of these terms is the probability to not have the disease. i.e. 1 - prevalence, multiplied by P(positive result | subject not afflicted), which is the false positive rate = 0.001. Multiplying these together gives 0.0009999. This is exactly how we arrived at the number in the top right box in the diagram (remember that the numbers in the diagram are multiplied by 10 million).<br /><br />The second term we need is obtained by similar means, only this time, y = 'does not have disease,' and P(x | y) is now 1 - false-positive rate, giving P = 0.0000999, corresponding to the the top box in the lower half of the right-hand column in the diagram. Adding these together gives P(x) = 0.0010998, corresponding to the top line in green on the diagram.<br /><br /></div><div style="text-align: justify;">Another statistical expert with a blog, David Spiegelhalter, has recently used a similar tree diagram to solve a related problem of weather forecasting, <a href="http://understandinguncertainty.org/using-expected-frequencies-when-teaching-probability?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+UnderstandingUncertainty+%28Understanding+Uncertainty%29">here</a>. Part of the success of diagrams like David Colquhoun's diagram (and my copycat graph) is that it lays out a multidimensional problem in a visually accessible way. As David Spiegelhalter explains, another important element is that these diagrams exploit a translation from probabilities to expected frequencies, and back, which eases the workload on our conceptual machinery.<br /><br />As Spiegelhalter shows, such problems as the medical screening puzzle and his weather forecasting riddle can also be represented using contingency tables. For these 2 dimensional hypothesis spaces over binary propositions, the contingency tables are drawn as 2 by 2 arrays (with totals usually added for good measure). The table for my diagnosis test problem looks like this:</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;"><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-nihucbXgXyQ/VFpLRmrWRvI/AAAAAAAACi0/xhIAPhmHbco/s1600/contingency_table.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-nihucbXgXyQ/VFpLRmrWRvI/AAAAAAAACi0/xhIAPhmHbco/s1600/contingency_table.PNG" height="136" width="640" /></a></div><br /></div><div style="text-align: justify;">These tables generalize easily to non-binary hypotheses, e.g. 'the patient has a temperature of [34, 35, 36, ... , 40] degree centigrade.' Instead of having two columns for [pass, fail] the test, we would have one column for each outcome of the thermometer reading.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">One way that contingency tables don't readily generalize, however, is when there are more than two dimensions in the probability space. This is when probability tree diagrams become especially useful. On a tree diagram, higher dimensionality is added by simply increasing the number of columns of nodes (boxes). All columns but the left-most represent one dimension in probability space. Lets take an example that sticks with binary hypotheses. <br /><br />Suppose that the background prevalence of the disease in my original example is itself a random variable. Let's say that 4% of the population possess a genetic mutation that makes them an unfortunate 65 time more susceptible to that medical condition (assume that people without the mutation still have the same 1 in 10,000 risk having of the disease as before). Using a tree diagram, we can proceed to easily solve non-trivial problems, such as:<br /><br />(i) What is the overall probability for a person to receive a positive test result?<br /><br />(ii) What is the probability that I have the condition, assuming that I tested positive?<br /><br />(iii) What is the probability that I have the mutation, given that I tested positive for the disease?<br /><br />This time, my diagram will omit the left-most rectangle, which doesn't really convey any information, anyway, and I'll leave the translations to and from expected frequencies as an exercise for you, if you're bothered:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-VDXiGI72Q10/VFm5yMzy71I/AAAAAAAACiI/COl7TbqGjkM/s1600/problem_3d.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-VDXiGI72Q10/VFm5yMzy71I/AAAAAAAACiI/COl7TbqGjkM/s1600/problem_3d.png" height="480" width="640" /></a></div> </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Notice that the probabilities in the middle and right-hand columns are conditional probabilities, of the form P(x | y). So the number 0.9999 in the top middle box is the probability, P(un-afflicted | no mutation). From the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#product-rule">product rule</a>, this number multiplied by the probability to have the mutation, P(y), is the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#joint-probability">joint probability</a>, P(xy), i.e. the probability to have the mutation and not be afflicted by the disease. <br /><br />For question (i) we just have to add up 4 different cases. Working from top to bottom, the first is the probability to receive a (false) positive test, and not have the disease, and not have the mutation:<br /><br /><div style="text-align: center;">P = 0.96 × 0.9999 × 0.001 = 0.0009955</div><div style="text-align: center;"><br /></div><div style="text-align: justify;">Obtaining the other terms by the same means (each corresponding to a positive test result in the right-hand column), and accumulating them:</div><div style="text-align: justify;"><br /></div><div style="text-align: center;">P(positive test) = 0.0009599 + 0.0000959 + 0.00003974 + 0.000025974,</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">i.e.</div><div style="text-align: center;">P(positive test) = 0.00136</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">For question (ii), there are 2 ways to receive a positive test, and have the disease, corresponding to having and not having the genetic mutation. So we sum the probabilities for these two, and divide by the total, just obtained:</div><div style="text-align: justify;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-8JRRRDRaqkw/VFpLVFESLbI/AAAAAAAACjA/DVIyzptv0UY/s1600/ppv_disease.PNG" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-8JRRRDRaqkw/VFpLVFESLbI/AAAAAAAACjA/DVIyzptv0UY/s1600/ppv_disease.PNG" height="52" width="400" /></a></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">which gives</div><div style="text-align: justify;"><br /></div><div style="text-align: center;">P(disease | positive test) = 0.2624</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Finally, for (iii), instead of adding on the top line the second and fourth terms, we add the last two:</div><div style="text-align: justify;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-JoDK4lsWbCY/VFpLVNWjVEI/AAAAAAAACi8/YRUX37bKzV8/s1600/ppv_mutation.PNG" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-JoDK4lsWbCY/VFpLVNWjVEI/AAAAAAAACi8/YRUX37bKzV8/s1600/ppv_mutation.PNG" height="48" width="400" /></a><a href="http://2.bp.blogspot.com/-nk4kyTSVkvY/VFnB7DaXUII/AAAAAAAACik/xxLd8UwZlC8/s1600/ppv_eqn_2.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"></a></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">yielding,</div><div style="text-align: justify;"><br /></div><div style="text-align: center;">P(mutation | positive test) = 0.22097</div><div style="text-align: justify;"><br /></div></div><div style="text-align: justify;">Notice that the method of going backwards through the graph (solving for parameters towards the left hand side), in questions (ii) and (iii), is exactly the same as solving <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#bayes-theorem">Bayes' theorem</a>.<br /><br /></div><div style="text-align: justify;">This problem can be made much more complicated, without introducing any further difficulty, other than the number of arithmetic operations required. As mentioned, we could have any number of columns in the diagram (dimensions to the problem), and any number hypotheses in each dimension. Also it would not have mattered at all if, for example, the false positive rate depended upon whether or not the subject has the mutation. As long as we have numbers to put in those little boxes, the solution is close to automatic.<br /><br /><br /><br /></div>Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com0tag:blogger.com,1999:blog-715339341803133734.post-68413585425168740002014-09-20T03:52:00.000-05:002014-09-20T03:52:57.701-05:00Fear of Science <div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Many people react negatively to the idea that moral principles can be inferred entirely using scientific method. There is a general feeling that this is impossible. This seems to be partly why quite a lot of people view the decline of traditional sources of moral instruction as a serious threat. This is a major, double mistake.<br /><br /></div><div style="text-align: justify;">In August last year, I attended an event, 'Answers in Science,' at Houston Museum of Natural Science, aimed at raising awareness of the way that a number of christian fundamentalists have been trying to sabotage the quality of scientific education in Texas schools. Among several that spoke there, two people raised points that struck me as highly significant, given the line of thought I've been pursuing for some time, with regard to the relationship between <a href="http://maximum-entropy-blog.blogspot.com/p/rational-ethics.html">science and morality</a>. They were Kathy Miller, from <a href="http://www.tfn.org/site/PageServer?pagename=TFN_homepage">Texas Freedom Network</a>, and Mike Aus, a former pastor. </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;"><a name='more'></a>Kathy spoke very informatively about the mechanisms and procedures of education review boards in Texas. She explained how a disturbing amount of what they do is protected from public scrutiny, and how a large proportion of the people on such boards are radically skeptical christian fundamentalists, who often believe in the literal truth of the bible, and aim to have topics such as biological evolution, one of the most important of scientific theories, removed from the school curriculum.<br /><br />Teaching that the theory of evolution is false is not only factually wrong (and therefore a huge disservice to any student who wishes to pursue a scientific career, or just understand science), but also conveys a hideously distorted message concerning how knowledge and understanding are obtained, giving a corrupted view of what constitutes evidence, and seriously undermining the pursuit of rationality. This hinders society's thriving in important ways.<br /><br /></div><div style="text-align: justify;">What is interesting about these boards is that their members are publicly elected, and the science deniers who sit on them do so because they receive popular support. Kathy discussed briefly how this happens. She suggested that the people voting for these radical religious fanatics are themselves predominantly not religious fundamentalists, just culturally christian, and generally sympathetic to christian values.</div><div style="text-align: justify;"><br />It seemed from Miller's remarks that many who give support to those who would remove the theory of evolution from the school curriculum, and would teach that the Earth (indeed, the entire universe) is less than 10,000 years old, that dinosaurs and humans were contemporary, that global warming is not a problem, and so on, do so out of the simple desire for their children to learn to be good people.<br /><br /></div><div style="text-align: justify;">When I heard this, I immediately drew parallels with the <a href="http://old.richarddawkins.net/articles/644942-rdfrs-uk-ipsos-mori-poll-2-uk-christians-oppose-special-influence-for-religion-in-public-policy">MORI poll</a>, conducted in 2011 in the UK for the Richard Dawkins Foundation. Quoting from the linked press release:</div><div style="text-align: justify;"><blockquote class="tr_bq">"<span style="text-align: start;">Asked why they had been recorded as Christian in the 2011 Census, only three in ten (31%) said it was because they genuinely try to follow the Christian religion, with four in ten (41%) saying it was because they try to be a good person and associate that with Christianity.</span>"</blockquote>There seems to be a coherent message emerging: religious identification, and possibly support for radical religious teaching, may be associated more with a broad wish to retain moral integrity than with belief in specific religious doctrines.<br /><br /></div><div style="text-align: justify;">But the tendency of many to rely on traditional religious teachings for provision of a moral foundation is a double mistake, as I mentioned. Firstly, such teachings are based on superstition and dogma, constructed to further the selfish aims of its authors - not healthy indicators. As issues such as evolution, the age of the universe, and countless other matters show, religious dogma has a terrible track record in the sphere of getting things right. And predictably so - there is absolutely nothing about the methodology of its construction to predispose it towards accuracy.<br /><br />Even in the realm of morality, there is ample evidence for the poor track record of popular religions. Aficionados of these religions have been steadily accepting the continual erosion of their traditional teachings for centuries. Uncounted practices, once encouraged by popular religions, such as the keeping of slaves, persecution of other races, pursuit of holy wars, subjugation of women, proscription of homosexuality, and animal sacrifices, to name but a few, are now considered unacceptable in civilized society.<br /><br />Secondly, by rejecting science in favor of religion for moral guidance, one is implicitly declaring that morality must be determined by irrational means, which is a clear absurdity (see for example <a href="http://maximum-entropy-blog.blogspot.com/2013/09/is-rationality-desirable.html">Is Rationality Desirable?</a>). Indeed, whatever there is to be discovered about morality can only be discovered efficiently and reliably by employing scientific method (see <a href="http://maximum-entropy-blog.blogspot.com/2013/03/scientific-morality.html">Scientific Morality</a>). This applies not only to the methods of achieving our moral objectives, but also to the process of inferring what our core moral objectives are. In the quest to learn how best to behave, by turning one's nose up at scientific method, one is immediately taking on a serious and unnecessary handicap.<br /><br />This is a mistake made not only by many sympathetic to religious teachings (and lets not forget that there are some good things about such value systems), but also by many in the scientific community. Popular misconceptions about the limitations of science are hardly blameworthy, when the intellectual consensus encourages belief in those same alleged limitations. Many scientists I've spoken to, and many more I've read on the internet, are repulsed by the idea that science can venture into the realm of human value, and consider it ridiculous. This is a traditional view, presumably owing much to thousands of years of religious dogma. Upon moderate reflection, however, it can be seen to be an obvious travesty (see <a href="http://maximum-entropy-blog.blogspot.com/2014/02/practical-morality-part-2.html">Practical Morality, Part 2</a>). It is now crucial for the scientific community as a whole to undertake that moderate reflection, and throw off the shackles of dogma (things that scientists are usually quite good at). One only needs to ponder how it can possibly be that something meaningful can be not measurable, or that something measurable can not be investigated scientifically, to realize that there is something seriously amiss with the traditional viewpoint.<br /><br />Only after these issues are widely understood within the scientific and intellectual communities can we expect popular reliance on often harmful superstitions for moral guidance to be significantly diminished. This will not only reduce the skepticism and fear of scientific method, but will also enable the flourishing of a new (but long overdue) discipline of moral science, which in all reasonable expectation must lead to an enhanced understanding of ethics. <br />_<br /><br />Mike Aus's point (the other speaker I noted at that Houston event), from his personal perspective as a non-scientist, was that the theory of evolution, when partially understood, can be damn scary. Success, in terms of natural selection, is highly dependent on one's ability to outmaneuver one's competitors. This seems to give the impression that if another person has a resource that I could benefit from, then it would be natural, and therefore (according to the theory) proper, for me to stab them in the back and take it. This suggests that some of the skepticism concerning the role of science in matters of morality comes from a feeling that science (in particular, the theory of biological evolution) entails immoral behaviour.<br /><br />This comes down to another problem of the popular understanding of science. Again, a problem most easily solved after scientists themselves understand the issues more fully. I'm not claiming that biologists don't understand evolution, but as long as scientists don't grasp the origin of morality, they won't be able to explain the relationships between morality and biology, and they won't be able to identify the regions of overlap and non-overlap.<br /><br />Broadly, there are two problems with this simplistic perception of evolutionary theory. Firstly, whether or not an organism is a rival, in terms of natural selection, depends on very many factors. In particular, for a species with a technology as sophisticated as ours, there are very strong reasons why cooperation usually works out far better for us than brutal, short-sighted bullying. We term this effect 'the social contract' (see <a href="http://maximum-entropy-blog.blogspot.com/2014/01/practical-morality-part-1.html">Practical Morality, Part 1</a>).<br /><br />Secondly, whatever behavior maximizes our evolutionary fitness, this is not identical with what is morally optimal. 'Survival of the fittest' may prefer X, but we are not 'survival of the fittest,' we are humans. Chance effects in our evolution (e.g. 'unexpected' side effects of having brains like ours), together with systematic effects in our upbringing can equip us with values that conflict with the algorithm of natural selection (of genes, at least) in significant ways. There is no reason that propagation of my genes must be my ultimate source of value. </div><div style="text-align: justify;">_<br /><br />Morality, as a discipline, is entirely contained within scientific method. Problems of moral decision demand decent estimates of 2 classes of facts: (i) what it is we want and (ii) what the outcomes of various actions are likely to be. Consequently, such problems can only be solved by analyzing real experiences in a coherent manner - something we call science. To get the best and most reliable solution to any decision problem, we need controlled data, and we need to analyze it using sound methodology. Anything else is just guess work. Scientists and decision makers need to consider these things, and overcome the cultural biases that so often stand in the way of the obvious. Once we reach widespread understanding in this sphere, we'll have gone a long way towards reducing the fear of science that presently holds back society. <br /><br /><br /><br /><br /></div>Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com2tag:blogger.com,1999:blog-715339341803133734.post-73953906561655292492014-05-24T10:46:00.000-05:002014-05-24T10:46:08.011-05:00Pass / Fail Mentality<div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /><br />(Following on from <a href="http://maximum-entropy-blog.blogspot.com/2014/05/the-calibration-problem-why-science-is.html">The Calibration Problem: Why Science Is Not Deductive</a>)</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Recently, I was talking about calibration (<a href="http://maximum-entropy-blog.blogspot.com/2014/05/calibrating-x-ray-spectrometer-first.html">here</a> and <a href="http://maximum-entropy-blog.blogspot.com/2014/05/calibrating-x-ray-spectrometer-spectral.html">here</a>), and how it should be more than just identifying the most likely cause of the output of a measuring instrument. The calibration process should strive to characterize the range of true conditions that might produce such an output, along with any major asymmetries (bias) in the relationship between the truth and the instrument's reading. In short, we need to identify all the major characteristics of the probability distribution over states of the world, given the condition of our measuring device.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Failure to go beyond simply identifying each possible instrument reading with a single most probable cause is a special case of a very general problem that in my opinion plagues scientific practice. Such a major failure mode should have a special name, so lets call it 'pass / fail mentality.' It is the most extreme possible form of a fallacy known as spurious precision, and involves needlessly throwing away information.</div><div style="text-align: justify;"><br /><a name='more'></a>In <a href="http://maximum-entropy-blog.blogspot.com/2014/05/the-calibration-problem-why-science-is.html">The Calibration Problem</a>, I argued that science is necessarily inductive, and consists, ideally, of producing probability distributions over sets of exclusive and hopefully exhaustive propositions. To present a result with an error bar that is too narrow to accurately characterize one of these probability distributions is to be guilty of spurious precision. To collapse that error bar to zero width: 'X <i>is</i> 5.1,' is to take this fallacy as far as it can go.<br /><br />Science is often considered to consist of tests of <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#boolean">Boolean</a> propositions (molecule M speeds up recovery from disease D, planet Earth is getting warmer, humans and dandelions share a common ancestor, etc), and so presenting a finding with zero recognized uncertainty often consists of a statement of the form, "proposition X passed the test," or "proposition X failed the test." Hence the name I'm giving to this fallacy. What we would prefer is a statement of the form, "proposition X, achieving a probability of 0.87, looks quite reliable."<br /><br />Better still, would be to examine an uncollapsed hypothesis space, such that instead of saying, "yes the Earth is getting warmer," or even, "yes, the Earth is very probably getting warmer," we would say something like, "the rate of change of the Earth's surface temperature is X ± Y."<br /><br />Unfortunately, there seems to be a tendency in human nature to prefer <a href="http://en.wikipedia.org/wiki/Precision_bias">statements of exaggerated precision</a>. Probability theory gives us the tools to manage uncertainty. Until we properly understand probability, therefore, we do not understand uncertainty. And ill-understood uncertainty is a scary thing. Yet, we know that the function of science is to help banish our fear of the unknown. Thus, it's often overwhelmingly tempting to assume that the function of science is to banish uncertainty completely.<br /><br />This is the temptation that many a radical skeptic has succumbed to and / or manipulated in others. The climate-change denier, the young-Earth creationist, the anti-vaccination lobbyist - all use the same tactic: Look, they can't decide if the Earth is 4.53 or 4.54 billion years old! They have no certainty about anything!<br /><br />So strong is this desire to see uncertainty eliminated, (and so strong, perhaps, the desire not to give ammunition to the radical skeptic) that much of the way science is conducted and reported is built around this flawed model: Based on our results, we have decided that P is true. Effect Q has passed the test of statistical significance.<br /><br />Often it has been said by gurus of scientific method that science must proceed by asking specific questions, and that these questions must be of the yes / no variety. Hence, we have seen debates between the rival philosophies of verificationism v's <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#falsifiability-principle">falsificationism</a>; should we proceed by proving our theories true or by proving them false? This debate is wrong - strictly speaking, we can do neither.<br /><br />Many a data set has been left unpublished because it was insufficient to answer any question 'conclusively.' But this inconclusiveness is itself a useful piece of information. It indicates, for example, that any possible <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#effect-size">effect size</a> is likely to be small, and in combination with other weakly-informative studies in <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#meta-analysis">meta-analysis</a>, it can be used to help aggregate a more informative result.<br /><br />The tendency to not publish ambiguous results is not always the fault of the practicing researcher either. Most scientific journals will publish only conclusive results, and of those, positive results usually receive far more favorable attention.<br /><br /><a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#null-hypothesis-significance-test">Null-hypothesis significance testing</a> (one of the most common forms of data analysis in use in contemporary scientific literature) is a classic example of the pass / fail mentality running rampant in scientific endeavour. A threshold is set, and if some data-derived metric exceeds that threshold, then the data go on to fame and fortune. Otherwise, they go to the back of the filing cabinet. This madness is not necessarily limited to <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#frequency-interpretation">frequentist</a> methods, though. Anywhere that some hard boundary between finding and non-finding has been set will exhibit the same flaw, and it won't matter if that boundary is a <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#p-value">p-value</a>, a <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#posterior">posterior probability</a>, or a <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#likelihood">likelihood</a> ratio.<br /><br />Odd things occur when experimental data is examined using threshold-like criteria, such as the α-levels applied to p-values. Meta-analysts have found that the scientific literature can exhibit excess results that just barely cross the significance threshold, contradicting statistical analysis that predicts what the distribution of p-values should be. For example, Masicampo and Lalande<sup>1</sup> found an anomalous spike in the number of reported p-values just less than 0.05, in the field of psychology. Very many studies use the same arbitrary significance level of p = 0.05, so that a p-value greater than 0.05 is considered inconclusive. One interpretation is that many researchers achieving p-values close to, but not quite crossing the significance barrier tended to 're-work' their analyses in various ways, until the magic α-level was crossed. Use of thresholds leads to distortion of the evidence.<br /><br />Applied probability is called decision theory. It is the endeavour to determine how to act, based on our empirical findings. Where actions are to be taken, often a probability distribution is going to need to be collapsed, somewhere. We can't always distribute our actions - it doesn't usually work to only 70% undergo surgery. (Though, mixed strategies very often are possible, and advisable.) Thus, on superficial analysis, the introduction of a threshold of significance, or the expression of a parameter estimate as an infinitely narrow spike in probability space may seem reasonable: if action requires us to gamble on a single state of the world (e.g. "only surgery can save your life"), what else are we to do?<br /><br />The problem is that the pass/fail mentality, perhaps motivated by a crude appreciation of decision-theoretic considerations, does not explicitly acknowledge any elements of decision theory, and totally fails to implement the most basic aspects of decision analysis.<br /><br />Many a frequentist chastises the Bayesian for introducing personal bias, in the form of the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#prior">prior probability distribution</a>, without realizing that frequentist techniques do exactly this, arbitrarily, and without acknowledgement. By implementing arbitrary α-levels, the significance tester is effectively setting a decision threshold, without ever formally or even approximately evaluating any utility function, which is one of the things that any decision analysis must have. In fact, it's even worse: the introduction of the decision threshold goes so unacknowledged, that the urge, upon completing the calculation is not to say, "for economical reasons, we should behave as if X is true," but rather to simply say, "X is true." Nobody (pretty much) actually, explicitly thinks this way, but the conventions of reporting have been set up in such a way that this <i>non sequitur</i> represents a real point of psychological attraction, which easily impacts on the thinking and behaviour of the insufficiently wary.<br /><br />One of the implicit assumptions in pass / fail thinking is that the best point estimate is the peak (or sometimes the mean) of the probability distribution. Partly, this arises because there remains confusion as to whether the exercise is one of decision, or one of pure science, to determine what is true. But a stupidly simple toy example of decision making serves to show that the optimum, with respect to action, depends on the utility function, and can be arbitrarily far from the probability peak.<br /><br />Suppose we play a gambling game, but it's not much of a gamble, as the cost of entry into the game is zero. An exotic roulette wheel produces outcomes that are <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#poisson">Poisson distributed</a>, with mean equal to 10, such that the probability to land on 38 (the highest number on the wheel) is extremely low (about 8 × 10<sup>-11</sup>, in fact). If the outcome matches our prediction, we win a prize that is dependent on what the outcome is. It just happens that only one of the prizes is non-zero (and positive) - the prize awarded for a correctly predicted outcome of 38. What outcome should we predict? Obviously, in this trivial game, our expected utility is maximized by betting on 38, even though it has a low probability to arise. </div><div style="text-align: justify;"><br />In the post on the <a href="http://maximum-entropy-blog.blogspot.com/2014/05/the-calibration-problem-why-science-is.html">calibration problem</a>, I showed in detail why it is that science will always be fundamentally concerned with calculating probability distributions. Hopefully the above considerations have helped to illustrate further why as much of the details of those probability distributions as possible should be retained, when the final analysis is being assessed and reported. To present a point estimate without an error bar is frankly unthinkable, for any conscientious scientist. To neglect to mention any strong asymmetry of the resulting probability curve is to needlessly discard valuable information, and such fine details, once eliminated, represent a lost opportunity when evidence from multiple studies is to be aggregated. Sometimes action requires a hard decision, but do the decision analysis explicitly, so that you know what you have done, and so that others can review your assumptions, and determine whether your utility function is the same as theirs.<br /><br /><br /><hr /><br /><br /><span style="color: blue;"><b>References</b></span><br /><br /><br /><!-- ************************ Table of references ************************--> <br /><table> <tbody><tr> <td valign="top"> [1] </td> <td><span style="text-align: justify;">Masicampo, E. J. and Lalande, D. R.</span>, '<i>A peculiar prevalence of p values just below .05</i>,' Quarterly Journal of Experimental Psychology, 65 (11), 2271-2279, 2012 (<a href="http://www.tandfonline.com/doi/pdf/10.1080/17470218.2012.711335">Link</a> to paper, paywalled, unfortunately. See a short discussion with a plot of Masicampo & Lalande's data <a href="http://www.graphpad.com/www/data-analysis-resource-center/blog/a-peculiar-prevalence-of-p-values-just-below-051/">here</a>.)</td> </tr></tbody></table><br /><br /><br /><br /></div>Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com1tag:blogger.com,1999:blog-715339341803133734.post-43303184416370833232014-05-20T18:16:00.000-05:002014-05-20T18:16:15.216-05:00Announcing: Moral Science Index<div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Continuing the paradigm established by my <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html">glossary</a> and my <a href="http://maximum-entropy-blog.blogspot.com/p/mathematical.html">mathematical index</a>, I've put together an index to and summary of the material I've accumulated on the topic of moral science. The index can be reached <a href="http://maximum-entropy-blog.blogspot.com/p/rational-ethics.html">here</a>, or from the link, 'Moral Science', on the right-hand side, beneath my profile.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">The idea is simply to provide a point of entry for people interested in knowing what I have to say on this topic. People can see everything I have presented on this theme, the order in which the different pieces were published (and hence, approximately their dependency), a short description of each piece's function, together with some global motivating and qualifying remarks. </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">The relationship between science and morality represents a significant percentage of the material on my blog. It's an important (by definition) and highly overlooked topic, so I think it is important for people to have a single point of access to this material, the same way that the mathematical index provides a consolidated resource for learning about statistics, and the same way that the glossary represents the most definitive statement of my philosophy available, anywhere. (In some respects, I now view the blog as secondary to the glossary.)</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">I will try to keep the moral science index current - as I release more material, I'll update the index accordingly.</div><div style="text-align: justify;"><br />As always, I welcome your comments, questions, criticisms, outraged indignation, etc. If anything needs clarification, the fault is mine. If you're curious about some detail I can help with, then I'm delighted to do so (that's the whole point of the website, actually). Comments are open here and on the index itself, and alternative contact details exist on the right hand side of this page.</div><div style="text-align: justify;"><br /><br /></div><div style="text-align: justify;"><span style="color: blue;"><b>Some Highlights:</b></span></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">For your convenience, I'll reproduce here some of the major points from the moral science page. </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">(1) As of the publication date of this blog post, the index stands at:</div><blockquote class="tr_bq" style="text-align: justify;"><div>Blog entries on this topic (in order of publication):</div><div></div><ul style="text-align: start;"><li><a href="http://maximum-entropy-blog.blogspot.com/2013/03/scientific-morality.html">Scientific Morality</a></li><li><a href="http://maximum-entropy-blog.blogspot.com/2013/06/crime-and-punishment.html">Crime and Punishment</a></li><li><a href="http://maximum-entropy-blog.blogspot.com/2013/09/is-rationality-desirable.html">Is Rationality Desirable?</a></li><li><a href="http://maximum-entropy-blog.blogspot.com/2014/01/practical-morality-part-1.html">Practical Morality, Part 1</a></li><li><a href="http://maximum-entropy-blog.blogspot.com/2014/02/practical-morality-part-2.html">Practical Morality, Part 2</a></li></ul><div style="text-align: start;"><br /></div><div style="text-align: start;">Glossary entries on relevant concepts:</div><ul style="text-align: start;"><li><a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#Absolutism">Absolutism</a></li><li><a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#consequentialism">Consequentialism</a></li><li><a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#morality">Morality</a></li><li><a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#rationality">Rationality</a></li></ul></blockquote><div style="text-align: justify;"><br /></div><div style="text-align: justify;">(2) To disclaim any extraordinary expertise in any specific realm of moral decision:</div><blockquote class="tr_bq" style="text-align: justify;"><i>My writing on ethics is not to prescribe how to behave, but to inform on how to know how to behave.</i></blockquote><div style="text-align: justify;"><br /></div><div style="text-align: justify;">(3) Quoting from the overview:</div><blockquote class="tr_bq" style="text-align: justify;"><div><i>The founding principle behind my writing on this blog is that there is no better method to learn about anything than science. If a thing is meaningful - has consequences - science can measure it, by virtue of those consequences.... </i></div><div><i><br /></i></div><div><i>It is often said that science has nothing to say on the matter of what constitutes moral behaviour. If correct, this leaves us with only one option: morality has no meaning, it is a non-concept. It seems to me absurdly trivial that this is not so. Anyway, only a moderate amount of reflection is required to prove it. Thus, it is equally trivial to prove that science can guide us - in fact, is the optimal guide - concerning moral prescription.</i></div></blockquote><br />(4) Another feature on the moral science page is a short list of blog articles I expect to write on the topic in the near future, covering (in no particular order):<br /><blockquote class="tr_bq"><ul style="text-align: justify;"><li>the correspondence, if any, between correct consequentialism and classic utilitarianism</li></ul><ul style="text-align: justify;"><li>the correspondence, if any, between correct consequentialism and political libertarianism</li></ul><div style="text-align: justify;">(Spoiler alert: the answer in both cases is, not so much.)</div><ul style="text-align: justify;"><li>some necessary aspects of the nature of human decision criteria</li></ul><ul style="text-align: justify;"><li>the limited insight offered by the classic thought experiments in the philosophy of ethics</li></ul><ul style="text-align: justify;"><li>the potential for correct moral realism to significantly reduce reliance on superstition, leading to a better informed and more rationally directed society </li></ul></blockquote><br /><br /><br />Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com0tag:blogger.com,1999:blog-715339341803133734.post-43090939371328446802014-05-17T01:21:00.002-05:002014-05-17T01:21:52.180-05:00The Calibration Problem: Why Science Is Not Deductive<div class="MsoNormal" style="margin-bottom: 13.5pt;"><br /></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in;"><br /></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><span style="font-family: inherit;">Here is perhaps the most important fact about scientific method that anybody can ever learn: </span><span style="font-family: inherit;">the optimal course of a scientific investigation is to provide probability assignments for propositions about the universe, and when scientific method deviates from this optimum path, it is valid only to the extent that it successfully approximates this ideal. There is a simple reason for this:</span></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><br /></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><span style="font-family: inherit;">We would love to be able to say that we are 100% certain about X, that Y is guaranteed to be true, or that fact Z about the universe has somehow entered my head and impressed infallible knowledge of its necessary truth on my mind, but of course, except for the most trivial propositions, none of these is possible.<o:p></o:p></span><br /><span style="font-family: inherit;"><br /></span></div><div align="center" class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: center;"></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"></div><span style="font-family: inherit; text-align: justify;">Firstly, every measurement is subject to </span><a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#noise" style="font-family: inherit; text-align: justify;"><span style="color: blue;">noise</span></a><span style="font-family: inherit; text-align: justify;">, so there will always be a degree of uncertainty about what caused a particular experience.</span><br /><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><br /></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><span style="font-family: inherit;">Secondly, and far more fundamentally, calibration of any instrument requires certain symmetries of physical law to be hypothesized. Here's what I mean:</span><br /><span style="font-family: inherit;"></span><br /><a name='more'></a><span style="font-family: inherit;">If a 63.7 kg weight caused my machine to go 'beep' yesterday, I might postulate that a 63.7 kg weight will do the same today, because I'm assuming that the relevant laws of physics (and my device) are the same. If a mass on a spring oscillates at a given frequency, allowing me to count out a certain number of seconds, I might assume that changing my location on the surface of the Earth will not change that frequency. (And by the way, how would I know that the frequency was fixed at all?) These are hypotheses that may or may not be true. The only way to test such hypotheses is by the development of further instrumentation. Such instrumentation, though, is subject to a similar calibration problem, reliant on some other kind of analogical reasoning. </span><br /><br />Until I realize that solid objects expand and contract as the average kinetic energy of their atoms changes, it might never occur to me that length measurements taken with a simple ruler vary slightly, but systematically, with the ambient temperature. Furthermore, in order to discern that such a bias is present, I need to make a comparison against some other calibrated standard. Such auxiliary standards, though, will always suffer the same type of vulnerability.<br /><br /><span style="font-family: inherit;">Thus, we cannot prove with 100% certainty exactly which symmetries hold in nature (though the assumption that <i>some</i> symmetries hold is <i>a priori</i> sound, which I might get round to in a future post). We can only demonstrate that up to now, our experiences are consistent with some set of postulated symmetries. </span><span style="font-family: inherit;">The process of testing assumed principles of calibration (laws of physics) against empirical experience in this way is known as induction. </span><span style="font-family: inherit;"> </span></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><br /></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><span style="font-family: inherit;">So, faced with the impossibility of knowing with absolute certainty any but the most trivial facts about the world (e.g. that only things that exist are affected by gravity), we must fall back on the next best thing: to quantify our justifiable degrees of belief in the various propositions we are interested in. Prescribing the manner in which this is achieved is the task of probability theory. </span>Essentially, induction works by applying <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#bayes-theorem">Bayes' theorem</a>, or some reasonable approximation. </div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><br /></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><span style="font-family: inherit;">As for those very trivial statements that we can deduce with total confidence, what do we get from those? Nothing, really. Take a look at my example: only things that exist are affected by gravity. Does this tell us anything about things that actually exist? In fact, no. It only tells us about things that don't exist. This is both why it is so trivial, and why it can be known without scrutinizing any evidence. Let's think about about one of the most famous examples, Descartes' <i>cogito</i>: 'I think, therefore I am.' Again, this doesn't really tell us anything. It doesn't say what I am, what it is to think, or what it is to be, only that thinking, like gravitational attraction, is a property limited to things that exist. It doesn't even suggest, for example, that I am in any way a separate entity to every other thinking object in the universe. <o:p></o:p></span></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><br /></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><span style="font-family: inherit;">For completeness, there is strictly no way out of the calibration problem. Consider the following thought experiment: imagine for the sake of argument that at some time, some clever scientist somehow devises an argument that identifies a unique set of symmetries in physical law, such that all other possible sets of symmetries lead to statements that are self contradicting, and therefore cannot be true. Imagine that this miraculous argument is actually correct. Obviously, the validity of such an argument, and our knowledge of that validity are not the same thing. This latter relies on our ability to confidently check the required logic, implying that there is yet another instrument in need of calibration: our own intellectual faculties, the fidelity of which can not, by any possible means, be established <i>a priori</i>. </span></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><br />Inductive inference is often contrasted with deductive logic. Deductive logic performs trivial operations on assumed premises, to draw conclusions that, according to the system, can not be false if the premises are true. The classic example starts from two premises, (i) 'all men are mortal,' and (ii) 'Socrates is a man,' to reach the unavoidable result, 'Socrates is mortal'.<br /><br /></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;">Some of the well-known philosophers of science believed that because inductively derived information is not capable of guaranteeing truth, it must be inferior to deductive logic, or worse (e.g. with Karl Popper) it must be strictly useless - another fine example of <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#mind-projection">mind-projection fallacy</a>: because a statement about reality can only be true or false, then any degree of belief in it that I possess must be all or nothing - a position I have refuted <a href="http://maximum-entropy-blog.blogspot.com/2012/10/parameter-estimation-and-relativity-of.html">elsewhere</a>. (Popper is best known for his <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#falsifiability-principle">falsifiability</a> criterion - <span style="font-family: inherit;">in <span style="text-align: left;"><a href="http://maximum-entropy-blog.blogspot.com/2013/02/inductive-inference-or-deductive.html">Inductive inference or deductive falsification?</a> I show that, contrary to Popper and others, falsification must be inductive, but it's also important to note that falsification is not the only direction in which science can progress.</span>)</span><br /><br />Deduction often feels far more steadfast than inductive inference, because of its power to guarantee the conclusion from the employed premises, but really, deduction on its own tells us absolutely nothing about the world. Because of the calibration problem, the premises of any useful deduction can not be guaranteed <span style="font-family: inherit;">by any means. To put it another way, one may argue that mathematical theorems possess necessary truth, but to the extent that this is true, they apply only to abstract, mathematical objects, x's and y's, but not real entities inhabiting the universe.</span></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><span style="font-family: inherit;"><br /></span><span style="font-family: inherit;">There may seem to be a problem, in that probability is a mathematical theory, meaning that all its theorems are derived using deductive logic. How can probabilistic reasoning be more powerful than deduction, if probability theory depends on deduction? It's what probability theory is about that allows it to lay legitimate claim to a uniquely privileged position among mathematical theories. The theory of differential calculus, for example, is a theory about x's and y's - entities with no real existence, not even in the mind of the person who fully perceives the theory. Using the theory of differential calculus, however, I can use those x's and y's to represent, for example, space and time, and formulate a theory of gravitation. We might start from Newton's inverse square law and use the theory to predict that the planets will adopt elliptical orbits around the sun, but could we then know with deductive certainty that this is the truth? Of course not. Could we even infer that this is probably the truth? No, not without probability theory. <o:p></o:p></span></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><br /></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><span style="font-family: inherit;">Probability theory is still a theory of abstract x's and y's, but the objects in this theory are now not surrogates for masses on springs or airoplane wings, they are rational agents and their rankings of believability. The theory of probability, therefore, provides a bridge between a mechanical theory and the thing that it is a theory of. It allows us - <i>real</i> agents - to quantify the correspondence between model and reality. On its own, a mechanical model, such as a theory of gravity, has no knowable relationship with what is actually going on. It is inductive inference that allows us to say, 'yes, the assumptions of this model are reasonable,' and 'yes, the predictions of this model match my experiences well.'<o:p></o:p></span></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><br />Finally, while it looks like inductive inference is founded on deductive logic, where do we suppose the axioms needed to derive our mathematical systems derive from? It is surely perverse to suggest that they come from anywhere other than our experience of the world, and what works, intellectually. Such experience is derived in 3 ways:<br /><br />(i) our population-genetic history - our brains are the way the are, because the way they regulate our behaviour is a good match for the way the world operates, leading to efficient propagation of the genes that prescribe our brains' construction<br /><br />(ii) our cultural history - early philosophers experimented with all manner of intellectual systems, eliminating all sorts of obvious mistakes along the way, and passing on a treasure trove of useful heuristics<br /><br />(iii) our personal history - our direct contact with nature makes certain axiomatic systems feel highly unpalatable, because they just don't match what we see<br /><br />To spell it out explicitly: deductive inference is in fact founded upon inductively derived principles.</div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><br /></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;">The idea that inductive learning is more powerful than deductive logic has been recognized at least as far back as 1620, when probability theory was still in its infancy (just a few meager decades of faltering development). In that year, Francis Bacon, one of the founders of empirical scientific method, published his great work on the subject, '<i>Novum Organum,</i>' (<a href="http://oll.libertyfund.org/titles/bacon-novum-organum">full text</a>). This title means 'New Instrument,' and was a reference to Aristotle's '<i>Organon,</i>' (<a href="https://archive.org/details/AristotleOrganon">full text</a>). This was Aristotle's book on deductive logic, which had stood for centuries as the accepted model for all epistemology. Bacon's title was carefully chosen to send the message, "Aristotle is now obsolete." Bacon's great contribution was to say that deduction alone gets you nowhere. If you want to know what the world is actually made of, and how it behaves, he argued, you must make observations and do experiments. Science, in fact all knowledge, is based on experience, not pure thought.</div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><br /></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><br /></div><br /><div class="MsoNormal"><br /></div>Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com4tag:blogger.com,1999:blog-715339341803133734.post-87635795363133927102014-05-06T22:38:00.000-05:002014-05-06T22:38:07.454-05:00Calibrating an X-ray Spectrometer - Spectral Distortion<style> td.upper_line { border-top:solid 1px black; } table.fraction { text-align: center; vertical-align: middle; margin-top:0.5em; margin-bottom:0.5em; line-height: 2em; } </style> <style> table.num_eqn { width:99%; text-align: center; vertical-align: middle; margin-top:0.5em; margin-bottom:0.5em; line-height: 2em; } td.eqn_number { text-align:right; width:2em; } </style> <br /><br /><br /><div style="text-align: justify;">Calibration is a process whereby a relationship is inferred between the output of some measuring instrument and the physical process responsible for that output. An instrument may be something as simple as a ruler, or something as complicated as the <a href="http://en.wikipedia.org/wiki/Human_Genome_Project">Human Genome Project</a> or the <a href="http://scienceblogs.com/startswithabang/2013/03/21/what-the-entire-universe-is-made-of-thanks-to-planck/">Planck cosmic background survey</a>. Calibration is fundamental to science. We might even say that it <i>is</i> science.<br /><br />When we think about calibration, we often think simply about finding the most probable value for some physical parameter, given some reading from an instrument. In the <a href="http://maximum-entropy-blog.blogspot.com/2014/05/calibrating-x-ray-spectrometer-first.html">previous part</a>, I described this simple process for a device used to characterize the distribution of photon energies in a stream of x-rays.<br /><br />But we really ought to think of calibration as more than this. To make the best inferences possible from a reading, we should formulate the entire probability distribution, not just the location of its maximum, for the state of the world when the machine goes "bing," or when the display reads "42." When the readout says 7, it's good to know that I've most probably just found a black hole (perhaps), but it's also good to know what alternative explanations there are, and what amounts of probability mass they command.<br /><br /><a name='more'></a>At the end of the previous post, I showed this measurement of an x-ray spectrum, made with a cadmium telluride (CdTe) detector:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-Ie_-w78gN0s/U2R-E4zn2uI/AAAAAAAACbY/JW71wkh9MJ4/s1600/tube_spec.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-Ie_-w78gN0s/U2R-E4zn2uI/AAAAAAAACbY/JW71wkh9MJ4/s1600/tube_spec.png" height="382" width="640" /></a></div><br /><br />As I explained, those step-like drops in intensity at 27 and 32 keV are artifacts of the detector, and are consequences of an asymmetric probability distribution, P(state of the world | this instrument reading). Such asymmetry makes it all the more important to look beyond determining just the distribution's peak, as it means that our instrument is systematically distorting the truth. In this post, I'll describe my analysis aimed at converting this repeatably distorted detected spectrum to a representation of the true spectrum of the source.<br /><br />The detector, of course, is made of atoms (Cd and Te), and these atoms have K-edges just like all others, as I described <a href="http://maximum-entropy-blog.blogspot.com/2014/05/calibrating-x-ray-spectrometer-first.html">previously</a>. That means that when a photon with more energy than these K-edges is absorbed, there is a chance for a fluorescence photon to leave the detector, taking away its characteristic energy with it. This energy, of course, does not contribute to the number of photo-generated electrons in the current pulse corresponding to such a detection event, and the detector registers photons with systematically lower energy than they really have. The two steps in the spectrum correspond to the points at which the incoming photons first exceed the Cd and Te K-edges. Every time a fluorescence photon escapes, the registered energy equals the true energy minus that of the fluorescence photon. </div><div style="text-align: justify;"><br /></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;">We want to work out the probability for a K-photon to escape the detector, so we can correct this asymmetric distortion. Let’s start by defining a few propositions:<o:p></o:p></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><br /></div><div align="center"><table border="0" cellpadding="0" cellspacing="0" class="MsoTableGrid" style="border-collapse: collapse; border: none; margin-left: 7.3pt;"><tbody><tr><td style="padding: 0in 5.4pt; width: 18.15pt;" valign="top" width="24"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">A<o:p></o:p></div></td><td style="padding: 0in 5.4pt; width: 16.15pt;" valign="top" width="22"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">≡<o:p></o:p></div></td><td style="padding: 0in 5.4pt; width: 391.5pt;" valign="top" width="522"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">an x-ray photon from the source was absorbed in the detector<o:p></o:p></div></td></tr><tr><td style="padding: 0in 5.4pt; width: 18.15pt;" valign="top" width="24"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td><td style="padding: 0in 5.4pt; width: 16.15pt;" valign="top" width="22"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td><td style="padding: 0in 5.4pt; width: 391.5pt;" valign="top" width="522"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td></tr><tr><td style="padding: 0in 5.4pt; width: 18.15pt;" valign="top" width="24"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">Cd<o:p></o:p></div></td><td style="padding: 0in 5.4pt; width: 16.15pt;" valign="top" width="22"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">≡<o:p></o:p></div></td><td style="padding: 0in 5.4pt; width: 391.5pt;" valign="top" width="522"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">an x-ray was absorbed by a cadmium atom in the detector<o:p></o:p></div></td></tr><tr><td style="padding: 0in 5.4pt; width: 18.15pt;" valign="top" width="24"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td><td style="padding: 0in 5.4pt; width: 16.15pt;" valign="top" width="22"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td><td style="padding: 0in 5.4pt; width: 391.5pt;" valign="top" width="522"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td></tr><tr><td style="padding: 0in 5.4pt; width: 18.15pt;" valign="top" width="24"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">z<o:p></o:p></div></td><td style="padding: 0in 5.4pt; width: 16.15pt;" valign="top" width="22"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">≡<o:p></o:p></div></td><td style="padding: 0in 5.4pt; width: 391.5pt;" valign="top" width="522"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">an x-ray from source was absorbed at depth = z<o:p></o:p></div></td></tr><tr><td style="padding: 0in 5.4pt; width: 18.15pt;" valign="top" width="24"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td><td style="padding: 0in 5.4pt; width: 16.15pt;" valign="top" width="22"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td><td style="padding: 0in 5.4pt; width: 391.5pt;" valign="top" width="522"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td></tr><tr><td style="padding: 0in 5.4pt; width: 18.15pt;" valign="top" width="24"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">E<o:p></o:p></div></td><td style="padding: 0in 5.4pt; width: 16.15pt;" valign="top" width="22"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">≡<o:p></o:p></div></td><td style="padding: 0in 5.4pt; width: 391.5pt;" valign="top" width="522"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">a fluorescence photon escaped the detector<o:p></o:p></div></td></tr><tr><td style="padding: 0in 5.4pt; width: 18.15pt;" valign="top" width="24"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td><td style="padding: 0in 5.4pt; width: 16.15pt;" valign="top" width="22"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td><td style="padding: 0in 5.4pt; width: 391.5pt;" valign="top" width="522"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td></tr><tr><td style="padding: 0in 5.4pt; width: 18.15pt;" valign="top" width="24"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">K<o:p></o:p></div></td><td style="padding: 0in 5.4pt; width: 16.15pt;" valign="top" width="22"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">≡<o:p></o:p></div></td><td style="padding: 0in 5.4pt; width: 391.5pt;" valign="top" width="522"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">a K-shell emission occurred<o:p></o:p></div></td></tr><tr><td style="padding: 0in 5.4pt; width: 18.15pt;" valign="top" width="24"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td><td style="padding: 0in 5.4pt; width: 16.15pt;" valign="top" width="22"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td><td style="padding: 0in 5.4pt; width: 391.5pt;" valign="top" width="522"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td></tr><tr><td style="padding: 0in 5.4pt; width: 18.15pt;" valign="top" width="24"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">K<sub>i</sub><o:p></o:p></div></td><td style="padding: 0in 5.4pt; width: 16.15pt;" valign="top" width="22"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">≡<o:p></o:p></div></td><td style="padding: 0in 5.4pt; width: 391.5pt;" valign="top" width="522"><div class="MsoNormal" style="margin: 0in 0.3in 0.0001pt 0in; text-align: justify;">a K-shell emission into the i<sup>th</sup> recombination channel occurred<o:p></o:p></div></td></tr><tr><td style="padding: 0in 5.4pt; width: 18.15pt;" valign="top" width="24"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td><td style="padding: 0in 5.4pt; width: 16.15pt;" valign="top" width="22"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td><td style="padding: 0in 5.4pt; width: 391.5pt;" valign="top" width="522"><div class="MsoNormal" style="margin: 0in 0.3in 0.0001pt 0in; text-align: justify;"><br /></div></td></tr><tr><td style="padding: 0in 5.4pt; width: 18.15pt;" valign="top" width="24"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">θ<o:p></o:p></div></td><td style="padding: 0in 5.4pt; width: 16.15pt;" valign="top" width="22"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">≡<o:p></o:p></div></td><td style="padding: 0in 5.4pt; width: 391.5pt;" valign="top" width="522"><div class="MsoNormal" style="margin: 0in 0.3in 0.0001pt 0in; text-align: justify;">a fluorescence photon was emitted at angle θ, with respect to the optic axis<o:p></o:p></div></td></tr><tr><td style="padding: 0in 5.4pt; width: 18.15pt;" valign="top" width="24"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td><td style="padding: 0in 5.4pt; width: 16.15pt;" valign="top" width="22"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td><td style="padding: 0in 5.4pt; width: 391.5pt;" valign="top" width="522"><div class="MsoNormal" style="margin: 0in 0.3in 0.0001pt 0in; text-align: justify;"><br /></div></td></tr><tr><td style="padding: 0in 5.4pt; width: 18.15pt;" valign="top" width="24"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">E<sub>i</sub><o:p></o:p></div></td><td style="padding: 0in 5.4pt; width: 16.15pt;" valign="top" width="22"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">≡<o:p></o:p></div></td><td style="padding: 0in 5.4pt; width: 391.5pt;" valign="top" width="522"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">a fluorescence photon from the i<sup>th</sup> recombination channel escaped the detector<o:p></o:p></div></td></tr><tr><td style="padding: 0in 5.4pt; width: 18.15pt;" valign="top" width="24"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td><td style="padding: 0in 5.4pt; width: 16.15pt;" valign="top" width="22"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td><td style="padding: 0in 5.4pt; width: 391.5pt;" valign="top" width="522"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td></tr><tr><td style="padding: 0in 5.4pt; width: 18.15pt;" valign="top" width="24"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">D<o:p></o:p></div></td><td style="padding: 0in 5.4pt; width: 16.15pt;" valign="top" width="22"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">≡<o:p></o:p></div></td><td style="padding: 0in 5.4pt; width: 391.5pt;" valign="top" width="522"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;">an absorbed photon from the source is detected in full (the full amount of energy absorbed is converted to charges collected at the readout electrode)<o:p></o:p></div></td></tr><tr><td style="padding: 0in 5.4pt; width: 18.15pt;" valign="top" width="24"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td><td style="padding: 0in 5.4pt; width: 16.15pt;" valign="top" width="22"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /></div></td><td style="padding: 0in 5.4pt; width: 391.5pt;" valign="top" width="522"><div class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify;"><br /><br /></div></td></tr></tbody></table></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;">In the derivation below, I'll only consider the case where a photon is absorbed by a cadmium atom, but the full calculation will consist of also examining the alternate case, where absorption occurs in tellurium. For this, we just need to replace <span style="font-family: Times, Times New Roman, serif;">P(Cd | A, I)</span> with <span style="font-family: Times, Times New Roman, serif;">P(Te | A, I) = 1 - P(Cd | A, I)</span>, from the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#sum-rule">sum rule</a>.</div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><br /></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;">We want to know the number of x-ray photons absorbed in the detector, N<sub>A</sub>. What we actually know is the number detected, N<sub>D</sub>.<o:p></o:p></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><br /></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;">On average, the number detected is<span style="text-align: center;"> </span></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-73is9LDrCls/U2UYw2mQRYI/AAAAAAAACbs/wzkT0S--DKA/s1600/eq1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-73is9LDrCls/U2UYw2mQRYI/AAAAAAAACbs/wzkT0S--DKA/s1600/eq1.PNG" height="51" width="200" /></a></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><br /></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;">Once a photon has been absorbed, our probability model, <span style="font-family: "Times New Roman","serif";">I</span>, assumes that the only possibility for it to be not detected in full is for some of the energy to escape as a fluorescence photon. Thus, from the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#sum-rule">sum rule</a>,</div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-GeODIVqCsSE/U2UYy1HiPzI/AAAAAAAACdA/0T0C8vOyZ_s/s1600/eq2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-GeODIVqCsSE/U2UYy1HiPzI/AAAAAAAACdA/0T0C8vOyZ_s/s1600/eq2.PNG" /></a></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;">Also, our model assumes exactly 4 fluorescence channels:</div><div align="center" class="MsoNormal" style="margin-bottom: 0in; text-align: center;"><br /><table align="center" border="1" cellpadding="5" cellspacing="0"><tbody><tr><td><b> i </b></td><td><b> channel</b></td><td><b> fluorescence</b><br /><b>energy (keV)</b></td></tr><tr><td> 1 </td><td> Cd, K<sub>α</sub> </td><td><div style="text-align: center;">23.2082</div></td></tr><tr><td> 2 </td><td> Cd, K<sub>β</sub></td><td><div style="text-align: center;">26.1586</div></td></tr><tr><td> 3 </td><td> Te, K<sub>α</sub></td><td><div style="text-align: center;">27.3773</div></td></tr><tr><td> 4</td><td> Te, K<sub>β</sub></td><td><div style="text-align: center;">31.091</div></td></tr></tbody></table><br /></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;">So, the detection probability is</div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-t9IsnFZVz70/U2UYzLwyBII/AAAAAAAACc8/mMkupgdXW8M/s1600/eq3.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-t9IsnFZVz70/U2UYzLwyBII/AAAAAAAACc8/mMkupgdXW8M/s1600/eq3.PNG" /></a></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;">Or, invoking the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#extended-sum-rule">extended sum rule</a> for disjoint propositions,<o:p></o:p></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-ioW3_hn4lFU/U2UYzSM9KlI/AAAAAAAACc4/5xKhg3BivZY/s1600/eq4.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-ioW3_hn4lFU/U2UYzSM9KlI/AAAAAAAACc4/5xKhg3BivZY/s1600/eq4.PNG" /></a></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;">From which, we can estimate the number of absorbed x-rays from the number detected:</div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><!--**************************************************************************--><!-- ************************ A Numbered Equation ************************--><br /><table cellpadding="0" cellspacing="0" class="num_eqn"> <tbody><tr> <td><table align="center" cellpadding="0" cellspacing="0"> <tbody><tr> <td><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-UwsIV-q3e84/U2UYzvEsz7I/AAAAAAAACcY/TCdgxIVhVZY/s1600/eq5--equation_1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-UwsIV-q3e84/U2UYzvEsz7I/AAAAAAAACcY/TCdgxIVhVZY/s1600/eq5--equation_1.PNG" /></a></div></td> </tr></tbody></table></td> <td class="eqn_number">(1) </td> </tr></tbody></table><a href="https://www.blogger.com/blogger.g?blogID=715339341803133734" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"></a><a href="https://www.blogger.com/blogger.g?blogID=715339341803133734" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"></a><a href="https://www.blogger.com/blogger.g?blogID=715339341803133734" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"></a>A fluorescence photon can only escape the detector if it has been emitted, so the proposition E<sub>i</sub> is the same as the proposition E<sub>i</sub> K<sub>i</sub>. Thus,</div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-3DsrTtk-7mA/U2UYz_7Gh5I/AAAAAAAACcg/fCDLjbFzKAg/s1600/eq6.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-3DsrTtk-7mA/U2UYz_7Gh5I/AAAAAAAACcg/fCDLjbFzKAg/s1600/eq6.PNG" height="53" width="200" /></a></div></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;">which, from the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#product-rule">product rule</a>, can be decomposed:</div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><table cellpadding="0" cellspacing="0" class="num_eqn"> <tbody><tr> <td><table align="center" cellpadding="0" cellspacing="0"> <tbody><tr> <td><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-9YphqHhMUSE/U2UY0CSi-jI/AAAAAAAACc0/zQPCZ0YoaOU/s1600/eq7--equation_2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-9YphqHhMUSE/U2UY0CSi-jI/AAAAAAAACc0/zQPCZ0YoaOU/s1600/eq7--equation_2.PNG" /></a></div></td> </tr></tbody></table></td> <td class="eqn_number">(2) </td> </tr></tbody></table>The probability for K-emission in the i<sup>th</sup> channel can also be similarly decomposed. For example, for i = 1, K<sub>α</sub> emission from cadmium, K<sub>1</sub> is a conjunction of three things: (a) absorption by a cadmium atom, (b) emission of a fluorescence photon (i.e. no non-radiative relaxation, such as Auger recombination, where the energy goes into another electron), and (c) emission into the α line:</div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><o:p></o:p></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><br /></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-01c-0ujbRV8/U2UY0F45v5I/AAAAAAAACcs/jyhn8mt-rIQ/s1600/eq8.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-01c-0ujbRV8/U2UY0F45v5I/AAAAAAAACcs/jyhn8mt-rIQ/s1600/eq8.PNG" /></a></div><br /></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;">The first term is the relative intensity (material dependent) of the K<sub>α</sub> line, obtainable from <a href="http://xdb.lbl.gov/Section1/Table_1-3.pdf">published tables</a>,</div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-YvJDz2sE9tc/U2UY0hTtNfI/AAAAAAAACcw/s8RFKAQvH0M/s1600/eq9.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-YvJDz2sE9tc/U2UY0hTtNfI/AAAAAAAACcw/s8RFKAQvH0M/s1600/eq9.PNG" /></a></div></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;">The second term is the overall K-fluorescence yield, termed ω<sub>K</sub>, obtained from literature<sup>1</sup>, and the third term is given by the photoelectric absorption coefficients (discussed <a href="http://maximum-entropy-blog.blogspot.com/2014/04/the-exponential-distribution.html">two posts ago</a>):<br /><o:p></o:p></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><a href="https://www.blogger.com/blogger.g?blogID=715339341803133734" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"></a><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-z0Ho7O3aZCk/U2UYw43c6rI/AAAAAAAACdo/EzDo_26Dv7Q/s1600/eq10.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-z0Ho7O3aZCk/U2UYw43c6rI/AAAAAAAACdo/EzDo_26Dv7Q/s1600/eq10.PNG" height="63" width="200" /></a></div></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><o:p></o:p></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;">We need one other term in order to evaluate Eq. (2). The escape probability is dependent on the unknown absorption depth, z, of the incident x-ray photon. We will integrate this <a href="http://maximum-entropy-blog.blogspot.com/2012/05/nuisance-parameters.html">nuisance parameter</a> out, to give the desired <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#marginal">marginal distribution</a>:</div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><o:p></o:p></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-gEY7nelCr-Q/U2UYw3lQZRI/AAAAAAAACdk/V98NLsUo-fM/s1600/eq11.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-gEY7nelCr-Q/U2UYw3lQZRI/AAAAAAAACdk/V98NLsUo-fM/s1600/eq11.PNG" height="65" width="400" /></a></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;">The absorption depth is independent of the ensuing fluorescence channel, so<br /><br /></div><table cellpadding="0" cellspacing="0" class="num_eqn"> <tbody><tr> <td><table align="center" cellpadding="0" cellspacing="0"> <tbody><tr> <td><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-ThFo48oklW8/U2UYyiC3F_I/AAAAAAAACdE/RejVU7MvNV4/s1600/eq12--equation_3.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-ThFo48oklW8/U2UYyiC3F_I/AAAAAAAACdE/RejVU7MvNV4/s1600/eq12--equation_3.PNG" height="75" width="400" /></a></div></td> </tr></tbody></table></td> <td class="eqn_number">(3)</td></tr></tbody></table><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;">Note that, from the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#product-rule">product rule</a>,<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-aKvWACeW7oo/U2UYxilsKPI/AAAAAAAACdc/506wJ-cqv70/s1600/eq13.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-aKvWACeW7oo/U2UYxilsKPI/AAAAAAAACdc/506wJ-cqv70/s1600/eq13.PNG" height="85" width="320" /></a></div>i.e. the probability to be absorbed at a given depth (obtained from the exponential distribution) must be <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#normalization">normalized</a> by dividing by the overall absorption probability in the 1 mm thickness of the detector.<o:p></o:p></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><a href="https://www.blogger.com/blogger.g?blogID=715339341803133734" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"></a><br /></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><div class="separator" style="clear: both; text-align: center;"><a href="https://www.blogger.com/blogger.g?blogID=715339341803133734" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"></a></div>The depth-dependent escape probability, <span style="font-family: Times, Times New Roman, serif;">P(E<sub>i</sub> | z, K<sub>i</sub>, A, I)</span>, is also dependent on the unknown emission angle of the fluorescence photon, θ, relative to the direction of travel of the detected photon (perpendicular to detector surface). This is because different angles (for a given emission depth) correspond to different distances required to exit the CdTe detector. Again, let’s <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#marginal">marginalize</a> over this nuisance parameter:<br /><!--**************************************************************************--><!-- ************************ A Numbered Equation ************************--><br /><a href="https://www.blogger.com/null" id="eq1"></a><br /><table cellpadding="0" cellspacing="0" class="num_eqn"> <tbody><tr> <td><table align="center" cellpadding="0" cellspacing="0"> <tbody><tr> <td><a href="http://4.bp.blogspot.com/-wR7pZsXAFlY/U2UYxt21WJI/AAAAAAAACdU/thM37UJwaCE/s1600/eq14--equation_4.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" src="http://4.bp.blogspot.com/-wR7pZsXAFlY/U2UYxt21WJI/AAAAAAAACdU/thM37UJwaCE/s1600/eq14--equation_4.PNG" height="62" width="400" /></a></td> </tr></tbody></table></td> <td class="eqn_number">(4)</td></tr></tbody></table><!--*****************************************************************************--> </div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><br /></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;">(θ is independent of all other variables, and is <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#indifference">uniform</a> over 0 ≤ θ ≤ π, hence <span style="font-family: Times, Times New Roman, serif;">P(θ | I) × d</span>θ is 1 divided by the number of samples over the half circle.)<br /><o:p></o:p></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><br /></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;">Strictly, we should integrate over two angles, θ and φ (the detector has a square profile, and escape distances from the side depend on φ), but because the detector is heavily masked, so that only the centre is illuminated, and because the detector is very wide, compared to the penetration depth at the K-photon energies, we treat escape from the bottom or top as the only escape paths. This is supported by noting that reducing the size of the mask aperture has no effect on these observed artifacts. For the same reason, my model does not integrate over the x- and y-coordinates in the detector volume.<br /><o:p></o:p></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><a href="https://www.blogger.com/blogger.g?blogID=715339341803133734" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"></a><br /></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;">For emission angles less than 90⁰ (moving downwards), the distance required to exit the detector through the bottom surface is<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://www.blogger.com/blogger.g?blogID=715339341803133734" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"></a><a href="https://www.blogger.com/blogger.g?blogID=715339341803133734" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"></a><a href="https://www.blogger.com/blogger.g?blogID=715339341803133734" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"></a><a href="http://4.bp.blogspot.com/-CgW31fq_NEU/U2UYyFtI8pI/AAAAAAAACdQ/Ik0mXEAjgzY/s1600/eq15.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-CgW31fq_NEU/U2UYyFtI8pI/AAAAAAAACdQ/Ik0mXEAjgzY/s1600/eq15.PNG" height="68" width="200" /></a></div><br /></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;">while for photons moving upwards, the distance to escape through the top is</div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-oE1R1hMRQD0/U2UYyeTnArI/AAAAAAAACdI/_DJb1gc_1Mg/s1600/eq16.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-oE1R1hMRQD0/U2UYyeTnArI/AAAAAAAACdI/_DJb1gc_1Mg/s1600/eq16.PNG" height="77" width="200" /></a></div><div style="text-align: justify;">For each angle, and each depth, the desired escape probability is the <a href="http://maximum-entropy-blog.blogspot.com/2014/04/the-exponential-distribution.html" style="text-align: justify;">exponential function</a><span style="text-align: justify;">: exp(-µ<sub>PE</sub> × d<sub>esc</sub>), where µ</span><sub style="text-align: justify;">PE</sub> is the photo-electric absorption coefficient at the relevant energy. </div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><o:p></o:p></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><br /></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;">Finally, combining Eq’s (2), (3), and (4), for the Cd <span style="text-align: start;">K</span><sub style="text-align: start;">α</sub> fluorescence channel, we have (with similar formulae for the other channels):<o:p></o:p></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><br /><!--**************************************************************************--><!-- ************************ A Numbered Equation ************************--><br /><table cellpadding="0" cellspacing="0" class="num_eqn"> <tbody><tr> <td><table align="center" cellpadding="0" cellspacing="0"> <tbody><tr> <td><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-3cFZcjkAVeU/U2UYyrDpsYI/AAAAAAAACdY/jQlCFR7SmQQ/s1600/eq17--equation_5.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-3cFZcjkAVeU/U2UYyrDpsYI/AAAAAAAACdY/jQlCFR7SmQQ/s1600/eq17--equation_5.PNG" /></a></div></td> </tr></tbody></table></td> <td class="eqn_number">(5) </td> </tr></tbody></table><br />Using Eq’s (5) and (1), we can get the number of photons absorbed at each energy, from the number counted by the detector. The intensity at the j<sup>th</sup> energy,<span style="font-family: inherit;"> I<sub>j</sub></span><span style="font-family: "Calibri","sans-serif"; font-size: 11pt; line-height: 16.866666793823242px; position: relative; top: 7pt;"></span>, needs to be enhanced according to Eq. (1). The intensities at each of the depleted energies, <span style="font-family: inherit;">I<sub>j - n</sub></span>, where n is the number of detector channels spanned by the energy of the relevant fluorescence photon, need to have a number of counts, <span style="font-family: inherit;">I<sub>j</sub> × </span><span style="font-family: Times, Times New Roman, serif;">P(E</span><sub style="font-family: Times, 'Times New Roman', serif;">i</sub><span style="font-family: Times, Times New Roman, serif;"> | A, I)</span>, subtracted.</div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><o:p></o:p></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><br /></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;">The procedure starts at the highest energy in the spectrum. Once the j<sup>th</sup> detector channel has been processed, we move on to the j-1<sup>th</sup>, until all channels have been adjusted. By traversing backwards through the spectrum, any contribution from higher energies will have been removed before the number of escape photons is calculated for each energy channel.<o:p></o:p></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><br /></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;">Note that the fluorescence yields are step functions of the incident energy – nothing is emitted for incident energies below the K-edges.<o:p></o:p></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><br /></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;">Once the detected spectrum has been converted to an absorbed spectrum, the spectrum incident on the detector is obtained by dividing the absorbed spectrum by the energy-dependent quantum efficiency of the 1 mm CdTe detector, and finally, the spectrum incident on the outer window of the detector is obtained by dividing again, by the transmission efficiency of the 250 µm beryllium window. This allows characterization of the spectrum emitted by the source, which is the ultimate goal of the activity. These transmission and absorption efficiencies are obtained again, using the appropriate energy-dependent attenuation coefficients, all of which can be obtained from this convenient <a href="http://physics.nist.gov/PhysRefData/FFast/html/form.html">NIST database</a>.<o:p></o:p></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><br />All the calculations I described were carried out by my computer. I numerically integrated over z, taking 200 samples, and θ, with 180 samples (taking care not to let θ = 90, to avoid division by zero when the cosine is taken). The result is as follows, and exhibits partial success:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-ixg3aRGYL-0/U2md7xX494I/AAAAAAAACeM/QcuGJfBZXPI/s1600/before_and_after_correction.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-ixg3aRGYL-0/U2md7xX494I/AAAAAAAACeM/QcuGJfBZXPI/s1600/before_and_after_correction.png" height="377" width="640" /></a></div><br /><br />It seems to me that all the logic I described is correct. All the simplifying assumptions seem reasonable, and I have checked the code and not found any errors, but sadly, the result is not quite what it should be. Those spurious steps in the spectrum have been successfully removed, but have been replaced by a couple of sharp spikes. Evidently my model of the device is not quite adequate. These spikes could relatively easily be removed, by simply noting that they shouldn't be there - they are narrow enough that a linear or quadratic interpolation between the points on either side would be a fair fix, though a highly unsatisfying one. There's still some work to do here, before total victory can be declared.<br /><a href="https://www.blogger.com/blogger.g?blogID=715339341803133734" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"></a><a href="https://www.blogger.com/blogger.g?blogID=715339341803133734" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"></a><a href="https://www.blogger.com/blogger.g?blogID=715339341803133734" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"></a><br />In the literature, I see others working with a similar spectrometer encountered similar difficulties, which they also couldn't explain<sup>2</sup>. Here is my best (though at present, highly speculative) guess for what might be happening:<br /><br />My model assumes the onset of the K-edge is immediate, in line with basic known physics. But the original spectrum shows a gradual onset over almost 2 keV. Naturally, the detector exhibits measurement uncertainty in the form of a symmetric broadening, but at < 0.2 keV (known from the <a href="http://maximum-entropy-blog.blogspot.com/2014/05/calibrating-x-ray-spectrometer-first.html">fluorescence measurements</a>) this is insufficiently broad to account for the observed effect. The detector, however, is placed under a high voltage, to drive the photo-generated electrons to the readout electrode - several hundred volts, which results in a strong electric field that could conceivably affect measurably the binding energies of the atoms' electrons. Furthermore, CdTe technology is not as well developed as that of other semiconductors, such as silicon, and the CdTe crystals that can be grown are not of such high quality. Because of this, defects in the CdTe crystal can lead to local and transient distortion of the applied electric field, which just might lead to small differences in the effective K-edges at different locations (and different times). There could be some really interesting device physics going on here. If so, remember, you heard it here first!<br /><br />If I make significant progress with this, I'll try to post a follow-up. If you know how to solve this problem, please drop me an email!<br /><br /></div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><br /></div><hr /><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><br /><span style="color: blue;"><b>References</b></span><br /><br /><br /><!-- ************************ Table of references ************************--> <br /><table> <tbody><tr> <td valign="top"> [1] </td> <td>A. Markowicz, in <i>Handbook of X-Ray Spectrometry</i>, edited by Van Grieken and Markowicz, (Marcel Dekker) 2002 </td> </tr><tr> <td valign="top"> [2] </td> <td><span style="font-family: inherit;"><span style="text-align: justify;">R. Redus, J. Pantazis, T. Pantazis, A. Huber, and B. Cross, <i>Characterization of CdTe detectors for quantitative X-ray spectroscopy</i>, presented at the 2007 Denver X-ray Conference and submitted to IEEE Trans. Nucl. Sci, 2008 </span></span></td> </tr></tbody></table><br /><br /><br /><b style="color: blue;">Acknowledgement</b><br /><br />Big thanks to Charles Willis and Bill Erwin at M.D. Anderson for lending me their spectrometer. It's a nice piece of kit.</div><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"><br /></div><br /><div class="MsoNormal" style="margin-bottom: 0in; text-align: justify;"></div><br /><div style="-webkit-text-stroke-width: 0px; color: black; font-family: 'Times New Roman'; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; margin: 0px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;"><br /></div>Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com0tag:blogger.com,1999:blog-715339341803133734.post-31124306042592598722014-05-03T01:36:00.000-05:002014-05-03T01:36:03.375-05:00Calibrating an X-ray Spectrometer - First Steps<div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Recently, I've been working with a borrowed piece of equipment - an x-ray spectrometer - whose response I need to understand, so I can take measurements with it. This is a special case of the general problem of calibration, which is a crucial topic in science, so I'd like to take some time to describe the procedure I went through. As you'll see later, the problem is not fully solved yet, which I suppose illustrates the trial-and-error nature of scientific work. Regardless of the degree of ultimate success, though, the process I'll describe strikes me as a fine illustration of the basic logic of experimental science.</div><div style="text-align: justify;"><br /><a name='more'></a></div><div style="text-align: justify;">Digital x-ray detectors work because when x-rays are absorbed in the detector, the energy goes into liberating large numbers of electrons, which get collected in the detector's read-out circuitry. To make a detector that can record the energy of the incoming x-ray, we can exploit the fact that on average, each electron gets a certain amount of energy, so that the number of electrons liberated is proportional to the energy in the absorbed x-ray photon. This will work as long as the detector sampling rate is high compared to the photon flux (a condition whose violation we call 'pile-up').</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Each absorbed x-ray, therefore, creates a current pulse proportional to the x-ray's energy. For a multi-channel x-ray spectrometer, each current pulse is analyzed and assigned by the electronics to one of a range of available channels, each corresponding to a particular range of energies. Each time a pulse is assigned to a channel, a counter corresponding to that channel is incremented by 1. The calibration problem for such an instrument, therefore, is to find the relationship between the channel being incremented and the energy of the photon.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Most often, the calibration problem is considered to consist of finding the mean (or very often, just the mode) of the probability dist<span style="font-family: inherit;">ribution, P(energy | channel), th</span>ough a more complete calibration consists of characterizing the entire distribution, not just its peak or mean. This becomes particularly important if the probability distribution is notably asymmetric.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Finding the peak is a good place to start, however. In the present case, one method to do this relies on the phenomenon of K-shell fluorescence, which I'll briefly explain. The diagram below represents the spectrum of energies available to the electrons in an atom:</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;"><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-9TWgxi0Kpgs/U2Jkka21BXI/AAAAAAAACZ8/eXd2Pe2lk9c/s1600/K-fluo.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-9TWgxi0Kpgs/U2Jkka21BXI/AAAAAAAACZ8/eXd2Pe2lk9c/s1600/K-fluo.png" height="480" width="640" /></a></div>The number subscripts on the right indicate the so-called principle quantum number, n. At n = 1, the electron has the lowest energy it can have for that atom - its orbit is also closest (on average) to the nucleus. Higher-energy levels exist, getting more close together (energetically) as n increases, until a certain critical energy, at which the electron is no-longer bound to the nucleus - the electron is liberated to wander the vacuum, hence the 'V' subscript. </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">The n = 1 orbital is also referred to as the K-shell, n = 2 is known as the L-shell, and so on. If an electron in the K-shell absorbs a photon and is given enough energy to exceed E<sub>V</sub>, then the electron leaves the atom altogether. The minimum energy required for this, the difference between E<sub>1</sub> and E<sub>V</sub> , is known as the K-edge.<br /><br />If an atom with several electrons is ionized in this way, an electron from a higher orbital must drop down to the vacated level to restore equilibrium, and this process often produces K-shell fluorescence - the excess energy of the electron that moves down to fill the K-shell is released in the form of a photon. Most often, the relaxing electron will come from either the n = 2 or the n = 3 orbital, and these are the transitions I've marked on the diagram - the emitted light is depicted as the green oscillations. For the transition n = 2 to n = 1, a K<sub>α</sub> photon is emitted, while relaxation from n = 3 to n = 1 produces a K<sub>β </sub>photon.<br /><br />For hydrogen, the transitions terminating at the K-shell produce ultraviolet photons, and are termed the Lyman series (the Balmer series are the transitions terminating at the L-shell (n = 2), and so on). For larger atoms, the K-photon energies are in the x-ray range. Because of the discrete nature of the energy levels participating in these fluorescence events, a fluorescence spectrum for a pure metal will consist of a series of very sharp lines, whose energies are unvarying properties of the atoms of the metal. Here is a fluorescence spectrum I measured for a pure sample of tin using my borrowed cadmium-telluride (CdTe) spectrometer. The tin was exposed to the photon flux coming from my tungsten x-ray tube, and the fluorescence was collected in a 90° back-scattering geometry:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-s226020TG1I/U2R3rShXNNI/AAAAAAAACa8/pExV066sL9I/s1600/tin_spec.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-s226020TG1I/U2R3rShXNNI/AAAAAAAACa8/pExV066sL9I/s1600/tin_spec.png" height="480" width="640" /></a></div><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">The energies of these peaks can be looked up, (for example, in <a href="http://xdb.lbl.gov/Section1/Table_1-3.pdf">tables</a> in this <a href="http://xdb.lbl.gov/">x-ray data booklet</a>, from Lawrence Berkeley Lab) and compared to the channels at which the recorded signal peaks. Repeating for several fluorescent metals (zinc, zirconium, and tungsten, in my case) gives a series of channel-energy pairs, which can be fitted with some calibration model using <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#param-est">maximum likelihood</a>, or some other method. The spectrometer I was using is quite well designed, with the consequence that a linear fitting model was suitable for finding the expected energy for photons registered in each channel.<br /><br />Because the measurements are <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#noise">noisy</a>, simply taking the channel at which the signal is maximum is not the best way to to find the peak channel. To find the peak channel, then, the fluorescence spectra were fitted with reasonable line shape functions, in this case a <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#normal-dist">Gaussian</a> function for each emission line, using maximum likelihood. In each case, the fitting software I used gave an error bar for the fitted mean of each Gaussian, which gives the standard deviation of the assumed Gaussian error distribution for each inferred peak position. From this information, the following table was drawn up:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-2s5o3r8ZGZc/U2KyHb7U6wI/AAAAAAAACaM/ZqNAeNWd-cQ/s1600/calib_table.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-2s5o3r8ZGZc/U2KyHb7U6wI/AAAAAAAACaM/ZqNAeNWd-cQ/s1600/calib_table.PNG" height="278" width="640" /></a></div><br />One thing to notice is that the α and β emissions actually can have substructure, such that for tungsten, three different β lines take part, though two of them are not resolved (that's why I used their average position for the calibration).<br /><br />The third and fourth columns in the table give the fitted peak positions and their associated standard deviations. The known peak energies are plotted against these fitted peak channels, and fitted with a linear model. The linear model uses the a and b parameters given in the little box on the right of the main table. The sixth column in the table has the values of the linear model at each channel number in the third column.<br /><br />From my earlier description of <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#param-est">parameter estimation</a>, the joint likelihood function for for any set of model parameters, θ, can often be calculated from<br /><br /><div style="text-align: center;"><img src="http://2.bp.blogspot.com/-xXK_TzaynuY/UIdnYQx8zQI/AAAAAAAABTk/37ziaffsd-4/s400/equation+3.tiff" height="74" width="400" /></div><br /><br />where the d's are the data (the known peak energies, corresponding to the measured channels), and the y's are the model values. We can therefore maximize the likelihood function by minimizing the sum of the squared residuals, divided by the square of the standard deviation (from column 4). These weighted residuals are in the last column, and their sum is given as the χ<sup>2</sup> parameter, in the little box. This χ<sup>2</sup> is optimized numerically (it can also be done analytically, using <a href="http://en.wikipedia.org/wiki/Least_squares#Weighted_least_squares">linear algebra</a>), by adjusting the a and b parameters until χ<sup>2</sup> is at its minimum. The resulting fit is given in the table, and is shown in the plot below:<br /><br /><br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-_PJAPrZfRBw/U2K-dSnnhKI/AAAAAAAACao/W68O81Pm9wE/s1600/calib_line.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-_PJAPrZfRBw/U2K-dSnnhKI/AAAAAAAACao/W68O81Pm9wE/s1600/calib_line.PNG" height="418" width="640" /></a></div><br /><br />Each data point has been plotted with its associated error bar, but most of the error bars are smaller than the data markers.<br /><br />That was the easy part. The difficulty appears when we look to see if there is any systematic distortion of a measured spectrum - that asymmetry in P(energy | channel), I was talking about. Take a look at this spectrum I measured directly for the tungsten x-ray tube:<br /><br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-Ie_-w78gN0s/U2R-E4zn2uI/AAAAAAAACbU/v5m6KfC4VYU/s1600/tube_spec.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-Ie_-w78gN0s/U2R-E4zn2uI/AAAAAAAACbU/v5m6KfC4VYU/s1600/tube_spec.png" height="382" width="640" /></a></div><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">The spectrum has quite a bit of structure, and most of it reflects the true nature of the source very well. The x-rays are produced by firing a stream of electrons at a tungsten target. In this case, the electrons are accelerated by a 100 kV potential difference (giving them each exactly 100 kilo electron volts of energy). These electrons can continuously lose energy as they fly through the metal, causing a continuum of radiation to be emitted. That's the main, broad peak in the spectrum, known as 'bremsstrahlung.'<br /><br />These accelerated electrons can also knock away inner electrons from the tungsten atoms, leading to rearrangement of the outer electrons, and associated fluorescence emissions, exactly as described above for photo-absorption. At a little over 10 keV, there are two sharp emission lines corresponding to the L-shell characteristic fluorescence from the tungsten in the x-ray tube. At about 59 and 67 keV, two more groups of lines appear, due to the K-shell transitions for tungsten. <br /><br />There are, however, some step-like artifacts in the spectrum, at about 27 and 32 keV, which are not characteristics of the spectrum from the x-ray tube. Instead, these are properties of the detector. These energies happen to match the K-edges for the cadmium and tellurium atoms in the detector, and it's a safe bet that these step-like drops in intensity are due to fluorescent photons carrying absorbed energy out of the detector, before it gets a chance to be collected at the readout electrode. In the next part, I'll describe my efforts so far to correct such effects, by calculating sampling distributions for these and a number of other spectral distortion mechanisms that I confidently believe to be influencing the detected signal.<br /><br /><br /><hr /><br /></div>Big thanks to Charles Willis and Bill Erwin at M.D. Anderson for lending me their spectrometer. It's a nice piece of kit.<br /><br /><br />Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com0tag:blogger.com,1999:blog-715339341803133734.post-39670135629254065022014-04-26T00:37:00.001-05:002014-04-26T00:37:35.354-05:00The Exponential Distribution<style> td.upper_line { border-top:solid 1px black; } table.fraction { text-align: center; vertical-align: middle; margin-top:0.5em; margin-bottom:0.5em; line-height: 2em; } </style> <style> table.num_eqn { width:99%; text-align: center; vertical-align: middle; margin-top:0.5em; margin-bottom:0.5em; line-height: 2em; } td.eqn_number { text-align:right; width:2em; } </style> <br /><div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">The exponential distribution holds a special significance for me. My PhD thesis was all about optical transients, the simplest mathematical models of which are exponential distributions. Currently, I work in x-ray science, which is heavily concerned with the depletion of an (x-ray) optical field as it traverses some distribution of matter (both in an object being imaged, and in the detector) - this time the exponential distribution is over space, rather than time, but the mathematics is the same.<br /><br />Any kind of involvement with mathematical science quickly brings us into intimate contact with exponential functions, as these arise left, right, and centre, in the solutions of differential equations. The reason for this is related to the fact that the exponential is the only mathematical function that is its own derivative. This is closely related to a special property of the exponential distribution, known as memorylessness (what will happen next - its rate of change - is entirely governed by the current state). So let's take a quick look into how the exponential distribution comes about, and what its major characteristics are. </div><div style="text-align: justify;"><br /><a name='more'></a></div><div style="text-align: justify;">Imagine a stream of photons incident on some distribution of matter. It's no surprise to learn that some of those photons are going to be absorbed or scattered, so that they non-longer continue on their original path. The number that are scattered will depend on the thickness of the matter that they pass through, which is why, on a foggy day, things that are close to you can be easily seen, things not too far off can be somewhat made out, while objects a bit further away can't be seen at all. We'd like to know exactly what the dependence on distance is.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Lets denote as P<sub>L</sub>(U) the probability (dependent on length, L) that a photon will remain unabsorbed by its surrounding medium. For any infinitessimally thin strip of that medium (whose distance in is L), the probability to be absorbed at that location is the product P<sub>L</sub>(U) × P(A | U), where P(A | U) is the probability to be absorbed in that strip, given that it was not absorbed in any early strip. This follows from the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#product-rule">product rule</a> applied to the necessary <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#conjunction">conjunction</a>, 'unabsorbed, before now' AND 'absorbed here', required for an absorption to occur at a particular place. The probability P(A | U) is independent of where the photon has been up to now - adding the U after the vertical bar ensures this. There is no physical reason for P(A | U) to depend on the photon's history, and this is the property of memorylessness I mentioned a moment ago. To put it another way, we are dealing with a <a href="http://en.wikipedia.org/wiki/Markov_process">Markov process</a>, which can be a useful fact to remember.<br /><br />Because P(A | U) is unchanging, we have invented a special symbol for it, μ, which we call the absorption coefficient. As each consecutive layer of the absorbing medium is traversed by the photon, the probability for the photon not to have been absorbed is reduced by the amount, P<sub>L</sub>(U) × μ (from the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#sum-rule">sum rule</a>). Or, to put it another way, the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#derivative">rate of change</a> of P<sub>L</sub>(U) with respect to the path length traversed is:<br /><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody></tbody></table><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody><tr><td><table align="center" cellpadding="0" cellspacing="0"><tbody><tr><td><a href="http://2.bp.blogspot.com/-noiNYzUThHw/U1nXNWOE7dI/AAAAAAAACWQ/0SWDKDmUxcg/s1600/eq1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" src="http://2.bp.blogspot.com/-noiNYzUThHw/U1nXNWOE7dI/AAAAAAAACWQ/0SWDKDmUxcg/s1600/eq1.PNG" /></a><span style="text-align: justify;"> </span></td></tr></tbody></table></td><td class="eqn_number">(1)</td></tr></tbody></table>We can rearrange this equation, then take the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#integral">integral</a> of each side:</div><div style="text-align: justify;"><br /><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody><tr><td><table align="center" cellpadding="0" cellspacing="0"><tbody><tr><td><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-9n5vVRPzo0k/U1nXmwgwGaI/AAAAAAAACWY/gT2osbwVFwE/s1600/eq2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-9n5vVRPzo0k/U1nXmwgwGaI/AAAAAAAACWY/gT2osbwVFwE/s1600/eq2.PNG" /></a><span style="text-align: justify;"> </span></div></td></tr></tbody></table></td><td class="eqn_number">(2)</td></tr></tbody></table><br />The left-hand side is solved using <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#integral5">item 5</a> in my table of intergals, while the right-hand side is given by <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#integral3">item 3</a>:<br /><br /><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody><tr><td><table align="center" cellpadding="0" cellspacing="0"><tbody><tr><td><a href="http://3.bp.blogspot.com/-OpwdmuJ67v8/U1nX6199IwI/AAAAAAAACWg/yI8FBd0b3Bg/s1600/eq3.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" src="http://3.bp.blogspot.com/-OpwdmuJ67v8/U1nX6199IwI/AAAAAAAACWg/yI8FBd0b3Bg/s1600/eq3.PNG" /></a><span style="text-align: justify;"> </span></td></tr></tbody></table></td><td class="eqn_number">(3)</td></tr></tbody></table>As always, ln(.) represents the natural logarithm. Since this equation is true for all distances, we can form equations for distances, L and 0, and then subtract the 2 equations:<br /><br /><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody><tr><td><table align="center" cellpadding="0" cellspacing="0"><tbody><tr><td><a href="http://3.bp.blogspot.com/-_6E5TONXtac/U1nYHrvzAYI/AAAAAAAACWo/MJPGnrmu9-M/s1600/eq4.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" src="http://3.bp.blogspot.com/-_6E5TONXtac/U1nYHrvzAYI/AAAAAAAACWo/MJPGnrmu9-M/s1600/eq4.PNG" /></a><span style="text-align: justify;"> </span></td></tr></tbody></table></td><td class="eqn_number">(4)</td></tr></tbody></table><br />From the <a href="http://en.wikipedia.org/wiki/Logarithm#Logarithmic_identities">laws of logs</a>, this becomes<br /><br /><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody><tr><td><table align="center" cellpadding="0" cellspacing="0"><tbody><tr><td><a href="http://1.bp.blogspot.com/-UMQfvLRPsIE/U1nYUlwtUkI/AAAAAAAACWw/ArL8YIUxgK0/s1600/eq5.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" src="http://1.bp.blogspot.com/-UMQfvLRPsIE/U1nYUlwtUkI/AAAAAAAACWw/ArL8YIUxgK0/s1600/eq5.PNG" /></a><span style="text-align: justify;"> </span></td></tr></tbody></table></td><td class="eqn_number">(5)</td></tr></tbody></table>or, taking the exponential of each side<br /><br /><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody><tr><td><table align="center" cellpadding="0" cellspacing="0"><tbody><tr><td><a href="http://4.bp.blogspot.com/-VkaJjYcMxlw/U1nYjvdPDNI/AAAAAAAACW4/osT7yjxuyFw/s1600/eq6.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" src="http://4.bp.blogspot.com/-VkaJjYcMxlw/U1nYjvdPDNI/AAAAAAAACW4/osT7yjxuyFw/s1600/eq6.PNG" /></a><span style="text-align: justify;"> </span></td></tr></tbody></table></td><td class="eqn_number">(6)</td></tr></tbody></table>where we have finally identified the proportionality of P(U) and the intensity, I, of the optical field. This describes the exponential decay of the photon flux. This equation is called the <a href="http://en.wikipedia.org/wiki/Beer%E2%80%93Lambert_law">Beer-Lambert Law</a>. P<sub>L</sub>(U) is not a probability distribution over L, however, as the set of propositions, unabsorbed at L<sub>1</sub>, unabsorbed at L<sub>2</sub>, ... etc., are not exclusive. P<sub>L</sub>(A), though, is a distribution over a set of disjoint (non-overlapping) propositions (a photon can not be absorbed in more than one place), and as we found, is proportional to P<sub>L</sub>(U). As noted above, the constant of proportionality is μ, so the absorption probability density as a function of distance, L, is (setting the photon's initial existence probability, I<sub>0</sub>, to 1):<br /><br /><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody><tr><td><table align="center" cellpadding="0" cellspacing="0"><tbody><tr><td><a href="http://1.bp.blogspot.com/-DeOcUDDjkTM/U1nYz749rYI/AAAAAAAACXA/IRLuzTD-JUE/s1600/eq7.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" src="http://1.bp.blogspot.com/-DeOcUDDjkTM/U1nYz749rYI/AAAAAAAACXA/IRLuzTD-JUE/s1600/eq7.PNG" /></a><span style="text-align: justify;"> </span></td></tr></tbody></table></td><td class="eqn_number">(7)</td></tr></tbody></table>It's easy enough to verify that this function is <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#normalization">normalized</a> (e.g. check Eq. 12, for L = ∞).<br /><br />In general, if a parameter, x, is assigned an exponential distribution, with decay constant, λ, then the normalized PDF is<br /><br /><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody><tr><td><table align="center" cellpadding="0" cellspacing="0"><tbody><tr><td><a href="http://3.bp.blogspot.com/-iHGU3RUxHLs/U1nZB0eEQtI/AAAAAAAACXI/o-yX0wLfxsA/s1600/eq8.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" src="http://3.bp.blogspot.com/-iHGU3RUxHLs/U1nZB0eEQtI/AAAAAAAACXI/o-yX0wLfxsA/s1600/eq8.PNG" /></a><span style="text-align: justify;"> </span></td></tr></tbody></table></td><td class="eqn_number">(8)</td></tr></tbody></table>To maintain unit consistency, the units of λ are the inverse of the units of x. If x is distance, in mm, then λ has units mm<sup>-1</sup>. If x is time, in seconds, then λ is a rate (or frequency) with units s<sup>-1</sup>.<br /><br />Below, I've plotted an exponential decay (not normalized), following exp(-x/300), from x = 0 to 1000:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-Pjbs_BJx0cA/U1rbGJVUjzI/AAAAAAAACZQ/QRtQTVfuO0M/s1600/figure_1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-Pjbs_BJx0cA/U1rbGJVUjzI/AAAAAAAACZQ/QRtQTVfuO0M/s1600/figure_1.png" height="480" width="640" /></a></div><br /><br />We can visualize the memorylessness of the thing, and appreciate how some of the exponential distribution's spooky symmetry comes about by starting at any point further along the x-axis and advancing along x by a distance of another 1000 units, and expanding the y-axis to fill the same amount of space on the screen. Below, I chose to start at x = 900, near the end of the previous plot. The curve looks identical to before. Note that the numbers on the x- and y-axes are different, but the functional form is the same. It is as if those first 900 units on the x-axis had never happened.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-J9408dehC_I/U1rbP3kcl8I/AAAAAAAACZY/KJPpRGm9yzc/s1600/figure_2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-J9408dehC_I/U1rbP3kcl8I/AAAAAAAACZY/KJPpRGm9yzc/s1600/figure_2.png" height="480" width="640" /></a></div><br /><br />Any two-level, time-invariant decay process is exponential. The photon is a two-level system, it goes from unabsorbed to absorbed, then it's game over, and as long as its environment isn't changing, it exhibits the required temporal symmetry. A radioactive nucleus is a similar two level system - not decayed followed by decayed. Very many other physical systems follow the same pattern. The process is still exponential if there are several decay channels between the two levels of the system. More complex dynamics can be described by various combinations of exponential functions.<br /><br />Beyond photons and atoms, many other phenomena are exponential. Even some human affairs, such as the <a href="http://www.dcscience.net/?p=2369">time that a hospital bed remains occupied</a>, follow this remarkable formula.<br /><br />The <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#mean">mean</a> of the exponential distribution is obtained in the usual way, by evaluating the definite integral from 0 to ∞ (the exponential distribution has no density below x = 0):<br /><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody><tr><td><table align="center" cellpadding="0" cellspacing="0"><tbody><tr><td><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-_0bxXtlLgHA/U1tCzxw2KAI/AAAAAAAACZo/yM6sofnUPEI/s1600/eq9.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-_0bxXtlLgHA/U1tCzxw2KAI/AAAAAAAACZo/yM6sofnUPEI/s1600/eq9.PNG" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div></td></tr></tbody></table></td><td class="eqn_number">(9)<br /><div><br /></div></td></tr></tbody></table>This is can be tackled easily using integration by parts, yielding<br /><br /><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody><tr><td><table align="center" cellpadding="0" cellspacing="0"><tbody><tr><td><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-h34jvdrTp8E/U1ncMlV2umI/AAAAAAAACX0/CoP-U2SJQzM/s1600/eq10.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-h34jvdrTp8E/U1ncMlV2umI/AAAAAAAACX0/CoP-U2SJQzM/s1600/eq10.PNG" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div></td></tr></tbody></table></td><td class="eqn_number">(10)<br /><div><br /></div></td></tr></tbody></table>In another amazing display of symmetry, the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#standard-deviation">standard deviation</a> for the exponential distribution is the same as the mean:<br /><br /><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody><tr><td><table align="center" cellpadding="0" cellspacing="0"><tbody><tr><td><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-KKTxfX2GDrc/U1nysNl8iXI/AAAAAAAACZA/MFIpeajUgTc/s1600/std.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-KKTxfX2GDrc/U1nysNl8iXI/AAAAAAAACZA/MFIpeajUgTc/s1600/std.PNG" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div></td></tr></tbody></table></td><td class="eqn_number">(11)<br /><div><br /></div></td></tr></tbody></table>Obtaining the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#cdf">cumulative distribution function</a> for the exponential distribution is as easy as it ever gets. Where f(L) = μ × exp(-μL) was the probability for a photon to be absorbed at L (Eq. 7), recall that exp(-μL) was also the probability for the photon to be unabsorbed prior to reaching L. But the statement 'unabsorbed up to L' is <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#not">complimentary</a> to the statement, 'absorbed anywhere between 0 and L,' so the CDF is simply<br /><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody><tr><td><table align="center" cellpadding="0" cellspacing="0"><tbody><tr><td><a href="http://4.bp.blogspot.com/-GOTJtVRr9jE/U1nZYalhn5I/AAAAAAAACXQ/aVBAUumc344/s1600/eqCDF.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" src="http://4.bp.blogspot.com/-GOTJtVRr9jE/U1nZYalhn5I/AAAAAAAACXQ/aVBAUumc344/s1600/eqCDF.PNG" /></a><span style="text-align: justify;"> </span></td></tr></tbody></table></td><td class="eqn_number">(12)<br /><div><br /></div></td></tr></tbody></table>When an electron in an atom is given a jolt of extra energy, and promoted to a higher orbital, the time in which is stays in that high energy state, before relaxing down to its equilibrium state also follows the exponential distribution. The average lifetime of the excited state is 1/λ, which is termed the time constant, τ, and the evolution of an ensemble of N excited atoms is written<br /><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody><tr><td><table align="center" cellpadding="0" cellspacing="0"><tbody><tr><td><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-qIbXzqlICm0/U1neMSYvx3I/AAAAAAAACYA/VrSMcBYgjWU/s1600/eq11.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-qIbXzqlICm0/U1neMSYvx3I/AAAAAAAACYA/VrSMcBYgjWU/s1600/eq11.PNG" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div></td></tr></tbody></table></td><td class="eqn_number">(13)<br /><div><br /></div></td></tr></tbody></table>It is straightforward to see that τ is the expected time it takes for the number of excited atoms to fall to 1/e times the initial number.<br /><br />Radioactive nuclei are more usually characterized by their half life, T<sub>1/2</sub>, rather than τ. The half life is the time it takes N(t) to reach half its initial value. It is the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#median">median</a> of the exponential distribution, as can be seen directly from Eq. 12. It is found easily by setting t = T<sub>1/2</sub> in Eq. 13: N(T<sub>1/2</sub>) / N(0) = exp(-T<sub>1/2</sub>/τ) = 1/2, and solving:<br /><table cellpadding="0" cellspacing="0" class="num_eqn"><tbody><tr><td><table align="center" cellpadding="0" cellspacing="0"><tbody><tr><td><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-6Si5oCBkh9Y/U1ngRJClYvI/AAAAAAAACYQ/J5EawDHPJRw/s1600/eq12.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-6Si5oCBkh9Y/U1ngRJClYvI/AAAAAAAACYQ/J5EawDHPJRw/s1600/eq12.PNG" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div></td></tr></tbody></table></td><td class="eqn_number">(14)<br /><div><br /></div></td></tr></tbody></table>This particular formula highlights the general difference between the mean and the median: the mean is the centre of mass (depending on the sum of the products of mass times distance), while the median is the point at which the mass to the left equals the mass to the right (depending only on the sum of the masses).<br /><br />Note: we mustn't fall into the trap of thinking that after 2 half lives, all the radioactive nuclei will have decayed. Remember, the process is memoryless - in 2 half lives, the population drops to one quarter, in 3 half lives it drops to one eighth, and so on.<br /><br />Of further interest is that for any continuous parameter restricted to non-negative values, the exponential distribution has the property of <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#mep">maximum entropy</a>.<br /><br /><br /><br /><br /></div><div style="text-align: justify;"><br /></div>Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com0tag:blogger.com,1999:blog-715339341803133734.post-7922280134858918042014-03-22T03:41:00.001-05:002014-03-22T03:41:25.414-05:00Whose confidence interval is this?<div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">This week, yet again, I was confronted by yet another facet of the nonsensical nature of the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#frequency-interpretation">frequentist</a> approach to statistics. The blog of <a href="http://andrewgelman.com/2014/03/15/problematic-interpretations-confidence-intervals/">Andrew Gelman</a> drew my attention to a recent peer-reviewed paper studying the extent of misunderstanding of the meaning of confidence intervals, among students and researchers. What shocked me, though, was not the only findings of the study. </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Confidence intervals are a relatively simple idea in statistics, used to quantify the precision of a measurement. When a measurement is subject to statistical <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#noise">noise</a>, the result is not going to be exactly equal to the parameter under investigation. For a high quality measurement, where the impact of the noise is relatively low, we can expect the result of the measurement to be close to the true value. We can express this expected closeness to the truth by supplying a narrow confidence interval. If the noise is more dominant, then the confidence interval will be wider - we will be less sure that truth is close to the result of the measurement. Confidence intervals are also known as error bars.</div><div style="text-align: justify;"><a name='more'></a><br /></div><div style="text-align: justify;">Hoekstra <i>et al.</i>, the authors of the <a href="http://www.ejwagenmakers.com/inpress/HoekstraEtAlPBR.pdf">paper</a><sup>1</sup>, asked students and experienced researchers to mark as true or false a number of statements interpreting the meaning of a confidence interval. The results of their survey appear shocking, with large numbers of wrong answers reported, and veteran researchers apparently doing little better than students yet to receive any formal training in statistics.<br /><br />This is bad. Confidence intervals are good things. Quantifying our knowledge is exactly what science is about, and an assessment of precision is vital to that process. It goes without saying that understanding what has been assessed is also vital, especially for the people doing the assessing!<br /><br />Just for fun, then, have a go at the survey questions that formed the basis for the data in the paper:<br /><br /><blockquote class="tr_bq"><i>Professor Bumbledorf conducts an experiment, analyzes the data and reports, "the 95% confidence interval for the mean ranges from 0.1 to 0.4." Which of the following statements are true:</i></blockquote><ol><li><i>The probability that the true mean is greater than 0 is at least 95%</i></li><li><i>The probability that the true mean equals 0 is smaller than 5%</i></li><li><i>The <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#null-hypothesis">null hypothesis</a> that the true mean equals 0 is likely to be incorrect.</i></li><li><i>There is a 95% probability that the true mean lies between 0.1 and 0.4.</i></li><li><i>We can be 95% confident that the true mean lies between 0.1 and 0.4.</i></li><li><i>If we were to repeat the experiment over and over, then 95% of the time the true mean falls between 0.1 and 0.4.</i></li></ol><div><br />The results of the survey are given in the table below. The numbers are the proportion of survey participants who assessed each item to be true: </div><br /><br /><table align="center" border="1" cellpadding="5" cellspacing="0" valign="middle"><tbody><tr> <td>Item</td><td>1st year students<br /> (n = 442)</td><td>masters students<br /> (n = 34)</td><td>researchers<br /> (n = 118)</td></tr><tr> <td> 1</td><td> 51%</td><td> 32%</td><td> 38%</td></tr><tr> <td> 2</td><td> 55%</td><td> 44%</td><td> 47%</td></tr><tr> <td> 3</td><td> 73%</td><td> 68% </td><td> 86%</td></tr><tr> <td> 4</td><td> 58%</td><td> 50%</td><td> 59%</td></tr><tr> <td> 5</td><td> 49% </td><td> 50%</td><td> 55%</td></tr><tr> <td> 6</td><td> 66%</td><td> 79%</td><td> 58%</td></tr></tbody></table><br /><br />So how do you think you did, in comparison to the survey participants?<br /><br />According to the authors of the paper all six statements about the confidence interval are false.<br /><br />Did you do a double take just now. Did you feel momentarily confused, perplexed, humiliated? I hope so. I certainly felt confused, when I read the study description. It seemed to me, on examining the 6 statements that exactly one of them was false. Go on, take another scan through the list, see if you can pick out the one I identified as false ......<br /><br />I'll tell you in a moment.<br /><br />Though Gelman's assessment of the study had been vaguely endorsing of its message, I inferred the study's design to be very seriously flawed. Could I have been badly confused? Did I really know the technical definition of the confidence interval?<br /><br />Helpfully, the authors of the paper have supplied that for us:<br /><br /><blockquote class="tr_bq">[If] a particular procedure, when used repeatedly across a series of hypothetical data sets, yields intervals that contain the true parameter value in 95% of cases ... the resulting interval is said to be a 95% CI.</blockquote><br />'CI,' of course, means 'confidence interval.' This description pretty much meets my expectations. To be sure, I checked a few other sources, and they match this definition perfectly.<br /><br />So, armed with this technical information, lets work our way through the list. The job is easier, I feel, if we start at item number 4: "There is a 95% probability that the true mean lies between 0.1 and 0.4."<br /><br />Some basics, to help us out: imagine an urn (an opaque jar) filled with balls of 2 different colours. Suppose there are 100 balls in total, 95 of which are green, the remaining 5 being red. I insert my hand into the urn and blindly extract a ball. What is the probability that the extracted ball is green. Of course, P(green) = 0.95. This follows from very basic symmetry considerations. The <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#indifference">indifference principle</a> gives us equal probability to draw any of the balls. Applying the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#extended-sum-rule">extended sum rule</a> to this result trivially gives us 0.95 as the probability to draw one of the greens balls. This result is readily generalized, and is known as the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#bernoulli-urn">Bernoulli urn rule</a>. It is one of the earliest results of probability theory.<br /><br />We can treat our experiment like an urn. The experiment is a data-generating process, just like the urn. It spits out a sample - not a ball this time, but a sample of data points from a noisy distribution - a sample of data points with an associated confidence interval. We have from the authors' own pen: the 95% CI is produced by a procedure such that contains the true parameter value on 95% of occasions. So here is the question: what is the probability that the confidence interval we obtained is one of the 95% that contain the true parameter value? Trivially, it is 0.95, a.k.a 95%, and statement number 4 on the survey is true.<br /><br />With number 4 settled, number 1 is also trivially true.Since there is 95% probability that the true parameter values lies between two positive numbers, there can not be less than 95% probability that the true parameter value is greater than 0.<br /><br />Number 2 must also be true, for similar reasons. (In fact, if the parameter space is continuous, then the probability to be some discrete value is zero, anyway.)<br /><br />If item 2 is true, then I interpret 3 to be also true - probability less than 5% is my idea of unlikely.<br /><br />Item 5 is somewhat vague, but to my mind it is the same as number 4. Probability provides a numerical measure of confidence.<br /><br />That leaves number 6, "if we were to repeat the experiment over an over, then 95% of the time the true mean falls between 0.1 and 0.4." This statement is absurdly false. The true mean does not move around. If I am repeating the experiment, i.e. measuring the same parameter again, then its true value is the same as previously. Oddly, this uniquely false statement from the list received the second highest level of endorsement from the survey participants (a strong majority, in fact). There is clearly something to the paper's claim of 'robust misinterpretation of confidence intervals'.<br /><br />But how could the authors have been so wrong about the other 5 items?<br /><br />In the frequentist tradition, a probability <i>is</i> a frequency. The probability that a tossed coin lands heads up is 0.5 exactly because in a large number of tosses, half of them will come up heads. Actually, it probably won't be exactly half that are heads, but we might momentarily overcome the resulting feeling of queasiness this definition produces, to consider it as a serious candidate for an understanding of probability.<br /><br />The big problem we quickly hit, though, becomes apparent when we ask a perfectly reasonable question like, what is the probability that the universe is between 13.7 and 13.9 billion years old? There is no frequency with which this is true, its truth does not vary. Thus, in the frequentist tradition, facts do not have associated probabilities, because facts are either true or false. One really has to wonder, then, what it is the frequentists think they are assigning probabilities to. In this tradition, therefore, one can not say that some parameter lies in some interval with some probability. It either does or it doesn't.<br /><br />This raises an obvious question: if the frequentist is barred from calculating the probability that a parameter lies in some interval, how can they calculate their confidence intervals, which, as I showed amount to the same thing? How can they effectively say that the confidence interval from a repeated experiment will probably contain the parameter's true value? The fact is, they can't. Not without cheating. Not without grossly violating their own system.<br /><br />The wikipedia page for <a href="http://en.wikipedia.org/wiki/Confidence_interval">Confidence Interval</a> has a simple <a href="http://en.wikipedia.org/wiki/Confidence_interval#Practical_example">example</a> of a frequentist calculation of a 95% confidence interval. I actually don't mind this calculation, I think its a reasonable way (under the right circumstances - i.e. normal approximation is valid) to estimate the precision of a parameter estimate. But, not surprisingly, the calculation produces an equation of the form (where θ is the parameter being estimated, and x<sub>1</sub> and x<sub>2</sub> are the limits of the confidence interval)<br /><br /><div style="text-align: center;"><span style="font-size: large;"> P(x<sub>1</sub> ≤ θ ≤ x<sub>2</sub>) = 0.95</span></div><br />Guess what, this is a probability assignment about the value of θ. Something the frequentist system does not allow. Even though x<sub>1</sub> and x<sub>2</sub> have been calculated from the current estimate for θ, the wikipedia article currently includes the somewhat <i>ad hoc</i> looking statement<br /><blockquote class="tr_bq"><span style="font-family: inherit;"><i><span style="background-color: white; line-height: 19.200000762939453px; text-align: start;">This does not mean that there is 0.95 probability of meeting the parameter [</span></i></span><i>θ</i><i style="font-family: inherit;"><span style="background-color: white; line-height: 19.200000762939453px; text-align: start;">]</span><span style="background-color: white; line-height: 19.200000762939453px; text-align: start;"> in the interval obtained by using the currently computed value of the [estimate for </span></i><i>θ</i><i style="font-family: inherit;"><span style="background-color: white; line-height: 19.200000762939453px; text-align: start;">].</span></i></blockquote><br />The offending expression is made not a probability (and therefore not a violation of frequentist dogma) by simply declaring it so. Yay! that was easy.<br /><br />Of course, to get any probability assignment, the frequentist must, just like anybody else, assume (explicitly or implicitly) a <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#prior">prior distribution</a>, which also violates the frequentist methodology. A safer way to get point estimates and confidence bounds, therefore, involves explicitly formulating a suitable prior, and then operating on a <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#posterior">posterior distribution</a>, obtained from <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#bayes-theorem">Bayes' theorem</a>. If you'd like to see a simple example of such a calculation of a confidence interval, you could try my earlier article on <a href="http://maximum-entropy-blog.blogspot.com/2012/05/nuisance-parameters.html">nuisance parameters</a>.<br /><br /><br /><br /><hr /><br /><br /><span style="color: blue;"><b>References</b></span><br /><br /><br /><!-- ************************ Table of references ************************--> <br /><table> <tbody><tr> <td valign="top"> [1] </td> <td>R. Hoekstra, R.D. Morey, J.N. Rouder, and E.-J. Wagenmakers, <i>'Robust misinterpretation of confidence intervals'</i>, Psyconomic Bulletin & Review, January 2014 (<a href="http://www.ejwagenmakers.com/inpress/HoekstraEtAlPBR.pdf">link</a>)</td> </tr><tr> <td valign="top"><br /></td> <td><br /></td> </tr></tbody></table><br /><br /><br /></div>Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com2tag:blogger.com,1999:blog-715339341803133734.post-63344476658065870042014-02-27T21:10:00.002-06:002014-02-27T21:11:59.827-06:00The Full Adder Circuit<br /><div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">I recently wrote a very brief introduction to <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#boolean">Boolean algebra</a> for the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html">glossary</a>, so I thought it would be worth describing a very simple but important application example. There are two main reasons why I'm interested in Boolean algebra. The first is that in probability theory, the hypotheses we investigate are assumed to be Boolean in character (true or false, with no intermediates allowed). The second is that Boolean algebra is an important branch of logic, and therefore intimately linked to science and rationality.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;"><span style="text-align: start;">In an <a href="http://maximum-entropy-blog.blogspot.com/2013/10/entropy-games.html">earlier post</a>, I discussed how all transfer of information comes down to a sequence of answers to yes/no questions. In this spirit, therefore, consider the following:</span></div><blockquote class="tr_bq"><i>By answering only yes/no type questions, calculate the sum 234 + 111. In other words, if you were a digital computer, how would you perform this calculation?</i></blockquote><div style="text-align: justify;"><br /><a name='more'></a><br /></div><div style="text-align: justify;"><b>Expressing numbers by answering yes/no questions</b><br /><br /></div><div style="text-align: justify;">The numbers 234 and 111 are expressed (though I haven't specified it) in the conventional base 10 form. You're probably at the very least dimly aware that to solve this problem, we'll need to convert these numbers to base 2, or binary form.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">In base ten, each digit is the answer to a question, 'how many multiples of 10 to the power n are there?', where n is one subtracted from the digit's place in the number. For instance, in 234, the 4 is in the first place and therefore signals how many times 10<sup>0</sup> appears in the number (once all the higher powers of 10 have been accounted for). Any number greater than 0 raised the power 0 is 1, so 4 times 10<sup>0</sup> is just 4. Similarly, 3 times 10<sup>1</sup> is 30, and 2 times 10<sup>2</sup> is 200. Add them all up, and you get 234.<br /><br />Actually, in technical circles, what I referred to as the first digit is normally called the zeroth digit - digits are counted the same way Europeans count the floors of a building (Europeans are more logical than Americans!). Therefore, the 3 in 234 is in the first position and the 2 is in the second position. From here on, I'll switch to the standard zero-based indexing. In this nomenclature, the power to which the base is raised is the relevant index, and not 1 subtracted from the index.<br /><br />Expressing a number in binary uses almost exactly the same scheme, except that now each digit answers a yes/no type question. So if the number (base 10) is 15, we could start by asking 'is it larger than or equal to 2<sup>4</sup>?' The answer is no, so a 0 goes into the left-most (in this case the fourth) digit. Next, 'does it contain 2<sup>3</sup>?' Yes, 2<sup>3</sup> is 8, and 8 is less than 15, so the next digit is a 1: 01.<br /><br />Having accounted for those 8, that leaves 7 remaining to be accounted for. Of those remaining 7, is there a 2<sup>2</sup> present? Yes: 011. And 3 left over. Of those remaining 3, is there a 2<sup>1</sup> present? Yes: 0111. And there is also a 2<sup>0</sup> left, so we end up with 01111. That's 15 expressed in binary.<br /><br /><br /><b>Adding binary numbers</b><br /><br />A device that adds together just two binary digits is called a half adder circuit. We'll understand why in a moment. Its inputs are I<sub>1</sub> and I<sub>2</sub>. If I<sub>1</sub> is 1 and I<sub>2</sub> is 0, then the sum is 1. Similarly, if the values are reversed. If, however, the inputs are I<sub>1</sub> = 1, I<sub>2</sub> = 1, the sum (in decimal) is 2, which in binary is 10. We can't express this with a single output bit so we need two outputs, which we call 'sum' and 'carry'. In this case, the sum is 0 and the carry is 1.<br /><br />This half adder circuit is not sufficient to add together strings of more than 1 bit. Suppose we are adding 11 and 01. For the <strike>first</strike> zeroth (right-most) digit, the sum is 0, and the carry is 1. If we use another half adder to add the <strike>second</strike> first digits, we get 1 again, giving as combined output the string 11. Obviously, the output should be larger than the inputs, when both numbers are non-zero and positive. The reason we fell short is that we did not include the carry from summing the zeroth digit. Thus we need a circuit with three inputs, I<sub>1</sub>, I<sub>2</sub>, and carry_in. This new circuit is called a full adder.<br /><br />We don't need to know physically how such a circuit might be implemented (of course, digital electronics implements all logical operations using networks of transistors), but we do need to work out the logic of its operation. We can do this by drawing up a table.<br /><br /><br /><table align="center" border="1" cellpadding="5" cellspacing="0"><tbody><tr><td><b> <span style="text-align: justify;">I</span><sub style="text-align: justify;">1 </sub></b></td><td><b> <span style="text-align: justify;">I</span><sub style="text-align: justify;">2 </sub></b></td><td><b> carry_in </b></td><td><b>decimal </b><br /><b> result</b></td><td><b> binary </b><br /><b> result </b></td><td><b>sum</b></td><td><b>carry_out</b></td></tr><tr><td> 0 </td><td> 0 </td><td><div style="text-align: center;">0</div></td><td> 0</td><td> 0</td><td> 0</td><td> 0</td></tr><tr><td> 0</td><td> 0</td><td><div style="text-align: center;">1</div></td><td> 1</td><td> 1</td><td> 1</td><td> 0</td></tr><tr><td> 0</td><td> 1</td><td><div style="text-align: center;">0</div></td><td> 1</td><td> 1</td><td> 1</td><td> 0</td></tr><tr><td> 0</td><td> 1</td><td><div style="text-align: center;">1</div></td><td> 2</td><td> 10</td><td> 0</td><td> 1</td></tr><tr><td> 1</td><td> 0</td><td> 0</td><td> 1</td><td> 1</td><td> 1</td><td> 0</td></tr><tr><td> 1</td><td> 0</td><td> 1</td><td> 2</td><td> 10</td><td> 0</td><td> 1</td></tr><tr><td> 1</td><td> 1</td><td> 0</td><td> 2</td><td> 10</td><td> 0</td><td> 1</td></tr><tr><td> 1</td><td> 1</td><td> 1</td><td> 3</td><td> 11</td><td> 1</td><td> 1</td></tr></tbody></table><br /><br />All possible combinations of the 3 input bits are given in the 3 left-most columns. The fourth column states the standard decimal results when each of these combinations of 0's and 1's are added up. The fifth column converts these decimal results to binary. Finally, the last 2 columns are the zeroth (right-most) and the first digits of this binary result (the sum and carry_out, respectively).<br /><br />We need two logical expressions. One to give us the 'sum' output bit, and another to yield the 'carry_out' output bit.<br /><br />Lets start with the sum. There are four cases where this bit is 1, so the logical expression we need is the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#disjunction">disjunction</a> (A or B or C... ) of these four cases. Each case is a <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#conjunction">conjunction</a> of some set of values for each of the 3 input bits. For example, the first case where the sum bit is 1 is "I<sub>1</sub> is <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#not">false</a> and I<sub>2 </sub>is false and carry_in is true," which I'll denote as <span style="text-decoration: overline;">I</span><sub>1</sub><span style="text-decoration: overline;">I</span><sub>2</sub>C. <br /><br />Denoting disjunction as '+', then the full expression we're after is:<br /><br /><div style="text-align: center;"><span style="font-size: large;">sum = <span style="text-align: justify; text-decoration: overline;">I</span><sub style="text-align: justify;">1</sub><span style="text-align: justify; text-decoration: overline;">I</span><sub style="text-align: justify;">2</sub><span style="text-align: justify;">C + </span><span style="text-align: justify; text-decoration: overline;">I</span><sub style="text-align: justify;">1</sub><span style="text-align: justify;">I</span><sub style="text-align: justify;">2</sub></span><span style="font-size: large; text-align: justify; text-decoration: overline;">C</span><span style="font-size: large; text-align: justify;"> + </span><span style="font-size: large; text-align: justify;">I</span><sub style="text-align: justify;">1</sub><span style="font-size: large; text-align: justify; text-decoration: overline;">I</span><sub style="text-align: justify;">2</sub><span style="font-size: large; text-align: justify;"><span style="text-decoration: overline;">C</span></span><span style="font-size: large; text-align: justify;"> </span><span style="font-size: large; text-align: justify;">+</span><span style="font-size: large; text-align: justify;"> </span><span style="font-size: large; text-align: justify;">I</span><sub style="text-align: justify;">1</sub><span style="font-size: large; text-align: justify;">I</span><sub style="text-align: justify;">2</sub><span style="font-size: large; text-align: justify;">C</span></div><br /><br />Similarly, for the carry_out bit we have:<br /><br /><div style="text-align: center;"><span style="font-size: large;">carry_out = <span style="text-align: justify; text-decoration: overline;">I</span><sub style="text-align: justify;">1</sub><span style="text-align: justify;">I</span><sub style="text-align: justify;">2</sub></span><span style="font-size: large; text-align: justify;">C + </span><span style="font-size: large; text-align: justify;">I</span><sub style="text-align: justify;">1</sub><span style="font-size: large; text-align: justify; text-decoration: overline;">I</span><sub style="text-align: justify;">2</sub><span style="font-size: large; text-align: justify;">C + </span><span style="font-size: large; text-align: justify;">I</span><sub style="text-align: justify;">1</sub><span style="font-size: large;"><span style="text-align: justify;">I</span><sub style="text-align: justify;">2</sub></span><span style="font-size: large; text-align: justify; text-decoration: overline;">C</span><span style="font-size: large; text-align: justify;"> +</span><span style="font-size: large; text-align: justify;"> </span><span style="font-size: large; text-align: justify;">I</span><sub style="text-align: justify;">1</sub><span style="font-size: large; text-align: justify;">I</span><sub style="text-align: justify;">2</sub><span style="font-size: large; text-align: justify;">C</span></div><br /><br />To add two strings of binary digits, then, start with I<sub>1</sub> and I<sub>2 </sub>equal to the zeroth (right-most) digits in each string. The initial carry_in is 0. The sum and carry_out bits are calculated according to the above two expressions. For the next step, I<sub>1</sub> and I<sub>2 </sub>are assigned the values of the first digits (next to right-most position) in each of the input strings, and the carry_in is now assigned the carry_out value from the previous step. Finally, the result is the string generated by all the sum bits. We need a cascade of 9 full-adder circuits to perform the required logic.<br /><br />I made a spreadsheet implementation of this procedure, which can be found <a href="https://docs.google.com/spreadsheet/ccc?key=0Ag8YfHJyCwTGdERTXzVYNTExODhEeG1qR3l1Ym1TT3c&usp=sharing">here</a>. You don't have write permission for that file, but if you save your own copy, then you shouldn't have any problem experimenting with it.<br /><br />And that's all we need in order to do logical computations. We can gain some efficiency improvements by applying axioms and theorems of Boolean algebra to simplify the logical expressions we obtain (particularly when we have more complex expressions). This is called Boolean minimization (interestingly enough, this is often most easily done using a computer). But apart from that, the process described here is pretty much all there is to it.<br /><br /><br />By the way, the answer to the question, 234 + 111:<br /><br /><div style="text-align: center;">yes, no, yes, no, yes, yes, no, no, yes <br />(101011001)</div><br /><br /><br /> </div>Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com0tag:blogger.com,1999:blog-715339341803133734.post-31588871005084771272014-02-02T23:47:00.001-06:002014-02-02T23:55:32.700-06:00Practical Morality, Part 2<br /><div style="text-align: justify;"><blockquote class="tr_bq"><i>It has been said that democracy is the worst form of government, except all those others that have been tried.</i></blockquote><blockquote class="tr_bq">Winston Churchill </blockquote><div><br />(The second of two parts. Read the first installment <a href="http://maximum-entropy-blog.blogspot.com/2014/01/practical-morality-part-1.html">here</a>.)</div><br /><br /><b><span style="color: blue;">Politics & Science</span></b><br /><br />I have a funny little feeling that Churchill actually knew a small bit about politics. According to dear, old Winston, democracy sucks. But why does it suck? And does it necessarily suck?<br /><br />A full analysis of these questions could run into thousands of pages, and obviously stretches far beyond any area in which I could claim expertise, but for now at least, I want to point out just one aspect of democracy's poor performance to date that can most definitely be fixed. That is, the failure so far of both politicians and the electorate to explicitly recognize the <a href="http://maximum-entropy-blog.blogspot.com/2013/09/is-rationality-desirable.html">necessarily rational basis for morality</a>.<br /><a name='more'></a><br />Time and again, we see scientific experts consulted in order to obtain the best quality data possible to support some process of policy decision, only to see the elected politicians ignoring what they have been told, in favor of the decision that always suited their prior ideology. This is bad enough, but very often, the scientific analysis is never even sought. Somehow, this is seen by the voting public as acceptable. Worse still, it seems to be often treated as desirable. Certainly, it is something built into the contemporary political culture of many democracies.<br /><br />This perverse situation is made possible, almost inevitable, in fact, by the widespread, mistaken belief that science has absolutely nothing to say about what is morally desirable. Under this insidious assumption, how could the expert scientist possibly have anything conclusive to say about morality? Morality is not the art of what is, but of what should be done, so evidently, we must enforce a clear division of labour, such that the data gathering is left to the expert scientist, while morality is left to the ethicist and the expert politician. Seriously, it's not as if politics can be reduced to questions of fact, is it?<br /><br />There, the absurdity of the prevailing position exposed.<br /><br />This position is so ubiquitous, it seems to be held even by many of the most respected (and powerful) scientists around. For example, in an <a href="http://www.bbc.co.uk/iplayer/episode/b01n111x/The_Life_Scientific_Sir_Mark_Walport/">episode</a> of BBC Radio 4's "The life scientific," broadcast on October 2nd, 2012, Mark Wolport was interviewed by Jim Al-Khalili. Walport, who at the time was about to assume the position of chief scientific adviser to the British government, spoke about numerous things that made brilliant sense to me, but about 16 minutes in, he was asked what his attitude would be, should his advice be ignored by the politicians. This was his response:<br /><blockquote class="tr_bq"><i>It’s very important for an adviser to distinguish between what is the science, and then recognize that there may be a series of different decisions that you can take, and that’s then politics.</i></blockquote>Quite clearly, he is making the point that at some instant, the science ends and then politics takes over, that the politician may choose to ignore the best quality advice (in favour of what? gut feeling? divine inspiration?), and that this is just fine: the scientist, incapable of judging human affairs, must deliver his evidence then keep his mouth shut. A little later, Wolport elaborated further on (in his apparent view) the strict divide between science and politics:<br /><blockquote class="tr_bq"><i>That's absolutely right, [politics isn't always based on reason,] politics is based on all sorts of things, it’s based on political ideology, it’s based sometimes on pragmatism, it’s based on choosing which battles to fight and which battles not to fight, but that’s as it were the distinction between scientific advice and political decisions.</i> </blockquote>From somebody with Wolport's scientific credentials, I'd have expected to hear these comments followed by something to the effect that this is wrong, and this culture has to be changed. But it seems that this chief scientific adviser holds the view that this perceived distinction between rationally acquired understanding and the running of a country is right and proper.<br /><br />It may well be that to some extent during this interview, Wolport felt unable to express his true views on this matter, having already seen the outcome when another prominent scientific adviser dared cross the UK. government. In 2009 David Nutt was sacked from his position as chairman of the Advisory Council on the Misuse of Drugs, by Home Secretary Alan Johnson. Nutt's apparent crime was to point out that the legal classification of recreational drugs in the UK was incommensurate with the best scientific measures of harm caused by drugs. Johnson's political career received no visible setbacks as a result of this action (and its strong inherent suggestion that he likes to <a href="http://maximum-entropy-blog.blogspot.com/2014/01/the-dennett-tennis-test.html">play without the net</a>).<br /><br />The moment we realize that the question of what ought to be done is a question concerning matters of fact, and that matters of fact can only be answered more reliably when investigated more scientifically, then we begin to wonder how on Earth it can be acceptable for policy decisions affecting potentially millions of people to go against the best quality scientific advice available. What procedure could possibly justify such decisions? (At some point, a decision has been made, without using the decision procedure, that the decision procedure is broken!)<br /><br />The politicians feel it is appropriate to ignore scientific advice, partly because the top scientists (like Wolport) are telling them this is so. They feel they have understanding and expertise that the scientist cannot tap into, because this is the prevailing culture: human needs can not be assessed by evidence and logic. Politicians are encouraged to invent their own <a href="http://maximum-entropy-blog.blogspot.com/2014/01/the-dennett-tennis-test.html">dubious epistemologies</a>, because society persistently fails to recognize the truth about moral realism, and the logical relationship between morality and science.<br /><br />Within this culture, the politician is free, even expected, to employ his deliberately non-scientific judgement, often citing a mandate from the masses to justify manifestly unsound policies: 'who am I, a servant of the people, to defy popular opinion?' Well, sorry folks, but there are some things you just don't get to vote on. If I'm feeling unwell and go to the doctor, he will not say, "your test results are in, you either have 6 months to live, or its just a minor cold, you decide!"<br /><br />Indeed, the elected politician is a servant of the people, and as such, trivially, has a duty to serve the interests of the population. This can only be done when a rational procedure enabling reliable predictions about the social outcomes of policy decisions is utilized. Another historic British statesman, Edmund Burke, said this, in 1774, which is apt:<br /><blockquote class="tr_bq"><span style="background-color: white; line-height: 19.200000762939453px; text-align: start;"><i><span style="font-family: inherit;">Your representative owes you, not his industry only, but his judgment; and he betrays, instead of serving you, if he sacrifices it to your opinion.</span></i></span></blockquote><br /><br /><b><span style="color: blue;">Honesty As A Meta-Virtue</span></b><br /><br />As I mentioned in Part 1, the possibility of moral relativism scares the living crap out of people. The kind of moral relativism I describe, (we've got to be careful, there are other kinds, that make little sense) follows as a trivial and necessary consequence of the moral realism I have outlined here and in earlier essays. What makes an act I perform moral is a combination of (a) the likely, (real) future state of the world, with v's without the act, and (b) my utility function, (the algorithm that assigns for me relative value to difference states of existence) which is a real, and objective property of the matter the composes my mind. We thus arrive at realism. We also arrive at the obvious conclusion that another decision-making entity with a different utility function, even if placed under identical circumstances to mine, may have radically different actions that count for it as moral.<br /><br />In short, what makes it moral for me to pursue goal X is the trivial fact that I desire X (supported by a sound rational procedure - this is crucial!). I admit, this does have some chilling-sounding consequences, particularly when the parenthetical qualification is omitted.<br /><br />Whether it is out of fear for what we ourselves might do should our assessment of value happen to change or out of alarm at the prospect that others might not share the same values as us (I suspect the latter as dominant), a major industry has grown up over the centuries to firmly establish a certain dogma. According to this dogma, the determination of morality is universal and absolute. There is no sense in which X could be moral for me, but immoral for another. In particular, the determination of morality consists of no self-serving component - what you desire counts for nothing, all that matters is the rules (Kant's <a href="http://en.wikipedia.org/wiki/Categorical_imperative">categorical imperative</a>).<br /><br />The objective of this dogma is clear: a moral code can be established such that the capacity for a person to think 'outside the box' is completely eliminated. In our fearful state, we might selfishly breathe a sigh of relief, but essentially, it is a technology for destroying a person's autonomy.<br /><br />In a famous paper of 1972<sup>1</sup>, Philippa Foot has exposed the ridiculous absurdity of this dogma. According to the dogma (quoting from Foot):<br /><blockquote class="tr_bq"><i>Actions that are truly moral must be done "for their own sake," "because they are right," and not some ulterior purpose.</i></blockquote><br />But what then motivates a person to be moral?<br /><br /> "Stupid question," snorts the dogmatist, "it is the obvious fact that to be moral is good."<br /><br />But what if I have no interest in what is good?<br /><br /> "But you <i>ought</i> to be interested in what is good, if you were not, then you would not be a moral person!"<br /><br />Um...., so the <i>only possible </i>reason to be moral is that it is immoral not to be.<br /><br />Great. All that remains is to arbitrarily decide what is moral, so we can all toe the line. And it must be an arbitrary decision, mind you, for if it were not, then whatever non-arbitrary procedure might be used would constitute a motivating principle, violating the central moral dogma. In fact, the decision to decide what is moral must also be arbitrary - we could equally arbitrarily decide not to decide this. Now fuck off, and don't ask any more questions.<br /><br /><br />Ladies and gentlemen, this travesty, this grotesque parody of what the human mind is capable of, remains to this day the orthodox and default view of morality. For centuries, respectable intellectuals have been proudly going round in circles, confidently marching right up their own backsides, with arguments of exactly this type.<br /><br />Utter nonsense as it is, might we not yet draw comfort from the <i>status quo</i> that this dogma provides? For as long as it is universally accepted, does it not serve well to minimize the incidence of crime and immorality? Let's not be too hasty.<br /><br />What is universally accepted? Crime is a major problem for society, and crime ain't committing itself! Somebody is breaking the rules, and if we looked inside the mind of a criminal, it seems self evident that the message we'd receive would be something like: "Frankly, my dear, I couldn't give a toss about being moral or about being good. I don't give a damn about the rules."<br /><br />The cases where the moral dogma has failed are perhaps exactly the cases where it (or <i>something</i>) was most needed. And I think it's a good bet that in many cases this failure is because what has been unsuccessfully drummed into the potential criminal's head has been such profoundly manifest gibberish, worthless circular garbage, not suitable to convince any self-respecting person with the tiniest inclination towards independent thought. An opportunity to correct an antisocial tendency has been lost, in a way that in hindsight seems almost inevitable.<br /><br />So here's my crazy proposal: instead of making up any old absurd crap, and teaching that to our kids as the basis for appropriate behaviour, why don't we just tell them the truth? I think this policy has some serious potential advantages.<br /><br />So maybe you don't care about rules for their own sake. This is good and proper - society needs more free thinkers. Disconnected from any consideration of <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#consequentialism">consequences</a>, rules are nothing more than sounds leaking out of people's mouths, and trails of ink on pieces of paper. But do you care about yourself? Ultimately, this is all you need to care about, in order to be good.<br /><br />If you actually do care about your own wellbeing, (which <i>a priori</i> you must), then if you are being consistent, you must also care about adopting a sound strategy for achieving your goals, and it turns out that for reasons touched upon in <a href="http://maximum-entropy-blog.blogspot.com/2014/01/practical-morality-part-1.html">Part 1</a>, such strategies overwhelmingly involve cooperating with other people - fulfilling one's obligations laid out in the social contract. The profound effect of the social contract is that for me, as a fundamentally selfish entity, I do not merely need to <i>act as if</i> I care about other people. Rather, I actually do care, for naturally selected reasons, both biological and cultural. This, we can expect to hold true for the vast majority of humans, under the vast majority of conceivable circumstances - we have solid mathematics (game theory) capable of explaining how this comes about.<br /><br />It seems to me that what I've just argued can not be refuted. However, whether people would tend to behave better or worse if such a message were to be adopted as the principal method of teaching morality is ultimately an empirical question. I do not know for certain. I don't claim to know exactly why people misbehave. But that this message ought to be better than the traditional dogma, I consider to be supported by very powerful arguments.<br /><br />The principal advantages I see for my proposal, as opposed to continued appeal to the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#Absolutism">absolutist</a> dogma, are (i) honesty, (ii) appeal to self interest, and (iii) coherence. If the main argument used in an attempt to stop me doing something I believe I want to do is a lie, then it seems there is a good chance that I will recognize it as a lie, and ignore it. This seems like a strong general tendency.<br /><br />Similarly, if what I'm told is the basis for morality sounds suspiciously very much like the incoherent babbling of an imbecile, then I think the risk that I will not follow the prescribed moral code is enhanced. The absolutist dogma manifestly makes no sense, and must be expected to lose significant credibility as a result. For all I know, there may be a large population of law-abiding psychopaths, who avoid antisocial behaviour principally because of their indoctrination since birth in the old moral dogma (this seems to be a major fear people have when I discuss openly recognizing the truth of morality based on self interest), but is there any good reason to think that indoctrination in utter nonsense should be more effective than indoctrination in a moral methodology that makes natural good sense? Realistic moral relativism has the enormous advantage of exactly this kind of coherence, such that we do not need to anesthetize our brains to believe it.<br /><br />This brings us to a further advantage of honest moral teaching: its success does not depend on the cultivated suppression of free thought and critical evaluation (behaviours above which, few can be ranked as higher virtues). When the people we trust most repeatedly tell us nonsense, and try to pass it off as truth, it is hard not to believe it. But the price of believing manifest gibberish is an internal crisis, known to psychologists as cognitive dissonance. It seems quite reasonable to suppose that a mind committed to holding incoherent propositions as beliefs must become adept at suppressing it's ability to recognize that incoherence. I think we can easily anticipate the dangers of this talent.<br /><br />I opened this essay with a slightly depressing quotation from Winston Churchill about democracy. The ultimate goal, however, of considering moral realism in these two posts has been a fuller democratization of the social contract. With this goal in mind, then, let me end by offsetting that quote with a more positive piece of advice from the same extraordinary man:<br /><br /><blockquote class="tr_bq"><i>Never, never, never quit.</i> </blockquote><br /><br /><br /><hr /><br /><span style="color: blue;"><b>References</b></span><br /><br /><br /><!-- ************************ Table of references ************************--> <br /><table> <tbody><tr> <td valign="top"> [1] </td> <td>Philippa Foot, "<i>Morality as a system of hypothetical imperatives</i>," Philosophical Review Vol. 84, pages 305 to 316, 1972. (<a href="http://philosophyfaculty.ucsd.edu/faculty/dbrink/courses/other%20pdf%20articles/Morality%20as%20a%20Sytem%20of%20Hypothetical%20Imperatives.pdf">Link</a>)</td> </tr><tr> <td valign="top"></td> <td><br /></td> </tr></tbody></table><br /><br /></div><div style="text-align: justify;"></div><br /><div style="-webkit-text-stroke-width: 0px; color: black; font-family: 'Times New Roman'; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;"><div style="margin: 0px;"><br /></div></div>Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com5tag:blogger.com,1999:blog-715339341803133734.post-53691059562048320472014-01-28T23:38:00.000-06:002014-02-02T23:51:21.626-06:00Practical Morality, Part 1<div style="text-align: justify;"><span style="font-family: inherit;"><br /></span><span style="font-family: inherit;"><br /></span><span style="font-family: inherit;">(The first of two parts. Part 2 is <a href="http://maximum-entropy-blog.blogspot.com/2014/02/practical-morality-part-2.html">here</a>.)</span><br /><span style="font-family: inherit;"><br /></span></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;"><span style="color: blue; font-family: inherit;"><b>The Social Contract</b></span></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Where-ever you are right now, take a quick look around. Do a quick survey of all the stuff you can see. Think about the number of things you have around you that other people have made. If you are in your own home, then great, the experiment works even better - the things around you probably belong to you, you make some kind of use of them, and quite possibly your life would be less satisfying without them. Some of these things may even be, if not essential for life, indispensable for a comfortable modern existence.</div><div style="text-align: justify;"><a name='more'></a><br /></div><div style="text-align: justify;">Now try to count up the number of these things that you could make yourself. I'm trying this now (the counting, not the actual making), and there is really very little that I could contemplate building, perhaps some of the simpler bits of wooden furniture, if really pushed. Quite possibly (and this is not meant as an insult) there is nothing of the stuff that have around you right now that you could build yourself. </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Alternatively, perhaps you are quite adept with your hands, and there are several things you could put together yourself. But presumably, you would need tools for this. And possibly even tools to make those tools. You would need materials, for the thing you want to make and for the tools to make them. </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Even if you have the skill and energy to actually make some of the things you own, starting from nothing but a pile of raw materials (ignoring for the moment the complexity of acquiring that pile of raw materials), you wouldn't be very efficient. So many processes are involved in converting raw materials into desirable objects, and so many of them require expert knowledge, skill, practice, and dedicated specialization, that you could never approach the efficiency of a large community of individuals, each pretty much focused on some limited domain of expertise. Economies of scale emerge naturally from such specialization. If all I do is cut down trees, then before too long, I predict that I am going to be better at cutting down trees than somebody who just cuts down a tree whenever he needs a bit of wood. If all I do is cut down trees, then I can invest a lot into the having the best possible tools, tailor made for the job of cutting down trees - I don't need my tools to be also good for digging copper out of the ground, because I don't do that, somebody else does that.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">And we haven't even begun to think about anything technologically advanced. Getting your laptop to you, for the low price you paid for it, took the research and development efforts of literally thousands of engineers and scientists, practically all of whom got to a position to be able to do that work by engaging in years of dedicated study during which it was impossible to support themselves through full-time employment. For the technology to reach this stage, it took support structures to enable those people to engage in full-time study, and it took efficient dissemination of knowledge, sufficient to make research results readily available and synthesizable in all corners of the world. It also took stable international trade to have all the rare-earth elements and other necessary materials readily available for the manufacture of this product, and it took robust legal protection of intellectual property, in order for all those R&D hours at the product development stage to be worth the investment.<br /><br />These mechanisms together with many others, all evolved to help make society function as much as possible to everybody's benefit, are known collectively as "the social contract." They are, for example, what make it reasonable for me to exchange items of real economic value for a few trivial-looking pieces of paper, or a few bits of information on some (to me unknown) computer. They make it possible for me drive a car in confidence that another vehicle coming towards me will stay on its designated side of the road, allowing us to pass each other without injury. They make it probable that the foods I buy won't poison me, the machines I use won't kill me, and the politicians I vote for won't throw me in jail if I refuse to vote for them next time.<br /><br />Biologically evolved behaviours, such as my tendency to care much more about close family members than about people I've never met, play a major part in defining our core moral objectives. These may include elements of social cooperation, such as, perhaps, a fundamental desire to live in proximity with other people, leading naturally to a desire to live peacefully with other people. Such genetically determined traits help to make us intrinsically caring about others. The social contract does not necessarily define our core moral values, but, by virtue of the colossal technological benefits its brings us, serves as an indispensable aid for achieving the things we value most. It has the profound effect of making my personal, selfish values intimately entangled with the values of more or less every other human on the planet.<br /><br /><br /><span style="color: blue;"><b>Realistic Moral Relativism: A Practical Matter</b></span> </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;"><u>Fact (1)</u>: your core <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#morality">moral values</a> are completely determined by the real physical properties of the matter out of which your mind is built. This is what I mean by moral realism. (See my <a href="http://maximum-entropy-blog.blogspot.com/2013/03/scientific-morality.html">earlier article</a>, and <a href="http://maximum-entropy-blog.blogspot.com/2013/06/crime-and-punishment.html">supporting arguments</a>)<br /><br /><u>Fact (2)</u>: there are no principles of moral value that necessarily hold for all beings in all parts of the universe. This is what I mean by moral relativism. There is one moral meta-principle that holds absolutely, namely Fact (1), above, but it doesn't specify any value that any being must hold.<br /><br />Fact (1) is the foundation of our moral science. It has the dual advantages of: (i) phenomenal empirical support, and (ii) being logically inescapable. Fact (2) follows as a trivial consequence: since values are determined by arrangements of matter, then different arrangements of matter may support different values.<br /><br />Fact (2) scares the bejeezus out of people, which I'll discuss more in Part 2. Whatever the reason, however, there is extraordinary resistance to acceptance of Fact (1). When those of us who have come to recognize the potential to develop a moral science try to explain these findings, there is a tendency to present thought experiments aimed at demonstrating the possibility to measure moral value. These typically involve some highly advanced neurological apparatus, maybe some kind of ultra-high-resolution, perfectly calibrated magnetic resonance scanner, capable of recording all relevant details of a person's evolving brain states, and using the data to precisely quantify value. If we could do that, we explain, we would know everything about the human condition, and the moral facts would be laid out before us.<br /><br />There is nothing incorrect about this (once 1 or 2 matters of interpretation have been clarified), but the argument often fails to have its desired impact. There seem to be two common reasons for this, both of which I sympathize with. Both follow from the extreme implausibility of the described measurement: to record a complete description of a person's brain state. Person A complains that such a measurement, resulting in complete knowledge of a person's private state of mind is <i>strictly impossible</i>, thus invalidating the principle we hoped to illustrate. Person B doesn't have the time to philosophically investigate the limits of epistemology, but recognizes the <i>practical impossibility</i> of this - it ain't gonna happen in the foreseeable future - and so dismisses the whole thing as a fanciful science fiction, not worthy of a second thought.<br /><br />Both person A and person B have missed the point. It was supposed to be a thought experiment, illustrating the kinds of information that we might strive to access, in order to advance a moral science. But the truth of Fact (1) does not depend on the ability to attain such complete knowledge, and neither, in fact, does our ability to develop a moral science from it.<br /><br />To think that a truth does not exist, simply because we are prevented, in principle, from uncertainty-free knowledge of it is <a href="http://maximum-entropy-blog.blogspot.com/2012/06/the-mind-projection-fallacy.html">mind-projection fallacy</a>, pure and simple. Facts exist. Propositions about the real world are either true or false, and their truth state is independent of how confident we feel about them (we might feel that the recursive proposition, X: "I believe confidently that X is true", is a counter example, but X is not a coherent proposition about the real world - what could it possibly mean?). No science can deliver knowledge that is completely free of uncertainty. This is why the gold standard for expressing scientific advance consists of calculating probabilities. And because we have probability theory, that exquisite invention that saves us from the misery of complete epistemological crisis, science does not need absolute certainty in order to make concrete advances: with incomplete knowledge we continue, often in baby steps, to make real advances in understanding, enabling actual technological gains.<br /><br />We don't need the full blown complexity of the above thought experiment in order to establish our moral science, or indeed to produce from it incremental strategic advances for society. In fact there are only two things we need, in order to make progress in moral science:<br /><br /><ol><li>to measure our moral goals</li><li>to measure the universe</li></ol><br />"Really?" you ask, "that's <i>all</i>?"<br /><br />Stay with me. First I'll explain why those two, then I'll say a bit about what they mean.<br /><br />To work efficiently towards our goals, it is a requirement that we have reliable estimates of what our goals are, hence the first requirement. Without these estimates, any effort we expend relies on luck to achieve its objectives - we might just as well do nothing. About the second requirement, to maximize the probability to attain one's goals, one has to choose strategically between some set of possible actions. The outcomes of those actions, though, are entirely dependent on the content and behaviour of one's environment - if my goal is to boil a pan of water, then attempting to light a fire under it is not a good strategy, if the pan of water happens to be currently at the bottom of a swimming pool. So we need some kind of reasonable model of reality - an estimate of the stuff that populates it, and a somewhat accurate account of the mechanisms by which that stuff interacts.<br /><br />Now it's time to qualify what I mean by measure, when I say for example, "measure the universe." A measurement consists of two steps: step one is collecting some set of empirical data, and step two consists of some procedure to draw inferences from that data, usually by combination with previous inferences from previous data. That's it. Notice that there is no statement in there about the quality of the data, or the degree of uncertainty in the resulting inference. Note, though, that as long as a good procedure of inference (a scientific procedure) has been used, we can always construct a model of reality according to which our uncertainty is reduced by the new data. The new data always tells us something new about the world.<br /><br />Thus, I can measure the universe, simply by opening my eyes. If this is my first time to measure the universe, then it's quite a good start! Every time I open my eyes, I can make new inferences, with an expected increase in confidence about the contents of my environment and the mechanisms by which those contents transform themselves.<br /><br />Similarly, I can interrogate my moral goals, simply by asking myself, 'what do you actually want?' It is perhaps the crudest experiment we can imagine, but we can also easily imagine improved experimental designs. This is the principal activity of the scientist. There is no sense in which we can invalidate a set of raw data, so what we do instead is try to think of ways in which our inference procedure might have failed to capture what is really going on. And once we have thought of some possible failure modes, we can add controls to our experiments. If the machine says "24," then the machine says "24," and it only remains to discover how well the machine is calibrated - what is the correspondence between the machine saying "24," and the thing I actually want to draw inferences about?<br /><br />So a better protocol for measuring a person's goals might control for the fact that a person may be mistaken about what their goals are. Luckily, psychologists have already developed such protocols, capable of investigating a subject's state of mind, even when she is probably not very well aware of it herself. We can look at other behaviours that according to separate evidence are more intimately connected to the aspects of interest of a person's state of mind. <a href="http://en.wikipedia.org/wiki/Pupillometry">Pupil dilation</a>, for example, typically happens without the person's awareness, and is extremely hard to fake. With careful observations of the pupil, one can determine, for example, that a subject was very interested in a particular stimulus, without them having had the faintest idea of it. <br /><br />The range and sophistication of the protocols and controls we might apply to the problem of measuring value are open ended. Dozens of powerful methodologies exist already for the investigation of mental states (all of which can be cross-checked against each another), from the carefully worded questionnaire, right up to the heroic machines of neuroscience, such as the famed fMRI scanner, which, while often <a href="http://prefrontal.org/files/posters/Bennett-Salmon-2009.pdf">problematic</a> to interpret (or see many articles by <a href="http://discovermagazine.com/tags?tag=fMRI#.UkRpe4akoU6">Neuroskeptic</a>), still provides an immense richness of data. Thus, we should not let anybody tell us that at the present time, moral value can not be measured - the only challenge for the future is to reduce the error bars.<br /><br />One might complain that the fMRI experiment only measures neural activity, whereas what we want to know is the subject's mental state, but in this regard, how is fMRI different from measuring pupil dilation? It is the same calibration problem. This problem is solved in essentially the same way in all science: by trying to think of ways that our inference procedure might be too naive, and doing experiments to test them. To say that there is no way to manage the calibration problem is to suppose that all technological advance to date is the result of pure good fortune.<br /><br />In practice, we very often already act as if we know that knowledge of and progress towards our moral objectives are both attainable through rational means. Regardless what we consciously profess, we do this because experience informs us that it works. For instance, I believe, with a high level of confidence, that my future satisfaction will be compromised if money I have now leaves my possession without me getting something of value in return, and I therefore make conscious efforts to ensure that I do not lose track of my wallet [compulsively checks pocket again ...].<br /><br />In run-of-the-mill, day-to-day activity, our unconscious, not-too-rigorous adherence to and application of Fact (1), from above, serves us very well. We know this, because whatever departures there exist between what really happens and what we would be inclined to describe with the phrase, 'serves us very well,' are small enough for us not to find them particularly obvious. This, of course, is no accident. If such departures were very obvious, then we would tend to modify our behaviour, accordingly - this is actually exactly what has happened, and we call the process 'growing up,' though the phrase can apply equally well to the history of a given individual as to the application of selective pressures on gene populations, over time scales of thousands of millennia.<br /><br />In a similar way, in run-of-the-mill, day-to-day activity, it is enough for me to know that there is some force of gravity tending to make things go down, but if I want to establish a communications satellite in orbit, I had better know about the inverse-square law that describes that force, and a few other complicated things besides. Thus, as we strive to answer ever more exacting moral questions, and as we place ever more challenging demands on our moral technology, it is obvious that we must be ever more rigorous and deliberate in the development of our moral science. This can only happen effectively, if the people who are to work on these grand problems can acknowledge the truth and practical importance of Fact (1).<br /><br />Our future flourishing, therefore, must be expected to be greatly enhanced by widespread taking on board of the principles of moral science - Fact (1) and its corollaries - through ever improving estimated answers to such questions as (1) what values are primary, as opposed to secondary? (2) what secondary values do we hold in the mistaken belief that they support our primary goals? (3) what common moral values are held by almost all people? (4) how much does the perception of value differ from person to person? and (5) at what point do software engineers need to worry about possible suffering experienced by their algorithms? That is for the future (though it can start today). In the remaining two sections, in Part 2, I'll give good reasons to suppose that a broad acceptance by society of the validity of moral science should have immediate, important, and valuable consequences for almost everybody.<br /><br />I'm not claiming that a precise determination of core human values is an easy measurement to take - it is fraught with difficulty - but it will never be possible until it is accepted as something we can aspire to. Understanding Fact (1) and its inevitable truth is a crucial first step, and to begin taking such steps is practically guaranteed to lead to some kind of improved understanding. Not only that, but merely recognizing that this moral science is in principle possible has substantial immediate practical consequences, as I will argue, next.<br /><br />Consider also this: there are, no doubt, some truths about the universe that are forever obscured from science (such as what Euclid's grandfather had for breakfast 3 days before his 10th birthday), but this can only be because these things make no significant difference to anything today. Conversely, if a thing makes a big difference, then by definition, we can detect it and measure it relatively easily. This must apply equally well to the things that affect the outcomes of our moral decisions. The prospects for an applied science of human ethics, producing practical technological benefits, are not so bleak.<br /><br /><br /></div><hr /><br /><div style="text-align: justify;">Find Part 2 <a href="http://maximum-entropy-blog.blogspot.com/2014/02/practical-morality-part-2.html">here</a>.<br /><br /><br /><br /></div>Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com0tag:blogger.com,1999:blog-715339341803133734.post-49568146966030202002014-01-16T09:35:00.001-06:002014-01-16T09:35:18.789-06:00The Dennett Tennis Test<div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><span style="font-family: Arial, sans-serif; font-size: 11.5pt;"><br /></span></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><span style="font-family: Arial, sans-serif; font-size: 11.5pt;"><br /></span></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><span style="font-family: Arial, sans-serif; font-size: 11.5pt;">This is a simple little tactic to consider employing next time you find yourself in conversation with somebody who doubts the efficacy of scientific method, particularly anybody who wants to propose any kind of alternative.</span><span style="font-family: "Times New Roman","serif"; font-size: 12.0pt; mso-fareast-font-family: "Times New Roman";"><o:p></o:p></span></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in;"></div><div style="text-align: justify;"><span style="font-family: 'Times New Roman', serif;"></span></div><br /><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><span style="font-family: Arial, sans-serif; font-size: 11.5pt;">You know the sort, I’m sure. The kind of person who says “sure, scientific data argues that acupuncture is total and utter garbage, but maybe acupuncture is one of those things science isn’t equipped to investigate.” Or the kind of person who says “absolutely, you’re right, I have no evidence that my god exists, but frankly I’m offended at your crass insistence that everything has to come down to evidence.” Or “how ridiculous to suggest that science has anything to say about the supernatural.”</span></div><a name='more'></a><br /><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><span style="font-family: Arial, sans-serif; font-size: 11.5pt;">“Oh, you’re so reductionistic,” another might say, shaking their head with earnest sympathy, “there is so much more to mother nature than your sterile lab experiments.” (Yes there is, evidently, but what exactly is your point?) </span><span style="font-family: "Times New Roman","serif"; font-size: 12.0pt; mso-fareast-font-family: "Times New Roman";"><o:p></o:p></span></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in;"></div><span style="font-family: 'Times New Roman', serif; font-size: 12pt;"><o:p></o:p></span><br /><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><span style="font-family: Arial, sans-serif; font-size: 11.5pt;">Or nodding with diplomatic wisdom, one will say, “yes, yes, I see your point, but don’t you think your scientific paradigm is just a social construct, one of many equally valid points of view?”</span><span style="font-family: "Times New Roman","serif"; font-size: 12.0pt; mso-fareast-font-family: "Times New Roman";"><o:p></o:p></span></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in;"></div><span style="font-family: 'Times New Roman', serif; font-size: 12pt;"><o:p></o:p></span><br /><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><span style="font-family: Arial, sans-serif; font-size: 11.5pt;">To all such people, I propose the following reply, paraphrasing philosopher Daniel Dennett:</span><br /><blockquote class="tr_bq"><i><span style="font-family: Arial, sans-serif; font-size: 11.5pt;">You know what, your argument is completely convincing, except for one detail:</span><span style="font-family: Arial, sans-serif; font-size: 11.5pt;"> everything you say proves unambiguously that you are a ham sandwich, wrapped in tin foil.</span></i></blockquote></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><span style="font-family: Arial, sans-serif; font-size: 11.5pt;"><br /></span><span style="font-family: Arial, sans-serif; font-size: 11.5pt;">This is what I'm calling the Dennett Tennis Test (DTT). I’ll leave it to you to decide if the phrase is ugly or poetic. Faced with DTT, your conversation partner has 2 options:</span></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><span style="font-family: Arial, sans-serif; font-size: 11.5pt;"><br /></span></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"></div><span style="font-family: Arial, sans-serif; font-size: 11.5pt;">(1) They can agree with you, at which point it is clearly time to end the conversation - they have declared themself to be mad.</span><br /><div><span style="font-family: Arial, sans-serif; font-size: 11.5pt; text-align: justify; text-indent: 0in;"><br /></span></div><div><span style="font-family: Arial, sans-serif; font-size: 11.5pt; text-align: justify; text-indent: 0in;">(2) They can protest that your conclusion is unsupportable, thus proving that they are dishonest - they do not really believe in the efficacy of their own argument.</span><br /><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in;"></div><div style="text-align: justify;"><br /></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><span style="font-family: Arial, sans-serif; font-size: 11.5pt;">So how does this work, and where the hell does the weird name come from? </span><span style="font-family: "Times New Roman","serif"; font-size: 12.0pt; mso-fareast-font-family: "Times New Roman";"><o:p></o:p></span></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in;"></div><span style="font-family: 'Times New Roman', serif; font-size: 12pt;"><o:p></o:p></span><br /><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><span style="font-family: Arial, sans-serif; font-size: 11.5pt;">The thing about tennis is that, quite like the two options listed above, there are 2 ways you can play the game: with or without the net. To make the game fair, though, if one player plays with a net, then so does the other.</span><span style="font-family: "Times New Roman","serif"; font-size: 12.0pt; mso-fareast-font-family: "Times New Roman";"><o:p></o:p></span></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in;"></div><span style="font-family: 'Times New Roman', serif; font-size: 12pt;"><o:p></o:p></span><br /><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><span style="font-family: Arial, sans-serif; font-size: 11.5pt;">I picked the phrase ‘Dennett Tennis Test’ in honour of an argument made by Dan Dennett, (in his excellent book “Darwin’s dangerous idea”) which he built around a remark that he attributes to Ronald de Sousa, likening philosophical theology to intellectual tennis without the net. Just as the tennis net filters out bad serves and bad returns, so in a reasonable discussion does rationality filter out the crummy arguments. The point of DTT is to say “oh, I didn’t know you wanted to play without the net, well, never mind, I’m a sportsman, I’ll play the way you want.” If your opponent agrees to this arrangement, and its most explicitly displayed outrageous consequences, then they signal that they have no commitment to approaching the truth. If they protest, then you have the right to ask why they expect the rules to apply asymmetrically. Either logic is abandoned, allowing their arguments and yours to pass unfiltered, or reason steps in, exerting the same selection pressure on everybody’s statements. When your opponent protests that your conclusion is unjustifiable, they</span><span style="font-family: Arial, sans-serif; font-size: 15px;"> establish exactly the standard of evidence that is sufficient for their own position to instantly crumble. </span></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in;"></div><div style="text-align: justify;"><span style="font-family: Arial, sans-serif; font-size: 11.5pt;"><br /></span><span style="font-family: Arial, sans-serif; font-size: 11.5pt;">All the arguments I have taken as examples at the top (these are not straw men of my own concoction, they occur regularly, even within academia) take the form of asserting that there is a way of knowing that bypasses the need for evidence and its logical evaluation. The alternative-medicine advocate who sees the scientific research, but still clings to the belief that the hocus-pocus treatment works is claiming access to knowledge that science can’t deliver. They’re not just saying the science was done badly, but that science is the wrong tool entirely. For example, <a href="http://homeoinst.org/sites/default/files/newsletters/HRI_Newsletter13_Summer2011.pdf">Dr. Peter Fisher</a>, homeopath to Queen Elizabeth II and prominent homeopathy researcher, (here reaching a stunning level of perversity): "'Inherent implausibility' is a poor guide to future understanding."</span></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in;"></div><span style="font-family: 'Times New Roman', serif; font-size: 12pt;"><o:p></o:p></span><br /><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><span style="font-family: Arial, sans-serif; font-size: 11.5pt;">The religious enthusiast, clinging to the notion of faith as an alternative to reason, decides to believe, as if that mere decision were enough to shape the structure of reality - as if wanting to have faith is enough to make it true. Many religious believers actively boast of their non-reliance on evidence, claiming it virtuous to place their faith in … well in faith itself actually. But by this "logic", I can just as legitimately put my faith in absolutely any bloody thing I please. In reality, the blatant circularity of this kind of epistemology cannot convince any honest thinker. It is really just an obfuscation, where somebody really wants a thing to be true (or wants others to believe it), but knows deep down that the evidence they have is insufficient. With a complex enough wording, repeated sufficiently often, the believer hides the fact that their belief really is based on evidence - mainly testimony from trusted people (have you noticed how different belief systems exhibit clear spatio-temporal correlation? Is it coincidence that most religious people follow the same religion as their parents?). Such evidence that people have for their religious faith, however, crumbles when subjected to the tiniest scrutiny, (for example, exactly similar evidence provides exactly as reliable support for a host of other contradictory hypotheses) and they invent this capability to know without evidence, which they call faith, in a desperate attempt to avoid the obvious, uncomfortable conclusion. Logic is suspended, without any justification whatsoever - ham sandwiches, all round.</span><span style="font-family: "Times New Roman","serif"; font-size: 12.0pt; mso-fareast-font-family: "Times New Roman";"><o:p></o:p></span></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in;"></div><div style="text-align: justify;"><span style="font-family: 'Times New Roman', serif;"><br /></span></div><div style="text-align: justify;"><span style="font-family: Arial, sans-serif;"><span style="font-size: 11.5pt;">In 1996 physicist Alan Sokal wrote a preposterous spoof paper</span><sup style="font-size: 11.5pt;">1</sup><span style="font-size: 11.5pt;">, which he submitted to a ‘serious’ philosophical journal, ‘Social Text’. The paper was crammed with utter nonsense statements (and copious flattering references to the works of the journal’s editors), and they loved it. They published it, never once suspecting that it was a load of deliberate trash. The subject of the paper? That science is a social construct, that all belief systems are equally valid, and that quantum theory proves it. The academic (nominal) philosophers behind the movement this journal epitomizes were committed to the view that if I believe that I can step out of a 10th floor window and float away to Jupiter to enjoy cups of Jovian tea with the locals, before returning home to commune holistically with all the insects on the planet, obtaining crucial knowledge about the birth of time from them, then that belief is as valid as the belief that the world is </span><a href="http://maximum-entropy-blog.blogspot.com/2012/10/parameter-estimation-and-relativity-of.html" style="font-size: 11.5pt;">approximately spherical</a><span style="font-size: 11.5pt;">. This is a philosophy that explicitly refuses to rank the believability of propositions based on the observed behaviour of reality - a form of radical </span><span style="font-size: 15px;">skepticism</span><span style="font-size: 11.5pt;"> in which logic is eagerly shunned. How have these people managed to create an academic discipline based on the deliberate application of no intellectual discipline? How would they respond to DTT? Sokal has effectively tried it, and they went for option (1).</span></span></div><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in;"></div><span style="font-family: 'Times New Roman', serif; font-size: 12pt;"><o:p></o:p></span><br /><div class="MsoNormal" style="margin-bottom: .0001pt; margin-bottom: 0in; text-align: justify;"><span style="font-family: Arial, sans-serif; font-size: 11.5pt;">In case you think my characterization of the views of the 'strong sociologists' and postmodern relativists - the sorts of academics that Sokal was ridiculing - is far too absurd to be accurate, Sokal's book, written with Jean Bricmont, "Fashionable nonsense," (also published as "Intellectual Impostures") is crammed with quotations from people at the forefront of this movement that again and again prove exactly this. Here are two brief examples: </span><br /><span style="font-family: Arial, sans-serif; font-size: 11.5pt;"><br /></span><span style="font-family: Arial, sans-serif; font-size: 11.5pt;">Barnes and Bloor</span><sup style="font-family: Arial, sans-serif;">2</sup><span style="font-family: Arial, sans-serif; font-size: 11.5pt;">: </span><br /><blockquote class="tr_bq"><span style="font-family: Arial, sans-serif; font-size: 11.5pt;">"It is those who ... grant certain forms of knowledge privileged status, who pose the real threat to a scientific understanding of knowledge and cognition.</span><span style="font-family: Arial, sans-serif; font-size: 11.5pt;">"</span></blockquote><span style="font-family: Arial, sans-serif; font-size: 11.5pt;"><br /></span><span style="font-family: Arial, sans-serif; font-size: 11.5pt;">Paul Feyerabend</span><span style="font-family: Arial, sans-serif; font-size: 13px;"><sup>3</sup></span><span style="font-family: Arial, sans-serif; font-size: 11.5pt;">:</span><span style="font-family: Arial, sans-serif; font-size: 11.5pt;"> </span><br /><blockquote class="tr_bq"><span style="font-family: Arial, sans-serif; font-size: 11.5pt;">"All methodologies have their limitations and the only 'rule' that survives is 'anything goes'.</span><span style="font-family: Arial, sans-serif; font-size: 11.5pt;">"</span></blockquote><span style="font-family: Arial, sans-serif; font-size: 11.5pt;"><br /></span><span style="font-family: Arial, sans-serif; font-size: 11.5pt;"><br /></span><br /><span style="font-family: Arial, sans-serif; font-size: 11.5pt;"><br /></span></div></div><div style="text-align: justify;"><span style="font-family: Arial, sans-serif; font-size: 11.5pt;">The proposition that the person you are arguing with is a ham sandwich may be good for raising a cheap laugh, but one might feel that it's far too ridiculous to serve as a serious parody of anybody's actual views. Under close examination, though, it doesn't take long to see that the arguments that I propose to target with the Dennett Tennis Test propose a system of belief formation according to which this proposition is every bit as valid as those that being argued for - gods, the healing power of crystals, truth as a social construct, or whatever it happens to be. </span></div><br /><br /><br /><hr /><br /><span style="color: blue;"><b>References</b></span><br /><span style="font-family: inherit;"><br /></span><!-- ************************ Table of references ************************--> <br /><table> <tbody><tr> <td valign="top"><span style="font-family: inherit;"> [1] </span></td> <td><span style="font-family: inherit;">Alan Sokal, '<i>Transgressing the boundaries: toward a transformative hermeneutics of quantum gravity</i>', Social Text #46/47, pages 217 to 252, 1996</span><br /><span style="font-family: inherit;"><br /></span></td> </tr><tr> <td valign="top"><span style="font-family: inherit;"> [2] </span></td> <td><span style="font-family: inherit;">'</span><i style="font-family: inherit;">Relativism, rationalism, and the sociology of knowledge</i><span style="font-family: inherit;">', in '</span><i style="font-family: inherit;">Rationality and relativism'</i><span style="font-family: inherit;">, edited by Hollis and Lukes.</span><br /><br /></td> </tr><tr> <td valign="top"><span style="font-family: inherit;"> [3] </span></td> <td><span style="font-family: inherit;">'</span><i style="font-family: inherit;">Against method,</i><span style="font-family: inherit;">' Paul Feyerabend, 1975</span></td> </tr></tbody></table><span style="font-family: inherit;"><br /></span><br /><span style="font-family: inherit;"><br /></span><span style="font-family: inherit;"><br /></span><span style="font-family: inherit;"><br /></span>Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com0tag:blogger.com,1999:blog-715339341803133734.post-5113404441108743652013-12-22T01:45:00.000-06:002013-12-23T16:00:23.084-06:00Confounded Koalas<style> td.upper_line { border-top:solid 1px black; } table.fraction { text-align: center; vertical-align: middle; margin-top:0.5em; margin-bottom:0.5em; line-height: 2em; } </style> <style> table.num_eqn { width:99%; text-align: center; vertical-align: middle; margin-top:0.5em; margin-bottom:0.5em; line-height: 2em; } td.eqn_number { text-align:right; width:2em; } </style> <br /><div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Koalas are not as exclusive as kangaroos. At least, when it comes to their drinking habits. As I explained <a href="http://maximum-entropy-blog.blogspot.com/2013/10/entropy-of-kangaroos.html">before</a>, kangaroos drink beer or whisky, but not both. Koalas like to mix things up a bit more, when it comes to their choice of drink, but how much exactly? What is the probability, for example, that any given koala who drinks beer on any given night will also drink whisky on the same night? These are the sorts of urgent questions that science must seek to answer with the utmost speed and accuracy.</div><div style="text-align: justify;"></div><a name='more'></a><div style="text-align: justify;"><br />It's an empirical question, so we'll need some empirical evidence. Luckily, I've been out in the field already. I went deep into the outback one day, and asked lots of koalas what they had been drinking the night before. To save time, though, I didn't bother questioning animals that I believed hadn't been drinking anything on the previous evening. To do this, I polled only hung-over koalas. You see, when a koala gets a hang over, its nose turns bright red, which can be seen from quite a distance. This clever strategy saved me a lot of time walking through the Australian scrub, trying to catch up with subjects who couldn't add anything to my required data set. Here are the numbers I obtained, after interviewing 1222 red-nosed koalas:</div><div style="text-align: justify;"><br /></div><table align="center"> <tbody><tr><td><div style="text-align: right;">Beer but no whisky:</div></td><td> 505</td></tr><tr><td><div style="text-align: right;">Whisky but no beer:</div></td><td> 436</td></tr><tr><td><div style="text-align: right;">Beer and whisky:</div></td><td> 281</td></tr></tbody></table><div style="text-align: justify;"><br /><br />So the total number of whisky drinkers, for example, came to 59% of the 1222 study participants. Of the subset of those 1222 individuals who drank beer on the preceding evening, however, the number that also drank whisky came to only 36 %. Thus we conclude that consumption of whisky is anti-correlated with consumption of beer - if I know that an individual has consumed beer, I consider it less likely to have drunk whisky than otherwise. Letting A represent the proposition that a koala drank whisky and B stand for a beer drinker, we conclude that P(A) > P(A | B). </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Ok, confession time. Brace yourself, this is going to come as quite a shock. These are completely made up data. I've never even been to Australia. </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">But here is the really weird thing:<br /><br />The numbers above were actually produced by a randomized model that assumed a complete lack of correlation between beer drinking and whisky consumption. The two were assumed independent, meaning that in fact, P(A) = P(A | B), in contrast with the strong impression given by the generated data. </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">The number of koalas simulated is quite large, so this isn't a case of random noise producing a spurious finding. In fact, what we've been the victim of here is a kind of biased sampling, known as Berkson's paradox. Our attempt to investigate the relationship between two variables, A and B, has been confounded in an interesting way by a third variable, C. </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">A fairly trivial special case of the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#product-rule">product rule</a> states that (where, as always, X+Y denotes 'X or Y') </div><div style="text-align: justify;"><br /></div><div style="text-align: center;">P(A.[A+B]) = P(A | A+B) × P(A+B)</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">and because the conjunction of 2 propositions, XY, is identical to YX, then also<br /><br /><div style="text-align: center;"> <span style="text-align: center;">P(A.[A+B]) = P(A+B | A) × P(A)</span></div><div style="text-align: justify;"><span style="text-align: center;"><br /></span></div><div style="text-align: justify;"><span style="text-align: center;">Now, A+B is a sure thing, if A is already known to be true, so combining these two results gives </span></div><div style="text-align: justify;"><br /></div></div><div style="text-align: justify;"><div style="text-align: center;">P(A) = P(A | A+B) × P(A+B)</div></div><div style="text-align: justify;"> </div><div style="text-align: justify;">Assuming that neither P(A) nor P(B) is 0 or 1, this means that</div><br /><a href="https://www.blogger.com/blogger.g?blogID=715339341803133734" id="eq1"></a><br /><table cellpadding="0" cellspacing="0" class="num_eqn"> <tbody><tr> <td><table align="center" cellpadding="0" cellspacing="0"> <tbody><tr> <td><span style="text-align: center;">P(A) < P(A | A+B)</span></td> </tr></tbody></table></td> <td class="eqn_number">(1) </td> </tr></tbody></table><br /><div style="text-align: justify;"><br />which is something we ought to have expected already - knowledge that at least one of the propositions A and B is true constitutes good evidence that A is true.<br /><br />But a good way to be confident that A+B is true is if events A and B are both separately known to cause another event C, which is known to have occurred. This is what has happened on this occasion: C is the hangover I used to select subjects for the study. So while we were trying to estimate P(A), what we actually measured was more indicative of P(A | A+B), and thus our result of 59% was, from equation (1), an overestimate.<br /><br />What happened when we were estimating P(A | B)?<br /><br />A simple way to get to grips with this is by drawing a <a href="http://en.wikipedia.org/wiki/Truth_table">truth table</a>, to compare the 2 propositions, "A or B" and "B and (A or B)":<br /><br /><table align="center" border="1" cellpadding="5" cellspacing="0"> <tbody><tr> <td><b> A </b></td><td><b> B </b></td><td><b> A+B </b></td><td><b> B.(A+B) </b></td> </tr><tr> <td> 0 </td><td> 0 </td><td><div style="text-align: center;">0</div></td><td><div style="text-align: center;">0</div></td> </tr><tr> <td> 0</td><td> 1</td><td><div style="text-align: center;">1</div></td><td><div style="text-align: center;">1</div></td> </tr><tr> <td> 1</td><td> 0</td><td><div style="text-align: center;">1</div></td><td><div style="text-align: center;">0</div></td> </tr><tr> <td> 1</td><td> 1</td><td><div style="text-align: center;">1</div></td><td><div style="text-align: center;">1</div></td> </tr></tbody></table><br /><br />From the truth table, it's quite clear that "B.(A+B)" has, under all circumstances, the same values as "B" - it is the same proposition. Thus in obtaining an estimate for P(A | B.[A+B]), which we inadvertently did, when what we wanted was P(A | B), the addition of the extra information, B, erased any effect of our prior knowledge of A+B that isn't preserved in our knowledge of just B. Therefore, P(A | B.[A+B]) = P(A | B), and our measured proportion of 36% of all beer drinkers who drink whisky did not suffer any distortion due to my chosen selection method.<br /><br />Because my figure for P(A) was overestimated, however, while the result for P(A | B) was not, the effect was a spurious impression that P(A | B) < P(A), implying negative correlation, contrary to the reality of the data-generating process.<br /><br />In fact, the numbers I gave above, came from 2000 simulated koalas, randomly assigned as beer drinkers with probability 0.4, and independently assigned as whisky drinkers with probability 0.36. This is perfectly reflected in the observed proportions for beer drinkers (786/2000 = 0.393) and whiskey drinkers (717/2000 = 0.359). The result for P(A | B) was also perfectly consistent with independence - the proportion of beer drinkers who indulged in whisky was the same as the proportion for the entire population, 36%. <br /><br />With Berkson's paradox, our attempt to draw inferences about the relationship between two variables, A and B, was confounded by a third, correlated variable, C. Something very similar was going on, when we examined <a href="http://maximum-entropy-blog.blogspot.com/2012/04/booze-sexual-discrimination-and-best.html">Simpson's paradox</a>, but the effect of the confounder was slightly different. With Simpson's paradox, two non-independent variables, A and B, are rendered conditionally independent upon receipt of information concerning C (A is "screened off" from B by C, in the language of the graph theorists) - without knowing C, we are lulled into incorrectly thinking that A is a direct cause of B.<br /><br />With Berkson's fallacy, the effect is opposite: knowledge of C (or inadvertently selecting a biased sample such that C was true on an excess number of occasions) made two otherwise uncorrelated variables appear to be dependent upon one another. The effect was such that occurrence of A seemed to suppress the occurrence of B. (Note that even if I hadn't consciously decided to look only for hung-over animals, their bright red noses would have been easier to spot in the undergrowth, leading to their being over-represented in the survey, which would have had a similar effect.) <br /><br />While the third variable, C, is ignored, it can confound our scientific efforts, but once brought to our attention, figuring out what causes what actually becomes easier. Thanks to differences in effects, such as those differences between Simpson's and Berkson's paradoxes just sketched, it can actually help us distinguish between different classes of causal relationships. If C screens off A from B, then certain distributions of cause and effect can be ruled out, while certain other causal relationships are excluded when C introduces dependence between A and B. [This paragraph was modified slightly on 12-23-2013 to remove an error.]<br /><br />Full causal analysis is only possible when we perform controlled interventions (e.g. randomized controlled clinical trials), but if we stretch our intelligence, there is a lot that can still be done when intervention is difficult or impossible to implement - a situation many scientists have to live with. (Cosmology, anyone? geology? archaeology? Just a few examples.)<br /><br /><br /><br /><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /></div>Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com1tag:blogger.com,1999:blog-715339341803133734.post-33312367975938391482013-11-16T11:30:00.000-06:002013-11-18T12:14:07.155-06:00The Acid Test of Indifference <div style="text-align: justify;"><span style="font-family: inherit;"><br /></span></div><div style="text-align: justify;"><span style="font-family: inherit;"><br /></span></div><div style="text-align: justify;"><span style="font-family: inherit;">In recent posts, I've looked at the <a href="http://maximum-entropy-blog.blogspot.com/2013/10/entropy-games.html">interpretation of the Shannon entropy</a>, and the <a href="http://maximum-entropy-blog.blogspot.com/2013/11/monkeys-and-multiplicity.html">justification for the maximum entropy principle</a> in inference under uncertainty. In the latter case, we looked at how mathematical investigation of the entropy function can help with establishing prior probability distributions from first principles. </span></div><div style="text-align: justify;"><span style="font-family: inherit;"><br /></span></div><div style="text-align: justify;"><span style="font-family: inherit;">There are some prior distributions, however, that we know automatically, without having to give the slightest thought to entropy. If the maximum entropy principle is really going to work, the first thing it has got to be able to do is to reproduce those distributions that we can deduce already, using other methods. </span></div><div style="text-align: justify;"></div><a name='more'></a><br /><div style="text-align: justify;"><span style="font-family: inherit;">There's one case, in particular, that I'm thinking of, and it's the uniform distribution of Laplace's <a href="http://en.wikipedia.org/wiki/Principle_of_indifference">principle of indifference</a>: with n hypotheses and no information to the contrary, each must rationally be assigned the same probability, 1/n. This principle is pretty much self evident. If we really need to check that it's correct, we just need to consider the symmetry of the situation: suppose we have a small cube with sides labelled with the numbers 1 to 6 (a die). Without any stronger information, these numbers really are just arbitrary labels - we could, for example, decide instead to denote the side marked with a 1 by the label "6" and <i>vice versa</i>. But nothing physical about the die will have been changed by this change of convention, so no alteration of the probability assignment concerning the outcome of the usual experiment is called for. Thus each of these two outcomes ("1" or "6") must be equally probable, with further similar arguments applying to all pair of sides.</span></div><div style="text-align: justify;"><span style="font-family: inherit;"><br /></span></div><div style="text-align: justify;"><span style="font-family: inherit;">So if the entropy principle is valid, it must also arrive at this uniform distribution under the circumstances of us being maximally uninformed. Let's check if it works.</span></div><div style="text-align: justify;"><span style="font-family: inherit;"><br /></span></div><div class="MsoNormal" style="text-align: justify;"><span style="font-family: inherit;">We start with a discrete probability distribution over X, f(x<sub>1</sub>, x<sub>2</sub>, …., x<sub>n</sub>), equal to (p<sub>1</sub>, p<sub>2</sub>, …., p<sub>n</sub>). The entropy for this distribution is, as usual given by<o:p></o:p></span></div><div class="MsoNormal" style="text-align: justify;"><span style="font-family: inherit;"><br /></span></div><div class="MsoNormal" style="text-align: justify;"><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-P8K3dfAvvNk/Uoeel2I30pI/AAAAAAAACRA/XyrLt5Cltkc/s1600/eq1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-P8K3dfAvvNk/Uoeel2I30pI/AAAAAAAACRA/XyrLt5Cltkc/s1600/eq1.PNG" /></a></div><br /></div><div style="text-align: justify;"><span style="font-family: inherit;">which, for reasons that’ll soon become clear, I’ll express as </span></div><div style="text-align: justify;"><span style="font-family: inherit;"><br /></span></div><div style="text-align: justify;"><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-IkjTEYeNgAc/UoefiHxgyaI/AAAAAAAACRM/WRm9TU-Mgfk/s1600/eq2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-IkjTEYeNgAc/UoefiHxgyaI/AAAAAAAACRM/WRm9TU-Mgfk/s1600/eq2.PNG" /></a></div></div><div style="text-align: justify;"><span style="font-family: inherit;"><br /></span></div><div class="MsoNormal" style="text-align: justify;"><span style="font-family: inherit;">Suppose that for f(x<sub>1</sub>, x<sub>2</sub>, …., x<sub>n</sub>) the probability at x<sub>1</sub>, p<sub>1</sub>, is smaller than p<sub>2</sub>. Imagine another distribution, f’(X), in which p<sub>3</sub>to p<sub>n</sub> are identical to f(X), but p<sub>1</sub> and p<sub>2</sub>have been made more similar, by adding a tiny number, ε, to p<sub>1</sub> and subtracting the same number from p<sub>2</sub> (this latter subtraction is necessary, so as not to violate the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#normalization">normalization condition</a>). We want to examine the entropy of this distribution, relative to the other:<o:p></o:p></span></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;"><div style="text-align: center;"><a href="http://1.bp.blogspot.com/--QUnei81P_w/UoefiBVJ9vI/AAAAAAAACRk/4st8-y099Cw/s1600/eq3.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/--QUnei81P_w/UoefiBVJ9vI/AAAAAAAACRk/4st8-y099Cw/s1600/eq3.PNG" /></a></div></div><div class="MsoNormal" style="text-align: justify;"><span style="font-family: inherit;">And so the difference in entropy is:<o:p></o:p></span></div><div class="MsoNormal" style="text-align: justify;"><span style="font-family: inherit;"><br /></span></div><div class="MsoNormal" style="text-align: justify;"><div style="text-align: center;"><a href="http://2.bp.blogspot.com/-h_Vv671s9Mk/UoefiD1sGlI/AAAAAAAACRI/r_OL58cLiKs/s1600/eq4.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-h_Vv671s9Mk/UoefiD1sGlI/AAAAAAAACRI/r_OL58cLiKs/s1600/eq4.PNG" /></a></div></div><div class="MsoNormal" style="text-align: justify;"><span style="font-family: inherit;"><br /></span></div><div class="MsoNormal" style="text-align: justify;"><span style="font-family: inherit;">We can consolidate this, using the <a href="http://en.wikipedia.org/wiki/Logarithm#Product.2C_quotient.2C_power_and_root">laws of logs</a>:</span></div><div class="MsoNormal" style="text-align: justify;"><span style="font-family: inherit;"><br /></span></div><div class="MsoNormal" style="text-align: justify;"><div style="text-align: center;"><a href="http://3.bp.blogspot.com/-JDRtq_RxEqA/Uoefim0dysI/AAAAAAAACRQ/LWPRiDPh5no/s1600/eq5.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-JDRtq_RxEqA/Uoefim0dysI/AAAAAAAACRQ/LWPRiDPh5no/s1600/eq5.PNG" /></a></div></div><div class="MsoNormal" style="text-align: justify;"><span style="font-family: inherit;"><br /></span></div><div class="MsoNormal" style="text-align: justify;"><span style="font-family: inherit;">In the 17<sup>th</sup> century, Nicholas Mercator was a multi-talented Danish mathematician. His many accomplishments include the design and construction of chronometers for kings and fountains for palaces, as well as making theoretical contributions to the field of music. He seems to be the first person to have made use of the natural base for logarithms, and in 1668 he published a convenient series expansion of the expression log<sub>e</sub>(1+x):</span></div><div class="MsoNormal" style="text-align: justify;"><span style="font-family: inherit;"><br /></span></div><div class="MsoNormal" style="text-align: justify;"><div style="text-align: center;"><a href="http://3.bp.blogspot.com/-5O6MnSRCXLY/UoefiqYLSQI/AAAAAAAACRc/XMNx7UpuWsQ/s1600/eq6.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-5O6MnSRCXLY/UoefiqYLSQI/AAAAAAAACRc/XMNx7UpuWsQ/s1600/eq6.PNG" /></a></div></div><div class="MsoNormal" style="text-align: justify;"><span style="font-family: inherit;"><br /></span></div><div class="MsoNormal" style="text-align: justify;"><span style="font-family: inherit;">We now know this as the <a href="http://en.wikipedia.org/wiki/Mercator_series">Mercator series</a>. (Remarkably, this work predates by 47 years the introduction of the <a href="http://en.wikipedia.org/wiki/Taylor_series">Taylor series</a>, of which the above expansion is an example.) This looks promising to us in the current investigation, as for very small x, only the first term will remain significant, so we’d like to express the logarithms in the above entropy difference in this form. <o:p></o:p></span></div><div class="MsoNormal" style="text-align: justify;"><span style="font-family: inherit;"><br /></span></div><div class="MsoNormal" style="text-align: justify;"><span style="font-family: inherit;">(We might wonder why this Danish dude had a Latin name, but apparently that was often done in those days, to enhance one’s academic standing and whatnot – his original name was Kauffman, by the way, which means the same thing: shopkeeper (I don't understand why he felt it wasn't intellectual enough). Actually, since moving to the US last year, I see a similar thing going on at universities here, where it is apparently often considered a highly coveted marker of status to be able to prove that you know at least three letter of the Greek alphabet.)<o:p></o:p></span></div><div class="MsoNormal" style="text-align: justify;"><span style="font-family: inherit;"><br /></span></div><div class="MsoNormal" style="text-align: justify;"><span style="font-family: inherit;"><o:p></o:p></span></div><div class="MsoNormal" style="text-align: justify;"><span style="font-family: inherit;">Taking one of those terms in ΔH, factorizing and again applying the laws of logs:</span></div><div class="MsoNormal" style="text-align: justify;"><span style="font-family: inherit;"><br /></span></div><div class="MsoNormal" style="text-align: justify;"><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-LfmLWjLlxU0/UoehUn1sM7I/AAAAAAAACR0/ymgQNhm8-d8/s1600/eq7.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-LfmLWjLlxU0/UoehUn1sM7I/AAAAAAAACR0/ymgQNhm8-d8/s1600/eq7.PNG" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div></div><div class="MsoNormal" style="text-align: justify;"><span style="font-family: inherit;"><o:p></o:p></span></div><div class="MsoNormal" style="text-align: justify;"><span style="font-family: inherit;">with a similar result for log(p<sub>2</sub> - ε), so </span><br /><br /><div style="text-align: center;"><a href="http://3.bp.blogspot.com/-ZokhF6LriKg/UoehUppcRDI/AAAAAAAACR4/Nz7is0KsI4Y/s1600/eq8.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-ZokhF6LriKg/UoehUppcRDI/AAAAAAAACR4/Nz7is0KsI4Y/s1600/eq8.PNG" /></a></div><br /><div class="MsoNormal"><span style="font-family: inherit;">Supposing we have made ε arbitrarily small (without actually equaling zero), then when we implement the Mercator expansion, all terms with ε squared, or raised to higher powers, will be negligibly small, so<o:p></o:p></span></div><div class="MsoNormal"><span style="font-family: inherit;"><br /></span></div><div class="MsoNormal"><div style="text-align: center;"><a href="http://2.bp.blogspot.com/-0cwOIWgoBCk/UoehUhoX2sI/AAAAAAAACR8/vM9DN1nBkBU/s1600/eq9.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-0cwOIWgoBCk/UoehUhoX2sI/AAAAAAAACR8/vM9DN1nBkBU/s1600/eq9.PNG" /></a></div></div><div class="MsoNormal"><span style="font-family: inherit;"><br /></span></div><div class="MsoNormal"><span style="font-family: inherit;">For p<sub>2</sub> > p<sub>1</sub>and ε > 0, this expression is necessarily positive, meaning that H’ > H, i.e. the distribution formed by taking 2 unequal probabilities in f, and adjusting them both to make them more nearly equal results in a new distribution, f’, with higher entropy.<o:p></o:p></span></div></div><div class="MsoNormal" style="text-align: justify;"><span style="font-family: inherit;"><br /></span></div><div class="MsoNormal" style="text-align: justify;"><div class="MsoNormal"><span style="font-family: inherit;">This procedure of adjusting unequal probabilities by some tiny ε can, of course, continue for as long as it takes to end up with all probabilities in the distribution equal, and the result will necessarily have higher entropy than any of the distributions that preceded it. This proves that the uniform distribution is globally the one with maximum entropy.</span> </div><div><br /></div><div><a href="http://en.wikipedia.org/wiki/Lagrange_multiplier#Example_3:_Entropy">Another approach</a> we could have taken to check that the entropy principle produces the appropriate uninformative prior elegantly uses the method of Lagrange multipliers, which is a powerful technique employed frequently at the business end of entropy applications. (Yes, that's the same Lagrange, reportedly <a href="http://maximum-entropy-blog.blogspot.com/2012/08/bayes-theorem-all-you-need-to-know.html">teased by Laplace</a> in front of Napoleon - but we mustn't think that Lagrange was a fool, he was one of the all-time greats of mathematics and physics, and it probably says more about Laplace that he so easily go the better of Lagrange on that occasion.)</div><br />The favorite joke that non-Bayesians use to try to taunt us with is the claim that we pull our priors out of our posteriors. They think our prior distributions are arbitrary and unjustified, and that as a consequence our entire epistemology crumbles. But this only showcases their own ignorance. Never mind the obvious fact that if it were true, no kind of learning whatsoever would ever be possible, for us and them alike. In reality, to derive our priors we make use of simple and obvious symmetry considerations (such as indifference), which not only work just fine, but provide results that stand verified when we apply far more rigorous formalisms, such as maximum entropy and group theory (which I haven't discussed yet).</div><br /><br /><br /><br />Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com2tag:blogger.com,1999:blog-715339341803133734.post-89105984100773544482013-11-01T22:38:00.000-05:002013-11-01T22:38:30.629-05:00Monkeys and Multiplicity<div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Monkeys love to make a mess. Monkeys like to throw stones. Give a monkey a bucket of small pebbles, and before too long, those pebbles will be scattered indiscriminately in all directions. These are true facts about monkeys, facts we can exploit for the construction of a random number generator.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Set up a room full of empty buckets. Add one bucket full of pebbles and one mischievous monkey. Once the pebbles have been scattered, the number of little stones in each bucket is a random variable. We're going to use this random number generator for an unusual purpose, though. In fact, we could call it a 'calculus of probability,' because we're going to use this exotic apparatus for figuring out probability distributions from first principles<sup>1</sup>.<br /><a name='more'></a><br />Lets say we have a hypothesis space consisting of n mutually exclusive and exhaustive propositions. We need to find a <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#prior">prior probability distribution</a> over this set of hypotheses, so we can begin the process of <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#bayes-theorem">Bayesian updating</a>, using observed data. But nobody can tell us what the prior distribution ought to be, all we have is a limited set of structural constraints - we know the whole distribution <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#normalization">must sum to 1</a>, and perhaps we can deduce other properties, such as an appropriate mean or standard deviation. So we design an unusual experiment to help us out. We need to lay out n empty buckets - one bucket for each hypothesis in the entire set of possibilities (better nail them to the floor).<br /><br />Each little pebble that we give to the monkey represents a small unit of probability, so the arrangement of stones in buckets at the end represents a candidate probability distribution. We've been very clever, beforehand, and we've made sure that the resting place of each pebble is uniformly randomly distributed over the set of buckets - in keeping with everything we've heard about the chaotic nature of monkeys, this one exhibits no systematic bias - so we decide that this candidate probability distribution is a fair sample from among all the possibilities.<br /><br />We examine the candidate distribution, and check that it doesn't violate any of our known constraints. If it does violate any constraints, then it's no use to us, and we have to forget about it. If not, we record the distribution, then go tidy everything up, so we can repeat the whole process again. And again. And again and again and again, and again. After enough cycles, there should be a clear winner, one distribution that occurs more often than any other. This will be the probability distribution that best approximates our state of knowledge. </div><div style="text-align: justify;"><br />To convince ourselves that some distributions really will occur significantly more frequently than others, we just need to do a little combinatorics. Suppose the total supply of ammo given to the little beast comes to N pebbles.<br /><br />The number of possible ways to end up with exactly N<sub>1</sub> pebbles in the first bucket is given by the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#combination">binomial coefficient</a>:<br /><br /></div><div style="text-align: justify;"><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-joWJ7Fv8c4c/Um6zqZ_98EI/AAAAAAAACOI/TqzNcpxjyfs/s1600/eq1.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-joWJ7Fv8c4c/Um6zqZ_98EI/AAAAAAAACOI/TqzNcpxjyfs/s1600/eq1.JPG" /></a></div><br />and with those N<sub>1</sub> pebbles accounted for, the number of ways to get exactly N<sub>2 </sub>in the second bucket is<br /><br /><div style="text-align: center;"><a href="http://1.bp.blogspot.com/-RE3W_3__k_A/Um6zpG6lgQI/AAAAAAAACNw/7QtHZeeoWvM/s1600/eq2.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-RE3W_3__k_A/Um6zpG6lgQI/AAAAAAAACNw/7QtHZeeoWvM/s1600/eq2.JPG" /></a></div><div style="text-align: center;"><br /></div><div style="text-align: justify;">and so on.</div><br />So the total number of opportunities to get some particular distribution of stones (probability mass) over all n buckets (hypotheses) is given by the product<br /><br /><div style="text-align: center;"><a href="http://1.bp.blogspot.com/-81Wbpvq-aTM/Um6zpmyJySI/AAAAAAAACN0/GXk8VLwmqdo/s1600/eq3.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-81Wbpvq-aTM/Um6zpmyJySI/AAAAAAAACN0/GXk8VLwmqdo/s1600/eq3.JPG" /></a></div><br />For each i<sup>th</sup> term in this product (up to n-1), the term in brackets in the denominator is the same as the numerator of the (i+1)<sup>th</sup> term, so these all cancel out, and for the n<sup>th</sup> term, the corresponding part of the denominator comes to 0!, which is 1. So any particular outcome of this experiment can, in a long run of similar trials, be expected to occur with frequency proportional to<br /><br /><div style="text-align: center;"><a href="http://2.bp.blogspot.com/-MLgfCrn_emI/Um6zp_bFKKI/AAAAAAAACN8/Op4uvYwPKHY/s1600/multiplicity.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-MLgfCrn_emI/Um6zp_bFKKI/AAAAAAAACN8/Op4uvYwPKHY/s1600/multiplicity.JPG" /></a></div><br />which we call the multiplicity of the particular distribution defined by the numbers N<sub>1</sub>, N<sub>2</sub>, ...., N<sub>n</sub>.<br /><br />The number of buckets, n, only needs to be moderately large for it to be quite difficult to visualize how W varies as a function of the N<sub>i</sub>. We can get around this by looking into the case of only 2 buckets. Below, I've plotted the multiplicity as a function of the fraction of stones in the first of two available buckets, for three different total numbers of stones: 50, 500, and 1000 (blue and green curves enhanced to come up to the same height as the red curve).<br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-LCAJTRwVsv0/UnPlJLVXaAI/AAAAAAAACQw/YpDHJEUCAmA/s1600/multiplicity_vs_N.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-LCAJTRwVsv0/UnPlJLVXaAI/AAAAAAAACQw/YpDHJEUCAmA/s1600/multiplicity_vs_N.png" height="476" width="640" /></a></div><br />Each case is characterized by a fairly sharp peak, centered at 50%. As the number of stones available increases, so does the sharpness of the peak - it becomes less and less plausible for the original supply of pebbles to end up very unequally divided between the available resting places.<br /><br />Calculating large factorials is quite tricky work, though (just look at the numbers: 2.7 × 10<sup>299</sup>, for N = 1000). To find which distribution is expected to occur most frequently, we need to locate the arrangement for which W is maximized, but any monotonically increasing function of W will be maximized by the same distribution, so instead lets maximize another function, which I'll arbitrarily call H:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-X0LYyico1NQ/Um608QjCN_I/AAAAAAAACOc/f1qrkbYamWo/s1600/eq4.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-X0LYyico1NQ/Um608QjCN_I/AAAAAAAACOc/f1qrkbYamWo/s1600/eq4.JPG" /></a></div></div><div style="text-align: justify;">But since each N<sub>i</sub> is N×p<sub>i</sub> (where p<sub>i</sub> is the probability for any given stone to land in the i<sup>th</sup> bucket, according to this distribution), and repeatedly using the <a href="http://en.wikipedia.org/wiki/Logarithm#Product.2C_quotient.2C_power_and_root">quotient rule for logs</a>,<br /><br /><div style="text-align: center;"><a href="http://2.bp.blogspot.com/-YUg_Zy41kzA/Um608VQ6smI/AAAAAAAACOg/ZHeMfRtgCwE/s1600/eq5.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-YUg_Zy41kzA/Um608VQ6smI/AAAAAAAACOg/ZHeMfRtgCwE/s1600/eq5.PNG" /></a></div><br />Now, because probabilities vary over a continuous range between 0 and 1, and because we don't wan't to impose overly artificial constraints on the outcome of the experiment, we will have given the monkey a really, really large number of pebbles. This means that we can simplify the above expression for H using the <a href="http://en.wikipedia.org/wiki/Stirling's_approximation">Stirling approximation</a>, which states generally that when k is very large (where we're using the natural base),<br /><br /><div style="text-align: center;"><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-kWX9qtpAE3k/Um7kBqqYRNI/AAAAAAAACPU/P348EKTiBhA/s1600/stirling.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-kWX9qtpAE3k/Um7kBqqYRNI/AAAAAAAACPU/P348EKTiBhA/s1600/stirling.PNG" /></a></div></div><br />with the approximation getting better as k gets larger. In our particular case, this yields<br /><br /><div style="text-align: center;"><a href="http://4.bp.blogspot.com/-1j5daGyxg0M/Um7kAeZw_JI/AAAAAAAACPg/QautaC7dH9o/s1600/eq6.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-1j5daGyxg0M/Um7kAeZw_JI/AAAAAAAACPg/QautaC7dH9o/s1600/eq6.PNG" /></a></div><br />and suddenly, we can see why that 1/N was inexplicably sitting at the front of our expression for H:<br /><br /><div style="text-align: center;"><a href="http://2.bp.blogspot.com/-epOSo-yHBIM/Um7kA6DaLqI/AAAAAAAACO8/EXXL52aI9oU/s1600/eq7.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-epOSo-yHBIM/Um7kA6DaLqI/AAAAAAAACO8/EXXL52aI9oU/s1600/eq7.PNG" /></a></div><br />The Σp<sub>i</sub> term at the end <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#normalization">is 1</a>, so<br /><br /><div style="text-align: center;"><a href="http://4.bp.blogspot.com/-05N_rOYhvdU/Um7kAxHMozI/AAAAAAAACPQ/O_uPX5i7BJI/s1600/eq8.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-05N_rOYhvdU/Um7kAxHMozI/AAAAAAAACPQ/O_uPX5i7BJI/s1600/eq8.PNG" /></a></div><br />at which point the <a href="http://en.wikipedia.org/wiki/Logarithm#Product.2C_quotient.2C_power_and_root">product rule for logs</a> produces<br /><br /><div style="text-align: center;"><a href="http://2.bp.blogspot.com/-tTUzbITWQ3I/Um7kBOYxC6I/AAAAAAAACPM/forZ1A7SIA0/s1600/eq9.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-tTUzbITWQ3I/Um7kBOYxC6I/AAAAAAAACPM/forZ1A7SIA0/s1600/eq9.PNG" /></a></div><br />finally, yielding<br /><br /><div style="text-align: center;"><a href="http://4.bp.blogspot.com/-7Ih5j7NMavM/Um7kAS3X6PI/AAAAAAAACO0/NmzgnO9Jl6U/s1600/eq10.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-7Ih5j7NMavM/Um7kAS3X6PI/AAAAAAAACO0/NmzgnO9Jl6U/s1600/eq10.PNG" /></a></div><br />which, oh my goodness gracious, is exactly the same as the quantity we had <a href="http://maximum-entropy-blog.blogspot.com/2013/10/entropy-games.html">previously</a> labelled H: the Shannon entropy.<br /><br />So, it turns out that the probability distribution most likely to come out favorite in the repeated monkey experiment, the one corresponding to the arrangement of stones with the highest multiplicity, and therefore the one best expressing our state of ignorance, is the one with the highest entropy. And this is the maximum entropy principle. It means that any distribution with lower entropy than is permitted by the constraints present in the problem has somehow incorporated more information that is actually available to us, and so adoption of any such distribution constitutes an irrational inference.<br /><br />Note, though, that if your gut feeling is that the maximum entropy distribution you've calculated is not specific enough, this is your gut expressing the opinion that you actually do have some prior information that you've neglected to include. It's probably worth exploring the possibility that your gut is right.<br /><br />It might seem, from this multiplicity argument, that for a given number of points, n, in the hypothesis space, there is only one possible maximum entropy distribution, but recall that sometimes (as in the <a href="http://maximum-entropy-blog.blogspot.com/2013/10/entropy-of-kangaroos.html">kangaroo problem</a>) we have information from which we can formulate additional constraints. Sometimes the unconstrained maximum-multiplicity distribution will be ruled out by these constraints, and we have to select a distribution from those that aren't ruled out. It's in such cases that the method actually gets interesting and useful.<br /><br />This hasn't been Jaynes' original derivation of the maximum entropy principle (the logic presented here was brought to Jaynes' attention by G. Wallis), and neither is it a rigorous mathematical proof, but it has the substantial advantage of intuitive appeal. It even makes concrete the abstract link between the maximum entropy principle and the second law of thermodynamics. <br /><br />Thinking again about the expanding gas example from the previous post, we can see the remarkable similarity between the simple universe we considered then, a box containing 100 gas molecules, with the space inside the box conceptually divided into 32 equal-sized regions, and the apparatus described here. The expanding gas scenario corresponds very closely to a possible instance of the monkey experiment with 32 buckets and 100 pebbles:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-SOCI3YqGw0Q/Um_9bwC9w5I/AAAAAAAACP0/J8C3_VCpu4U/s1600/box_without_partition.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-SOCI3YqGw0Q/Um_9bwC9w5I/AAAAAAAACP0/J8C3_VCpu4U/s1600/box_without_partition.png" height="165" width="320" /></a></div><br /><br />As the multiplicity curves above testify, the state with the highest multiplicity is the one with as close as possible to equal numbers of molecules in each region, so a state like the one above is far more likely than one in which one half of the box is empty. Given the fact that the gas molecules move about, in a highly uncorrelated way, therefore, we can see that the second law of thermodynamics (the fact that a closed system in a low-entropy state now will tend to move to and stay in a high-entropy state) amounts to a tautology: if there is a more probable state available, then expect to find the system in that state next time you look.<br /><br />We also saw above that increasing the number of pebbles / molecules reduces the relative width of the multiplicity curve. But the numbers of particles encountered in typical physical situations are astronomically huge. In a cubic meter of air, for example, at sea level and around room temperature, there are (from the ideal-gas approximation) around 2.5 × 10<sup>25</sup> molecules (which, by the way, weighs roughly 1 kg). This means that once a system like this gas-in-a-box experiment has evolved to its state of highest entropy, the chances of seeing the entropy decrease again by any appreciable amount vanish completely. Thus the second tendency of thermodynamics receives a well earned promotion, and we call it instead 'the second law'.<br /><br />Please note: no primates were harmed during the making of this post.<br /><br /><br /><hr /><br /><span style="color: blue;"><b>References</b></span><br /><br /><!-- ************************ Table of references ************************--> <br /><table> <tbody><tr> <td valign="top"> [1] </td> <td><div style="text-align: justify;"><span style="font-family: inherit;">I can't claim credit for designing this monkey experiment. Versions of it are dotted about the literature, possibly originating with: </span></div><br /><div style="text-align: justify;"><span style="font-family: inherit;"><span style="font-family: inherit; text-align: justify;">Gull, S.F. and Daniell, G.J., </span><i style="font-family: inherit; text-align: justify;">Image Reconstruction from Incomplete and Noisy Data</i><span style="font-family: inherit; text-align: justify;">, Nature 272, no. 20, page 686, 1978 </span></span></div><span style="font-family: inherit;"></span><br /><div style="text-align: justify;"><span style="font-family: inherit;">See also</span></div><div style="text-align: justify;"><span style="font-family: inherit;"><br /></span></div><div style="text-align: justify;"><span style="font-family: inherit;">Jaynes, E.T., <i>Monkeys, Kangaroos, and N</i>, in <i>Maximum-Entropy and Bayesian Methods in Applied Statistics</i>, edited by Justice, J.H., Cambridge University Press, 1986 (full text of the paper <a href="http://bayes.wustl.edu/etj/articles/cmonkeys.pdf">available here</a>)</span></div><br /></td> </tr></tbody></table><br /><br /><br /><br /><br /></div>Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com2tag:blogger.com,1999:blog-715339341803133734.post-64800770285926944732013-10-26T01:45:00.000-05:002013-10-26T01:45:40.659-05:00Entropy Games<div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">In 1948, Claude Shannon, an electrical engineer working at Bell labs, was interested in the problem of communicating messages along physical channels, such as telephone wires. He was particularly interested in issues like how many bits of data are needed to communicate a message, how much redundancy is appropriate when the channel is noisy, and how much a message can be safely compressed. </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">In that year, Shannon figured out<sup>1</sup> that he could mathematically specify the minimum number of bits required to convey any message. You see every message, every proposition, in fact, whether actively digitized or not, can be expressed as some sequence of answers to yes / no questions, and every string of binary digits is exactly that: a sequence of answers to yes / no questions. So if you know the minimum number of bits required to send a message, you know everything you need to know about the amount of information it contains.</div><div style="text-align: justify;"><br /><a name='more'></a></div><div style="text-align: justify;">Many possible sets of yes / no questions can yield the same message, but they're not all the same length. Consider the game of 20 questions, in which one player has to guess or deduce the name of a person that the other player is thinking of, simply by asking questions to which the answer will be either 'yes' or 'no.' A really crappy strategy when playing this game would be to ask the following set of questions:<br /><ol><li>Is it Dorothy Hodgekin? [Nope.]</li><li>Is it Emily Noether? [No!]</li><li>Is it Caroline Herschel? [No, dumb-ass!]</li><li>Is it Margaret Cavendish? [Oh, for crying out loud.]</li><li> etc. </li></ol><div>A strategy like this is likely to take an awfully long time to finish the game, particularly as the correct answer may be a man with no outstanding scientific credentials. A better way to play the game is to devise questions that divide the set of possibilities as nearly in half as possible. A good first question is therefore, 'is it a male?' From there, we could proceed with, 'is it a famous person?' and so on, producing a significantly smaller set of possibilities with each answer. With good interrogation, the size of the hypothesis space decays exponentially, while naive guessing, such as the examples above, leaves the range of possibilities virtually unchanged.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">So an optimum string of bits for a message is one such that each bit reduces the number of possible messages the sender might be sending by 50% (with a caveat we'll see in a moment). Thus, if we have a set of 128 symbols, say lower- and upper-case English alphabet, digits 0-9, several punctuation marks, and various other symbols (the ASCII system), we don't need 128 bits to transmit each symbol, only 7. The first bit answers the question: 'is it in the first half of the list?' The second bit answers the question: 'of the remaining list of possibilities, is it in the first half of the list?' and so on. The 7th bit, therefore, leaves a list containing exactly one symbol.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Suppose our alphabet - the complete set of symbols out of which our messages are to be constructed - consists of n symbols. If all symbols in our alphabet occur equally frequently, then the number of bits, call it x, needed to transmit one symbol is given by n = 2<sup>x</sup>. The solution to this equation, of course, is<br /><br /><div style="text-align: center;"><span style="font-size: large;">x = log<sub>2</sub>(n)</span> </div></div><br />But if all symbols occur equally frequently, then the probability, p, for the i<sup>th</sup> symbol in a given message to be one in particular is 1/n, from the principle of indifference, so<br /><br /><div style="text-align: center;"><span style="font-size: large;">x = log<sub>2</sub></span><span style="font-size: large;">(1/p)</span></div><br />or, from the <a href="http://en.wikipedia.org/wiki/Logarithm#Product.2C_quotient.2C_power_and_root">laws of logs</a> (special case of the quotient rule, noting that log(1) = 0)<br /><br /><div style="text-align: center;"><span style="font-size: large;">x = -log<sub>2</sub>(p)</span></div></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">In normal communication, however, the symbols do not usually occur equally frequently. Furthermore, a message will often contain symbols that are unnecessary. Consider the text you are reading now. This text consists of an arrangement of white and black pixels. The state of each pixel is determined by answering a single yes / no question, but do you actually need to receive all those bits at your visual cortex in order to read the text? Hopefully, this next game will answer that question in the negative. Try to read the following partially obscured text:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-8Ur6vaXqOho/UmgreE-nG5I/AAAAAAAACMw/_RaCBNDjSC0/s1600/obscured_text.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-8Ur6vaXqOho/UmgreE-nG5I/AAAAAAAACMw/_RaCBNDjSC0/s1600/obscured_text.JPG" height="31" width="400" /></a></div><br />Tests (upublished work, conducted by me, very small sample) reveal that most people can cope with reading this. Similarly, most people can manage to interpret SMS text, composed with many of the vowels excluded from their proper places.<br /><br /></div><div style="text-align: justify;">Luckily, as is not too hard to see, the above formula generalizes to the case where symbols don't carry equal amounts of information, which happens when they don't occur with equal frequency, or where some parts of a message are superfluous (the obscured pixels along the top of that message, above, evidently didn't convey any additional useful information).<br /><br />Suppose, for example, that we have a set of only 4 symbols: A, B, C, and D. Suppose that in an average message, A appears 50% of the time, and the remaining letters are half B's, a quarter C's, and a quarter D's. Dividing the list of possibilities into 2 halves with the first bit is not optimally efficient here (that caveat I warned you about). Such a strategy would take 2 bits to transfer each symbol, even though the C's and D's carry more information than the A's.<br /><br />Rather than splitting the list in half, it is clearly the probability distribution over the list we should split in half, and because A's occur half the time, our first question is, 'is it an A?' If the answer is yes, then we have a symbol from only one bit. If the answer is no, then the next question is, 'is it a B?' since B's occur on 50% of the occasions when its not an A. Sometimes we'll need to ask a third question, 'is it a C?' But that'll only happen 25% of the time, and the average number of bits needed to transfer 1 symbol will be (0.5 × 1) + (0.25 × 2) + (0.25 × 3) = 1.75 bits.<br /><br />So the number of bits needed to transfer a symbol is still a function of the prior probability that that symbol would have been the one sent. If the i<sup>th</sup> symbol has prior probability associated it of p<sub>i</sub>, then:<br /><br /><div style="text-align: center;"><span style="font-size: large;">x<sub>i</sub> = -log<sub>2</sub>(p<sub>i</sub>)</span></div><br />The average of x, over the entire alphabet of symbols, is just the <a href="http://maximum-entropy-blog.blogspot.com/2013/01/great-expectations.html">expectation</a> over this expression, and is what we call the entropy:<br /><br /><div style="text-align: center;"><span style="font-size: large;">H = 〈x<sub>i</sub>〉 = -Σ p<sub>i </sub>log<sub>2</sub>(p<sub>i</sub>)</span></div><br />which, thankfully, is the same expression we used to solve the <a href="http://maximum-entropy-blog.blogspot.com/2013/10/entropy-of-kangaroos.html">kangaroo problem</a>, before. In fact, when calculating entropy, the base is not all that important - a different base will give a different number, but if we consistently use the same base, then our numbers will vary consistently. Often, the natural base is preferred, while sometimes base 10 is used, but then the units are not bits, but respectively, nats or Harts. Note that this formula (with base 2) reproduces the 1.75 obtained for the A's, B's, C's, and D's, above.<br /><br />Why does this formula from communication theory have anything to do with science and inference?<br /><br />Well, every scientific experiment, in fact every observation or experience, can be considered as a message from Mother Nature to us. Every bit of information from Mother Nature reduces the number of plausible universes by half, and we can count the bits using Shannon's theory. (By the way, I should be careful: the universe hates it when we personify her.) In the next post, I'll give some insight into why the maximum entropy principle (which I've already applied to kangaroos) is a valid tool in statistical inference.<br /><br />Why do we call it entropy?<br /><br />Good question, thank you very much for asking. Legend has it, Shannon initially adopted the term entropy primarily because of the remarkable similarity of his formula to an equation already used by physicists to calculate thermodynamic entropy. It was supposed to be an analogy, but several authors see it as more. There is nothing resembling consensus on this issue, but it seems clear to me that the information theoretic and thermodynamic uses of the term entropy are essentially the same.<br /><br /></div><div style="text-align: justify;">Borrowing from Arieh Ben-Naim's examples in <i>Entropy Demystified </i><sup>2</sup>, we can look at one of the canonical illustrations of physical entropy changing: the expansion of a gas in a closed container. The box depicted below has a partition down the middle. The left side contains gas molecules, while the right side has been evacuated. For each gas molecule, we play again the 20 questions game to figure out where it is - except that 4 questions are enough to fix its location to one of the 16 regions (to the left of the partition) marked with the dotted lines.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-Msw7rlxRYDo/Umk44-v8jdI/AAAAAAAACNY/r4tC3KKtzhA/s1600/box_with_partition.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-Msw7rlxRYDo/Umk44-v8jdI/AAAAAAAACNY/r4tC3KKtzhA/s1600/box_with_partition.png" height="207" width="400" /></a></div><br /><br /><div class="separator" style="clear: both; text-align: center;"></div>At a certain point of time, we cause the partition to dissolve, allowing the gas to spread to all parts of the box, as shown in the second picture:<br /><div class="separator" style="clear: both; text-align: center;"></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-aPQ906Gevn4/Umk4-KSfC9I/AAAAAAAACNg/A4Wf5o2mRjg/s1600/box_without_partition.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-aPQ906Gevn4/Umk4-KSfC9I/AAAAAAAACNg/A4Wf5o2mRjg/s1600/box_without_partition.png" height="206" width="400" /></a></div><br /><br />The physicist and the information theorist are agreed: entropy has increased. From physics (see <a href="http://galileo.phys.virginia.edu/classes/152.mf1i.spring02/MolecularEntropy.htm">for example</a>), because the temperature hasn't changed, the change in entropy is proportional to the log of the ratio of the volumes occupied by the gas after and before the partition was removed. That is, log(2) = 1 more unit of entropy per molecule has been added (in bits, rather than nats).<br /><br />From the point of view of information theory, in order to locate any particular molecule, I now have to find the right place from among 32 little regions - I need one more bit of information than I needed before (the total entropy of a message is the average symbol entropy multiplied by the length of the message).<br /><br />(We could also have achieved an increase in physical entropy by increasing the temperature rather than the volume, but then, according to the <a href="http://en.wikipedia.org/wiki/Maxwell%E2%80%93Boltzmann_distribution">Maxwell distribution</a>, the width of the probability curve associated with each particle's velocity would also increase, thus increasing the number of bits needed to pinpoint each particle's momentum.)<br /><br />It is important to note that the even dispersal of the gas throughout the box, after the partition is removed, is not caused by our amount of information having been reduced, as some of my teachers suggested when I was an undergraduate. Of course, this belief commits the <a href="http://maximum-entropy-blog.blogspot.com/2012/06/the-mind-projection-fallacy.html">mind-projection fallacy</a>, and gets the situation exactly backwards.<br /><br />As we'll see in the next post, the reason that physical systems tend to be found in or evolving towards states of maximal entropy (the second law of thermodynamics) is exactly the same reason that maximum-entropy distributions are appropriate for inference under missing information (the maximum entropy principle): those maximum-entropy states / distributions have the highest multiplicity. </div><div style="text-align: justify;"><br /><br /><hr /><br /><span style="color: blue;"><b>References</b></span><br /><br /><br /><table> <tbody><tr> <td valign="top"> [1] </td> <td>Shannon, C.E., "A Mathematical Theory of Communication," Bell System Technical Journal 27 (3), pages 379–423, 1948 (<a href="http://www3.alcatel-lucent.com/bstj/vol27-1948/articles/bstj27-3-379.pdf">Download here</a>.)</td> </tr><tr> <td valign="top"> [2] </td> <td>Ben-Naim, A., "Entropy Demystified," World Scientific Publishing Company, 2008</td> </tr></tbody></table><br /><br /><br /><br /><span style="color: blue;"><b><br /></b></span></div>Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com6tag:blogger.com,1999:blog-715339341803133734.post-634731420276709862013-10-18T22:44:00.000-05:002013-10-18T22:44:13.814-05:00Entropy of Kangaroos<div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">All this discussion of scientific method, the deep roots of probability theory, mathematics, and morality is all well and good, but what about kangaroos? As I'm sure most of my more philosophically sophisticated readers appreciate, kangaroos play a necessarily central and vital role in any valid epistemology. To celebrate this fact, I'd like to consider a mathematical calculation that first appeared in the image-analysis literature, just coming up to 30 years ago<sup>1</sup>. I'll paraphrase the original problem in my own words:<br /><br /><blockquote class="tr_bq"><i>We all know that two thirds of kangaroos are right handed, and that one third of kangaroos drink beer (the remaining two thirds preferring whisky). These are true facts. What is the probability that a randomly encountered kangaroo is a left-handed beer drinker? Find a unique answer.</i></blockquote><br /><a name='more'></a><br />The last part of the statement of the puzzle, 'find a unique answer,' is implicit in the question, but I've emphasized it, as it highlights the difference between the sensible view of probability and the old, <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#frequency-interpretation">frequentist</a> school of thought. The ardent frequentist will deny that any unique solution to the kangaroo problem exists. They will tell you that there is not enough information in the statement of the question to produce an answer. But reasoning under uncertainty (missing information) is totally and utterly, completely exactly what probability theory is for, so of course there is an answer to this question. Quite possibly, if you're good with numbers, your intuition is leaning toward the correct number already.<br /><br />To help focus your mind, below is a beautiful portrait of a left-handed, beer-swilling kangaroo (identity unknown). This sketch appeared in the original paper by Gull and Skilling that introduced this problem. I love the stylish way this kangaroo is showing its tongue to the artist. Ask yourself: "how plausible is this kangaroo?" This is the number we are looking for.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-eqwvX41Ntfg/UmA62Q7sOeI/AAAAAAAACLc/0OY08Tg6vdI/s1600/left-handed_beer-drinking_kangaroo.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-eqwvX41Ntfg/UmA62Q7sOeI/AAAAAAAACLc/0OY08Tg6vdI/s1600/left-handed_beer-drinking_kangaroo.JPG" height="296" width="320" /></a></div><br />It's not the first time kangaroos have appeared in the statistical literature, either (I told you they're philosophically significant). William Gosset, better known to many of us as <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#student">Student</a>, used the following diagram<sup>2</sup> as a tool to help remember elements of mathematical terminology. The animal on the left is a platypus, representing the general shape of a <a href="http://en.wikipedia.org/wiki/Kurtosis">platykurtic</a> distribution, while the more leptokurtic curve is depicted with kangaroos, naturally, because of all their "lepping about."<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-XnXNrMc7hAU/UmA6qhssVwI/AAAAAAAACLU/N1MLhlRXEOs/s1600/student_kangaroo_pic2.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-XnXNrMc7hAU/UmA6qhssVwI/AAAAAAAACLU/N1MLhlRXEOs/s1600/student_kangaroo_pic2.JPG" height="171" width="640" /></a></div><br /><br />All right, to analyze our problem, we can draw the hypothesis space as a matrix:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-Nwf4EFGxzAY/UmCnbq60WyI/AAAAAAAACLs/1Vulai792DQ/s1600/kangaroo_table1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-Nwf4EFGxzAY/UmCnbq60WyI/AAAAAAAACLs/1Vulai792DQ/s1600/kangaroo_table1.PNG" height="246" width="320" /></a></div><br />This kind of matrix is also known as a <a href="http://en.wikipedia.org/wiki/Contingency_table">contingency table</a>. The numbers opposite each atomic proposition give the total for that case, e.g. the overall proportion of left-handed 'roos, p<sub>11</sub> + p<sub>12</sub>, is 1/3, as initially stated. (I used to have difficulty remembering the conventional ordering of those little ij numbered subscripts on the elements of a matrix so I also developed a mnemonic. If you ever learned basic electronics, you might remember that the response time of a capacitor in a simple circuit is given by something called the <a href="http://en.wikipedia.org/wiki/RC_time_constant">RC constant</a>. So that's how I remember - RC constant: first the <u>R</u>ow, then the <u>C</u>olumn, (pretty much) <u>always</u>.)<br /><br />We can easily reduce the number of variables in the above table. Whatever p<sub>21</sub> is, p<sub>11</sub> + p<sub>21</sub> = 1/3, so p<sub>21</sub> = 1/3 - p<sub>11</sub>. All the p<sub>ij</sub> can be expressed in terms of p<sub>11</sub> :<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-q7eKIehYiDI/UmCnl2KwT2I/AAAAAAAACL0/lYXOUlluss4/s1600/kangaroo_table2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-q7eKIehYiDI/UmCnl2KwT2I/AAAAAAAACL0/lYXOUlluss4/s1600/kangaroo_table2.PNG" height="235" width="320" /></a></div><br />This is where the die-hard frequentist stops working and breaks into tears, for though it is obvious that p<sub>11</sub> is constrained to the range 0 ≤ p<sub>11</sub> ≤ 1/3, he can see no way to single out 1 from the uncountably infinite number of possibilities within that range.<br /><br />The thing that stops the frequentist in his tracks is that he doesn't know the correct correlation between left handedness and beverage preference. It could be that all left-handers drink whisky, or none do, or that the proportions of left-handed kangaroos that drink whisky and beer are equal, or anything in between.<br /><br />But for us, this is not a problem. What we do not know can't interfere with our calculation, as we simply want a number that characterizes what we do know. There may be some non-zero correlation, but we have no information about that, so symmetry demands that we remain indifferent. Any correlation could be positive or negative, so any arbitrary choice has 50% probability to be in the wrong direction. So, denoting a beer drinker as B and a left hander as L:<br /><br /><div style="text-align: center;"><span style="font-family: Times, Times New Roman, serif; font-size: large;">P(B | L) = P(B | R)</span>, </div><br />at least until we have better information. We are not proclaiming something about actual proportions (frequencies) here. We are merely acknowledging that the handedness and the drinking habits of these marsupials are at present logically independent.<br /><br />From the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#product-rule">product rule</a>, though, P(BL) = P(B | L) × P(L), etc, so:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-CXJPgRwRn_8/UmCpS3APl2I/AAAAAAAACL8/oq0aj9XYI2E/s1600/eq2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-CXJPgRwRn_8/UmCpS3APl2I/AAAAAAAACL8/oq0aj9XYI2E/s1600/eq2.PNG" /></a></div><br />or, from the contingency table:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-DFQx56JrUBA/UmCpYyJotvI/AAAAAAAACME/naN4O-MYok8/s1600/eq3.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-DFQx56JrUBA/UmCpYyJotvI/AAAAAAAACME/naN4O-MYok8/s1600/eq3.PNG" /></a></div><br />yielding our desired result:<br /><br /><div style="text-align: center;"><span style="font-family: Times, Times New Roman, serif; font-size: large;">P(BL) = p<sub>11</sub> = 1/9</span></div><br />________________<br /><br />There is, though, another way to reach this result: as the title of this post suggests, by calculating entropy. This is, of course, the real reason Gull and Skilling invented this problem. I'm not going to explain today what entropy is. That'll have have to wait for a future post. I'm just going to state a formula for it, and tell you a truly remarkable thing that can be done with it. If you don't already know the rationale behind it, until I offer up an explanation, I expect your time will be pretty much fully occupied with contemplating why the number provided by this formula should have the extraordinary properties I claim for it.<br /><br />The formula for the Shannon entropy, H, is:<br /><br /><div style="text-align: center;"><span style="font-family: Times, Times New Roman, serif; font-size: large;">H = - Σ p×log(p)</span></div><br />where the p's are probabilities and the sum is over the entire hypothesis space.<br /><br />The principle of maximum entropy asserts that for a given problem, the distribution of probabilities that maximizes the number H, subject to the constraints of that problem, is the one that best characterizes our knowledge. This maximum entropy distribution is the one that is maximally non-committal, without violating the information established in the problem. Thus, any adoption of a distribution of lower than maximum entropy constitutes an assumption of information that we do not really have, thereby constituting an irrational inference, of a type often referred to as 'spurious precision.'<br /><br />Calculating the entropy in this case is quite fun and simple. All probabilties are fixed, for a given p<sub>11</sub>, so we can vary this parameter, and see where the entropy function is maximized. The plotted curve below shows this calculation. The vertical line is drawn in to show where our previous argument declared that p<sub>11</sub> should be.<br /><br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-d1yfqg-NbVc/UmDFeezuVGI/AAAAAAAACMg/8BDe72XVzTM/s1600/entropy_solution.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="" border="0" src="http://1.bp.blogspot.com/-d1yfqg-NbVc/UmDFeezuVGI/AAAAAAAACMg/8BDe72XVzTM/s1600/entropy_solution.jpeg" height="417" title="" width="640" /></a></div><br /><br />The maximum-entropy point corresponds perfectly with the earlier analysis, yielding P(BL) = 1/9, just as I claimed it would. I don't claim this is a proof of the maximum entropy principle, but perhaps you'll feel inspired to at least wonder what this H stands for, and how its maximization comes to have this special property. I'll be covering these things shortly.<br /><br />The maximum entropy principle is certainly not the main inspiration for this blog, but Ed Jaynes' profound understanding of the mechanics of rational inference, with which he was able to recognize this principle, is easily good enough to have inspired the title.<br /><br /><br /><br /><hr /><br /><span style="color: blue;"><b>References</b></span><br /><br /><table cellpadding="3" valign="top"> <tbody><tr> <td>[1]</td> <td><br /><span style="text-align: justify;">Gull, S.F. and Skilling, J., 'The maximum entropy method,' in</span><span style="text-align: justify;"> 'Indirect imaging,' edited by Roberts, J.A. (Cambridge Univ. Press), </span><span style="text-align: justify;">1984</span></td> </tr><tr> <td>[2]</td> <td>Student, 'Errors of routine analysis,' Biometrika 19, no. 1, page 160, July 1927 </td> </tr></tbody></table></div><br /><br /><br /><br />Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com2tag:blogger.com,1999:blog-715339341803133734.post-54701097328465133032013-10-11T22:57:00.000-05:002013-10-11T22:57:41.847-05:00No Such Thing as a Probability for a Probability<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span><span style="font-family: Arial, Helvetica, sans-serif;"><br /><br /></span><br /><div style="margin: 0in 0in 0.0001pt; text-align: justify;"><span style="font-family: Arial, Helvetica, sans-serif;">In the <a href="http://maximum-entropy-blog.blogspot.com/2013/10/error-bars-for-binary-parameters.html">previous post</a>, I discussed</span><span style="font-family: Arial, Helvetica, sans-serif;"> a problem of <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#param-est">parameter estimation</a>, in which the parameter of interest is a frequency: the relative frequency with which some data-generating process produces observations of some given type. In the example I chose (mathematically equivalent to Laplace's sunrise problem), we assumed a frequency that is fixed in the long term, and we assumed logical independence between successive observations. As a result, the frequency with which the</span><span style="font-family: Arial, Helvetica, sans-serif;"> process produces X , if known, has the same numerical value as the probability that any particular event will be an X. Many authors covering this problem exploit this correspondence, and describe the sought after parameter directly as a probability. This seems to me to be confusing, unnecessary, and incorrect.</span></div><div style="margin: 0in 0in 0.0001pt; text-align: justify;"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div><div style="margin: 0in 0in 0.0001pt; text-align: justify;"><span style="font-family: Arial, Helvetica, sans-serif;">We perform parameter estimation by calculating probability distributions, but if the parameter we are after is itself a probability, then we have the following weird riddle to solve: What is a probability for a probability? What could this mean? </span></div><div style="margin: 0in 0in 0.0001pt; text-align: justify;"><br /></div><div style="margin: 0in 0in 0.0001pt; text-align: justify;"><span style="font-family: Arial, Helvetica, sans-serif;">A probability is a rational account of one's state of knowledge, contingent upon some model. Subject to the constraints of that model (e.g. the necessary assumption that probability theory is correct), there is no wiggle room with regard to a probability - its associated distribution, if such existed, would be a two-valued function, being everywhere either on or off, and being on in exactly one location. What I have described, however, is not a probability distribution, as the probability at a discrete location in a continuous hypothesis space has no meaning. This opens up a few potential philosophical avenues, but in any case, this 'distribution' is clearly not the one the problem was about, so we don't need to pursue them.<o:p></o:p></span></div><div style="margin: 0in 0in 0.0001pt; text-align: justify;"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div><div style="margin: 0in 0in 0.0001pt; text-align: justify;"><span style="font-family: Arial, Helvetica, sans-serif;">In fact, we never need to discuss the probability for a probability. Where a probability is obtained as the expectation of some other nuisance parameter, that parameter will always be a frequency. To begin to appreciate the generality of this, suppose I'm fitting a mathematical function, y = f(x), with model parameters, θ, to some set of observed data pairs, (x, y). None of the </span><span style="font-family: Arial, Helvetica, sans-serif;">θ<sub>i</sub> can be a probability, since each (x, y) pair is a real observation of some actual physical process - each parameter is chosen to describe some aspect of the physical nature of the system under scrutiny. </span><br /><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span><span style="font-family: Arial, Helvetica, sans-serif;">Suppose we ask a question concerning the truth of a proposition, Q: "If x is 250, y(x) is in the interval, a = [a</span><sub style="font-family: Arial, Helvetica, sans-serif;">1</sub><span style="font-family: Arial, Helvetica, sans-serif;">, a</span><sub style="font-family: Arial, Helvetica, sans-serif;">2</sub><span style="font-family: Arial, Helvetica, sans-serif;">]." </span><br /><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span><span style="font-family: Arial, Helvetica, sans-serif;">We proceed first to calculate the multi-dimensional posterior distribution over θ-space. Then we evaluate at each point in θ-space the probability distribution for the <span style="background-color: white;">frequency with which y(250) ∈ [a</span></span><span style="background-color: white;"><sub style="font-family: Arial, Helvetica, sans-serif;">1</sub><span style="font-family: Arial, Helvetica, sans-serif;">, a</span><sub style="font-family: Arial, Helvetica, sans-serif;">2</sub></span><span style="font-family: Arial, Helvetica, sans-serif;"><span style="background-color: white;">]. If y(x) is deterministic, at all frequencies this will be either 1 or 0</span>. Regardless whether or not y is deterministic, t</span><span style="font-family: Arial, Helvetica, sans-serif;">he product of this function with the distribution, P(θ), gives the probability distribution over (f, θ), and the integral over this product is the final probability for</span><span class="apple-converted-space" style="font-family: Arial, Helvetica, sans-serif;"> </span><span style="font-family: Arial, Helvetica, sans-serif;">Q. We never needed a probability distribution over probability space, only over f and θ space, and since every inverse problem in probability theory can be expressed as an exercise in parameter estimation, we have highly compelling reasons to say that this will always hold. </span></div><div style="margin: 0in 0in 0.0001pt; text-align: justify;"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span><span style="font-family: Arial, Helvetica, sans-serif;">It might seem as though multi-level, <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#model-comparison">hierarchical modeling</a> presents a counter example to this. In the hierarchical case, the function y(x) (or some function higher still up the ladder) becomes itself one of several possibilities in some top-level hypothesis space. We may, for example suspect that our data pairs could be fitted by either a linear function, or a quadratic, in which case our job is to find out which is more suitable. In this case, the probability that y(250) is in some particular range depends on which fitting function is correct, which is itself expressible as a probability distribution, and we seem to be back to having a probability for a probability.</span><br /><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span><span style="font-family: Arial, Helvetica, sans-serif;">But every multi-level model can be expressed as a simple parameter estimation problem. For a fitting function, y<sub>A</sub>(x), we might have parameters θ<sub>A</sub> = {θ<sub>A1</sub>, θ<sub>A2</sub>, ....}, and for another </span><span style="font-family: Arial, Helvetica, sans-serif;">function, y</span><sub style="font-family: Arial, Helvetica, sans-serif;">B</sub><span style="font-family: Arial, Helvetica, sans-serif;">(x), parameters θ</span><sub style="font-family: Arial, Helvetica, sans-serif;">B</sub><span style="font-family: Arial, Helvetica, sans-serif;"> = {θ</span><sub style="font-family: Arial, Helvetica, sans-serif;">B1</sub><span style="font-family: Arial, Helvetica, sans-serif;">, θ</span><sub style="font-family: Arial, Helvetica, sans-serif;">B2</sub><span style="font-family: Arial, Helvetica, sans-serif;">, ....}. The entire problem is thus mathematically indistinguishable from a single parameter estimation problem with </span><span style="font-family: Arial, Helvetica, sans-serif;">θ</span><span style="font-family: Arial, Helvetica, sans-serif;"> = {θ</span><sub style="font-family: Arial, Helvetica, sans-serif;">A1</sub><span style="font-family: Arial, Helvetica, sans-serif;">, θ</span><sub style="font-family: Arial, Helvetica, sans-serif;">A2</sub><span style="font-family: Arial, Helvetica, sans-serif;">, ...., </span><span style="font-family: Arial, Helvetica, sans-serif;">θ</span><sub style="font-family: Arial, Helvetica, sans-serif;">B1</sub><span style="font-family: Arial, Helvetica, sans-serif;">, θ</span><sub style="font-family: Arial, Helvetica, sans-serif;">B2</sub><span style="font-family: Arial, Helvetica, sans-serif;">, ...., </span><span style="font-family: Arial, Helvetica, sans-serif;">θ</span><sub style="font-family: Arial, Helvetica, sans-serif;">N</sub><span style="font-family: Arial, Helvetica, sans-serif;">}, where </span><span style="font-family: Arial, Helvetica, sans-serif;">θ</span><sub style="font-family: Arial, Helvetica, sans-serif;">N</sub><span style="font-family: Arial, Helvetica, sans-serif;"> is an additional hypothesis specifying the name of the true fitting function. By the above argument, none of the </span><span style="font-family: Arial, Helvetica, sans-serif;">θ</span><span style="font-family: Arial, Helvetica, sans-serif;">'s here can be a probability. </span><span style="font-family: Arial, Helvetica, sans-serif;">(What does </span><span style="font-family: Arial, Helvetica, sans-serif;">θ</span><sub style="font-family: Arial, Helvetica, sans-serif;">B1</sub><span style="font-family: Arial, Helvetica, sans-serif;"> mean in model A? It is irrelevant: for a given point in the sub-space, </span><span style="font-family: Arial, Helvetica, sans-serif;">θ</span><sub style="font-family: Arial, Helvetica, sans-serif;">A</sub><span style="font-family: Arial, Helvetica, sans-serif;">, the probability is uniform over </span><span style="font-family: Arial, Helvetica, sans-serif;">θ</span><sub style="font-family: Arial, Helvetica, sans-serif;">B</sub><span style="font-family: Arial, Helvetica, sans-serif;">.)</span><br /><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span><span style="font-family: Arial, Helvetica, sans-serif;">Often, though, it is conceptually advantageous to use the language of multi-level modeling. In fact, this is exactly what happened previously, when we studied various incarnations of the sunrise problem. Here is how we coped: </span><br /><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span><span style="font-family: Arial, Helvetica, sans-serif;">We had a parameter (see <a href="http://maximum-entropy-blog.blogspot.com/2013/10/error-bars-for-binary-parameters.html">previous post</a>), which we called A, denoting the truth value of some binary proposition. That parameter was itself determined by a frequency, f, for which we devised a means to calculate a probability distribution. When we needed to know the probability that a system with internal frequency, f, would produce 9 events of type X in a row, we made use of the logical independence of subsequent events to say that the P(X) is numerically the same as f (the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#bernoulli-urn">Bernoulli urn rule</a>). Thus, we were able to make use of the laws of probability (the product rule in this case) to calculate P(9 in a row | this f is temporarily assumed correct) = f <sup>9</sup>. Under the assumptions of the model, therefore, for any assumed f, the value </span><span style="font-family: Arial, Helvetica, sans-serif;">f </span><sup style="font-family: Arial, Helvetica, sans-serif;">9</sup><span style="font-family: Arial, Helvetica, sans-serif;"> is the frequency with which this physical process produces 9 X's out of 9 samples, and our result was again an expectation over frequency space (though this time a different frequency). We actually made 2 translations: from frequency to probability and then from probability back to frequency, before calculating the final probability. It may seem unnecessarily cumbersome, but by doing this, we avoid the nonsense of a probability for a probability. </span><br /><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span><span style="font-family: Arial, Helvetica, sans-serif;">(There are at least 2 reasons why I think avoiding such nonsense is important. Firstly, when we teach, we should avoid our students harboring the justified suspicion that we are telling them nonsense. The student does not have to be fully conscious that any nonsense was transmitted, for the teaching process to be badly undermined. Secondly, when we do actual work with probability calculus, there may be occasions when we solve problems of an exotic nature, where arming ourselves with normally harmless nonsense could lead to a severe failure of the calculation, perhaps even seeming to produce an instance where the entire theory implodes.)</span><br /><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span><span style="font-family: Arial, Helvetica, sans-serif;">What if nature is telling us that we shouldn't impose the assumption of logical independence? No big deal, we just need to add a few more gears to the machine. For example, we might introduce some high-order autoregression model to predict how an event depends on those that came before it. Such a model will have a set of n + 1 coefficients, but for each point in the space of those coefficients, we will be able to form the desired frequency distribution. We can then proceed to solve the problem: with what frequency does this system produce an X, given that the previous n events were thing<sub>1</sub>, thing<sub>2</sub>, .... The frequency of interest will typically be different to the global frequency for the system (if such exists), but the final probability will always be an expectation of a frequency. </span><br /><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span><span style="font-family: Arial, Helvetica, sans-serif;">The same kind of argument applies if subsequent events are independent, but f varies with time in some other way. There is no level of complexity that changes the overall thesis.</span><br /><br /></div><div class="MsoNormal"><div style="text-align: justify;"><span style="font-family: Arial, Helvetica, sans-serif;">It might look like we have strayed dangerously close to the dreaded <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#frequency-interpretation">frequency interpretation</a> of probability, but really we haven't. As I pointed out in the linked-to glossary article, every probability can be considered an expected frequency, but owing to the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#theory-ladenness">theory ladenness</a> of the procedure that arrives at those expected frequencies, whenever we reach the designated top level of our calculation, we are prevented from identifying probability with actual frequency. To make this identification is to claim to be omniscient. It is thus incorrect to talk, as some authors do, of physical probabilities, as opposed to epistemic probabilities.</span></div><br /><div style="text-align: justify;"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div><div style="text-align: justify;"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div><div style="text-align: justify;"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div><div style="text-align: justify;"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div></div>Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com0tag:blogger.com,1999:blog-715339341803133734.post-76226194875739721352013-10-05T00:06:00.000-05:002013-10-05T00:06:14.851-05:00Error Bars for Binary Parameters<style> td.upper_line { border-top:solid 1px black; } table.fraction { text-align: center; vertical-align: middle; margin-top:0.5em; margin-bottom:0.5em; line-height: 2em; } </style> <style> table.num_eqn { width:99%; text-align: center; vertical-align: middle; margin-top:0.5em; margin-bottom:0.5em; line-height: 2em; } td.eqn_number { text-align:right; width:2em; } </style> <br /><div style="text-align: justify;"><br /></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Propositions about real phenomena are either true or false. For some logical proposition, e.g. "there is milk in the fridge", let <span style="font-family: Times, Times New Roman, serif;">A</span> be the binary parameter denoting its truth value. Now, truth values are not in the habit of marching themselves up to us and announcing their identity. In fact, for propositions about specific things in the real world, there is normally no way whatsoever to gain direct access to these truth values, and we must make do with inferences drawn from our raw experiences. We need a system, therefore, to assess the reliability of our inferences, and that system is probability theory. When we do <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#param-est">parameter estimation</a>, a convenient way to summarize the results of the probability calculations is the error bar, and it would seem to be necessary to have some corresponding tool to capture our degree of confidence when we estimate a binary parameter, such as <span style="font-family: Times, Times New Roman, serif;">A</span>. But what could this error bar possibly look like? The hypothesis space consists of only two discrete points, and there isn't enough room to convey the required information. </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Let me pose a different question: how easy is to change your mind? One of the important functions of probability theory is to quantify evidence in terms of how easy it would be for future evidence to change our minds. Suppose I stand at the side of a not-too-busy road, and wonder in which direction the next car to pass me will be travelling. Let <span style="font-family: Times, Times New Roman, serif;">A</span> now represent the proposition that any particular observed vehicle is traveling to the left. Suppose that, upon my arrival at the scene, I'm in a position of extreme ignorance about the patterns of traffic on the road, and that my ignorance is best represented (for symmetry reasons) by indifference, and my resulting probability estimate for <span style="font-family: Times, Times New Roman, serif;">A</span> is 50%. </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Suppose that after a large number of observations in this situation, I find that almost equal numbers of vehicles have been going right as have been going left. This results in a probability assignment for <span style="font-family: Times, Times New Roman, serif;">A</span> that is again 50%. Here's the curious thing, though: in my initial state of indifference, only a small number of observations would have been sufficient for me to form a strong opinion that the frequency with which <span style="font-family: Times, Times New Roman, serif;">A</span> is true, f<sub>A</sub>, is close to either 0 or 1. But now, having made a large number of observations, I have accumulated substantial evidence that f<sub>A</sub> is in fact close to 0.5, and it would take a comparably large number of observations to convince me otherwise. The appropriate response to possible future evidence has changed considerably, but I used the same number, 50%, to summarize my state of information. How can this be?</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">In fact, the solution is quite automatic. In order to calculate <span style="font-family: Times, Times New Roman, serif;">P(A)</span>, it is first necessary to assign a probability distribution over frequency space, <span style="font-family: Times, Times New Roman, serif;">P(f<sub>A</sub>)</span>. I did this in one of my <a href="http://maximum-entropy-blog.blogspot.com/2012/03/how-to-make-ad-hominem-arguments.html">earliest bog posts</a>, in which I solved a thinly disguised version of Laplace's <a href="http://en.wikipedia.org/wiki/Sunrise_problem">sunrise problem</a>. Lets treat this traffic problem in the same way. My starting position in the traffic problem, indifference, meant that my information about the relative frequency with which an observed vehicle travels to the left was best encoded with a <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#prior">prior probability</a> distribution that is the same value at all points within the hypothesis space. Lets assume also that we start with the conviction (from whatever source) that the frequency, f<sub>A</sub>, is constant in the long run and that consecutive events are independent. Laplace's solution (this is, yet again, identical to the sunrise problem he solved just over 200 years ago) provides a neat expression for <span style="font-family: Times, Times New Roman, serif;">P(A)</span>, known as the <b>rule of succession</b> (p is probability that next event is type X, n is number of observed occurrences of type X events, and N is total number of observed events):<br /><div style="text-align: center;"><table cellpadding="0" cellspacing="0" class="num_eqn"> <tbody><tr> <td><table align="center" cellpadding="0" cellspacing="0"> <tbody><tr> <td><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-YYiqE6VW0WU/Uk7XKvXsxEI/AAAAAAAACKk/msW4S3uo5GE/s1600/succession.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-YYiqE6VW0WU/Uk7XKvXsxEI/AAAAAAAACKk/msW4S3uo5GE/s1600/succession.JPG" /></a></div><br /></td> </tr></tbody></table></td> <td class="eqn_number">(1)</td></tr></tbody></table></div>but his method follows that same route I took when predicting a person's behaviour from past observations: at each possible frequency (between 0 and 1) calculate <span style="font-family: Times, Times New Roman, serif;">P(f<sub>A</sub>)</span> from <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#bayes-theorem">Bayes' theorem</a>, using the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#binomial">binomial distribution</a> to calculate the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#likelihood">likelihood function</a>. The proposition <span style="font-family: Times, Times New Roman, serif;">A</span> can be resolved into a set of mutually exclusive and exhaustive propositions about the frequency, <span style="font-family: Times, 'Times New Roman', serif;">f</span><sub style="font-family: Times, 'Times New Roman', serif;">A</sub>, giving <span style="font-family: Times, Times New Roman, serif;">P(A) = P(A[f<sub>1</sub> + f<sub>2</sub> + f<sub>3</sub> +....])</span>, so that the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#product-rule">product rule</a>, applied directly after the <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#extended-sum-rule">extended sum rule</a> means that the final assignment of <span style="font-family: Times, Times New Roman, serif;">P(A)</span> consists of <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#integral">integrating</a> over the product f<sub>A</sub><span style="font-family: Times, 'Times New Roman', serif;">×</span>P(f<sub>A</sub>), which we recognize as obtaining the <a href="http://maximum-entropy-blog.blogspot.com/2013/01/great-expectations.html">expectation</a>, <span style="font-family: Times, Times New Roman, serif;">〈f<sub>A</sub>〉</span>.<br /><br />The figure below depicts the evolution of the distribution, <span style="font-family: Times, 'Times New Roman', serif;">P(f</span><sub style="font-family: Times, 'Times New Roman', serif;">A </sub><span style="font-family: Times, 'Times New Roman', serif;">| DI), </span><span style="font-family: inherit;">for the first N observations, for several N. The data all come from a single sequence of binary uniform random variables, and the procedure follows equation (4), from my <a href="http://maximum-entropy-blog.blogspot.com/2012/03/how-to-make-ad-hominem-arguments.html">earlier article</a>. We started, at N = 0, from indifference, and the distribution was flat. Gradually, as more and more data was added, a peak emerged, and got steadily sharper and sharper: </span><br /><span style="font-family: Times, 'Times New Roman', serif;"><br /></span><br /><div style="text-align: center;"><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-v7vnzy72zfQ/Uk3x3BX0fbI/AAAAAAAACKU/yE_WDG42nao/s1600/sunrise_1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="374" src="http://2.bp.blogspot.com/-v7vnzy72zfQ/Uk3x3BX0fbI/AAAAAAAACKU/yE_WDG42nao/s1600/sunrise_1.png" width="640" /></a></div><br /></div><br />(The numbers on the y-axis are larger than 1, but that's OK because they are <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#distribution-function">probability densities</a> - once the curve is integrated, which involves multiplying each value by a differential element, df, the result is exactly 1.) The probability distribution, <span style="font-family: Times, 'Times New Roman', serif;">P(f</span><sub style="font-family: Times, 'Times New Roman', serif;">A </sub><span style="font-family: Times, 'Times New Roman', serif;">| DI)</span>, is therefore the answer to our initial question: <span style="font-family: Times, 'Times New Roman', serif;">P(f</span><sub style="font-family: Times, 'Times New Roman', serif;">A </sub><span style="font-family: Times, 'Times New Roman', serif;">| DI)</span> contains all the information we have about the robustness of <span style="font-family: Times, Times New Roman, serif;">P(A)</span> against new evidence, and we get our error bar by somehow characterizing the width of <span style="font-family: Times, 'Times New Roman', serif;">P(f</span><sub style="font-family: Times, 'Times New Roman', serif;">A </sub><span style="font-family: Times, 'Times New Roman', serif;">| DI)</span>.<br /><br />Now, an important principle of probability theory requires that the order with which we incorporate different elements of the data, <span style="font-family: Times, Times New Roman, serif;">D</span>, does not affect the final <a href="http://maximum-entropy-blog.blogspot.com/p/glossary.html#posterior">posterior</a> supplied by Bayes' theorem. For <span style="font-family: Times, Times New Roman, serif;">D = {d<sub>1</sub>, d<sub>2</sub>, d<sub>3</sub>, ...}</span>, we could work the initial prior over to a posterior using only <span style="font-family: Times, 'Times New Roman', serif;">d</span><sub style="font-family: Times, 'Times New Roman', serif;">1</sub>, then, using this posterior as the new prior, repeat for <span style="font-family: Times, 'Times New Roman', serif;">d</span><sub style="font-family: Times, 'Times New Roman', serif;">2</sub>, and so on through the list. We could do the same thing, only taking the d's in any order we choose. We could bundle them into sub-units, or we could process the whole damn lot in a single batch. The final probability assignment must be the same in each case. Violation of this principle would invalidate our theory (assuming there is no causal path, e.g. if I'm observing my own mental state, from knowledge of some of the d's to subsequent observed d's).<br /><br />For example, each curve on the graph above shows the result from a single application of Bayes' theorem, though I could just as well have processed each individual observation separately, producing the same result. This works because the prior distribution is changing with each new bit of data added, gradually recording the combined effect of all the evidence. Each <span style="font-family: Times, 'Times New Roman', serif;">d</span><sub style="font-family: Times, 'Times New Roman', serif;">i</sub> becomes subsumed into the background information, <span style="font-family: Times, Times New Roman, serif;">I</span>, before the next one is treated.<br /><br />But we might have the feeling that something peculiar happens if we try to carry this principle over to the calculation of <span style="font-family: Times, Times New Roman, serif;">P(A | DI)</span>. What is the result of observing 9 consecutive cars travelling to the left? It depends what has happened before, obviously. Suppose D<sub>1</sub> is now the result of 1 million observations, consisting of exactly 500,000 vehicles moving in each direction. The posterior assignment is almost exactly 50%. Now I see D<sub>2</sub>, those 9 cars travelling to the left - what is the outcome? The new prior is 50%, the same as it was before the first observation.<br /><br />What the hell is going on here? How do we account for the fact that these 9 vehicles have a much weaker effect on our rational belief now, than they would have done if they had arrived right at the beginning of the experiment? The outcome of Bayes' theorem is proportional to prior times likelihood: <span style="font-family: Times, Times New Roman, serif;">P(A | I)×P(D | AI).</span> Looking at 2 very different situations, 9 observations after 1 million, and 9 observations after zero, the prior is the same, the proposition, <span style="font-family: Times, Times New Roman, serif;">A</span>, is the same, and <span style="font-family: Times, Times New Roman, serif;">D</span> is the same. The rule of succession with n = N = 9 gives the same result in each case. It seems like we have a problem. We might solve the problem by recognizing that the correct answer comes by first getting <span style="font-family: Times, 'Times New Roman', serif;">P(f</span><sub style="font-family: Times, 'Times New Roman', serif;">A </sub><span style="font-family: Times, 'Times New Roman', serif;">| DI)</span> then finding its expectation, but how did we recognize this? Is it possible that we rationally reached out to something external to probability theory to figure out that direct calculation of <span style="font-family: Times, Times New Roman, serif;">P(A | DI)</span> would not work? Could it be that probability theory is not the complete description of rationality? (Whatever that means.)<br /><br />Of course, such flights of fancy aren't necessary. The direct calculation of <span style="font-family: Times, 'Times New Roman', serif;">P(A | DI)</span> works perfectly fine, as long as we follow the procedure correctly. Lets define 2 new propositions,<br /><div style="text-align: center;"><span style="font-family: Times, Times New Roman, serif;"><br /></span></div><div style="text-align: center;"><span style="font-family: Times, Times New Roman, serif;">L = A = "the next vehicle to pass will be travelling to the left,"</span></div><div style="text-align: center;"><span style="font-family: Times, Times New Roman, serif;">R = "the next vehicle to pass will be travelling to the right."</span></div><div style="text-align: center;"><br /></div>With D<sub>1</sub> and D<sub>2</sub> as before:<br /><br /><div style="text-align: center;"><span style="font-family: Times, Times New Roman, serif;">D<sub>1</sub> = "500,000 out of 1 million vehicles were travelling to the left"</span></div><div style="text-align: center;"><span style="font-family: Times, Times New Roman, serif;">D<sub>2</sub> = "Additional to </span><span style="font-family: Times, 'Times New Roman', serif;">D</span><sub style="font-family: Times, 'Times New Roman', serif;">1</sub><span style="font-family: Times, 'Times New Roman', serif;">, 9 out of 9 vehicles were travelling to the left"</span></div><br />Background information is given by<br /><br /><div style="text-align: center;"><span style="font-family: Times, Times New Roman, serif;">I<sub>1</sub> = "prior distribution over f is uniform, f is constant in the long run, </span><br /><span style="font-family: Times, 'Times New Roman', serif;">and subsequent events are independent"</span></div><br />From this we have the first posterior,<br /><!--**************************************************************************--><!-- ************************ A Numbered Equation ************************--><br /><a href="http://www.blogger.com/blogger.g?blogID=715339341803133734" id="eq2"></a><br /><table cellpadding="0" cellspacing="0" class="num_eqn"> <tbody><tr> <td><table align="center" cellpadding="0" cellspacing="0"> <tbody><tr> <td><span style="text-align: center;">P(L | </span><span style="font-family: Times, 'Times New Roman', serif; text-align: center;">D</span><sub style="font-family: Times, 'Times New Roman', serif; text-align: center;">1</sub><span style="font-family: Times, 'Times New Roman', serif; text-align: center;">I</span><sub style="font-family: Times, 'Times New Roman', serif; text-align: center;">1</sub><span style="text-align: center;">) = 0.5</span><span style="text-align: justify;"> </span></td> </tr></tbody></table></td> <td class="eqn_number">(2) </td> </tr></tbody></table><br />Now comes the crucial step, we must fully incorporate the information in <span style="font-family: Times, 'Times New Roman', serif; text-align: center;">D</span><sub style="font-family: Times, 'Times New Roman', serif; text-align: center;">1</sub><br /><!--**************************************************************************--><!-- ************************ A Numbered Equation ************************--><br /><a href="http://www.blogger.com/blogger.g?blogID=715339341803133734" id="eq3"></a><br /><table cellpadding="0" cellspacing="0" class="num_eqn"> <tbody><tr> <td><table align="center" cellpadding="0" cellspacing="0"> <tbody><tr> <td><span style="font-family: Times, 'Times New Roman', serif; text-align: center;">I</span><sub style="font-family: Times, 'Times New Roman', serif; text-align: center;">2</sub><span style="font-family: Times, 'Times New Roman', serif; text-align: center;"> = </span><span style="font-family: Times, 'Times New Roman', serif; text-align: center;">I</span><sub style="font-family: Times, 'Times New Roman', serif; text-align: center;">1</sub><span style="font-family: Times, 'Times New Roman', serif; text-align: center;">D</span><sub style="font-family: Times, 'Times New Roman', serif; text-align: center;">1</sub></td> </tr></tbody></table></td> <td class="eqn_number">(3) </td> </tr></tbody></table><br />Now, after obtaining <span style="font-family: Times, 'Times New Roman', serif; text-align: center;">D</span><sub style="font-family: Times, 'Times New Roman', serif; text-align: center;">2</sub>, the posterior for <span style="font-family: Times, Times New Roman, serif;">L</span> becomes<br /><table cellpadding="0" cellspacing="0" class="num_eqn"> <tbody><tr> <td><table align="center" cellpadding="0" cellspacing="0"> <tbody><tr> <td><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-zL62ROuvg4A/Uk7hc8NGILI/AAAAAAAACK0/NnbOYVvAuKw/s1600/sunrise_posterior.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="50" src="http://2.bp.blogspot.com/-zL62ROuvg4A/Uk7hc8NGILI/AAAAAAAACK0/NnbOYVvAuKw/s1600/sunrise_posterior.JPG" width="400" /></a></div></td> </tr></tbody></table></td> <td class="eqn_number">(4) </td> </tr></tbody></table><br /><!--*****************************************************************************--> When we pose and solve a problem that's explicitly about the frequency, f, of the data-generating process, we often don't pay much heed to the updating of <span style="font-family: Times, Times New Roman, serif;">I </span><span style="font-family: inherit;">in equation (3), because</span> it is mathematically irrelevant to the likelihood, P(<span style="font-family: Times, 'Times New Roman', serif; text-align: center;">D</span><sub style="font-family: Times, 'Times New Roman', serif; text-align: center;">2</sub> | <span style="font-family: Times, 'Times New Roman', serif;">f</span><span style="font-family: Times, 'Times New Roman', serif; text-align: center;">D</span><sub style="font-family: Times, 'Times New Roman', serif; text-align: center;">1</sub><span style="font-family: Times, 'Times New Roman', serif; text-align: center;">I</span><sub style="font-family: Times, 'Times New Roman', serif; text-align: center;">1</sub>). Assuming a particular value for the frequency renders all the information in <span style="font-family: Times, 'Times New Roman', serif; text-align: center;">D</span><sub style="font-family: Times, 'Times New Roman', serif; text-align: center;">1</sub> powerless to influence this number. But if we are being strict, we must make this substitution, as <span style="font-family: Times, 'Times New Roman', serif;">I</span> is necessarily defined as all the information we have relevant to the problem, apart from the current batch of data (<span style="font-family: Times, 'Times New Roman', serif; text-align: center;">D</span><sub style="font-family: Times, 'Times New Roman', serif; text-align: center;">2</sub>, in this case).<br /><br />The priors in equation (4) are equal, so they cancel out. The likelihood is not hard to calculate, remember what it means: the probability to see 9 out of 9 travelling to the left, given that 500,000 out of 1,000,000 were travelling to the left, previously, and given that the next one will be travelling to the left. That is, what is the probability to have 9 out of 9 travelling to the left, given that in total n = 500,001 out of N = 1,000,001 travel to the left. We can use the same procedure as before to calculate the probability distribution over the possible frequencies, <span style="font-family: Times, Times New Roman, serif;">P(f<span style="text-align: center;"> | L</span><span style="text-align: center;">I</span><sub style="text-align: center;">2</sub>). </span><span style="font-family: inherit;">For any given frequency, the assumption of independence in</span><span style="font-family: Times, Times New Roman, serif;"> </span><span style="font-family: Times, 'Times New Roman', serif; text-align: center;">I</span><sub style="font-family: Times, 'Times New Roman', serif; text-align: center;">1</sub><span style="font-family: Times, Times New Roman, serif;"> </span><span style="font-family: inherit;">means that the only information we have about the probability for any given vehicle's direction is this frequency, and so the probability and the frequency have the same numerical value. This means that for any assumed frequency, the probability to have 9 in a row going to the left is </span>f <sup>9</sup>, <span style="font-family: inherit;">from the product rule. But since we have a probability distribution over a range of frequencies, we take the expectation by integrating over the product</span> P(f)<span style="font-family: Times, 'Times New Roman', serif;">×</span>f <sup>9</sup>.<br /><br />We can do that integration numerically, and we get a small number: 0.00195321. The counter-part of the likelihood, the one conditioned on R rather than L, is obtained by an analogous process. It produces another small, but very similar number: 0.00195318. From these numbers, the ratio in equation (4) gives 0.5000045, which does not radically disagree with the 0.5000005 we already had. (For comparison, if N = n = 9 was the complete data set, the result would be <span style="font-family: Times, Times New Roman, serif;">P(L)</span> = 0.9091, as you can easily confirm.) Thus, when we do the calculation properly, a sample of only 9 makes almost no difference after a sample of 1 million, and peace can be restored in the cosmos. <br /><br />Using the same procedure, we can confirm also that combining <span style="font-family: Times, 'Times New Roman', serif; text-align: center;">D</span><sub style="font-family: Times, 'Times New Roman', serif; text-align: center;">1</sub> and <span style="font-family: Times, 'Times New Roman', serif; text-align: center;">D</span><sub style="font-family: Times, 'Times New Roman', serif; text-align: center;">2</sub> into a single data set, with N = 1,000,009 and n = 500,009, gives precisely the same outcome for <span style="font-family: Times, Times New Roman, serif;">P(L | DI)</span>, 0.5000045, exactly as it must.<br /><br /><br /><br /><br /></div>Tom Campbell-Rickettshttp://www.blogger.com/profile/07387943617652130729noreply@blogger.com0