Here's a problem that cropped up in the course of some calculations I have been working on recently. In stating the problem, I'll include the physical context, thought this context isn't important for the rest of the discussion, which applies generally to certain manipulations on probability distributions:
A highenergy massive particle (e.g. a proton or αparticle), whose initial kinetic energy is governed by a certain probability distribution, passes through a slab of material. As it travels through the material, it scatters some of the electrons inside the slab, dissipating a small fraction of its energy. The amount of its energy that it loses also has some probability distribution (the socalled 'straggling function'). What is the probability distribution over the energy of the particle as it exits the slab of material?
I'm sure all of us wrestle with exactly this question several times each day. The problem concerns the probability distribution over the difference between two independent random variables^{1} (in this case the particle's initial energy and the energy it deposits in the slab of material).
The route to solving this problem involves utilizing the solution to a related problem, namely the probability distribution over the sum of two random variables, so first let's look at that.
The sum of two independent random numbers
In the previous post, I stated that the distribution over this sum is the convolution of the two distributions from which the two random variables are sampled. First, I'll give my explanation of why this is so. Later, to mitigate the risk that my reasoning is wrong, I'll show the results of some computational experiments that test this reasoning.
Suppose we have a parameter, x, and two probability distributions, which we model using continuous functions f(x) and g(x), from which the numbers x_{1} and x_{2} are respectively sampled. Let's call the sum of these two x_{sum}.
For any given value of x_{sum}, there are many possibilities for how we got it. We could have a small value for x_{1} added to a large value for x_{2}. Or, for example, we could add any small number, δ, to x_{1}, and then subtract the same δ from x_{2}, to get the same outcome for x_{sum}.
For any given value of x_{sum}, there are many possibilities for how we got it. We could have a small value for x_{1} added to a large value for x_{2}. Or, for example, we could add any small number, δ, to x_{1}, and then subtract the same δ from x_{2}, to get the same outcome for x_{sum}.
For a given value of x_{sum} and a given value of x_{1}, x_{2} must be given by (x_{sum } x_{1}), and the probability for that event is given by the product rule, simplified by our assumption of independence^{2}:
But, as we noted, there are many combinations of x_{1} and x_{2} that could produce any given x_{sum}. Thus, the probability for any particular x_{sum} will be given by the extended sum rule (noting the exclusivity of the different events). We just add up all the probabilities for all the contributing possibilities:
(1) 
The integral given in Equation 1 is the definition of the convolution of the two functions f(x) and g(x), denoted f(x)*g(x).
Quality assurance
In principle, this argument should be enough, but since computers are now powerful enough that running a decent MonteCarlo simulation costs very little, it is reasonable to test our reasoning using this technique.
The easiest way to perform this MonteCarlo calculation is to convert the PDFs f(x) and g(x) to CDFs, then use a numerical random number generator, uniform between 0 and 1, to sample x_{1} and x_{2}. At each iteration, we get two random numbers, one for each of f(x) and g(x), then we simply find the value of x at which the appropriate CDF matches the random number. We then build a histogram of the generated sums.
The results of this procedure are plotted below for arbitrary functions, f(x) and g(x), together with the result of direct calculation of the integral in Equation 1 (see appendix below for my simple numerical implementation of this).
Evaluation of Equation 1 (solid red curve, marked 'Direct integration') achieves consistency, firstly with commonsense expectation: the result, shown with the solid red curve, has its maximum at x = 9,000, i.e. the sum of the locations where f(x) and g(x) are both maximized (7,000 and 2,000, respectively), and has the expected shape, having the same long righthand tail as g(x), and the same rounded peak as f(x).
Secondly, the result matches perfectly the MonteCarlo prediction (source code also shown in the appendix). Here I sampled 10,000 sums, x_{1} + x_{2}, and rescaled the result to the height of the normalized PDF.
Logical analysis and simple experimentation agree. Our result can be considered robust! (And a couple of centuries worth of mathematicians, who already knew the thing to be correct, can continue to rest in peace.)
Then why does the analytical result matter?
In some special cases, analytical formulas can be derived from Equation 1. Going beyond those special cases, though, we usually need to think about computational speed. The MonteCarlo technique is certainly cheap now compared to 50 or 100 years ago, when scientists and engineers had to spend much more of their time finding analytical or approximate analytical solutions to all sorts of problems, but thanks to a special property of the convolution operation, there is an even quicker way to numerically calculate convolutions.
This property, termed the convolution theorem, is that f(x)*g(x) is the same as inverse Fourier transform of the product of the Fourier transforms of f(x) and g(x). Thanks to Fast Fourier Transform (FFT) algorithms, numerical methods for convolution can be executed in very little time.
The convolve() function in the numpy package for Python, for example, uses the FFT technique. When I repeated the calculation in the above figure, using numpy.convolve() instead of my (admittedly crude) integration procedure, I got pretty much exactly the same result, but it executed about 1,000 times faster. (Admittedly, not all of this performance improvement is due to the logic of the improved algorithm, but rather to its implementation: numpy calls code that is written in C, C++, and Fortran, and can therefore execute the same logic much faster than it could in native Python.)
The difference between two independent random numbers
Now we just need to find a way to adapt our result to problems of the form stated at the beginning of this article. Modifying our earlier reasoning to consider x_{diff} = x_{1}  x_{2}, we note that for a given x_{2} and x_{diff}, it must be that x_{1} = x_{diff} + x_{2}. Consequently, we find that our required probability distribution is
Quality assurance
In principle, this argument should be enough, but since computers are now powerful enough that running a decent MonteCarlo simulation costs very little, it is reasonable to test our reasoning using this technique.
The easiest way to perform this MonteCarlo calculation is to convert the PDFs f(x) and g(x) to CDFs, then use a numerical random number generator, uniform between 0 and 1, to sample x_{1} and x_{2}. At each iteration, we get two random numbers, one for each of f(x) and g(x), then we simply find the value of x at which the appropriate CDF matches the random number. We then build a histogram of the generated sums.
The results of this procedure are plotted below for arbitrary functions, f(x) and g(x), together with the result of direct calculation of the integral in Equation 1 (see appendix below for my simple numerical implementation of this).
Secondly, the result matches perfectly the MonteCarlo prediction (source code also shown in the appendix). Here I sampled 10,000 sums, x_{1} + x_{2}, and rescaled the result to the height of the normalized PDF.
Logical analysis and simple experimentation agree. Our result can be considered robust! (And a couple of centuries worth of mathematicians, who already knew the thing to be correct, can continue to rest in peace.)
Then why does the analytical result matter?
In some special cases, analytical formulas can be derived from Equation 1. Going beyond those special cases, though, we usually need to think about computational speed. The MonteCarlo technique is certainly cheap now compared to 50 or 100 years ago, when scientists and engineers had to spend much more of their time finding analytical or approximate analytical solutions to all sorts of problems, but thanks to a special property of the convolution operation, there is an even quicker way to numerically calculate convolutions.
This property, termed the convolution theorem, is that f(x)*g(x) is the same as inverse Fourier transform of the product of the Fourier transforms of f(x) and g(x). Thanks to Fast Fourier Transform (FFT) algorithms, numerical methods for convolution can be executed in very little time.
The convolve() function in the numpy package for Python, for example, uses the FFT technique. When I repeated the calculation in the above figure, using numpy.convolve() instead of my (admittedly crude) integration procedure, I got pretty much exactly the same result, but it executed about 1,000 times faster. (Admittedly, not all of this performance improvement is due to the logic of the improved algorithm, but rather to its implementation: numpy calls code that is written in C, C++, and Fortran, and can therefore execute the same logic much faster than it could in native Python.)
The difference between two independent random numbers
Now we just need to find a way to adapt our result to problems of the form stated at the beginning of this article. Modifying our earlier reasoning to consider x_{diff} = x_{1}  x_{2}, we note that for a given x_{2} and x_{diff}, it must be that x_{1} = x_{diff} + x_{2}. Consequently, we find that our required probability distribution is
(2) 
Now Equation 2 is technically sufficient to solve the problem, but as we have seen, direct integrations of the like of Equations 1 and 2 can be pretty slow. Since Equation 2 is so tantalisingly similar to Equation 1, therefore, it would be great if there was some mathematical trick to make Equation 2 look more like Equation 1. This would allow us to calculate P(x_{diff}) using our superspeedy FFTbased numerical procedure. It turns out there is such a trick.
Instead of subtracting a random number drawn from g(x), what we can do is add a number drawn from g(x), which is g(x) reflected through the yaxis. This is the function we'd like to convolve with f(x).
Unfortunately, convolution algorithms, such as numpy.covolve(), don't know anything about the xaxis  they only receive a set of yvalues, so as far as the algorithm is concerned, x = 0 corresponds to the number at the beginning of the list, ans so on. There is no direct way for us to tell numpy.convolve() that these yvalues correspond to negative xvalues.
All is not lost, however. If we simply reverse the list of numbers specifying the values of g(x), and convolve with that, it will be like we are calculating the distribution over the sum of numbers drawn from f(x) and g(x), except that the x_{2} we'll be adding will be too large, by exactly the length of the array defining g(x). Thus to recover the correct position of the convolved distribution, we must chop off from the beginning of the list of numbers specifying the result, a section of length equal to the length of the supplied g(x). My code to execute this is shown in the appendix.
Testing the procedure is now super easy. We already have the MonteCarlo procedure that we used above, and we know that it works. Converting it to our current purposes requires only changing one '+' to a ''. The verification looks like this:
The overlap between the random experiment and the nifty FFT procedure is good enough to conclude that my reasoning was again sound. For your interest, my MonteCarlo took almost 4 times longer to run, but gives a spectrum with only 50 bins, to cover a range covered by about 4,000 bins with the FFT method. Thus, I'd need to run the MonteCarlo about 80 times longer again, to achieve the same resolution, while not compromising on the signaltonoise ratio.
Instead of subtracting a random number drawn from g(x), what we can do is add a number drawn from g(x), which is g(x) reflected through the yaxis. This is the function we'd like to convolve with f(x).
Unfortunately, convolution algorithms, such as numpy.covolve(), don't know anything about the xaxis  they only receive a set of yvalues, so as far as the algorithm is concerned, x = 0 corresponds to the number at the beginning of the list, ans so on. There is no direct way for us to tell numpy.convolve() that these yvalues correspond to negative xvalues.
All is not lost, however. If we simply reverse the list of numbers specifying the values of g(x), and convolve with that, it will be like we are calculating the distribution over the sum of numbers drawn from f(x) and g(x), except that the x_{2} we'll be adding will be too large, by exactly the length of the array defining g(x). Thus to recover the correct position of the convolved distribution, we must chop off from the beginning of the list of numbers specifying the result, a section of length equal to the length of the supplied g(x). My code to execute this is shown in the appendix.
Testing the procedure is now super easy. We already have the MonteCarlo procedure that we used above, and we know that it works. Converting it to our current purposes requires only changing one '+' to a ''. The verification looks like this:
Notes
[1] 
In fact, this is an approximation. The straggling function is really a function of the particle's initial energy, and our two random variables are not strictly independent. If the distribution over the initial energy is narrow enough, however, it'll be safe to assume that any changes to the straggling function over the range of energies will be too small to significantly affect the outcome of the calculation.

[2] 
Events A and B are conditionally independence if and only if P(AB,I) = P(AI) and P(BA,I) = P(BI).

[3]  The convolution operation has the property of commutativity, such that f(x)*g(x) is identical to g(x)*f(x). 
Appendix: Python source code
Here is the function I used for to calculate the distribution over the sum of 2 numbers, by direct calculation of the integral in Equation 1 (this method was about 1,000 times slower than the numpy.convolve() function, which uses FFT):
def sum_by_integration(f, g):
# f and g are numpy arrays giving the magnitude of the functions at each location.
# Note that f and g are defined on the same grid
n = len(f) + len(g)
s = numpy.zeros(n)
steps = len(f)
counter = 0
while counter < (n  1):
counter += 1
for step in range(steps):
if ((ctr  step) >= 0) and ((ctr  step) < len(g)):
s[ctr] += f[step] * g[ctr  step]
return s
Here's my function to calculate the CDF, for the MonteCarlo method. Numpy has its own function to do this, numpy.cumsum(), but I'll give you this recipe, which can be translated into other languages:
def CDF(pdf):
# pdf is an array giving the magnitude of the probability distribution function at each location.
cdf = numpy.zeros(numpy.shape(pdf))
cdf[0] = pdf[0]
counter = 0
for p in pdf[1:]:
counter += 1
cdf[counter] = cdf[counter  1] + p
cdf = cdf / float(cdf[1]) # (cdf always goes to unity)
return cdf
And here's the MonteCarlo implementation, itself:
def sum_by_MC(f, g, N=10000):
cdf1 = CDF(f)
cdf2 = CDF(g)
sums = []
counter = 0
while counter < N:
counter += 1
r1, r2 = numpy.random.uniform(0, 1, 2)
# use nonzero() to find index where CDF first exceeds random number
x1 = numpy.nonzero(cdf1 >= r1)[0][0]
x2 = numpy.nonzero(cdf2 >= r2)[0][0]
sums.append(x1 + x2)
hist, edges = numpy.histogram(sums, bins=50)
return hist, edges
My code to calculate the distribution of the difference between two random variables is given next:
def difference_by_FFT(f, g):
# calculate distribution over f  g
g = g[::1] # reverse g(x) to look like a shifted g(x)
difference = numpy.convolve(f, g, mode='full') # numpy's fast convolution algorithm
difference = difference[len(g):] # chop off the beginning section
difference = difference / numpy.sum(difference) # normalize
return difference
No comments:
Post a Comment