Friday, March 8, 2013

What is Randomness?

Random variables play an important part in the vocabulary of probability theory. I think there's a lot of confusion, though, about what randomness actually is. A few days ago, I found an expert statistician trying to distinguish between mere statistical fluctuation and actual changes in the causal environment. Another example that always bugs me comes from computer science, and is the ubiquitous insistence that a deterministic algorithm can not produce random numbers, but only pseudo-random numbers.   

Each of these examples commits a fallacy. The first may be an isolated slip-up, or else a deliberate attempt to gloss over technicalities with sloppy language (I may even be guilty (gasp!) of either of these myself, on occasion), but the second is almost universal within the entire profession of computer scientists, which represents a significant sample of the world's technically disciplined. If any one of those computer scientists understood what randomness is, they would recognize immediately that the need to distinguish between random and pseudo-random is entirely fictional. 

Its not that I have anything against computer scientists. Information technology, after all, is what this blog is all about: systematically processing knowledge, and modern society owes its existence to computer science. Nor do I think that computer scientists are excessively prone to the fallacy I'm talking about. In fact, the statistical literature, compiled by those who, of all people, should have dedicated considerable effort to understanding this topic, is crammed with instances, of which my initial example is representative. As another example, the current Wikipedia entry on randomness contains a very confused section that starts with the statement: 'Randomness, as opposed to unpredictability, is an objective property.'

So what is the problem? Lets look at pseudo-random numbers, first. The point about pseudo-random numbers is that they are produced in a clever way to 'replicate' a random variable - they come appropriately distributed and they appear uncorrelated, which is to say that even knowing the nth, (n-1)th, ... numbers in a list, it will be impossible to predict what the (n+1)th number will be. They are considered to be not truly random, however, because they are produced by mechanical operations of a computer on fixed states of its circuits. If we only knew the states of the circuit and the operations, then we would know the number that will come out next. To say that this prevents us from describing the numbers as random, however, is an instance of the mind-projection fallacy, as I have discussed before.

The mind projection fallacy consists of assuming that properties of our model of reality necessarily correspond to properties of reality. We often talk about a phenomenon being random. This makes it tempting to conclude that randomness is a property of the phenomenon itself, but this assumes too much, and also demands that the events we are talking about take place in some kind of bubble, where physics doesn't operate.

When I dip my hand into an urn to draw out a ball whose colour can't be predicted, the colour that comes out is rightly considered to be a random variable. But we are not talking about some quantum-mechanical wavefunction that collapses the moment the first photon from the ball hits my retina (or the moment the nerve impulse reaches my visual cortex, or any of a million other candidate moments). We are talking about a process with  real and definite cause and effect relationships. The layout of the coloured balls inside the urn, the trajectory of my hand into the urn, and the exact moment I decide to close my hand uniquely determine the outcome of the draw. Randomness is not the occurrence of causeless events, but is a consequence of our incomplete information. It's not a property of the balls in the urn, but a property of our prior state of knowledge.

It might be that at microscopic scales, quantum stochastic variability really does emerge from an absence of causation, but there are 2 important points to note in relation to this discussion. Firstly, good scientists recognize the need for agnosticism on this front - there really isn't enough evidence yet to decide one way or the other (BBC Radio 4's excellent 'In Our Time' has an episode, entitled 'The measurement problem,' with an interesting discussion on the topic). Secondly, the vast majority of cases where the concept of randomness is applied concern macroscopic phenomena, where classical mechanics is a perfectly adequate model. For these reasons, the only sensible general usage of the word 'random' is when referring to missing information, rather than as a description of uncaused events. That wikipedia article I quoted from, in apparent recognition of this, later cites Brownian motion and chaos as examples of randomness, thereby contradicting the earlier quote (though inexplicably, the two are identified as separate classes of random behaviour).

Getting back to the urn, if I knew precisely the coordinates (relative to my hand) and colours of the balls inside, the colour of the extracted sphere would not be a a surprise, and therefore wouldn't be a random variable. But under the standard drawing conditions, in which these are not known, it is a random variable. Similarly, knowing the state and operations of a deterministic computer algorithm would render its output non-random (provided I have the computational resources elsewhere (and the inclination) to replicate those operations), but this does not affect the randomness of its output when we don't know these things.

And finally, how can there be a distinction between statistical fluctuations and changes in the causal environment? If I repeat the experiment with the urn and get a different coloured ball the second time, which is that, sampling variability, or a difference in causes? Both, of course. What could sampling variability (at the macroscopic scale) be the result of, if not mechanical differences in the evolution of the experiment? If I toss a coin 4 times and get 4 heads in a row, in one sense, that's a statistical fluke, but in another, its the inevitable result of a system obeying completely deterministic mechanical laws. All that decides the level at which we find ourselves discussing the matter is our degree of awareness of the states and operations of nature.

Ok, so we live in a world polluted by some sloppy terminology, but does it really matter? I think it does. Every statistical model is an attempt to describe some physical process. As long as we systematically deny the action of physics on any aspects of these processes, then we close off access to potentially valuable physical insight. This is actually the problem that the discussed attempt to separate sampling fluctuation and causal variation was trying to address, but this half-hearted formulation serves only to postpone the problem until some later date.


  1. I share your point of view. I also agree that it is much more than just sloppy terminology. Language has critical influence in how we think and perceive the world. The next quote from Poincaré warns us about the costs we may still be paying in many scientific areas:

    "A well-made language is no indifferent thing; not to go beyond physics, the unknown man who invented the word heat devoted many generations to error. Heat has been treated as a substance, simply because it was designated by a substantive, and it has been thought indestructible" [The Foundations of Science]

    The argument that made me think about this perspective and accepted it was ET Jaynes posthumous book. Do you know other books that deal with this subject and that are worthy to read?

  2. I agree absolutely with the spirit of Poincaré's remark, but I'm not completely convinced by his example - it seems like a bit of a chicken/egg problem to me. Certainly, though, there are bad formulations that severely impede physical insight.

    We also need to keep in mind, though, that the dependence of understanding on our choice of vocabulary has no bearing on the truth values of objective facts.

    Reading Jaynes' discussion of randomness was the first time I found evidence of somebody else thinking the same way as me. I'm sure there are others, but I can't think of any authors at the moment. If your research turns up other sources, please consider letting us know.

  3. The distinction between pseudo-random and "truly" random, as I use the terms, is indeed useful, but it has more to do with the ease of replicability and control than anything else. Specifically, a process is pseudo-random if I entirely control its parameters and so can recreate its results exactly (which is usually accomplished using a known algorithm and a known seed), as distinct from a process that has inputs that I cannot control (e.g. reading bits from an EGD or a third party such as A coin-flipping robot can be pseudo-random in this sense, but coin-flipping done by humans is typically not. Pseudo-random processes can of course be used interchangeably with "true" random processes for most applications, but sometimes (as in game tournaments) replicability is desirable and sometimes (as in most security applications) it is not.

    I'll grant you that there is a lot of misunderstanding out there around what randomness and probability are (those are some gems on the Wikipedia page), including among computer scientists, and the terminology is indeed vague and inconsistent, but to me this example feels like attacking a strawman.

    1. Certainly, degree of controlability is a very important technological parameter - Mersenne twister is fine for my Monte-Carlo simulations, but I would never use it for cryptography, as only a few hundred consecutive observations are enough for a mathematically literate person to be able predict to all future numbers exactly. In a sense, we make the sequence random artificially, by deliberately not examining information that is in principle available to us, and I accept the point that 'pseudo-random' could have a valid usage.

      Describing a process as 'truly random,' however, without specifying the observer's state of knowledge is the mind projection fallacy, but this does appear to be the usual intention of the random/pseudorandom distinction, e.g. from Wikipedia's article on pseudorandom number generators:

      'The sequence is not truly random in that it is completely determined by a relatively small set of initial values, called the PRNG's state, which includes a truly random seed.'

  4. “It might be that at microscopic scales, quantum stochastic variability really does emerge from an absence of causation,“

    Recent developments (i.e. the PBR theorem and e.g. this: ) certainly don't seem to leave much room for causation but, as you say, I don't think it matters anyway. The one thing I really dislike in Jaynes' book is that “But What About Quantum Theory” section. Rovelli's relational interpretation is a much better way of arguing against naïve conceptions of “physical probabilities”, IMHO.