Maximum Entropy: Entropy Games

In 1948, Claude Shannon, an electrical engineer working at Bell labs, was interested in the problem of communicating messages along physical channels, such as telephone wires. He was particularly interested in issues like how many bits of data are needed to communicate a message, how much redundancy is appropriate when the channel is noisy, and how much a message can be safely compressed.

In that year, Shannon figured out¹ that he could mathematically specify the minimum number of bits required to convey any message. You see every message, every proposition, in fact, whether actively digitized or not, can be expressed as some sequence of answers to yes / no questions, and every string of binary digits is exactly that: a sequence of answers to yes / no questions. So if you know the minimum number of bits required to send a message, you know everything you need to know about the amount of information it contains.

Many possible sets of yes / no questions can yield the same message, but they're not all the same length. Consider the game of 20 questions, in which one player has to guess or deduce the name of a person that the other player is thinking of, simply by asking questions to which the answer will be either 'yes' or 'no.' A really crappy strategy when playing this game would be to ask the following set of questions:

Is it Dorothy Hodgekin? [Nope.]
Is it Emily Noether? [No!]
Is it Caroline Herschel? [No, dumb-ass!]
Is it Margaret Cavendish? [Oh, for crying out loud.]
etc.

A strategy like this is likely to take an awfully long time to finish the game, particularly as the correct answer may be a man with no outstanding scientific credentials. A better way to play the game is to devise questions that divide the set of possibilities as nearly in half as possible. A good first question is therefore, 'is it a male?' From there, we could proceed with, 'is it a famous person?' and so on, producing a significantly smaller set of possibilities with each answer. With good interrogation, the size of the hypothesis space decays exponentially, while naive guessing, such as the examples above, leaves the range of possibilities virtually unchanged.

So an optimum string of bits for a message is one such that each bit reduces the number of possible messages the sender might be sending by 50% (with a caveat we'll see in a moment). Thus, if we have a set of 128 symbols, say lower- and upper-case English alphabet, digits 0-9, several punctuation marks, and various other symbols (the ASCII system), we don't need 128 bits to transmit each symbol, only 7. The first bit answers the question: 'is it in the first half of the list?' The second bit answers the question: 'of the remaining list of possibilities, is it in the first half of the list?' and so on. The 7th bit, therefore, leaves a list containing exactly one symbol.

Suppose our alphabet - the complete set of symbols out of which our messages are to be constructed - consists of n symbols. If all symbols in our alphabet occur equally frequently, then the number of bits, call it x, needed to transmit one symbol is given by n = 2^x. The solution to this equation, of course, is

x = log₂(n)

But if all symbols occur equally frequently, then the probability, p, for the i^th symbol in a given message to be one in particular is 1/n, from the principle of indifference, so

x = log₂(1/p)

or, from the laws of logs (special case of the quotient rule, noting that log(1) = 0)

x = -log₂(p)

In normal communication, however, the symbols do not usually occur equally frequently. Furthermore, a message will often contain symbols that are unnecessary. Consider the text you are reading now. This text consists of an arrangement of white and black pixels. The state of each pixel is determined by answering a single yes / no question, but do you actually need to receive all those bits at your visual cortex in order to read the text? Hopefully, this next game will answer that question in the negative. Try to read the following partially obscured text:

Tests (upublished work, conducted by me, very small sample) reveal that most people can cope with reading this. Similarly, most people can manage to interpret SMS text, composed with many of the vowels excluded from their proper places.

Luckily, as is not too hard to see, the above formula generalizes to the case where symbols don't carry equal amounts of information, which happens when they don't occur with equal frequency, or where some parts of a message are superfluous (the obscured pixels along the top of that message, above, evidently didn't convey any additional useful information).

Suppose, for example, that we have a set of only 4 symbols: A, B, C, and D. Suppose that in an average message, A appears 50% of the time, and the remaining letters are half B's, a quarter C's, and a quarter D's. Dividing the list of possibilities into 2 halves with the first bit is not optimally efficient here (that caveat I warned you about). Such a strategy would take 2 bits to transfer each symbol, even though the C's and D's carry more information than the A's.

Rather than splitting the list in half, it is clearly the probability distribution over the list we should split in half, and because A's occur half the time, our first question is, 'is it an A?' If the answer is yes, then we have a symbol from only one bit. If the answer is no, then the next question is, 'is it a B?' since B's occur on 50% of the occasions when its not an A. Sometimes we'll need to ask a third question, 'is it a C?' But that'll only happen 25% of the time, and the average number of bits needed to transfer 1 symbol will be (0.5 × 1) + (0.25 × 2) + (0.25 × 3) = 1.75 bits.

So the number of bits needed to transfer a symbol is still a function of the prior probability that that symbol would have been the one sent. If the i^th symbol has prior probability associated it of p_i, then:

x_i = -log₂(p_i)

The average of x, over the entire alphabet of symbols, is just the expectation over this expression, and is what we call the entropy:

H = 〈x_i〉 = -Σ p_ilog₂(p_i)

which, thankfully, is the same expression we used to solve the kangaroo problem, before. In fact, when calculating entropy, the base is not all that important - a different base will give a different number, but if we consistently use the same base, then our numbers will vary consistently. Often, the natural base is preferred, while sometimes base 10 is used, but then the units are not bits, but respectively, nats or Harts. Note that this formula (with base 2) reproduces the 1.75 obtained for the A's, B's, C's, and D's, above.

Why does this formula from communication theory have anything to do with science and inference?

Well, every scientific experiment, in fact every observation or experience, can be considered as a message from Mother Nature to us. Every bit of information from Mother Nature reduces the number of plausible universes by half, and we can count the bits using Shannon's theory. (By the way, I should be careful: the universe hates it when we personify her.) In the next post, I'll give some insight into why the maximum entropy principle (which I've already applied to kangaroos) is a valid tool in statistical inference.

Why do we call it entropy?

Good question, thank you very much for asking. Legend has it, Shannon initially adopted the term entropy primarily because of the remarkable similarity of his formula to an equation already used by physicists to calculate thermodynamic entropy. It was supposed to be an analogy, but several authors see it as more. There is nothing resembling consensus on this issue, but it seems clear to me that the information theoretic and thermodynamic uses of the term entropy are essentially the same.

Borrowing from Arieh Ben-Naim's examples in Entropy Demystified ², we can look at one of the canonical illustrations of physical entropy changing: the expansion of a gas in a closed container. The box depicted below has a partition down the middle. The left side contains gas molecules, while the right side has been evacuated. For each gas molecule, we play again the 20 questions game to figure out where it is - except that 4 questions are enough to fix its location to one of the 16 regions (to the left of the partition) marked with the dotted lines.

At a certain point of time, we cause the partition to dissolve, allowing the gas to spread to all parts of the box, as shown in the second picture:

The physicist and the information theorist are agreed: entropy has increased. From physics (see for example), because the temperature hasn't changed, the change in entropy is proportional to the log of the ratio of the volumes occupied by the gas after and before the partition was removed. That is, log(2) = 1 more unit of entropy per molecule has been added (in bits, rather than nats).

From the point of view of information theory, in order to locate any particular molecule, I now have to find the right place from among 32 little regions - I need one more bit of information than I needed before (the total entropy of a message is the average symbol entropy multiplied by the length of the message).

(We could also have achieved an increase in physical entropy by increasing the temperature rather than the volume, but then, according to the Maxwell distribution, the width of the probability curve associated with each particle's velocity would also increase, thus increasing the number of bits needed to pinpoint each particle's momentum.)

It is important to note that the even dispersal of the gas throughout the box, after the partition is removed, is not caused by our amount of information having been reduced, as some of my teachers suggested when I was an undergraduate. Of course, this belief commits the mind-projection fallacy, and gets the situation exactly backwards.

As we'll see in the next post, the reason that physical systems tend to be found in or evolving towards states of maximal entropy (the second law of thermodynamics) is exactly the same reason that maximum-entropy distributions are appropriate for inference under missing information (the maximum entropy principle): those maximum-entropy states / distributions have the highest multiplicity.

References

[1]	Shannon, C.E., "A Mathematical Theory of Communication," Bell System Technical Journal 27 (3), pages 379–423, 1948 (Download here.)
[2]	Ben-Naim, A., "Entropy Demystified," World Scientific Publishing Company, 2008

6 comments:

Rasmus BååthOctober 26, 2013 at 4:44 AM
Again a great post! For an introduction to entropy would you recomend "Entropy demystified" or some other book? I've seen that on Amazon Ben-Naim have written many books on the same topic...
UnknownOctober 29, 2013 at 9:45 AM
Hi Tom,

Can you please expand on the following passage:

"It is important to note that the even dispersal of the gas throughout the box, after the partition is removed, is not caused by our amount of information having been reduced, as some of my teachers suggested when I was an undergraduate. Of course, this belief commits the mind-projection fallacy, and gets the situation exactly backwards."

Do you mean to say that the right way to think about it is that segmenting the gas in the left chamber generates more information rather than saying that information is reduced through dispersion?

Thanks :)

Saturday, October 26, 2013

Entropy Games

6 comments: