Rationality- From AI to Zombies

Page 68

by Eliezer Yudkowsky

You cannot hide behind a comforting shield of correct-by-definition. Both extensional definitions and intensional definitions can be wrong, can fail to carve reality at the joints.

Categorizing is a guessing endeavor, in which you can make mistakes; so it’s wise to be able to admit, from a theoretical standpoint, that your definition-guesses can be “mistaken.”

*

174

Entropy, and Short Codes

(If you aren’t familiar with Bayesian inference, this may be a good time to read An Intuitive Explanation of Bayes’s Theorem.)

Suppose you have a system X that’s equally likely to be in any of 8 possible states:

{X1, X2, X3, X4, X5, X6, X7, X8}.

There’s an extraordinarily ubiquitous quantity—in physics, mathematics, and even biology—called entropy; and the entropy of X is 3 bits. This means that, on average, we’ll have to ask 3 yes-or-no questions to find out X’s value. For example, someone could tell us X’s value using this code:

X1 : 001

X2 : 010

X3 : 011

X4 : 100

X5 : 101

X6 : 110

X7 : 111

X8 : 000.

So if I asked “Is the first symbol 1?” and heard “yes,” then asked “Is the second symbol 1?” and heard “no,” then asked “Is the third symbol 1?” and heard “no,” I would know that X was in state 4.

Now suppose that the system Y has four possible states with the following probabilities:

Y1 : 1∕2 (50%)

Y2 : 1∕4 (25%)

Y3 : 1∕8 (12.5%)

Y4 : 1∕8 (12.5%).

Then the entropy of Y would be 1.75 bits, meaning that we can find out its value by asking 1.75 yes-or-no questions.

What does it mean to talk about asking one and three-fourths of a question? Imagine that we designate the states of Y using the following code:

Y1 : 1

Y2 : 01

Y3 : 001

Y4 : 000.

First you ask, “Is the first symbol 1?” If the answer is “yes,” you’re done: Y is in state 1. This happens half the time, so 50% of the time, it takes 1 yes-or-no question to find out Y’s state.

Suppose that instead the answer is “No.” Then you ask, “Is the second symbol 1?” If the answer is “yes,” you’re done: Y is in state 2. The system Y is in state 2 with probability 1/4, and each time Y is in state 2 we discover this fact using two yes-or-no questions, so 25% of the time it takes 2 questions to discover Y’s state.

If the answer is “No” twice in a row, you ask “Is the third symbol 1?” If “yes,” you’re done and Y is in state 3; if “no,” you’re done and Y is in state 4. The 1/8 of the time that Y is in state 3, it takes three questions; and the 1/8 of the time that Y is in state 4, it takes three questions.

(1/2 × 1) + (1/4 × 2) + (1/8 × 3) + (1/8 × 3)

= 0.5 + 0.5 + 0.375 + 0.375

= 1.75.

The general formula for the entropy H(S) of a system S is the sum, over all Si, of -P(Si)log2(P(Si)).

For example, the log (base 2) of 1/8 is -3. So -(1∕8 × -3) = 0.375 is the contribution of state S4 to the total entropy: 1/8 of the time, we have to ask 3 questions.

You can’t always devise a perfect code for a system, but if you have to tell someone the state of arbitrarily many copies of S in a single message, you can get arbitrarily close to a perfect code. (Google “arithmetic coding” for a simple method.)

Now, you might ask: “Why not use the code 10 for Y4, instead of 000? Wouldn’t that let us transmit messages more quickly?”

But if you use the code 10 for Y4, then when someone answers “Yes” to the question “Is the first symbol 1?,” you won’t know yet whether the system state is Y1 (1) or Y4 (10). In fact, if you change the code this way, the whole system falls apart—because if you hear “1001,” you don’t know if it means “Y4, followed by Y2” or “Y1, followed by Y3.”

The moral is that short words are a conserved resource.

The key to creating a good code—a code that transmits messages as compactly as possible—is to reserve short words for things that you’ll need to say frequently, and use longer words for things that you won’t need to say as often.

When you take this art to its limit, the length of the message you need to describe something corresponds exactly or almost exactly to its probability. This is the Minimum Description Length or Minimum Message Length formalization of Occam’s Razor.

And so even the labels that we use for words are not quite arbitrary. The sounds that we attach to our concepts can be better or worse, wiser or more foolish. Even apart from considerations of common usage!

I say all this, because the idea that “You can X any way you like” is a huge obstacle to learning how to X wisely. “It’s a free country; I have a right to my own opinion” obstructs the art of finding truth. “I can define a word any way I like” obstructs the art of carving reality at its joints. And even the sensible-sounding “The labels we attach to words are arbitrary” obstructs awareness of compactness. Prosody too, for that matter—Tolkien once observed what a beautiful sound the phrase “cellar door” makes; that is the kind of awareness it takes to use language like Tolkien.

The length of words also plays a nontrivial role in the cognitive science of language:

Consider the phrases “recliner,” “chair,” and “furniture.” Recliner is a more specific category than chair; furniture is a more general category than chair. But the vast majority of chairs have a common use—you use the same sort of motor actions to sit down in them, and you sit down in them for the same sort of purpose (to take your weight off your feet while you eat, or read, or type, or rest). Recliners do not depart from this theme. “Furniture,” on the other hand, includes things like beds and tables which have different uses, and call up different motor functions, from chairs.

In the terminology of cognitive psychology, “chair” is a basic-level category.

People have a tendency to talk, and presumably think, at the basic level of categorization—to draw the boundary around “chairs,” rather than around the more specific category “recliner,” or the more general category “furniture.” People are more likely to say “You can sit in that chair” than “You can sit in that recliner” or “You can sit in that furniture.”

And it is no coincidence that the word for “chair” contains fewer syllables than either “recliner” or “furniture.” Basic-level categories, in general, tend to have short names; and nouns with short names tend to refer to basic-level categories. Not a perfect rule, of course, but a definite tendency. Frequent use goes along with short words; short words go along with frequent use.

Or as Douglas Hofstadter put it, there’s a reason why the English language uses “the” to mean “the” and “antidisestablishmentarianism” to mean “antidisestablishmentarianism” instead of antidisestablishmentarianism other way around.

*

175

Mutual Information, and Density in Thingspace

Suppose you have a system X that can be in any of 8 states, which are all equally probable (relative to your current state of knowledge), and a system Y that can be in any of 4 states, all equally probable.

The entropy of X, as defined in the previous essay, is 3 bits; we’ll need to ask 3 yes-or-no questions to find out X’s exact state. The entropy of Y is 2 bits; we have to ask 2 yes-or-no questions to find out Y’s exact state. This may seem obvious since 23 = 8 and 22 = 4, so 3 questions can distinguish 8 possibilities and 2 questions can distinguish 4 possibilities; but remember that if the possibilities were not all equally likely, we could use a more clever code to discover Y’s state using e.g. 1.75 questions on average. In this case, though, X’s probability mass is evenly distributed over all its possible states, and likewise Y, so we can’t use any clever codes.

What is the entropy of the combined system (X,Y)?

You might be tempted to answer, “It takes 3 que
stions to find out X, and then 2 questions to find out Y, so it takes 5 questions total to find out the state of X and Y.”

But what if the two variables are entangled, so that learning the state of Y tells us something about the state of X?

In particular, let’s suppose that X and Y are either both odd or both even.

Now if we receive a 3-bit message (ask 3 questions) and learn that X is in state X5, we know that Y is in state Y1 or state Y3, but not state Y2 or state Y4. So the single additional question “Is Y in state Y3?,” answered “No,” tells us the entire state of (X,Y): X = X5,Y = Y1. And we learned this with a total of 4 questions.

Conversely, if we learn that Y is in state Y4 using two questions, it will take us only an additional two questions to learn whether X is in state X2, X4, X6, or X8. Again, four questions to learn the state of the joint system.

The mutual information of two variables is defined as the difference between the entropy of the joint system and the entropy of the independent systems: I(X;Y) = H(X) + H(Y) - H(X,Y).

Here there is one bit of mutual information between the two systems: Learning X tells us one bit of information about Y (cuts down the space of possibilities from 4 possibilities to 2, a factor-of-2 decrease in the volume) and learning Y tells us one bit of information about X (cuts down the possibility space from 8 possibilities to 4).

What about when probability mass is not evenly distributed? Last essay, for example, we discussed the case in which Y had the probabilities 1/2, 1/4, 1/8, 1/8 for its four states. Let us take this to be our probability distribution over Y, considered independently—if we saw Y, without seeing anything else, this is what we’d expect to see. And suppose the variable Z has two states, Z1 and Z2, with probabilities 3/8 and 5/8 respectively.

Then if and only if the joint distribution of Y and Z is as follows, there is zero mutual information between Y and Z:

Z1Y1 : 3/16

Z1Y2 : 3/32

Z1Y3 : 3/64

Z1Y4 : 3/64

Z2Y1 : 5/16

Z2Y2 : 5/32

Z2Y3 : 5/64

Z2Y4 : 5/64.

This distribution obeys the law

P(Y,Z) = P(Y)P(Z).

For example, P(Z1Y2) = P(Z1)P(Y2) = 3∕8 × 1∕4 = 3∕32.

And observe that we can recover the marginal (independent) probabilities of Y and Z just by looking at the joint distribution:

P(Y1) = total probability of all the different ways Y1 can happen

= P (Z1Y1) + P(Z2Y1)

= 3/16 + 5/16

= 1/2.

So, just by inspecting the joint distribution, we can determine whether the marginal variables Y and Z are independent; that is, whether the joint distribution factors into the product of the marginal distributions; whether, for all Y and Z, we have P(Y,Z) = P(Y)P(Z).

This last is significant because, by Bayes’s Rule,

P(ZjYi) = P(Yi)P(Zj)

P(ZjYi)/P(Zj) = P(Yi)

P(Yi|Zj) = P(Yi).

In English: “After you learn Zj, your belief about Yi is just what it was before.”

So when the distribution factorizes—when P(Y,Z) = P(Y)P(Z)—this is equivalent to “Learning about Y never tells us anything about Z or vice versa.”

From which you might suspect, correctly, that there is no mutual information between Y and Z. Where there is no mutual information, there is no Bayesian evidence, and vice versa.

Suppose that in the distribution (Y,Z) above, we treated each possible combination of Y and Z as a separate event—so that the distribution (Y,Z) would have a total of 8 possibilities, with the probabilities shown—and then we calculated the entropy of the distribution (Y,Z) the same way we would calculate the entropy of any distribution:

P(Z1Y1)log2(P(Z1Y1)) + P(Z1Y2)log2(P(Z1Y2)) + P(Z1Y3)log2(P(Z1Y3)) + . . . + P (Z2Y4)log2(P(Z2Y4))

=(3/16)log2(3/16) + (3/32)log2(3/32) + (3/64)log2(3/64) + . . . + (5/64)log2(5/64).

You would end up with the same total you would get if you separately calculated the entropy of Y plus the entropy of Z. There is no mutual information between the two variables, so our uncertainty about the joint system is not any less than our uncertainty about the two systems considered separately. (I am not showing the calculations, but you are welcome to do them; and I am not showing the proof that this is true in general, but you are welcome to Google on “Shannon entropy” and “mutual information.”)

What if the joint distribution doesn’t factorize? For example:

Z1Y1 : 12/64

Z1Y2 : 8/64

Z1Y3 : 1/64

Z1Y4 : 3/64

Z2Y1 : 20/64

Z2Y2 : 8/64

Z2Y3 : 7/64

Z2Y4 : 5/64.

If you add up the joint probabilities to get marginal probabilities, you should find that P(Y1) = 1∕2, P(Z1) = 3∕8, and so on—the marginal probabilities are the same as before.

But the joint probabilities do not always equal the product of the marginal probabilities. For example, the probability P(Z1Y2) equals 8/64, where P(Z1)P(Y2) would equal 3∕8 × 1∕4 = 6∕64. That is, the probability of running into Z1Y2 together is greater than you’d expect based on the probabilities of running into Z1 or Y2 separately.

Which in turn implies:

P(Z1Y2) > P(Z1)P(Y2)

P(Z1Y2)/P(Y2) > P(Z1)

P(Z1|Y2) > P(Z1).

Since there’s an “unusually high” probability for P(Z1Y2)—defined as a probability higher than the marginal probabilities would indicate by default—it follows that observing Y2 is evidence that increases the probability of Z1. And by a symmetrical argument, observing Z1 must favor Y2.

As there are at least some values of Y that tell us about Z (and vice versa) there must be mutual information between the two variables; and so you will find—I am confident, though I haven’t actually checked—that calculating the entropy of (Y,Z) yields less total uncertainty than the sum of the independent entropies of Y and Z. That is, H(Y,Z) = H(Y) + H(Z) - I(Y;Z), with all quantities necessarily positive.

(I digress here to remark that the symmetry of the expression for the mutual information shows that Y must tell us as much about Z, on average, as Z tells us about Y. I leave it as an exercise to the reader to reconcile this with anything they were taught in logic class about how, if all ravens are black, being allowed to reason Raven(x) ⇒ Black(x) doesn’t mean you’re allowed to reason Black(x) ⇒ Raven(x). How different seem the symmetrical probability flows of the Bayesian, from the sharp lurches of logic—even though the latter is just a degenerate case of the former.)

“But,” you ask, “what has all this to do with the proper use of words?”

In Empty Labels and then Replace the Symbol with the Substance, we saw the technique of replacing a word with its definition—the example being given:

All [mortal, ¬feathers, bipedal] are mortal.

Socrates is a [mortal, ¬feathers, bipedal].

Therefore, Socrates is mortal.

Why, then, would you even want to have a word for “human”? Why not just say “Socrates is a mortal featherless biped”?

Because it’s helpful to have shorter words for things that you encounter often. If your code for describing single properties is already efficient, then there will not be an advantage to having a special word for a conjunction—like “human” for “mortal featherless biped”—unless things that are mortal and featherless and bipedal, are found more often than the marginal probabilities would lead you to expect.

In efficient codes, word length corresponds to probability—so the code for Z1Y2 will be just as long as the code for Z1 plus the code for Y2, unless P(Z1Y2) > P(Z1)P(Y2), in which case the code for the word can be shorter than the codes for its parts.

And this in turn corresponds exactly to the case where we can infer some of the properties of the thing from seeing its other properties. It must be more likely than the default that featherless bipedal things will also be mortal.

Of course the word “human” really describes many, man
y more properties—when you see a human-shaped entity that talks and wears clothes, you can infer whole hosts of biochemical and anatomical and cognitive facts about it. To replace the word “human” with a description of everything we know about humans would require us to spend an inordinate amount of time talking. But this is true only because a featherless talking biped is far more likely than default to be poisonable by hemlock, or have broad nails, or be overconfident.

Having a word for a thing, rather than just listing its properties, is a more compact code precisely in those cases where we can infer some of those properties from the other properties. (With the exception perhaps of very primitive words, like “red,” that we would use to send an entirely uncompressed description of our sensory experiences. But by the time you encounter a bug, or even a rock, you’re dealing with nonsimple property collections, far above the primitive level.)

So having a word “wiggin” for green-eyed black-haired people is more useful than just saying “green-eyed black-haired person” precisely when:

Green-eyed people are more likely than average to be black-haired (and vice versa), meaning that we can probabilistically infer green eyes from black hair or vice versa; or

Wiggins share other properties that can be inferred at greater-than-default probability. In this case we have to separately observe the green eyes and black hair; but then, after observing both these properties independently, we can probabilistically infer other properties (like a taste for ketchup).

One may even consider the act of defining a word as a promise to this effect. Telling someone, “I define the word ‘wiggin’ to mean a person with green eyes and black hair,” by Gricean implication, asserts that the word “wiggin” will somehow help you make inferences / shorten your messages.

If green-eyes and black hair have no greater than default probability to be found together, nor does any other property occur at greater than default probability along with them, then the word “wiggin” is a lie: The word claims that certain people are worth distinguishing as a group, but they’re not.

‹ Prev Next ›