Book Read Free

Rationality- From AI to Zombies

Page 115

by Eliezer Yudkowsky


  * * *

  Imagine an experiment which produces an integer result between zero and 99. For example, the experiment might be a particle counter that tells us how many particles have passed through in a minute. Or the experiment might be to visit the supermarket on Wednesday, check the price of a 10 oz bag of crushed walnuts, and write down the last two digits of the price.

  We are testing several different hypotheses that try to predict the experimental result. Each hypothesis produces a probability distribution over all possible results; in this case, the integers between zero and 99. The possibilities are mutually exclusive, so the probability mass in the distribution must sum to one (or less); we cannot predict a 90% probability of seeing 42 and also a 90% probability of seeing 43.

  Suppose there is a precise hypothesis that predicts a 90% chance of seeing the result 51. (I.e., the hypothesis is that the supermarket usually prices walnuts with a price of “X dollars and 51 cents.”) The precise theory has staked 90% of its probability mass on the outcome 51. This leaves 10% probability mass remaining to spread over 99 other possible outcomes—all the numbers between zero and 99 except 51. The theory makes no further specification, so we spread the remaining 10% probability mass evenly over 99 possibilities, assigning a probability of 1/990 to each non-51 result. For ease of writing, we’ll approximate 1/990 as 0.1%.

  This probability distribution is analogous to the likelihood or conditional probability of the result given the hypothesis. Let us call it the likelihood distribution for the hypothesis, our chance of seeing each specified outcome if the hypothesis is true. The likelihood distribution for a hypothesis H is a function composed of all the conditional probabilities for P(0|H) = 0.001, P(1|H) = 0.001, . . . , P(51|H) = 0.9, . . . , P(99|H) = 0.001.

  The precise theory predicts a 90% probability of seeing 51. Let there be also a vague theory, which predicts “a 90% probability of seeing a number in the fifties.”

  Seeing the result 51, we do not say the outcome confirms both theories equally. Both theories made predictions, and both assigned probabilities of 90%, and the result 51 confirms both predictions. But the precise theory has an advantage because it concentrates its probability mass into a sharper point. If the vague theory makes no further specification, we count “a 90% probability of seeing a number in the fifties” as a 9% probability of seeing each number between 50 and 59.

  Suppose we started with even odds in favor of the precise theory and the vague theory—odds of 1:1, or 50% probability for either hypothesis being true. After seeing the result 51, what are the posterior odds of the precise theory being true? The predictions of the two theories are analogous to their likelihood assignments—the conditional probability of seeing the result, given that the theory is true. What is the likelihood ratio between the two theories? The first theory allocated 90% probability mass to the exact outcome. The vague theory allocated 9% probability mass to the exact outcome. The likelihood ratio is 10:1. So if we started with even 1:1 odds, the posterior odds are 10:1 in favor of the precise theory. The differential pressure of the two conditional probabilities pushed our prior confidence of 50% to a posterior confidence of about 91% that the precise theory is correct. Assuming that these are the only hypotheses being tested, that this is the only evidence under consideration, and so on.

  Why did the vague theory lose when both theories fit the evidence? The vague theory is timid; it makes a broad prediction, hedges its bets, allows many possibilities that would falsify the precise theory. This is not the virtue of a scientific theory. Philosophers of science tell us that theories should be bold, and subject themselves willingly to falsification if their prediction fails.6 Now we see why. The precise theory concentrates its probability mass into a sharper point and thereby leaves itself vulnerable to falsification if the real outcome hits elsewhere; but if the predicted outcome is correct, precision has a tremendous likelihood advantage over vagueness.

  The laws of probability theory provide no way to cheat, to make a vague hypothesis such that any result between 50 and 59 counts for as much favorable confirmation as the precise theory receives, for that would require probability mass summing to 900%. There is no way to cheat, providing you record your prediction in advance, so you cannot claim afterward that your theory assigns a probability of 90% to whichever result arrived. Humans are very fond of making their predictions afterward, so the social process of science requires an advance prediction before we say that a result confirms a theory. But how humans may move in harmony with the way of Bayes, and so wield the power, is a separate issue from whether the math works. When we’re doing the math, we just take for granted that likelihood density functions are fixed properties of a hypothesis and the probability mass sums to 1 and you’d never dream of doing it any other way.

  You may want to take a moment to visualize that, if we define probability in terms of calibration, Bayes’s Theorem relates the calibrations. Suppose I guess that Theory 1 is 50% likely to be true, and I guess that Theory 2 is 50% likely to be true. Suppose I am well-calibrated; when I utter the words “fifty percent,” the event happens about half the time. And then I see a result R which would happen around nine-tenths of the time given Theory 1, and around nine-hundredths of the time given Theory 2, and I know this is so, and I apply Bayesian reasoning. If I was perfectly calibrated initially (despite the poor discrimination of saying 50/50), I will still be perfectly calibrated (and better discriminated) after I say that my confidence in Theory 1 is now 91%. If I repeated this kind of situation many times, I would be right around ten-elevenths of the time when I said “91%.” If I reason using Bayesian rules, and I start from well-calibrated priors, then my conclusions will also be well-calibrated. This only holds true if we define probability in terms of calibration! If “90% sure” is instead interpreted as, say, the strength of the emotion of surety, there is no reason to expect the posterior emotion to stand in an exact Bayesian relation to the prior emotion.

  Let the prior odds be ten to one in favor of the vague theory. Why? Suppose our way of describing hypotheses allows us to either specify a precise number, or to just specify a first-digit; we can say “51,” “63,” “72,” or “in the fifties/sixties/seventies.” Suppose we think that the real answer is about equally liable to be an answer of the first kind or the second. However, given the problem, there are a hundred possible hypotheses of the first kind, and only ten hypotheses of the second kind. So if we think that either class of hypotheses has about an equal prior chance of being correct, we have to spread out the prior probability mass over ten times as many precise theories as vague theories. The precise theory that predicts exactly 51 would thus have one-tenth as much prior probability mass as the vague theory that predicts a number in the fifties. After seeing 51, the odds would go from 1:10 in favor of the vague theory to 1:1, even odds for the precise theory and the vague theory.

  If you look at this carefully, it’s exactly what common sense would expect. You start out uncertain of whether a phenomenon is the kind of phenomenon that produces exactly the same result every time, or if it’s the kind of phenomenon that produces a result in the Xties every time. (Maybe the phenomenon is a price range at the supermarket, if you need some reason to suppose that 50–59 is an acceptable range but 49–58 isn’t.) You take a single measurement and the answer is 51. Well, that could be because the phenomenon is exactly 51, or because it’s in the fifties. So the remaining precise theory has the same odds as the remaining vague theory, which requires that the vague theory must have started out ten times as probable as that precise theory, since the precise theory has a sharper fit to the evidence.

  If we just see one number, like 51, it doesn’t change the prior probability that the phenomenon itself was “precise” or “vague.” But, in effect, it concentrates all the probability mass of those two classes of hypothesis into a single surviving hypothesis of each class.

  Of course, it is a severe error to say that a phenomenon is precise or vague, a case of what Jaynes calls
the Mind Projection Fallacy.7 Precision or vagueness is a property of maps, not territories. Rather we should ask if the price in the supermarket stays constant or shifts about. A hypothesis of the “vague” sort is a good description of a price that shifts about. A precise map will suit a constant territory.

  Another example: You flip a coin ten times and see the sequence HHTTH:TTTTH. Maybe you started out thinking there was a 1% chance this coin was fixed. Doesn’t the hypothesis “This coin is fixed to produce HHTTH:TTTTH” assign a thousand times the likelihood mass to the observed outcome, compared to the fair coin hypothesis? Yes. Don’t the posterior odds that the coin is fixed go to 10:1? No. The 1% prior probability that “the coin is fixed” has to cover every possible kind of fixed coin—a coin fixed to produce HHTTH:TTTTH, a coin fixed to produce TTHHT:HHHHT, etc. The prior probability the coin is fixed to produce HHTTH:TTTTH is not 1%, but a thousandth of one percent. Afterward, the posterior probability the coin is fixed to produce HHTTH:TTTTH is one percent. Which is to say: You thought the coin was probably fair but had a one percent chance of being fixed to some random sequence; you flipped the coin; the coin produced a random-looking sequence; and that doesn’t tell you anything about whether the coin is fair or fixed. It does tell you, if the coin is fixed, which sequence it is fixed to.

  This parable helps illustrate why Bayesians must think about prior probabilities. There is a branch of statistics, sometimes called “orthodox” or “classical” statistics, which insists on paying attention only to likelihoods. But if you only pay attention to likelihoods, then eventually some fixed-coin hypothesis will always defeat the fair coin hypothesis, a phenomenon known as “overfitting” the theory to the data. After thirty flips, the likelihood is a billion times as great for the fixed-coin hypothesis with that sequence, as for the fair coin hypothesis. Only if the fixed-coin hypothesis (or rather, that specific fixed-coin hypothesis) is a billion times less probable a priori can the fixed-coin hypothesis possibly lose to the fair coin hypothesis.

  If you shake the coin to reset it, and start flipping the coin again, and the coin produces HHTTH:TTTTH again, that is a different matter. That does raise the posterior odds of the fixed-coin hypothesis to 10:1, even if the starting probability was only 1%.

  Similarly, if we perform two successive measurements of the particle counter (or the supermarket price on Wednesdays), and both measurements return 51, the precise theory wins by odds of 10:1.

  So the precise theory wins, but the vague theory would still score better than no theory at all. Consider a third theory, the hypothesis of zero knowledge or maximum-entropy distribution, which makes equally probable any result between zero and 99. Suppose we see the result 51. The vague theory produced a better prediction than the maximum-entropy distribution—assigned a greater likelihood to the outcome we observed. The vague theory is, literally, better than nothing. Suppose we started with odds of 1:20 in favor of the hypothesis of complete ignorance. (Why odds of 1:20? There is only one hypothesis of complete ignorance, and moreover, it’s a particularly simple and intuitive kind of hypothesis, Occam’s Razor.) After seeing the result of 51, predicted at 9% by the vague theory versus 1% by complete ignorance, the posterior odds go to 10:20 or 1:2. If we then see another result of 51, the posterior odds go to 10:2 or 83% probability for the vague theory, assuming there is no more precise theory under consideration.

  Yet the timidity of the vague theory—its unwillingness to produce an exact prediction and accept falsification on any other result—renders it vulnerable to the bold, precise theory. (Providing, of course, that the bold theory correctly guesses the outcome!) Suppose the prior odds were 1:10:200 for the precise, vague, and ignorant theories—prior probabilities of 0.5%, 4.7%, and 94.8% for the precise, vague and ignorant theories. This figure reflects our prior probability distribution over classes of hypotheses, with the probability mass distributed over entire classes as follows: 50% that the phenomenon shifts across all digits, 25% that the phenomenon shifts around within some decimal bracket, and 25% that the phenomenon repeats the same number each time. One hypothesis of complete ignorance, 10 possible hypotheses for a decimal bracket, 100 possible hypotheses for a repeating number. Thus, prior odds of 1:10:200 for the precise hypothesis 51, the vague hypothesis “fifties,” and the hypothesis of complete ignorance.

  After seeing a result of 51, with assigned probability of 90%, 9%, and 1%, the posterior odds go to 90:90:200 = 9:9:20. After seeing an additional result of 51, the posterior odds go to 810:81:20, or 89%, 9%, and 2%. The precise theory is now favored over the vague theory, which in turn is favored over the ignorant theory.

  Now consider a stupid theory, which predicts a 90% probability of seeing a result between zero and nine. The stupid theory assigns a probability of 0.1% to the actual outcome, 51. If the odds were initially 1:10:200:10 for the precise, vague, ignorant, and stupid theories, the posterior odds after seeing 51 once would be 90:90:200:1. The stupid theory has been falsified (posterior probability of 0.2%).

  It is possible to have a model so bad that it is worse than nothing, if the model concentrates its probability mass away from the actual outcome, makes confident predictions of wrong answers. Such a hypothesis is so poor that it loses against the hypothesis of complete ignorance. Ignorance is better than anti-knowledge.

  Side note: In the field of Artificial Intelligence, there is a sometime fad that praises the glory of randomness. Occasionally an AI researcher discovers that if they add noise to one of their algorithms, the algorithm works better. This result is reported with great enthusiasm, followed by much fulsome praise of the creative powers of chaos, unpredictability, spontaneity, ignorance of what your own AI is doing, et cetera. (See The Imagination Engine for an example; according to their sales literature they sell wounded and dying neural nets.8) But how sad is an algorithm if you can increase its performance by injecting entropy into intermediate processing stages? The algorithm must be so deranged that some of its work goes into concentrating probability mass away from good solutions. If injecting randomness results in a reliable improvement, then some aspect of the algorithm must do reliably worse than random. Only in AI would people devise algorithms literally dumber than a bag of bricks, boost the results slightly back toward ignorance, and then argue for the healing power of noise.

  Suppose that in our experiment we see the results 52, 51, and 58. The precise theory gives this conjunctive event a probability of a thousand to one times 90% times a thousand to one, while the vaguer theory gives this conjunctive event a probability of 9% cubed, which works out to . . . oh . . . um . . . let’s see . . . a million to one given the precise theory, versus a thousand to one given the vague theory. Or thereabouts; we are counting rough powers of ten. Versus a million to one given the zero-knowledge distribution that assigns an equal probability to all outcomes. Versus a billion to one given a model worse than nothing, the stupid hypothesis, which claims a 90% probability of seeing a number less than 10. Using these approximate numbers, the vague theory racks up a score of -30 decibels (a probability of 1/1000 for the whole experimental outcome), versus scores of -60 for the precise theory, -60 for the ignorant theory, and -90 for the stupid theory. It is not always true that the highest score wins, because we need to take into account our prior odds of 1:10:200:10, confidences of -23, -13, 0, and -13 decibels. The vague theory still comes in with the highest total score at -43 decibels. (If we ignored our prior probabilities, each new experiment would override the accumulated results of all the previous experiments; we could not accumulate knowledge. Furthermore, the fixed-coin hypothesis would always win.)

  As always, we should not be alarmed that even the best theory still has a low score—recall the parable of the fair coin. Theories are approximations. In principle we might be able to predict the exact sequence of coinflips. But it would take better measurement and more computing power than we’re willing to expend. Maybe we could achieve 60/40 prediction of coinflips, with a good enough model . . . ? We go with the best
approximation we have, and try to achieve good calibration even if the discrimination isn’t perfect.

  * * *

  We’ve conducted our analysis so far under the rules of Bayesian probability theory, in which there’s no way to have more than 100% probability mass, and hence no way to cheat so that any outcome can count as “confirmation” of your theory. Under Bayesian law, play money may not be counterfeited; you only have so much clay.

  Unfortunately, human beings are not Bayesians. Human beings bizarrely attempt to defend hypotheses, making a deliberate effort to prove them or prevent disproof. This behavior has no analogue in the laws of probability theory or decision theory. In formal probability theory the hypothesis is, and the evidence is, and either the hypothesis is confirmed or it is not. In formal decision theory, an agent may make an effort to investigate some issue of which the agent is currently uncertain, not knowing whether the evidence shall go one way or the other. In neither case does one ever deliberately try to prove an idea, or try to avoid disproving it. One may test ideas of which one is genuinely uncertain, but not have a “preferred” outcome of the investigation. One may not try to prove hypotheses, nor prevent their proof. I cannot properly convey just how ridiculous the notion would be, to a true Bayesian; there are not even words in Bayes-language to describe the mistake . . .

  For every expectation of evidence there is an equal and opposite expectation of counterevidence. If A is evidence in favor of B, then not-A must be evidence in favor of not-B. The strengths of the evidences may not be equal; rare but strong evidence in one direction may be balanced by common but weak evidence in the other direction. But it is not possible for both A and not-A to be evidence in favor of B. That is, it’s not possible under the laws of probability theory.

 

‹ Prev