Book Read Free

Rationality- From AI to Zombies

Page 76

by Eliezer Yudkowsky


  Cyan directs us to chapter 37 of MacKay’s excellent statistics book, free online, for a more thorough explanation of this problem.2

  According to old-fashioned statistical procedure—which I believe is still being taught today—the two researchers have performed different experiments with different stopping conditions. The two experiments could have terminated with different data, and therefore represent different tests of the hypothesis, requiring different statistical analyses. It’s quite possible that the first experiment will be “statistically significant,” the second not.

  Whether or not you are disturbed by this says a good deal about your attitude toward probability theory, and indeed, rationality itself.

  Non-Bayesian statisticians might shrug, saying, “Well, not all statistical tools have the same strengths and weaknesses, y’know—a hammer isn’t like a screwdriver—and if you apply different statistical tools you may get different results, just like using the same data to compute a linear regression or train a regularized neural network. You’ve got to use the right tool for the occasion. Life is messy—”

  And then there’s the Bayesian reply: “Excuse you? The evidential impact of a fixed experimental method, producing the same data, depends on the researcher’s private thoughts? And you have the nerve to accuse us of being ‘too subjective’?”

  If Nature is one way, the likelihood of the data coming out the way we have seen will be one thing. If Nature is another way, the likelihood of the data coming out that way will be something else. But the likelihood of a given state of Nature producing the data we have seen, has nothing to do with the researcher’s private intentions. So whatever our hypotheses about Nature, the likelihood ratio is the same, and the evidential impact is the same, and the posterior belief should be the same, between the two experiments. At least one of the two Old Style methods must discard relevant information—or simply do the wrong calculation—for the two methods to arrive at different answers.

  The ancient war between the Bayesians and the accursèd frequentists stretches back through decades, and I’m not going to try to recount that elder history in this essay.

  But one of the central conflicts is that Bayesians expect probability theory to be . . . what’s the word I’m looking for? “Neat?” “Clean?” “Self-consistent?”

  As Jaynes says, the theorems of Bayesian probability are just that, theorems in a coherent proof system. No matter what derivations you use, in what order, the results of Bayesian probability theory should always be consistent—every theorem compatible with every other theorem.

  If you want to know the sum 10 + 10, you can redefine it as (2 × 5) + (7 + 3) or as (2 × (4 + 6)) or use whatever other legal tricks you like, but the result always has to come out to be the same, in this case, 20. If it comes out as 20 one way and 19 the other way, then you may conclude you did something illegal on at least one of the two occasions. (In arithmetic, the illegal operation is usually division by zero; in probability theory, it is usually an infinity that was not taken as a the limit of a finite process.)

  If you get the result 19 = 20, look hard for that error you just made, because it’s unlikely that you’ve sent arithmetic itself up in smoke. If anyone should ever succeed in deriving a real contradiction from Bayesian probability theory—like, say, two different evidential impacts from the same experimental method yielding the same results—then the whole edifice goes up in smoke. Along with set theory, ’cause I’m pretty sure ZF provides a model for probability theory.

  Math! That’s the word I was looking for. Bayesians expect probability theory to be math. That’s why we’re interested in Cox’s Theorem and its many extensions, showing that any representation of uncertainty which obeys certain constraints has to map onto probability theory. Coherent math is great, but unique math is even better.

  And yet . . . should rationality be math? It is by no means a foregone conclusion that probability should be pretty. The real world is messy—so shouldn’t you need messy reasoning to handle it? Maybe the non-Bayesian statisticians, with their vast collection of ad-hoc methods and ad-hoc justifications, are strictly more competent because they have a strictly larger toolbox. It’s nice when problems are clean, but they usually aren’t, and you have to live with that.

  After all, it’s a well-known fact that you can’t use Bayesian methods on many problems because the Bayesian calculation is computationally intractable. So why not let many flowers bloom? Why not have more than one tool in your toolbox?

  That’s the fundamental difference in mindset. Old School statisticians thought in terms of tools, tricks to throw at particular problems. Bayesians—at least this Bayesian, though I don’t think I’m speaking only for myself—we think in terms of laws.

  Looking for laws isn’t the same as looking for especially neat and pretty tools. The Second Law of Thermodynamics isn’t an especially neat and pretty refrigerator.

  The Carnot cycle is an ideal engine—in fact, the ideal engine. No engine powered by two heat reservoirs can be more efficient than a Carnot engine. As a corollary, all thermodynamically reversible engines operating between the same heat reservoirs are equally efficient.

  But, of course, you can’t use a Carnot engine to power a real car. A real car’s engine bears the same resemblance to a Carnot engine that the car’s tires bear to perfect rolling cylinders.

  Clearly, then, a Carnot engine is a useless tool for building a real-world car. The Second Law of Thermodynamics, obviously, is not applicable here. It’s too hard to make an engine that obeys it, in the real world. Just ignore thermodynamics—use whatever works.

  This is the sort of confusion that I think reigns over they who still cling to the Old Ways.

  No, you can’t always do the exact Bayesian calculation for a problem. Sometimes you must seek an approximation; often, indeed. This doesn’t mean that probability theory has ceased to apply, any more than your inability to calculate the aerodynamics of a 747 on an atom-by-atom basis implies that the 747 is not made out of atoms. Whatever approximation you use, it works to the extent that it approximates the ideal Bayesian calculation—and fails to the extent that it departs.

  Bayesianism’s coherence and uniqueness proofs cut both ways. Just as any calculation that obeys Cox’s coherency axioms (or any of the many reformulations and generalizations) must map onto probabilities, so too, anything that is not Bayesian must fail one of the coherency tests. This, in turn, opens you to punishments like Dutch-booking (accepting combinations of bets that are sure losses, or rejecting combinations of bets that are sure gains).

  You may not be able to compute the optimal answer. But whatever approximation you use, both its failures and successes will be explainable in terms of Bayesian probability theory. You may not know the explanation; that does not mean no explanation exists.

  So you want to use a linear regression, instead of doing Bayesian updates? But look to the underlying structure of the linear regression, and you see that it corresponds to picking the best point estimate given a Gaussian likelihood function and a uniform prior over the parameters.

  You want to use a regularized linear regression, because that works better in practice? Well, that corresponds (says the Bayesian) to having a Gaussian prior over the weights.

  Sometimes you can’t use Bayesian methods literally; often, indeed. But when you can use the exact Bayesian calculation that uses every scrap of available knowledge, you are done. You will never find a statistical method that yields a better answer. You may find a cheap approximation that works excellently nearly all the time, and it will be cheaper, but it will not be more accurate. Not unless the other method uses knowledge, perhaps in the form of disguised prior information, that you are not allowing into the Bayesian calculation; and then when you feed the prior information into the Bayesian calculation, the Bayesian calculation will again be equal or superior.

  When you use an Old Style ad-hoc statistical tool with an ad-hoc (but often quite interesting) justification, you neve
r know if someone else will come up with an even more clever tool tomorrow. But when you can directly use a calculation that mirrors the Bayesian law, you’re done—like managing to put a Carnot heat engine into your car. It is, as the saying goes, “Bayes-optimal.”

  It seems to me that the toolboxers are looking at the sequence of cubes {1, 8, 27, 64, 125, . . . } and pointing to the first differences {7, 19, 37, 61, . . . } and saying “Look, life isn’t always so neat—you’ve got to adapt to circumstances.” And the Bayesians are pointing to the third differences, the underlying stable level {6, 6, 6, 6, 6, . . . }. And the critics are saying, “What the heck are you talking about? It’s 7, 19, 37 not 6, 6, 6. You are oversimplifying this messy problem; you are too attached to simplicity.”

  It’s not necessarily simple on a surface level. You have to dive deeper than that to find stability.

  Think laws, not tools. Needing to calculate approximations to a law doesn’t change the law. Planes are still atoms, they aren’t governed by special exceptions in Nature for aerodynamic calculations. The approximation exists in the map, not in the territory. You can know the Second Law of Thermodynamics, and yet apply yourself as an engineer to build an imperfect car engine. The Second Law does not cease to be applicable; your knowledge of that law, and of Carnot cycles, helps you get as close to the ideal efficiency as you can.

  We aren’t enchanted by Bayesian methods merely because they’re beautiful. The beauty is a side effect. Bayesian theorems are elegant, coherent, optimal, and provably unique because they are laws.

  *

  1. Edwin T. Jaynes, “Probability Theory as Logic,” in Maximum Entropy and Bayesian Methods, ed. Paul F. Fougère (Springer Netherlands, 1990).

  2. David J. C. MacKay, Information Theory, Inference, and Learning Algorithms (New York: Cambridge University Press, 2003).

  185

  Outside the Laboratory

  “Outside the laboratory, scientists are no wiser than anyone else.” Sometimes this proverb is spoken by scientists, humbly, sadly, to remind themselves of their own fallibility. Sometimes this proverb is said for rather less praiseworthy reasons, to devalue unwanted expert advice. Is the proverb true? Probably not in an absolute sense. It seems much too pessimistic to say that scientists are literally no wiser than average, that there is literally zero correlation.

  But the proverb does appear true to some degree, and I propose that we should be very disturbed by this fact. We should not sigh, and shake our heads sadly. Rather we should sit bolt upright in alarm. Why? Well, suppose that an apprentice shepherd is laboriously trained to count sheep, as they pass in and out of a fold. Thus the shepherd knows when all the sheep have left, and when all the sheep have returned. Then you give the shepherd a few apples, and say: “How many apples?” But the shepherd stares at you blankly, because they weren’t trained to count apples—just sheep. You would probably suspect that the shepherd didn’t understand counting very well.

  Now suppose we discover that a PhD economist buys a lottery ticket every week. We have to ask ourselves: Does this person really understand expected utility, on a gut level? Or have they just been trained to perform certain algebra tricks?

  One thinks of Richard Feynman’s account of a failing physics education program:

  The students had memorized everything, but they didn’t know what anything meant. When they heard “light that is reflected from a medium with an index,” they didn’t know that it meant a material such as water. They didn’t know that the “direction of the light” is the direction in which you see something when you’re looking at it, and so on. Everything was entirely memorized, yet nothing had been translated into meaningful words. So if I asked, “What is Brewster’s Angle?” I’m going into the computer with the right keywords. But if I say, “Look at the water,” nothing happens—they don’t have anything under “Look at the water”!

  Suppose we have an apparently competent scientist, who knows how to design an experiment on N subjects; the N subjects will receive a randomized treatment; blinded judges will classify the subject outcomes; and then we’ll run the results through a computer and see if the results are significant at the 0.05 confidence level. Now this is not just a ritualized tradition. This is not a point of arbitrary etiquette like using the correct fork for salad. It is a ritualized tradition for testing hypotheses experimentally. Why should you test your hypothesis experimentally? Because you know the journal will demand so before it publishes your paper? Because you were trained to do it in college? Because everyone else says in unison that it’s important to do the experiment, and they’ll look at you funny if you say otherwise?

  No: because, in order to map a territory, you have to go out and look at the territory. It isn’t possible to produce an accurate map of a city while sitting in your living room with your eyes closed, thinking pleasant thoughts about what you wish the city was like. You have to go out, walk through the city, and write lines on paper that correspond to what you see. It happens, in miniature, every time you look down at your shoes to see if your shoelaces are untied. Photons arrive from the Sun, bounce off your shoelaces, strike your retina, are transduced into neural firing frequencies, and are reconstructed by your visual cortex into an activation pattern that is strongly correlated with the current shape of your shoelaces. To gain new information about the territory, you have to interact with the territory. There has to be some real, physical process whereby your brain state ends up correlated to the state of the environment. Reasoning processes aren’t magic; you can give causal descriptions of how they work. Which all goes to say that, to find things out, you’ve got to go look.

  Now what are we to think of a scientist who seems competent inside the laboratory, but who, outside the laboratory, believes in a spirit world? We ask why, and the scientist says something along the lines of: “Well, no one really knows, and I admit that I don’t have any evidence—it’s a religious belief, it can’t be disproven one way or another by observation.” I cannot but conclude that this person literally doesn’t know why you have to look at things. They may have been taught a certain ritual of experimentation, but they don’t understand the reason for it—that to map a territory, you have to look at it—that to gain information about the environment, you have to undergo a causal process whereby you interact with the environment and end up correlated to it. This applies just as much to a double-blind experimental design that gathers information about the efficacy of a new medical device, as it does to your eyes gathering information about your shoelaces.

  Maybe our spiritual scientist says: “But it’s not a matter for experiment. The spirits spoke to me in my heart.” Well, if we really suppose that spirits are speaking in any fashion whatsoever, that is a causal interaction and it counts as an observation. Probability theory still applies. If you propose that some personal experience of “spirit voices” is evidence for actual spirits, you must propose that there is a favorable likelihood ratio for spirits causing “spirit voices,” as compared to other explanations for “spirit voices,” which is sufficient to overcome the prior improbability of a complex belief with many parts. Failing to realize that “the spirits spoke to me in my heart” is an instance of “causal interaction,” is analogous to a physics student not realizing that a “medium with an index” means a material such as water.

  It is easy to be fooled, perhaps, by the fact that people wearing lab coats use the phrase “causal interaction” and that people wearing gaudy jewelry use the phrase “spirits speaking.” Discussants wearing different clothing, as we all know, demarcate independent spheres of existence—“separate magisteria,” in Stephen J. Gould’s immortal blunder of a phrase. Actually, “causal interaction” is just a fancy way of saying, “Something that makes something else happen,” and probability theory doesn’t care what clothes you wear.

  In modern society there is a prevalent notion that spiritual matters can’t be settled by logic or observation, and therefore you can have whatever religious beliefs you like. I
f a scientist falls for this, and decides to live their extralaboratorial life accordingly, then this, to me, says that they only understand the experimental principle as a social convention. They know when they are expected to do experiments and test the results for statistical significance. But put them in a context where it is socially conventional to make up wacky beliefs without looking, and they just as happily do that instead.

  The apprentice shepherd is told that if “seven” sheep go out, and “eight” sheep go out, then “fifteen” sheep had better come back in. Why “fifteen” instead of “fourteen” or “three”? Because otherwise you’ll get no dinner tonight, that’s why! So that’s professional training of a kind, and it works after a fashion—but if social convention is the only reason why seven sheep plus eight sheep equals fifteen sheep, then maybe seven apples plus eight apples equals three apples. Who’s to say that the rules shouldn’t be different for apples?

  But if you know why the rules work, you can see that addition is the same for sheep and for apples. Isaac Newton is justly revered, not for his outdated theory of gravity, but for discovering that—amazingly, surprisingly—the celestial planets, in the glorious heavens, obeyed just the same rules as falling apples. In the macroscopic world—the everyday ancestral environment—different trees bear different fruits, different customs hold for different people at different times. A genuinely unified universe, with stationary universal laws, is a highly counterintuitive notion to humans! It is only scientists who really believe it, though some religions may talk a good game about the “unity of all things.”

 

‹ Prev