Rationality- From AI to Zombies

Home > Science > Rationality- From AI to Zombies > Page 127
Rationality- From AI to Zombies Page 127

by Eliezer Yudkowsky


  But for any non-Vast amount of training data—any training data that does not include the exact bitwise image now seen—there are superexponentially many possible concepts compatible with previous classifications.

  For the AI, choosing or weighting from among superexponential possibilities is a matter of inductive bias. Which may not match what the user has in mind. The gap between these two example-classifying processes—induction on the one hand, and the user’s actual goals on the other—is not trivial to cross.

  Let’s say the AI’s training data is:

  Dataset 1:

  +: Smile_1, Smile_2, Smile_3

  -: Frown_1, Cat_1, Frown_2, Frown_3, Cat_2, Boat_1, Car_1, Frown_5.

  Now the AI grows up into a superintelligence, and encounters this data:

  Dataset 2:

  : Frown_6, Cat_3, Smile_4, Galaxy_1, Frown_7, Nanofactory_1, Molecular_Smileyface_1, Cat_4, Molecular_Smileyface_2, Galaxy_2, Nanofactory_2.

  It is not a property of these datasets that the inferred classification you would prefer is:

  +: Smile_1, Smile_2, Smile_3, Smile_4

  -: Frown_1, Cat_1, Frown_2, Frown_3, Cat_2, Boat_1, Car_1, Frown_5, Frown_6, Cat_3, Galaxy_1, Frown_7, Nanofactory_1, Molecular_Smileyface_1, Cat_4, Molecular_Smileyface_2, Galaxy_2, Nanofactory_2.

  rather than

  +: Smile_1, Smile_2, Smile_3, Molecular_Smileyface_1, Molecular_Smileyface_2, Smile_4

  -: Frown_1, Cat_1, Frown_2, Frown_3, Cat_2, Boat_1, Car_1, Frown_5, Frown_6, Cat_3, Galaxy_1, Frown_7, Nanofactory_1, Cat_4, Galaxy_2, Nanofactory_2.

  Both of these classifications are compatible with the training data. The number of concepts compatible with the training data will be much larger, since more than one concept can project the same shadow onto the combined dataset. If the space of possible concepts includes the space of possible computations that classify instances, the space is infinite.

  Which classification will the AI choose? This is not an inherent property of the training data; it is a property of how the AI performs induction.

  Which is the correct classification? This is not a property of the training data; it is a property of your preferences (or, if you prefer, a property of the idealized abstract dynamic you name “right”).

  The concept that you wanted cast its shadow onto the training data as you yourself labeled each instance + or -, drawing on your own intelligence and preferences to do so. That’s what supervised learning is all about—providing the AI with labeled training examples that project a shadow of the causal process that generated the labels.

  But unless the training data is drawn from exactly the same context as the real-life, the training data will be “shallow” in some sense, a projection from a much higher-dimensional space of possibilities.

  The AI never saw a tiny molecular smileyface during its dumber-than-human training phase, or it never saw a tiny little agent with a happiness counter set to a googolplex. Now you, finally presented with a tiny molecular smiley—or perhaps a very realistic tiny sculpture of a human face—know at once that this is not what you want to count as a smile. But that judgment reflects an unnatural category, one whose classification boundary depends sensitively on your complicated values. It is your own plans and desires that are at work when you say “No!”

  Hibbard knows instinctively that a tiny molecular smileyface isn’t a “smile,” because he knows that’s not what he wants his putative AI to do. If someone else were presented with a different task, like classifying artworks, they might feel that the Mona Lisa was obviously smiling—as opposed to frowning, say—even though it’s only paint.

  As the case of Terry Schiavo illustrates, technology enables new borderline cases that throw us into new, essentially moral dilemmas. Showing an AI pictures of living and dead humans as they existed during the age of Ancient Greece will not enable the AI to make a moral decision as to whether switching off Terry’s life support is murder. That information isn’t present in the dataset even inductively! Terry Schiavo raises new moral questions, appealing to new moral considerations, that you wouldn’t need to think about while classifying photos of living and dead humans from the time of Ancient Greece. No one was on life support then, still breathing with a brain half fluid. So such considerations play no role in the causal process that you use to classify the ancient-Greece training data, and hence cast no shadow on the training data, and hence are not accessible by induction on the training data.

  As a matter of formal fallacy, I see two anthropomorphic errors on display.

  The first fallacy is underestimating the complexity of a concept we develop for the sake of its value. The borders of the concept will depend on many values and probably on-the-fly moral reasoning, if the borderline case is of a kind we haven’t seen before. But all that takes place invisibly, in the background; to Hibbard it just seems that a tiny molecular smileyface is just obviously not a smile. And we don’t generate all possible borderline cases, so we don’t think of all the considerations that might play a role in redefining the concept, but haven’t yet played a role in defining it. Since people underestimate the complexity of their concepts, they underestimate the difficulty of inducing the concept from training data. (And also the difficulty of describing the concept directly—see The Hidden Complexity of Wishes.)

  The second fallacy is anthropomorphic optimism. Since Bill Hibbard uses his own intelligence to generate options and plans ranking high in his preference ordering, he is incredulous at the idea that a superintelligence could classify never-before-seen tiny molecular smileyfaces as a positive instance of “smile.” As Hibbard uses the “smile” concept (to describe desired behavior of superintelligences), extending “smile” to cover tiny molecular smileyfaces would rank very low in his preference ordering; it would be a stupid thing to do—inherently so, as a property of the concept itself—so surely a superintelligence would not do it; this is just obviously the wrong classification. Certainly a superintelligence can see which heaps of pebbles are correct or incorrect.

  Why, Friendly AI isn’t hard at all! All you need is an AI that does what’s good! Oh, sure, not every possible mind does what’s good—but in this case, we just program the superintelligence to do what’s good. All you need is a neural network that sees a few instances of good things and not-good things, and you’ve got a classifier. Hook that up to an expected utility maximizer and you’re done!

  I shall call this the fallacy of magical categories—simple little words that turn out to carry all the desired functionality of the AI. Why not program a chess player by running a neural network (that is, a magical category-absorber) over a set of winning and losing sequences of chess moves, so that it can generate “winning” sequences? Back in the 1950s it was believed that AI might be that simple, but this turned out not to be the case.

  The novice thinks that Friendly AI is a problem of coercing an AI to make it do what you want, rather than the AI following its own desires. But the real problem of Friendly AI is one of communication—transmitting category boundaries, like “good,” that can’t be fully delineated in any training data you can give the AI during its childhood. Relative to the full space of possibilities the Future encompasses, we ourselves haven’t imagined most of the borderline cases, and would have to engage in full-fledged moral arguments to figure them out. To solve the FAI problem you have to step outside the paradigm of induction on human-labeled training data and the paradigm of human-generated intensional definitions.

  Of course, even if Hibbard did succeed in conveying to an AI a concept that covers exactly every human facial expression that Hibbard would label a “smile,” and excludes every facial expression that Hibbard wouldn’t label a “smile” . . .

  Then the resulting AI would appear to work correctly during its childhood, when it was weak enough that it could only generate smiles by pleasing its programmers.

  When the AI progressed to the point of superintelligence and its own nanotechnological infrastructure, it would rip off your face,
wire it into a permanent smile, and start xeroxing.

  The deep answers to such problems are beyond the scope of this essay, but it is a general principle of Friendly AI that there are no bandaids. In 2004, Hibbard modified his proposal to assert that expressions of human agreement should reinforce the definition of happiness, and then happiness should reinforce other behaviors. Which, even if it worked, just leads to the AI xeroxing a horde of things similar-in-its-conceptspace to programmers saying “Yes, that’s happiness!” about hydrogen atoms—hydrogen atoms are easy to make.

  Link to my discussion with Hibbard here. You already got the important parts.

  *

  1. Bill Hibbard, “Super-Intelligent Machines,” ACM SIGGRAPH Computer Graphics 35, no. 1 (2001): 13–15, http://www.siggraph.org/publications/newsletter/issues/v35/v35n1.pdf.

  2. Eliezer Yudkowsky, “Artificial Intelligence as a Positive and Negative Factor in Global Risk,” in Bostrom and Ćirković, Global Catastrophic Risks, 308–345.

  275

  The True Prisoner’s Dilemma

  It occurred to me one day that the standard visualization of the Prisoner’s Dilemma is fake.

  The core of the Prisoner’s Dilemma is this symmetric payoff matrix:

  1 : C 1 : D

  2 : C (3,3) (5,0)

  2 : D (0,5) (2,2)

  Player 1, and Player 2, can each choose C or D. Player 1’s and Player 2’s utilities for the final outcome are given by the first and second number in the pair. For reasons that will become apparent, “C” stands for “cooperate” and D stands for “defect.”

  Observe that a player in this game (regarding themselves as the first player) has this preference ordering over outcomes: (D,C) > (C,C) > (D,D) > (C,D).

  Option D, it would seem, dominates C: If the other player chooses C, you prefer (D,C) to (C,C); and if the other player chooses D, you prefer (D,D) to (C,D). So you wisely choose D, and as the payoff table is symmetric, the other player likewise chooses D.

  If only you’d both been less wise! You both prefer (C,C) to (D,D). That is, you both prefer mutual cooperation to mutual defection.

  The Prisoner’s Dilemma is one of the great foundational issues in decision theory, and enormous volumes of material have been written about it. Which makes it an audacious assertion of mine, that the usual way of visualizing the Prisoner’s Dilemma has a severe flaw, at least if you happen to be human.

  The classic visualization of the Prisoner’s Dilemma is as follows: you are a criminal, and you and your confederate in crime have both been captured by the authorities.

  Independently, without communicating, and without being able to change your mind afterward, you have to decide whether to give testimony against your confederate (D) or remain silent (C).

  Both of you, right now, are facing one-year prison sentences; testifying (D) takes one year off your prison sentence, and adds two years to your confederate’s sentence.

  Or maybe you and some stranger are, only once, and without knowing the other player’s history or finding out who the player was afterward, deciding whether to play C or D, for a payoff in dollars matching the standard chart.

  And, oh yes—in the classic visualization you’re supposed to pretend that you’re entirely selfish, that you don’t care about your confederate criminal, or the player in the other room.

  It’s this last specification that makes the classic visualization, in my view, fake.

  You can’t avoid hindsight bias by instructing a jury to pretend not to know the real outcome of a set of events. And without a complicated effort backed up by considerable knowledge, a neurologically intact human being cannot pretend to be genuinely, truly selfish.

  We’re born with a sense of fairness, honor, empathy, sympathy, and even altruism—the result of our ancestors’ adapting to play the iterated Prisoner’s Dilemma. We don’t really, truly, absolutely and entirely prefer (D,C) to (C,C), though we may entirely prefer (C,C) to (D,D) and (D,D) to (C,D). The thought of our confederate spending three years in prison does not entirely fail to move us.

  In that locked cell where we play a simple game under the supervision of economic psychologists, we are not entirely and absolutely without sympathy for the stranger who might cooperate. We aren’t entirely happy to think that we might defect and the stranger cooperate, getting five dollars while the stranger gets nothing.

  We fixate instinctively on the (C,C) outcome and search for ways to argue that it should be the mutual decision: “How can we ensure mutual cooperation?” is the instinctive thought. Not “How can I trick the other player into playing C while I play D for the maximum payoff?”

  For someone with an impulse toward altruism, or honor, or fairness, the Prisoner’s Dilemma doesn’t really have the critical payoff matrix—whatever the financial payoff to individuals. The outcome (C,C) is preferable to the outcome (D,C), and the key question is whether the other player sees it the same way.

  And no, you can’t instruct people being initially introduced to game theory to pretend they’re completely selfish—any more than you can instruct human beings being introduced to anthropomorphism to pretend they’re expected paperclip maximizers.

  To construct the True Prisoner’s Dilemma, the situation has to be something like this:

  Player 1: Human beings, Friendly AI, or other humane intelligence.

  Player 2: Unfriendly AI, or an alien that only cares about sorting pebbles.

  Let’s suppose that four billion human beings—not the whole human species, but a significant part of it—are currently progressing through a fatal disease that can only be cured by substance S.

  However, substance S can only be produced by working with a paperclip maximizer from another dimension—substance S can also be used to produce paperclips. The paperclip maximizer only cares about the number of paperclips in its own universe, not in ours, so we can’t offer to produce or threaten to destroy paperclips here. We have never interacted with the paperclip maximizer before, and will never interact with it again.

  Both humanity and the paperclip maximizer will get a single chance to seize some additional part of substance S for themselves, just before the dimensional nexus collapses; but the seizure process destroys some of substance S.

  The payoff matrix is as follows:

  1 : C 1 : D

  2 : C (2 billion human lives saved, 2 paperclips gained) (+3 billion lives saved, +0 paperclips)

  2 : D (+0 lives, +3 paperclips) (+1 billion lives, +1 paperclip)

  I’ve chosen this payoff matrix to produce a sense of indignation at the thought that the paperclip maximizer wants to trade off billions of human lives against a couple of paperclips. Clearly the paperclip maximizer should just let us have all of substance S. But a paperclip maximizer doesn’t do what it should; it just maximizes paperclips.

  In this case, we really do prefer the outcome (D,C) to the outcome (C,C), leaving aside the actions that produced it. We would vastly rather live in a universe where 3 billion humans were cured of their disease and no paperclips were produced, rather than sacrifice a billion human lives to produce 2 paperclips. It doesn’t seem right to cooperate, in a case like this. It doesn’t even seem fair—so great a sacrifice by us, for so little gain by the paperclip maximizer? And let us specify that the paperclip-agent experiences no pain or pleasure—it just outputs actions that steer its universe to contain more paperclips. The paperclip-agent will experience no pleasure at gaining paperclips, no hurt from losing paperclips, and no painful sense of betrayal if we betray it.

  What do you do then? Do you cooperate when you really, definitely, truly and absolutely do want the highest reward you can get, and you don’t care a tiny bit by comparison about what happens to the other player? When it seems right to defect even if the other player cooperates?

  That’s what the payoff matrix for the true Prisoner’s Dilemma looks like—a situation where (D,C) seems righter than (C,C).

  But all the rest of the logic—everything about what happens i
f both agents think that way, and both agents defect—is the same. For the paperclip maximizer cares as little about human deaths, or human pain, or a human sense of betrayal, as we care about paperclips. Yet we both prefer (C,C) to (D,D).

  So if you’ve ever prided yourself on cooperating in the Prisoner’s Dilemma . . . or questioned the verdict of classical game theory that the “rational” choice is to defect . . . then what do you say to the True Prisoner’s Dilemma above?

  PS: In fact, I don’t think rational agents should always defect in one-shot Prisoner’s Dilemmas, when the other player will cooperate if it expects you to do the same. I think there are situations where two agents can rationally achieve (C,C) as opposed to (D,D), and reap the associated benefits.1

  I’ll explain some of my reasoning when I discuss Newcomb’s Problem. But we can’t talk about whether rational cooperation is possible in this dilemma until we’ve dispensed with the visceral sense that the (C,C) outcome is nice or good in itself. We have to see past the prosocial label “mutual cooperation” if we are to grasp the math. If you intuit that (C,C) trumps (D,D) from Player 1’s perspective, but don’t intuit that (D,C) also trumps (C,C), you haven’t yet appreciated what makes this problem difficult.

  *

  1. Eliezer Yudkowsky, Timeless Decision Theory, Unpublished manuscript (Machine Intelligence Research Institute, Berkeley, CA, 2010), http://intelligence.org/files/TDT.pdf.

  276

  Sympathetic Minds

  “Mirror neurons” are neurons that are active both when performing an action and observing the same action—for example, a neuron that fires when you hold up a finger or see someone else holding up a finger. Such neurons have been directly recorded in primates, and consistent neuroimaging evidence has been found for humans.

 

‹ Prev