The Most Human Human

Page 18

by Brian Christian

ME: well,

ELBOT: yes!

ME: um …

ELBOT: That would be a great name for a robot!

With a number of customer-representative-type bots, like the ones at alaskaair.com, continental.com, and goarmy.com, the input screen locks into a “Please wait …” message after each carriage return:

YOU ASKED: uh

JENN: That’s ok, what else can I help you with?

YOU ASKED: well

JENN: I’m not sure how to answer that. Please try rewording your question.

YOU ASKED: um

JENN: Children traveling alone are welcome on Alaska Airlines and Horizon Air. Here’s information regarding our Unaccompanied Minor service.

It’s fascinating that for all the trenchant questioning that many of these programs can deal with, mere mumbling destroys them.

Leveraging the Medium

I often judge a work of art by how necessary it is for that work to be in that medium. In part, that means how much it leverages the medium’s unique capacities: a symphony, of course, can do things that an essay can’t, and vice versa.

Different written media, for instance, leverage their unique characteristics with time to produce different brands of intimacy, different communicative possibilities—the quick-moving text message saying one was just now being thought of and producing the empathic delight of a moment shared across space between two minds; the slow-moving postal letter or handmade gift saying that one was being thought of at length without knowing it, imbuing future days with the sheen of possibility.

In some sense the story of instant messaging—the medium of the Turing test—is the story of the telegram, accelerated to the breaking point.

I discover, though, that the protocol the Loebner Prize was using in 2009 was unlike emails, text messages, and standard instant-messaging systems in a very crucial way. The Loebner Prize chat protocols transmitted typing keystroke by keystroke. You’re watching each other type, typos and backspacing and all.

Part of what I needed to figure out was how exactly to use the Loebner Prize competition’s unusual “live typing” medium. What did this enable and disable, compared to the standard walkie-talkie, telegram-like, turn-taking style?

To make typing visible also, then, makes typing’s “negative space” visible: hesitation. In a chat conversation where text is transmitted with every carriage return, only egregiously long pauses are taken to be “part” of the interaction. With more fluid and immediate feedback, silence acquires meaning. Failing to quickly answer a question in a face-to-face conversation, for instance, is tantamount in many cases to answering it. I recall asking a friend of mine how things were going with the girl he’d started seeing; the tiny pause and “um” in which he searched for the right words made it clear they were having problems. In other words, it wasn’t his choice of words, but the choosing itself, that told the tale.3

I remember some Internet chat programs back in the ’90s trying out this character-at-a-time approach, but people for the most part rejected it. It was too invasive, was the feeling: what people like about writing is the time and space to compose and edit before sharing it with the other person. The advantage of the character-at-a-time transmission, though, is that it starts to approach much more closely the condition of speech, with its fluidity of turn-taking and its choppy grammar, where what’s lost in eloquence is made up for in agility.

Long missives weren’t going to work, as they have in certain years, where programs were able to steamroller the judges by eating up the clock and delivering ridiculously prolix answers. If two parties are taking strict turns under time pressure, they are putting themselves at the mercy of the length of the other’s turns—here, for instance, is one program from 1996:

REMOTE: How about telling me a joke?

JUDGE: No.

REMOTE: Here’s my favorite joke … A dog limps into a bar with a bloody bandage on his foot. He’s wearing a 10-gallon hat, six-shooters in his holsters and chaps. He growled, “I’m a-lookin’ for the man that shot my paw.”

“The joke’s not funny …,” the judge writes, seemingly prompting the program to tell another one—which it does (“A knotty, worn-out old string walks into a bar …”). Meanwhile, three full minutes have elapsed.

Confederates, too, sometimes err on the epistolary side, as did a confederate in 1992, speaking on the topic of “Health”:

JUDGE: well i generally eat various garbage that happens my way. to say i have a program would be really pushin it. Doritos are far and above my favorite feel good and groovy chow.

REMOTE: You have to be careful with garbage. That brings out another point. Natural health is about intuition to a great extent. When you eat too much garbage, you have to be careful of maintaining balance by your own intuition because sometimes, if you don’t control it at this initial stage, you don’t know until you see the results in your health that you have been harmed. High cholesterol is perfect example of this, and we all know its results; clogged arteries that are then very difficult to remedy, and there is controversy, over whether it is reversible.

This judge voted her a computer.

If a computer (or confederate) started rambling on too long in a “barge-in-able” test, the judge would just cut them off.

I realized something else about the character-by-character transmission, and what that might allow. Sometimes spoken dialogue becomes slightly nonlinear—as in, “I went to the store and bought milk and eggs, and on my way home I ran into Shelby—oh, and bread too,” where we understand that bread goes with “bought” and not “ran into.” (This is part of the function of “oh,” another one of those words that traditional linguistics has had no truck with.) For the most part, though, there is so little lag time between the participants, and between the composition of a sentence in their minds and their speaking it out loud, that the subject matter rarely branches entirely into two parallel threads. In an instant-message conversation, the small window of time in which one person is typing, but the other cannot see what’s being typed, is frequently enough to send the conversation in two directions at once:

A: how was your trip?

A: oh, and did you get to see the volcano?

B: good! how’ve things been back at the homestead?

A: oh, you know, the usual

B: yes we did get to see it!

Here the conversation starts to develop separate and parallel threads, such that each person’s remark isn’t necessarily about the most recent remark. It’s possible that watching each other type eliminates the lag that creates this branching, although I had reason to believe it would do something else altogether …

Talking simultaneously for extended periods simply doesn’t work, as our voice—emanating just inches away from our ears—mixes confusingly with our interlocutor’s in the air and makes it hard to hear what they are saying. I was fascinated to learn that the deaf don’t encounter this problem: they can comfortably sign while watching someone else sign. In large groups it still makes sense to have one “speaker” at a time, because people cannot look in more than one direction at a time, but conversations between pairs of signers, as Rochester Institute of Technology researcher Jonathan Schull observed, “involve more continuous simultaneous and overlapping signing among interlocutors” than spoken conversations. Signers, in other words, talk and listen at the same time. Schull and his collaborators conclude that turn-taking, even turn negotiation, far from being an essential and necessary property of communication, “is a reluctant accommodation to channel-contingent constraints.”

One major difference between the Loebner protocols and traditional instant messaging is that, because the text is being created without any obvious ordering that would enable it to be arranged together on the screen, each user’s typing appears in a separate area of the screen. Like sign language, this makes group conversation rather difficult, but offers fascinating possibilities for two-person exchange.

Another piece of m
y confederate strategy fell into place. I would treat the Turing test’s strange and unfamiliar textual medium more like spoken and signed, and less like written, English. I would attempt to disrupt the turn-taking “wait and parse” pattern that computers understand and create a single, flowing duet of verbal behavior, emphasizing timing: whatever little computers understand about verbal “harmony,” it still dwarfs what they understand about rhythm.

I would talk in a way that would, like a Ferneyhough piece, force satisficing over optimization. If nothing was happening on my screen, whether or not it was my turn, I’d elaborate a little on my answer, or add a parenthetical, or throw a question back at the judge—just as we offer and/or fill audible silence when we talk out loud. If the judge took too long considering his next question, I’d keep talking. I’m the one (unlike the bots) with something to prove. If I understood what the judges were writing, I’d spare them the keystrokes or seconds and jump in.

There’s a trade-off, of course, between the number of opportunities for interaction and response, on the one hand, and the sophistication of the responses themselves. The former thrives with brevity, the latter with length. It seemed to me, though, that so much of the difficulty and nuance in conversation comes from understanding the question and offering an appropriate response—thus it makes sense to maximize the amount of interchanges.

Some judges, I would discover, would be startled or confused at this jumping of the gun, and I saw them pause, hesitate, yield, even start backspacing what they had half written. Other judges cottoned on immediately, and leaped right in after.4

In the first round of the 2009 contest, judge Shalom Lappin—computational linguist at King’s College London—spoke with Cleverbot, and then myself. My strategy of verbosity was clearly in evidence: I made 1,089 keystrokes in five minutes (3.6 keystrokes a second) to Cleverbot’s 356 (1.2/sec), and Lappin made 548 keystrokes (1.8/sec) in my conversation, compared to 397 (1.3/sec) with Cleverbot. Not only did I say three times as much as my silicon adversary, but I engaged the judge more, to the tune of 38 percent more typing from Lappin.

Looking back at the logs, though, I wanted to see if there was a way to quantify the fluidity of the human interactions against the rigidity of the machine ones. It occurred to me that you could create a benchmark—let’s call it “swaps”—for the number of times that the party who typed the most recent keystroke changes.

I did manage to type three times as much as Cleverbot, but the real story, as it turns out, is in the swaps. Lappin’s conversation with Cleverbot had 33 such events; his conversation with me had 492, almost fifteen times that.

There’s no single obvious way to show, on the page, what that kind of interaction looks like (and this is partially the point, the gap between performance and transcript). His Cleverbot conversation, written in the traditional script style, looks a bit like this:

JUDGE: What language are you written in?

REMOTE: English.

JUDGE: Give me a break!

REMOTE: Ok.

JUDGE: How do you enjoy the conference?

Even if we notate it more like musical staves, keystroke by keystroke,5 there’s very little more to the story. The keystrokes line up in tidy lines: statement/response/statement/response:

Four carriage returns, four swaps. And our conversation, as a script, looks like this:

REMOTE: sweet, picking up an instrument

JUDGE: I meant Stones, Dylan, Beatles …

But the musical-staff-looking keystroke logs look utterly unlike the Cleverbot logs, and they tell a much different story:

Two carriage returns, fifty-one swaps.

Alternately, we might try a third notation, which makes the difference even clearer: to string all the letters together, bolding the judge’s keystrokes and leaving the computer’s and my own unbolded. You get this from the human-computer dialogues:

And this from the human-human dialogues:

Now if that difference isn’t night and day, I don’t know what is. Over.

1. Some equations (the Newtonian parabolas that projectiles follow, for instance) are such that you can just plug in any old future value for time and get a description of the future state of events. Other calculations (e.g., some cellular automata) contain no such shortcuts. Such processes are called “computationally irreducible.” Future time values cannot simply be “plugged in”; rather, you have to run the simulation all the way from point A to point Z, including all intermediate steps. Stephen Wolfram, in A New Kind of Science, attempts to reconcile free will and determinism by conjecturing that the workings of the human brain are “irreducible” in this way: that is, there are no Newtonian-style “laws” that allow us shortcuts to knowing in advance what people will do. We simply have to observe them.

2. Linguists have dubbed this “back-channel feedback.”

3. Apparently the world of depositions is changing as a result of the move from written transcripts to video. After being asked an uncomfortable question, one expert witness, I was told, rolled his eyes and glowered at the deposing attorney, then shifted uncomfortably in his chair for a full fifty-five seconds, before saying, smugly and with audible venom, “I don’t recall.” He had the transcript in mind. But when a video of that conversation was shown in court, he went down in flames.

4. As Georgetown University linguist Deborah Tannen notes: “This all-together-now interaction-focused approach to conversation is more common throughout the world than our one-at-a-time information-focused approach.”

5. We’ll use “_” to mean a space, “” to mean carriage return/enter, and “»” to mean backspace.

8. The World’s Worst Deponent

Body (&) Language

Language is an odd thing. We hear communication experts telling us time and again about things like the “7-38-55 rule,” first posited in 1971 by UCLA psychology professor Albert Mehrabian: 55 percent of what you convey when you speak comes from your body language, 38 percent from your tone of voice, and a paltry 7 percent from the words you choose.

Yet it’s that 7 percent that can and will be held against you in a court of law: we are held, legally, to our diction much more than we are held to our tone or posture. These things may speak louder than words, but they are far harder to transcribe or record. Likewise, it’s harder to defend against an accusation of using a certain word than it is to defend against an accusation of using a certain tone; also, it’s much more permissible for an attorney quoting a piece of dialogue to superimpose her own body language and intonation—because they cannot be reproduced completely accurately in the first place—than to supply her own diction.

It’s that same, mere 7 percent that is all you have to prove your humanity in a Turing test.

Lie Detection

One way to think about the Turing test is as a lie-detection test. Most of what the computer says—notably, what it says about itself—is false. In fact, depending on your philosophical bent, you might say that the software is incapable of expressing truth at all (in the sense that we usually insist that a liar must understand the meaning of his words for it to count as lying). I became interested, as a confederate, in examples where humans have to confront other humans in situations where one is attempting to obtain information that the other one doesn’t want to give out, or one is attempting to prove that the other one is lying.

One of the major arenas in which these types of encounters and interactions play out is the legal world. In a deposition, for instance, most any question is fair game—the lawyer is, often, trying to be moderately sneaky or tricky, the deponent knows to expect this, and the lawyer knows to expect them expecting this, and so on. There are some great findings that an attorney can use to her advantage—for example, telling a story backward is almost impossible if the story is false. (Falsehood would not appear to be as modular and flexible as truth.) However, certain types of questions are considered “out of bounds,” and the deponent’s attorney can make what’s called a “form objection.�
�

There are several types of questions that can be objected to at a formal level. Leading questions, which suggest an answer (“You were at the park, weren’t you?”), are out of bounds, as are argumentative questions (“How do you expect the jury to believe that?”), which challenge the witness without actually attempting to discover any particular facts or information. Other formally objectionable structures include compound questions, ambiguous questions, questions assuming facts not yet established, speculative questions, questions that improperly characterize the person’s earlier testimony, and cumulative or repetitive questions.

In the courtroom, verbal guile of this nature is off-limits, but it may be that we find this very borderline—between appropriate and inappropriate levels of verbal gamesmanship—is precisely the place where we want to position ourselves in a Turing test. The Turing test has no rules of protocol—anything is permissible, from obscenity to nonsense—and so interrogative approaches deemed too cognitively taxing or indirect or theatrical for the legal process may, in fact, be perfect for teasing apart human and machine responses.

Questions Deserving Mu

To take one example, asking a “simple” yes-or-no question might prompt an incorrect answer, which might provide evidence that the respondent is a computer. In 1995, a judge responded to “They have most everything on Star Trek” by asking, “Including [rock band] Nine Inch Nails?” The answer: an unqualified “Yes.” “What episode was that?” says the judge. “I can’t remember.” This line of questioning goes some way toward establishing that the interlocutor is just answering at random (and is thereby probably a machine that simply doesn’t understand the questions), but even so, it takes some digging to make sure that your conversant didn’t simply misunderstand what you asked, isn’t simply being sarcastic, etc.—all of which takes time.

‹ Prev Next ›