by Dan Ariely
As for Weizenbaum, appalled and horrified, he did something almost unheard-of: an about-face on his entire career. He pulled the plug on the Eliza project, encouraged his own critics, and became one of science’s most outspoken opponents of AI research. But the genie was out of the bottle, and there was no going back. The basic “template matching” skeleton and approach of Eliza has been reworked and implemented in some form or another in almost every chat program since, including the contenders at the 2009 Loebner Prize competition. The enthusiasm—as well as the unease—about these programs has only grown.
The Jerk as Genius
The first Loebner Prize competition was held on November 8, 1991, at the Boston Computer Museum. In its first few years, the contest required each program and human confederate to choose a topic, as a means of limiting the conversation. One of the confederates in 1991 was the Shakespeare expert Cynthia Clay, who was famously deemed a computer by three different judges after a conversation about the playwright. The consensus seemed to be: “No one knows that much about Shakespeare.” (For this reason, Clay took her misclassifications as a compliment.)
The programmer Joseph Weintraub chose “Whimsical Conversation” for his PC Therapist III and went on to earn the very first Most Human Computer award. As the program practiced it, “whimsical conversation” was not a topic at all but a manner—a crucial distinction. Weintraub was able to play off a looseness in the contest rules, the natural behavior patterns of chatbots, and people’s stereotypes about computers and humans, all at once. Strategically, this was brilliant.
By dutifully and scrupulously providing information in response to the questions asked, Clay demonstrated her knowledge and understanding—but sometimes in a factual, encyclopedic way commonly associated with computer systems. Here is a sample of Clay’s conversation:
JUDGE: What is your opinion on Shakespeare’s plays?
CONFEDERATE: That’s pretty general; would you be more specific? Otherwise, I’ll just say I like them.
The computer at the “Whimsical Conversation” terminal stood in stark contrast:
JUDGE: It has been a long day! Now I think I’m ready for some fun. The weather isn’t very pleasant today. Do you have any jokes that could lift the clouds?
COMPUTER: Everybody talks about the weather but nobody seems to do much about it. Why do you need to tell me now you think you’re ready for some fun? Groannnnnn . . . . . . . . . . . .
JUDGE: What is the definition of whimsical conversation?
COMPUTER: Didn’t you say well anyway I would imagine whimsical conversation is like a series of knight’s moves in chess-skrt of zig-zag rather than following any nparticular line.? [Weintraub’s program, like many that followed it, faked typos. Any unintentional typing errors in the transcripts that follow have been corrected.] Why do you ask? Was that a clever comment, or what? Are such questions much on your mind?
JUDGE: I like the image of knights moving haphazardly across the chess board, does that mean there is no thought to whimsical conversation?
COMPUTER: I could swear you just said how do you define whimsical?? How do yku define whimsical? Were you always so sick sick sick? Why do you need to tell me you like the image of knights moving haphazardly across the chess board?
“I really thought [PC Therapist] was human, because it . . . was weird and funny, in a normal sort of way,” said one of the judges, a Harvard graduate student. Weintraub’s program, shifting topics wildly and spouting non sequiturs and canned one-liners, came off as zany, a jokester, a much more “human” personality type. At least I used to think so—before I learned how easy this was to mimic.
As Richard Wallace, three-time winner of the Most Human Computer award (’00, ’01, and ’04), explains: “Experience with [Wallace’s chatbot] ALICE indicates that most casual conversation is “state-less,” that is, each reply depends only on the current query, without any knowledge of the history of the conversation required to formulate the reply.” Many human conversations function in this way, and it behooves AI researchers to determine which types of conversation are stateless—with each remark depending only on the last—and try to create these very sorts of interactions. It’s our job as confederates, as humans, to resist them.
One of the classic stateless-conversation types is the kind of zany free-associative riffing that Weintraub’s program, PC Therapist III, employed. Another, it turns out, is verbal abuse.
In May 1989, Mark Humphrys, a twenty-one-year-old University College Dublin undergraduate, put online an Eliza-style program he’d written, called MGonz, and left the building for the day. A user (screen name “Someone”) at Drake University in Iowa tentatively sent the message “finger” to Humphrys’s account—an early-Internet command that acted as a request for basic information about a user. To Someone’s surprise, a response came back immediately: “cut this cryptic shit speak in full sentences.” This began an argument between Someone and MGonz that lasted almost an hour and a half. (The best part was undoubtedly when Someone said, “you sound like a goddamn robot that repeats everything.”)
Returning to the lab the next morning, Humphrys was stunned to find the log and felt a strange, ambivalent emotion. His program might have just shown how to pass the Turing Test, he thought—but the evidence was so profane that he was afraid to publish it.
Humphrys’s twist on the Eliza paradigm was to abandon the therapist persona for that of an abusive jerk; when it lacked any clear cue for what to say, MGonz fell back not on therapy clichés like “How does that make you feel?” but on things like “You are obviously an asshole,” or “Ah type something interesting or shut up.” It’s a stroke of genius because, as becomes painfully clear from reading the MGonz transcripts, argument is stateless—that is, unanchored from all context, a kind of Markov chain of riposte, meta-riposte, meta-meta-riposte. Each remark after the first is only about the previous remark. If a program can induce us to sink to this level, of course it can pass the Turing Test.
Once again, the question of what types of human behavior computers can imitate shines light on how we conduct our own, human lives. Verbal abuse is simply less complex than other forms of conversation. In fact, since reading the papers on MGonz and transcripts of its conversations, I find myself much more able to constructively manage heated conversations. Aware of the stateless, knee-jerk character of the terse remark I want to blurt out, I recognize that that remark has far more to do with a reflex reaction to the very last sentence of the conversation than with either the issue at hand or the person I’m talking to. All of a sudden, the absurdity and ridiculousness of this kind of escalation become quantitatively clear, and, contemptuously unwilling to act like a bot, I steer myself toward a more “stateful” response: better living through science.
Beware of Banality
Entering the Brighton Centre, I found my way to the Loebner Prize contest room. I saw rows of seats, where a handful of audience members had already gathered; up front, what could only be the bot programmers worked hurriedly, plugging in tangles of wires and making the last flurries of keystrokes. Before I could get too good a look at them, this year’s test organizer, Philip Jackson, greeted me and led me behind a velvet curtain to the confederate area. Out of view of the audience and the judges, the four of us confederates sat around a rectangular table, each at a laptop set up for the test: Doug, a Canadian linguistics researcher; Dave, an American engineer working for Sandia National Laboratories; Olga, a speech-research graduate student from South Africa; and me. As we introduced ourselves, we could hear the judges and audience members slowly filing in, but couldn’t see them around the curtain. A man zoomed by in a green floral shirt, talking a mile a minute and devouring finger sandwiches. Though I had never met him before, I knew instantly he could be only one person: Hugh Loebner. Everything was in place, he told us between bites, and the first round of the test would start momentarily. We four confederates grew quiet, staring at the blinking cursors on our laptops. My hands were poised over the
keyboard, like a nervous gunfighter’s over his holsters.
The cursor, blinking. I, unblinking. Then all at once, letters and words began to materialize:
Hi how are you doing?
The Turing Test had begun.
I had learned from reading past Loebner Prize transcripts that judges come in two types: the small-talkers and the interrogators. The latter go straight in with word problems, spatial-reasoning questions, deliberate misspellings. They lay down a verbal obstacle course, and you have to run it. This type of conversation is extraordinarily hard for programmers to prepare against, because anything goes—and this is why Turing had language and conversation in mind as his test, because they are really a test of everything. The downside to the give-’em-the-third-degree approach is that it doesn’t leave much room to express yourself, personality-wise.
The small-talk approach has the advantage of making it easier to get a sense of who a person is—if you are indeed talking to a person. And this style of conversation comes more naturally to layperson judges. For one reason or another, small talk has been explicitly and implicitly encouraged among Loebner Prize judges. It’s come to be known as the “strangers on a plane” paradigm. The downside is that these conversations are, in some sense, uniform—familiar in a way that allows a programmer to anticipate a number of the questions.
I started typing back.
CONFEDERATE: hey there!
CONFEDERATE: i’m good, excited to actually be typing
CONFEDERATE: how are you?
I could imagine the whole lackluster conversation spread out before me: Good. Where are you from? / Seattle. How about yourself? / London.
Four minutes and 43 seconds left. My fingers tapped and fluttered anxiously.
I could just feel the clock grinding away while we lingered over the pleasantries. I felt this desperate urge to go off script, cut the crap, cut to the chase—because I knew that the computers could do the small-talk thing, which played directly into their preparation. As the generic civilities stretched forebodingly out before me, I realized that this very kind of conversational boilerplate was the enemy, every bit as much as the bots. How, I was thinking as I typed another unassuming pleasantry, do I get an obviously human connection to happen?
Taking Turns
Part of what I needed to figure out was how to exploit the Loebner Prize’s unusual “live typing” medium. The protocol being used was unlike e-mails, text messages, and standard instant-messaging systems in a very crucial way: it transmitted our typing keystroke by keystroke. The judge and I were watching each other type, typos and backspacing and all. I remember some Internet chat programs back in the nineties trying out this character-at-a-time approach, but people for the most part rejected it. It was too invasive, was the feeling: what people like about writing is the time and space to compose and edit a message before sharing it with the other person. The advantage of the character-at-a-time transmission, though, is that it approaches much more closely the condition of speech, with its fluidity of turn-taking and its choppy grammar: what’s lost in eloquence is made up for in agility.
It also, then, lets us see typing’s “negative space”: hesitation. In a chat conversation where text is transmitted with every carriage return, only egregiously long pauses are taken to be part of the interaction. With more fluid and immediate feedback, silence acquires meaning. Failing to quickly answer a question in a face-to-face conversation, for instance, is tantamount in many cases to answering it. I recall asking a friend of mine how things were going with the woman he’d started seeing; the um and the tiny pause in which he searched for the right words made it clear they were having problems.
So what did the Loebner Prize’s unusual (and recently implemented) protocols enable and disable, compared with the standard, walkie-talkie, turn-taking style? Long missives weren’t going to work, as they had in previous years, when programs were able to steamroll the judges by eating up the clock and delivering ridiculously prolix answers. If two parties are taking strict turns under time pressure, they are putting themselves at the mercy of the length of the other’s turns. Here, for instance, is one program’s conversation transcript from 1996:
COMPUTER: How about telling me a joke?
JUDGE: No.
COMPUTER: Here’s my favorite joke . . . A dog limps into a bar with a bloody bandage on his foot. He’s wearing a 10-gallon hat, six-shooters in his holsters and chaps. He growled, “I’m a-lookin’ for the man that shot my paw.”
“The joke’s not funny . . .” the judge writes, giving the program an opening to tell another one—which it does (“A knotty, worn-out old string walks into a bar . . .”). Meanwhile, three full minutes have elapsed. If a computer (or confederate) started rambling on too long under the new, live-typing protocols, the judge could and would just cut it off.
And so another piece of my confederate strategy fell into place. I would treat the Turing Test’s strange and unfamiliar textual medium more like spoken English and less like the written language. I would attempt to disrupt the turn-taking “wait and parse” pattern that computers understand and create a single, flowing duet of verbal behavior, emphasizing timing. If computers understand little about verbal “harmony,” they understand even less about rhythm.
If nothing was happening on my screen, whether or not it was my turn, I’d elaborate a little on my answer, or add a parenthetical, or throw a question back at the judge—just as we offer and/or fill audible silence when we talk out loud. If the judge took too long considering the next question, I’d keep talking. I would be the one (unlike the bots) with something to prove. If I knew what the judge was about to write, I’d spare him the keystrokes and jump in.
There’s a trade-off, of course, between the number of opportunities for serve and volley and the sophistication of the responses themselves. The former thrives with brevity, the latter with length. It seemed to me, though, that so much of the nuance (or difficulty) in conversation comes from understanding (or misunderstanding) a question and offering an appropriate (or inappropriate) response—thus, it made sense to maximize the number of interchanges.
Some judges, I discovered, would be startled or confused at this jumping of the gun, and I saw them pause, hesitate, yield, even start backspacing what they had half-written. Other judges cottoned on immediately and leapt right in after me.
In the first round of the 2009 contest, judge Shalom Lappin—a computational linguist at King’s College London—spoke with a computer program called Cleverbot and then with me. My strategy of verbosity was clearly in evidence: I made 1,089 keystrokes in five minutes (3.6 keystrokes a second) to Cleverbot’s 356 (1.2/sec), and Lappin made 548 keystrokes (1.8/sec) in my conversation, compared with 397 (1.3/sec) with Cleverbot. Not only did I say three times as much as my silicon adversary, but I engaged the judge more, to the tune of 38 percent more typing from Lappin.
Reviewing the logs later, though, I looked for a way to quantify the fluidity of the human-to-human interactions against the rigidity of the human-to-machine ones. It occurred to me that you could count the changes—let’s call them “swaps”—of the party who typed the latest keystroke. I did manage to type three times as much as Cleverbot, but the real story, as it turns out, is in the swaps. Lappin’s conversation with Cleverbot had 33; his conversation with me had 492, almost fifteen times as many.
The Confederate’s Gambit
You don’t converse with Google, or with most computer systems; you depose them. There’s a crucial difference. Interestingly, many Loebner Prize judges approach the Turing Test as a kind of cross-examination; strangely, a number of confederates also seem to approach it with that dynamic in mind. One of the human-to-human conversations in 2008 never manages to get out of that stiff question-and-response mode:
JUDGE: Do you have a long drive?
CONFEDERATE: fairly long
JUDGE: so do I :( ah well, do you think you could have used public transport?
CONFEDERATE: i could haver />
JUDGE: and why not?
CONFEDERATE: i chose not to
JUDGE: that’s fair. Do you think we have too many cars on the road or not enough today?
CONFEDERATE: it’s not for me to say
Yawn! But the computer in this pair is playful with the judge from the get-go:
JUDGE: HI
COMPUTER: Amen to that.
JUDGE: quite the evangelist
COMPUTER: Our Father, who art in cyberspace, give us today our daily bandwidth.
JUDGE: evangelist/nerd lol. So how are things with you today?
And with that, the program has practically sealed up the judge’s confidence in its humanity with its second sentence. Note that the confederate’s stiff answers prompt more grilling and forced conversation—what’s your opinion on such-and-such political topic? But with the computer, the judge, misled by the opening wisecracks into assuming it’s the real person, remains utterly casual: how are things? This makes the contest easier for the computer and harder for the confederate.