by Scott Weems
This is the classic issue in computer science: computers find it easy to create new things but nearly impossible to assess their usefulness or novelty. This failing is most obvious in the realm of humor, because knowing how funny a joke is takes world knowledge—something that most computers lack, even Watson. Consider, for example, a joke made by The Joking Computer’s successor, the Joke Analysis Production Machine (JAPE): What kind of device has wings? An airplane hangar. The reason JAPE thought this joke was funny was that it classified hangars both as places for storing aircraft and as devices for hanging clothes. That’s accurate (to the extent that we accept the misspelling of hangers), but most humans know that a long piece of wire holding a shirt isn’t much of a “device.”
Even though it followed its formula correctly, JAPE was unsuccessful specifically because it failed to recognize the lack of humor in the final product. This challenge might also explain why there are so many joke production programs but so few specialized for joke recognition. To write a joke, all you need is a strategy, such as manipulation of rhymes or replacement of words with synonyms. That’s the tool used by the online program Hahacronym, which uses a stored database of potential replacements to identify funny alterations of existing acronyms. What does FBI stand for? Fantastic Bureau of Intimidation. MIT? Mythical Institute of Theology.
Of course, identifying good humor requires more than simple tricks, since there are no shortcuts for classifying the myriad ways to make a joke. Typically, humor recognition programs meet this challenge through massive computing power, like Watson did when answering Jeopardy! questions. Such programs look for language patterns, especially contradictions and incongruities. In this sense they’re pattern detectors. But to be effective, they must access vast amounts of material—as in, millions of pieces of text. (As a comparison, since starting this book you’ve read about forty thousand words yourself.)
One example of a pattern detection program is Double Entendre via Noun Transfer, also known as DEviaNT. Developed by Chloé Kiddon and Yuriy Brun at the University of Washington in Seattle, it identifies words in natural speech that have the potential for both sexual and nonsexual meanings. Specifically, it searches text and inserts the phrase “That’s What She Said” during instances of double entendres (a task of great practical importance to frat houses and fans of The Office). DEviaNT is distinctive in that it’s not just a joke creator but a humor recognition program too, because it takes a sense of humor to know when to “interrupt.”
DEviaNT was first taught to recognize the seventy-six nouns most commonly used in sexual contexts, with special attention to the sixty-one best candidates for euphemisms. Then it read more than a million sentences from an erotica database, as well as tens of thousands of nonerotic sentences. Each word in these sentences was assigned a “sexiness” value, which, in turn, was entered into an algorithm that differentiated the erotic versus nonerotic sentences. As a test, the model was later exposed to a huge library of quotes, racy stories, and text messages as well as user-submitted “That’s What She Said” jokes. The goal was to identify instances of potential double entendre—a particularly interesting challenge, noted the authors, because DEviaNT hadn’t actually been taught what a double entendre was. It had been given only lots of single entendres, and then was trained to have a dirty mind.
The researchers were quite pleased when DEviaNT recognized most of the double entendres it was presented, plus two phrases from the nonerotic sentences that had acquired sexual innuendo completely by accident (“Yes, give me all the cream and he’s gone” and “Yeah, but his hole really smells sometimes”). DEviaNT’s high degree of accuracy was especially impressive given that most of the language it was tested on wasn’t sexual. In effect, it was trying to spot needles in haystacks.
But that’s cheating, you might claim. DEviaNT didn’t actually understand the sexual nature of the jokes. It didn’t even know what it was reading. All it did was look for language patterns, and a very specific type at that. True, but these arguments also assume that “understanding” involves some special mental state, in addition to coming up with the right answer. (Or, when recognizing bawdy jokes, knowing when to exclaim “That’s what she said!”) As we’ll soon see, that’s a human-centric perspective. Maybe we underestimate computers because we assume too much about how they should think. To explore that possibility, let’s turn to one last computer humor program—the University of North Texas’s one-liner program, developed by the computer scientist Rada Mihalcea.
Like DEviaNT, this program was trained to recognize humor by reading vast amounts of humorous and nonhumorous material. Specifically, it was shown sixteen thousand humorous “one-liners” that had been culled from a variety of websites, along with an equal number of nonhumorous sentences taken from other public databases. Mihalcea’s goal was to teach the program to distinguish between the humorous sentences and the nonhumorous ones. But the program had two versions. One version looked for certain features previously established as common in jokes, such as alliteration, slang, and the proximity of antonyms. The second version was given no such hints at all and simply allowed the program to learn on its own from thousands of labeled examples. After training, both versions were shown new sentences and asked to identify which were jokes and which weren’t.
Mihalcea was surprised to see that the trained version of the program, the one told which features are most common in jokes, did relatively poorly. Its accuracy hovered only slightly above chance at recognizing humor, meaning that the hints weren’t very helpful. By contrast, the version that learned on its own—using algorithms such as Naive Bayes and Single Vector Classifier, which start with no previous knowledge at all—reached accuracy levels averaging 85 percent. This is a fairly impressive outcome, especially considering that many humans also have difficulty recognizing jokes, especially one-liners.
Mihalcea’s finding is important because it shows that imposing our own rules on computers’ thinking seldom works. Computers must be allowed to “think messy,” just like people, by wandering into new thoughts or discoveries. For humans this requires a brain, but for computers it requires an algorithm capable of identifying broad patterns. This is essential not just for creating and recognizing jokes but for all artistic endeavors. Watson needed to be creative, too. The programmers at IBM didn’t try to define what problem-solving strategies Watson used to win at Jeopardy! Rather, they allowed it to learn and to look for patterns on its own, so that it could be a flexible learner just like the human brain.
Some people may argue that people aren’t pattern detectors, at least not like computers. If you believe this, you’re not alone. You’re also wrong. Recognizing patterns is exactly how the human brain operates. Consider the following example: “He’s so modest he pulls down the shade to change his ___.” What’s the first word that comes to mind when you read this sentence? If you’re in a humorous mood, you might think of mind, which is the traditional punch line to the joke. If you’re not, you might say clothes. Or maybe pants.
I share this joke because it illustrates how the human brain, like a computer, is a pattern detector. Cloze probability is the term that linguists use to describe how well a word “fills in the blank,” based on common language use. To measure cloze probability, linguists study huge databases of text, determining the frequency at which specific words appear within certain contexts. For example, linguists know that the word change most often refers to replacement of a material object, such as clothes. In fact, there’s a cloze probability of 42 percent that the word clothes would appear in the context set up by our example—which is why it was probably the first word you thought of. Change referring to an immaterial object, such as a mind, is much less likely—closer to 6 percent.
These probabilities have a lot to do with humor because, as already discussed, humor requires surprise, which in this case is the difference between 42 percent and 6 percent. Our brains, much like computers, do rapid calculations every time we read a sentence, often
jumping ahead and making inferences based on cloze probability. Thus, when we arrive at a punch line like mind, a sudden change in scripts is required. The new script is much less expected than clothes, and so the resolution makes us laugh. Computer humor recognition works the same way, looking for patterns while also identifying the potential for those patterns to be violated.
Why, then, aren’t computers better at cracking jokes than humans? Because they don’t have the world knowledge to know which low-probability answer is funniest. In our current example, mind is clearly the funniest possible ending. But jacket has a low cloze probability too. In fact, the probability that people will refer to changing their jacket is about 3 percent—half the probability they’ll talk about changing their mind. Why is the second phrase funny whereas the first one isn’t? Because, with our vast world knowledge, people understand that changing our mind isn’t something that can be seen through a window.
We know this because we’ve all stood in front of windows. No computer has ever stood in front of a window.
To understand why computers struggle when recognizing good jokes, think back to the EEG findings from Chapter 2. As we learned, our brains elicit two kinds of reactions to jokes—the P300 and the N400. The P300 reflects an orienting reflex, a shift in attention telling us that we’ve just seen something new or unexpected. The N400 is more semantic in nature. It measures how satisfying the new punch line is, and how well it activates a new perspective or script.
In that earlier chapter we also discovered that whereas all jokes elicit a P300, only funny ones elicit an N400, because these bring about a satisfying resolution. A related finding is that a word’s cloze probability is inversely proportional to the size of the N400 it produces—the higher the cloze probability (i.e., the more we expect to see that word), the smaller the N400. This size difference reflects how easily new words are integrated into already constructed meanings, with easier integration meaning smaller N400s. At first you might think that cloze probability should influence the “surprise” response of the P300, but this isn’t the case. Low-probability words aren’t shocking, only incongruent. It’s a matter of context—larger N400 responses mean that contexts are being shifted, while P300 responses mean that we’re simply shocked, context having nothing to do with it.
It’s a subtle difference, one that computers struggle with. To computers, there’s no such thing as context, only a constant stream of probabilities. That’s where we humans distinguish ourselves, bringing us back to the constructing, reckoning, and resolving stages from Chapter 2. The human brain doesn’t just recognize cloze probabilities, it builds hypotheses and revises those hypotheses based on new evidence. It’s always looking for patterns and constructing contexts, and by relying on both probabilities and expectations, it becomes an active manipulator of its environment rather than a passive receiver.
To see how this relates to humor, let’s review a study conducted by the cognitive scientist Seana Coulson of the University of California at San Diego. Coulson’s aim was to understand the human brain’s sensitivity to both context and cloze probability. First, she showed subjects sixty sentences, some of which ended in a funny punch line and some of which didn’t (e.g., “She read so much about the bad effects of smoking she decided she’d have to give up the habit/reading”). Only the joke endings were expected to bring about shifts in perspective. Next, she varied the cloze probability of the sentence endings, dividing them into two categories. Sentences for which the joke setup activated a salient, high cloze-probability ending—as in the above example—were labeled “high constraint.” Those with a lower cloze-probability ending were called “low constraint.” For example, “Statistics indicate that Americans spend eighty million a year on games of chance, mostly dice/weddings” is a low-constraint sentence because there are many possible endings—dice being only one of several low cloze-probability alternatives.
Not surprisingly, the N400s were bigger for sentences with funny punch lines than for those with unfunny ones. But this difference appeared only among the high-constraint sentences. That’s because these were instances in which the subjects’ world knowledge had set up some expectation and context, and the punch line brought a new way of thinking. Cloze probability is important to humor, but so is violation of our expectations. We’re pattern detectors, but we’re constructors, reckoners, and resolvers too. Computers’ inability to incorporate all three processes is what causes them to struggle.
Before moving on to the next section, let’s take one more look at how our thinking differs from a computer’s. A little later we’ll be addressing creativity, and how humor is just one example of this unique skill, a skill we still hold over our computer overlords. But for now, I want to drive home the point that the human brain is much more than just a parallel processor, or dozens of parallel processors linked together, as with IBM’s Deep Blue or Watson. Indeed, it’s like a child who can’t sit still, always looking around the corner for what’s coming next.
One benefit of computers is that they always follow directions: at any given time, we can tell a computer to stop working and tell us what it knows. It won’t ignore our command, and it won’t keep working and hope we don’t notice. Humans are a different story. Our brains work so fast, and in such hidden ways, that it’s nearly impossible to see what calculations they’re really making. Analyzing jokes is especially difficult, because comprehension occurs in seconds. There’s no way to stop people halfway through a joke and identify what they’re thinking. Or is there?
“Semantic priming” studies are among the oldest in the field of psychology. The process is relatively simple: subjects are given a task—say, reading a joke—and then interrupted with an entirely different task that indirectly measures their hidden thoughts. For example, after reading the setup to a joke, they may be shown a string of letters and asked if those letters constitute a real word or not (called a “lexical decision” task). Imagine that you’re a voluntary participant in a study and are instructed to read the following: “A woman walks into a bar with a duck on a leash. . . .” Then, the letters S-O-W appear on the screen and you’re asked whether they form a real word or not. How long would it take you to recognize that S-O-W refers to a female pig?
Now, imagine that you’re given the same task after reading the full joke: A woman walks into a bar with a duck on a leash. The bartender says, “Where did you get the pig?” The woman says, “That’s not a pig. It’s a duck!” The bartender replies, “I was talking to the duck.”
Would you immediately recognize the meaning of S-O-W this time? Of course you would, because the word pig would have been activated in your mind. Without priming, it usually takes subjects between a third of a second and three times that long to recognize a given word. With priming (e.g., reading the above joke), that reaction time is decreased by a quarter of a second. This may not seem like much, but in the world of psychology it’s a huge effect.
I mention semantic priming because Jyotsna Vaid, a psychologist at Texas A&M University, used this very task to find out the precise point at which subjects revised their interpretations and “got” a joke. For our example joke there are at least two possible interpretations. One of these is that the woman owns a pet duck and that the bartender doesn’t know his birds from his boars. A good way to check for this interpretation is to use P-E-T in the lexical decision task, because if it’s what subjects are thinking, then the word pet should be at the top of their minds. The second possible interpretation is that ducks can understand questions from surly bartenders, and that the woman is as ugly as a pig. For that one, S-O-W should be highly activated.
Earlier I noted that jokes become funny when scripts suddenly change due to an incongruous punch line—for example, a doctor’s wife inviting a raspy-voiced man inside for an afternoon tryst rather than a chest exam. Now we’re seeing the exact point at which these shifts occur. Not surprisingly, Vaid saw that the initial, literal interpretations of the jokes were dominant when subjects start
ed reading. In other words, they had no choice but to assume the woman owned a pet duck. However, as soon as the punch line came and an incongruity was detected, the second interpretation became active too. The first one didn’t disappear, though. Instead, it stayed active until the end of the joke, after the subjects had been given a chance to laugh. Only then did they make up their minds and move on—and the word pet stopped receiving facilitation in the lexical decision task. From these results we see that our brains build hypotheses, sometimes more than one at a time, and only as more evidence becomes available are old ones jettisoned like rotten fruit.
In a sense, then, we’re built to be pattern detectors, always taking in new information and building stories. Much of the time those interpretations are correct. Sometimes they aren’t.
And when they aren’t, occasionally we laugh.
TRANSFORMATIONAL CREATIVITY
“Computers are creative all the time,” says Margaret Boden. But will they ever generate ideas—or jokes—that convince us they’re truly creative without seeming artificial or mechanical? “Many respectable ideas have been generated by computers which have amazed us and that we value. But what we haven’t seen is a computer that creates something amazing and then says, ‘Don’t you think this is interesting? This is valuable.’ There are many systems which come up with amazingly novel ideas, but if there’s any value in it, humans still need to persuade us why.”
Boden is referring to a major problem with creativity—and a big challenge for humor researchers too. Creativity is subjective. Knowing when a punch line works or not, as with a painting or a sonata, requires being able to assess its value and novelty. But this capability is something many people lack, so imagine how difficult it must be for computers. How do we justify any work of art? How do we know that the punch line An airplane hangar isn’t funny but a telegram-sending dog proclaiming “But that would make no sense at all” is?