Book Read Free

This Is the Voice

Page 10

by John Colapinto


  The neurological explanation for why the Method works (and thus its efficacy in emotion studies) can be found in research of the mid-nineteenth century by the French neurologist Guillaume Duchenne. In a study of the facial muscles, Duchenne showed that a genuine-looking smile involves two distinct muscle groups: one that lifts the corners of the mouth and a separate, crescent-shaped muscle that surrounds the eye socket which makes the outer corners of the eyelids crinkle in a highly specific way. Both muscle groups must be activated for the smile to look real. (Darwin, in his research for the Expression of the Emotions, tested this by showing to his house guests two photographs taken by Duchenne, one of a man smiling only with his mouth, the other of the same man smiling with both mouth and eyes; twenty-one out of twenty-four of Darwin’s house guests correctly labeled the non-crinkly-eyed smile as fake.) Duchenne explained why the two smiles communicate such different messages. The muscles that pull up the corners of the mouth are under our conscious control—as is clear when the wedding photographer, for the fiftieth time that day, shouts, “Say cheese!” But the crescent-shaped muscles around our eyes aren’t under our voluntary control (as those dead-eyed wedding photo smiles attest). These muscles are “only brought into play by a true feeling, by an agreeable emotion,”23 Duchenne wrote.

  That the muscles which give rise to emotional expression (including those of the voice) are beyond our voluntary control is a central tenet of the Method, which uses personal memories to circumvent the conscious mind, the cortex, to activate those parts of the emotional brain—the limbic system—which produce the involuntary movements that give rise to “real” emotional displays in face, gesture, posture, and voice.24 That’s how Brando stunned audiences with a performance both tender and violent in the theatrical and movie versions of A Streetcar Named Desire (directed by Stanislavsky acolyte Elia Kazan), and De Niro galvanized movie audiences with his weirdly polite, yet ominous challenge (“You talkin’ to me?”) in Taxi Driver, or Streep brought tears to the eyes as the abandoning mother in Kramer vs. Kramer, when her voice trembles so realistically during her testimony in her divorce trial.

  And that’s why Scherer began using Method actors in experiments aimed at distilling, from speech, the parameters of pitch, volume, pace, and rhythm that convey specific emotions. In one of his most ambitious experiments, from the mid-1990s,25 he recruited twelve actors (six men, six women) to portray fourteen finely discriminated affective states, including “hot” anger, “cold” anger, panic fear, anxiety, elation, boredom, shame, and contempt. To eliminate the contaminating influence of language (whose associations might color a listener’s judgment of the emotion being voiced), Scherer and his co-researcher, Rainer Banse, constructed a pair of sentences that combined features of six European languages randomly arranged into seven-syllable nonsense utterances:

  Hat sundig pron you venzy

  Fee got laish jonkill gosterr

  Scherer was careful not to instruct the actors according to explicitly labeled states (for example, “Read this as if you’re very angry”) since the label “angry” might be interpreted differently by each performer. Instead, he gave them scenarios (like “death of a loved one”). With each actor performing fourteen emotions, using the two nonsense utterances, in two takes each, Scherer and Banse generated a corpus of 672 voice samples, which they winnowed, according to various criteria (including sound quality) to 280; these were played to student listener-judges who made a final selection according to which portrayals best matched the intended emotion. Two emotions were immediately identified as almost impossible to hear from acoustic clues and eliminated from the study: shame and disgust. Scherer speculated that disgust is often expressed acoustically in a single vocal burst (for example, “yuck!”) and thus sounds unfamiliar when spread out over seven syllables. Shame might have evolved few vocal cues because when people feel that emotion, they tend to clam up.

  To analyze the remaining 224 actor portrayals, Scherer and Banse developed a custom software program that could parse, at a scale of milliseconds, minute shifts in pitch, volume, pace, and spectral acoustics. The pair detailed how high-arousal emotions (like “hot anger,” “panic fear,” and elation) showed a marked increase in both pitch and loudness. Sadness and boredom had a lower pitch and volume (since people in those depressed states can’t summon much energy, either in breathing or vocal cord tightening, to produce anything but a low murmur). Contempt had a low pitch but also a low volume, a correlation that Scherer and Banse explained, in psychological terms, with a reference to Darwin’s reflections on animal voices, in which “superiority displays” are expressed by a deepened pitch to suggest greater body size (and thus dominance). The “dampened volume,” however, “may serve as a signal to the recipient that the sender… does not consider the other worthy of the expenditure of energy.” The same lowered pitch and volume is used in sarcasm, like the teen “praising” his father’s new khakis—a withering effect heightened by the contrast of the languidly blasé vocal tone with the surface praise in the words. (“They look great.”)

  The study offered an unprecedentedly exhaustive analysis; one chart, filling an entire page, showed the fourteen emotions broken down into fifty-eight separate acoustic variables, for a total of 812 measurements, each calculated to multiple decimal places. Exactly who might benefit from such fanatically microscopic analysis was of no concern to Scherer, who, as someone performing pure “basic science,” has always believed that such knowledge is worthwhile for its own sake. It also perhaps explains why, for much of his forty-year career, Scherer has been an outlier, a lone traveler in an underpopulated, and largely unsung, branch of science.

  But that changed in mid-2017, when the quest to understand vocal emotion became, overnight, a Holy Grail of science, thanks to a collective realization on the part of Silicon Valley that the Siri and Alexa speech recognition functions on our mobile devices will be infinitely enhanced when they can decipher, not only the words we speak, but the emotion in our voice, and to produce appropriate emotional, and thus human-sounding, speech in reply—like the disembodied computer voice that Joaquin Phoenix’s character falls in love with in the movie Her (voiced by Scarlett Johansson), a voice that speaks with such fidelity to the prosody, paralinguistics, and timbre of the human vocal signal that it is impossible to tell apart from the real thing. Such astonishingly realistic computer-generated speech is still a fantasy, but not for long, if the top tech companies have anything to do with it. And they do. Which is why Scherer, after decades toiling almost alone, now finds his subject at the center of research programs driven by the richest companies in the world: Google, Apple, Amazon, Microsoft—all of whom are in a dead heat to develop the disruptive, game-changing voice-emotion software that will transform our relationship with our computers and which, some believe, will mark the next step in our evolution (or perhaps devolution) as a species.

  * * *

  The branch of science that seeks to imbue computers with emotions was first conceived of and named in 1995 when Rosalind Picard, a thirty-three-year-old assistant professor at MIT, published her landmark paper, “Affective Computing.”26 At a time when iPhones were not even a gleam in Steve Jobs’s eye and the World Wide Web was accessed by dial-up modem, Picard laid out a vision for how computers could (and should) be equipped with the ability to detect and communicate emotions. “Most people were pretty uncomfortable with the idea,” Picard (now head of MIT’s bustling Affective Computing lab) recently recalled. “Emotion was still believed to be something that made us irrational… something that was undesirable in day-to-day functioning.”27 Picard made the case for affective computing by citing Damasio’s research on the role that emotions play in shaping reason. “The neurological evidence indicates emotions are not a luxury,” she wrote, “they are essential for rational human performance.”28 Indeed, Picard believed that full Artificial Intelligence, AI, would not be achievable until computers possessed not just the ability to crunch massive amounts of data (an act humans
perform with our cortex), but to perceive and express emotion (which we do in our limbic brain).

  The idea of “emotional computers” aroused skepticism, and even ridicule, several years into this century, according to Björn Schuller, a professor of Artificial Intelligence at Imperial College, London, and the cofounder of a computer voice-emotion start-up company that is, today, among the fastest growing in the world.29 Schuller told me that he was a “born computer nerd” who first became interested in computerized vocal emotion at age nine, when he saw the American television series Knight Rider and its Artificially Intelligent talking car, KITT. “In the first episode,” Schuller recalls, “KITT says to his human owner [played by David Hasselhoff], ‘Since you’re in a slightly irritated mood caused by fatigue…’ I was, like, ‘Wow, the car can hear that he’s irritated and he’s tired!’ This totally got me.”

  When Schuller started his PhD in computer science at the Technical University of Munich in 2000, his focus was on speech recognition—the effort to transcribe spoken words and sentences into written text using computers. It’s a harder task than it might sound, given the way we smear acoustic information across syllables according to how we move our mouths as we talk. The two c’s in the word “concave,” for instance, are completely different sounds, because of how you round your lips in anticipation of the upcoming o when you say the first c, and how you retract your lips in anticipation of the a vowel for that second c. (Say the word slowly while looking in a mirror and you’ll see what I mean.) We think they’re the same sound, but computers know they’re not. Which led to all sorts of absurd transcription errors in the early days of computerized voice-to-text transcription. Steven Pinker has noted some of these (including “A cruelly good M.C.” for “I truly couldn’t see,” and “Back to work” for “Book tour”). To avoid such mistakes, programmers had to input every possible variation for how a c or a d or an n can sound in the various contexts in which it crops up in speech. The n in “noodle” but also “needle”—to say nothing of the n in “pan,” or “pin.” They had to do this for every consonant and vowel combination in the language—a Herculean, not to say Sisyphean, task that left some doubt as to whether any computer, anywhere, could ever accurately decipher the human voice and render it in text, to say nothing of the still harder task of convincingly reproducing it in simulated speech.

  That all started to change around 2005 with the advent of machine learning, a new way to write software. Instead of laboriously inputting a bazillion speech sounds, software engineers started to write algorithms that teach a computer how to teach itself by listening to huge amounts of human speech, analyzing it, and storing in memory the particular way the individual sounds are pronounced when placed in particular contexts (or “coarticulated,” as linguists put it). When Björn Schuller started his PhD in speech recognition, the dream of reliable text transcription by computer seemed decades in the future; but within a couple of years the coarticulation problem was more or less solved.30 Schuller began to look for new horizons and, recalling his childhood fascination with KITT the talking car, he began to wonder if there was a way to make computer voices actually sound human—that is, by imbuing their speech with emotional prosody. Most people considered this a pipe dream—the fantasy outlined in Picard’s theoretical “Affective Computing” paper of six years earlier. Nevertheless, when Schuller heard a fellow first-year PhD student complain that she had quixotically accepted the challenge from a local tech company to create a video system that could read the emotions in facial expressions, he was intrigued. “At the time,” Schuller recalls, “computer vision systems were nowhere near good enough to detect emotions in the face. But I was studying speech, and I knew there was audio on my friend’s video footage. So, remembering Knight Rider, I said to her, ‘Give me your data. Let’s see where we get.’ ”

  Schuller wrote a program for detecting some basic changes in pitch, volume, and pace and used it to analyze the emotional content in the audio portion of his friend’s video files. “It worked to some degree,” he says, “and I got totally fascinated.” Indeed, he instantly switched his PhD from speech recognition to speech emotion recognition. “All my colleagues made fun of me,” he says. “They wouldn’t take me seriously. Until maybe 2007, or 2008.”

  By then, computers had—thanks to increased processor speed and computing power—essentially mastered the coarticulation problem, eliminating all but a few of the errors Pinker documented. Big Tech—Google, Apple, Microsoft—were hungry for the new new thing and began to turn their attention to the missing ingredient in computer speech: emotion. By 2012, the landscape had changed completely. “The focus of research,” Schuller says, “shifted totally toward what I do.”

  What Schuller, and a growing number of others, “do” is use machine learning to make computers teach themselves emotional prosody. Schuller plays accurately labeled samples of emotional vocalizations into the computer’s learning software (“angry,” “sad,” “happy”), and the machine does the rest.31 And the algorithms are learning fast. At present, computers can recognize specific vocal emotions 65 to 70 percent of the time, about the same as humans—astonishing progress given that the field is less than a decade old.

  For pioneering voice-emotion researcher Klaus Scherer, who spent a half century teasing out the measurements in the various elements of the vocal acoustic signal, these developments are bittersweet. He is gratified by the sudden widespread interest in a subject to which he devoted his life, but disheartened that today’s computer engineers take not the slightest interest in the voice signal’s quantitative measurements—all those minutely calibrated tables of numbers and values for pitch, volume, rhythm, pace, and overtone frequencies that Scherer painstakingly accumulated from his experiments with Method actors. Computer engineers like Björn Schuller don’t have to know anything about the minuscule acoustic adjustments that distinguish anger from fear, or joy from irritation: they need only play recordings of properly labeled voice emotions into the computer’s learning software. The results are amazing. “But,” Scherer adds, with some bitterness, “it doesn’t really mean anything in terms of our understanding of how it all works.”

  Scherer is also frustrated that speech emotion engineers study only a handful of “basic” emotions—fear, anger, joy, sadness, boredom, surprise. Complex, nuanced, blended emotions are considered too difficult at present to label and simulate. Scherer doubts that any technology will ever be able to decode the most complex vocal emotions—like the panic underneath my “cheerful” welcome to my son during the global fiscal collapse, or the hint of marital betrayal in the husband’s request for the remote, or the threat-sound we pick up in De Niro’s voice when he asks, “You talkin’ to me?” “That kind of voice detection comes about through a channel discrepancy,” Scherer told me. “There is something falling out between the voice quality on one hand, and the prosody on the other. The two don’t jibe.” From the resulting discordance, a listener draws psychological inferences.

  “When you say, cheerfully, ‘Hi!’ and your son responds ‘What’s wrong?’ he hasn’t detected what’s wrong,” Scherer went on, “but he has detected that something is wrong. Same with the suspicious wife. She hasn’t heard a vocal cue that specifically encodes guilt or sexual betrayal, but she’s heard a discrepancy. What she comes out with—Are you having an affair?—is an expression of the underlying prejudices and fears that she has. That we all have. With a discrepancy, we project our innermost fears onto that thing.” Scherer believes that computers will never be able to grok the incredibly complex interplay between acoustic analysis, psychological deduction, and emotional projection inherent in such acts of hearing. “I think the human ability to make conjectures on the basis of vocal cues is unique,” he told me, “and I think we’ll be able to keep that advantage over computers.”

  Björn Schuller disagrees. He says that the “conjectures” we draw from such channel discrepancies are possible because the listener is so familiar with the spe
aker’s particular voice: a father and child; a husband and wife; a boss and employee (or what we know of Travis Bickel’s antisocial personality from the movie’s opening scenes). Such familiarity is already being established between us and our iPhones and Androids, Siris and Alexas. The preinstalled speech-learning software in our devices is constantly educating itself about the idiosyncratic way we pronounce our vowels and consonants (to better “understand” what we’re saying). But when Google, Apple, and Microsoft start installing emotion-based machine learning software on our devices, as Schuller expects them to do in the next three to five years, the learning curve about other particulars of our uniquely expressive voices will spike dramatically and in dimensions far beyond our specific way of coarticulating c before short o. Our Alexas and Siris will analyze voice dimensions like arousal, which, in emotion studies, is defined by how calming versus how exciting the signal is; valence, which refers to how positive versus how negative is the feeling expressed; and dominance, which measures the degree of control versus the degree of submission communicated by the signal. In this manner, your computer will know the emotional makeup of your specific voice as well as your mother does. You will be as incapable of disguising your true feelings from your iPhone or Android as you are from her.

  Or so Schuller hopes. In 2018, he began working with MIT’s Rosalind Picard, the original oracle of affective computing, on a technique they call “personalized machine learning,” designed to teach your computer the nuances of your voice so well that the false cheerfulness in the question “Hi, how was school?” will instantly be detected. Humans do this by synthesizing and analyzing, at unimaginable speeds, the vast amounts of acoustic data in the voices we know. He and Picard believe that machine learning will give computers the same power. “Then we will be at this point where you say to your laptop, ‘Yeah, everything’s perfect, computer!’ And the computer says”—Schuller adopts a tone of sarcastic skepticism—“ ‘Yeah, sure…’ ” He cites as possible applications for such software the early diagnosis of autism, detecting depression or suicidality in teens, and catching mental illnesses in workers before they act out in mass shootings or other antisocial behavior.

 

‹ Prev