by Michio Kaku
* * *
There are two approaches to making a machine intelligent. Experts can teach the machine what they know, by imparting knowledge about a particular field and giving it rules to perform a set of functions; this method is sometimes termed knowledge-based. Or engineers can design a machine that has the capacity to learn for itself, so that when it is trained with the right data it can figure out its own rules for how to accomplish a task. That process is at work in machine learning. Humans integrate both types of intelligence so seamlessly that we hardly distinguish between them. You don’t need to think about how to ride a bicycle, for example, once you’ve mastered balancing and steering; however, you do need to think about how to avoid a pedestrian in the bike lane. But a machine that can learn through both methods would require nearly opposite kinds of systems: one that can operate deductively, by following hard-coded procedures; and one that can work inductively, by recognizing patterns in the data and computing the statistical probabilities of when they occur. Today’s AI systems are good at one or the other, but it’s hard for them to put the two kinds of learning together the way brains do.
The history of artificial intelligence, going back at least to the fifties, has been a kind of tortoise-versus-hare contest between these two approaches to making machines that can think. The hare is the knowledge-based method, which drove AI during its starry-eyed adolescence, in the sixties, when AIs showed that they could solve mathematical and scientific problems, play chess, and respond to questions from people with a pre-programmed set of methods for answering. Forward progress petered out by the seventies, in the so-called “AI winter.”
Machine learning, on the other hand, was for many years more a theoretical possibility than a practical approach to AI. The basic idea—to design an artificial neural network that, in a crude, mechanistic way, resembled the one in our skulls—had been around for several decades, but until the early 2010s there were neither large enough data sets available with which to do the training nor the research money to pay for it.
The benefits and the drawbacks of both approaches to intelligence show clearly in “natural language processing”: the system by which machines understand and respond to human language. Over the decades, NLP and its sister science, speech generation, have produced a steady flow of knowledge-based commercial applications of AI in language comprehension; Amazon’s Alexa and Apple’s Siri synthesize many of these advances. Language translation, a related field, also progressed along incremental improvements through many years of research, much of it conducted at IBM’s Thomas J. Watson Research Center.
Until the recent advances in machine learning, nearly all progress in NLP occurred by manually coding the rules that govern spelling, syntax, and grammar. “If the number of the subject and the number of the subject’s verb are not the same, flag as an error” is one such rule. “If the following noun begins with a vowel, the article ‘a’ takes an ‘n’” is another. Computational linguists translate these rules into the programming code that a computer can use to process language. It’s like turning words into math.
Joel Tetreault is a computational linguist who until recently was the director of research at Grammarly, a leading brand of educational writing software. (He’s now at Dataminr, an information-discovery company.) In an email, he described the Sisyphean nature of rule-based language processing. Rules can “cover a lot of low-hanging fruit and common patterns,” he wrote. But “it doesn’t take long to find edge and corner cases,” where rules don’t work very well. For example, the choice of a preposition can be influenced by the subsuming verb, or by the noun it follows, or by the noun that follows the preposition—a complex set of factors that our language-loving brains process intuitively, without obvious recourse to rules at all. “Given that the number of verbs and nouns in the English language is in the hundreds of thousands,” Tetreault added, “enumerating rules for all the combinations just for influencing nouns and verbs alone would probably take years and years.”
Tetreault grew up in Rutland, Vermont, where he learned to code in high school. He pursued computer science at Harvard and earned a PhD from the University of Rochester, in 2005; his dissertation was titled “Empirical Evaluations of Pronoun Resolution,” a classic rule-based approach to teaching a computer how to interpret “his,” “her,” “it,” and “they” correctly—a problem that today he would solve by using deep learning.
Tetreault began his career in 2007, at Educational Testing Service, which was using a machine called e-rater (in addition to human graders) to score GRE essays. The e-rater, which is still used, is a partly rule-based language-comprehension AI that turned out to be absurdly easy to manipulate. To prove this, the MIT professor Les Perelman and his students built an essay-writing bot called BABEL, which churned out nonsensical essays designed to get excellent scores. (In 2018, ETS researchers reported that they had developed a system to identify BABEL-generated writing.)
After ETS, Tetreault worked at Nuance Communication, a Massachusetts-based technology company that in the course of twenty-five years built a wide range of speech-recognition products, which were at the forefront of AI research in the nineties. Grammarly, which Tetreault joined in 2016, was founded in 2009, in Kiev, by three Ukrainian programmers: Max Lytvyn, Alex Shevchenko, and Dmytro Lider. Lytvyn and Shevchenko had created a plagiarism-detection product called MyDropBox. Since most student papers are composed on computers and emailed to teachers, the writing is already in a digital form. An AI can easily analyze it for word patterns that might match patterns that already exist on the Web, and flag any suspicious passages. Because Grammarly’s founders spoke English as a second language, they were particularly aware of the difficulties involved in writing grammatically. That fact, they believed, was the reason many students plagiarized: it’s much easier to cut and paste a finished paragraph than to compose one. Why not use the same pattern-recognition technology to make tools that would help people to write more effectively? Brad Hoover, a Silicon Valley venture capitalist who wanted to improve his writing, liked Grammarly so much that he became the CEO of the company and moved its headquarters to the Bay Area, in 2012.
Like Spotify, with which it shares a brand color (green), Grammarly operates on the “freemium” model. The company set me up with a Premium account ($30 a month, or $140 annually) and I used it as I wrote this article. Grammarly’s claret-red error stripe, underlining my spelling mistakes, is not as schoolmasterly as Google Docs’ stop-sign-red squiggle; I felt less in error somehow. Grammarly is also excellent at catching what linguists call “unknown tokens”—the glitches that sometimes occur in the writer’s neural net between the thought and the expression of it, whereby the writer will mangle a word that, on rereading, his brain corrects, even though the unknown token renders the passage incomprehensible to everyone else.
In addition, Grammarly offers users weekly editorial pep talks from a virtual editor that praises (“Check out the big vocabulary on you! You used more unique words than 97% of Grammarly users”) and rewards the writer with increasingly prestigious medallions for his or her volume of writing. “Herculean” is my most recent milestone.
However, when it comes to grammar, which contains far more nuance than spelling, Grammarly’s suggestions are less helpful to experienced writers. Writing is a negotiation between the rules of grammar and what the writer wants to say. Beginning writers need rules to make themselves understood, but a practiced writer gives color, personality, and emotion to writing by bending the rules. One develops an ear for the edge cases in grammar and syntax that Grammarly tends to flag but which make sentences snap. (Grammarly cited the copy-edited version of this article for a hundred and nine grammatical “correctness” issues, and gave it a score of 77—a solid C-plus.)
Grammarly also uses deep learning to go “beyond grammar,” in Tetreault’s phrase, to make the company’s software more flexible and adaptable to individual writers. At the company’s headquarters, in San Francisco’s Embarc
adero Center, I saw prototypes of new writing tools that would soon be incorporated into its Premium product. The most elaborate concern tone—specifically, the difference between the informal style that is the lingua franca of the Web and the formal writing style preferred in professional settings, such as in job applications. “Sup” doesn’t necessarily cut it when sending in a résumé.
Many people who use Grammarly are, like the founders, ESL speakers. It’s a similar situation with Google’s Smart Compose. As Paul Lambert explained, Smart Compose could create a mathematical representation of each user’s unique writing style, based on all the emails she has written, and have the AI incline toward that style in making suggestions. “So people don’t see it, but it starts to sound more like them,” Lambert said. However, he continued, “our most passionate group are the ESL users. And there are more people who use English as a second language than as a first language.” These users don’t want to go beyond grammar yet—they’re still learning it. “They don’t want us to personalize,” he said. Still, more Smart Compose users hit Tab to accept the machine’s suggestions when predictive text makes guesses that sound more like them and not like everyone else.
* * *
As a student, I craved the rules of grammar and sentence construction. Perhaps because of my alarming inability to spell—in misspelling “potato,” Dan Quayle c’est moi—I loved rules, and I prided myself on being a “correct” writer because I followed them. I still see those branching sentence diagrams in my head when I am constructing subordinate clauses. When I revise, I become my own writing instructor: make this passage more concise; avoid the passive voice; and God forbid a modifier should dangle. (Reader, I married a copy editor.) And while it has become acceptable, even at The New Yorker, to end a sentence with a preposition, I still half expect to get my knuckles whacked when I use one to end with. Ouch.
But rules get you only so far. It’s like learning to drive. In driver’sed, you learn the rules of the road and how to operate the vehicle. But you don’t really learn to drive until you get behind the wheel, step on the gas, and begin to steer around your first turn. You know the rule: keep the car between the white line marking the shoulder and the double yellow center line. But the rule doesn’t keep the car on the road. For that, you rely on an entirely different kind of learning, one that happens on the fly. Like Smart Compose, your brain constantly computes and updates the “state” of where you are in the turn. You make a series of small course corrections as you steer, your eyes sending the visual information to your brain, which decodes it and sends it to your hands and feet—a little left, now a little right, slow down, go faster—in a kind of neural-net feedback loop, until you are out of the turn.
Something similar occurs in writing. Grammar and syntax provide you with the rules of the road, but writing requires a continuous dialogue between the words on the page and the prelinguistic notion in the mind that prompted them. Through a series of course corrections, otherwise known as revisions, you try to make language hew to your intention. You are learning from yourself.
Unlike good drivers, however, even accomplished writers spend a lot of time in a ditch beside the road. In spite of my herculean status, I got stuck repeatedly in composing this article. When I needed help, my virtual editor at Grammarly seemed to be on an extended lunch break.
* * *
“We’re not interested in writing for you,” Grammarly’s CEO, Brad Hoover, explained; Grammarly’s mission is to help people become better writers. Google’s Smart Compose might also help non-English speakers become better writers, although it is more like a stenographer than like a writing coach. Grammarly incorporates both machine learning and rule-based algorithms into its products. No computational linguists, however, labored over imparting our rules of language to OpenAI’s GPT-2. GPT-2 is a powerful language model: a “learning algorithm” enabled its literary education.
Conventional algorithms execute coded instructions according to procedures created by human engineers. But intelligence is more than enacting a set of procedures for dealing with known problems; it solves problems it’s never encountered before, by learning how to adapt to new situations. David Ferrucci was the lead researcher behind Watson, IBM’s Jeopardy!-playing AI, which beat the champion Ken Jennings in 2011. To build Watson, “it would be too difficult to model all the world’s knowledge and then devise a procedure for answering any given Jeopardy! question,” Ferrucci said recently. A knowledge-based, or deductive, approach wouldn’t work—it was impractical to try to encode the system with all the necessary knowledge so that it could devise a procedure for answering anything it might be asked in the game. Instead, he made Watson supersmart by using machine learning: Ferrucci fed Watson “massive amounts of data,” he said, and built all kinds of linguistic and semantic features. These were then input to machine-learning algorithms. Watson came up with its own method for using the data to reach the most statistically probable answer.
Learning algorithms like GPT-2’s can adapt, because they figure out their own rules, based on the data they compute and the tasks that humans set for them. The algorithm automatically adjusts the artificial neurons’ settings, or “weights,” so that each time the machine tries the task it has been designed to do the probability that it will do the task correctly increases. The machine is modeling the kind of learning that a driver engages when executing a turn, and that my writer brain performs in finding the right words: correcting course through a feedback loop. “Cybernetics,” which was the term for the process of machine learning coined by a pioneer in the field, Norbert Wiener, in the 1940s, is derived from the Greek word for “helmsmanship.” By attempting a task billions of times, the system makes predictions that can become so accurate it does as well as humans at the same task, and sometimes outperforms them, even though the machine is still only guessing.
To understand how GPT-2 writes, imagine that you’ve never learned any spelling or grammar rules, and that no one taught you what words mean. All you know is what you’ve read in eight million articles that you discovered via Reddit, on an almost infinite variety of topics (although subjects such as Miley Cyrus and the Mueller report are more familiar to you than, say, the Treaty of Versailles). You have Rain Man–like skills for remembering each and every combination of words you’ve read. Because of your predictive-text neural net, if you are given a sentence and asked to write another like it, you can do the task flawlessly without understanding anything about the rules of language. The only skill you need is being able to accurately predict the next word.
GPT-2 was trained to write from a forty-gigabyte data set of articles that people had posted links to on Reddit and which other Reddit users had upvoted. Without human supervision, the neural net learned about the dynamics of language, both the rule-driven stuff and the edge cases, by analyzing and computing the statistical probabilities of all the possible word combinations in this training data. GPT-2 was designed so that, with a relatively brief input prompt from a human writer—a couple of sentences to establish a theme and a tone for the article—the AI could use its language skills to take over the writing and produce whole paragraphs of text, roughly on topic.
What made the full version of GPT-2 particularly dangerous was the way it could be “fine-tuned.” Fine-tuning involves a second round of training on top of the general language skills the machine has already learned from the Reddit data set. Feed the machine Amazon or Yelp comments, for example, and GPT-2 could spit out phony customer reviews that would skew the market much more effectively than the relatively primitive bots that generate fake reviews now, and do so much more cheaply than human scamsters. Russian troll farms could use an automated writer like GPT-2 to post, for example, divisive disinformation about Brexit, on an industrial scale, rather than relying on college students in a St. Petersburg office block who can’t write English nearly as well as the machine. Pump-and-dump stock schemers could create an AI stock-picker that writes false analyst report
s, thus triggering automated quants to sell and causing flash crashes in the market. A “deepfake” version of the American jihadi Anwar al-Awlaki could go on producing new inflammatory tracts from beyond the grave. Fake news would drown out real news.
Yes, but could GPT-2 write a New Yorker article? That was my solipsistic response on hearing of the artificial author’s doomsday potential. What if OpenAI fine-tuned GPT-2 on The New Yorker’s digital archive (please, don’t call it a “data set”)—millions of polished and fact-checked words, many written by masters of the literary art. Could the machine learn to write well enough for The New Yorker? Could it write this article for me? The fate of civilization may not hang on the answer to that question, but mine might.
I raised the idea with OpenAI. Greg Brockman, the CTO, offered to fine-tune the full-strength version of GPT-2 with the magazine’s archive. He promised to use the archive only for the purposes of this experiment. The corpus employed for the fine-tuning included all nonfiction work published since 2007 (but no fiction, poetry, or cartoons), along with some digitized classics going back to the 1960s. A human would need almost two weeks of 24/7 reading to get through it all; Jeff Wu, who oversaw the project, told me that the AI computed the archive in under an hour—a mere after-dinner macaron compared with its All-U-Can-Eat buffet of Reddit training data, the computing of which had required almost an entire “petaflop-per-second day”—a thousand trillion operations per second, for twenty-four hours.