Word by Word

Page 9

by Kory Stamper

It’s a system that seems laughably easy.

—

Noah Webster, in an 1816 letter, wrote that “the business of a lexicographer is to collect, define, and arrange, as far as possible, all the words that belong to a language.” All words—technical language, jargon, cant, interesting words, boring words—were, in Noah’s view, ripe for harvesting.*6 But modern lexicographers shift the emphasis of Noah’s statement a bit to the left: the business of a lexicographer is to collect, define, and arrange, as far as possible, all the words that belong to a language. The fruit isn’t as low hanging as Noah’s sound bite would lead you to believe. No dictionary in the world records all the words in any given language.

A lexicographer can’t rely solely on their native knowledge of the language when evaluating a word for entry—how could I, a half-assed medievalist, know whether “EBITDA,” a word used in accounting, is widespread when I have never seen it before and hope never to have to encounter it in my everyday life? Even the editors who are reckoned to be specialists—the science editors—aren’t necessarily experts. “We don’t have expertise in everything,” says Christopher Connor, one of Merriam-Webster’s life sciences editors. “We’re just tasked at doing it.” That’s why many dictionary companies have some sort of reading program that gives lexicographers the raw materials they use to write a dictionary definition.

A tool is only as good as the materials used to make it (as I have heard my father holler from the depths of the garage, usually after a sudden, clarion “clang” followed by a daisy chain of expletives), and a dictionary is no exception. The aim of a general dictionary isn’t to just skim the language, siphoning off the lightweight flotsam that everyone sees; nor should it dredge the bottom of the language, pulling up rare and archaic words from the depths where they have slid into the rusting muck. To have a truly representative sample of the language to define from, you need both depth and breadth.

English is a language that invites invention (whether you like it or not), and the glories of the Internet make it possible to spread that invention abroad (whether you like it or not). That means that we tend to see new coinages everywhere we go—words like “mansplain,” a lovely little portmanteau*7 of “man” and “explain” and used broadly to refer to a man pedantically explaining something to a listener under the false assumption that the listener knows less than the speaker does on the subject. “Mansplain” stuttered into existence around 2009, and by the beginning of 2013 it was everywhere: from The New York Times to The Huffington Post, from The Globe and Mail of Canada to the Sunday Tribune in South Africa to the Sunday Guardian in India. And of course it was everywhere: it’s a great word. It birthed a whole brood of “-splains”: an informal list I keep at my desk includes “grammarsplain,” “wonksplain,” “poorsplain,” “catsplain,” “whitesplain,” “blacksplain,” “lawsplain,” and the inevitable “sexsplain,” all found in various articles online.

A casual reader might assume that “mansplain” and the “-splain” affix were the It Words for the second decade of the twenty-first century—not just everywhere, but important because they were everywhere. And sure enough, everyone who knew what I did for a living asked, from 2013 onward (and with equal parts eagerness and horror), “Have you entered ‘mansplain’ yet?” I would assure—or disappoint—them, “No, it’s not in yet,” and in return they would inevitably get irked. One friend responded, “I thought you were supposed to be on the bleeding edge of language change!” I assured her that we were but that the bleeding edge of language change wasn’t always the most prominent one. I had been spending, I said, a lot of time looking at uses of “bored of.”

She blinked in what I can only hope was ravening interest but was more likely blank incredulity. “Bored of,” she repeated.

Yes, “bored of,” as in “I’m bored of being asked about ‘mansplaining’ every fifteen minutes.” For hundreds of years, “bored” was always paired with “by” (“I’m bored by your grammarsplaining”) or “with” (“I’m bored with your grammarsplaining”), but in recent years lexicographers began to notice that “bored” was beginning to be paired more with “of.” This trend has been much more common in the U.K. than in the United States—in fact, the folks at Oxford Dictionaries say that they have more evidence nowadays of “bored of” than they do of “bored by,” and the evidence shows uses of “bored with” and “bored of” are neck and neck over there—but it’s creeping in on this side of the pond as well. It’s a small change, just a little slip in the linguistic tectonic plates of “bored,” but lexicographers can feel the shock waves ripple through the language. Because it may be that this particular use of “of” here in “bored of” is the beginning of a new meaning of “of,” and that, my friends, is the sort of thing that gets lexicographers all hot and bothered. “Bored of” is a new use that few people notice, but it’s far more prevalent than all the uses of “mansplain” chucked together. Skimming the surface of the language means that you skip over this small but beautiful specimen.

Skimming’s out, but that doesn’t mean that you should attempt to sound the bottom of the language, either. In one of my early defining batches for the Unabridged Dictionary, I had the word “abecedarian.” It’s relatively rare, one of those ten-dollar words that people whip out when trying to prove that they competed in the National Spelling Bee. Rare or not, I needed to read through the citations we had for it to determine if a change to the definition was needed. I was familiar with the first definition given, “one that is learning the rudiments of something (such as the alphabet),” but it had another definition I had never seen before: “one of a 16th century Anabaptist sect that despised human learning on the ground that the illiterate needed no more than the guidance of the Holy Spirit to interpret Scripture.” I mimed an “ooh”—in the office you’d never actually articulate an “ooh,” because that counts as talking and talking is frowned upon—and went diving into the citations. The history of religion in Europe is something I know a very small bit about, and this promised to be a gas. Reformation Anabaptists! Illiteracy! The Holy Spirit! Truly, my defining cup runneth over.

But “abecedarian” is a rare word, and this sense of “abecedarian” was the rarest ever. There was almost nothing in our extensive citation files for it; no one at Merriam-Webster had evidently encountered this word in print. So how did it end up in the dictionary? A pink from the early twentieth century let slip that the evidence was in a single book in the editorial library. I slid the file drawer closed and plodded down to the basement, where we kept the old editorial library books next to the rolls of packing tape and the ghost of George Merriam, doomed to moan for eternity about the price of ink and the crazy demands of the Webster family. And thus began a wild-goose chase for evidence of this odd Anabaptist sect’s name—a chase that lasted almost a week, involved me tracking down earlier and earlier screeds against the Anabaptists in a variety of languages, and ended with an e-mail from a professor-friend that began, “So I read further in that Historiae anabaptisticae text and consulted some of Melanchthon’s letters written during the period in which Storck et al., were in Wittenberg and the aftermath of that.” It was all terribly exciting in that look-at-me-using-my-Latin way, but it was, lexicographically, a waste of my time. If this particular use of “abecedarian” is so rare that I can’t find much evidence of it, then it is probably one of those uses that’s sucked deep into the sludge of discarded words that makes up the bottom of the river English and probably shouldn’t get a full week of my time. Yes, I learned a lot about the Zwickau prophets and the early beginnings of Anabaptism, but in the end, when I sat down and did a dispassionate inventory of what I had found about the word, I discovered that it didn’t change the definition at all. A week of editorial time stuffed down a rabbit hole, and all I came out of it with was the knowledge that I am the world’s biggest epistemophilic dork. What I learned wasn’t even good cocktail-party fodder. Depth, then, isn’t all it’s cracked up to be.

/> So we aim for the even middle: a variety of resources with some depth. But here is a truth lexicographers—people who sing praises to objectivity and worship raw, unparsed data—would rather you ignore: deciding on a balanced source list is really a subjective art. How much academic writing should go in? Academic writers want you to read everything they write (because someone has to), and many fields are rife with specialty journals of one stripe or another. If we read the Journal of Modern Literature, should we also read Contemporary Literature? What about American Literature? If we read that, do we need to add Early American Literature to the mix, or can we assume that American Literature will also cover some early American literature? And not to be jingoistic, but do we also have to read the Canadian Review of Comparative Literature or the Scottish Literary Review? I exaggerate, of course, but just barely. Edge out into the sciences, and your list blossoms into thousands of possibilities, including nine separate journals all called Journal of Physics. So much whiffling deliberation over which Journal of Physics you should read, and yet it might not really matter, because most published writing isn’t academic. Your time would be better spent agonizing over whether to read both People and OK! magazine.

The early lexicographers made very deliberate choices to omit sources they felt were not up to snuff. Samuel Johnson got very sniffy about including American sources; Noah Webster thought that some of the giants of British literature were too inflated to include in a sensible language of good English.*8 Modern lexicographers are (or try to be) less snooty in their selections, but they also have to make choices. Before you is the latest Margaret Atwood novel and the entirety of the Twilight book series. Twilight is wildly popular, but you suspect it’s not going to yield a lot vis-à-vis new words. Margaret Atwood’s book is not quite as popular as the teen vampire/werewolf/paranormal romance, but it will probably yield more new words. Do you eschew what is generally held to be popular dreck for something more literary? Doesn’t dreck have a place in the world, too? One of the complaints against Webster’s Third New International Dictionary was that it included a number of quotations—forty-five in all—from Polly Adler’s book A House Is Not a Home. Polly Adler’s name was well-known to discerning readers of the early twentieth century: she was the most celebrated brothel owner and madam in New York. Her memoir supplied some fantastic quotations, including the delightful “trying to chisel in on the beer racket.” Which brings us to an important consideration: Who are you, you overeducated and myopic boob, to judge whether a book is (a) dreck and (b) probably not a gold mine of neologisms? Plenty of lexicographers thought that when the Harry Potter series first debuted, it wasn’t going to give us much in the way of lasting coinages, because it was a fantasy series. And now we see the word “muggle,” used unironically to refer to a (usually provincial) person outside a particular culture or group, all over the place, including in an article discussing the language of a Supreme Court dissent.

—

Modern lexicographers have something else to contend with: the vast ball of wax that is the Internet. Reference publishers have traditionally been a little squidgy about mining the Internet for citations because most writing on the Internet isn’t edited. Then again, books, news, and periodicals are reorienting themselves to be online properties. What you see in the printed version of The New Yorker, for instance, isn’t the same content you may see on its website. As print sources have shrunk, so too have their editorial staffs, which means that some formerly reliable sources are now spottily edited (if they’re edited at all).

The Internet has also posed another problem for the lexicographer: sources can be changed, edited, or disappear at will. Much was made in 2015 of Bryan Henderson, the Wikipedia editor whose personal mission was to delete and revise all appearances of “is comprised of” on the open-source encyclopedia. He has made—by hand—over forty-seven thousand edits to the site, most of them replacing “is comprised of” with “is composed of” or “consists of.” There was a lot of hooting and hollering both in support and in detraction of Mr. Henderson, but lexicographers frowned slightly and rubbed the crease of editorial worry between our brows. We’ve all seen an interesting use online—perhaps a “bored of” or an “is comprised of”—that we add to the files, only to go back later and find it was edited out. Damn mutability, we mutter. Yet the record has always been mutable: John Dryden edited later editions of his works to avoid words like “wench” because he found, as he grew older, that he preferred “mistress” instead.

There is one final point to consider. In an age of dictionary contractions, where reference companies are cutting editors left and right and you’re lucky to have half a dozen editors on staff, how are you going to read all this goddamned stuff?

Some dictionary companies have attempted to fix this problem by using what’s called a corpus. A corpus is a curated collection of full-text sources usually dumped into a searchable database of some sort. These corpora*9 are online, available both publicly and by subscription. Some focus on newspapers; some on a mix of academic and nonacademic writing; some include transcripts of news broadcasts; one famously only includes the scripts of American soap operas. Most contain hundreds of millions of words, and they’ve been a boon to lexicographers. The best of them label and subdivide their sources so lexicographers can sort through transcribed speech or academic writing, or they tag the words in their corpus with parts of speech—a godsend when you are defining a word like “as,” which has five parts of speech. And they collect things that lexicographers never had easy access to before. Sci-fi and fantasy novels; the proceedings from Britain’s early modern criminal court; Usenet group posts; hand-printed zines and pamphlets from the punk 1970s and 1980s; comic books: before the advent of the Internet, lexicographers chasing down a hunch had to know that those materials existed and hope they were housed somewhere within easy commuting distance and overseen by a charitable librarian.

Corpora are also excellent for collecting dialect terms or regionalisms. These are words that are specific to a dialect or region and that don’t have much national or international use. The word “finna” is an excellent example: it’s common in Southern English as an alteration of “fixing to,” which is itself a Southernism for “going to,” and it rarely appears in edited print outside the American South. That means that the Yankee lexicographers up north would rarely encounter it in national print, but you can find it in corpora that include smaller regional newspapers and publications. Corpora literally open up new lexical worlds that lexicographers might have only glimpsed before. Before corpora, lexicographers could only look at what they had personally collected. Now, with just a few keystrokes, you can get a sense of the geographical distribution of a word or get a sense as to how widespread a given word is compared with another word.

But as great as corpora are—and they really are—they can’t compete with a real-life person trawling through a magazine or a web article, waiting for something to snag on their sprachgefühl. Linguists love corpora; where two or three linguists are gathered, there shall you find heavy-breathing fetishism about the size, scope, all those possibilities, all that data. Yet all the data in the world is useless unless you can find someone to parse and interpret it.

An online dictionary start-up called Wordnik made waves in 2015 when it announced a fund-raising campaign to document the “million missing words” of English—words that were so new or rare they weren’t entered into any dictionary. The way it was going to do this was to search the Internet for any glossed word—that is, a word that is explained in running text right after its first use, like this. Wordnik was using a data analytics firm to help it find its million words by looking for trigger texts, like “also called” or “known as.” One co-founder of the data firm told The New York Times that Wordnik’s research was going to help track how quickly new words are adopted:

“We can actually measure when words get adopted in mainstream lingo,” he said, by looking at when writers stop explaining neologisms like
“infotainment” and start using them as if their meanings were commonly understood. “It will be interesting to see which words will very quickly get adopted and which words remain outsiders.”

There are a number of metrics that trained lexicographers look for when judging whether a word has fully settled in to the language. The disappearance of that gloss I mentioned above is one, and it works fairly well—provided that you are savvy about where you’re looking. Because these days, the most productive new words—and by “productive,” I don’t mean “useful” but “used a lot and all over the place”—come from a few fields that are just bursting with jargon and specialized vocabulary, like computer programming, medicine, and business. Within those fields, that jargon and specialized vocabulary are well understood and so show up without the trappings we give to new words to mark them as such: quotation marks, italics, glosses. But in general-interest publications, those specialty terms will still show up in quotation marks, italics, and with glosses, because they haven’t really settled in to the language of the Regular Guy. When it comes time to write a dictionary, I might enter the specialized term in an unabridged dictionary, where people expect harder terms and more specialized vocabulary, but not in an abridged dictionary, which people expect to have the words you need every day. Or I might decide that it’s an important enough word that even though it’s still being glossed regularly, it deserves entry right away: words like “AIDS” and “SARS” will probably get entered into a dictionary fairly quickly after they first show up on the scene, because you can reason that the syndromes they name are significant enough health events that they are not going anywhere very soon. Those sorts of decisions are made on a human level; people with experience in the trenches of language change can make those decisions far better than natural-language processing programs currently can. Computers are, however, far quicker.

‹ Prev Next ›