by David Wolman
The publishers at Houghton decided to join forces with pioneers of a nascent discipline that straddled the fields of computer science and linguistics. These experts were applying computing technology and statistics to the study of human language. Modern technologies like in-car navigation systems, translation software, and talking robots are all branches on the computational linguistics tree. So too is spell-check.
Czech-born Henry Ku era was one of the science’s trailblazers. After Czechoslovakia’s communist coup in 1948, Ku era escaped to the American-occupied zone in Germany. He had nearly completed his PhD in linguistics before fleeing, and was accepted into Harvard as a graduate student in 1949.4 From there, he became a professor of linguistics at Brown University. In the 1950s, many of Ku era’s academic colleagues viewed computers as an oddity, hardly worthy of scholarly pursuit. At that time, Ku era worked on a computer that could store a total of one kilobyte of data. (By comparison, a two-sentence email today takes up about four kilobytes.) But even in those cumbersome early machines, Ku era saw the great potential that computing power could bring to bear on the study of language. Within a decade, he would help propel the worlds of linguistics and lexicography into the Information Age.
Houghton’s top brass brought in Ku era to help build a digital version of the language. At Brown, Ku era and his colleagues had already compiled an electronic body of text that lexicographers could use to survey words and their frequency. Compared with the inherently unscientific method of getting a bunch of people to haphazardly mine published material in search of words and meanings, this new approach was revolutionary. For the first time, it was possible to take a statistically rigorous measure of the language.
Houghton, with Ku era’s guidance, built a one million–word digital lexicon with short definitions tagged to each entry.5 Only a piece of the operation was computerized; each word and its spelling was still approved by a one hundred–person panel of language and literature experts, hired to level final judgment about the entries. Nevertheless, the 1969 American Heritage was the first dictionary ever to be compiled electronically, and although most consumers probably didn’t know or care about the distinction, the technology-based approach gave Houghton an edge over both the OED and Merriam-Webster.
By the late ’70s and early ’80s, when Houghton was already getting ready to publish a second-edition American Heritage, a nationwide word-processing bonanza was under way. Companies were racing to design programs for everyday consumers typing on first-generation desktops or stand-alone typewriter-computer hybrids. (When I was in high school, my family owned a Brother-brand word processor with a narrow green screen. My father would watch over my shoulder as I wrote and rewrote sentences, marveling at how easy things had become since he was a kid. Then we would waste ungodly amounts of paper trying to get the machine to print straight across the page.)
Among the emerging breed of word-processing specialists, news spread quickly that Houghton had a digital lexicon. One of the most obvious and, as one Houghton veteran told me, “irresistible,” ideas for what to do with the lexicon was to make a spelling verifier. With a list of canonical word forms in the code, any typed string of letters could be compared against the word list, much like Les Earnest’s original program, but on a broader scale. If a match didn’t show up, the “word” would be flagged as an error or misspelling. Eager to provide this immeasurably useful function to customers, technology firms came to Houghton with checkbooks at the ready. Houghton established a small software division, and in a matter of years had licensed word lists to the likes of Hewlett-Packard, Lang, Digital, Commodore, Sharp, Wang, Panasonic, Sony, Lanier, Brother, and a small Washington-based company called Microsoft.
A huge benefit for the tech companies was that Houghton’s word list came with the weight of authority. This wasn’t just a list of words thrown together by a bunch of computer geeks. This was a lexicon compiled and confirmed by a group of linguists and literature experts who were using it to produce one of the most authoritative English dictionaries in the world. But having the list wasn’t enough. The programmers didn’t have the linguistics know-how to write an effective spell-check program. So they came back to Houghton and Ku era asking for help. Within a few years, Houghton’s software division introduced spelling correction. Instead of only flagging entries that weren’t on the word list, the program could now take those (usually) erroneous entries and run algorithms to calculate words the user had most likely intended to type.
The word lists and checking software were enormously profitable. All of a sudden Houghton, the 150-year-old company that had published the likes of Hawthorne, Thoreau, and Emerson, and was known to most people as an old-school New England publisher of educational texts, had become a tech-sector powerhouse. The company made hundreds of millions of dollars in royalties. Eventually, Houghton executives decided to spin off the software division into a separate company focused on the new horizon of linguistics and computing.
But by the mid-’90s the bottom fell out of the word-processing business, thanks to the ascent of Microsoft. Competing programs faded away, which meant a shrunken market for selling spelling correction software. Then Microsoft decided to buy the technology outright, instead of continuing to pay royalties on every sale of a Microsoft product that contained Houghton’s spell-checking goodies. By the time Microsoft had absorbed the word lists and technology that had started with Ku era and the first digital lexicon, spell-check was everywhere.
UBIQUITOUS MAYBE, BUT hardly perfect. Just ask attorney Arthur Dudley. In a brief submitted to San Francisco’s First District Court of Appeal in 2006, Dudley accidentally caused some initial confusion, followed by laughter, with the application of the rare legal term sea sponge. A similarly shaped term, sua sponte, is well-known legalese. It means to act on one’s own accord, without prompting from an outside person or entity. But Dudley’s word-processing program wasn’t up on legal terminology, and the spelling corrector turned sua sponte into sea sponge.6
Still, it’s an impressive thing that software can correct, on the fly, so many typos and misspellings. (Adios seperate, ocassionally, persaverance, liason, tendancy, and relavent.) When it comes to the bulk of the work most of us do on computers, spell-check is remarkably helpful. Yet there are a number of reasons why it doesn’t always correct correctly. One is that English is just so diverse and continually growing that it’s hard to keep updating the word lists to include all the new words, company names and slang in the lexicon.
Another reason why an A+ spelling checker remains elusive is that the natural ways in which humans use language and the variable contexts of speech are extremely complex, so much so that programming a computer to truly understand us has been, is, and will for a long time remain one of the greatest challenges in all of science. One popular spoof on spell-check is an anonymous poem posted on a number of Web sites. It begins:
Eye have a spelling chequer,
It came with my Pea Sea.
It plane lee marks four my revue
Miss Steaks I can knot sea.
Eye strike the quays and type a word
And weight four it two say
Weather eye am write oar wrong
It tells me straight a weigh.
Nevertheless, in-house language experts at places like Microsoft do their best to keep watch over the language and, bit by bit, keep tweaking the word lists and algorithms for suggested correction. Nowadays, countless typos and misspellings get corrected as you type, and new features are moving the programs beyond just canonical forms, flagging, for instance, homophones (a pear of sox) and malapropisms (righteous indigestion).
Microsoft’s language experts also track word requests, as well as frequently corrected “words,” to assess whether those words should be added to the Speller dictionary (Speller is the trademark name of Microsoft’s spell-checker). One recent request was pleather, meaning a plastic faux leather, which was added because of a lobbying effort by the group People for the Ethical Treatment o
f Animals. If you’ve got the latest goods from Microsoft, pleather shouldn’t get a red squiggly.
In other cases, real words are intentionally kept out of the program’s dictionary. A calender is a machine used for a specialized manufacturing process. But most people see calender as a misspelling of calendar. More often than not, it is. The wordsmiths at Microsoft have decided to keep calender out of the program’s dictionary, figuring that at the end of the day it’s more useful to fix so many misspelled calendars, than it is to cater to the sensibilities of a small subset of the population who happen to know of, and want to write about, calenders. Similar homophones (computer people call them “common confusables”) include words like rime, kame, quire and leman.
Other judgment calls are also embedded in Microsoft’s Speller. Thru doesn’t get the red squiggle, at least on my version of Word, nor does phone or Xmas. But X-mas and cellphone get flagged. Wanna gets the red squiggle, although most dictionaries say it’s an alternative or informal spelling for want to. And then there are “words” that fall under the vague category of “restrictions.” For a PG–13 example, consider jackass. At this moment in spell-check software history, jackass isn’t going to be called out as a misspelling. It’s a word. But type jakass, jackas, or jaqueass, and you’ll see the red squiggly three times over, yet without the suggested correction of jackass. (You will get, however, jokes, jukes, kakas, jackals, and two dozen other suggestions.) Jackass is in the Speller’s dictionary (i.e., master word list), but presumably as a matter of taste, it has been deemed a restricted word—recognized, but not recommended.
Does this kind of thing have a widespread impact on the language, orthographically or otherwise? Probably not. Even the most enterprising jackass can’t come up with that many insults. But this and other wrinkles within the world of spell-check software remind us that behind the curtain of our most authoritative language resources, namely dictionaries and now spell-checkers, are human beings.
ONE THING THAT never ceases to amaze me about spell-check is that it receives so much flak (or flack) from the English-speaking public. People welcome technology into countless areas of their lives. Yet for some reason, they choose to take issue with this particular tool, convinced it’s at best harmful, at worst turning our species into slovenly philistines. Bring up the subject of spell-check—or, for that matter, text messaging—in a bar, school, airplane, café, or office park, and you’re bound to meet a number of people ready and willing to provide unsolicited sermons asserting that spell-check “is the worst thing ever,” “has totally wrecked education,” and that “no one can spell anymore because of it.” “There’s just something very emotional about spelling,” a linguist once told me, and it’s true. Politics don’t seem to matter. Geography doesn’t seem to matter, gender doesn’t seem to matter, and age seems to matter less than one might think. I meet plenty of twenty-and thirty-somethings who say they can’t stand spell-check.
When I visited with Les Earnest at his home in Los Altos Hills, I brought up this common complaint about spell-check. “It’s bullshit,” he said. “The spelling checker helps teach you to spell. I don’t make as many errors now that I use it. And it keeps me straight while also catching typos when I’m typing fast,” which he does, working with eight different computers in his home office. But even Earnest isn’t a spelling anarchist. “Some people, when emailing, use misspelled words deliberately, to assert informality. Me? I sort of like things to be correct. We should try to spell correctly, just as we should try to do math equations correctly. And I accept spelling checkers as part of the solution.”
One day last winter, I dialed Henry Ku era’s number in Rhode Island. Now eighty-three years old, he had moved into an assisted-living facility. The telephone rang twice before a man answered with a quavering voice. Introducing myself, I was already sketching in my mind a plan to fly east to meet up with him. We would sit and drink tea together, or maybe a couple of pilsners, and Ku era would tell me about his escape from Czechoslovakia, his love for language, the early days of the wild new science of computational linguistics, and the runaway success of spell-check.
But it wasn’t meant to be. “I’m not well,” he said. “I can’t help you. Not now.” From the frailty of his voice, I knew that “not now” meant not ever. The truth is, though, Ku era has already helped me, and millions of other people, every time a spell-check program has whipped a misspelling back into shape before someone who cared could notice. The software is just performing as programmed, but in a small way, we have Ku era to thank every time it happens.
ELEVEN
THE RUBARB ON THE INTERNET
The waves are beating against the rocky promontory of fixed spelling all the time.1
Lexicographer Robert Burchfield
THE MOST FAMOUS MISSPELLING in business history has got to be Google. In 1995, two Stanford University graduate students developed a technique for searching and cataloging Web sites. The students, Sergey Brin and Larry Page, first called the search engine BackRub, because the program analyzes the “back links” of Web sites.2 Realizing the name wasn’t such a hot choice, they sat down with a few friends to brainstorm a better one. Someone got to thinking of the dazzling scope of the search operation on an ever-expanding Internet, which led to the idea of names for very large numbers.
Googol is the short version of googleplex, which is a 1 followed by one hundred zeros and is not to be confused with the Russian writer Nikolai Gogol. The two techies liked the sound of googol. A quick scan for available domain names came up clean: Google.com was up for grabs. By the time anyone noticed the misspelling, Page had already registered Google.com, and he and Brin didn’t seem to care about the error. They had moved on to other things, stacking computers in Page’s dorm room and soon, raising their first twenty-five million dollars in venture capital.
When I arrived on a midsummer afternoon at Google’s Mountain View, California, headquarters (also known as the Googleplex), no one was playing beach volleyball in the sand court near the plastic pink flamingos and dinosaur skeleton, but flip-flop-wearing employees were riding commuter bikes between campus buildings; two women were digging in the community organic garden; and clusters of people sat at outdoor tables under sunshades colored yellow, red, blue, or green. Inside one of the campus cafés, a bunch of young Google-ites sat on colored sofas and chairs, sipping lattes and freshly squeezed juices while tapping away on laptops or reading through printouts of code.
I had come to Mountain View to learn about the workings of Google’s suggested spelling function, which you may recognize as: “Did you mean: function.” Peter Norvig, Google’s director of research, hurried into the conference room. A white-haired man with wide-open eyes, Norvig wore an ocean-blue Hawaiian shirt, beige shorts, and white sneakers. Before coming to Google in 2002, he was division chief for computational sciences at NASA. He was also a professional speller—once. During a recent Broadway showing of the musical “25th Annual Putnam County Spelling Bee,” Norvig volunteered to be an audience participant in the first couple of rounds of the bee. He overstayed his welcome, though, correctly spelling the absurdly difficult words used to send walk-ons back to their seats. The cast eventually had to boot him off stage.
Norvig used to be in charge of Google’s search quality program, but his research now focuses on matters of language. A few weeks prior, I’d read an article quoting Norvig as saying there are one hundred trillion words on the Internet. I asked him if that’s really true. Norvig shrugged. “You’ve got to come up with some number,” he said, because people like to hear something concrete. “But the real number is infinity. Just hit the ‘next’ page of your [online] calendar and you’ve added more words.” The more interesting questions, Norvig said, have to do with the millions of unique English words in cyberspace and the novel ways in which people are using them and coining new ones.3
You don’t need to dig deep into programming code or obscure corners of the Internet to witness this linguistic revolution in
action. It starts with a simple search query. According to the company’s Web site, Google’s spell-check software checks your entry to ensure that you’re “using the most common version of a word’s spelling. If it calculates that you’re likely to generate more relevant search results with an alternative spelling, it will ask ‘Did you mean: [more common spelling]?’” For Google-ites responsible for maintaining and improving the search engine, the goal isn’t to provide a spell-checking service, although people often use it that way. From a strictly purpose-of-code perspective, Google’s spell-check is designed to help you travel as seamlessly as possible through the galaxy of digital information to the site that best matches what your brain is looking for, not what your fingertips might say your brain is looking for. Spelling correction is only a means to that end.
The whole thing works much like a conventional spell-checker, at first. Queries are compared against a giant list of known words, and that list is constantly updated, as new words, pharmaceutical names, celebrities, song lyrics, technical terms, comic-book characters, and advertisements continue piling onto the Web. Google software uses this list to determine whether you might benefit from changing the spelling within your query. Type seperate, and even though the search (for me, today) brings up 34.4 million results in 0.14 seconds, Google’s results page asks if I meant separate, which delivers 280 million search results.
But it’s not just separate’s presence on a word list or a larger tally of results that tell Google computers to tell me to swap e for a. The algorithms go deeper than that, and this is where Google search departs, in operation and philosophy, from traditional spell-checkers. It’s not about canonical forms. When conducting a search, Google’s algorithms don’t care about spelling; they care about accurately reflecting what’s out there on the Web. Commonly accepted spellings will usually lead a searcher to the desired information. But so too can alternatively spelled or misspelled words.