by Steve Levy
While working on its big revisions like Universal Search, Google kept trying to improve its search in general. Dozens of engineers plugged away at failed queries, trying to determine if, as with the case of Audrey Fino, they pointed to deeper algorithmic shortcomings.
The wrong way to fix things was to patch the algorithm to address a specific failed query. That was an approach that didn’t scale; it clashed with the idea that Google’s giant search algorithm could find the most relevant material by its own logic alone. A legendary story at Google illustrated this principle. Around 2002, a team was testing a subset of search limited to products, called Froogle. But one problem was so glaring that the team wasn’t comfortable releasing Froogle: when the query “running shoes” was typed in, the top result was a garden gnome sculpture that happened to be wearing sneakers. Every day engineers would try to tweak the algorithm so that it would be able to distinguish between lawn art and footwear, but the gnome kept its top position. One day, seemingly miraculously, the gnome disappeared from the results. At a meeting, no one on the team claimed credit. Then an engineer arrived late, holding an elf with running shoes. He had bought the one-of-a kind product from the vendor, and since it was no longer for sale, it was no longer in the index. “The algorithm was now returning the right results,” says a Google engineer. “We didn’t cheat, we didn’t change anything, and we launched.”
Over the years, Google evolved a set process for search engine tweaks. After an engineer identified a flaw, he or she would be assigned a “search analyst” to manage the next several weeks, during which the improvement would be implemented. The engineer would determine the problem and recode the relevant part of the search algorithm. Maybe it would require adjusting the importance of a signal. Or perhaps altering the interpretation of multiword “bigrams.” Or even integrating a new signal. Then the counselor would submit it to testing.
Part of that testing involves hundreds of people around the world who sit at their home computers and judge results for various queries, marking whether the new tweaks return better or worse results than the previous versions. “We cover over a hundred locales,” says engineering director Scott Huffman, who is in charge of the testing process. “We have Swiss-French evaluators and Swiss-German evaluators and so on.” But Google also employs a much bigger army of testers—its millions of users, virtually all of whom are unwitting lab rats for Google’s constant quality experiments.
The mainstay of this system was the “A/B test,” where a fraction of users—typically 1 percent—would be exposed to the suggested change. The results and the subsequent behavior of those users would be compared with those of the general population. Google gauged every alteration to its products that way, from the hue of its interface colors to the number of search results delivered on a page. There were so many changes to measure that Google discarded the traditional scientific nostrum that only one experiment should be conducted at a time, with all variables except the one tested being exactly the same in the control group and the experimental group. “We want to run so many experiments, we can’t afford to put you in any one group, or we’d run out of people,” says a search quality manager. “On most Google queries, you’re actually in multiple control or experimental groups simultaneously. Essentially all the queries are involved in some test.”
In search tweaks, the culmination of the process would come in the weekly Search Quality Launch Meeting. In a typical session in 2009, fifty engineers, mostly in their twenties and early thirties, participated. One test query was “Terry Smith KS,” a search that appeared on a screen and had been launched from Springfield, Missouri. The baseline, or unaltered result, assumed that the user wants a link to a town called Smith, in Kansas. A tweaked version of the search included a link to a Terry Smith who lives in Kansas. That was considered a win by the engineers. On the other hand, when a tester in Sykesville, Maryland, tried the query “weather.com Philadelphia,” the new version gave a high ranking to a map showing the location of the long-defunct main office of Bell Telephone of Pennsylvania. That was strange and a big loss. This result spurred a vigorous discussion. Someone figured it out: probably, in some earlier period of technology when Bell Telephone was a sort of search engine, that office was the source of the dial-up phone service that told you the weather. Buried on the web somewhere was that factoid, and the alteration to the algorithm had somehow routed it out of its obscurity. In 2009, Google search engineers made more than six hundred changes to improve search quality.
It was no coincidence that the man who eventually headed Google’s research division was the coauthor of Artificial Intelligence: A Modern Approach, the standard textbook in the field. Peter Norvig had been in charge of the Computational Science Division at NASA’s facility in Ames, not far from Google. At the end of 2000, it was clear to Norvig that turmoil in the agency had put his programs in jeopardy, so he figured it was a good time to move. He had seen Larry Page speak some months before and sensed that Google’s obsession with data might present an opportunity for him. He sent an email to Page and got a quick reply—Norvig’s AI book had been assigned reading for one of Page’s courses. After arriving at Google, Norvig hired about a half-dozen people fairly quickly and put them to work on projects. He felt it would be ludicrous to have a separate division at Google that specialized in things like machine learning—instead, artificial intelligence should be spread everywhere in the company.
One of the things high on Google’s to-do list was translation, rendering the billions of words appearing online into the native language of any user in the world. By 2001, Google.com was already available in twenty-six languages. Page and Brin believed that artificial barriers such as language should not stand in the way of people’s access to information. Their thoughts were along the lines of the pioneer of machine translation, Warren Weaver, who said, “When I look at an article in Russian, I say, ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.’” Google, in their minds, would decode every language on the planet.
There had been previous attempts at online translation, notably a service dubbed Babel Fish that first appeared in 1995. Google’s own project, begun in 2001, had at its core a translation system licensed from another company—basically the same system that Yahoo and other competitors used. But the system was often so inaccurate that it seemed as though the translated words had been selected by throwing darts at a dictionary. Sergey Brin highlighted the problems at a 2004 meeting when he provided Google’s translation of a South Korean email from an enthusiastic fan of the company’s search technology. It read, “The sliced raw fish shoes it wishes. Google green onion thing!”
By the time Brin expressed his frustration with the email, Google had already identified a hiring target who would lead the company’s translations efforts—in a manner that solidified the artificial intelligence focus that Norvig saw early on at Google. Franz Och had focused on machine translations while earning his doctorate in computer science from the RWTH Aachen University in his native Germany and was continuing his work at the University of Southern California. After he gave a talk at Google in 2003, the company made him an offer. Och’s biggest worry was that Google was primarily a search company and its interest in machine translation was merely a flirtation. A conversation with Larry Page dissolved those worries. Google, Page told him, was committed to organizing all the information in the world, and translation was a necessary component. Och wasn’t sure how far you could push the system—could you really build for twenty language pairs? (In other words, if your system had twenty languages, could it translate any of those to any other?) That would be unprecedented. Page assured him that Google intended to invest heavily. “I said okay,” says Och, who joined Google in April 2004. “Now we have 506 language pairs, so it turned out it was worthwhile.”
Earlier efforts at machine translation usually began with human experts who knew both languages that would be involved in the transformation. They
would incorporate the rules and structure of each language so they could break down the original input and know how to recast it in the second tongue. “That’s very time-consuming and very hard, because natural language is so complex and diverse and there are so many nuances to it,” says Och. But in the late 1980s some IBM computer scientists devised a new approach, called statistical machine translation, which Och embraced. “The basic idea is to learn from data,” he explains. “Provide the computer with large amounts of monolingual text, and the computer should figure out himself what those structures are.” The idea is to feed the computer massive amounts of data and let him (to adopt Och’s anthropomorphic pronoun) do the thinking. Essentially Google’s system created a “language model” for each tongue Och’s team examined. The next step was to work with texts in different languages that had already been translated and let the machines figure out the implicit algorithms that dictate how one language converts to another. “There are specific algorithms that learn how words and sentences correspond, that detect nuances in text and produce translation. The key thing is that the more data you have, the better the quality of the system,” says Och.
The most important data were pairs of documents that were skillfully translated from one language to another. Before the Internet, the main source material for these translations had been corpuses such as UN documents that had been translated into multiple languages. But the web had produced an unbelievable treasure trove—and Google’s indexes made it easy for its engineers to mine billions of documents, unearthing even the most obscure efforts at translating one document or blog post from one language to another. Even an amateurish translation could provide some degree of knowledge, but Google’s algorithms could figure out which translations were the best by using the same principles that Google used to identify important websites. “At Google,” says Och, with dry understatement, “we have large amounts of data and the corresponding computation of resources we need to build very, very, very good systems.”
Och began with a small team that used the latter part of 2004 and early 2005 to build its systems and craft the algorithms. For the next few years, in fact, Google launched a minicrusade to sweep up the best minds in machine learning, essentially bolstering what was becoming an AI stronghold in the company. Och’s official role was as a scientist in Google’s research group, but it is indicative of Google’s view of research that no step was required to move beyond study into actual product implementation.
Because Och and his colleagues knew they would have access to an unprecedented amount of data, they worked from the ground up to create a new translation system. “One of the things we did was to build very, very, very large language models, much larger than anyone has ever built in the history of mankind.” Then they began to train the system. To measure progress, they used a statistical model that, given a series of words, would predict the word that came next. Each time they doubled the amount of training data, they got a .5 percent boost in the metrics that measured success in the results. “So we just doubled it a bunch of times.” In order to get a reasonable translation, Och would say, you might feed something like a billion words to the model. But Google didn’t stop at a billion.
By mid-2005, Google’s team was ready to participate in the annual machine translation contest sponsored by the National Institute of Standards and Technology. At the beginning of the event, each competing team was given a series of texts and then had a couple of days for its computers to do the translation while government computers ran evaluations and scored the results. For some reason, NIST didn’t characterize the contest as one in which a participant is crowned champion, so Och was careful not to declare Google the winner. Instead, he says, “Our scores were better than the scores of everyone else.” One of the language pairs it was tested on involved Arabic. “We didn’t have an Arabic speaker on the team but did the very best machine translation.”
By not requiring native speakers, Google was free to provide translations to the most obscure language pairs. “You can always translate French to English or English to Spanish, but where else can you translate Hindi to Danish or Finnish or Norwegian?”
A long-term problem in computer science had been speech recognition—the ability of computers to hear and understand natural language. Google applied Och’s techniques to teaching its vast clusters of computers how to make sense of the things humans said. It set up a telephone number, 1-800-GOOG-411, and offered a free version of what the phone companies used to call directory assistance. You would say the name and city of the business you wanted to call, and Google would give the result and ask if you wanted to be connected. But it was not a one-way exchange. In return for giving you the number, Google learned how people spoke, and since it could tell if its guess was successful, it had feedback that told it where it went wrong. Just as with its search engine, Google was letting its users teach it about the world.
“What convinced me to join Google was its ability to process large-scale information, particularly the feedback we get from users,” says Alfred Spector, who joined in 2008 to head Google’s research division. “That kind of machine learning has just not happened like it’s happened at Google.”
Over the years Google has evolved what it calls “a practical large scale machine learning system” that it has dubbed “Seti.” The name comes from the Search for Extra Terrestrial Intelligence, which scans the universe for evidence of life outside Earth; Google’s system also works on the scale of the universe as it searches for signals in its mirror world. Google’s indexes almost absurdly dwarf the biggest data sets formerly used in machine learning experiments. The most ambitious machine learning effort in the UCI KDD Archive of Large Data Sets for Data Mining Research and Experimentation is a set of 4 million instances used to detect fraud and intrusion detection. Google’s Seti learning system uses data sets with a mean training set size of 100 billion instances.
Google’s researchers would acknowledge that working with a learning system of this size put them into uncharted territory. The steady improvement of its learning system flirted with the consequences postulated by scientist and philosopher Raymond Kurzweil, who speculated about an impending “singularity” that would come when a massive computer system evolves its way to intelligence. Larry Page was an enthusiastic follower of Kurzweil and a key supporter of Kurzweil-inspired Singularity University, an educational enterprise that anticipates a day when humans will pass the consciousness baton to our inorganic progeny.
What does it mean to say that Google “knows” something? Does Google’s Seti system tell us that in the search for nonhuman intelligence we should not look to the skies but to the million-plus servers in Google’s data centers?
“That’s a very deep question,” says Spector. “Humans, really, are big bags of mostly water walking around with a lot of tubes and some neurons and all. But we’re knowledgeable. So now look at the Google cluster computing system. It’s a set of many heuristics, so it knows ‘vehicle’ is a synonym for ‘automobile,’ and it knows that in French it’s voiture, and it knows it in German and every language. It knows these things. And it knows many more things that it’s learned from what people type.” He cited other things that Google knows: for example, Google had just introduced a new heuristic where it determined from your searches whether you might be contemplating suicide, in which case it would provide you with information on sources of aid. In this case, Google’s engine gleans predictive clues from its observations of human behavior. They are formulated in Google’s virtual brain just as neurons are formed in our own wetware. Spector promised that Google would learn much, much more in coming years.
“Do these things rise to the level of knowledge?” he asks rhetorically. “My ten-year-olds believe it. They think Google knows a lot. If you asked anyone in their grade school class, I think the kids would say yes.”
What did Spector, a scientist, think?
“I’m afraid that it’s not a question that is amenable to a scientific answer
,” he says. “I do think, however, loosely speaking, Google is knowledgeable. The question is, will we build a general-purpose intelligence which just sits there, looks around, then develops all those skills unto itself, no matter what they are, whether it’s medical diagnosis or …” Spector pauses. “That’s a long way off,” he says. “That will probably not be done within my career at Google.” (Spector was fifty-five at the time of the conversation in early 2010.)
“I think Larry would very much like to see that happen,” he adds.
In fact, Page had been thinking about such things for some time. Back in 2004, I asked Page and Brin what they saw as the future of Google search. “It will be included in people’s brains,” said Page. “When you think about something and don’t really know much about it, you will automatically get information.”
“That’s true,” said Brin. “Ultimately I view Google as a way to augment your brain with the knowledge of the world. Right now you go into your computer and type a phrase, but you can imagine that it could be easier in the future, that you can have just devices you talk into, or you can have computers that pay attention to what’s going on around them and suggest useful information.”
“Somebody introduces themselves to you, and your watch goes to your web page,” said Page. “Or if you met this person two years ago, this is what they said to you.” Later in the conversation Page said, “Eventually you’ll have the implant, where if you think about a fact, it will just tell you the answer.”