Bitwise
Page 19
While Google even then made use of machine learning mechanisms, most of Google’s operations, and certainly the areas I worked in, avoided “deep learning” in favor of raw number-crunching and dumb analysis. By “dumb” I mean that rather than trying to infer anything about the “meaning” or “structure” of the data, the general approach at Google was to see how much could be gotten out of what little explicit structure was there. In the early days circa 1998 to 2005, Google used word frequency and PageRank to see which pages linked to where. Soon, however, search engine optimizers (SEOs) tried to game Google search in order to promote their pages as highly as they could. Google, in turn, modified their algorithms to try to reinstate fairness, in what was termed the “Google dance”—because the order of results for a query would reshuffle. But initially, Google’s approach was austere: maximize the quality of the search engine by sticking to the raw data—the link graph, word occurrence, co-occurrence, and order, and some amount of HTML page structure.*2 Google could not be certain of a page’s subject matter, the meaning of a piece of text, or the quality of that text; these are matters that even humans could disagree on. These were relevant questions, but they were to be addressed only after Google had squeezed as much as it could out of the “dumb” data, because they were much harder problems. As ever in software engineering, Google picked the low-hanging fruit first. As Google research director and AI maven Peter Norvig said, “Simple models and a lot of data trump more elaborate models based on less data.”
The goal of a search engine, and of information retrieval more generally, is to take a large amount of content and allow users to issue search queries (like “best camera” or “who is the president?”) that return the most relevant results from within that content. There are three main components to any search engine: crawling, indexing, and ranking. The crawler, or webspider, is responsible for obtaining the content in the first place, by systematically downloading every web page it can find—billions of them. The indexer then takes these pages and breaks down their content, creating keyword lists through which pages can be located for search terms, as well as any other information that might give an indication of a page’s value and relevance. It creates what is effectively a gigantic database of the web. Finally, the ranker uses the indexer’s database in order to determine the relevancy of a page relative to a particular search query, and is responsible for the order of the search results that a user sees.
Google’s search engine was comprised of a pipeline of servers that worked rather like an assembly line to crawl and index pages. I worked on the crawl. Our mission was to collect all the pages of the web, extract the links from those pages, find all the new pages in those links that we hadn’t yet crawled, and then start over again as soon as possible with an updated set of pages. Since web pages frequently change, we recrawled sites as often as deemed necessary. This process became increasingly frequent, and some webmasters wrote in to complain that Google was using up too much of their bandwidth. This was especially a problem for websites that had a lot of useless pages. Deciding which pages were useless—that is, which were of no interest to Google—became one of my main challenges.
* * *
—
Deciding if something was “useful” or “useless” was about as far as Google got when it came to understanding the significance of the content it indexed. A computer does not grasp what “useful” means as a concept, but Google had several wonderfully unambiguous metrics to gauge the “usefulness” of a piece of content, based around two core concepts: popularity and uniqueness.
Who linked to a piece of content? The initial version of Google in 1997 analyzed which web pages linked to other pages, and how they did so. If a million pages linked to a page on orange juice with the anchor text “orange juice,” it was a reasonably good bet that that orange juice page was fairly useful to people who wanted to know something about orange juice. Whether that page was authoritative was another matter entirely, but Google could securely know that it was popular.
Who clicked on the content? Google tracked which results people click. Popular search results get boosts so that they become even more popular. High-ranking search results that are never clicked on get demoted. This is a feedback mechanism: the system “fixes” itself in response to the reactions of its users. Users provide this feedback without even intending to; by choosing the link that looks the most useful, they help Google learn how good its search results are. This assumed, of course, that users themselves knew what the best links were. Google functioned simultaneously as a mirror of people’s tastes and as an arbiter of them.
How similar was the content to the content on other web pages? The rise of electronic media made copying trivial, because all it took to duplicate a piece of content was to make a virtual duplicate, which could then be shared however you wished. With the internet, sharing with others around the globe became as easy as copying. The music and film industries discovered this when Napster and BitTorrent arrived on the scene. More innocuously, however, the internet caused there to be multiple copies of tons of reference works and boilerplate. Entire database manuals, technical documentation, public domain works, and more were duplicated across multiple sites. The more a piece of content was replicated across the web, the less likely it was that that particular copy was useful. At best, one copy was useful, which could be identified by seeing which copy was most popular among links. Often, however, such duplicated content was of little intrinsic value. It was cruft, boilerplate, and other flotsam and jetsam left over from the workings of various software packages. High-quality content tended to be reproduced less, not more.
What was most distinctive about the content? The least common words on a page generally were those that were most important. The word “person” would appear on billions of pages, but “mesothelioma” and “asbestos” appeared on far fewer, and on those pages “mesothelioma” and “asbestos” were far more relevant to the content than “person.” By identifying the rarer words, Google could pin down more of the context in which a page was useful.
It was through these questions that Google not only discovered how to make the best search engine, but also how to make money from its search engine: through advertising. In 2000, Google found their golden goose when they released their AdWords service, in an internet monetizing coup that has not been since replicated, not even by Facebook. Unlike banner ads, which had low click-through rates, Google presented ads to users that were far more likely to be clicked. This owed to the unique position Google users were in: they were searching for a particular term—often because they were interested in buying something related to that term. If third-party advertisers could bid on showing ads for specific keywords, Google could show those ads at exactly the right time: namely, when users were searching for those keywords. And advertisers were willing to bid up some keywords quite high. One of the most reliably expensive keywords over the last fifteen years has been “mesothelioma,” bid on primarily by law firms looking for asbestos victims. Many law and insurance keywords go for over a hundred dollars per single click. In 2016, “best mesothelioma lawyer” cost you over $900 per click.
* * *
—
SEOs engaged in a never-ending effort to rig Google’s results by trying to bias these factors into favoring their pages. Snagging a top Google placement in search results was valuable, and each of the four measures of usefulness above could be manipulated:
Boosting links: SEOs set up their own sets of “link farms” to artificially increase the incoming links to a page they wanted to elevate in Google’s rankings.
Boosting clicks: SEOs set up bots to search for terms on which they wanted their pages to be found, then repeatedly clicked on their pages in the results.
Artificial differences: SEOs generated multiple copies of pages that were different enough not to be identifiabl
e as related.
Keyword manipulation: SEOs stuffed pages with other rare keywords to maximize attention given that page so that people would see it in more search results than if the page were genuine content.
As SEO grew, Google devoted as many resources to fighting back against the attempted manipulation of its results as it did trying to improve the quality of the results—since the two tasks were effectively the same. SEO efforts made up a tiny fraction of the total content of the web, but because of Google’s privileged place in the 2000s as the overwhelmingly dominant search engine, SEOs fixated on identifying Google’s weaknesses and abusing them, in order to elevate their content over unoptimized content. By 2010, the original ranking recipe that had allowed Google to be so successful had become untrustworthy. Google frequently altered its recipe to utilize a different mix of criteria.
People like to speak of Google’s ranking algorithm as its “secret sauce,” but this ever-shifting recipe is one that not even Google’s own engineers could write down. “Usefulness” had metastasized from a humanly comprehensible algorithm into an arcane calculation involving hundreds of individual factors per page. The relationship of Google’s measure of “usefulness” to our measure of “usefulness” remained more or less steady, but this continuity masked how much Google’s algorithm diverged from intuitive human thought processes. In the cloud, unlike in the world of the PC, there was no way to undo the complexity, as Microsoft had been able to with Windows 2000 (which finally fixed up the messy legacy left by MS-DOS and Windows 95).
Hangman
When it breaks, you build it again….Gotta fix it faster.
—Brian Edison in George Armitage’s Hot Rod
Here is an example of how useless content caused trouble for Google. Sometime around the turn of the century, someone wrote a notorious web implementation of the child’s game Hangman, where each guess of a letter took a user to a new page with a unique URL that included all the letter guesses that had been made, like this:
Player Guesses A: http://www.mysite.com/fun/hangman/a
Player Guesses E: http://www.mysite.com/fun/hangman/a/e
Player Guesses I: http://www.mysite.com/fun/hangman/a/e/i
Player Guesses O: http://www.mysite.com/fun/hangman/a/e/i/o
By the end of a game, if the word was COMPUTER, the player might reach this “page”:
http://www.mysite.com/fun/hangman/a/e/i/o/u/c/m/p/t/r
The entire alphabet can follow at the end of the URL, in any order. The game ends when the player runs out of guesses or completes the word. If there was a ten guess maximum, there would be 26!/16! distinct URLs containing unique permutations of ten letters—that’s almost twelve trillion. By default, Google would try to crawl them all and fall into the quicksand of Hangman.
Hangman is an example of just how imperfect and ultimately human the web is. It may sound like a mechanical exercise to grab all the pages on the web over and over, but the web was varied and irregular from the very beginning. Web pages were not manufactured in some central factory by cookie cutters. Even today there linger all sorts of shims and spandrels that a handful of web developers thought were a good (or expedient) idea at one time or another. Some Google could just ignore; others, like Hangman, we had to deal with. Data is very rarely as neat as we imagine.
Google initially handled Hangman in a very inelegant way. We hard-coded a check for “hangman” and a few variants thereof that would stop the crawler from crawling those trillions of generated pages. This sort of special-casing is frowned upon, since all it took was calling the page “hagman” instead of “hangman” in order to defeat the check. One of my later tasks was to automate detection of this sort of crawler black hole, but Hangman in particular was so unusually awful that Google’s engineers had made a special-case exception for it years before.
Even when you think you need to know nothing about the data you’re dealing with, the implications of its content can sneak in. To Google’s code, Hangman was more “interesting” than the average website or blog, simply by virtue of it being a special case. Yet these pathological cases were utterly marginal in a practical sense. Hangman signaled another division besides useful or useless. In a world where there is an overwhelming surfeit of data, more than we can process piece by piece, all data is either representative or pathological. Ordinary or bizarre. Standardized or broken. The data was either part of the machine, or a virulent entity that the machine couldn’t handle.
I enjoyed these sorts of problems. I was looking for generalized solutions for a multitude of particular pathological causes. The web was not perfect, because humans were not perfect, but Google’s code could be improved and tuned in order to cope with the imperfections of the web in better and better ways. This process can be frustrating; imagine engineers who work on ranking and their task of keeping results as relevant and high quality as possible. With the web constantly evolving and growing, maintaining the current standard of search performance becomes an exercise in running to stand still. Stop working on Google for a month or two, and the search engine would be far worse when you came back to it. The implicit challenge became one of trying to outrun the changes to the web—the search for algorithms that were sufficiently general and smart that they could handle whatever surprises the web might next throw at them. Google might need a new handler in order to understand content presented on web pages in the new HTML5 specification, but with luck and planning, the underlying indexing format was robust enough that it could still process the new data obtained from HTML5 without too much adjustment.*3 Or perhaps ranking algorithms are sufficiently robust that some new SEO tricks aren’t able to foil them.
The dream, then, is to create algorithms of maximum generality while sacrificing as little specificity as possible. In reality, software engineers always compromise, blocking off a certain problem area and finding an expedient path that’s neither too ambitious nor too kludgy.*4 And that sort of heuristic optimization is not so far off from what we do every day in life and language. We lump things into human-comprehensible categories while trying to respect special cases, but we never reach perfect abstraction, nor perfect comprehensiveness. It was working at Google that helped me understand this.
The Library of Babylon
I have made a heap of all that I could find.
—NENNIUS, Historia Brittonum, as quoted by David Jones, The Anathemata
I wrote a blog while I was working at Google—a “litblog.” I wrote about deeply unpopular books and chronicled my reading of Proust. On my website stats, I saw Google inhale my blog pages every few days, and spit them out in search results. But from work, I could see my site from Google’s perspective. More accurately, I didn’t see it. There was nothing special about my site, nothing that caused it to stand out. My pages were a couple hundred out of billions. They were typical, standardly formatted pages about esoteric subjects, to be indexed and retrieved in response to keywords like “Proust” and “Krasznahorkai” and “modernism.” I was not tempted to rig Google to favor my pages. Even if I could have (which I couldn’t), there was very little to rig. My pages showed up when people searched for my subjects, and didn’t otherwise. The hot topics were locksmiths (lots of scammers) and mesothelioma (lots of ambulance chasers). Even if my blog had been artificially inflated in its rankings, that wouldn’t have generated a great deal more interest in it. Google Search could skew people’s attention, but not create it.
The god’s-eye view offered by Google was ultimately indifferent to what went on outside its server farms. Every individual thing was too small to matter. Large sites like CNN or the New York Times or Wikipedia needed special care and attention due to their size and popularity, but those sites were not any more or less interesting to Google, merely more time-consuming. C
ontent that was not pathological like Hangman was mostly meaningful as a representative of some more general type of content: news, image, video, or otherwise undistinguishable text. To Google, I was one indistinguishable segment of a very, very long tail. Google assimilated everything, and none of it meant anything.
In the Total Perspective Vortex, from Douglas Adams’s The Hitchhiker’s Guide to the Galaxy radio series and novels, an unlucky victim sees the vast expanse of the universe as well as a tiny marker that says “You are here.” The effect drives people insane, save for the terminally narcissistic.*5 That was what it was like seeing my blog against the backdrop of the Google universe. I was data, sucked up, processed, categorized, and packaged up as needed. Google knew nothing of books or of me; it only knew the words I used and the combinations I used them in, and when to show them in search results. I was a drop in the ocean.
That constant reminder of my own insignificance stayed with me. I also had the sense of my soul being split; on the one hand, I was helping architect the machine that processed the world’s data. On the other, I was one inconsequential piece of that data.