by Steve Levy
As always, Page was disappointed at the phenomenon of intelligent people rejecting ambitious schemes on the flimsy grounds of impossibility. He understood that skeptics were motivated by fear and inertia, but he still found such behavior unforgivable. He knew that digital technologies had changed the physics of the possible. Given that current technology would soon be cheaper, more and more powerful, and able to handle vast amounts of data, it was a matter of logic to see that a project to digitize and search through the world’s books was doable. It might be expensive, but it was silly to call it impossible. And it might not be expensive at all.
Page tried to calculate whether such an enterprise could be addressed with a trillion dollars, a billion dollars, or merely millions of dollars. When he finished his calculation—how many books, how much it would cost to scan all of them, how much storage the digital files would require—he became convinced that the costs were reasonable. But even his virtual spreadsheets didn’t dissolve the skepticism of those with whom he shared his scheme. “I’d run through the numbers with people and they wouldn’t believe them, and they’d say, ‘That really won’t work,’” he later said. “So eventually I just did it. I did the work. You can’t argue with facts. You’re not entitled to your own facts.”
It would have been great, he later thought, to begin the project in 1999. But Google’s early funds were committed to building infrastructure and hiring engineers—the opportunity costs were too high to digitize the world’s books. But Page didn’t let go of the idea. In 2002, after AdWords had helped resolve Google’s profit problems, he thought it was time to act.
At the time Google was working on a doomed project called Google Catalogs, where Google scanned actual dead-tree product catalogs to help users find products. There were scanners around the office. Talking to Marissa Mayer one night, Page wondered whether it would make sense to use similar scanners for books. Maybe Google should buy a copy of every book in the world, remove the pages, scan them, and then maybe rebind them and sell them to recover the costs. He had Mayer look into the idea, and she quickly found that rebinding would be too costly. The better idea was “nondestructive scanning.” It would require more care when handling the books, but it seemed more economical. For one thing, the books could be sold afterward. Or they could simply be borrowed in the first place. “We came up with all these numbers,” says Mayer. “We were emailing them around, the right cost per hour, the right number of pages per hour—debate, debate, debate. After one thread hinged on how many pages an hour we could do, we decided we should just scan one.”
They set up a makeshift book scanning device. They tried several sizes of books, the first one, appropriately enough, being The Google Book, an illustrated children’s story by V. C. Vickers. (The “Google” in the title was an odd creature with aspects of mammal, reptile, and fish.) They then tested a photo book, Ancient Forests by David Middleton; a dense text, Algorithms in C by Robert Sedgewick; and a general-interest book, Startup, by Jerry Kaplan. Marissa would turn the page, and Larry would click the shutter of a digital camera.
Neither was aware of it, but the final couplet of the first book Google ever scanned, written as a lark by Bank of England governor Vincent Cartwright Vickers (1879–1939) almost a century earlier, would turn out to be painfully ironic.
The sun is setting—
Can’t you hear
A something in the distance Howl!!?
I wonder if it’s—
Yes!! It is
That horrid Google
On the prowl!!!
The first few times around were kind of sloppy, because Marissa’s thumb kept getting in the way. Larry would say, “Don’t go too fast … don’t go too slow.” It had to be a rate that someone could maintain for a long time—this was going to scale, remember, to every book ever written. They finally used a metronome to synchronize their actions. After some practice, they found that they could capture a 300-page book such as Startup in about forty-two minutes, faster than they expected. Then they ran optical character recognition (OCR) software on the images and began searching inside the book. Page would open the book to a random page and say, “This word—can you find it?” Mayer would do a search to see if she could. It worked. Presumably, a dedicated machine could work faster, and that would make it possible to capture millions of books. How many books were ever printed? Around 30 million? Even if the cost was $10 a book, the price tag would only be $300 million. That didn’t sound like too much money for the world’s most valuable font of knowledge.
Besides, this wasn’t a project to pursue simply because of return on investment. Just as Google had changed the world by making the most obscure items on the web spring up instantly for those who needed them, it could do the same with books. A user could instantly access a unique fact, a one-of-a-kind insight, or a breathtaking passage otherwise buried in the stacks of some dusty book in a distant library. Research tasks that had formerly taken months could be completed between breakfast and lunch. Scanning the world’s books would create a new era in the history of information. Who could object to such a noble mission?
Page determined that Google would do it, get every book ever written in its search engine. Brin was all for it. Eric Schmidt needed to hear more. “Eric wasn’t skeptical but listening, trying to make sense,” says Megan Smith, the biz-dev person who became involved in the project. “If something passed his directional sniff test, if there was a business reason behind an idea, he was open to things.” In this case, Schmidt became convinced that capturing books in Google’s search index would allow Google to deliver important information that was currently lacking—and that eventually the investment would be recovered by increased traffic and more clicking on ads. He was also blown away when Page told him that he’d figured out the whole thing when he was at Stanford. “What does that tell you?” Schmidt would say to a reporter in 2005. “Genius? I think so.”
The project was dubbed Ocean, to reflect the vast informational sea they would be exploring. Marissa Mayer called it “our moon shot.”
Instead of buying current scanners, Google determined that for its monster task it needed one that was superior to current designs. So it commissioned some of its best wizards to build a machine that, presumably, would work much more accurately and at a somewhat brisker rate than Marissa Mayer turning pages one by one. Though Google wasn’t known for actually building machines, its data center needs had generated a lot of engineering expertise in that area: remember, it was the world’s biggest manufacturer of computer servers.
One of the difficulties in book scanning rested in producing high-quality images from the printed page, so that OCR software could accurately translate the shapes of the letters on the page to computer-readable text. The problem was that, on their own, books did not sit flat on the platform: they presented a 3-D problem requiring a 2-D solution. The usual workarounds—flattening the book by pressing it on the glass or removing the binding—would not work since they were time-consuming and damaged the books. If its patents are any indication, Google’s engineers invented a system that could process the 3-D images. Its system involved two special cameras with multiple stereographic lenses, each capturing the image of a page on its opposite end, and a third, infrared camera hovering above the page. By the combination of these cameras, Google’s scanners could capture a three-dimensional picture of an open book. Using sophisticated algorithms that detected their own versions of signals in Google’s search-ranking algorithms, the software would determine the “groove” in the book that delineated its spine, and thus could separate the images on the facing pages and render them as if they were flat.
Google found that the state of robotics did not allow for a speedy process by which a machine could turn the pages itself without shredding them. So despite the fact that hiring a wave of human laborers did not conform to Google’s scaling philosophy, humans it was. Every so often, one literally would see the fingerprints of the Google worker in charge of the task on the scans.
>
To test the machines, Google needed lots of books of all kinds, different sizes and shapes, so it sent a biz-dev person to a used-book conference in Arizona with a budget to buy as many books as she could. She’d talk to people selling in bulk, negotiate a discount, and buy their whole collection, having them deliver the goods to a semitruck she’d rented. When the truck was filled, the driver drove it to Mountain View and discharged his cargo into the top secret scanning facility.
Another team worked on the user interface of the books product. Google’s search quality experts figured out which data could be used to determine relevance in book search, including metadata, information not included in the content of the book itself, such as facts about the book. Google used reference works and databases to determine facts. Had the book been a best seller? How recently was it published? How often was it cited by other works? Other signals could come from the web. Were people on the web talking about it? Was the author famous? Was the book mentioned on prominent websites about its subject matter? You could tell a lot about the book’s importance by seeing how often a book was referred to by other sources and then determining the importance of those sources.
Eventually Google decided to treat every sheet of every book as a separate document, adding signals such as font size, page density, and relevance to the linked table of contents and index. “It’s just like web ranking,” says Frances Haugen, who worked on a later version of the Book Search interface. “But we haven’t found the silver bullet—we haven’t found a Page-Rank for books.”
While Google was tackling the mechanical and digital part of the process, its leaders were plotting a means of procuring the actual books. Of the estimated 33 million books that had been published, Google wanted all of them. (Later, using a more relaxed definition of what a book was, the company estimated that there were 129,864,880 different books in the world in all languages, as of August 2010.) Page, Brin, Schmidt, and David Drummond were talking about Book Search one day in the Googleplex, and they determined that the richest source would be the Library of Congress. They promptly asked their adviser Al Gore to contact the director of the library, James Billington.
Within days, Brin, Page, and Drummond were on a red-eye to Washington, D.C., to make a morning meeting with Billington. Drummond had been saying how important it was to appear presentable and had somewhat of a comeuppance when United Airlines misplaced his luggage. He had to wait until Nordstrom in Pentagon City opened to buy a suit. “They got me in and out in twenty minutes,” he says. Brin, whose sport jacket had survived the flight, bought a tie in the hotel gift shop. Page went without a jacket. Along with Gore, the trio met with Billington and his associates and proposed to scan the entire Library of Congress or whatever the library would let them scan, for free. Billington mentioned the usual procedures for procurement, but Page noted that the government wouldn’t be procuring anything, since Google would be giving its services away, even moving its own scanners in to do the job. Billington said okay.
But he spoke too soon. Part of the Library of Congress’s operation was the Copyright Office, and its head, Marybeth Peters, saw red flags. “She wasn’t quite as sure on the copyright issues,” says Drummond, “so they wound up not moving forward aggressively.” (Google eventually scanned only a small portion of the library’s holdings.)
Google turned instead to university and public libraries. The first one it approached was the University of Michigan, Larry Page’s alma mater. During a fall visit, Page sat next to the university president, Mary Sue Coleman, at a football game. He told her that Google would like to digitize all 7 million volumes in the university’s libraries.
Michigan had already begun digitizing some of its work. “It was a project that our librarians predicted would take one thousand years,” Coleman later said in a speech. “Larry said that Google would do it in six.” It was an attractive proposition to Michigan; Google would assume the entire cost, and Michigan would get a copy of the digital archive. From Michigan’s point of view, it was a step that had to be taken, because the future of books was online. “Twenty years from now, interaction with a physical book will be rare,” says the university’s associate librarian, John Wilkin. “Most of that interaction will be in the study of books as artifacts.”
The team began working with Michigan’s library staff—and Michigan’s lawyers. Now that the project was proceeding, Google had to grapple with the fact that the majority of books were protected under copyright from unauthorized scanning and distribution. Page was envisioning a use that no one in the Gutenberg age, or the founding fathers, who specified a copyright regime in the Constitution, had anticipated. What Google was doing felt as though it was respectful to the rights of authors and publishers—it allowed users the ability to search just as they could in a library. The only difference was that Google was granting users unprecedented powers to do so.
The lead lawyer at Google on this issue was Alex Macgillivray, known to Googlers as AMac. His background included trade-secret defense work for Wilson Sonsini Goodrich & Rosati, representing law firm clients like Napster. “Google’s leadership doesn’t care terribly much about precedent or law,” he says. “They’re trying to get a product launched, in this case trying to make books easier to find.” When charting Google’s copyright standing for Ocean, Macgillivray did a quasi-mathematical plotting of the various interests. He drew up a graph of user benefits and legal risks. “There are places along the edge of the graph which as a lawyer I would prefer not to be, but I’m fine anywhere in the middle,” he later said. “I just didn’t want to be suboptimal.”
In this case, Google was at the edge of the graph. It felt strongly that the very act of scanning and copying the books lent to it by the libraries was protected under the fair use provisions of the law. But a strict reading of the law didn’t bear out that interpretation. “The basic question was whether you can scan and index stuff without a rights holder’s permission,” Macgillivray says. “The entire operation was based on our argument of fair use.” The other question was whether Google had the rights to show short excerpts from the work (as it does with web pages in search) called snippets, but “the snippets are gravy,” AMac would say. In Google’s view, there was no reason for Book Search to be treated differently from web search.
Macgillivray held a couple of important precedents in his back pocket. The most important was a suit filed by the Bill Graham Archives—the holder of intellectual property of the company owned by the late rock promoter—in an attempt to stop a book about the Grateful Dead called What a Long Strange Trip It’s Been. The book featured a timeline of the famed rock band, illustrated at various milestones by thumbnail images of concert tickets and posters. The images weren’t being used for their original purpose, so it wasn’t like a poster hung on a dorm room wall or a concert ticket sold as an entry pass or even a souvenir. The legal term for this was a transformative use—you were using material as a basis to create something new. To Macgillivray, the suit involved the exact question that Google might be sued on: could an unauthorized reproduction of copyrighted material be made for a transformative use? The publisher won in district court and prevailed on appeal. Macgillivray kept a copy of the judge’s decision in his office.
The University of Michigan agreed with Google’s views on copyright. But the other partners Google began talking to weren’t so comfortable. In order to get a book into its index, Google made a digital copy of it, and most legal minds interpreted that action as infringement. “Harvard didn’t want to do in copyright, they only wanted to do the public domain,” says Drummond. (Public domain books are those published before 1923, whose copyright has expired.) “The New York Public Library was the same thing.” Oxford University presented its own problem. Drummond had a great time when he went there to negotiate the deal—the head librarian gave him a grand tour of the Bodleian Library and treated Drummond and the Googlers accompanying him to a rare trip to the roof, where all of Oxford lay in front of them. But the deal
they struck was limited to books out of copyright, that is, in the public domain.
Google began its scanning in near-total stealth. There was a cloak-and-dagger element to the procedure, soured by a clandestine taint, like ducking out of a 1950s nightclub to smoke weed. Google would rent space in a town near a library. Several times a week, university library employees would gather and pack the hundreds of books to be scanned in the next few days. Google employees would load them into trucks, whisk them away, and return them unharmed a few days later. There were hundreds of such employees, a shadow workforce spending its days moving books onto and off the scanning platens.
Maybe the care that Google took to hide its activity was an early indicator of trouble to come. If the world would so eagerly welcome the fruits of Ocean, what was the need for such stealth? The secrecy was yet another expression of the paradox of a company that sometimes embraced transparency and other times seemed to model itself on the NSA. In other areas, Google had put its investments into the public domain, like the open-source Android and Chrome operating systems. And as far as user information was concerned, Google made it easy for people not to become locked into using its products. It even had an initiative called the Data Liberation Front to make sure that users could easily move information they created with Google documents off Google’s servers.