Big Data: A Revolution That Will Transform How We Live, Work, and Think
Page 10
As for the merchant family, it was the famous Venetian traders and patrons of the arts, the Medici. In the sixteenth century they became the most influential bankers in Europe, in no small part because they used a superior method of data recording, the double-entry system. Together, Pacioli’s textbook and the Medici’s success in applying it sealed the victory of double-entry bookkeeping—and by extension established the use of Arabic numerals in the West.
Parallel to advances in the recording of data, ways of measuring the world—denoting time, distance, area, volume, and weight—continued to gain ever increasing precision. The zeal to understand nature through quantification defined science in the nineteenth century, as scholars invented new tools and units to measure and record electric currents, air pressure, temperature, sound frequency, and the like. It was an era when absolutely everything had to be defined, demarcated, and denoted. The fascination went so far as measuring people’s skulls as a proxy for their mental ability. Fortunately the pseudo-science of phrenology has mostly withered away, but the desire to quantify has only intensified.
Measuring reality and recording data thrived because of a combination of the tools and a receptive mindset. That combination is the rich soil from which modern datafication has grown. The ingredients for datafying were in place, though in an analog world it was still costly and time-consuming. In many instances it required seemingly infinite patience, or at least a life-long dedication, like Tycho Brahe’s fastidious nightly observations of stars and planets in the 1500s. In the limited cases where datafication succeeded in the analog era, such as Commodore Maury’s navigational charts, it often did so because of a fortunate confluence of coincidences: Maury, for example, was confined to a desk job but with access to a treasure trove of logbooks. Yet whenever datafication did succeed, enormous value was created from the underlying information and tremendous insights were uncovered.
The arrival of computers brought digital measuring and storage devices that made datafying vastly more efficient. It also greatly enabled mathematical analysis of data to uncover its hidden value. In short, digitization turbocharges datafication. But it is not a substitute. The act of digitization—turning analog information into computer-readable format—by itself does not datafy.
When words become data
The difference between digitization and datafication becomes obvious when we look at a domain where both have happened and compare their consequences. Consider books. In 2004 Google announced an incredibly bold plan. It would take every page of every book it could get hold of and (to the extent possible under copyright laws) permit everyone around the world to search and access the books through the Internet for free. To achieve this feat the company teamed up with some of the world’s biggest and most prestigious academic libraries and developed scanning machines that could automatically turn pages, so that scanning millions of books was both feasible and financially viable.
First Google digitized text: every page was scanned and captured in a high-resolution digital image file that was stored on Google servers. The page had been transformed into a digital copy that could have easily been retrieved by people everywhere through the Web. Retrieving it, however, would have required either knowing which book had the information one wanted, or doing much reading to find the right bit. One could not have searched the text for particular words, or analyzed it, because the text hadn’t been datafied. All that Google had were images that only humans could transform into useful information—by reading.
While this would still have been a great tool—a modern, digital Library of Alexandria, more comprehensive than any library in history—Google wanted more. The company understood that information has stored value that can only be released once it is datafied. And so Google used optical character-recognition software that could take a digital image and recognize the letters, words, sentences, and paragraphs on it. The result was datafied text rather than a digitized picture of a page.
Now the information on the page was usable not just for human readers, but also for computers to process and algorithms to analyze. Datafication made text indexable and thus searchable. And it permitted an endless stream of textual analysis. We now can discover when certain words or phrases were used for the first time, or became popular, knowledge that sheds new light on the spread of ideas and the evolution of human thought across centuries and in many languages.
You can try it yourself. Google’s Ngram Viewer (http://books.google.com/ngrams) will generate a graph of the use of words or phrases over time, using the entire Google Books index as a data source. Within seconds we discover that until 1900 the term “causality” was more frequently used than “correlation,” but then the ratio reversed. We can compare writing styles and gain insights into authorship disputes. Datafication also makes plagiarism in academic works much easier to discover; as a result, a number of European politicians, including a German defense minister, have been forced to resign.
An estimated 130 million unique books have been published since the invention of the printing press in the mid-fifteenth century. By 2012, seven years after Google began its book project, it had scanned over 20 million titles, more than 15 percent of the world’s written heritage—a substantial chunk. This has sparked a new academic discipline called “Culturomics”: computational lexicology that tries to understand human behavior and cultural trends through the quantitative analysis of texts.
In one study, researchers at Harvard poured through millions of books (which equated to more than 500 billion words) to reveal that fewer than half the number of English words that appear in books are included in dictionaries. Rather, they wrote, the cornucopia of words “consists of lexical ‘dark matter’ undocumented in standard references.” Moreover, by algorithmically analyzing references to the artist Marc Chagall, whose works were banned in Nazi Germany because he was Jewish, the researchers showed that the suppression or censorship of an idea or person leaves “quantifiable fingerprints.” Words are like fossils encased within pages instead of sedimentary rock. The practitioners of culturomics can mine them like archeologists. Of course the dataset entails a zillion implicit biases—are library books a true reflection of the real world or simply one that authors and librarians hold dear? Nevertheless culturomics has given us an entirely new lens with which to understand ourselves.
Transforming words into data unleashes numerous uses. Yes, the data can be used by humans for reading and by machines for analysis. But as the paragon of a big-data company, Google knows that information has multiple potential purposes that can justify its collection and datafication. So Google cleverly used the datafied text from its book-scanning project to improve its machine-translation service. As explained in Chapter Three, the system would take books that are translations and analyze what words and phrases the translators used as alternatives from one language to another. Knowing this, it could then treat translation as a giant math problem, with the computer figuring out probabilities to determine what word best substitutes for another between languages.
Of course Google was not the only organization that dreamed of bringing the richness of the world’s written heritage into the computer age, and it was hardly the first to try. Project Gutenberg, a volunteer initiative to place public domain works online as early as 1971, was all about making the texts available for people to read, but it didn’t consider ancillary uses of treating the words as data. It was about reading, not reusing. Likewise, publishers for years have experimented with electronic versions of books. They too saw the core value of books as content, not as data—their business model is based on this. Thus they never used or permitted others to use the data inherent in a book’s text. They never saw the need, or appreciated the potential.
Many companies are now vying to crack the e-book market. Amazon, with its Kindle e-book readers, seems to have a big early lead. But this is an area where Amazon’s and Google’s strategies differ greatly.
Amazon, too, has datafied books—but unlike Google, it has
failed to exploit possible new uses of the text as data. Jeff Bezos, the company’s founder and chief executive, convinced hundreds of publishers to release their books on the Kindle format. Kindle books are not made up of page images. If they were, one wouldn’t be able to change the font size or display a page on color as well as black-and-white screens. The text is datafied, not just digital. Indeed, Amazon has done for millions of new books what Google is painstakingly trying to achieve for many older ones.
However, other than Amazon’s brilliant service of “statistically significant words”—which uses algorithms to find links among the topics of books that might not otherwise be apparent—the online retailer has not used its wealth of words for big-data analysis. It sees its book business as based on the content that humans read, rather than on analysis of datafied text. And in fairness, it probably faces restrictions from conservative publishers over how it may use the information contained in their books. Google, as the big-data bad boy willing to push the limits, feels no such constraints: its bread is buttered by users’ clicks, not by access to publishers’ titles. Perhaps it is not unjust to say that, at least for now, Amazon understands the value of digitizing content, while Google understands the value of datafying it.
When location becomes data
One of the most basic pieces of information in the world is, well, the world. But for most of history spatial area was never quantified or used in data form. The geo-location of nature, objects, and people of course constitutes information. The mountain is there; the person is here. But to be most useful, that information needs to be turned into data. To datafy location requires a few prerequisites. We need a method to measure every square inch of area on Earth. We need a standardized way to note the measurements. We need an instrument to monitor and record the data. Quantification, standardization, collection. Only then can we store and analyze location not as place per se, but as data.
In the West, quantification of location began with the Greeks. Around 200 B.C. Eratosthenes invented a system of grid lines to demarcate location, akin to latitude and longitude. But like so many good ideas from antiquity, the practice faded away over time. A millennium and a half later, around 1400 A.D., a copy of Ptolemy’s Geographia arrived in Florence from Constantinople just as the Renaissance and the shipping trade were igniting interest in science and in know-how from the ancients. Ptolemy’s treatise was a sensation, and his old lessons were applied to solve modern navigation challenges. From then on, maps appeared with longitude, latitude, and scale. The system was later improved upon by the Flemish cartographer Gerardus Mercator in 1570, enabling sailors to plot a straight course in a spherical world.
Although by this time there was a means to record location, there was no generally accepted format for sharing that information. A common identification system was needed, just as the Internet benefited from domain names to make things like email work universally. The standardization of longitude and latitude took a long time. It was finally enshrined in 1884 at the International Meridian Conference in Washington, D.C., where 25 nations chose Greenwich, England, as the prime meridian and zero-point of longitude (with the French, who considered themselves the leaders in international standards, abstaining). In the 1940s the Universal Transverse Mercator (UTM) coordinate system was created, which broke the world into 60 zones to increase accuracy.
Geospatial location could now be identified, recorded, tallied, analyzed, and communicated in a standardized, numerical format. Position could be datafied. But because of the high cost of measuring and recording the information in an analog setting, it rarely was. For datafication to happen, tools to measure location cheaply had to be invented. Until the 1970s the only way to determine physical location was by using landmarks, astronomical constellations, dead reckoning, or limited radio-position technology.
A great change occurred in 1978, when the first of the 24 satellites that make up the Global Positioning System (GPS) was launched. Receivers on the ground can triangulate their position by noting the differences in time it takes to receive a signal from the satellites 12,600 miles overhead. Developed by the U.S. defense department, the system was first opened to non-military uses in the 1980s and became fully operational by the 1990s. Its precision was enhanced for commercial applications a decade later. Accurate to one meter, GPS marked the moment when a method to measure location, the dream of navigators, mapmakers, and mathematicians since antiquity, was finally fused with the technical means to achieve it quickly, (relatively) cheaply, and without requiring any specialized knowledge.
Yet the information must actually be generated. There was nothing to prevent Eratosthenes and Mercator from estimating their whereabouts every minute of the day, had they cared to. While feasible, that was impractical. Likewise, early GPS receivers were complex and costly, suitable for a submarine but not for everyone at all times. But this would change, thanks to the ubiquity of inexpensive chips embedded in digital gadgets. The cost of a GPS module tumbled from hundreds of dollars in the 1990s to about a dollar today at high volume. It usually takes only a few seconds for GPS to fix a location, and the coordinates are standardized. So 37° 14' 06" N, 115° 48' 40" W can only mean that one is at the super-secretive U.S. military base in a remote part of Nevada known as “Area 51,” where space aliens are (perhaps!) being kept.
Nowadays GPS is just one system among many to capture location. Rival satellite systems are under way in China and Europe. And even better accuracy can be established by triangulating among cell towers or wifi routers to determine position based on signal strength, since GPS doesn’t work indoors or amid tall buildings. That helps explain why firms like Google, Apple, and Microsoft have established their own geo-location systems to complement GPS. Google’s Street View cars collected wifi router information as they snapped photos, and the iPhone was a “spyPhone” gathering location and wifi data and sending it back to Apple, without users realizing it. (Google’s Android phones and Microsoft’s mobile operating system also collected this sort of data.)
It is not just people but objects that can be tracked now. With wireless modules placed inside vehicles, the datafication of location will transform the idea of insurance. The data offers a granular look at the times, locations, and distances of actual driving to better price risk. In the U.S. and Britain, drivers can buy car insurance priced according to where and when they actually drive, not just pay an annual rate based on their age, sex, and past record. This approach to insurance pricing creates incentives for good behavior. It shifts the very nature of insurance from one based on pooled risk to something based on individual action. Tracking individuals by vehicles also changes the nature of fixed costs, like roads and other infrastructure, by tying the use of those resources to drivers and others who “consume” them. This was impossible to do prior to rendering geo-location in data form on a continual basis for everyone and everything—but it is the world we are headed into.
UPS, for example, uses “geo-loco” data in multiple ways. Its vehicles are fitted with sensors, wireless modules, and GPS so that headquarters can predict engine trouble, as we saw in the last chapter. Moreover, it lets the company know the vans’ whereabouts in case of delays, to monitor employees, and to scrutinize their itineraries to optimize routes. The most efficient path is determined in part from data on previous deliveries, much as Maury’s charts were based on earlier sea voyages.
The analytics program has had extraordinary effects. In 2011 UPS shaved a massive 30 million miles off its drivers’ routes, saving three million gallons of fuel and 30,000 metric tons of carbon-dioxide emissions, according to Jack Levis, UPS’s director of process management. It also improved safety and efficiency: the algorithm compiles routes with fewer turns that must cross traffic at intersections, which tend to lead to accidents, waste time, and consume more fuel since vans often must idle before turning.
“Prediction gave us knowledge,” says Levis at UPS. “But after knowledge is something more: wisdom and clairvoyance. At some poin
t in time, the system will be so smart that it will predict problems and correct them before the user realizes that there was something wrong.”
Datafied location across time is most notably being applied to people. For years wireless operators have collected and analyzed information to improve the service level of their networks. But the data is increasingly being used for other purposes and collected by third parties for new services. Some smartphone applications, for example, gather location information regardless of whether the app itself has a location-based feature. In other cases, the whole point of an app is to build a business around knowing the users’ locations. An example is Foursquare, which lets people “check in” at their favorite locations. It earns income from loyalty programs, restaurant recommendations, and other location-related services.
The ability to collect users’ geo-loco data is becoming extremely valuable. On an individual level, it allows targeted advertising based on where the person is situated or is predicted to go. Moreover, the information can be aggregated to reveal trends. For instance, amassing location data lets firms detect traffic jams without needing to see the cars: the number and speed of phones traveling on a highway reveal this information. The company AirSage crunches 15 billion geo-loco records daily from the travels of millions of cellphone subscribers to create real-time traffic reports in over 100 cities across America. Two other geo-loco companies, Sense Networks and Skyhook, can use location data to tell which areas of a city have the most bustling nightlife, or to estimate how many protesters turned up at a demonstration.