Big Data: A Revolution That Will Transform How We Live, Work, and Think
Page 13
In the end, the group didn’t detect any increase in the risk of cancer associated with use of mobile phones. For that reason, its findings hardly made a splash in the media when they were published in October 2011 in the British medical journal BMJ. But if a link had been uncovered, the study would have been front-page news around the world, and the methodology of “recombinant data” would have been celebrated.
With big data, the sum is more valuable than its parts, and when we recombine the sums of multiple datasets together, that sum too is worth more than its individual ingredients. Today Internet users are familiar with basic “mashups,” which combine two or more data sources in a novel way. For instance, the property website Zillow superimposes real estate information and prices on a map of neighborhoods in the United States. It also crunches reams of data, such as recent transactions in the neighborhood and the specifications of properties, to predict the value of specific homes in an area. The visual presentation makes the data more accessible. But with big data we can go far beyond this. The Danish cancer study gives us a hint of what is possible.
EXTENSIBLE DATA
One way to enable the reuse of data is to design extensibility into it from the outset so that it is suitable for multiple uses. Though this is not always possible—since one may only realize possible uses long after the data has been collected—there are ways to encourage multiple uses of the same dataset. For instance, some retailers are positioning store surveillance cameras so that they not only spot shoplifters but can also track the flow of customers through the store and where they stop to look. Retailers can use the latter information to design the best layout for the store as well as to judge the effectiveness of marketing campaigns. Prior to this, such video cameras were only for security. Now they are seen as an investment that may increase revenue.
One of the best at collecting data with extensibility in mind is, unsurprisingly, Google. Its controversial Street View cars cruised around snapping pictures of houses and roads, but also gobbling up GPS data, checking mapping information, and even sucking in wifi network names (and, perhaps illegally, the content that flowed over open wireless networks). A single Google Street View drive amassed a myriad of discrete data streams at every moment. The extensibility comes in because Google applied the data not just for a primary use but for lots of secondary uses. For example, the GPS data it garnered improved the company’s mapping service and was indispensable for the functioning of its self-driving car.
The extra cost of collecting multiple streams or many more data points in each stream is often low. So it makes sense to gather as much data as possible, as well as to make it extensible by considering potential secondary uses at the outset. That increases the data’s option value. The point is to look for “twofers”—where a single dataset can be used in multiple instances if it can be collected in a certain way. Thus the data can do double duty.
DEPRECIATING VALUE OF DATA
As the cost of storing digital data has plummeted, businesses have strong economic motivation to keep data to reuse for the same or similar purposes. But there is a limit to its usefulness.
For instance, as firms like Netflix and Amazon parlay customer purchases, browsing, and reviews into recommendations of new products, they can be tempted to use the records many times over for many years. With that in mind, one might argue that as long as a firm isn’t constrained by legal and regulatory limits like privacy laws, it ought to use digital records forever, or at least as long as economically possible. However, the reality is not so simple.
Most data loses some of its utility over time. In such circumstances, continuing to rely on old data doesn’t just fail to add value; it actually destroys the value of fresher data. Take a book you bought ten years ago from Amazon, which may no longer reflect your interests. If Amazon uses the decade-old purchase record to recommend other books, you’re less likely to buy those titles—or even care about subsequent recommendations the site might offer. As Amazon’s recommendations are based on both outdated information and more recent, still valuable data, the presence of the old data diminishes the value of the newer data.
So the company has a huge incentive to use data only so long as it remains productive. It needs to continuously groom its troves and cull the information that has lost value. The challenge is knowing what data is no longer useful. Just basing that decision on time is rarely adequate. Hence, Amazon and others have built sophisticated models to help them separate useful from irrelevant data. For instance, if a customer looks at or buys a book that was recommended based on a previous purchase, e-commerce companies can infer that the older purchase still represents the customer’s current preferences. That way they are able to score the usefulness of older data, and thus model more accurate “depreciation rates” for the information.
Not all data depreciates in value at the same pace or in the same way. This explains why some firms believe they need to keep data as long as possible, even if regulators or the public want it erased or anonymized after a period. For instance, Google has long resisted calls to delete users’ full Internet Protocol addresses from old search queries. (Instead it erases only the final digits after nine months to quasi-anonymize the query. Thus, the company can still compare year-on-year data, such as for holiday shopping searches—but only on a regional basis, not down to the individual.) Also, knowing the location of searchers can help improve the relevance of results. For instance, if lots of people in New York search for Turkey—and click on sites related to the country, not the bird—the algorithm will rank those pages higher for others in New York. Even if the value of the data diminishes for some of its purposes, its option value may remain strong.
The value of data exhaust
The reuse of data can sometimes take a clever, hidden form. Web firms can capture data on all the things that users do, and then treat each discrete interaction as a signal to use as feedback for personalizing the site, improving a service, or creating an entirely new digital product. We see a piquant illustration of this in a tale of two spell checks.
Over the course of twenty years, Microsoft developed a robust spell checker for its Word software. It worked by comparing a frequently updated dictionary of correctly spelled terms against the stream of characters the users typed. The dictionary established what were known words; the system would treat close variants that weren’t in the dictionary as misspellings that it would then correct. Because of the effort needed to compile and update the dictionary, Microsoft Word’s spell check was available only for the most common languages. It cost the company millions of dollars to create and maintain.
Now consider Google. It arguably has the world’s most complete spell checker, in basically every living language. The system is constantly improving and adding new words—the incidental outcome of people using the search engine every day. Mistype “iPad”? It’s in there. “Obamacare”? Got it.
Moreover, Google seemingly obtained its spell checker for free, reusing the misspellings that are typed into the company’s search engine among the three billion queries it handles every day. A clever feedback loop instructs the system what word users actually meant to type. Users sometimes explicitly “tell” Google the answer when it poses the question at the top of the results page—“Did you mean epidemiology?”—by clicking on that to start a new search with the correct term. Or the web page where users go implicitly signals the correct spelling, since it’s probably more highly correlated with the correctly spelled word than with the incorrect one. (This is more important than it may seem: As Google’s spell check continually improved, people stopped bothering to type their searches correctly, since Google could process them well regardless.)
Google’s spell-checking system shows that “bad,” “incorrect,” or “defective” data can still be very useful. Interestingly, Google wasn’t the first to have this idea. Around 2000 Yahoo saw the possibility of creating a spell checker from users’ mistyped queries. But the idea never went anywhere. Old search-q
uery data was treated largely as rubbish. Likewise, Infoseek and Alta Vista, earlier popular search engines, each had the world’s most comprehensive database of misspelled words in its day, but they didn’t appreciate the value. Their systems, in a process that was invisible to users, treated typos as “related terms” and performed a search. But it was based on dictionaries that explicitly told the system what was correct, not on the living, breathing sum of user interactions.
Only Google recognized that the detritus of user interactions was actually gold dust that could be gathered up and forged into a shiny ingot. One of Google’s top engineers estimated that its spell checker performs better than Microsoft’s by at least an order of magnitude (though when pressed, he conceded he had not reliably measured this). And he scoffed at the idea that it was “free” to develop. The raw material—misspellings—might have come without a direct cost, but Google had probably spent a lot more than Microsoft to develop the system, he confessed with a broad smile.
The two companies’ different approaches are extremely telling. Microsoft only saw the value of spell check for one purpose, word processing. Google, on the other hand, understood its deeper utility. The firm not only used the typos to develop the world’s best and most up-to-date spell checker to improve search, but it applied the system to many other services, such as the “autocomplete” feature in search, Gmail, Google Docs, and even its translation system.
A term of art has emerged to describe the digital trail that people leave in their wake: “data exhaust.” It refers to data that is shed as a byproduct of people’s actions and movements in the world. For the Internet, it describes users’ online interactions: where they click, how long they look at a page, where the mouse-cursor hovers, what they type, and more. Many companies design their systems so that they can harvest data exhaust and recycle it, to improve an existing service or to develop new ones. Google is the undisputed leader. It applies the principle of recursively “learning from the data” to many of its services. Every action a user performs is considered a signal to be analyzed and fed back into the system.
For example, Google is acutely aware of how many times people searched for a term as well as related ones, and of how often they clicked on a link but then returned to the search page unimpressed with what they found, only to search again. It knows whether they clicked on the eighth link on the first page or the first link on the eighth page—or if they abandoned the search altogether. The company may not have been the first to have this insight, but it implemented it with extraordinary effectiveness.
This information is highly valuable. If many users tend to click on a search result at the bottom of the results page, this suggests it is more relevant than those above it, and Google’s ranking algorithm knows to automatically place it higher up on the page in subsequent searches. (And it does this for advertisements too.) “We like learning from large, ‘noisy’ datasets,” chirps one Googler.
Data exhaust is the mechanism behind many services like voice recognition, spam filters, language translation, and much more. When users indicate to a voice-recognition program that it has misunderstood what they said, they in effect “train” the system to get better.
Many businesses are starting to engineer their systems to collect and use information in this way. In Facebook’s early days, its first “data scientist,” Jeff Hammerbacher (and among the people credited with coining the term), examined its rich trove of data exhaust. He and the team found that a big predictor that people would take an action (post content, click an icon, and so on) was whether they had seen their friends do the same thing. So Facebook redesigned its system to put greater emphasis on making friends’ activities more visible, which sparked a virtuous circle of new contributions to the site.
The idea is spreading far beyond the Internet sector to any company that has a way to gather up user feedback. E-book readers, for example, capture massive amounts of data on the literary preferences and habits of the people who use them: how long they take to read a page or section, where they read, if they turn the page with barely a skim or close the book forever. The devices record each time users underline a passage or take notes in the margins. The ability to gather this kind of information transforms reading, long a solitary act, into a sort of communal experience.
Once aggregated, the data exhaust can tell publishers and authors things they could never know before in a quantifiable way: the likes, dislikes, and reading patterns of people. This information is commercially valuable. One can imagine e-book firms selling it to publishers to improve the content and structure of books. For instance, Barnes & Noble’s analysis of data from its Nook e-book reader revealed that people tended to quit long nonfiction books midway through. That discovery inspired the company to create a series called “Nook Snaps”: short works on topical themes such as health and current affairs.
Or consider online education programs like Udacity, Coursera, and edX. They track the web interactions of students to see what works best pedagogically. Class sizes have been at the level of tens of thousands of students, producing extraordinary amounts of data. Professors can now see if a large percentage of students have rewatched a segment of a lecture, which might suggest they weren’t clear on a certain point. In teaching a Coursera class on machine learning, the Stanford professor Andrew Ng noted that around 2,000 students got a particular homework question wrong—but produced the exact same incorrect answer. Clearly, they were all making the same error. But what was it?
With a little bit of investigation, he figured out that they were inverting two algebraic equations in an algorithm. So now, when other students make the same error, the system doesn’t simply say they’re wrong; it gives them a hint to check their math. The system applies big data, too, by analyzing every forum post that students have read and whether they complete their homework correctly to predict the probability that a student who has read a given post will produce correct results, as a way to determine which forum posts are most useful for students to read. These are things that were utterly impossible to know before, and which could change teaching and learning forever.
Data exhaust can be a huge competitive advantage for companies. It may also become a powerful barrier to entry against rivals. Consider: if a newly launched company devised an e-commerce site, social network, or search engine that was much better than today’s leaders like Amazon, Google, or Facebook, it would have trouble competing not simply because of economies of scale and network effects or brand, but because so much of those leading firms’ performance is due to the data exhaust they collect from customer interactions and incorporate back into the service. Could a new online education site have the know-how to compete with one that already has a gargantuan amount of data with which it can learn what works best?
The value of open data
Today we’re likely to think of sites like Google and Amazon as the pioneers of big data, but of course governments were the original gatherers of information on a mass scale, and they still rival any private enterprise for the sheer volume of data they control. One difference from data holders in the private sector is that governments can often compel people to provide them with information, rather than having to persuade them to do so or offer them something in return. As a consequence, governments will continue to amass vast troves of data.
The lessons of big data apply as much to the public sector as to commercial entities: government data’s value is latent and requires innovative analysis to unleash. But despite their special position in capturing information, governments have often been ineffective at using it. Recently the idea has gained prominence that the best way to extract the value of government data is to give the private sector and society in general access to try. There is a principle behind this as well. When the state gathers data, it does so on behalf of its citizens, and thus it ought to provide access to society (except in a limited number of cases, such as when doing so might harm national security or the privacy rights of others).
&nbs
p; This idea has led to countless “open government data” initiatives around the globe. Arguing that governments are only custodians of the information they collect, and that the private sector and society will be more innovative, advocates of open data call on official bodies to publicly release data for purposes both civic and commercial. To work, of course, the data must be in a standardized, machine-readable form so it can be easily processed. Otherwise, the information might be considered public only in name.
The idea of open government data got a big boost when President Barack Obama, on his first full day in office on January 21, 2009, issued a presidential memorandum ordering the heads of federal agencies to release as much data as possible. “In the face of doubt, openness prevails,” he instructed. It was a remarkable declaration, particularly when compared with the attitude of his predecessor, who had instructed agencies to do precisely the opposite. Obama’s order prompted the creation of a website called data.gov, a repository of openly accessible information from the federal government. The site mushroomed from 47 datasets in 2009 to nearly 450,000 across 172 agencies by its third anniversary in July 2012.
Even in reticent Britain, where a lot of government information has been locked up by Crown Copyright and has been difficult and costly to license to use (such as postal codes for e-commerce companies), there has been substantial progress. The UK government has issued rules to encourage open information and supported the creation of an Open Data Institute co-directed by Tim Berners-Lee, the inventor of the World Wide Web, to promote novel uses of open data and ways to free it from the state’s grip.
The European Union has also announced open-data initiatives that could soon become continent-wide. Countries elsewhere, such as Australia, Brazil, Chile, and Kenya, have issued and implemented open-data strategies. Below the national level, a growing number of cities and municipalities around the world, too, have embraced open data, as have international organizations such as the World Bank, which has made available hundreds of previously restricted datasets of economic and social indicators.