In short, although data has long been valuable, it was either seen as ancillary to the core operations of running a business, or limited to relatively narrow categories such as intellectual property or personal information. In contrast, in the age of big data, all data will be regarded as valuable, in and of itself.
When we say “all data,” we mean even the rawest, most seemingly mundane bits of information. Think of readings from a heat sensor on a factory machine. Or the real-time stream of GPS coordinates, accelerometer readings, and fuel levels from a delivery vehicle—or a fleet of 60,000 of them. Or think of billions of old search queries, or the price of nearly every seat on every commercial airline flight in the United States going back years.
Until recently there were no easy ways to collect, store, and analyze such data, which severely limited the opportunities to extract its potential value. In Adam Smith’s celebrated example of the pin maker, with which he discussed the division of labor in the eighteenth century, it would have required observers watching all the workers not just for one particular study, but at all times everyday, taking detailed measurements, and counting the output on thick paper with feathery quill pens. When classical economists considered the factors of production (land, labor, and capital), the idea of harnessing data was largely absent. Though the cost of gathering and using data has declined over the past two centuries, until fairly recently it remained relatively expensive.
What makes our era different is that many of the inherent limitations on the collection of data no longer exist. Technology has reached a point where vast amounts of information often can be captured and recorded cheaply. Data can frequently be collected passively, without much effort or even awareness on the part of those being recorded. And because the cost of storage has fallen so much, it is easier to justify keeping data than discarding it. All this makes much more data available at lower cost than ever before. Over the past half-century, the cost of digital storage has been roughly cut in half every two years, while storage density has increased 50 million-fold. In light of informational firms like Farecast or Google—where raw facts go in at one end of a digital assembly line and processed information comes out at the other—data is starting to look like a new resource or factor of production.
The immediate value of most data is evident to those who collect it. In fact, they probably gather it with a specific purpose in mind. Stores collect sales data for proper financial accounting. Factories monitor their output to ensure it conforms to quality standards. Websites log every click users make—sometimes even where the mouse-cursor moves—for analyzing and optimizing the content the sites present to visitors. These primary uses of the data justify its collection and processing. When Amazon records not only the books that customers buy but the web pages they merely look at, it knows it will use the data to offer personalized recommendations. Similarly, Facebook tracks users’ “status updates” and “likes” to determine the most suitable ads to display on its website to earn revenue.
Unlike material things—the food we eat, a candle that burns—data’s value does not diminish when it is used; it can be processed again and again. Information is what economists call a “non-rivalrous” good: one person’s use of it does not impede another’s. And information doesn’t wear out with use the way material goods do. Hence Amazon can use data from past transactions when making recommendations to its customers—and use it repeatedly, not only for the customer who generated the data but for many others as well.
Just as data can be used many times for the same purpose, more importantly, it can be harnessed for multiple purposes as well. This point is important as we try to understand how much information will be worth to us in the era of big data. We’ve seen some of this potential realized already, as when Walmart searched its database of old sales receipts and spotted the lucrative correlation between hurricanes and Pop-Tarts sales.
All this suggests that data’s full value is much greater than the value extracted from its first use. It also means that companies can exploit data effectively even if the first or each subsequent use only brings a tiny amount of value, so long as they utilize the data many times over.
The “option value” of data
To get a sense of what the reuse of data means for its ultimate value, consider electric cars. Whether they succeed as a mode of transportation depends on a dizzying array of logistics, which all have something to do with battery life. Drivers need to be able to recharge their car batteries quickly and conveniently, and power companies need to ensure that the energy drawn by these vehicles doesn’t destabilize the grid. We have largely effective distribution of gas stations today, but we don’t yet know what the recharging needs and placement of stations for electric vehicles will be.
Strikingly, this is not so much an infrastructural problem as an informational one. And big data is an important part of the solution. In a trial in 2012, IBM worked with Pacific Gas and Electric Company in California and the carmaker Honda to collect vast amounts of information to answer fundamental questions about when and where electric cars will draw power and what this will mean for the power supply. IBM developed an elaborate predictive model based on numerous inputs: the car’s battery level, the location of the car, the time of day, and the available slots at nearby charging stations. It coupled that data with the current consumption of power from the grid as well as historical power usage patterns. Analyzing the huge streams of real-time and historical data from multiple sources let IBM determine the optimal times and places for drivers to top up their car batteries. It also revealed where to best build recharging stations. Eventually, the system will need to take into account price differences at nearby recharging stations. Even weather forecasts will have to be factored in: for example, if it is sunny and a nearby solar-powered station is teeming with electricity, but the forecast calls for a week of rain in which the solar panels will be idle.
The system takes information generated for one purpose and reuses it for another—in other words, the data moves from primary to secondary uses. This makes it much more valuable over time. The car’s battery-level indicator tells drivers when to fill ’er up. The power grid’s usage data is collected by the utility so it can manage the stability of the grid. Those are the primary uses. Both sets of data find secondary uses—and new value—when they’re applied to a completely different purpose: determining when and where to recharge, and where to build electric-vehicle service stations. On top of this, ancillary information is incorporated, such as the location of the car and historical grid consumption. And IBM processes the data not once but over and over, as it continuously updates its profile of e-car energy intake and its stress on the power grid.
Data’s true value is like an iceberg floating in the ocean. Only a tiny part of it is visible at first sight, while much of it is hidden beneath the surface. Innovative companies that understand this can extract that hidden value and reap potentially huge benefits. In short, data’s value needs to be considered in terms of all the possible ways it can be employed in the future, not simply how it is used in the present. We have seen this in many of the examples we have highlighted already. Farecast harnessed data from previously sold plane tickets to predict the future price of airfares. Google reused search terms to uncover the prevalence of the flu. Maury repurposed old captains’ logs to reveal ocean currents.
Still, the importance of data’s reuse is not fully appreciated in business and society. Few executives at Con Edison in New York could have imagined that century-old cable information and maintenance records might be used to prevent future accidents. It took a new generation of statisticians, and a new wave of methods and tools, to unlock the data’s value. Even many Internet and technology companies have been unaware until recently how valuable data’s reuse can be.
It may be helpful to envision data the way physicists see energy. They refer to “stored” or “potential” energy that exists within an object but lies dormant. Think of a compressed spring or a ball res
ting at the top of a hill. The energy in these objects remains latent—potential—until it’s unleashed, say, when the spring is released or the ball is nudged so that it rolls downhill. Now these objects’ energy has become “kinetic” because they’re moving and exerting force on other objects in the world. After its primary use, data’s value still exists, but lies dormant, storing its potential like the spring or the ball, until the data is applied to a secondary use and its power is released anew. In a big-data age, we finally have the mindset, ingenuity, and tools to tap data’s hidden value.
Ultimately, the value of data is what one can gain from all the possible ways it can be employed. These seemingly infinite potential uses are like options—not in the sense of financial instruments, but in the practical sense of choices. The data’s worth is the sum of these choices: the “option value” of data, so to speak. In the past, once data’s main use was achieved we often thought the data had fulfilled its purpose, and we were ready to erase it, to let it slip away. After all, it seemed the key worth had been extracted. In the big-data age, data is like a magical diamond mine that keeps on giving long after its principal value has been tapped. There are three potent ways to unleash data’s option value: basic reuse; merging datasets; and finding “twofers.”
THE REUSE OF DATA
A classic example of data’s innovative reuse is search terms. At first glance, the information seems worthless after its primary purpose has been fulfilled. The momentary interaction between consumer and search engine produced a list of websites and ads that served a particular function unique to that moment. But old queries can be extraordinarily valuable. Companies like Hitwise, a web-traffic-measurement company owned by the data broker Experian, lets clients mine search traffic to learn about consumer preferences. Marketers can use Hitwise to get a sense of whether pink will be in this spring or black is back. Google makes a version of its search-term analytics openly available for people to examine. It has launched a business-forecasting service with Spain’s second-largest bank, BBVA, to look at the tourism sector as well as sell real-time economic indicators based on search data. The Bank of England uses search queries related to property to get a better sense of whether housing prices are rising or falling.
Companies that have failed to appreciate the importance of data’s reuse have learned their lesson the hard way. For example, in Amazon’s early days it signed a deal with AOL to run the technology behind AOL’s e-commerce site. To most people, it looked like an ordinary outsourcing deal. But what really interested Amazon, explains Andreas Weigend, Amazon’s former chief scientist, was getting hold of data on what AOL users were looking at and buying, which would improve the performance of its recommendation engine. Poor AOL never realized this. It only saw the data’s value in terms of its primary purpose—sales. Clever Amazon knew it could reap benefits by putting the data to a secondary use.
Or take the case of Google’s entry into speech recognition with GOOG-411 for local search listings, which ran from 2007 to 2010. The search giant didn’t have its own speech-recognition technology so needed to license it. It reached an agreement with Nuance, the leader in the field, which was thrilled to have landed such a prized client. But Nuance was then a big-data dunderhead: the contract didn’t specify who got to retain the voice-translation records, and Google kept them for itself. Analyzing the data lets one score the probability that a given digitized snippet of voice corresponds to a specific word. This is essential for improving speech-recognition technology or creating a new service altogether. At the time Nuance perceived itself as in the business of software licensing, not data crunching. As soon as it recognized its error, it began striking deals with mobile operators and handset manufacturers to use its speech-recognition service—so that it could gather up the data.
The value in data’s reuse is good news for organizations that collect or control large datasets but currently make little use of them, such as conventional businesses that mostly operate offline. They may sit on untapped informational geysers. Some companies may have collected data, used it once (if at all), and just kept it around because of low storage cost—in “data tombs,” as data scientists call the places where such old info resides.
Internet and technology companies are on the front lines of harnessing the data deluge, since they collect so much information just by being online and are ahead of the rest of industry in analyzing it. But all firms stand to gain. The consultants at McKinsey & Company point to a logistics company, whose name they keep anonymous, which noticed that in the process of delivering goods, it was amassing reams of information on product shipments around the globe. Smelling opportunity, it established a special division to sell the aggregated data in the form of business and economic forecasts. In other words, it created an offline version of Google’s past-search-query business. Or consider SWIFT, the global interbank system for wire transfers. It found that payments correlate with global economic activity. So SWIFT offers GDP forecasts based on fund transfer data passing over its network.
Some firms, thanks to their position in the information value chain, may be able to collect huge amounts of data, even though they have little immediate need for it or aren’t adept at reusing it. For instance, mobile phone operators collect information on their subscribers’ locations so they can route calls. For these companies, this data has only narrow technical uses. But it becomes more valuable when it is reused by companies that distribute personalized, location-based advertising and promotions. Sometimes the value comes not from individual data points but from what they reveal in the aggregate. Hence the geo-loco businesses like AirSage and Sense Networks that we saw in the last chapter can sell information on where people are gathering on a Friday night or how slowly cars are crawling in traffic. This massed information can be used to determine real estate values or billboard advertising prices.
Even the most banal information may have special value, if applied in the right way. Look again at mobile phone operators: they have rec- ords of where and when the phones connect to base stations, including at what signal strength. Operators have long used that data to fine-tune the performance of their networks, deciding where to add or upgrade infrastructure. But the data has many other potential uses. Handset manufacturers could use it to learn what influences signal strength, for example, to improve the reception quality of their gadgets. Mobile operators have long been loath to monetize that information for fear of running afoul of privacy regulations. But they are starting to soften their stance as their financial fortunes flounder and they regard their data as a potential source of income. In 2012 the large Spanish and international operator Telefonica went so far as to create a separate company, called Telefonica Digital Insights, to sell anonymous and aggregated subscriber-location data to retailers and others.
RECOMBINANT DATA
Sometimes the dormant value can only be unleashed by combining one dataset with another, perhaps a very different one. We can do innovative things by commingling data in new ways. An example of how this can work is a clever study published in 2011 on whether cellphones increase the likelihood of cancer. With around six billion cellphones in the world, almost one for every human on Earth, the question is crucial. Many studies have looked for a link, but they have been hobbled by shortcomings. Their sample sizes were too small, or the time periods they covered were too short, or they were based on self-reported data that was fraught with error. However, a team of researchers at the Danish Cancer Society devised an interesting approach based on previously collected data.
Data on all cellphone subscribers since mobiles were introduced in Denmark was obtained from mobile operators. The study looked at those who had cellphones from 1987 to 1995, with the exception of corporate subscribers and others whose socioeconomic data was not available. It came to 358,403 people. The country also maintained a nationwide registry of all cancer patients, which contained 10,729 people who had tumors of the central nervous system during 1990 to 2007, the follow-up period. Finally, th
e study used a nationwide registry with information on highest attained education and disposable income for each Danish inhabitant. After combining the three datasets, the researchers looked into whether mobile users showed higher rates of cancer than non-subscribers. And among subscribers, were those who had owned a cellphone for a longer period more likely to get cancer?
Despite the study’s scale, the data wasn’t messy or imprecise at all: the datasets required fastidious quality standards for medical or commercial or demographic purposes. The information wasn’t collected in ways that could introduce biases related to the theme of the study. In fact, the data had been generated years earlier, and for reasons that had nothing to do with this research. Most important, the study was not based on a sample but on something close to N=all: almost every incident of cancer, and nearly every mobile user, which amounted to 3.8 million person-years of cellphone ownership. The fact that it contained almost all cases meant that the researchers could control for subpopulations, such as those with high levels of income.
Big Data: A Revolution That Will Transform How We Live, Work, and Think Page 12