Big Data: A Revolution That Will Transform How We Live, Work, and Think

Home > Other > Big Data: A Revolution That Will Transform How We Live, Work, and Think > Page 4
Big Data: A Revolution That Will Transform How We Live, Work, and Think Page 4

by Viktor Mayer-Schonberger


  Using all the data makes it possible to spot connections and details that are otherwise cloaked in the vastness of the information. For instance, the detection of credit card fraud works by looking for anomalies, and the best way to find them is to crunch all the data rather than a sample. The outliers are the most interesting information, and you can only identify them in comparison to the mass of normal transactions. It is a big-data problem. And because credit card transactions happen instantaneously, the analysis usually has to happen in real time too.

  Xoom is a firm that specializes in international money transfers and is backed by big names in big data. It analyzes all the data associated with the transactions it handles. The system raised alarm bells in 2011 when it noticed a slightly higher than average number of Discover Card transactions originating from New Jersey. “It saw a pattern when there shouldn’t have been a pattern,” explained John Kunze, Xoom’s chief executive. On its own, each transaction looked legitimate. But it turned out that they came from a criminal group. The only way to spot the anomaly was to examine all the data—sampling might have missed it.

  Using all the data need not be an enormous task. Big data is not necessarily big in absolute terms, although often it is. Google Flu Trends tunes its predictions on hundreds of millions of mathematical modeling exercises using billions of data points. The full sequence of a human genome amounts to three billion base pairs. But the absolute number of data points alone, the size of the dataset, is not what makes these examples of big data. What classifies them as big data is that instead of using the shortcut of a random sample, both Flu Trends and Steve Jobs’s doctors used as much of the entire dataset as feasible.

  The discovery of match fixing in Japan’s national sport, sumo wrestling, is a good illustration of why using N=all need not mean big. Thrown matches have been a constant accusation bedeviling the sport of emperors, and always rigorously denied. Steven Levitt, an economist at the University of Chicago, looked for corruption in the records of more than a decade of past matches—all of them. In a delightful research paper published in the American Economic Review and reprised in the book Freakonomics, he and a colleague described the usefulness of examining so much data.

  They analyzed 11 years’ worth of sumo bouts, more than 64,000 wrestler-matches, to hunt for anomalies. And they struck gold. Match fixing did indeed take place, but not where most people suspected. Rather than for championship bouts, which may or may not be rigged, the data showed that something funny was happening during the unnoticed end-of-tournament matches. It seems little is at stake, since the wrestlers have no chance of winning a title.

  But one peculiarity of sumo is that wrestlers need a majority of wins at the 15-match tournaments in order to retain their rank and income. This sometimes leads to asymmetries of interests, when a wrestler with a 7–7 record faces an opponent with 8–6 or better. The outcome means a great deal to the first wrestler and next to nothing to the second. In such cases, the number-crunching uncovered, the wrestler who needs the victory is very likely to win.

  Might the fellows who need the win be fighting more resolutely? Perhaps. But the data suggested that something else is happening as well. The wrestlers with more at stake win about 25 percent more often than normal. It’s hard to attribute that large a discrepancy to adrenaline alone. When the data was parsed further, it showed that the very next time the same two wrestlers met, the loser of the previous bout was much more likely to win than when they sparred in later matches. So the first victory appears to be a “gift” from one competitor to the other, since what goes around comes around in the tight-knit world of sumo.

  This information was always apparent. It existed in plain sight. But random sampling of the bouts might have failed to reveal it. Even though it relied on basic statistics, without knowing what to look for, one would have no idea what sample to use. In contrast, Levitt and his colleague uncovered it by using a far larger set of data—striving to examine the entire universe of matches. An investigation using big data is almost like a fishing expedition: it is unclear at the outset not only whether one will catch anything but what one may catch.

  The dataset need not span terabytes. In the sumo case, the entire dataset contained fewer bits than a typical digital photo these days. But as big-data analysis, it looked at more than a typical random sample. When we talk about big data, we mean “big” less in absolute than in relative terms: relative to the comprehensive set of data.

  For a long time, random sampling was a good shortcut. It made analysis of large data problems possible in the pre-digital era. But much as when converting a digital image or song into a smaller file, information is lost when sampling. Having the full (or close to the full) dataset provides a lot more freedom to explore, to look at the data from different angles or to look closer at certain aspects of it.

  A fitting analogy may be the Lytro camera, which captures not just a single plane of light, as with conventional cameras, but rays from the entire light field, some 11 million of them. The photographers can decide later which element of an image to focus on in the digital file. There is no need to focus at the outset, since collecting all the information makes it possible to do that afterwards. Because rays from the entire light field are included, it is closer to all the data. As a result, the information is more “reuseable” than ordinary pictures, where the photographer has to decide what to focus on before she presses the shutter.

  Similarly, because big data relies on all the information, or at least as much as possible, it allows us to look at details or explore new analyses without the risk of blurriness. We can test new hypotheses at many levels of granularity. This quality is what lets us see match fixing in sumo wrestling, track the spread of the flu virus by region, and fight cancer by targeting a precise portion of the patient’s DNA. It allows us to work at an amazing level of clarity.

  To be sure, using all the data instead of a sample isn’t always necessary. We still live in a resource-constrained world. But in an increasing number of cases using all the data at hand does make sense, and doing so is feasible now where before it was not.

  One of the areas that is being most dramatically shaken up by N=all is the social sciences. They have lost their monopoly on making sense of empirical social data, as big-data analysis replaces the highly skilled survey specialists of the past. The social science disciplines largely relied on sampling studies and questionnaires. But when the data is collected passively while people do what they normally do anyway, the old biases associated with sampling and questionnaires disappear. We can now collect information that we couldn’t before, be it relationships revealed via mobile phone calls or sentiments unveiled through tweets. More important, the need to sample disappears.

  Albert-László Barabási, one of the world’s foremost authorities on the science of network theory, wanted to study interactions among people at the scale of the entire population. So he and his colleagues examined anonymous logs of mobile phone calls from a wireless operator that served about one-fifth of an unidentified European country’s population—all the logs for a four-month period. It was the first network analysis on a societal level, using a dataset that was in the spirit of N=all. Working on such a large scale, looking at all the calls among millions of people over time, produced novel insights that probably couldn’t have been revealed in any other way.

  Intriguingly, in contrast to smaller studies, the team discovered that if one removes people from the network who have many links within their community, the remaining social network degrades but doesn’t fail. When, on the other hand, people with links outside their immediate community are taken off the network, the social net suddenly disintegrates, as if its structure had buckled. It was an important, but somewhat unexpected result. Who would have thought that the people with lots of close friends are far less important to the stability of the network structure than the ones who have ties to more distant people? It suggests that there is a premium on diversity within a group and in society a
t large.

  We tend to think of statistical sampling as some sort of immutable bedrock, like the principles of geometry or the laws of gravity. But the concept is less than a century old, and it was developed to solve a particular problem at a particular moment in time under specific technological constraints. Those constraints no longer exist to the same extent. Reaching for a random sample in the age of big data is like clutching at a horse whip in the era of the motor car. We can still use sampling in certain contexts, but it need not—and will not—be the predominant way we analyze large datasets. Increasingly, we will aim to go for it all.

  3

  MESSY

  USING ALL AVAILABLE DATA is feasible in an increasing number of contexts. But it comes at a cost. Increasing the volume opens the door to inexactitude. To be sure, erroneous figures and corrupted bits have always crept into datasets. Yet the point has always been to treat them as problems and try to get rid of them, in part because we could. What we never wanted to do was consider them unavoidable and learn to live with them. This is one of the fundamental shifts of going to big data from small.

  In a world of small data, reducing errors and ensuring high quality of data was a natural and essential impulse. Since we only collected a little information, we made sure that the figures we bothered to record were as accurate as possible. Generations of scientists optimized their instruments to make their measurements more and more precise, whether for determining the position of celestial bodies or the size of objects under a microscope. In a world of sampling, the obsession with exactitude was even more critical. Analyzing only a limited number of data points means errors may get amplified, potentially reducing the accuracy of the overall results.

  For much of history, humankind’s highest achievements arose from conquering the world by measuring it. The quest for exactitude began in Europe in the middle of the thirteenth century, when astronomers and scholars took on the ever more precise quantification of time and space—“the measure of reality,” in the words of the historian Alfred Crosby.

  If one could measure a phenomenon, the implicit belief was, one could understand it. Later, measurement was tied to the scientific method of observation and explanation: the ability to quantify, record, and present reproducible results. “To measure is to know,” pronounced Lord Kelvin. It became a basis of authority. “Knowledge is power,” instructed Francis Bacon. In parallel, mathematicians, and what later became actuaries and accountants, developed methods that made possible the accurate collection, recording, and management of data.

  By the nineteenth century France—then the world’s leading scientific nation—had developed a system of precisely defined units of measurement to capture space, time, and more, and had begun to get other nations to adopt the same standards. This went as far as laying down internationally accepted prototype units to measure against in international treaties. It was the apex of the age of measurement. Just half a century later, in the 1920s, the discoveries of quantum mechanics shattered forever the dream of comprehensive and perfect measurement. And yet, outside a relatively small circle of physicists, the mindset of humankind’s drive to flawlessly measure continued among engineers and scientists. In the world of business it even expanded, as the rational sciences of mathematics and statistics began to influence all areas of commerce.

  However, in many new situations that are cropping up today, allowing for imprecision—for messiness—may be a positive feature, not a shortcoming. It is a tradeoff. In return for relaxing the standards of allowable errors, one can get ahold of much more data. It isn’t just that “more trumps some,” but that, in fact, sometimes “more trumps better.”

  There are several kinds of messiness to contend with. The term can refer to the simple fact that the likelihood of errors increases as you add more data points. Hence, increasing the stress readings from a bridge by a factor of a thousand boosts the chance that some may be wrong. But you can also increase messiness by combining different types of information from different sources, which don’t always align perfectly. For example, using voice-recognition software to characterize complaints to a call center, and comparing that data with the time it takes operators to handle the calls, may yield an imperfect but useful snapshot of the situation. Messiness can also refer to the inconsistency of formatting, for which the data needs to be “cleaned” before being processed. There are a myriad of ways to refer to IBM, notes the big-data expert DJ Patil, from I.B.M. to T. J. Watson Labs, to International Business Machines. And messiness can arise when we extract or process the data, since in doing so we are transforming it, turning it into something else, such as when we perform sentiment analysis on Twitter messages to predict Hollywood box office receipts. Messiness itself is messy.

  Suppose we need to measure the temperature in a vineyard. If we have only one temperature sensor for the whole plot of land, we must make sure it’s accurate and working at all times: no messiness allowed. In contrast, if we have a sensor for every one of the hundreds of vines, we can use cheaper, less sophisticated sensors (as long as they do not introduce a systematic bias). Chances are that at some points a few sensors may report incorrect data, creating a less exact, or “messier,” dataset than the one from a single precise sensor. Any particular reading may be incorrect, but the aggregate of many readings will provide a more comprehensive picture. Because this dataset consists of more data points, it offers far greater value that likely offsets its messiness.

  Now suppose we increase the frequency of the sensor readings. If we take one measurement per minute, we can be fairly sure that the sequence with which the data arrives will be perfectly chronological. But if we change that to ten or a hundred readings per second, the accuracy of the sequence may become less certain. As the information travels across a network, a record may get delayed and arrive out of sequence, or may simply get lost in the flood. The information will be a bit less accurate, but its great volume makes it worthwhile to forgo strict exactitude.

  In the first example, we sacrificed the accuracy of each data point for breadth, and in return we received detail that we otherwise could not have seen. In the second case, we gave up exactitude for frequency, and in return we saw change that we otherwise would have missed. Although we may be able to overcome the errors if we throw enough resources at them—after all, 30,000 trades per second take place on the New York Stock Exchange, where the correct sequence matters a lot—in many cases it is more fruitful to tolerate error than it would be to work at preventing it.

  For instance, we can accept some messiness in return for scale. As Forrester, a technology consultancy, puts it, “Sometimes two plus two can equal 3.9, and that is good enough.” Of course the data can’t be completely incorrect, but we’re willing to sacrifice a bit of accuracy in return for knowing the general trend. Big data transforms figures into something more probabilistic than precise. This change will take a lot of getting used to, and it comes with problems of its own, which we’ll consider later in the book. But for now it is worth simply noting that we often will need to embrace messiness when we increase scale.

  One sees a similar shift in terms of the importance of more data relative to other improvements in computing. Everyone knows how much processing power has increased over the years as predicted by Moore’s Law, which states that the number of transistors on a chip doubles roughly every two years. This continual improvement has made computers faster and memory more plentiful. Fewer of us know that the performance of the algorithms that drive many of our systems has also increased—in many areas more than the improvement of processors under Moore’s Law. Many of the gains to society from big data, however, happen not so much because of faster chips or better algorithms but because there is more data.

  For example, chess algorithms have changed only slightly in the past few decades, since the rules of chess are fully known and tightly constrained. The reason computer chess programs play far better today than in the past is in part that they are playing their endgame better. And t
hey’re doing that simply because the systems have been fed more data. In fact, endgames when six or fewer pieces are left on the chessboard have been completely analyzed and all possible moves (N=all) have been represented in a massive table that when uncompressed fills more than a terabyte of data. This enables chess computers to play the endgame flawlessly. No human will ever be able to outplay the system.

  The degree to which more data trumps better algorithms has been powerfully demonstrated in the area of natural language processing: the way computers learn how to parse words as we use them in everyday speech. Around 2000, Microsoft researchers Michele Banko and Eric Brill were looking for a method to improve the grammar checker that is part of the company’s Word program. They weren’t sure whether it would be more useful to put their effort into improving existing algorithms, finding new techniques, or adding more sophisticated features. Before going down any of these paths, they decided to see what happened when they fed a lot more data into the existing methods. Most machine-learning algorithms relied on corpuses of text that totaled a million words or less. Banko and Brill took four common algorithms and fed in up to three orders of magnitude more data: 10 million words, then 100 million, and finally a billion words.

  The results were astounding. As more data went in, the performance of all four types of algorithms improved dramatically. In fact, a simple algorithm that was the worst performer with half a million words performed better than the others when it crunched a billion words. Its accuracy rate went from 75 percent to above 95 percent. Inversely, the algorithm that worked best with a little data performed the least well with larger amounts, though like the others it improved a lot, going from around 86 percent to about 94 percent accuracy. “These results suggest that we may want to reconsider the tradeoff between spending time and money on algorithm development versus spending it on corpus development,” Banko and Brill wrote in one of their research papers on the topic.

 

‹ Prev