Big Data: A Revolution That Will Transform How We Live, Work, and Think
Page 3
The ideal of identifying causal mechanisms is a self-congratulatory illusion; big data overturns this. Yet again we are at a historical impasse where “god is dead.” That is to say, the certainties that we believed in are once again changing. But this time they are being replaced, ironically, by better evidence. What role is left for intuition, faith, uncertainty, acting in contradiction of the evidence, and learning by experience? As the world shifts from causation to correlation, how can we pragmatically move forward without undermining the very foundations of society, humanity, and progress based on reason? This book intends to explain where we are, trace how we got here, and offer an urgently needed guide to the benefits and dangers that lie ahead.
2
MORE
BIG DATA IS ALL ABOUT seeing and understanding the relations within and among pieces of information that, until very recently, we struggled to fully grasp. IBM’s big-data expert Jeff Jonas says you need to let the data “speak to you.” At one level this may sound trivial. Humans have looked to data to learn about the world for a long time, whether in the informal sense of the myriad observations we make every day or, mainly over the last couple of centuries, in the formal sense of quantified units that can be manipulated by powerful algorithms.
The digital age may have made it easier and faster to process data, to calculate millions of numbers in a heartbeat. But when we talk about data that speaks, we mean something more—and different. As noted in Chapter One, big data is about three major shifts of mindset that are interlinked and hence reinforce one another. The first is the ability to analyze vast amounts of data about a topic rather than be forced to settle for smaller sets. The second is a willingness to embrace data’s real-world messiness rather than privilege exactitude. The third is a growing respect for correlations rather than a continuing quest for elusive causality. This chapter looks at the first of these shifts: using all the data at hand instead of just a small portion of it.
The challenge of processing large piles of data accurately has been with us for a while. For most of history we worked with only a little data because our tools to collect, organize, store, and analyze it were poor. We winnowed the information we relied on to the barest minimum so we could examine it more easily. This was a form of unconscious self-censorship: we treated the difficulty of interacting with data as an unfortunate reality, rather than seeing it for what it was, an artificial constraint imposed by the technology at the time. Today the technical environment has changed 179 degrees. There still is, and always will be, a constraint on how much data we can manage, but it is far less limiting than it used to be and will become even less so as time goes on.
In some ways, we haven’t yet fully appreciated our new freedom to collect and use larger pools of data. Most of our experience and the design of our institutions have presumed that the availability of information is limited. We reckoned we could only collect a little information, and so that’s usually what we did. It became self-fulfilling. We even developed elaborate techniques to use as little data as possible. One aim of statistics, after all, is to confirm the richest finding using the smallest amount of data. In effect, we codified our practice of stunting the quantity of information we used in our norms, processes, and incentive structures. To get a sense of what the shift to big data means, the story starts with a look back in time.
Not until recently have private firms, and nowadays even individuals, been able to collect and sort information on a massive scale. In the past, that task fell to more powerful institutions like the church and the state, which in many societies amounted to the same thing. The oldest record of counting dates is from around 5000 B.C., when Sumerian merchants used small clay beads to denote goods for trade. Counting on a larger scale, however, was the purview of the state. Over millennia, governments have tried to keep track of their people by collecting information.
Consider the census. The ancient Egyptians are said to have conducted censuses, as did the Chinese. They’re mentioned in the Old Testament, and the New Testament tells us that a census imposed by Caesar Augustus—“that all the world should be taxed” (Luke 2:1)—took Joseph and Mary to Bethlehem, where Jesus was born. The Domesday Book of 1086, one of Britain’s most venerated treasures, was at its time an unprecedented, comprehensive tally of the English people, their land and property. Royal commissioners spread across the countryside compiling information to put in the book—which later got the name “Domesday,” or “Doomsday,” because the process was like the biblical Final Judgment, when everyone’s life is laid bare.
Conducting censuses is both costly and time-consuming; King William I, who commissioned the Domesday Book, didn’t live to see its completion. But the only alternative to bearing this burden was to forgo collecting the information. And even after all the time and expense, the information was only approximate, since the census takers couldn’t possibly count everyone perfectly. The very word “census” comes from the Latin term “censere,” which means “to estimate.”
More than three hundred years ago, a British haberdasher named John Graunt had a novel idea. Graunt wanted to know the population of London at the time of the plague. Instead of counting every person, he devised an approach—which today we would call “statistics”—that allowed him to infer the population size. His approach was crude, but it established the idea that one could extrapolate from a small sample useful knowledge about the general population. But how one does that is important. Graunt just scaled up from his sample.
His system was celebrated, even though we later learned that his numbers were reasonable only by luck. For generations, sampling remained grossly flawed. Thus for censuses and similar “big data-ish” undertakings, the brute-force approach of trying to count every number ruled the day.
Because censuses were so complex, costly, and time-consuming, they were conducted only rarely. The ancient Romans, who long boasted a population in the hundreds of thousands, ran a census every five years. The U.S. Constitution mandated one every decade, as the growing country measured itself in millions. But by the late nineteenth century even that was proving problematic. The data outstripped the Census Bureau’s ability to keep up.
The 1880 census took a staggering eight years to complete. The information was obsolete even before it became available. Worse still, officials estimated that the 1890 census would have required a full 13 years to tabulate—a ridiculous state of affairs, not to mention a violation of the Constitution. Yet because the apportionment of taxes and congressional representation was based on population, getting not only a correct count but a timely one was essential.
The problem the U.S. Census Bureau faced is similar to the struggle of scientists and businessmen at the start of the new millennium, when it became clear that they were drowning in data: the amount of information being collected had utterly swamped the tools used for processing it, and new techniques were needed. In the 1880s the situation was so dire that the Census Bureau contracted with Herman Hollerith, an American inventor, to use his idea of punch cards and tabulation machines for the 1890 census.
With great effort, he succeeded in shrinking the tabulation time from eight years to less than one. It was an amazing feat, which marked the beginning of automated data processing (and provided the foundation for what later became IBM). But as a method of acquiring and analyzing big data it was still very expensive. After all, every person in the United States had to fill in a form and the information had to be transferred to a punch card, which was used for tabulation. With such costly methods, it was hard to imagine running a census in any time span shorter than a decade, even though the lag was unhelpful for a nation growing by leaps and bounds.
Therein lay the tension: Use all the data, or just a little? Getting all the data about whatever is being measured is surely the most sensible course. It just isn’t always practical when the scale is vast. But how to choose a sample? Some argued that purposefully constructing a sample that was representative of the whole would be the most suitabl
e way forward. But in 1934 Jerzy Neyman, a Polish statistician, forcefully showed that such an approach leads to huge errors. The key to avoid them is to aim for randomness in choosing whom to sample.
Statisticians have shown that sampling precision improves most dramatically with randomness, not with increased sample size. In fact, though it may sound surprising, a randomly chosen sample of 1,100 individual observations on a binary question (yes or no, with roughly equal odds) is remarkably representative of the whole population. In 19 out of 20 cases it is within a 3 percent margin of error, regardless of whether the total population size is a hundred thousand or a hundred million. Why this should be the case is complicated mathematically, but the short answer is that after a certain point early on, as the numbers get bigger and bigger, the marginal amount of new information we learn from each observation is less and less.
The fact that randomness trumped sample size was a startling insight. It paved the way for a new approach to gathering information. Data using random samples could be collected at low cost and yet extrapolated with high accuracy to the whole. As a result, governments could run small versions of the census using random samples every year, rather than just one every decade. And they did. The U.S. Census Bureau, for instance, conducts more than two hundred economic and demographic surveys every year based on sampling, in addition to the decennial census that tries to count everyone. Sampling was a solution to the problem of information overload in an earlier age, when the collection and analysis of data was very hard to do.
The applications of this new method quickly went beyond the public sector and censuses. In essence, random sampling reduces big-data problems to more manageable data problems. In business, it was used to ensure manufacturing quality—making improvements much easier and less costly. Comprehensive quality control originally required looking at every single product coming off the conveyor belt; now a random sample of tests for a batch of products would suffice. Likewise, the new method ushered in consumer surveys in retailing and snap polls in politics. It transformed a big part of what we used to call the humanities into the social sciences.
Random sampling has been a huge success and is the backbone of modern measurement at scale. But it is only a shortcut, a second-best alternative to collecting and analyzing the full dataset. It comes with a number of inherent weaknesses. Its accuracy depends on ensuring randomness when collecting the sample data, but achieving such randomness is tricky. Systematic biases in the way the data is collected can lead to the extrapolated results being very wrong.
There are echoes of such problems in election polling using landline phones. The sample is biased against people who only use cell-phones (who are younger and more liberal), as the statistician Nate Silver has pointed out. This has resulted in incorrect election predictions. In the 2008 presidential election between Barack Obama and John McCain, the major polling organizations of Gallup, Pew, and ABC/Washington Post found differences of between one and three percentage points when they polled with and without adjusting for cellphone users—a hefty margin considering the tightness of the race.
Most troublingly, random sampling doesn’t scale easily to include subcategories, as breaking the results down into smaller and smaller subgroups increases the possibility of erroneous predictions. It’s easy to understand why. Suppose you poll a random sample of a thousand people about their voting intentions in the next election. If your sample is sufficiently random, chances are that the entire population’s sentiment will be within a 3 percent range of the views in the sample. But what if plus or minus 3 percent is not precise enough? Or what if you then want to break down the group into smaller subgroups, by gender, geography, or income?
And what if you want to combine these subgroups to target a niche of the population? In an overall sample of a thousand people, a subgroup such as “affluent female voters in the Northeast” will be much smaller than a hundred. Using only a few dozen observations to predict the voting intentions of all affluent female voters in the Northeast will be imprecise even with close to perfect randomness. And tiny biases in the overall sample will make the errors more pronounced at the level of subgroups.
Hence, sampling quickly stops being useful when you want to drill deeper, to take a closer look at some intriguing subcategory in the data. What works at the macro level falls apart in the micro. Sampling is like an analog photographic print. It looks good from a distance, but as you stare closer, zooming in on a particular detail, it gets blurry.
Sampling also requires careful planning and execution. One usually cannot “ask” sampled data-fresh questions if they have not been considered at the outset. So though as a shortcut it is useful, the tradeoff is that it’s, well, a shortcut. Being a sample rather than everything, the dataset lacks a certain extensibility or malleability, whereby the same data can be reanalyzed in an entirely new way than the purpose for which it was originally collected.
Consider the case of DNA analysis. The cost to sequence an individual’s genome approached a thousand dollars in 2012, moving it closer to a mass-market technique that can be performed at scale. As a result, a new industry of individual gene sequencing is cropping up. Since 2007 the Silicon Valley startup 23andMe has been analyzing people’s DNA for only a couple of hundred dollars. Its technique can reveal traits in people’s genetic codes that may make them more susceptible to certain diseases like breast cancer or heart problems. And by aggregating its customers’ DNA and health information, 23andMe hopes to learn new things that couldn’t be spotted otherwise.
But there’s a hitch. The company sequences just a small portion of a person’s genetic code: places that are known to be markers indicating particular genetic weaknesses. Meanwhile, billions of base pairs of DNA remain unsequenced. Thus 23andMe can only answer questions about the markers it considers. Whenever a new marker is discovered, a person’s DNA (or more precisely, the relevant part of it) has to be sequenced again. Working with a subset, rather than the whole, entails a tradeoff: the company can find what it is looking for faster and more cheaply, but it can’t answer questions that it didn’t consider in advance.
Apple’s legendary chief executive Steve Jobs took a totally different approach in his fight against cancer. He became one of the first people in the world to have his entire DNA sequenced as well as that of his tumor. To do this, he paid a six-figure sum—many hundreds of times more than the price 23andMe charges. In return, he received not a sample, a mere set of markers, but a data file containing the entire genetic codes.
In choosing medication for an average cancer patient, doctors have to hope that the patient’s DNA is sufficiently similar to that of patients who participated in the drug’s trials to work. However, Steve Jobs’s team of doctors could select therapies by how well they would work given his specific genetic makeup. Whenever one treatment lost its effectiveness because the cancer mutated and worked around it, the doctors could switch to another drug—“jumping from one lily pad to another,” Jobs called it. “I’m either going to be one of the first to be able to outrun a cancer like this or I’m going to be one of the last to die from it,” he quipped. Though his prediction went sadly unfulfilled, the method—having all the data, not just a bit—gave him years of extra life.
From some to all
Sampling is an outgrowth of an era of information-processing constraints, when people were measuring the world but lacked the tools to analyze what they collected. As a result, it is a vestige of that era too. The shortcomings in counting and tabulating no longer exist to the same extent. Sensors, cellphone GPS, web clicks, and Twitter collect data passively; computers can crunch the numbers with increasing ease.
The concept of sampling no longer makes as much sense when we can harness large amounts of data. The technical tools for handling data have already changed dramatically, but our methods and mindsets have been slower to adapt.
Yet sampling comes with a cost that has long been acknowledged but shunted aside. It loses detail. In some cases there
is no other way but to sample. In many areas, however, a shift is taking place from collecting some data to gathering as much as possible, and if feasible, getting everything: N=all.
As we’ve seen, using N=all means we can drill down deep into data; samples can’t do that nearly as well. Second, recall that in our example of sampling above, we had only a 3 percent margin of error when extrapolating to the whole population. For some situations, that error margin is fine. But you lose the details, the granularity, the ability to look closer at certain subgroups. A normal distribution is, alas, normal. Often, the really interesting things in life are found in places that samples fail to fully catch.
Hence Google Flu Trends doesn’t rely on a small random sample but instead uses billions of Internet search queries in the United States. Using all this data rather than a small sample improves the analysis down to the level of predicting the spread of flu in a particular city rather than a state or the entire nation. Oren Etzioni of Farecast initially used 12,000 data points, a sample, and it performed well. But as Etzioni added more data, the quality of the predictions improved. Eventually, Farecast used the domestic flight records for most routes for an entire year. “This is temporal data—you just keep gathering it over time, and as you do, you get more and more insight into the patterns,” Etzioni says.
So we’ll frequently be okay to toss aside the shortcut of random sampling and aim for more comprehensive data instead. Doing so requires ample processing and storage power and cutting-edge tools to analyze it all. It also requires easy and affordable ways to collect the data. In the past, each one of these was an expensive conundrum. But now the cost and complexity of all these pieces of the puzzle have declined dramatically. What was previously the purview of just the biggest companies is now possible for most.