Take the following three sentences: “Fred’s parents arrived late. The caterers were expected soon. Fred was angry.” When reading them we instantly intuit why Fred was angry—not because the caterers were to arrive soon, but because of his parents’ tardiness. Actually, we have no way of knowing this from the information supplied. Still, our minds cannot help creating what we assume are coherent, causal stories out of facts we are given.
Daniel Kahneman, a professor of psychology at Princeton and the recipient of the 2002 Nobel Prize in economics, uses this example to suggest that we have two modes of thinking. One is fast and takes little effort, letting us jump to conclusions in seconds. The other is slow and hard, requiring us to think through a particular issue.
The fast way of thinking is biased heavily toward “seeing” causal links even when there are none. It is prejudiced to confirm our existing knowledge and beliefs. In ancient history, this fast way of thinking helped us survive a dangerous environment, in which we often needed to decide quickly and with limited information. But frequently it falls short of establishing the true cause of an effect.
Unfortunately, Kahneman argues, very often our brain is too lazy to think slowly and methodically. Instead, we let the fast way of thinking take over. As a consequence, we often “see” imaginary causalities, and thus fundamentally misunderstand the world.
Parents often tell their children that they got the flu because they did not wear hats or gloves in cold weather. Yet there is no direct causal link between bundling up and catching the flu. If we visit a restaurant and later fall sick, we intuitively blame the food we ate there (and perhaps avoid the restaurant in the future), even though the food may have nothing to do with our illness. We could have caught a stomach bug in any number of ways, such as shaking hands with an infected person. The fast-thinking side of our brain is hard-wired to jump quickly to whatever causal conclusions it can come up with. It thus often leads us to wrong decisions.
Contrary to conventional wisdom, such human intuiting of causality does not deepen our understanding of the world. In many instances, it’s little more than a cognitive shortcut that gives us the illusion of insight but in reality leaves us in the dark about the world around us. Just as sampling was a shortcut we used because we could not process all the data, the perception of causality is a shortcut our brain uses to avoid thinking hard and slow.
In a small-data world, showing how wrong causal intuitions were took a long time. This is going to change. In the future, big-data correlations will routinely be used to disprove our causal intuitions, showing that often there is little if any statistical connection between the effect and its supposed cause. Our “fast thinking” mode is in for an extensive and lasting reality check.
Perhaps that lesson will make us think harder (and slower) as we aim to understand the world. But even our slow thinking—the second way we suss out causalities—will see its role transformed by big-data correlations.
In our daily lives, we think so often in causal terms that we may believe causality can easily be shown. The truth is much less comfortable. Unlike with correlations, where the math is relatively straightforward, there is no obvious mathematical way to “prove” causality. We can’t even express causal relationships easily in standard equations. Hence even if we think slow and hard, conclusively finding causal relationships is difficult. Because our minds are used to an information-poor world, we are tempted to reason with limited data, even though too often, too many factors are at play to simply reduce an effect to a particular cause.
Take the case of the vaccine against rabies. On July 6, 1885, the French chemist Louis Pasteur was introduced to nine-year-old Joseph Meister, who had been mauled by a rabid dog. Pasteur had invented vaccination and had worked on an experimental vaccine against rabies. Meister’s parents begged Pasteur to use the vaccine to treat their son. He did, and Joseph Meister survived. In the press, Pasteur was celebrated as having saved the young boy from a certain, painful death.
But had he? As it turns out, on average only one in seven people bitten by rabid dogs ever contract the disease. Even assuming Pasteur’s experimental vaccine was effective, there was about an 85 percent likelihood that the boy would have survived anyway.
In this example, administering the vaccine was seen as having cured Joseph Meister. But there are two causal connections in question: one between the vaccine and the rabies virus, and the other between being bitten by a rabid dog and developing the disease. Even if the former is true, the latter is true only in a minority of cases.
Scientists have overcome this challenge of demonstrating causality through experiments, in which the supposed cause can be carefully applied or suppressed. If the effects correspond to whether the cause was applied or not, it suggests a causal connection. The more carefully controlled the circumstances, the higher the likelihood that the causal link you identify is correct.
Hence, much like correlations, causality can rarely if ever be proven, only shown with a high degree of probability. But unlike correlations, experiments to infer causal connections are often not practical or raise challenging ethical questions. How could we run a causal experiment to identify the reason why certain search terms best predict the flu? And for a rabies shot, would we subject dozens, perhaps hundreds of patients to a painful death—as part of the “control group” that didn’t get the shot—although we had a vaccine for them? Even where experiments are practical, they remain costly and time-consuming.
In comparison, non-causal analyses, such as correlations, are often fast and cheap. Unlike for causal links, we have the mathematical and statistical methods to analyze relationships and the necessary digital tools to demonstrate the strength of them with confidence.
Moreover, correlations are not only valuable in their own right, they also point the way for causal investigations. By telling us which two things are potentially connected, they allow us to investigate further whether a causal relationship is present, and if so, why. This inexpensive and speedy filtering mechanism lowers the cost of causal analysis through specially controlled experiments. Through correlations we can catch a glimpse of the important variables that we then use in experiments to investigate causality.
But we must be careful. Correlations are powerful not only because they offer insights, but also because the insights they offer are relatively clear. These insights often get obscured when we bring causality back into the picture. For instance, Kaggle, a firm that organizes data-mining competitions for companies that are open to anyone to enter, ran a contest in 2012 on the quality of used cars. A used-car dealer supplied data to participating statisticians to build an algorithm to predict which of the vehicles available for purchase at an auction were likely to have problems. A correlation analysis showed that cars painted orange were far less prone to have defects—at about half the rate of the average of other cars.
Even as we read this, we already think about why it might be so: Are people who own orange cars likely to be car enthusiasts and take better care of their vehicles? Is it because a custom color might mean the car has been made in a more careful, customized way in other respects, too? Or, perhaps orange cars are more noticeable on the road and therefore less likely to be in accidents, so they’re in better condition when resold?
Quickly we are caught in a web of competing causal hypotheses. But our attempts to illuminate things this way only make them cloudier. Correlations exist; we can show them mathematically. We can’t easily do the same for causal links. So we would do well to hold off from trying to explain the reason behind the correlations: the why instead of the what. Otherwise, we might advise car owners to paint their clunkers orange in order to make the engines less defective—a ridiculous thought.
Taking these facts into account, it is quite understandable that correlation analysis and similar non-causal methods based on hard data are superior to most intuited causal connections, the result of “fast thinking.” But in a growing number of contexts, such analysis is also m
ore useful and more efficient than slow causal thinking that is epitomized by carefully controlled (and thus costly and time-consuming) experiments.
In recent years, scientists have tried to lower the costs of experiments to investigate causes, for instance by cleverly combining appropriate surveys to create “quasi-experiments.” That may make some causal investigations easier, but the efficiency advantage of non-causal methods is hard to beat. Moreover, big data itself aids causal inquiries as it guides experts toward likely causes to investigate. In many cases, the deeper search for causality will take place after big data has done its work, when we specifically want to investigate the why, not just appreciate the what.
Causality won’t be discarded, but it is being knocked off its pedestal as the primary fountain of meaning. Big data turbocharges non-causal analyses, often replacing causal investigations. The conundrum of exploding manholes in Manhattan is a case in point.
Man versus manhole
Every year a few hundred manholes in New York City start to smolder as their innards catch fire. Sometimes the cast-iron manhole covers, which weigh as much as 300 pounds, explode into the air several stories high before crashing down to the ground. This is not a good thing.
Con Edison, the public utility that provides the city’s electricity, does regular inspections and maintenance of the manholes every year. In the past, it basically relied on chance, hoping that a manhole scheduled for a visit might be one that was poised to blow. It was little better than a random walk down Wall Street. In 2007 Con Edison turned to statisticians uptown at Columbia University in hopes that they could use its historical data about the grid, such as previous problems and what infrastructure is connected to what, to predict which manholes were likely to have trouble, so the company would know where to concentrate its resources.
It’s a complex big-data problem. There are 94,000 miles of underground cables in New York City, enough to wrap around the Earth three and a half times. Manhattan alone boasts around 51,000 manholes and service boxes. Some of this infrastructure dates back to the days of Thomas Edison, the company’s namesake. One in 20 cables were laid before 1930. Though records had been kept since the 1880s, they were in a hodgepodge of forms—and never meant for data analysis. They came from the accounting department or emergency dispatchers who made hand-written notes of “trouble tickets.” To say the data was messy is a gross understatement. As just one example, the statisticians reported, the term “service box,” a common piece of infrastructure, had at least 38 variants, including SB, S, S/B, S.B, S?B, S.B., SBX, S/BX, SB/X, S/XB, /SBX, S.BX, S &BX, S?BX, S BX, S/B/X, S BOX, SVBX, SERV BX, SERV-BOX, SERV/BOX, and SERVICE BOX. A computer algorithm had to figure it all out.
“The data was just so incredibly raw,” recalls Cynthia Rudin, the statistician and data-miner, now at MIT, who led the project. “I’ve got a printout of all the different cable tables. If you roll the thing out, you couldn’t even hold it up without it dragging on the floor. And you have to make sense out of all of it—to mine it for gold, whatever it takes to get a really good predictive model.”
To work, Rudin and her team had to use all the data available, not just a sample, since any of the tens of thousands of manholes could be a ticking time bomb. So it begged for N=all. And though coming up with causal reasons would have been nice, it might have taken a century and still been wrong or incomplete. The better way to accomplish the task is to find the correlations. Rudin cared less about why than about which—though she knew that when the team sat across from Con Edison executives, the stats geeks would have to justify the basis for their rankings. The predictions might have been made by a machine, but the consumers were human, and people tend to want reasons, to understand.
The data mining unearthed the golden nuggets Rudin hoped to find. After formatting the messy data so that a machine could process it, the team started with 106 predictors of a major manhole disaster. They then condensed that list to a handful of the strongest signals. In a test of the Bronx’s power grid, they analyzed all the data they had, up to mid-2008. Then they used that data to predict problem spots for 2009. It worked brilliantly. The top 10 percent of manholes on their list contained a whopping 44 percent of the manholes that ended up having severe incidents.
In the end, the biggest factors were the age of the cables and whether the manholes had experienced previous troubles. This was useful, as it turns out, since it meant that Con Edison’s brass could easily grasp the basis for a ranking. But wait. Age and prior problems? Doesn’t that sound fairly obvious? Well, yes and no. On one hand, as the network theorist Duncan Watts likes to say, “Everything is obvious once you know the answer” (the title of one of his books). On the other hand, it is important to remember that there were 106 predictors in the model at the outset. It was not so evident how to weigh them and then prioritize tens of thousands of manholes, each with myriad variables that added up to millions of data points—and the data itself was not even in a form to be analyzed.
The case of exploding manholes highlights the point that data is being put to new uses to solve difficult real-world problems. To achieve this, however, we needed to change the way we operated. We had to use all the data, as much as we could possibly collect, not just a small portion. We needed to accept messiness rather than treat exactitude as a central priority. And we had to put our trust in correlations without fully knowing the causal basis for the predictions.
The end of theory?
Big data transforms how we understand and explore the world. In the age of small data, we were driven by hypotheses about how the world worked, which we then attempted to validate by collecting and analyzing data. In the future, our understanding will be driven more by the abundance of data rather than by hypotheses.
These hypotheses have often been derived from theories of the natural or the social sciences, which in turn help explain and/or predict the world around us. As we transition from a hypothesis-driven world to a data-driven world, we may be tempted to think that we also no longer need theories.
In 2008 Wired magazine’s editor-in-chief Chris Anderson trumpeted that “the data deluge makes the scientific method obsolete.” In a cover story called “The Petabyte Age,” he proclaimed that it amounted to nothing short of “the end of theory.” The traditional process of scientific discovery—of a hypothesis that is tested against reality using a model of underlying causalities—is on its way out, Anderson argued, replaced by statistical analysis of pure correlations that is devoid of theory.
To support his argument, Anderson described how quantum physics has become an almost purely theoretical field, because experiments are too expensive, too complex, and too large to be feasible. There is theory, he suggested, that has nothing to do anymore with reality. As examples of the new method, he referred to Google’s search engine and to gene sequencing. “This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear,” he wrote. “With enough data, the numbers speak for themselves. Petabytes allow us to say: ‘Correlation is enough.’”
The article unleashed a furious and important debate, even though Anderson quickly backpedaled away from his bolder claims. But his argument is worth examining. In essence, Anderson contends that until recently, as we aimed to analyze and understand the world around us, we required theories to test. In contrast, in a big-data age, the argument goes, we do not need theories: we can just look at the data. If true, this would suggest that all generalizable rules about how the world works, how humans behave, what consumers buy, when parts break, and so on may become irrelevant as analysis of big data takes over.
The “end of theory” seems to imply that while theories have existed in substantive fields like physics or chemistry, big-data analysis has no need of any conceptual models. That is preposterous.
Big data itself is founded on theory. For instance, it employs statistical theories and mathematical ones, and at times uses computer science theory, too. Yes, these are not
theories about the causal dynamics of a particular phenomenon like gravity, but they are theories nonetheless. And, as we have shown, models based on them hold very useful predictive power. In fact, big data may offer a fresh look and new insights precisely because it is unencumbered by the conventional thinking and inherent biases implicit in the theories of a specific field.
Moreover, because big-data analysis is based on theories, we can’t escape them. They shape both our methods and our results. It begins with how we select the data. Our decisions may be driven by convenience: Is the data readily available? Or by economics: Can the data be captured cheaply? Our choices are influenced by theories. What we choose influences what we find, as the digital-technology researchers danah boyd and Kate Crawford have argued. After all, Google used search terms as a proxy for the flu, not the length of people’s hair. Similarly, when we analyze the data, we choose tools that rest on theories. And as we interpret the results we again apply theories. The age of big data clearly is not without theories—they are present throughout, with all that this entails.
Anderson deserves credit for raising the right questions—and doing so, characteristically, before others. Big data may not spell the “end of theory,” but it does fundamentally transform the way we make sense of the world. This change will take a lot of getting used to. It challenges many institutions. Yet the tremendous value that it unleashes will make it not only a worthwhile tradeoff, but an inevitable one.
Before we get there, however, it bears noting how we got here. Many people in the tech industry like to credit the transformation to the new digital tools, from fast chips to efficient software, because they are the toolmakers. The technical wizardry does matter, but not as much as one might think. The deeper reason for these trends is that we have far more data. And the reason we have more data is that we are rendering more aspects of reality in a data format, the topic of the next chapter.
Big Data: A Revolution That Will Transform How We Live, Work, and Think Page 8