Big Data: A Revolution That Will Transform How We Live, Work, and Think
Page 7
In the big-data age, it is no longer efficient to make decisions about what variables to examine by relying on hypotheses alone. The datasets are far too big and the area under consideration is probably far too complex. Fortunately, many of the limitations that forced us into a hypothesis-driven approach no longer exist to the same extent. We now have so much data available and so much computing power that we don’t have to laboriously pick one proxy or a small handful of them and examine them one by one. Sophisticated computational analysis can now identify the optimal proxy—as it did for Google Flu Trends, after plowing through almost half a billion mathematical models.
No longer do we necessarily require a valid substantive hypothesis about a phenomenon to begin to understand our world. Thus, we don’t have to develop a notion about what terms people search for when and where the flu spreads. We don’t need to have an inkling of how airlines price their tickets. We don’t need to care about the culinary tastes of Walmart shoppers. Instead we can subject big data to correlation analysis and let it tell us what search queries are the best proxies for the flu, whether an airfare is likely to soar, or what anxious families want to nibble on during a storm. In place of the hypothesis-driven approach, we can use a data-driven one. Our results may be less biased and more accurate, and we will almost certainly get them much faster.
Predictions based on correlations lie at the heart of big data. Correlation analyses are now used so frequently that we sometimes fail to appreciate the inroads they have made. And the uses will only increase.
For instance, financial credit scores are being used to predict personal behavior. The Fair Isaac Corporation, now known as FICO, invented credit scores in the late 1950s. In 2011 FICO established the “Medication Adherence Score.” To determine how likely people are to take their medication, FICO analyzes a wealth of variables—including ones that may seem irrelevant, such as how long people have lived at the same address, if they are married, how long they’ve been in the same job, and whether they own a car. The score is intended to help health providers save money by telling them at which patients they ought to target their reminders. There is nothing causal between car ownership and taking antibiotics as directed; the link between them is pure correlation. But findings such as these were enough to inspire FICO’s chief executive to boast in 2011, “We know what you’re going to do tomorrow.”
Other data brokers are getting into the correlation game, too, as documented by the Wall Street Journal’s pioneering “What They Know” series. Experian has a product called Income Insight that estimates people’s income level partly on the basis of their credit history. It developed the score by analyzing its huge database of credit histories against anonymous tax data from the U.S. Internal Revenue Service. It would cost a business around $10 apiece to confirm someone’s income through tax forms, while Experian sells its estimate for less than $1. So in instances like this, using the proxy is more cost effective than going through the rigmarole to get the real thing. Similarly, yet another credit bureau, Equifax, sells an “Ability to Pay Index” and a “Discretionary Spending Index” that promise to predict the plumpness of a person’s purse.
The uses of correlations are being extended even further. Aviva, a large insurance firm, has studied the idea of using credit reports and consumer-marketing data as proxies for the analysis of blood and urine samples for certain applicants. The intent is to identify those who may be at higher risk of illnesses like high blood pressure, diabetes, or depression. The method uses lifestyle data that includes hundreds of variables such as hobbies, the websites people visit, and the amount of television they watch, as well as estimates of their income.
Aviva’s predictive model, developed by Deloitte Consulting, was considered successful at identifying health risks. Other insurance firms such as Prudential and AIG have examined similar initiatives. The benefit is that it may let people applying for insurance avoid having to give blood and urine samples, which no one enjoys, and which the insurance companies have to pay for. The lab tests cost around $125 per person, while the purely data-driven approach is about $5.
To some, the method may sound creepy, because it draws upon seemingly unrelated behaviors. It is as if companies can avail themselves of a cyber-snitch that spies on every mouse click. People might think twice before visiting websites of extreme sports or watching sitcoms glorifying couch potatoes if they felt this might result in higher insurance premiums. Admittedly, chilling people’s freedom to interact with information would be terrible. On the other hand, the benefit is that making insurance easier and less expensive to obtain may result in more insured people, which is a good thing for society, not to mention for insurance firms.
Yet the poster child, or perhaps the whipping boy, of big-data correlations is the American discount retailer Target, which has relied on predictions based on big-data correlations for years. In an extraordinary bit of reporting, Charles Duhigg, a business correspondent at the New York Times, recounted how Target knows when a woman is pregnant without the mother-to-be explicitly telling it so. Basically, its method is to harness data and let the correlations do their work.
Knowing if a customer may be pregnant is important for retailers, since pregnancy is a watershed moment for couples, when their shopping behaviors are open to change. They may start going to new stores and developing new brand loyalties. Target’s marketers turned to its analytics division to see if there was a way to discover customers’ pregnancies through their purchasing patterns.
The analytics team reviewed the shopping histories of women who signed up for its baby gift-registry. They noticed that these women bought lots of unscented lotion at around the third month of pregnancy, and that a few weeks later they tended to purchase supplements like magnesium, calcium, and zinc. The team ultimately uncovered around two dozen products that, used as proxies, enabled the company to calculate a “pregnancy prediction” score for every customer who paid with a credit card or used a loyalty card or mailed coupons. The correlations even let the retailer estimate the due date within a narrow range, so it could send relevant coupons for each stage of the pregnancy. “Target,” indeed.
In his book The Power of Habit, Duhigg recounts what happened next. One day, an angry man stormed into a Target store in Minnesota to see a manager. “My daughter got this in the mail!” he shouted. “She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?” When the manager called the man a few days later to apologize, however, the voice on the other end of the line was conciliatory. “I had a talk with my daughter,” he said. “It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.”
Finding proxies in social contexts is only one way that big-data techniques are being employed. Equally powerful are correlations with new types of data to solve everyday needs.
One of these is a method called predictive analytics, which is starting to be widely used in business to foresee events before they happen. The term may refer to an algorithm that can spot a hit song, which is commonly used in the music industry to give recording labels a better idea of where to place their bets. The technique is also being used to prevent big mechanical or structural failures: placing sensors on machinery, motors, or infrastructure like bridges makes it possible to monitor the data patterns they give off, such as heat, vibration, stress, and sound, and to detect changes that may indicate problems ahead.
The underlying concept is that when things break down, they generally don’t do so all at once, but gradually over time. Armed with sensor data, correlational analysis and similar methods can identify the specific patterns, the telltale signs, that typically crop up before something breaks—the whirring of a motor, excessive heat from an engine, and the like. From then on, one need only look for that pattern to know when something is amiss. Spotting the abnormality early on enables the system to send out a warning so that a n
ew part can be installed or the problem fixed before the breakdown actually occurs. The aim is to identify and then watch a good proxy, and thereby predict future events.
The shipping company UPS has used predictive analytics since the late 2000s to monitor its fleet of 60,000 vehicles in the United States and know when to perform preventive maintenance. A breakdown on the road can cause havoc, delaying deliveries and pick-ups. So to be cautious, UPS used to replace certain parts after two or three years. But that was inefficient, as some of the parts were fine. Since switching to predictive analytics, the company has saved millions of dollars by measuring and monitoring individual parts and replacing them only when necessary. In one case, the data even revealed that an entire group of new vehicles had a defective part that could have spelled trouble unless it had been spotted before they were deployed.
Similarly, sensors are affixed to bridges and buildings to watch for signs of wear and tear. They are also used in large chemical plants and refineries, where a piece of broken equipment could bring production to a standstill. The cost of collecting and analyzing the data that indicates when to take early action is lower than the cost of an outage. Note that predictive analytics may not explain the cause of a problem; it only indicates that a problem exists. It will alert you that an engine is overheating, but it may not tell you whether the overheating is due to a frayed fan belt or a poorly screwed cap. The correlations show what, not why, but as we have seen, knowing what is often good enough.
The same sort of methodology is being applied in healthcare, to prevent breakdowns of the human machine. When a hospital attaches a ganglion of tubes, wires, and instruments to a patient, a vast stream of data is generated. The electrocardiogram alone records 1,000 readings per second. And yet, remarkably, only a fraction of the data is currently used or kept. Most is just tossed away, even though it may hold important clues about the patient’s condition and response to treatments. And if kept and aggregated with other patients’ data, it could reveal extraordinary insights into which treatments tend to work and which do not.
Discarding data may have been appropriate when the cost and complexity of collecting, storing, and analyzing it were high, but this is no longer the case. Dr. Carolyn McGregor and a team of researchers at the University of Ontario Institute of Technology and IBM are working with a number of hospitals on software to help doctors make better diagnostic decisions when caring for premature babies (known as “preemies”). The software captures and processes patient data in real time, tracking 16 different data streams, such as heart rate, respiration rate, temperature, blood pressure, and blood oxygen level, which together amount to around 1,260 data points per second.
The system can detect subtle changes in the preemies’ condition that may signal the onset of infection 24 hours before overt symptoms appear. “You can’t see it with the naked eye, but a computer can,” explains Dr. McGregor. The system does not rely on causality but on correlations. It tells what, not why. But that serves its purpose. The advance warning lets doctors treat the infection earlier with lighter medical interventions, or alerts them sooner if a treatment seems ineffective. This improves patient outcomes. It is hard to think that this technique won’t be implemented for vastly more patients and conditions in the future. The algorithm itself may not be making the decisions, but the machines are doing what machines do best, to help human caregivers do what they do best.
Strikingly, Dr. McGregor’s big-data analysis was able to identify correlations that in some ways fly in the face of physicians’ conventional wisdom. She found, for instance, that very constant vital signs often are detected prior to a serious infection. This is odd, since we would suspect that deteriorating vitals would precede a full-blown infection. One can imagine generations of doctors ending their workday by glancing at a clipboard beside the crib, seeing the infant’s vital signs stabilize, and figuring it was safe to go home—only to get a frantic call from the nursing station at midnight informing them that something had gone tragically wrong and their instincts had been misplaced.
McGregor’s data suggests that the preemies’ stability, rather than a sign of improvement, is more like the calm before the storm—as if the baby’s body is telling its tiny organs to batten down the hatches for a rough ride ahead. We can’t know for sure: what the data indicates is a correlation, not causality. But we do know that it required statistical methods applied to a huge quantity of data to reveal this hidden association. Lest there be any doubt: big data saves lives.
Illusions and illuminations
In a small-data world, because so little data tended to be available, both causal investigations and correlation analysis began with a hypothesis, which was then tested to be either falsified or verified. But because both methods required a hypothesis to start with, both were equally susceptible to prejudice and erroneous intuition. And the necessary data often was not available. Today, with so much data around and more to come, such hypotheses are no longer crucial for correlational analysis.
There is another difference, which is just starting to gain importance. Before big data, partly because of inadequate computing power, most correlational analysis using large data sets was limited to looking for linear relationships. In reality, of course, many relationships are far more complex. With more sophisticated analyses, we can identify non-linear relationships among data.
As one example, for many years economists and political scientists believed that happiness and income were directly correlated: increase the income and a person on average will get happier. Looking at the data on a chart, however, reveals that a more complex dynamic is at play. For income levels below a certain threshold every rise in income translates into a substantial rise in happiness, but above that level increases in income barely improved a person’s happiness. If we were to plot this on a graph, the line would appear as a curve rather than a straight line as assumed by linear analysis.
The finding was important for policymakers. If it were a linear relationship, it would make sense to raise everyone’s income to improve overall happiness. But once the non-linear association was identified, the advice changed to focus on income increases for the poor, since the data showed that this would yield more bang for the buck.
And it gets even more complex, such as when the correlational relationship is more multi-faceted. For instance, researchers at Harvard and MIT examined the disparity of measles immunizations among the population: some groups get vaccinated while others don’t. At first this disparity seemed to be correlated with the amount people spend on healthcare. Yet a closer look revealed that the correlation is not a neat line; it is an oddly shaped curve. As people spend more money on healthcare, the immunization disparity goes down (as may be expected), but as they spend even more, it surprisingly goes up again—some of the very affluent seem to shy away from measles shots. For public health officials this is crucial to know, but simple linear correlation analysis would not have caught this.
Experts are just now developing the necessary tools to identify and compare non-linear correlations. At the same time, the techniques of correlational analysis are being aided and enhanced by a fast-growing set of novel approaches and software that can tease out non-causal relationships in data from many different angles—rather like the way cubist painters tried to capture the image of a woman’s face from multiple viewpoints at once. One of the most vibrant new methods can be found in the burgeoning field of network analysis. This makes it possible to map, measure, and calculate the nodes and links for everything from one’s friends on Facebook, to which court decisions cite which precedents, to who calls whom on their cellphones. Together these tools help answer non-causal, empirical questions.
Ultimately, in the age of big data, these new types of analyses will lead to a wave of novel insights and helpful predictions. We will see links we never saw before. We will grasp complex technical and social dynamics that have long escaped our comprehension despite our best efforts. But most important, these non-causal analy
ses will aid our understanding of the world by primarily asking what rather than why.
At first, this may sound counterintuitive. After all, as humans, we desire to make sense of the world through causal links; we want to believe that every effect has a cause, if we only look closely enough. Shouldn’t that be our highest aspiration, to know the reasons that underlie the world?
To be sure, there is a philosophical debate going back centuries over whether causality even exists. If everything were caused by something else, then logic dictates that we would not be free to decide anything. Human volition would not exist, as every decision we made and every thought we had would be caused by something else that, in turn, was the effect of another cause, and so forth. The trajectory of all life would simply be determined by causes leading to effects. Hence philosophers have bickered over the role of causality in our world, and at times pitted it against free will. That abstract debate, however, is not what we’re after here.
Rather, when we say that humans see the world through causalities, we’re referring to two fundamental ways humans explain and understand the world: through quick, illusory causality; and via slow, methodical causal experiments. Big data will transform the roles of both.
First is our intuitive desire to see causal connections. We are biased to assume causes even where none exist. This isn’t due to culture or upbringing or level of education. Rather, research suggests, it is a matter of how human cognition works. When we see two events happen one after the other, our minds have a great urge to see them in causal terms.