Pat Helland, one of the world’s foremost authorities on database design, describes this fundamental shift in a paper entitled “If You Have Too Much Data, Then ‘Good Enough’ Is Good Enough.” After identifying some of the core principles of traditional design that have become eroded by messy data of varying provenance and accuracy, he lays out the consequences: “We can no longer pretend to live in a clean world.” Processing big data entails an inevitable loss of information—Helland calls it “lossy.” But it makes up for that by yielding a quick result. “It’s OK if we have lossy answers—that’s frequently what business needs,” concludes Helland.
Traditional database design promises to deliver consistent results across time. If you ask for your bank account balance, for example, you expect to receive the exact amount. And if you query it a few seconds later, you want the system to provide the same result, assuming nothing has changed. Yet as the quantity of data collected grows and the number of users who access the system increases, this consistency becomes harder to maintain.
Large datasets do not exist in any one place; they tend to be split up across multiple hard drives and computers. To ensure reliability and speed, a record may be stored in two or three separate locations. If you update the record at one location, the data in the other locations is no longer correct until you update it too. While traditional systems would have a delay until all updates are made, that is less practical when data is broadly distributed and the server is pounded with tens of thousands of queries per second. Instead, accepting messiness is a kind of solution.
The shift is typified by the popularity of Hadoop, an open-source rival to Google’s MapReduce system that is very good at processing large quantities of data. It does this by breaking the data down into smaller chunks and parceling them out to other machines. It expects that hardware will fail, so it builds redundancy in. It presumes that the data is not clean and orderly—in fact, it assumes that the data is too huge to be cleaned before processing. Where typical data analysis requires an operation called “extract, transfer, and load,” or ETL, to move the data to where it will be analyzed, Hadoop dispenses with such niceties. Instead, it takes for granted that the quantity of data is so breathtakingly enormous that it can’t be moved and must be analyzed where it is.
Hadoop’s output isn’t as precise as that of relational databases: it can’t be trusted to launch a spaceship or to certify bank-account details. But for many less critical tasks, where an ultra-precise answer isn’t needed, it does the trick far faster than the alternatives. Think of tasks like segmenting a list of customers to send some of them a special marketing campaign. Using Hadoop, the credit-card company Visa was able to reduce the processing time for two years’ worth of test records, some 73 billion transactions, from one month to a mere 13 minutes. That sort of acceleration of processing is transformative to businesses.
The experience of ZestFinance, a company founded by the former chief information officer of Google, Douglas Merrill, underscores the point. Its technology helps lenders decide whether or not to offer relatively small, short-term loans to people who seem to have poor credit. Yet where traditional credit scoring is based on just a handful of strong signals like previous late payments, ZestFinance analyzes a huge number of “weaker” variables. In 2012 it boasted a loan default rate that was a third less than the industry average. But the only way to make the system work is to embrace messiness.
“One of the interesting things,” says Merrill, “is that there are no people for whom all fields are filled in—there’s always a large amount of missing data.” The matrix from the information ZestFinance gathers is incredibly sparse, a database file teeming with missing cells. So the company “imputes” the missing data. For instance, about 10 percent of ZestFinance’s customers are listed as dead—but as it turns out, that doesn’t affect repayment. “So, obviously, when preparing for the zombie apocalypse, most people assume no debt will get repaid. But from our data, it looks like zombies pay back their loans,” adds Merrill with a wink.
In return for living with messiness, we get tremendously valuable services that would be impossible at their scope and scale with traditional methods and tools. According to some estimates only 5 percent of all digital data is “structured”—that is, in a form that fits neatly into a traditional database. Without accepting messiness, the remaining 95 percent of unstructured data, such as web pages and videos, remain dark. By allowing for imprecision, we open a window into an untapped universe of insights.
Society has made two implicit tradeoffs that have become so ingrained in the way we act that we don’t even see them as tradeoffs anymore, but as the natural state of things. First, we presume that we can’t use far more data, so we don’t. But the constraint is increasingly less relevant, and there is much to be gained by using something approaching N=all.
The second tradeoff is over the quality of information. It was rational to privilege exactitude in an era of small data, when because we only collected a little information its accuracy had to be as high as possible. In many cases, that may still matter. But for many other things, rigorous accuracy is less important than getting a quick grasp of their broad outlines or progress over time.
The way we think about using the totality of information compared with smaller slivers of it, and the way we may come to appreciate slackness instead of exactness, will have profound effects on our interaction with the world. As big-data techniques become a regular part of everyday life, we as a society may begin to strive to understand the world from a far larger, more comprehensive perspective than before, a sort of N=all of the mind. And we may tolerate blurriness and ambiguity in areas where we used to demand clarity and certainty, even if it had been a false clarity and an imperfect certainty. We may accept this provided that in return we get a more complete sense of reality—the equivalent of an impressionist painting, wherein each stroke is messy when examined up close, but by stepping back one can see a majestic picture.
Big data, with its emphasis on comprehensive datasets and messiness, helps us get closer to reality than did our dependence on small data and accuracy. The appeal of “some” and “certain” is understandable. Our comprehension of the world may have been incomplete and occasionally wrong when we were limited in what we could analyze, but there was a comfortable certainty about it, a reassuring stability. Besides, because we were stunted in the data that we could collect and examine, we didn’t face the same compulsion to get everything, to see everything from every possible angle. And in the narrow confines of small data, we could pride ourselves on our precision—even if by measuring the minutiae to the nth degree, we missed the bigger picture.
Ultimately, big data may require us to change, to become more comfortable with disorder and uncertainty. The structures of exactitude that seem to give us bearings in life—that the round peg goes into the round hole; that there is only one answer to a question—are more malleable than we may admit; and yet admitting, even embracing, this plasticity brings us closer to reality.
As radical a transformation as these shifts in mindset are, they lead to a third change that has the potential to upend an even more fundamental convention on which society is based: the idea of understanding the reasons behind all that happens. Instead, as the next chapter will explain, finding associations in data and acting on them may often be good enough.
4
CORRELATION
GREG LINDEN WAS 24 years old in 1997 when he took time off from his PhD research in artificial intelligence at the University of Washington to work at a local Internet startup selling books online. It had only been open for two years but was doing a brisk business. “I loved the idea of selling books and selling knowledge—and helping people find the next piece of knowledge they wanted to enjoy,” he reminisces. The store was Amazon.com, and it hired Linden as a software engineer to make sure the site ran smoothly.
Amazon didn’t just have techies on its staff. At the time, it also employed a dozen or so book cri
tics and editors to write reviews and suggest new titles. While the story of Amazon is familiar to many people, fewer remember that its content was originally crafted by human hand. The editors and critics evaluated and chose the titles featured on Amazon’s web pages. They were responsible for what was called “the Amazon voice”—considered one of the company’s crown jewels and a source of its competitive advantage. An article in the Wall Street Journal around that time feted them as the nation’s most influential book critics, since they drove so many sales.
Then Jeff Bezos, Amazon’s founder and CEO, began to experiment with a potent idea: What if the company could recommend specific books to customers based on their individual shopping preferences? From its start, Amazon had captured reams of data on all its customers: what they purchased, what books they only looked at but didn’t buy, and how long they looked at them. What books they bought in unison.
The quantity of data was so huge that at first Amazon processed it the conventional way, by taking a sample and analyzing it to find similarities among customers. The resulting recommendations were crude. Buy a book on Poland and you’d be bombarded with Eastern European fare. Purchase one about babies and you’d be inundated with more of the same. “They tended to offer you tiny variations on your previous purchase, ad infinitum,” recalled James Marcus, an Amazon book reviewer from 1996 to 2001, in his memoir, Amazonia. “It felt as if you had gone shopping with the village idiot.”
Greg Linden saw a solution. He realized that the recommendation system didn’t actually need to compare people with other people, a task that was technically cumbersome. All it needed to do was find associations among products themselves. In 1998 Linden and his colleagues applied for a patent on “item-to-item” collaborative filtering, as the technique is known. The shift in approach made a big difference.
Because the calculations could be done ahead of time, the recommendations were lightning fast. The method was also versatile, able to work across product categories. So when Amazon branched out to sell items other than books, it could suggest movies or toasters too. And the recommendations were much better than before because the system used all the data. “The joke in the group was if it were working perfectly, Amazon should just show you one book—which is the next book you’re going to buy,” Linden recalls.
Now the company had to decide what should appear on the site. Machine-generated content like personal recommendations and bestseller lists, or reviews written by Amazon’s in-house editorial staff? What the clicks said, or what the critics said? It was a battle of mice and men.
When Amazon ran a test comparing sales produced by human editors with sales produced by computer-generated content, the results were not even close. The data-derived material generated vastly more sales. The computer may not have known why a customer who read Ernest Hemingway might also like to buy F. Scott Fitzgerald. But that didn’t seem to matter. The cash register was ringing. Eventually the editors were presented with the precise percentage of sales Amazon had to forgo when it featured their reviews online and the group was disbanded. “I was very sad about the editorial team getting beaten,” recalls Linden. “But the data doesn’t lie, and the cost was very high.”
Today a third of all of Amazon’s sales are said to result from its recommendation and personalization systems. With these systems, Amazon has driven many competitors out of business: not only large bookstores and music stores, but also local booksellers who thought their personal touch would insulate them from the winds of change. In fact, Linden’s work revolutionized e-commerce, as the method has been adopted by almost everyone. For Netflix, an online film rental company, three-fourths of new orders come from recommendations. Following Amazon’s lead, thousands of websites are able to recommend products, content, friends, and groups without knowing why people are likely to be interested in them.
Knowing why might be pleasant, but it’s unimportant for stimulating sales. Knowing what, however, drives clicks. This insight has the power to reshape many industries, not just e-commerce. Salespeople in all sectors have long been told that they need to understand what makes customers tick, to grasp the reasons behind their decisions. Professional skills and years of experience have been highly valued. Big data shows that there is another, in some ways more pragmatic approach. Amazon’s innovative recommendation systems teased out valuable correlations without knowing the underlying causes. Knowing what, not why, is good enough.
Predictions and predilections
Correlations are useful in a small-data world, but in the context of big data they really shine. Through them we can glean insights more easily, faster, and more clearly than before.
At its core, a correlation quantifies the statistical relationship between two data values. A strong correlation means that when one of the data values changes, the other is highly likely to change as well. We have seen such strong correlations with Google Flu Trends: the more people in a particular geographic place search for particular terms through Google, the more people in that location have the flu. Conversely, a weak correlation means that when one data value changes little happens to the other. For instance, we could run correlations on individuals’ hair length and happiness and find that hair length is not especially useful in telling us much about happiness.
Correlations let us analyze a phenomenon not by shedding light on its inner workings but by identifying a useful proxy for it. Of course, even strong correlations are never perfect. It is quite possible that two things may behave similarly just by coincidence. We may simply be “fooled by randomness,” to borrow a phrase from the empiricist Nassim Nicholas Taleb. With correlations, there is no certainty, only probability. But if a correlation is strong, the likelihood of a link is high. Many Amazon customers can attest to this by pointing to a bookshelf laden with the company’s recommendations.
By letting us identify a really good proxy for a phenomenon, correlations help us capture the present and predict the future: if A often takes place together with B, we need to watch out for B to predict that A will happen. Using B as a proxy helps us capture what is probably taking place with A, even if we can’t measure or observe A directly. Importantly, it also helps us predict what may happen to A in the future. Of course, correlations cannot foretell the future, they can only predict it with a certain likelihood. But that ability is extremely valuable.
Consider the case of Walmart. It is the largest retailer in the world, with more than two million employees and annual sales of around $450 billion—a sum greater than the GDP of four-fifths of the world’s countries. Before the Web brought forth so much data, the company held perhaps the biggest set of data in corporate America. In the 1990s it revolutionized retailing by recording every product as data through a system called Retail Link. This let its merchandise suppliers monitor the rate and volume of sales and inventory. Creating this transparency enabled the company to force suppliers to take care of the stockage themselves. In many cases Walmart does not take “ownership” of a product until the point of sale, thereby shedding its inventory risk and reducing its costs. Walmart used data to become, in effect, the world’s largest consignment shop.
What could all that historical data reveal if analyzed in the right way? The retailer worked with expert number-crunchers from Teradata, formerly the venerable National Cash Register Company, to uncover interesting correlations. In 2004 Walmart peered into its mammoth databases of past transactions: what item each customer bought and the total cost, what else was in the shopping basket, the time of day, even the weather. By doing so, the company noticed that prior to a hurricane, not only did sales of flashlights increase, but so did sales of Pop-Tarts, a sugary American breakfast snack. So as storms approached, Walmart stocked boxes of Pop-Tarts at the front of stores next to the hurricane supplies, to make life easier for customers dashing in and out—and boosted its sales.
In the past, someone at headquarters would have needed the hunch beforehand in order to gather the data and test the idea. Now, by having s
o much data and better tools, the correlations surface more quickly and inexpensively. (That said, one must be cautious: when the number of data points increases by orders of magnitude, we also see more spurious correlations—phenomena that appear to be connected even though they aren’t. This requires us to take extra care, as we are just beginning to appreciate.)
Long before big data, correlation analysis proved valuable. The concept was set forth in 1888 by Sir Francis Galton, a cousin of Charles Darwin, after he had noticed a relationship between men’s height and the length of their forearms. The mathematics behind it is relatively straightforward and robust—which turns out to be one of its essential features, and which has helped make it one of the most widely used statistical measures. Yet before big data, its usefulness was limited. Because data was scarce and collecting it expensive, statisticians often chose a proxy, then collected the relevant data and ran the correlation analysis to find out how good that proxy was. But how to select the right proxy?
To guide them, experts used hypotheses driven by theories—abstract ideas about how something works. Based on such hypotheses, they collected data and used correlation analysis to verify whether the proxies were suitable. If they weren’t, then the researchers often tried again, stubbornly, in case the data had been collected wrongly, before finally conceding that the hypothesis they had started with, or even the theory it was based on, was flawed and required amendment. Knowledge progressed through this hypothesis-driven trial and error. And it did so slowly, as our individual and collective biases clouded what hypotheses we developed, how we applied them, and thus what proxies we picked. It was a cumbersome process, but workable in a small-data world.
Big Data: A Revolution That Will Transform How We Live, Work, and Think Page 6