Big Data: A Revolution That Will Transform How We Live, Work, and Think Page 5 Read online free by Viktor Mayer-Schonberger

Home > Other > Big Data: A Revolution That Will Transform How We Live, Work, and Think > Page 5

Big Data: A Revolution That Will Transform How We Live, Work, and Think Page 5

So more trumps less. And sometimes more trumps smarter. What then of messy? A few years after Banko and Brill shoveled in all that data, researchers at rival Google were thinking along similar lines—but at an even larger scale. Instead of testing algorithms with a billion words, they used a trillion. Google did this not to develop a grammar checker but to crack an even more complex nut: language translation.

So-called machine translation has been a vision of computer pioneers since the dawn of computing in the 1940s, when the devices were made of vacuum tubes and filled an entire room. The idea took on a special urgency during the Cold War, when the United States captured vast amounts of written and spoken material in Russian but lacked the manpower to translate it quickly.

At first, computer scientists opted for a combination of grammatical rules and a bilingual dictionary. An IBM computer translated sixty Russian phrases into English in 1954, using 250 word pairs in the computer’s vocabulary and six rules of grammar. The results were very promising. “Mi pyeryedayem mislyi posryedstvom ryechyi,” was entered into the IBM 701 machine via punch cards, and out came “We transmit thoughts by means of speech.” The sixty sentences were “smoothly translated,” according to an IBM press release celebrating the occasion. The director of the research program, Leon Dostert of Georgetown University, predicted that machine translation would be “an accomplished fact” within “five, perhaps three years hence.”

But the initial success turned out to be deeply misleading. By 1966 a committee of machine-translation grandees had to admit failure. The problem was harder than they had realized it would be. Teaching computers to translate is about teaching them not just the rules, but the exceptions too. Translation is not just about memorization and recall; it is about choosing the right words from many alternatives. Is “bonjour” really “good morning”? Or is it “good day,” or “hello,” or “hi”? The answer is, it depends. . . .

In the late 1980s, researchers at IBM had a novel idea. Instead of trying to feed explicit linguistic rules into a computer, together with a dictionary, they decided to let the computer use statistical probability to calculate which word or phrase in one language is the most appropriate one in another. In the 1990s IBM’s Candide project used ten years’ worth of Canadian parliamentary transcripts published in French and English—about three million sentence pairs. Because they were official documents, the translations had been done to an extremely high quality. And by the standards of the day, the amount of data was huge. Statistical machine translation, as the technique became known, cleverly turned the challenge of translation into one big mathematics problem. And it seemed to work. Suddenly, computer translation got a lot better. After the success of that conceptual leap, however, IBM only eked out small improvements despite throwing in lots of money. Eventually IBM pulled the plug.

But less than a decade later, in 2006, Google got into translation, as part of its mission to “organize the world’s information and make it universally accessible and useful.” Instead of nicely translated pages of text in two languages, Google availed itself of a larger but also much messier dataset: the entire global Internet and more. Its system sucked in every translation it could find, in order to train the computer. In went to corporate websites in multiple languages, identical translations of official documents, and reports from intergovernmental bodies like the United Nations and the European Union. Even translations of books from Google’s book-scanning project were included. Where Candide had used three million carefully translated sentences, Google’s system harnessed billions of pages of translations of widely varying quality, according to the head of Google Translate, Franz Josef Och, one of the foremost authorities in the field. Its trillion-word corpus amounted to 95 billion English sentences, albeit of dubious quality.

Despite the messiness of the input, Google’s service works the best. Its translations are more accurate than those of other systems (though still highly imperfect). And it is far, far richer. By mid-2012 its dataset covered more than 60 languages. It could even accept voice input in 14 languages for fluid translations. And because it treats language simply as messy data with which to judge probabilities, it can even translate between languages, such as Hindi and Catalan, in which there are very few direct translations to develop the system. In those cases it uses English as a bridge. And it is far more flexible than other approaches, since it can add and subtract words as they come in and out of usage.

The reason Google’s translation system works well is not that it has a smarter algorithm. It works well because its creators, like Banko and Brill at Microsoft, fed in more data—and not just of high quality. Google was able to use a dataset tens of thousands of times larger than IBM’s Candide because it accepted messiness. The trillion-word corpus Google released in 2006 was compiled from the flotsam and jetsam of Internet content—“data in the wild,” so to speak. This was the “training set” by which the system could calculate the probability that, for example, one word in English follows another. It was a far cry from the grandfather in the field, the famous Brown Corpus of the 1960s, which totaled one million English words. Using the larger dataset enabled great strides in natural-language processing, upon which systems for tasks like voice recognition and computer translation are based. “Simple models and a lot of data trump more elaborate models based on less data,” wrote Google’s artificial-intelligence guru Peter Norvig and colleagues in a paper entitled “The Unreasonable Effectiveness of Data.”

As Norvig and his co-authors explained, messiness was the key: “In some ways this corpus is a step backwards from the Brown Corpus: it’s taken from unfiltered Web pages and thus contains incomplete sentences, spelling errors, grammatical errors, and all sorts of other errors. It’s not annotated with carefully hand-corrected part-of-speech tags. But the fact that it’s a million times larger than the Brown Corpus outweighs these drawbacks.”

More trumps better

Messiness is difficult to accept for the conventional sampling analysts, who for all their lives have focused on preventing and eradicating messiness. They work hard to reduce error rates when collecting samples, and to test the samples for potential biases before announcing their results. They use multiple error-reducing strategies, including ensuring that samples are collected according to an exact protocol and by specially trained experts. Such strategies are costly to implement even for limited numbers of data points, and they are hardly feasible for big data. Not only would they be far too expensive, but exacting standards of collection are unlikely to be achieved consistently at such scale. Even excluding human interaction would not solve the problem.

Moving into a world of big data will require us to change our thinking about the merits of exactitude. To apply the conventional mindset of measurement to the digital, connected world of the twenty-first century is to miss a crucial point. As mentioned earlier, the obsession with exactness is an artifact of the information-deprived analog era. When data was sparse, every data point was critical, and thus great care was taken to avoid letting any point bias the analysis.

Today we don’t live in such an information-starved situation. In dealing with ever more comprehensive datasets, which capture not just a small sliver of the phenomenon at hand but much more or all of it, we no longer need to worry so much about individual data points biasing the overall analysis. Rather than aiming to stamp out every bit of inexactitude at increasingly high cost, we are calculating with messiness in mind.

Take the way sensors are making their way into factories. At BP’s Cherry Point Refinery in Blaine, Washington, wireless sensors are installed throughout the plant, forming an invisible mesh that produces vast amounts of data in real time. The environment of intense heat and electrical machinery might distort the readings, resulting in messy data. But the huge quantity of information generated from both wired and wireless sensors makes up for those hiccups. Just increasing the frequency and number of locations of sensor readings can offer a big payoff. By measuring the stress on p
ipes at all times rather than at certain intervals, BP learned that some types of crude oil are more corrosive than others—a quality it couldn’t spot, and thus couldn’t counteract, when its dataset was smaller.

When the quantity of data is vastly larger and is of a new type, exactitude in some cases is no longer the goal so long as we can divine the general trend. Moving to a large scale changes not only the expectations of precision but the practical ability to achieve exactitude. Though it may seem counterintuitive at first, treating data as something imperfect and imprecise lets us make superior forecasts, and thus understand our world better.

It bears noting that messiness is not inherent to big data. Instead it is a function of the imperfection of the tools we use to measure, record, and analyze information. If the technology were to somehow become perfect, the problem of inexactitude would disappear. But as long as it is imperfect, messiness is a practical reality we must deal with. And it is likely to be with us for a long time. Painstaking efforts to increase accuracy often won’t make economic sense, since the value of having far greater amounts of data is more compelling. Just as statisticians in an earlier era put aside their interest in larger sample sizes in favor of more randomness, we can live with a bit of imprecision in return for more data.

The Billion Prices Project offers an intriguing case in point. Every month the U.S. Bureau of Labor Statistics publishes the consumer price index, or CPI, which is used to calculate the inflation rate. The figure is crucial for investors and businesses. The Federal Reserve considers it when deciding whether to raise or lower interest rates. Companies base salary increases on inflation. The federal government uses it to index payments like Social Security benefits and the interest it pays on certain bonds.

To get the figure, the Bureau of Labor Statistics employs hundreds of staff to call, fax, and visit stores and offices in 90 cities across the nation and report back about 80,000 prices on everything from tomatoes to taxi fares. Producing it costs around $250 million a year. For that sum, the data is neat, clean, and orderly. But by the time the numbers come out, they’re already a few weeks old. As the 2008 financial crisis showed, a few weeks can be a terribly long lag. Decision-makers need quicker access to inflation numbers in order to react to them better, but they can’t get it with conventional methods focused on sampling and prizing precision.

In response, two economists at the Massachusetts Institute of Technology, Alberto Cavallo and Roberto Rigobon, came up with a big-data alternative by steering a much messier course. Using software to crawl the Web, they collected half a million prices of products sold in the U.S. every single day. The information is messy, and not all the data points collected are easily comparable. But by combining the big-data collection with clever analysis, the project was able to detect a deflationary swing in prices immediately after Lehman Brothers filed for bankruptcy in September 2008, while those who relied on the official CPI data had to wait until November to see it.

The MIT project has spun off a commercial venture called PriceStats that banks and others use to make economic decisions. It compiles millions of products sold by hundreds of retailers in more than 70 countries every day. Of course, the figures require careful interpretation, but they are better than the official statistics at indicating trends in inflation. Because there are more prices and the figures are available in real time, they give decision-makers a significant advantage. (The method also serves as a credible outside check on national statistical bodies. For example, The Economist distrusts Argentina’s method of calculating inflation, so it relies on the PriceStats figures instead.)

Messiness in action

In many areas of technology and society, we are leaning in favor of more and messy over fewer and exact. Consider the case of categorizing content. For centuries humans have developed taxonomies and indexes in order to store and retrieve material. These hierarchical systems have always been imperfect, as everyone familiar with a library card catalogue can painfully recall, but in a small-data universe, they worked well enough. Increase the scale many orders of magnitude, though, and these systems, which presume the perfect placement of everything within them, fall apart. For example, in 2011 the photo-sharing site Flickr held more than six billion photos from more than 75 million users. Trying to label each photo according to preset categories would have been useless. Would there really have been one entitled “Cats that look like Hitler”?

Instead, clean taxonomies are being replaced by mechanisms that are messier but also eminently more flexible and adaptable to a world that evolves and changes. When we upload photos to Flickr, we “tag” them. That is, we assign any number of text labels and use them to organize and search the material. Tags are created and affixed by people in an ad hoc way: there are no standardized, predefined categories, no existing taxonomy to which we must conform. Rather, anyone can add new tags just by typing. Tagging has emerged as the de facto standard for content classification on the Internet, used in social media sites like Twitter, blogs, and so on. It makes the vastness of the Web’s content more navigable—especially for things like images, videos, and music that aren’t text based so word searches don’t work.

Of course, some tags may be misspelled, and such mistakes introduce inaccuracy—not to the data itself but to how it’s organized. That pains the traditional mind trained in exactitude. But in return for messiness in the way we organize our photo collections, we gain a much richer universe of labels, and by extension, a deeper, broader access to our pictures. We can combine search tags to filter photos in ways that weren’t possible before. The imprecision inherent in tagging is about accepting the natural messiness of the world. It is an antidote to more precise systems that try to impose a false sterility upon the hurly-burly of reality, pretending that everything under the sun fits into neat rows and columns. There are more things in heaven and earth than are dreamt of in that philosophy.

Many of the Web’s most popular sites flaunt their admiration for imprecision over the pretense of exactitude. When one sees a Twitter icon or a Facebook “like” button on a web page, it shows the number of other people who clicked on it. When the numbers are small, each click is shown, like “63.” But as the figures get larger, the number displayed is an approximation, like “4K.” It’s not that the system doesn’t know the actual total; it’s that as the scale increases, showing the exact figure is less important. Besides, the amounts may be changing so quickly that a specific figure would be out of date the moment it appeared. Similarly, Google’s Gmail presents the time of recent messages with exactness, such as “11 minutes ago,” but treats longer durations with a nonchalant “2 hours ago,” as do Facebook and some others.

The industry of business intelligence and analytics software was long built on promising clients “a single version of the truth”—the popular buzz words from the 2000s from the technology vendors in these fields. Executives used the phrase without irony. Some still do. By this, they mean that everyone accessing a company’s information-technology systems can tap into the same data; that the marketing team and the sales team don’t have to fight over who has the correct customer or sales numbers before the meeting even begins. Their interests might be more aligned if the facts were consistent, the thinking goes.

But the idea of “a single version of the truth” is doing an about-face. We are beginning to realize not only that it may be impossible for a single version of the truth to exist, but also that its pursuit is a distraction. To reap the benefits of harnessing data at scale, we have to accept messiness as par for the course, not as something we should try to eliminate.

We are even seeing the ethos of inexactitude invade one of the areas most intolerant of imprecision: database design. Traditional database engines required data to be highly structured and precise. Data wasn’t simply stored; it was broken up into “records” that contained fields. Each field held information of a particular type and length. For example, if a numeric field was seven digits long, an amount of 10 million or mor
e could not be recorded. If one wanted to enter “not available” into a field for phone numbers, it couldn’t be done. The structure of the database would have had to be altered to accommodate these entries. We still battle with such restrictions on our computers and smartphones, when the software won’t accept the data we want to enter.

Traditional indexes, too, were predefined, and that limited what one could search for. Add a new index, and it had to be created from scratch, taking time. Conventional, so-called relational, databases are designed for a world in which data is sparse, and thus can be and will be curated carefully. It is a world in which the questions one wants to answer using the data have to be clear at the outset, so that the database is designed to answer them—and only them—efficiently.

Yet this view of storage and analysis is increasingly at odds with reality. We now have large amounts of data of varying types and quality. Rarely does it fit into neatly defined categories that are known at the outset. And the questions we want to ask often emerge only when we collect and work with the data we have.

These realities have led to novel database designs that break with the principles of old—principles of records and preset fields that reflect neatly defined hierarchies of information. The most common language for accessing databases has long been SQL, or “structured query language.” The very name evokes its rigidity. But the big shift in recent years has been toward something called noSQL, which doesn’t require a preset record structure to work. It accepts data of varying type and size and allows it to be searched successfully. In return for permitting structural messiness, these database designs require more processing and storage resources. Yet it is a tradeoff we can afford given the plummeting storage and processing costs.

‹ Prev Next ›