Big Data: A Revolution That Will Transform How We Live, Work, and Think

Page 11

by Viktor Mayer-Schonberger

Yet the non-commercial uses of geo-location may turn out to be the most important of all. Sandy Pentland, the director of MIT’s Human Dynamics Laboratory, and Nathan Eagle together pioneered what they call “reality mining.” This refers to processing huge amounts of data from mobile phones to make inferences and predictions about human behavior. In one study, analyzing movements and call patterns allowed them to successfully identify people who had contracted the flu before they themselves knew they were ill. In the case of a deadly flu outbreak, this ability could save millions of lives by letting public health officials know the most afflicted areas at any moment. But if placed in irresponsible hands, the power of reality mining could have terrible consequences, as we will see later.

Eagle, the founder of the wireless-data startup Jana, has used aggregated cellphone data from more than 200 mobile operators in more than 100 countries—some 3.5 billion people in Latin America, Africa, and Europe—to answer questions dear to marketing execs’ hearts, like how many times per week a household does laundry. But he’s also used big data to examine questions such as how cities prosper. He and a colleague combined location data on prepaid cellphone subscribers in Africa with the amount of money they spent when they topped off their accounts. The value correlates strongly with income: richer people buy more minutes at a time. But one of Eagle’s counterintuitive findings is that slums, rather than being only centers of poverty, also act as economic springboards. The point is that these indirect uses of location data have nothing to do with the routing of mobile communications, the purpose for which the information was initially generated. Rather, once location is datafied, new uses crop up and new value can be created.

When interactions become data

The next frontiers of datafication are more personal: our relationships, experiences, and moods. The idea of datafication is the backbone of many of the Web’s social media companies. Social networking platforms don’t simply offer us a way to find and stay in touch with friends and colleagues, they take intangible elements of our everyday life and transform them into data that can be used to do new things. Facebook datafied relationships; they always existed and constituted information, but they were never formally defined as data until Facebook’s “social graph.” Twitter enabled the datafication of sentiment by creating an easy way for people to record and share their stray thoughts, which had previously been lost to the winds of time. LinkedIn datafied our long-past professional experiences, just as Maury transformed old logbooks, turning that information into predictions about our present and future: whom we may know, or a job we may want.

Such uses of the data are still embryonic. In the case of Facebook, it has been shrewdly patient, knowing that unveiling too many new purposes for its users’ data too soon could freak them out. Besides, the company is still adjusting its business model (and privacy policy) for the amount and type of data collection it wants to do. Hence much more of the criticism it has faced centers on what information it is capable of collecting than on what it has actually done with that data. Facebook had around one billion users in 2012, who were interconnected through over 100 billion friendships. The resulting social graph represents more than 10 percent of the total world population, datafied and available to a single company.

The potential uses are extraordinary. A number of startups have looked into adapting the social graph to use as signals for establishing credit scores. The idea is that birds of a feather flock together: prudent people befriend like-minded types, while the profligate hang out among themselves. If it pans out, Facebook could be the next FICO, the credit-scoring agency. The rich datasets from social media firms may well form the basis of new businesses that go far beyond the superficial sharing of photos, status updates, and “likes.”

Twitter, too, has seen its data used in interesting ways. To some, the 400 million terse tweets sent every day in 2012 by over 140 million monthly users seem like little more than random blather. And, in fact, they’re often just that. Yet the company enables the datafication of people’s thoughts, moods, and interactions, which could never be captured previously. Twitter has struck deals with two firms, DataSift and Gnip, to sell access to the data. (Although all tweets are public, access to the “firehose” comes at a cost.) Many businesses parse tweets, sometimes using a technique called sentiment analysis, to garner aggregate customer feedback or judge the impact of marketing campaigns.

Two hedge funds, Derwent Capital in London and MarketPsych in California, started analyzing the datafied text of tweets as signals for investments in the stock market. (Their actual trading strategies were kept secret: rather than investing in firms that were ballyhooed, they may have bet against them.) Both firms now sell the information to traders. In MarketPsych’s case, it teamed up with Thomson Reuters to offer no fewer than 18,864 separate indices across 119 countries, updated each minute, on emotional states like optimism, gloom, joy, fear, anger, and even themes like innovation, litigation, and conflict. The data is used not so much by humans as by computers: Wall Street math whizzes, known as “quants,” plug the data into their algorithmic models in order to look for unseen correlations that can be parlayed into profits. The very frequency of tweets on a topic can predict various things, such as Hollywood box-office revenue, according to one of the fathers of social networking analysis, Bernardo Huberman. He and a colleague at HP developed a model that looked at the rate at which new tweets were posted. With this, they were able to forecast a film’s success better than other commonly used predictors.

But even more is possible. Twitter messages are limited to a sparse 140 characters, but the metadata—that is, the “information about information”—associated with each tweet is rich. It includes 33 discrete items. Some do not seem very useful, like the “wallpaper” on a user’s Twitter page or the software the user employs to access the service. But other metadata is extremely interesting, such as users’ language, their geo-location, and the number and names of people they follow and those who follow them. In one study, reported in Science in 2011, an analysis of 509 million tweets over two years from 2.4 million people in 84 countries showed that people’s moods followed similar daily and weekly patterns across cultures around the world—something that had not been possible to spot before. Moods have been datafied.

Datafication is not just about rendering attitudes and sentiments into an analyzable form, but human behavior as well. This is otherwise hard to track, especially in the context of the broader community and subgroups within it. The biologist Marcel Salathé of Penn State University and the software engineer Shashank Khandelwal analyzed tweets to find that people’s attitudes about vaccinations matched their likelihood of actually getting flu shots. Importantly, their study used the metadata of who was connected to whom among Twitter followers to go a step further still. They noticed that subgroups of unvaccinated people may exist. What marks this research as particularly special is that where other studies, such as Google Flu Trends, used aggregated data to consider the state of individuals’ health, the sentiment analysis performed by Salathé actually predicted health behaviors.

These early findings indicate where datafication will surely go next. Like Google, a gaggle of social media networks such as Facebook, Twitter, LinkedIn, Foursquare, and others sit on an enormous treasure chest of datafied information that, once analyzed, will shed light on social dynamics at all levels, from the individual to society at large.

The datafication of everything

With a little imagination, a cornucopia of things can be rendered into data form—and surprise us along the way. In the same spirit as Professor Koshimizu’s work on backsides in Tokyo, IBM was granted a U.S. patent in 2012 on “Securing premises using surface-based computing technology.” That’s intellectual-property-lawyer-speak for a touch-sensitive floor covering, somewhat like a giant smartphone screen. The potential uses are plentiful. It would be able to identify the objects on it. In basic form, it could know to turn on lights in a room or open doors when
a person enters. More important, however, it might identify individuals by their weight or the way they stand and walk. It could tell if someone fell and did not get back up, an important feature for the elderly. Retailers could learn the flow of traffic through their stores. When the floor is datafied, there is no ceiling to its possible uses.

Datafying as much as possible is not as far out as it sounds. Consider the “quantified self” movement. It refers to a disparate group of fitness aficionados, medical maniacs, and tech junkies who measure every element of their bodies and lives in order to live better—or at least, to learn new things they couldn’t have known in an enumerated way before. The number of “self-trackers” is small for the moment but growing.

Because of smartphones and inexpensive computing technology, datafication of the most essential acts of living has never been easier. A slew of startups let people track their sleep patterns by measuring brainwaves throughout the night. One firm, Zeo, has already created the world’s largest database of sleep activity and uncovered differences in the amounts of REM sleep experienced by men and women. Asthmapolis has attached a sensor to an asthma inhaler that tracks location via GPS; aggregating the information lets the company discern environmental triggers for asthma attacks, such as proximity to certain crops.

The firms Fitbit and Jawbone let people measure their physical activity and sleep. Another company, Basis, lets wearers of its wristband monitor their vital signs, including heart rate and skin conductance, which are measures of stress. Getting the data is becoming easier and less intrusive than ever. In 2009 Apple was granted a patent for collecting data on blood oxygenation, heart rate, and body temperature through its audio earbuds.

There is a lot to learn from datafying how one’s body works. Researchers at Gjøvik University College in Norway and Derawi Biometrics have developed an app for smartphones that analyzes an individual’s gait while walking and uses the information as a security system to unlock the phone. Meanwhile two professors at Georgia Tech Research Institute, Robert Delano and Brian Parise, are developing a smartphone application called iTrem that uses the phone’s built-in accelerometer to monitor a person’s body tremors for Parkinson’s and other neurological disorders. The app is a boon for both doctors and patients. It allows patients to bypass costly tests done at a physician’s office; it also lets medical professionals remotely monitor people’s disability and their responses to treatments. According to researchers in Kyoto, a smartphone is only a tiny bit less effective at measuring the tremors than the tri-axial accelerometer used in specialized medical equipment, so it can be reliably used. Once again, a bit of messiness trumps exactitude.

In most of these cases, we’re capturing information and putting it into data form that allows it to be reused. This can happen almost everywhere and to nearly everything. GreenGoose, a startup in San Francisco, sells tiny sensors that detect motion, which can be placed on objects to track how much they are used. Putting it on a pack of dental floss, a watering can, or a box of cat litter makes it possible to datafy dental hygiene and the care of plants and pets. The enthusiasm over the “internet of things”—embedding chips, sensors, and communications modules into everyday objects—is partly about networking but just as much about datafying all that surrounds us.

Once the world has been datafied, the potential uses of the information are basically limited only by one’s ingenuity. Maury datafied seafarers’ previous journeys through painstaking manual tabulation, and thereby unlocked extraordinary insights and value. Today we have the tools (statistics and algorithms) and the necessary equipment (digital processors and storage) to perform similar tasks much faster, at scale, and in many different contexts. In the age of big data, even backsides have upsides.

We are in the midst of a great infrastructure project that in some ways rivals those of the past, from Roman aqueducts to the Enlightenment’s Encyclopédie. We fail to appreciate this because today’s project is so new, because we are in the middle of it, and because unlike the water that flows on the aqueducts the product of our labors is intangible. The project is datafication. Like those other infrastructural advances, it will bring about fundamental changes to society.

Aqueducts made possible the growth of cities; the printing press facilitated the Enlightenment, and newspapers enabled the rise of the nation state. But these infrastructures were focused on flows—of water, of knowledge. So were the telephone and the Internet. In contrast, datafication represents an essential enrichment in human comprehension. With the help of big data, we will no longer regard our world as a string of happenings that we explain as natural or social phenomena, but as a universe comprised essentially of information.

For well over a century, physicists have suggested that this is the case—that not atoms but information is the basis of all that is. This, admittedly, may sound esoteric. Through datafication, however, in many instances we can now capture and calculate at a much more comprehensive scale the physical and intangible aspects of existence and act on them.

Seeing the world as information, as oceans of data that can be explored at ever greater breadth and depth, offers us a perspective on reality that we did not have before. It is a mental outlook that may penetrate all areas of life. Today, we are a numerate society because we presume that the world is understandable with numbers and math. And we take for granted that knowledge can be transmitted across time and space because the idea of the written word is so ingrained. Tomorrow, subsequent generations may have a “big-data consciousness”—the presumption that there is a quantitative component to all that we do, and that data is indispensable for society to learn from. The notion of transforming the myriad dimensions of reality into data probably seems novel to most people at present. But in the future, we will surely treat it as a given (which, pleasingly, harks back to the very origin of the term “data”).

In time, the impact of datafication may dwarf that of aqueducts and newspapers, rivaling perhaps the printing press and the Internet by giving us the means to map the world in a quantifiable, analyzable way. For the moment, however, the most advanced users of datafication are in business, where big data is being used to create new forms of value—the subject of the next chapter.

6

VALUE

IN THE LATE 1990S the Web was quickly turning into an unruly, unwelcoming, unfriendly place. “Spambots” were inundating email inboxes and swamping online forums. In 2000 Luis von Ahn, a 22-year-old who had just graduated from college, had an idea for solving the problem: force registrants to prove they are human. So he looked for something that is easy for people to do but hard for machines.

He came up with the idea of presenting squiggly, hard-to-read letters during the sign-up process. People would be able to decipher them and type in the correct text in a few seconds, but computers would be stumped. Yahoo implemented his method and reduced its scourge of spambots overnight. Von Ahn called his creation Captcha (for Completely Automated Public Turing Test to Tell Computers and Humans Apart). Five years later, millions of Captchas were being typed each day.

Captcha brought von Ahn considerable fame and a job teaching computer science at Carnegie Mellon University after he earned his PhD. It was also a factor in his receiving, at 27, one of the MacArthur Foundation’s prestigious “genius” awards of half a million dollars. But when he realized that he was responsible for millions of people wasting lots of time each day typing in annoying, squiggly letters—vast amounts of information that was simply discarded afterwards—he didn’t feel so smart.

Looking for ways to put all that human computational power to more productive use, he came up with a successor, fittingly named ReCaptcha. Instead of typing in random letters, people type two words from text-scanning projects that a computer’s optical character-recognition program couldn’t understand. One word is meant to confirm what other users have typed and thus is a signal that the person is a human; the other is a new word in need of disambiguation. To ensure accuracy, the system presents the
same fuzzy word to an average of five different people to type in correctly before it trusts it’s right. The data had a primary use—to prove the user was human—but it also had a secondary purpose: to decipher unclear words in digitized texts.

The value this unleashes is immense, when one considers what it would cost to hire people instead. At roughly 10 seconds per use, 200 million ReCaptchas a day—the current rate—add up to half a million hours a day. The minimum wage in the United States was $7.25 an hour in 2012. If one were to turn to the market for disambiguating words that a computer couldn’t make sense of, it would cost around $4 million a day, or more than $1 billion a year. Instead, von Ahn designed a system to do it, in effect, for free. This was so valuable that Google acquired the technology from von Ahn in 2009. Google makes it freely available for any website to use; today it’s incorporated into some 200,000 sites, including Facebook, Twitter, and Craigslist.

The story of ReCaptcha underscores the importance of the reuse of data. With big data, the value of data is changing. In the digital age, data shed its role of supporting transactions and often became the good itself that was traded. In a big-data world, things change again. Data’s value shifts from its primary use to its potential future uses. This has profound consequences. It affects how businesses value the data they hold and who they let access it. It enables, and may force, companies to change their business models. It alters how organizations think about data and how they use it.

Information has always been essential for market transactions. Data enables price discovery, for instance, which is a signal for how much to produce. This dimension of data is well understood. Certain types of information have long been traded on markets. Content found in books, articles, music, and movies is an example, as is financial information like stock prices. These have been joined in the past few decades by personal data. Specialized data brokers in the United States such as Acxiom, Experian, and Equifax charge handsomely for comprehensive dossiers of personal information on hundreds of millions of consumers. With Facebook, Twitter, LinkedIn, and other social media platforms, our personal connections, opinions, preferences, and patterns of everyday living have joined the pool of personal information already available about us.

‹ Prev Next ›