Big Data: A Revolution That Will Transform How We Live, Work, and Think
Page 18
Strikingly, in a big-data age, most innovative secondary uses haven’t been imagined when the data is first collected. How can companies provide notice for a purpose that has yet to exist? How can individuals give informed consent to an unknown? Yet in the absence of consent, any big-data analysis containing personal information might require going back to every person and asking permission for each reuse. Can you imagine Google trying to contact hundreds of millions of users for approval to use their old search queries to predict the flu? No company would shoulder the cost, even if the task were technically feasible.
The alternative, asking users to agree to any possible future use of their data at the time of collection, isn’t helpful either. Such a wholesale permission emasculates the very notion of informed consent. In the context of big data, the tried and trusted concept of notice and consent is often either too restrictive to unearth data’s latent value or too empty to protect individuals’ privacy.
Other ways of protecting privacy fail as well. If everyone’s information is in a dataset, even choosing to “opt out” may leave a trace. Take Google’s Street View. Its cars collected images of roads and houses in many countries. In Germany, Google faced widespread public and media protests. People feared that pictures of their homes and gardens could aid gangs of burglars in selecting lucrative targets. Under regulatory pressure, Google agreed to let homeowners opt out by blurring their houses in the image. But the opt-out is visible on Street View—you notice the obfuscated houses—and burglars may interpret this as a signal that they are especially good targets.
A technical approach to protecting privacy—anonymization—also doesn’t work effectively in many cases. Anonymization refers to stripping out from datasets any personal identifiers, such as name, address, credit card number, date of birth, or Social Security number. The resulting data can then be analyzed and shared without compromising anyone’s privacy. That works in a world of small data. But big data, with its increase in the quantity and variety of information, facilitates re-identification. Consider the cases of seemingly unidentifiable web searches and movie ratings.
In August 2006 AOL publically released a mountain of old search queries, under the well-meaning view that researchers could analyze it for interesting insights. The dataset, of 20 million search queries from 657,000 users between March 1 and May 31 of that year, had been carefully anonymized. Personal information like user name and IP address were erased and replaced by unique numeric identifiers. The idea was that researchers could link together search queries from the same person, but had no identifying information.
Still, within days, the New York Times cobbled together searches like “60 single men” and “tea for good health” and “landscapers in Lilburn, Ga” to successfully identify user number 4417749 as Thelma Arnold, a 62-year-old widow from Lilburn, Georgia. “My goodness, it’s my whole personal life,” she told the Times reporter when he came knocking. “I had no idea somebody was looking over my shoulder.” The ensuing public outcry led to the ouster of AOL’s chief technology officer and two other employees.
Yet a mere two months later, in October 2006, the movie rental service Netflix did something similar in launching its “Netflix Prize.” The company released 100 million rental records from nearly half a million users—and offered a bounty of a million dollars to any team that could improve its film recommendation system by at least 10 percent. Again, personal identifiers had been carefully removed from the data. And yet again, a user was re-identified: a mother and a closeted lesbian in America’s conservative Midwest, who because of this later sued Netflix under the pseudonym “Jane Doe.”
Researchers at the University of Texas at Austin compared the Netflix data against other public information. They quickly found that ratings by one anonymized user matched those of a named contributor to the Internet Movie Database (IMDb) website. More generally, the research demonstrated that rating just six obscure movies (out of the top 500) could identify a Netflix customer 84 percent of the time. And if one knew the date on which a person rated movies as well, he or she could be uniquely identified among the nearly half a million customers in the dataset with 99 percent accuracy.
In the AOL case, users’ identities were exposed by the content of their searches. In the Netflix case, the identity was revealed by a comparison of the data with other sources. In both instances, the companies failed to appreciate how big data aids de-anonymization. There are two reasons: we capture more data and we combine more data.
Paul Ohm, a law professor at the University of Colorado in Boulder and an expert on the harm done by de-anonymization, explains that no easy fix is available. Given enough data, perfect anonymization is impossible no matter how hard one tries. Worse, researchers have recently shown that not only conventional data but also the social graph—people’s connections with one another—is vulnerable to de-anonymization.
In the era of big data, the three core strategies long used to ensure privacy—individual notice and consent, opting out, and anonymization—have lost much of their effectiveness. Already today many users feel their privacy is being violated. Just wait until big-data practices become more commonplace.
Compared with East Germany a quarter-century ago, surveillance has only gotten easier, cheaper, and more powerful. The ability to capture personal data is often built deep into the tools we use every day, from websites to smartphone apps. The data-recorders that are in most cars to capture all the actions of a vehicle a few seconds prior to an airbag activation have been known to “testify” against car owners in court in disputes over the events of accidents.
Of course, when businesses are collecting data to improve their bottom line, we need not fear that their surveillance will have the same consequences as being bugged by the Stasi. We won’t go to prison if Amazon discovers we like to read Chairman Mao’s “Little Red Book.” Google will not exile us because we searched for “Bing.” Companies may be powerful, but they don’t have the state’s powers to coerce.
So while they are not dragging us away in the middle of the night, firms of all stripes amass mountains of personal information concerning all aspects of our lives, share it with others without our knowledge, and use it in ways we could hardly imagine.
The private sector is not alone in flexing its muscles with big data. Governments are doing this too. For instance, the U.S. National Security Agency (NSA) is said to intercept and store 1.7 billion emails, phone calls, and other communications every day, according to a Washington Post investigation in 2010. William Binney, a former NSA official, estimates that the government has compiled “20 trillion transactions” among U.S. citizens and others—who calls whom, emails whom, wires money to whom, and so on.
To make sense of all the data, the United States is building giant data centers such as a $1.2 billion NSA facility in Fort Williams, Utah. And all parts of government are demanding more information than before, not just secretive agencies involved in counterterrorism. When the collection expands to information like financial transactions, health records, and Facebook status updates, the quantity being gleaned is unthinkably large. The government can’t process so much data. So why collect it?
The answer points to the way surveillance has changed in the era of big data. In the past, investigators attached alligator clips to telephone wires to learn as much as they could about a suspect. What mattered was to drill down and get to know that individual. The modern approach is different. In the spirit of Google or Facebook, the new thinking is that people are the sum of their social relationships, online interactions, and connections with content. In order to fully investigate an individual, analysts need to look at the widest possible penumbra of data that surrounds the person—not just whom they know, but whom those people know too, and so on. This was technically very hard to do in the past. Today it’s easier than ever. And because the government never knows whom it will want to scrutinize, it collects, stores, or ensures access to information not necessarily to monitor everyone at
all times, but so that when someone falls under suspicion, the authorities can immediately investigate rather than having to start gathering the info from scratch.
The United States is not the only government amassing mountains of data on people, nor is it perhaps the most egregious in its practices. However, as troubling as the ability of business and government to know our personal information may be, a newer problem emerges with big data: the use of predictions to judge us.
Probability and punishment
John Anderton is the chief of a special police unit in Washington, D.C. This particular morning, he bursts into a suburban house moments before Howard Marks, in a state of frenzied rage, is about to plunge a pair of scissors into the torso of his wife, whom he found in bed with another man. For Anderton, it is just another day preventing capital crimes. “By mandate of the District of Columbia Precrime Division,” he recites, “I’m placing you under arrest for the future murder of Sarah Marks, that was to take place today. . . .”
Other cops start restraining Marks, who screams, “I did not do anything!”
The opening scene of the film Minority Report depicts a society in which predictions seem so accurate that the police arrest individuals for crimes before they are committed. People are imprisoned not for what they did, but for what they are foreseen to do, even though they never actually commit the crime. The movie attributes this prescient and preemptive law enforcement to the visions of three clairvoyants, not to data analysis. But the unsettling future Minority Report portrays is one that unchecked big-data analysis threatens to bring about, in which judgments of culpability are based on individualized predictions of future behavior.
Already we see the seedlings of this. Parole boards in more than half of all U.S. states use predictions founded on data analysis as a factor in deciding whether to release somebody from prison or to keep him incarcerated. A growing number of places in the United States—from precincts in Los Angeles to cities like Richmond, Virginia—employ “predictive policing”: using big-data analysis to select what streets, groups, and individuals to subject to extra scrutiny, simply because an algorithm pointed to them as more likely to commit crime.
In the city of Memphis, Tennessee, a program called Blue CRUSH (for Crime Reduction Utilizing Statistical History) provides police officers with relatively precise areas of interest in terms of locality (a few blocks) and time (a few hours during a particular day of the week). The system ostensibly helps law enforcement better target its scarce resources. Since its inception in 2006, major property crimes and violent offenses have fallen by a quarter, according to one measure (though of course, this says nothing about causality; there’s nothing to indicate that the decrease is due to Blue CRUSH).
In Richmond, Virginia, police correlate crime data with additional datasets, such as information on when large companies in the city pay their employees or the dates of concerts or sports events. Doing so has confirmed and sometimes refined the cops’ suspicions about crime trends. For example, Richmond police long sensed that there was a jump in violent crime following gun shows; the big-data analysis proved them right but with a wrinkle: the spike happened two weeks afterwards, not immediately following the event.
These systems seek to prevent crimes by predicting, eventually down to the level of individuals, who might commit them. This points toward using big data for a novel purpose: to prevent crime from happening.
A research project under the U.S. Department of Homeland Security called FAST (Future Attribute Screening Technology) tries to identify potential terrorists by monitoring individuals’ vital signs, body language, and other physiological patterns. The idea is that surveilling people’s behavior may detect their intent to do harm. In tests, the system was 70 percent accurate, according to the DHS. (What this means is unclear; were research subjects instructed to pretend to be terrorists to see if their “malintent” was spotted?) Though these systems seem embryonic, the point is that law enforcement takes them very seriously.
Stopping a crime from happening sounds like an enticing prospect. Isn’t preventing infractions before they take place far better than penalizing the perpetrators afterwards? Wouldn’t forestalling crimes benefit not just those who might have been victimized by them, but society as a whole?
But it’s a perilous path to take. If through big data we predict who may commit a future crime, we may not be content with simply preventing the crime from happening; we are likely to want to punish the probable perpetrator as well. That is only logical. If we just step in and intervene to stop the illicit act from taking place, the putative perpetrator may try again with impunity. In contrast, by using big data to hold him responsible for his (future) acts, we may deter him and others.
Such prediction-based punishment seems an improvement over practices we have already come to accept. Preventing unhealthy, dangerous, or risky behavior is a cornerstone of modern society. We have made smoking harder to prevent lung cancer; we require wearing seatbelts to avert fatalities in car accidents; we don’t let people board airplanes with guns to avoid hijackings. Such preventive measures constrain our freedom, but many see them as a small price to pay in return for avoiding much graver harm.
In many contexts, data analysis is already employed in the name of prevention. It is used to lump us into cohorts of people like us, and we are often characterized accordingly. Actuarial tables note that men over 50 are prone to prostate cancer, so members of that group may pay more for health insurance even if they never get prostate cancer. High-school students with good grades, as a group, are less likely to get into car accidents—so some of their less-learned peers have to pay higher insurance premiums. Individuals with certain characteristics are subjected to extra screening when they pass through airport security.
That’s the idea behind “profiling” in today’s small-data world. Find a common association in the data, define a group of people to whom it applies, and then place those people under additional scrutiny. It is a generalizable rule that applies to everyone in the group. “Profiling,” of course, is a loaded word, and the method has serious downsides. If misused, it can lead not only to discrimination against certain groups but also to “guilt by association.”
In contrast, big data predictions about people are different. Where today’s forecasts of likely behavior—found in things like insurance premiums or credit scores—usually rely on a handful of factors that are based on a mental model of the issue at hand (that is, previous health problems or loan repayment history), with big data’s non-causal analysis we often simply identify the most suitable predictors from the sea of information.
Most important, using big data we hope to identify specific individuals rather than groups; this liberates us from profiling’s shortcoming of making every predicted suspect a case of guilt by association. In a big-data world, somebody with an Arabic name, who has paid in cash for a one-way ticket in first class, may no longer be subjected to secondary screening at an airport if other data specific to him make it very unlikely that he’s a terrorist. With big data we can escape the straitjacket of group identities, and replace them with much more granular predictions for each individual.
The promise of big data is that we do what we’ve been doing all along—profiling—but make it better, less discriminatory, and more individualized. That sounds acceptable if the aim is simply to prevent unwanted actions. But it becomes very dangerous if we use big-data predictions to decide whether somebody is culpable and ought to be punished for behavior that has not yet happened.
The very idea of penalizing based on propensities is nauseating. To accuse a person of some possible future behavior is to negate the very foundation of justice: that one must have done something before we can hold him accountable for it. After all, thinking bad things is not illegal, doing them is. It is a fundamental tenet of our society that individual responsibility is tied to individual choice of action. If one is forced at gunpoint to open the company’s safe, one has no choice and thus isn’t he
ld responsible.
If big-data predictions were perfect, if algorithms could foresee our future with flawless clarity, we would no longer have a choice to act in the future. We would behave exactly as predicted. Were perfect predictions possible, they would deny human volition, our ability to live our lives freely. Also, ironically, by depriving us of choice they would exculpate us from any responsibility.
Of course perfect prediction is impossible. Rather, big-data analysis will predict that for a specific individual, a particular future behavior has a certain probability. Consider, for example, research conducted by Richard Berk, a professor of statistics and criminology at the University of Pennsylvania. He claims his method can predict whether a person released on parole will be involved in a homicide (either kill or be killed). As inputs he uses numerous case-specific variables, including reason for incarceration and date of first offense, but also demographic data like age and gender. Berk suggests that he can forecast a future murder among those on parole with at least a 75 percent probability. That’s not bad. However, it also means that should parole boards rely on Berk’s analysis, they would be wrong as often as one out of four times.
But the core problem with relying on such predictions is not that they expose society to risk. The fundamental trouble is that with such a system we essentially punish people before they do something bad. And by intervening before they act (for instance by denying them parole if predictions show there is a high probability that they will murder), we never know whether or not they would have actually committed the predicted crime. We do not let fate play out, and yet we hold individuals responsible for what our prediction tells us they would have done. Such predictions can never be disproven.