Big Data: A Revolution That Will Transform How We Live, Work, and Think
Page 20
The dramatic change also fertilized the ground for new rules to govern the information explosion sparked by moveable type. As the secular state consolidated its power, it established censorship and licensing to contain and control the printed word. Copyright was established to give authors legal and economic incentives to create. Later, intellectuals pushed for rules to protect words from government suppression; by the nineteenth century, in a growing number of countries, freedom of speech was turned into a constitutional guarantee. But these rights came with responsibilities. As vitriolic newspapers trampled on privacy or slandered reputations, rules cropped up to shield people’s private sphere and allow them to sue for libel.
Yet these changes in governance also reflect a deeper, more fundamental transformation of the underlying values. In Gutenberg’s shadow, we first began to realize the power of the written word—and, eventually, the importance of information that spreads widely throughout society. As centuries passed, we opted for more information flows rather than less, and to guard against its excesses not primarily through censorship but through rules that limited the misuse of information.
As the world moves toward big data, society will undergo a similar tectonic shift. Big data is already transforming many aspects of our lives and ways of thinking, forcing us to reconsider basic principles on how to encourage its growth and mitigate its potential for harm. However, unlike our forebears during and after the printing revolution, we don’t have centuries to adjust; perhaps just a few years.
Simple changes to existing rules will not be sufficient to govern in the big-data age and to temper big data’s dark side. Rather than a parametric change, the situation calls for a paradigmatic one. Protecting privacy requires that big-data users become more accountable for their actions. At the same time, society will have to redefine the very notion of justice to guarantee human freedom to act (and thus to be held responsible for those actions). Lastly, new institutions and professionals will need to emerge to interpret the complex algorithms that underlie big-data findings, and to advocate for people who might be harmed by big data.
From privacy to accountability
For decades an essential principle of privacy laws around the world has been to put individuals in control by letting them decide whether, how, and by whom their personal information may be processed. In the Internet age, this laudable ideal has often morphed into a formulaic system of “notice and consent.” In the era of big data, however, when much of data’s value is in secondary uses that may have been unimagined when the data was collected, such a mechanism to ensure privacy is no longer suitable.
We envision a very different privacy framework for the big-data age, one focused less on individual consent at the time of collection and more on holding data users accountable for what they do. In such a world, firms will formally assess a particular reuse of data based on the impact it has on individuals whose personal information is being processed. This does not have to be onerously detailed in all cases, as future privacy laws will define broad categories of uses, including ones that are permissible without or with only limited, standardized safeguards. For riskier initiatives, regulators will establish ground rules for how data users should assess the dangers of a planned use and determine what best avoids or mitigates potential harm. This spurs creative reuses of the data, while at the same time it ensures that sufficient measures are taken to see that individuals are not hurt.
Running a formal big-data use assessment correctly and implementing its findings accurately offers tangible benefits to data users: they will be free to pursue secondary uses of personal data in many instances without having to go back to individuals to get their explicit consent. On the other hand, sloppy assessments or poor implementation of safeguards will expose data users to legal liability, and regulatory actions such as mandates, fines, and perhaps even criminal prosecution. Data-user accountability only works when it has teeth.
To see how this could happen in practice, take the example of the datafication of posteriors from Chapter Five. Imagine that a company sold a car antitheft service which used a driver’s sitting posture as a unique identifier. Then, it later reanalyzed the information to predict drivers’ “attention states,” such as whether they were drowsy or tipsy or angry, in order to send alerts to other drivers nearby to prevent accidents. Under today’s privacy rules, the firm might believe it needed a new round of notice and consent because it hadn’t previously received permission to use the information in this way. But under a system of data-user accountability, the company would assess the dangers of the intended use, and if it found them minimal it could just go ahead with its plan—and improve road safety in the process.
Shifting the burden of responsibility from the public to the users of data makes sense for a number of reasons. They know much more than anybody else, and certainly more than consumers or regulators, about how they intend to use the data. By conducting the assessment themselves (or hiring experts to do it) they will avoid the problem of revealing confidential business strategies to outsiders. Perhaps most important, the data users reap most of the benefits of secondary use, so it’s only fair to hold them accountable for their actions and place the burden for this review on them.
With such an alternative privacy framework, data users will no longer be legally required to delete personal information once it has served its primary purpose, as most privacy laws currently demand. This is an important change, since, as we’ve seen, only by tapping the latent value of data can latter-day Maurys flourish by wringing the most value out of it for their own—and society’s—benefit. Instead, data users will be allowed to keep personal information longer, though not forever. Society needs to carefully weigh the rewards from reuse against the risks from too much disclosure.
To strike the appropriate balance, regulators may choose different time frames for reuse, depending on the data’s inherent risk, as well as on different societies’ values. Some nations may be more cautious than others, just as some sorts of data may be considered more sensitive than others. This approach also banishes the specter of “permanent memory”—the risk that one can never escape one’s past because the digital records can always be dredged up. Otherwise our personal data hovers over us like the Sword of Damocles, threatening to impale us years hence with some private detail or regrettable purchase. Time limits also create an incentive for data holders to use it before they lose it. This strikes what we believe is a better balance for the big-data era: firms get the right to use personal data longer, but in return they have to take on responsibility for its uses as well as the obligation to erase personal data after a certain period of time.
In addition to a regulatory shift from “privacy by consent” to “privacy through accountability,” we envision technical innovation to help protect privacy in certain instances. One nascent approach is the concept of “differential privacy”: deliberately blurring the data so that a query of a large dataset doesn’t reveal exact results but only approximate ones. This makes it difficult and costly to associate particular data points with particular people.
Fuzzing the information sounds as if it might destroy valuable insights. But it need not—or at least, the tradeoff can be favorable. For instance, experts in technology policy note that Facebook relies on a form of differential privacy when it reports information about its users to potential advertisers: the numbers it reports are approximate, so they can’t help reveal individual identities. Looking up Asian women in Atlanta who are interested in Ashtanga yoga will produce a result such as “about 400” rather than an exact number, making it impossible to use the information to narrow down statistically on someone specific.
The shift in controls from individual consent to data-user accountability is a fundamental and essential change necessary for effective big-data governance. But it is not the only one.
People versus predictions
Courts of law hold people responsible for their actions. When judges render their impartial decision
s after a fair trial, justice is done. Yet, in the era of big data, our notion of justice needs to be redefined to preserve the idea of human agency: the free will by which people choose their actions. It is the simple idea that individuals can and should be held responsible for their behavior, not their propensities.
Before big data, this fundamental freedom was obvious. So much so, in fact, that it hardly needed to be articulated. After all, this is the way our legal system works: we hold people responsible for their acts by assessing what they have done. In contrast, with big data we can predict human actions increasingly accurately. This tempts us to judge people not on what they did, but on what we predicted they would do.
In the big-data era we will have to expand our understanding of justice, and require that it include safeguards for human agency as much as we currently protect procedural fairness. Without such safeguards the very idea of justice may be utterly undermined.
By guaranteeing human agency, we ensure that government judgments of our behavior are based on real actions, not simply on big-data analysis. Thus government must only hold us responsible for our past actions, not for statistical predictions of future ones. And when the state judges previous actions, it should be prevented from relying solely on big data. For example, consider the case of nine companies suspected of price fixing. It is entirely acceptable to use big-data analyses to identify possible collusion so that regulators can investigate and build a case using traditional means. But these companies cannot be found guilty only because big data suggests that they probably committed a crime.
A similar principle should apply outside government, when businesses make highly significant decisions about us—to hire or fire, offer a mortgage, or deny a credit card. When they base these decisions mostly on big-data predictions, we recommend that certain safeguards must be in place. First is openness: making available the data and algorithm underlying the prediction that affects an individual. Second is certification: having the algorithm certified for certain sensitive uses by an expert third party as sound and valid. Third is disprovability: specifying concrete ways that people can disprove a prediction about themselves. (This is analogous to the tradition in science of disclosing any factors that might undermine the findings of a study.)
Most important, a guarantee on human agency guards against the threat of a dictatorship of data, in which we endow the data with more meaning and importance than it deserves.
It is equally crucial that we protect individual responsibility. Society will face a great temptation to stop holding individuals accountable and instead may shift to managing risks, that is, to basing decisions about people on assessments of possibilities and likelihoods of potential outcomes. With so much seemingly objective data available, it may seem appealing to de-emotionalize and de-individualize decision-making, to rely on algorithms rather than on subjective assessments by judges and evaluators, and to frame decisions not in the language of personal responsibility but in terms of more “objective” risks and their avoidance.
For example, big data presents a strong invitation to predict which people are likely to commit crimes and subject them to special treatment, scrutinizing them over and over in the name of risk reduction. People categorized in this way may feel, quite rightly, that they’re being punished without ever being confronted and held responsible for actual behavior. Imagine that an algorithm identifies a particular teenager as highly likely to commit a felony in the next three years. As a result, the authorities assign a social worker to visit him once a month, to keep an eye on him and try to help him stay out of trouble.
If the teenager and his relatives, friends, teachers, or employers view the visits as a stigma, as they well may, then the intervention has the effect of a punishment, a penalty for an action that has not happened. And the situation isn’t much better if the visits are seen not as a punishment but simply as an attempt to reduce the likelihood of future problems—as a way to minimize risk (in this case, the risk of a crime that would undermine public safety). The more we switch from holding people accountable for their acts to relying on data-driven interventions to reduce risk in society, the more we devalue the ideal of individual responsibility. The predictive state is the nanny state, and then some. Denying people’s responsibility for their actions destroys their fundamental freedom to choose their behavior.
If the state bases many decisions on predictions and a desire to mitigate risk, our individual choices—and thus our individual freedom to act—no longer matter. Without guilt, there can be no innocence. Giving in to such an approach would not improve our society but impoverish it.
A fundamental pillar of big-data governance must be a guarantee that we will continue to judge people by considering their personal responsibility and their actual behavior, not by “objectively” crunching data to determine whether they’re likely wrongdoers. Only that way will we treat them as human beings: as people who have the freedom to choose their actions and the right to be judged by them.
Breaking the black box
Computer systems currently base their decisions on rules they have been explicitly programmed to follow. Thus when a decision goes awry, as is inevitable from time to time, we can go back and figure out why the computer made it. For example, we can investigate questions like “Why did the autopilot system pitch the plane five degrees higher when an external sensor detected a sudden surge in humidity?” Today’s computer code can be opened and inspected, and those who know how to interpret it can trace and comprehend the basis for its decisions, no matter how complex.
With big-data analysis, however, this traceability will become much harder. The basis of an algorithm’s predictions may often be far too intricate for most people to understand.
When computers were explicitly programmed to follow sets of instructions, as with IBM’s early translation program of Russian to English in 1954, a human could readily grasp why the software substituted one word for another. But Google Translate incorporates billions of pages of translations into its judgments about things like whether the English word “light” should be “lumière” or “léger” in French (that is, whether the word refers to brightness or to weight). It’s impossible for a human to trace the precise reasons for the program’s word choices because they are based on massive amounts of data and vast statistical computations.
Big data operates at a scale that transcends our ordinary understanding. For example, the correlation Google identified between a handful of search terms and the flu was the result of testing 450 million mathematical models. In contrast, Cynthia Rudin initially designed 106 predictors for whether a manhole might catch fire, and she could explain to Con Edison’s managers why her program prioritized inspection sites as it did. “Explainability,” as it is called in artificial intelligence circles, is important for us mortals, who tend to want to know why, not just what. But what if instead of 106 predictors, the system automatically generated a whopping 601 predictors, the vast majority of which had very low weightings but which, when taken together, improved the model’s accuracy? The basis for any prediction might be staggeringly complex. What could she tell the managers then to convince them to reallocate their limited budget?
In these scenarios, we can see the risk that big-data predictions, and the algorithms and datasets behind them, will become black boxes that offer us no accountability, traceability, or confidence. To prevent this, big data will require monitoring and transparency, which in turn will require new types of expertise and institutions. These new players will provide support in areas where society needs to scrutinize big-data predictions and enable people who feel wronged by them to seek redress.
As a society, we’ve often seen such new entities emerge when a dramatic increase in the complexity and specialization of a particular field produced an urgent need for experts to manage the new techniques. Professions like law, medicine, accounting, and engineering underwent this very transformation more than a century ago. More recently, specialists in computer security
and privacy have cropped up to certify that companies are complying with the best practices determined by bodies like the International Organization for Standards (which was itself formed to address a new need for guidelines in this field).
Big data will require a new group of people to take on this role. Perhaps they will be called “algorithmists.” They could take two forms—independent entities to monitor firms from outside, and employees or departments to monitor them from within—just as companies have in-house accountants as well as outside auditors who review their finances.
THE RISE OF THE ALGORITHMIST
These new professionals would be experts in the areas of computer science, mathematics, and statistics; they would act as reviewers of big-data analyses and predictions. Algorithmists would take a vow of impartiality and confidentiality, much as accountants and certain other professionals do now. They would evaluate the selection of data sources, the choice of analytical and predictive tools, including algorithms and models, and the interpretation of results. In the event of a dispute, they would have access to the algorithms, statistical approaches, and datasets that produced a given decision.
Had there been an algorithmist on staff at the Department of Homeland Security in 2004, he might have prevented the agency from generating a no-fly list so flawed that it included Senator Kennedy. More recent instances where algorithmists could have played a role have happened in Japan, France, Germany, and Italy, where people have complained that Google’s “autocomplete” feature, which produces a list of common search terms associated with a typed-in name, has defamed them. The list is largely based on the frequency of previous searches: terms are ranked by their mathematical probability. Still, which of us wouldn’t be angry if the word “convict” or “prostitute” appeared next to our name when potential business or romantic partners turned to the Web to check us out?