We envision algorithmists as providing a market-oriented approach to problems like these that may head off more intrusive forms of regulation. They’d fill a need similar to the one accountants and auditors filled when they emerged in the early twentieth century to handle the new deluge of financial information. The numeric onslaught was hard for people to understand; it required specialists organized in an agile, self-regulatory way. The market responded by giving rise to a new sector of competitive firms specializing in financial surveillance. By offering this service, the new breed of professionals bolstered society’s confidence in the economy. Big data could and should benefit from the similar confidence boost that algorithmists would provide.
EXTERNAL ALGORITHMISTS
We envision external algorithmists acting as impartial auditors to review the accuracy or validity of big-data predictions whenever the government requires it, such as under court order or regulation. They also can take on big-data companies as clients, performing audits for firms that want expert support. And they may certify the soundness of big-data applications like anti-fraud techniques or stock-trading systems. Finally, external algorithmists are prepared to consult with government agencies on how best to use big data in the public sector.
As in medicine, law, and other occupations, we envision that this new profession regulates itself with a code of conduct. The algorithmists’ impartiality, confidentiality, competence, and professionalism are enforced by tough liability rules; if they fail to adhere to these standards, they’ll be open to lawsuits. They can also be called on to serve as expert witnesses in trials, or to act as “court masters,” experts appointed by judges to assist them in technical matters on particularly complex cases.
Moreover, people who believe they’ve been harmed by big-data predictions—a patient rejected for surgery, an inmate denied parole, a loan applicant denied a mortgage—can look to algorithmists much as they already look to lawyers for help in understanding and appealing those decisions.
INTERNAL ALGORITHMISTS
Internal algorithmists work inside an organization to monitor its big-data activities. They look out not just for the company’s interests but also for the interests of people who are affected by its big-data analyses. They oversee big-data operations, and they’re the first point of contact for anybody who feels harmed by their organization’s big-data predictions. They also vet big-data analyses for integrity and accuracy before letting them go live. To perform the first of these two roles, algorithmists must have a certain level of freedom and impartiality within the organization they work for.
The notion of a person who works for a company remaining impartial about its operations may seem counterintuitive, but such situations are actually fairly common. The surveillance divisions at major financial institutions are one example; so are the boards of directors at many firms, whose responsibility is to shareholders, not management. And many media companies, including the New York Times and the Washington Post, employ ombudsmen whose primary responsibility is to defend the public trust. These employees handle readers’ complaints and often chastise their employer publicly when they determine that it has done wrong.
And there’s an even closer analogue to the internal algorithmist—a professional charged with ensuring that personal information isn’t misused in the corporate setting. For instance, Germany requires companies above a certain size (generally ten or more people employed in processing personal information) to designate a data-protection representative. Since the 1970s, these in-house representatives have developed a professional ethic and an esprit de corps. They meet regularly to share best practices and training and have their own specialized media and conferences. Moreover, they’ve succeeded in maintaining dual allegiances to their employers and to their duties as impartial reviewers, managing to act as data-protection ombudsmen while also embedding information-privacy values throughout their companies’ operations. We believe in-house algorithmists can do the same.
Governing the data barons
Data is to the information society what fuel was to the industrial economy: the critical resource powering the innovations that people rely on. Without a rich, vibrant supply of data and a robust market for services, the creativity and productivity that are possible may be stifled.
In this chapter we have laid out three fundamental new strategies for big-data governance, regarding privacy, propensity, and algorithm auditing. We’re confident that with these in place the dark side of big data will be contained. Yet as the nascent big-data industry develops, an additional critical challenge will be to safeguard competitive big-data markets. We must prevent the rise of twenty-first-century data barons, the modern equivalent of the nineteenth-century robber barons who dominated America’s railroads, steel manufacturing, and telegraph networks.
To control those earlier industrialists, the United States established antitrust rules that were extremely adaptable. Originally designed for the railroads in the 1800s, they were later applied to firms that were gatekeepers to the flow of information that businesses depend on, from National Cash Register in the 1910s, to IBM in the 1960s and later, Xerox in the 1970s, AT&T in the 1980s, Microsoft in the 1990s, and Google today. The technologies these firms pioneered became core components of the “information infrastructure” of the economy, and required the force of law to prevent unhealthy dominance.
To ensure the conditions for a bustling market for big data, we will need measures comparable to the ones that established competition and oversight in those earlier areas of technology. We should enable data transactions, such as through licensing and interoperability. This raises the issue of whether society might benefit from a carefully crafted and well-balanced “exclusion right” for data (similar to an intellectual property right, as provocative as this may sound!). Admittedly, achieving this would be a tall order for policymakers—and one fraught with risk for the rest of us.
It is obviously impossible to foretell how a technology will develop; even big data can’t predict how big data will evolve. Regulators will need to strike a balance between acting cautiously and boldly—and the history of antitrust law points to one way this can be accomplished.
Antitrust regulation curbed abusive power. Yet strikingly, its principles translated beautifully from one sector to another, and across different types of network industries. It is just the sort of muscular regulation—which does not favor one sort of technology over another—that is useful, since it protects competition without presuming to do much more than that. Hence, antitrust may help big data steam ahead just as it did the railroads. Also, as some of the world’s biggest data holders, governments ought to release their own data publicly. Encouragingly, some are already doing both these things—at least to an extent.
The lesson of antitrust regulation is that once overarching principles are identified, regulators can implement them to ensure the right degree of safeguards and support. Similarly, the three strategies we have put forward—shifting privacy protections from individual consent to data-users accountability; enshrining human agency amid predictions; and inventing a new caste of big-data auditors we call algorithmists—may serve as a foundation for effective and fair governance of information in the big-data era.
In many fields, from nuclear technology to bioengineering, we first build tools that we discover can harm us and only later set out to devise the safety mechanisms to protect us from those new tools. In this regard, big data takes its place alongside other areas of society that present challenges with no absolute solutions, just ongoing questions about how we order our world. Every generation must address these issues anew. Our task is to appreciate the hazards of this powerful technology, support its development—and seize its rewards.
Just as the printing press led to changes in the way society governs itself, so too does big data. It forces us to confront new challenges with new solutions. To ensure that people are protected at the same time as the technology is promoted, we must not let big data develop beyo
nd the reach of human ability to shape the technology.
10
NEXT
MIKE FLOWERS WAS A LAWYER in the Manhattan district attorney’s office in the early 2000s, prosecuting everything from homicides to Wall Street crimes, then made the shift to a plush corporate law firm. After a boring year behind a desk, he decided to leave that job too. Looking for something more meaningful, he thought of helping to rebuild Iraq. A friendly partner at the firm made a few calls to people in high places. The next thing Flowers knew, he was heading into the Green Zone, the secure area for American troops in the center of Baghdad, as part of the legal team for the trial of Saddam Hussein.
Most of his work turned out to be logistical, not legal. He needed to identify areas of suspected mass graves to know where to send investigators digging. He needed to ferry witnesses into the Green Zone without getting them blown up by the many IED (improvised explosive device) attacks that were a grim daily reality. He noticed that the military treated these tasks as information problems. And data came to the rescue. Intelligence analysts would combine field reports with details about the location, time, and casualties of past IED attacks to predict the safest route for that day.
On his return to New York City a few years later, Flowers realized that those methods marked a more powerful way to combat crime than he’d ever had at his disposal as a prosecutor. And he found a veritable soul mate in the city’s mayor, Michael Bloomberg, who had made his fortune in data by supplying financial information to banks. Flowers was named to a special task force assigned to crunch the numbers that might unmask the villains of the subprime mortgage scandal in 2009. The unit was so successful that a year later Mayor Bloomberg asked it to expand its scope. Flowers became the city’s first “director of analytics.” His mission: to build a team of the best data scientists he could find and harness the city’s untapped troves of information to reap efficiencies covering everything and anything.
Flowers cast his net wide to find the right people. “I had no interest in very experienced statisticians,” he says. “I was a little concerned that they would be reluctant to take this novel approach to problem solving.” Earlier, when he had interviewed traditional stats guys for the financial fraud project, they had tended to raise arcane concerns about mathematical methods. “I wasn’t even thinking about what model I was going to use. I wanted actionable insight, and that was all I cared about,” he says. In the end he picked a team of five people he calls “the kids.” All but one were economics majors just a year or two out of school and without much experience living in a big city, and they all had something a bit creative about them.
Among the first challenges the team tackled was “illegal conversions”—the practice of cutting up a dwelling into many smaller units so that it can house as many as ten times the number of people it was designed for. They are major fire hazards, as well as cauldrons of crime, drugs, disease, and pest infestation. A tangle of extension cords may snake across the walls; hot plates sit perilously on top of bedspreads. People packed this tight regularly die in blazes. In 2005 two firefighters perished trying to rescue residents. New York City gets roughly 25,000 illegal-conversion complaints a year, but it has only 200 inspectors to handle them. There seemed to be no good way to distinguish cases that were simply nuisances from ones that were poised to burst into flames. To Flowers and his kids, though, this looked like a problem that could be solved with lots of data.
They started with a list of every property lot in the city—all 900,000 of them. Next they poured in datasets from 19 different agencies indicating, for example, if the building owner was delinquent in paying property taxes, if there had been foreclosure proceedings, and if anomalies in utilities usage or missed payments had led to any service cuts. They also fed in information about the type of building and when it was built, plus ambulance visits, crime rates, rodent complaints, and more. Then they compared all this information against five years of fire data ranked by severity and looked for correlations in order to generate a system that could predict which complaints should be investigated most urgently.
Initially, much of the data wasn’t in usable form. For instance, the city’s record keepers did not use a single, standard way to describe location; every agency and department seemed to have its own approach. The buildings department assigns every structure a unique building number. The housing preservation department has a different numbering system. The tax department gives each property an identifier based on borough, block, and lot. The police use Cartesian coordinates. The fire department relies on a system of proximity to “call boxes” related to the location of firehouses, even though call boxes are defunct. Flowers’s kids embraced this messiness by devising a system that identifies buildings by using a small area in the front of the property based on Cartesian coordinates and then draws in geo-loco data from the other agencies’ databases. Their method was inherently inexact, but the vast amount of data they were able to use more than compensated for the imperfections.
The team members weren’t content just to crunch numbers, though. They went into the field with inspectors to watch them work. They took copious notes and quizzed the pros on everything. When one grizzled chief grunted that the building they were about to examine wouldn’t be a problem, the geeks wanted to know why he felt so sure. He couldn’t quite say, but the kids gradually determined that his intuition was based on the new brickwork on the building’s exterior, which suggested to him that the owner cared about the place.
The kids went back to their cubicles and wondered how they could possibly feed “recent brickwork” into their model as a signal. After all, bricks aren’t datafied—yet. But sure enough, a city permit is required for doing any external brickwork. Adding the permit information improved their system’s predictive performance by indicating that some suspected properties were probably not major risks.
The analytics occasionally showed that some time-honored ways of doing things were not the best, just as the scouts in Moneyball had to accept the shortcomings of their intuition. For example, the number of calls to the city’s “311” complaint hotline was considered to indicate which buildings were most in need of attention. More calls equaled more serious problems. But this turned out to be a misleading measure. A rat spotted on the posh Upper East Side might generate thirty calls within an hour, but it might take a battalion of rodents before residents in the Bronx felt moved to dial 311. Likewise, the majority of complaints about an illegal conversion might be about noise, not about hazardous conditions.
In June 2011 Flowers and his kids flipped the switch on their system. Every complaint that fell into the category of an illegal conversion was processed on a weekly basis. They gathered the ones that ranked in the top 5 percent for fire risk and passed them on to the inspectors for immediate follow-up. When the results came back, everyone was stunned.
Prior to the big-data analysis, inspectors followed up the complaints they deemed most dire, but only in 13 percent of cases did they find conditions severe enough to warrant a vacate order. Now they were issuing vacate orders on more than 70 percent of the buildings they inspected. By indicating which buildings most needed their attention, big data improved their efficiency fivefold. And their work became more satisfying: they were concentrating on the biggest problems. The inspectors’ newfound effectiveness had spillover benefits, too. Fires in illegal conversions are 15 times more likely than other fires to result in injury or death for firefighters, so the fire department loved it. Flowers and his kids looked like wizards with a crystal ball that let them see into the future and predict which places were most risky. They took massive quantities of data that had been lying around for years, largely unused after it was collected, and harnessed it in a novel way to extract real value. Using a big corpus of information allowed them to spot connections that weren’t detectable in smaller amounts—the essence of big data.
The experience of New York City’s analytical alchemists highlights many of the themes of this book. T
hey used a gargantuan quantity of data, not just some; their list of buildings in the city represented nothing less than N=all. The data was messy, such as location information or ambulance records, but that didn’t deter them. In fact, the benefits of using more data outweighed the drawbacks of less pristine information. They were able to achieve their accomplishments because so many features of the city had been datafied (however inconsistently), allowing them to process the information.
The inklings of experts had to take a backseat to the data-driven approach. At the same time, Flowers and his kids continually tested their system with veteran inspectors, drawing on their experience to make the system perform better. Yet the most important reason for the program’s success was that it dispensed with a reliance on causation in favor of correlation.
“I am not interested in causation except as it speaks to action,” explains Flowers. “Causation is for other people, and frankly it is very dicey when you start talking about causation. I don’t think there is any cause whatsoever between the day that someone files a foreclosure proceeding against a property and whether or not that place has a historic risk for a structural fire. I think it would be obtuse to think so. And nobody would actually come out and say that. They’d think, no, it’s the underlying factors. But I don’t want to even get into that. I need a specific data point that I have access to, and tell me its significance. If it’s significant, then we’ll act on it. If not, then we won’t. You know, we have real problems to solve. I can’t dick around, frankly, thinking about other things like causation right now.”
Big Data: A Revolution That Will Transform How We Live, Work, and Think Page 21