Data and Goliath
Page 15
When you’re watched by a computer, none of that dog analogy applies. The computer is processing what it sees, and basing actions on it. You might be told that the computer isn’t saving the data, but you have no assurance that that’s true. You might be told that the computer won’t alert a person if it perceives something of interest, but you can’t know whether that’s true. You have no way of confirming that no person will perceive whatever decision the computer makes, and that you won’t be judged or discriminated against on the basis of what the computer sees.
Moreover, when a computer stores your data, there’s always a risk of exposure. Privacy policies could change tomorrow, permitting new use of old data without your express consent. Some hacker or criminal could break in and steal your data. The organization that has your data could use it in some new and public way, or sell it to another organization. The FBI could serve a National Security Letter on the data owner. On the other hand, there isn’t a court in the world that can get a description of you naked from your dog.
The primary difference between a computer and a dog is that the computer communicates with other people and the dog does not—at least, not well enough to matter. Computer algorithms are written by people, and their output is used by people. And when we think of computer algorithms surveilling us or analyzing our personal data, we need to think about the people behind those algorithms. Whether or not anyone actually looks at our data, the very facts that (1) they could, and (2) they guide the algorithms that do, make it surveillance.
You know this is true. If you believed what Clapper said, then you wouldn’t object to a camera in your bedroom—as long as there were rules governing when the police could look at the footage. You wouldn’t object to being compelled to wear a government-issued listening device 24/7, as long as your bureaucratic monitors followed those same rules. If you do object, it’s because you realize that the privacy harm comes from the automatic collection and algorithmic analysis, regardless of whether or not a person is directly involved in the process.
IDENTIFICATION AND ANONYMITY
We all have experience with identifying ourselves on the Internet. Some websites tie your online identity to your real identity: banks, websites for some government services, and so on. Some tie your online identity to a payment system—generally credit cards—and others to your bank account or cell phone. Some websites don’t care about your real identity, and allow you to maintain a unique username just for that site. Many more sites could work that way. Apple’s iTunes, for example, could be so designed that it doesn’t know who you really are, just that you’re authorized to access a particular set of audio and video files.
The means to perform identification and authentication include passwords, biometrics, and tokens. Many people, myself included, have written extensively about the various systems and their relative strengths and weaknesses. I’ll spare you the details; the takeaway is that none of these systems is perfect, but all are generally good enough for their applications. Authentication basically works.
It works because the people involved want to be identified. You want to convince Hotmail that it’s your account; you want to convince your bank that it’s your money. And while you might not want AT&T to be able to tie all the Internet browsing you do on your smartphone to your identity, you do want the phone network to transmit your calls to you. All of these systems are trying to answer the following question: “Is this the person she claims to be?” That is why it’s so easy to gather data about us online; most of it comes from sources where we’ve intentionally identified ourselves.
Attribution of anonymous activity to a particular person is a much harder problem. In this case, the person doesn’t necessarily want to be identified. He is making an anonymous comment on a website. Or he’s launching a cyberattack against your network. In such a case, the systems have to answer the harder question: “Who is this?”
At a very basic level, we are unable to identify individual pieces of hardware and software when a malicious adversary is trying to evade detection. We can’t attach identifying information to data packets zipping around the Internet. We can’t verify the identity of a person sitting in front of a random keyboard somewhere on the planet. Solving this problem isn’t a matter of overcoming some engineering challenges; this inability is inherent in how the Internet works.
This means that we can’t conclusively figure out who left an anonymous comment on a blog. (It could have been posted using a public computer, or a shared IP address.) We can’t conclusively identify the sender of an e-mail. (Those headers can be spoofed; spammers do it all the time.) We can’t conclusively determine who was behind a series of failed log-ins to your bank account, or a cyberattack against our nation’s infrastructure.
We can’t even be sure whether a particular attack was criminal or military in origin, or which government was behind it. The 2007 cyberattack against Estonia, often talked about as the first cyberwar, was either conducted by a group associated with the Russian government or by a disaffected 22-year-old.
When we do manage to attribute an attack—be it to a mischievous high schooler, a bank robber, or a team of state-sanctioned cyberwarriors—we usually do so after extensive forensic analysis or because the attacker gave himself away in some other manner. It took analysts months to identify China as the definitive source of the New York Times attacks in 2012, and we didn’t know for sure who was behind Stuxnet until the US admitted it. This is a very difficult problem, and one we’re not likely to solve anytime soon.
Over the years, there have been many proposals to eliminate anonymity on the Internet. The idea is that if everything anyone did was attributable—if all actions could be traced to their source—then it would be easy to identify criminals, spammers, stalkers, and Internet trolls. Basically, everyone would get the Internet equivalent of a driver’s license.
This is an impossible goal. First of all, we don’t have the real-world infrastructure to provide Internet user credentials based on other identification systems—passports, national identity cards, driver’s licenses, whatever—which is what would be needed. We certainly don’t have the infrastructure to do that globally.
Even if we did, it would be impossible to make it secure. Every one of our existing identity systems is already subverted by teenagers trying to buy alcohol—and that’s a face-to-face transaction. A new one isn’t going to be any better. And even if it were, it still wouldn’t work. It is always possible to set up an anonymity service on top of an identity system. This fact already annoys countries like China that want to identify everyone using the Internet on their territory.
This might seem to contradict what I wrote in Chapter 3—that it is easy to identify people on the Internet who are trying to stay anonymous. This can be done if you have captured enough data streams to correlate and are willing to put in the investigative time. The only way to effectively reduce anonymity on the Internet is through massive surveillance. The examples from Chapter 3 all relied on piecing together different clues, and all took time. It’s much harder to trace a single Internet connection back to its source: a single e-mail, a single web connection, a single attack.
The open question is whether the process of identification through correlation and analysis can be automated. Can we build computer systems smart enough to analyze surveillance information to identify individual people, as in the examples we saw in Chapter 3, on a large-scale basis? Not yet, but maybe soon.
It’s being worked on. Countries like China and Russia want automatic systems to ferret out dissident voices on the Internet. The entertainment industry wants similar systems to identify movie and music pirates. And the US government wants the same systems to identify people and organizations it feels are threats, ranging from lone individuals to foreign governments.
In 2012, US Secretary of Defense Leon Panetta said publicly that the US has “made significant advances in . . . identifying the origins” of cyberattacks. My guess is that we have not developed so
me new science or engineering that fundamentally alters the balance between Internet identifiability and anonymity. Instead, it’s more likely that we have penetrated our adversaries’ networks so deeply that we can spy on and understand their planning processes.
Of course, anonymity cuts both ways, since it can also protect hate speech and criminal activity. But while identification can be important, anonymity is valuable for all the reasons I’ve discussed in this chapter. It protects privacy, it empowers individuals, and it’s fundamental to liberty.
11
Security
Our security is important. Crime, terrorism, and foreign aggression are threats both in and out of cyberspace. They’re not the only threats in town, though, and I just spent the last four chapters delineating others.
We need to defend against a panoply of threats, and this is where we start having problems. Ignoring the risk of overaggressive police or government tyranny in an effort to protect ourselves from terrorism makes as little sense as ignoring the risk of terrorism in an effort to protect ourselves from police overreach.
Unfortunately, as a society we tend to focus on only one threat at a time and minimize the others. Even worse, we tend to focus on rare and spectacular threats and ignore the more frequent and pedestrian ones. So we fear flying more than driving, even though the former is much safer. Or we fear terrorists more than the police, even though in the US you’re nine times more likely to be killed by a police officer than by a terrorist.
We let our fears get in the way of smart security. Defending against some threats at the expense of others is a failing strategy, and we need to find ways of balancing them all.
SECURITY FROM TERRORISTS AND CRIMINALS
The NSA repeatedly uses a connect-the-dots metaphor to justify its surveillance activities. Again and again—after 9/11, after the Underwear Bomber, after the Boston Marathon bombings—government is criticized for not connecting the dots.
However, this is a terribly misleading metaphor. Connecting the dots in a coloring book is easy, because they’re all numbered and visible. In real life, the dots can only be recognized after the fact.
That doesn’t stop us from demanding to know why the authorities couldn’t connect the dots. The warning signs left by the Fort Hood shooter, the Boston Marathon bombers, and the Isla Vista shooter look obvious in hindsight. Nassim Taleb, an expert on risk engineering, calls this tendency the “narrative fallacy.” Humans are natural storytellers, and the world of stories is much more tidy, predictable, and coherent than reality. Millions of people behave strangely enough to attract the FBI’s notice, and almost all of them are harmless. The TSA’s no-fly list has over 20,000 people on it. The Terrorist Identities Datamart Environment, also known as the watch list, has 680,000, 40% of whom have “no recognized terrorist group affiliation.”
Data mining is offered as the technique that will enable us to connect those dots. But while corporations are successfully mining our personal data in order to target advertising, detect financial fraud, and perform other tasks, three critical issues make data mining an inappropriate tool for finding terrorists.
The first, and most important, issue is error rates. For advertising, data mining can be successful even with a large error rate, but finding terrorists requires a much higher degree of accuracy than data-mining systems can possibly provide.
Data mining works best when you’re searching for a well-defined profile, when there are a reasonable number of events per year, and when the cost of false alarms is low. Detecting credit card fraud is one of data mining’s security success stories: all credit card companies mine their transaction databases for spending patterns that indicate a stolen card. There are over a billion active credit cards in circulation in the United States, and nearly 8% of those are fraudulently used each year. Many credit card thefts share a pattern—purchases in locations not normally frequented by the cardholder, and purchases of travel, luxury goods, and easily fenced items—and in many cases data-mining systems can minimize the losses by preventing fraudulent transactions. The only cost of a false alarm is a phone call to the cardholder asking her to verify a couple of her purchases.
Similarly, the IRS uses data mining to identify tax evaders, the police use it to predict crime hot spots, and banks use it to predict loan defaults. These applications have had mixed success, based on the data and the application, but they’re all within the scope of what data mining can accomplish.
Terrorist plots are different, mostly because whereas fraud is common, terrorist attacks are very rare. This means that even highly accurate terrorism prediction systems will be so flooded with false alarms that they will be useless.
The reason lies in the mathematics of detection. All detection systems have errors, and system designers can tune them to minimize either false positives or false negatives. In a terrorist-detection system, a false positive occurs when the system mistakenly identifies something harmless as a threat. A false negative occurs when the system misses an actual attack. Depending on how you “tune” your detection system, you can increase the number of false positives to assure you are less likely to miss an attack, or you can reduce the number of false positives at the expense of missing attacks.
Because terrorist attacks are so rare, false positives completely overwhelm the system, no matter how well you tune. And I mean completely: millions of people will be falsely accused for every real terrorist plot the system finds, if it ever finds any.
We might be able to deal with all of the innocents being flagged by the system if the cost of false positives were minor. Think about the full-body scanners at airports. Those alert all the time when scanning people. But a TSA officer can easily check for a false alarm with a simple pat-down. This doesn’t work for a more general data-based terrorism-detection system. Each alert requires a lengthy investigation to determine whether it’s real or not. That takes time and money, and prevents intelligence officers from doing other productive work. Or, more pithily, when you’re watching everything, you’re not seeing anything.
The US intelligence community also likens finding a terrorist plot to looking for a needle in a haystack. And, as former NSA director General Keith Alexander said, “you need the haystack to find the needle.” That statement perfectly illustrates the problem with mass surveillance and bulk collection. When you’re looking for the needle, the last thing you want to do is pile lots more hay on it. More specifically, there is no scientific rationale for believing that adding irrelevant data about innocent people makes it easier to find a terrorist attack, and lots of evidence that it does not. You might be adding slightly more signal, but you’re also adding much more noise. And despite the NSA’s “collect it all” mentality, its own documents bear this out. The military intelligence community even talks about the problem of “drinking from a fire hose”: having so much irrelevant data that it’s impossible to find the important bits.
We saw this problem with the NSA’s eavesdropping program: the false positives overwhelmed the system. In the years after 9/11, the NSA passed to the FBI thousands of tips per month; every one of them turned out to be a false alarm. The cost was enormous, and ended up frustrating the FBI agents who were obligated to investigate all the tips. We also saw this with the Suspicious Activity Reports—or SAR—database: tens of thousands of reports, and no actual results. And all the telephone metadata the NSA collected led to just one success: the conviction of a taxi driver who sent $8,500 to a Somali group that posed no direct threat to the US—and that was probably trumped up so the NSA would have better talking points in front of Congress.
The second problem with using data-mining techniques to try to uncover terrorist plots is that each attack is unique. Who would have guessed that two pressure-cooker bombs would be delivered to the Boston Marathon finish line in backpacks by a Boston college kid and his older brother? Each rare individual who carries out a terrorist attack will have a disproportionate impact on the criteria used to decide who’s a likely terrori
st, leading to ineffective detection strategies.
The third problem is that the people the NSA is trying to find are wily, and they’re trying to avoid detection. In the world of personalized marketing, the typical surveillance subject isn’t trying to hide his activities. That is not true in a police or national security context. An adversarial relationship makes the problem much harder, and means that most commercial big data analysis tools just don’t work. A commercial tool can simply ignore people trying to hide and assume benign behavior on the part of everyone else. Government data-mining techniques can’t do that, because those are the very people they’re looking for.
Adversaries vary in the sophistication of their ability to avoid surveillance. Most criminals and terrorists—and political dissidents, sad to say—are pretty unsavvy and make lots of mistakes. But that’s no justification for data mining; targeted surveillance could potentially identify them just as well. The question is whether mass surveillance performs sufficiently better than targeted surveillance to justify its extremely high costs. Several analyses of all the NSA’s efforts indicate that it does not.
The three problems listed above cannot be fixed. Data mining is simply the wrong tool for this job, which means that all the mass surveillance required to feed it cannot be justified. When he was NSA director, General Keith Alexander argued that ubiquitous surveillance would have enabled the NSA to prevent 9/11. That seems unlikely. He wasn’t able to prevent the Boston Marathon bombings in 2013, even though one of the bombers was on the terrorist watch list and both had sloppy social media trails—and this was after a dozen post-9/11 years of honing techniques. The NSA collected data on the Tsarnaevs before the bombing, but hadn’t realized that it was more important than the data they collected on millions of other people.