Data and Goliath

Home > Other > Data and Goliath > Page 5
Data and Goliath Page 5

by Bruce Schneier


  We don’t know the details, but the NSA chains together hops based on any connection, not just phone connections. This could include being in the same location as a target, having the same calling pattern, and so on. These types of searches are made possible by having access to everyone’s data.

  You can use mass surveillance to find individuals. If you know that a particular person was at a specific restaurant one evening, a train station three days later in the afternoon, and a hydroelectric plant the next morning, you can query a database of everyone’s cell phone locations, and anyone who fits those characteristics will pop up.

  You can also search for anomalous behavior. Here are four examples of how the NSA uses cell phone data.

  1.The NSA uses cell phone location information to track people whose movements intersect. For example, assume that the NSA is interested in Alice. If Bob is at the same restaurant as Alice one evening, and then at the same coffee shop as Alice a week later, and at the same airport as Alice a month later, the system will flag Bob as a potential associate of Alice’s, even if the two have never communicated electronically.

  2.The NSA tracks the locations of phones that are carried around by US spies overseas. Then it determines whether there are any other cell phones that follow the agents’ phones around. Basically, the NSA checks whether anyone is tailing those agents.

  3.The NSA has a program where it trawls through cell phone metadata to spot phones that are turned on, used for a while, and then turned off and never used again. And it uses the phones’ usage patterns to chain them together. This technique is employed to find “burner” phones used by people who wish to avoid detection.

  4.The NSA collects data on people who turn their phones off, and for how long. It then collects the locations of those people when they turned their phones off, and looks for others nearby who also turned their phones off for a similar period of time. In other words, it looks for secret meetings.

  I’ve already discussed the government of Ukraine using cell phone location data to find everybody who attended an antigovernment demonstration, and the Michigan police using it to find everyone who was near a planned labor union protest site. The FBI has used this data to find phones that were used by a particular target but not otherwise associated with him.

  Corporations do some of this as well. There’s a technique called geofencing that marketers use to identify people who are near a particular business so as to deliver an ad to them. A single geofencing company, Placecast, delivers location-based ads to ten million phones in the US and UK for chains like Starbucks, Kmart, and Subway. Microsoft does the same thing to people passing within ten miles of some of its stores; it works with the company NinthDecimal. Sense Networks uses location data to create individual profiles.

  CORRELATING DIFFERENT DATA SETS

  Vigilant Solutions is one of the companies that collect license plate data from cameras. It has plans to augment this system with other algorithms for automobile identification, systems of facial recognition, and information from other databases. The result would be a much more powerful surveillance platform than a simple database of license plate scans, no matter how extensive, could ever be.

  News stories about mass surveillance are generally framed in terms of data collection, but miss the story about data correlation: the linking of identities across different data sets to draw inferences from the combined data. It’s not just that inexpensive drones with powerful cameras will become increasingly common. It’s the drones plus facial recognition software that allows the system to identify people automatically, plus the large databases of tagged photos—from driver’s licenses, Facebook, newspapers, high school yearbooks—that will provide reference images for that software. It’s also the ability to correlate that identification with numerous other databases, and the ability to store all that data indefinitely. Ubiquitous surveillance is the result of multiple streams of mass surveillance tied together.

  I have an Oyster card that I use to pay for public transport while in London. I’ve taken pains to keep it cash-only and anonymous. Even so, if you were to correlate the usage of that card with a list of people who visit London and the dates—whether that list is provided by the airlines, credit card companies, cell phone companies, or ISPs—I’ll bet that I’m the only person for whom those dates correlate perfectly. So my “anonymous” movement through the London Underground becomes nothing of the sort.

  Snowden disclosed an interesting research project from the CSEC—that’s the Communications Security Establishment Canada, the country’s NSA equivalent—that demonstrates the value of correlating different streams of surveillance information to find people who are deliberately trying to evade detection.

  A CSEC researcher, with the cool-sounding job title of “tradecraft developer,” started with two weeks’ worth of Internet identification data: basically, a list of user IDs that logged on to various websites. He also had a database of geographic locations for different wireless networks’ IP addresses. By putting the two databases together, he could tie user IDs logging in from different wireless networks to the physical location of those networks. The idea was to use this data to find people. If you know the user ID of some surveillance target, you can set an alarm when that target uses an airport or hotel wireless network and learn when he is traveling. You can also identify a particular person who you know visited a particular geographical area on a series of dates and times. For example, assume you’re looking for someone who called you anonymously from three different pay phones. You know the dates and times of the calls, and the locations of those pay phones. If that person has a smartphone in his pocket that automatically logs into wireless networks, then you can correlate that log-in database with dates and times you’re interested in and the locations of those networks. The odds are that there will only be one match.

  Researchers at Carnegie Mellon University did something similar. They put a camera in a public place, captured images of people walking past, identified them with facial recognition software and Facebook’s public tagged photo database, and correlated the names with other databases. The result was that they were able to display personal information about a person in real time as he or she was walking by. This technology could easily be available to anyone, using smartphone cameras or Google Glass.

  Sometimes linking identities across data sets is easy; your cell phone is connected to your name, and so is your credit card. Sometimes it’s harder; your e-mail address might not be connected to your name, except for the times people refer to you by name in e-mail. Companies like Initiate Systems sell software that correlates data across multiple data sets; they sell to both governments and corporations. Companies are also correlating your online behavior with your offline actions. Facebook, for example, is partnering with the data brokers Acxiom and Epsilon to match your online profile with in-store purchases.

  Once you can correlate different data sets, there is a lot you can do with them. Imagine building up a picture of someone’s health without ever looking at his patient records. Credit card records and supermarket affinity cards reveal what food and alcohol he buys, which restaurants he eats at, whether he has a gym membership, and what nonprescription items he buys at a pharmacy. His phone reveals how often he goes to that gym, and his activity tracker reveals his activity level when he’s there. Data from websites reveal what medical terms he’s searched on. This is how a company like ExactData can sell lists of people who date online, people who gamble, and people who suffer from anxiety, incontinence, or erectile dysfunction.

  PIERCING OUR ANONYMITY

  When a powerful organization is eavesdropping on significant portions of our electronic infrastructure and can correlate the various surveillance streams, it can often identify people who are trying to hide. Here are four stories to illustrate that.

  1.Chinese military hackers who were implicated in a broad set of attacks against the US government and corporations were identified because they accessed Facebook
from the same network infrastructure they used to carry out their attacks.

  2.Hector Monsegur, one of the leaders of the LulzSec hacker movement under investigation for breaking into numerous commercial networks, was identified and arrested in 2011 by the FBI. Although he usually practiced good computer security and used an anonymous relay service to protect his identity, he slipped up once. An inadvertent disclosure during a chat allowed an investigator to track down a video on YouTube of his car, then to find his Facebook page.

  3.Paula Broadwell, who had an affair with CIA director David Petraeus, similarly took extensive precautions to hide her identity. She never logged in to her anonymous e-mail service from her home network. Instead, she used hotel and other public networks when she e-mailed him. The FBI correlated registration data from several different hotels—and hers was the common name.

  4.A member of the hacker group Anonymous called “w0rmer,” wanted for hacking US law enforcement websites, used an anonymous Twitter account, but linked to a photo of a woman’s breasts taken with an iPhone. The photo’s embedded GPS coordinates pointed to a house in Australia. Another website that referenced w0rmer also mentioned the name Higinio Ochoa. The police got hold of Ochoa’s Facebook page, which included the information that he had an Australian girlfriend. Photos of the girlfriend matched the original photo that started all this, and police arrested w0rmer aka Ochoa.

  Maintaining Internet anonymity against a ubiquitous surveillor is nearly impossible. If you forget even once to enable your protections, or click on the wrong link, or type the wrong thing, you’ve permanently attached your name to whatever anonymous provider you’re using. The level of operational security required to maintain privacy and anonymity in the face of a focused and determined investigation is beyond the resources of even trained government agents. Even a team of highly trained Israeli assassins was quickly identified in Dubai, based on surveillance camera footage around the city.

  The same is true for large sets of anonymous data. We might naïvely think that there are so many of us that it’s easy to hide in the sea of data. Or that most of our data is anonymous. That’s not true. Most techniques for anonymizing data don’t work, and the data can be de-anonymized with surprisingly little information.

  In 2006, AOL released three months of search data for 657,000 users: 20 million searches in all. The idea was that it would be useful for researchers; to protect people’s identity, they replaced names with numbers. So, for example, Bruce Schneier might be 608429. They were surprised when researchers were able to attach names to numbers by correlating different items in individuals’ search history.

  In 2008, Netflix published 10 million movie rankings by 500,000 anonymized customers, as part of a challenge for people to come up with better recommendation systems than the one the company was using at that time. Researchers were able to de-anonymize people by comparing rankings and time stamps with public rankings and time stamps in the Internet Movie Database.

  These might seem like special cases, but correlation opportunities pop up more frequently than you might think. Someone with access to an anonymous data set of telephone records, for example, might partially de-anonymize it by correlating it with a catalog merchant’s telephone order database. Or Amazon’s online book reviews could be the key to partially de-anonymizing a database of credit card purchase details.

  Using public anonymous data from the 1990 census, computer scientist Latanya Sweeney found that 87% of the population in the United States, 216 million of 248 million people, could likely be uniquely identified by their five-digit ZIP code combined with their gender and date of birth. For about half, just a city, town, or municipality name was sufficient. Other researchers reported similar results using 2000 census data.

  Google, with its database of users’ Internet searches, could de-anonymize a public database of Internet purchases, or zero in on searches of medical terms to de-anonymize a public health database. Merchants who maintain detailed customer and purchase information could use their data to partially de-anonymize any large search engine’s search data. A data broker holding databases of several companies might be able to de-anonymize most of the records in those databases.

  Researchers have been able to identify people from their anonymous DNA by comparing the data with information from genealogy sites and other sources. Even something like Alfred Kinsey’s sex research data from the 1930s and 1940s isn’t safe. Kinsey took great pains to preserve the anonymity of his subjects, but in 2013, researcher Raquel Hill was able to identify 97% of them.

  It’s counterintuitive, but it takes less data to uniquely identify us than we think. Even though we’re all pretty typical, we’re nonetheless distinctive. It turns out that if you eliminate the top 100 movies everyone watches, our movie-watching habits are all pretty individual. This is also true for our book-reading habits, our Internet-shopping habits, our telephone habits, and our web-searching habits. We can be uniquely identified by our relationships. It’s quite obvious that you can be uniquely identified by your location data. With 24/7 location data from your cell phone, your name can be uncovered without too much trouble. You don’t even need all that data; 95% of Americans can be identified by name from just four time/date/location points.

  The obvious countermeasures for this are, sadly, inadequate. Companies have anonymized data sets by removing some of the data, changing the time stamps, or inserting deliberate errors into the unique ID numbers they replaced names with. It turns out, though, that these sorts of tweaks only make de-anonymization slightly harder.

  This is why regulation based on the concept of “personally identifying information” doesn’t work. PII is usually defined as a name, unique account number, and so on, and special rules apply to it. But PII is also about the amount of data; the more information someone has about you, even anonymous information, the easier it is for her to identify you.

  For the most part, our protections are limited to the privacy policies of the companies we use, not by any technology or mathematics. And being identified by a unique number often doesn’t provide much protection. The data can still be collected and correlated and used, and eventually we do something to attach our name to that “anonymous” data record.

  In the age of ubiquitous surveillance, where everyone collects data on us all the time, anonymity is fragile. We either need to develop more robust techniques for preserving anonymity, or give up on the idea entirely.

  4

  The Business of Surveillance

  One of the most surprising things about today’s cell phones is how many other things they also do. People don’t wear watches, because their phones have a clock. People don’t carry cameras, because they come standard in most smartphones.

  That camera flash can also be used as a flashlight. One of the flashlight apps available for Android phones is Brightest Flashlight Free, by a company called GoldenShores Technologies, LLC. It works great and has a bunch of cool features. Reviewers recommended it to kids going trick-or-treating. One feature that wasn’t mentioned by reviewers is that the app collected location information from its users and allegedly sold it to advertisers.

  It’s actually more complicated than that. The company’s privacy policy, never mind that no one read it, actively misled consumers. It said that the company would use any information collected, but left out that the information would be sold to third parties. And although users had to click “accept” on the license agreement they also didn’t read, the app started collecting and sending location information even before people clicked.

  This surprised pretty much all of the app’s 50 million users when researchers discovered it in 2012. The US Federal Trade Commission got involved, forcing the company to clean up its deceptive practices and delete the data it had collected. It didn’t fine the company, though, because the app was free.

  Imagine that the US government passed a law requiring all citizens to carry a tracking device. Such a law would immediately be found unconst
itutional. Yet we carry our cell phones everywhere. If the local police department required us to notify it whenever we made a new friend, the nation would rebel. Yet we notify Facebook. If the country’s spies demanded copies of all our conversations and correspondence, people would refuse. Yet we provide copies to our e-mail service providers, our cell phone companies, our social networking platforms, and our Internet service providers.

  The overwhelming bulk of surveillance is corporate, and it occurs because we ostensibly agree to it. I don’t mean that we make an informed decision agreeing to it; instead, we accept it either because we get value from the service or because we are offered a package deal that includes surveillance and don’t have any real choice in the matter. This is the bargain I talked about in the Introduction.

  This chapter is primarily about Internet surveillance, but remember that everything is—or soon will be—connected to the Internet. Internet surveillance is really shorthand for surveillance in an Internet-connected world.

  INTERNET SURVEILLANCE

  The primary goal of all this corporate Internet surveillance is advertising. There’s a little market research and customer service in there, but those activities are secondary to the goal of more effectively selling you things.

  Internet surveillance is traditionally based on something called a cookie. The name sounds benign, but the technical description “persistent identifier” is far more accurate. Cookies weren’t intended to be surveillance devices; rather, they were designed to make surfing the web easier. Websites don’t inherently remember you from visit to visit or even from click to click. Cookies provide the solution to this problem. Each cookie contains a unique number that allows the site to identify you. So now when you click around on an Internet merchant’s site, you keep telling it, “I’m customer #608431.” This allows the site to find your account, keep your shopping cart attached to you, remember you the next time you visit, and so on.

 

‹ Prev