Book Read Free

Dataclysm: Who We Are (When We Think No One's Looking)

Page 18

by Christian Rudder


  I point this out because, to many people, government workers have an indifferent reputation—bureaucrats, functionaries, whatever. And certainly the average person working in data analytics in the private sector is as likely to be competent as not. But the people spying on us are extremely, extremely smart. We can hope that they, like Feynman and Einstein before them, are able to temper their work with a farsighted humanity, but we can know, for sure, that, like Feynman and Einstein before them, what they’re working on is inhumanly powerful.

  Insofar as algorithms are fed by data, Mr. Snowden has revealed that the NSA’s are fatted on superfood. Or rather … all the food. They gather phone calls, e-mail, text messages, pictures, basically everything that travels by electric current. It’s clear that it’s not a passive operation—according to one leaked document, the stated, top-level purpose is to “master the Internet.” The project’s brazenness is one of the most phenomenal things about it. Among the first documents published (jointly by the Guardian and the Washington Post) was a PowerPoint presentation about a program called PRISM. The slides don’t beat around the bush:

  It should’ve been called Operation Yoink! On the one hand, life on Earth only gets worse when anyone wearing a sidearm starts thinking about our Facebook accounts. On the other, it’s hard to be afraid of people using the Draw tool in a Microsoft product.

  No one sees the PRISM data for an individual without a court order, at least in theory, because the program is so invasive. Other snooping is mostly focused on metadata—the incidentals of communication. Here’s the government’s own Privacy and Civil Liberties Oversight Board describing one part of another project:

  For each of the millions of telephone numbers covered by the NSA’s Section 215 program, the agency obtains a record of all incoming and outgoing calls, the duration of those calls, and the precise time of day when they occurred. When the agency targets a telephone number for analysis, the same information [is obtained] for every telephone number with which the original number has had contact, and every telephone number in contact with any of those numbers.

  It must be said that none of this entails the actual content of anyone’s communication. In that respect, it’s not much different from the data we’ve looked at in this book. We let patterns stand in for any single person’s life, just like these guys do. At the NSA, again according to them, if your web of calls fits the profile of a “threat,” only then do they start paying real attention. But metadata isn’t necessarily less invasive for being indirect.

  People leave some amazing breadcrumbs for anyone interested in following them. You’ve seen plenty already—200 pages’ worth. Even so, there are just as many trails we haven’t followed. For example, a little text file called the Exif is attached to all images taken with a digital camera, from high-end SLRs to your iPhone. The file encodes not only when the picture was taken but miscellany like the f-stop and shutter speed for the photo and, often, the latitude and longitude of where it was taken. Exif is how programs like iPhoto can effortlessly sort your pictures into “moments” and place little pins all over the map to show you where you’ve been. There are other things the Exif can tell you, though. Take the profile photos on OkCupid. The better-looking a photo is, the better chance it has of being outdated. That is, people find that one “great picture” and just lock it in forever. We know this because of the Exif, which tells us when the picture was taken. This kind of data tagalong is common. GPS coordinates ride shotgun over the network whenever you open your favorite app. Almost every web page you’ve ever loaded has dozens of one-pixel images (just a single transparent dot) buried in the margins that, by being loaded alongside the “real” page, register your visit; the pixels can’t tell what you’re doing, just when and where you’ve gone. This simple stuff, just whens and wheres can give a company a good guess at your whole demographic profile.

  What about the people who don’t want to share like this? The people who would rather shop and preen alone? I myself know the value of privacy. That’s part of the reason I’m not a big social-media user, frankly. I have never posted a picture of my daughter on the Internet. I started using Instagram in earlyish 2011 when the service wasn’t big yet, and I used it as just a photo gallery app because I liked the filters. I thought it was like Hipstamatic, not really social—I know this makes me sound like a grandfather. When my wife realized what her fuddy-duddy husband was doing, she pointed out that I could connect my account to other people’s accounts, which I did, because hey, look: a button to click. But once it wasn’t just me on my own with my pictures, it lost all appeal.

  This kind of reticence is unusual. For all the hand-wringing, it’s hard to argue that most users are anything but blasé about privacy. Whenever Facebook updates its Terms of Service to extend their reach deeper into our data, we rage in circles for a day, then are on the site the next, like so many provoked bees who, finding no one to sting, have nowhere to go but back to the hive. Because tech loves to push boundaries and the boundaries keep giving, software has gotten almost aggressively invasive. There are weight-loss apps. Heart-rate apps. Rate-my-outfit apps—submit your ensemble to the crowd for fashion advice. Women are using apps to predict and manage their menstrual cycle: “The market is flooded with them,” as Jenna Wortham writes, before adding, “nearly every woman I know uses one.” You let the app know when your period starts, and it’ll alert you when you’re at peak fertility, to avoid or embrace as you wish. Of course, self-reported data not being quite invasive enough, there’s a startup that says it can infer when a woman is having her period from her link history. Any of these menstruation apps—at least if they have a competent data scientist behind them—will of course also know when a user is pregnant, overexercising, getting older, or having unprotected sex, since when you’re late, you’ll check the thing unusually often.

  But despite some, even many, people’s cavalier attitude toward privacy, I didn’t want to put anyone’s identity at risk in making this book. As I’ve said, all the analysis was done anonymously and in aggregate, and I handled the raw source material with care. There was no personally identifiable information (PII) in any of my data. In the discussion of users’ words—their profile text, tweets, status updates, and the like—those words were public. Where I had user-by-user records, the userids were encrypted. And in any analysis the scope of the data was limited to only the essential variables, so nothing could be tied back to any individual.

  I never wanted to connect the data back to individuals, of course. My goal was to connect it back to everyone. That’s the value I see in the data and therefore in the privacy lost in its existence: what we can learn. Jaron Lanier, author of Who Owns the Future? and a computer scientist currently working at Microsoft Research, wrote in Scientific American that “a stupendous amount of information about our private lives is being stored, analyzed and acted on in advance of a demonstrated valid use for it.” He’s unquestionably right about the “tremendous amount,” but I take issue with his final clause. How does anything ever become useful if it can’t be “acted on in advance of a demonstrated valid use”? The whole idea of research science is predicated on exploration. Iron ore was once just another rock until someone started to experiment with it. Mold on bread spent millennia just making people sick until Alexander Fleming discovered it also made penicillin.

  Already data science is generating deep findings that don’t just describe, but change, how people live. I’ve already mentioned Google Flu; launched in 2008, it now tracks nascent epidemics in more than twenty-five countries. It’s not a perfect tool, but it’s a start. Combined data is even being used to prevent disease, not just minimize it. As the New York Times reported last year: “Using data drawn from queries entered into Google, Microsoft and Yahoo search engines, scientists at Microsoft, Stanford and Columbia University have for the first time been able to detect evidence of unreported prescription drug side effects before they were found by the Food and Drug Administration’s warning system.”
The researchers determined that paroxetine and pravastatin were causing hyperglycemia in patients. Here, the payoff for living a little less privately is to live a little more healthily.

  Every day, it seems, brings word of some new advance. Today, I found out that a site called geni.com is well on the way to creating a crowdsourced family tree for all mankind. If it works, the company will have made, essentially, a social network for our genetic material. The week before, two political scientists debunked the received wisdom that Republicans owe their House majority to district gerrymandering. The authors had modeled every possible election over every possible configuration of the United States and concluded, with the computer playing Candide, that our divided world is the best we can hope for. The political geography of the country, not the actual maps, creates the gridlock.

  This is just the beginning. Data has a long head start—Facebook was collecting 500 terabytes of information every day way back in 2012—but the analysis is starting to catch up. Data journalism was brought to the mainstream by Nate Silver, but it’s become a staple of reporting: we quantify to understand. The Times, the Washington Post, the Guardian have all built impressive analytic and visualization teams and continue to devote resources to publishing the data of our lives, even in the constrained financial climate for reporters and their work.

  On the flush corporate side, Google, mentioned many times in these pages, leads the way in turning data to the public good. There’s Flu and the work of Stephens-Davidowitz, but also a raft of even more ambitious, if less publicized, projects, such as Constitute—a data-based approach to constitution design. The citizens of most countries are usually only concerned with one constitution—their own—but Google has assembled all nine hundred such documents drafted since 1787. Combined and quantified, they give emerging nations—five new constitutions are written every year—a better chance at a durable government because they can see what’s worked and what hasn’t in the past. Here, data unlocks a better future because, as Constitute’s website points out: in a constitution, “even a single comma can make a huge difference.”

  As we’ve seen, Facebook’s data team has begun to publish research of broad value from their immense store of human action and reaction. Seizing on that Newtonian interplay, Alex Pentland at MIT calls the emerging science “social physics.” He and his team have begun moving social data to the physical world. Working with local government, communications providers, and citizens, they’ve datafied an entire city. The residents of Trento, Italy, can now tackle, with hard numbers, what for the rest of us are workaday unanswerables: “How do other families spend their money? How much do they get out and socialize? Which preschools or doctors do people stay with for the longest time?”

  Perhaps this is the future we have to look forward to. I’ve tried to explain what we’ve already learned by combining the best of the work that’s out there with my own original research. In so doing, more than stretching out my arms to say This is the pinnacle, I mean to communicate the power of what’s to come. Watson and Crick unlocked the secret of DNA in 1953, and six decades later scientists are still decoding the human genome. The science of our shared humanity—the search for the full expression of the genes we’ll soon have fully mapped—is years from anything so lofty.

  As far as balancing the potential good with the bad, I wish I could propose a way forward. But to be honest I don’t see a simple solution. It might be that I’m too close. I share Lanier’s belief that regulation won’t work. Not that someone won’t try that route. The new laws will be drafted with all the right spirit, I’m sure, but their letter will be outdated before the ink is dry. And being on the data collectors’ side myself, I’ve seen firsthand that you can give people all the privacy controls in the world, but most people won’t use them. OkCupid asks women: Have you ever had an abortion?—it’s the 3,686th match question; I told you they truly cover everything. Right beneath the question, there’s a checkbox to keep your answer private. Of the people who answer in the affirmative, fewer than half check the box.

  So most people won’t use the tools you give them, but maybe “most people” is the wrong goal here. For one thing, providing ways to delete, or even repossess, data is the right thing to do, no matter how few users take you up on it. For another, it’s possible that privacy has changed, and left the people writing about it behind. Lanier and I are old men by Internet standards, and it’s not just in armies that “generals always fight the last war.” My expectations of what is correct and permissible might be wrong. Cultures and generations define privacy differently.

  People aren’t even that upset about the NSA, as gross as their overreach is. There have been many “Million” marches on Washington. Million Man, Million Mom, and so on. Recently, the hacker collective Anonymous called for a Million Mask March to protest, among other things, the PRISM program and government mass surveillance. The Washington Post captures the shortfall of public interest in just the first word of their coverage: “Hundreds of protesters …”

  In his Scientific American piece, Lanier proposes that we be compensated for our personal data and let market forces rebalance the privacy/value equation. He proposes that data collectors issue micropayments to users whenever their data is sold. But that expense, like a tax, either will be passed directly back to the consumer or will bring on a race to the bottom, where websites have to find margin wherever they can get it, the way commercial airlines do now. Either way, there’s no net value in it for us. And that’s not to mention the impracticality of making it happen.

  Pentland’s approach is much more feasible: he calls it his “New Deal on Data.” Ironically enough, it harkens back to Old English Common Law for its principles. He believes that, as with any other thing you own, you should have the fundamental rights of possession, use, and disposal for your data. What that means is you should be able to remove your data from a website (or other repository) whenever you feel like it’s being misused. You should also be allowed to “take it with you,” in theory for resale, should a market for that develop. That simple mechanism—the Delete button, with the option to copy/paste—is not only more feasible but also more fair than any enforced compensation.

  In fact, on the corporate side, I would argue that people are already compensated for their data: they get to use services like Facebook and Google—connect with old friends, find what they’re looking for—for free. As I’ve said, I give these services little of myself; but I get less out of them too. People have to decide their own trade-off there. Soon, though, there might be only one decision to make: am I going to use these services at all? The analytics are becoming so powerful that it may not matter what you try to hold back. From only the barest information, algorithms are already able to extrapolate or infer much about a person; that’s after only a few years of data to work on. Soon the half measures provided by menu options as you “manage your privacy settings” will give no protection at all, because the rest of your world won’t be so withholding. Companies and the government will find you through the graph. This whole debate could soon be an anachronism.

  In any event, when I talked about the data as a flood way, way back, I perhaps didn’t emphasize it enough: the waters are still churning. Only when they start to calm can people really know the level and make good the surfeit. I am eager to do so. In the meantime, the people who store, analyze, and act on data have a responsibility to continue to prove the value of their work—and to reveal exactly what it is they’re doing. Or else, for all my quibbling, Lanier is right: we shouldn’t be doing it.

  Technology is our new mythos. There’s magic in some of it, undeniably. But even grander than the substance is the image. Tech gods. Titans. Colossi astride the whole Earth, because, you know, Rhodes just isn’t cool anymore. This is how the industry is often cast to the public, and sadly it’s how it often thinks of itself. But though there are surely monsters, there are no gods. We would all do well to remember this. All are flawed, human, and mortal, and we all walk un
der the same dark sky. We brought on the flood—will it drown us or lift us up? My hope for myself, and for the others like me, is to make something good and real and human out of the data. And while we do, whenever the technology and the devices and the algorithms seem just too epic, we must all recall Tennyson’s aging Ulysses and resolve to search for our truth in a slightly different way. To strive, to seek, to find, but then, always, to yield.

  1 From Nature’s discussion of the console: “It is fitted with a camera that can monitor the heart rate of people sitting in the same room. The sensor is primarily designed for exercise games, allowing players to monitor heart changes during physical activity, but, in principle, the same type of system could monitor and pass on details of physiological responses to TV advertisements, horror movies or even … political broadcasts.”

  2 From Acxiom’s website: “[We give] our clients the power to successfully manage audiences, personalize customer experiences and create profitable customer relationships.” An interesting paradox: whenever you see the word “personalize,” you know things have gotten very impersonal.

  3 After Boston, Reddit and 4chan tried vigorously (meaning there was lots of typing) to track down the bombers and eventually “pinned” it on an innocent man. For all the lip service the cloud and crowd get, hardware solved the crime.

  Coda

  Designing the charts and tables in this book, I relied on the work of the statistician and artist Edward R. Tufte. More than relied on, I tried to copy it. His books occupy that smallest of intersections: coffee-table beautiful and textbook clear, and inside he lays out principles of information design drawn from the all-time famous examples of data as storytelling. Charles Minard’s plot of Napoleon’s Russian undoing. An unnamed abolitionist’s Description of a Slave Ship, showing the human cargo packed in inhuman closeness, an image that is still the iconic shorthand for the horrors of the Middle Passage. Dr. John Snow’s plot of a cholera outbreak in 1854 pinpointed the source of the disease for the first time. Tufte pulls lessons from these and makes them useful in a modern context, asking the data designer to maximize the data-to-ink ratio. Give every chart a clear story to tell. Use color to call out data’s red heart. Use white as dimension, not dead space. I’ve tried my best.

 

‹ Prev