What’s more, reduction and repetition are fundamental to the long history of science, not just data science and not just computer science, but capital-S Science, the ageless human enterprise. Experiments are built upon reducing a process to a single, manageable facet. The scientific method needs a control, and you can’t get it without cutting complexity to the bald core and saying this, this, is what matters. Only once you’ve simplified the question can you test it over and over again. Whether at a lab bench or a laptop, most of the knowledge we possess was acquired like this, by reduction.
So here, we’ve boiled humanity down to numbers rather than, say, anecdotes. In my mind—and this takes nothing away from Malcolm Gladwell—I see this book as the opposite of outliers. Instead of the strays from the far reaches of the data—the one-offs, the exceptions, the singletons, the Einsteins for whom you need the whole story to get it right, I’m pulling from the undifferentiated whole. We focus on the dense clusters, the centers of mass, the data duplicated over and over by the repetition and commonality of our human experience. It’s science as pointillism. Those dots may be one fractional part of you, but the whole is us.
Aggregation and reduction also allow us to deal in broad trends, the smooth flow of which might not have the peaks and troughs of the usual hero narratives but which are all the more applicable for it. The fact that Paul McCartney and John Lennon practiced rock music for 10,000 hours and then became the Beatles does say something about the value of rehearsal and persistence, but that number itself means nothing. I myself have put in that kind of time playing guitar, as have many others whose music you’ll never hear. Whatever it was that allowed Lennon and McCartney to turn practice into genius, it’s unique to them. On the other hand, every number in this book has many hundreds, often many thousands, of people behind it, none of them famous. Here’s the kernel of it: the phrase “one in a million” is at the core of so many wonderful works of art. It means a person so special, so talented, so something that they’re practically unique, and that very rareness makes them significant. But in mathematics, and so with data, and so here in this book, the phrase means just the opposite: 1/1,000,000 is a rounding error.
But if simplifying is what it takes to understand large data sets, I do worry about a different kind of reductionism: people becoming not a number exactly, but a dehumanized userid fed into the grind of a marketing algorithm; grist for someone else’s brand. Data takes too much of the guesswork out of the sell. It’s a rare urban legend that turns out to be true, but Target, by analyzing a customer’s purchases, really did know she was pregnant before she’d told anyone. The hitch was that she was a teenager, and they’d started sending maternity ads to her father’s house.
In some ways, that kind of corporate intrusion is better than brands actually trying to “relate.” Last summer, a Jell-O marketing campaign co-opted (tagjacked?) the hashtag #fml, which is Internet shorthand for “fuck my life.” Their social media people began responding to tweets that contained the tag with an unsolicited offer to “fun” the person’s life instead, with coupons. Thus people in extremis received jaunty offers from a gelatin, as in this exchange:
Pyrrhus Nelson @suhrryp
Seeing my bank account disappear at the dr office #fml
JELL-O @JELLO
@suhrryp Fun My Life? Of course we will. In fact, we’d be happy to.
prmtns.co/dkTq Exp. 48hrs
This kind of unwanted intercession is all too easy on social media because everything is so quantified. The hashtags jump right to the brand manager’s screen; he dives in with the discounts. At least the same technology that allows them into our lives allows us to fight back. A few years ago, McDonald’s sent out a couple tweets, feel-good stories about their suppliers, with the tag #McDStories, and they got #fml’d in reverse. This is just one of many responses:
MUZZAFUZZA @Muzzafuzza
I haven’t been to McDonalds in years, because I’d rather eat my own diarrhea. #McDStories
McDonald’s had paid to promote the hashtag and pulled the campaign after only a couple hours when it quickly spiraled out of their control. A week later, the repurposed #McDStories was still going strong. Their social media strategists should’ve known what to expect: a few months before, Wendy’s had tried to push #HeresTheBeef, and their catchphrase was ripped completely free of the intended context. People used it to complain about anything they didn’t like (had a beef with), ignoring the brand:
Remi Mitchison @RemiBee
#HeresTheBeef when a chick see another chick doin better and has more than she does … so she wanna stunt and #GetThatAssBeatUp
Jeremy Baumhower @jeremytheproduc
#HeresTheBeef The drugs companies have already cured HIV and cancer, however it is far more profitable to keep people barely alive on drugs
More recently, Mountain Dew ran a “Dub the Dew” contest, trying to ride the “crowdsourcing” wave to a cool new soda name and thinking maybe, if everything went just right and the metrics showed enough traction to get buy-in from the right influencers, they’d earn some brand ambassadors in the blogosphere. Reddit and 4chan got ahold of it, and “Hitler did nothing wrong” led the voting for a while, until at the last minute “Diabeetus” swooped in and the people’s voice was heard: Dub yourself, motherfucker.
The Internet can be a deranged place, but it’s that potential for the unexpected, even the insane, that so often redeems it. I can’t imagine anything worse for You! The Brand! than upvoting Hitler. Plus, what a waste of time, because obviously Mountain Dew isn’t going to print a single unflattering word in the style of its precious and distinctive marks. I find comfort in the silliness, in the frivolity, even in the stupidity. Trolling a soda is something no formula would ever recommend. It’s no industry best practice. And it’s evidence that as much as corporatism might invade our newsfeeds, our photostreams, our walls, and even, as some would hope, our very souls, a small part of us is still beyond reach. That’s what I always want to remember: it’s not numbers that will deny us our humanity; it’s the calculated decision to stop being human.
1 His mantra, by the way, is “distinct … or extinct.”
2 The 2014 Forbes Billionaires list has 1,645 members.
3 One of Newt’s former staffers told Gawker: “About 80 percent of those accounts are inactive or are dummy accounts created by various ‘follow agencies,’ another 10 percent are real people who are part of a network of folks who follow others back and are paying for followers themselves (Newt’s profile just happens to be a part of these networks because he uses them, although he doesn’t follow back), and the remaining 10 percent may, in fact, be real, sentient people who happen to like Newt Gingrich.”
4 As an analytics bona fide, they even own data.com.
14.
Breadcrumbs
Facebook released the Like button in 2009 and it changed the way people shared content. The idea wasn’t new—once-popular, now marginal, sites like digg.com and del.icio.us had been letting people “like” articles for years before that. But at these companies, the content was the star. Facebook laid curation over an already robust social network and, for the content creators, made it simple for anyone to attach that iconic little thumbs-up to their work. They created a new universal microcurrency—I might not pay you for your writing, music, or whatever, but I’ll give you a fillip of approval and share what you’ve done with my friends. As of May 2013, Facebook was recording 4.5 billion likes a day and in September of that year reported that 1.13 trillion had been submitted all-time.
Those students from MIT developed their gaydar the same year likes launched. Their algorithm was pretty good at guessing a man’s sexuality, but it also worked in a fairly obvious way: it’s surely no big secret that gay men are more likely to have gay male friends. The gaydar innovation was to use macro-level data to do something people had been doing in small ways all along. Since then, the power of predictive software has advanced rapidly; these types of programs only get smarter a
nd faster as more data becomes available. By 2012, a group from the UK had discovered that from a person’s likes alone they could figure out the following, with these degrees of accuracy:
whether someone is …
gay or straight 88%
lesbian or straight 75%
Caucasian or African American 95%
a man or a woman 93%
Democrat or Republican 85%
a drug user 65%
the child of parents who got divorced before he or she turned 21 60%
Again, this is not from looking at status updates or comments or shares or anything that the users typed. Just their likes. You know the science is headed to undiscovered country when someone can hear your parents fighting in the click-click-click of a mouse. A person’s “like” pattern even makes a decent proxy for intelligence—this model could reliably predict someone’s score on a standard (separately administered) IQ test, without the person answering a single direct question.
This stuff was computed from three years of data collected from people who joined Facebook after decades of being on Earth without it. What will be possible when someone’s been using these services since she was a child? That’s the darker side of the longitudinal data that I’m otherwise so excited about. Tests like Myers-Briggs and Stanford-Binet have long been used by employers, schools, the military. You sit down, do your best, and they sort you. For the most part, you’ve opted in. But it’s increasingly the case that you’re taking these tests just by living your life. And the results are there for anyone to read and judge. It’s one thing to see that someone’s Klout score is 51 or whatever in advance of a job interview. It’s another to know his IQ.
If employers begin to use algorithms to infer how intelligent you are or whether you use drugs, then your only choice will be to game the system—or, to borrow the wording from the previous chapter, “manage your brand.” To beat the machine, you must act like a machine, which means you’ve lost to the machine. And that’s all assuming you can guess at what you’re supposed to do in the first place. Apparently, one of the strongest correlates to intelligence in the research was liking “curly fries.” Who could reverse-engineer that?
But while Facebook does know a lot about you, it’s more like a “work friend”—for all the time you spend together, there are clear limits to your relationship. Facebook only knows what you do on Facebook. There are many places with much deeper reach. If you have an iPhone, Apple could have your address book, your calendar, your photos, your texts, all the music you listen to, all the places you go—and even how many steps it took to get there, since phones have a little gyroscope in them. Don’t have an iPhone? Then replace “Apple” with Google or Samsung or Verizon. Wear a FuelBand? Nike knows how well you sleep. An Xbox One? Microsoft knows your heart rate.1 A credit card? Buy something at a retailer, and your PII (personally identifiable information) attaches the UPC to your Guest ID in the CRM (customer relations management) software, which then starts working on what you’ll want next.
This is just a sliver of the corporate data state, the full description of which could take pages. For the government picture, a sliver is all I have, because that’s all we’ve been able to see of it. We do know that the UK has 5.9 million security cameras, one for every eleven citizens. In Manhattan, just below Fourteenth Street, there are 4,176. Satellites and drones complete the picture beyond the asphalt. Though there’s no telling what each one sees, it’s safe to say: if the government is interested in your whereabouts, one sees you. And besides, as Edward Snowden revealed, much of what they can’t put a lens on they can monitor at leisure from the screen of an NSANet terminal, location undisclosed.
Because so much happens with so little public notice, the lay understanding of data is inevitably many steps behind the reality. I have to say, just pausing to write this book, I’m sure I’ve lost ground. Analytics has in many ways surpassed the information itself as the real lever to pry. Cookies in your web browser and guys hacking for credit card numbers get most of the press and are certainly the most acutely annoying of the data collectors. But they’ve also taken hold of a small fraction of your life, and for that small piece they had to put in all kinds of work. No matter how crafty the JavaScript, they’re villains in the silent-film vein, all mustachios and top hats. Or, a more contemporary reference: they’re like so many pasty Dr. Evils—underworld relics holding the world hostage for one … million … dollars … while the billions fly by to the real masterminds, like Acxiom. These corporate data marketers, with reach into bank and credit card records, retail histories, and government filings like tax records, know stuff about human behavior that no academic researcher, fishing for patterns on some website, ever could.2 Meanwhile, the resources and expertise the national security apparatus brings to bear makes enterprise-level data-mining software look like Minesweeper.
This data, despite the “mining” metaphor, isn’t a naturally occurring resource; it comes from somewhere—and that somewhere is you. The companies and the government are collecting disparate pieces of your private life and trying to fashion them back into an image they can master. The more privacy you lose, the more effective they are. The fundamental question in any discussion of privacy is the trade-off—what you get for losing it. We make calculated trades all the time. Public figures sell their personal lives to advance their careers. Anyone who’s booked a hostel in Europe or bought a train ticket in India has had to decide if the private room is worth the extra money. And not to confuse the issue here, but many people, men and women, trade on privacy when they walk out the door in the evening, giving it away, via a hemline or a snug fit, for attention. So the exchange isn’t new. But our trading partners, and their terms, are. On the corporate side, the upshot of our data (the benefit to us) isn’t all that interesting unless you’re an economist. In theory, your data means ads are better targeted, which means less marketing spend is wasted, which means lower prices. At the very least, the data they sell means you get to use genuinely useful services like Facebook and Google without paying money for them. What we get in return for the government’s intrusion is less straightforward.
Does surveillance make us more safe? Is the security apparatus a blanket? Well, there haven’t been any terror attacks on American civilians since 2001—at least, not ones by the syndicates. That’s not meaningless, certainly not to a New Yorker. But an argument from absence isn’t very strong, and at least until we’re allowed to know the threats that were thwarted as opposed to those never planned, it’s hard to trust what we’re told. Like so much Texas dust, its memory has almost drifted away, but the color-coded “Threat Level” that was such a part of the discussion in the years after 9/11 always felt to me like an elaborate advertisement for Halliburton. It’s hard to believe in information coming to you on a “need to know basis” from an entity that doesn’t think you need to know anything. The concern becomes less about what they’re saying than why. In any event, I have no idea how many, if any, crimes the big glean at the NSA has prevented. We’re told it works, just not when, where, or how.
Quixotically, for those crimes total surveillance didn’t prevent, it has certainly proved useful in solving. All those security cameras cracked the case after the Boston Marathon bombing, as they did after the London subway bombings in 2005.3 Especially for asynchronous crimes, you need total data to return to, because the criminals commit their acts long before any victims fall. In these investigations, the power of the intelligence becomes part of the media story—this is the surveillance state’s time to shine. The data has a defined purpose, and no one debates the privacy/protection balance while there is blood on the ground. But in between the times of “United We Stand” a lot of what we learn about what the government knows comes from whistle-blowers like Snowden.
The NSA is the government’s signals intelligence arm, and here the signal they’re looking for is in our data. I have some personal familiarity with the organization. As I’ve said, I studied math. I did so at Harva
rd. My bachelor’s degree looks just like my classmates’, but there were unofficially two tracks in the department. One, mine, was for the kids who liked math and were pretty good at it. The other was for the transcendent savants. There was a difficult first-year course called Math 25, which I wasn’t good enough for, and from which the ultra-elite were drawn into a superclass called Math 55 by special invitation from the department. The hardest courses I ever took were often entirely skipped by these real mathematicians. The teaching assistants in my high-level courses, the people who handled a lot of the actual instruction and all of the grading, were not only often younger than me (one was sixteen) but were already deep into the graduate-level curriculum, as teenagers. I remember being very excited about (and challenged by) Real Analysis, which was a class that many of my peers—as if that’s the right word—would’ve found boring as ninth-graders. Whenever I hear the letters “NSA,” I think back to those days, because they recruited from that second track.
Dataclysm: Who We Are (When We Think No One's Looking) Page 17