Everybody Lies

Home > Other > Everybody Lies > Page 5
Everybody Lies Page 5

by Seth Stephens-Davidowitz


  So was Freud totally off-target in all his theories? Not quite. When I first got access to PornHub data, I found a revelation there that struck me as at least somewhat Freudian. In fact, this is among the most surprising things I have found yet during my data investigations: a shocking number of people visiting mainstream porn sites are looking for portrayals of incest.

  Of the top hundred searches by men on PornHub, one of the most popular porn sites, sixteen are looking for incest-themed videos. Fair warning—this is going to get a little graphic: they include “brother and sister,” “step mom fucks son,” “mom and son,” “mom fucks son,” and “real brother and sister.” The plurality of male incestuous searches are for scenes featuring mothers and sons. And women? Nine of the top hundred searches by women on PornHub are for incest-themed videos, and they feature similar imagery—though with the gender of any parent and child who is mentioned usually reversed. Thus the plurality of incestuous searches made by women are for scenes featuring fathers and daughters.

  It’s not hard to locate in this data at least a faint echo of Freud’s Oedipal complex. He hypothesized a near-universal desire in childhood, which is later repressed, for sexual involvement with opposite-sex parents. If only the Viennese psychologist had lived long enough to turn his analytic skills to PornHub data, where interest in opposite-sex parents seems to be borne out by adults—with great explicitness—and little is repressed.

  Of course, PornHub data can’t tell us for certain who people are fantasizing about when watching such videos. Are they actually imagining having sex with their own parents? Google searches can give some more clues that there are plenty of people with such desires.

  Consider all searches of the form “I want to have sex with my . . .” The number one way to complete this search is “mom.” Overall, more than three-fourths of searches of this form are incestuous. And this is not due to the particular phrasing. Searches of the form “I am attracted to . . . ,” for example, are even more dominated by admissions of incestuous desires. Now I concede—at the risk of disappointing Herr Freud—that these are not particularly common searches: a few thousand people every year in the United States admitting an attraction to their mother. Someone would also have to break the news to Freud that Google searches, as will be discussed later in this book, sometimes skew toward the forbidden.

  But still. There are plenty of inappropriate attractions that people have that I would have expected to have been mentioned more frequently in searches. Boss? Employee? Student? Therapist? Patient? Wife’s best friend? Daughter’s best friend? Wife’s sister? Best friend’s wife? None of these confessed desires can compete with mom. Maybe, combined with the PornHub data, that really does mean something.

  And Freud’s general assertion that sexuality can be shaped by childhood experiences is supported elsewhere in Google and PornHub data, which reveals that men, at least, retain an inordinate number of fantasies related to childhood. According to searches from wives about their husbands, some of the top fetishes of adult men are the desire to wear diapers and wanting to be breastfed, particularly, as discussed earlier, in India. Moreover, cartoon porn—animated explicit sex scenes featuring characters from shows popular among adolescent boys—has achieved a high degree of popularity. Or consider the occupations of women most frequently searched for in porn by men. Men who are 18–24 years old search most frequently for women who are babysitters. As do 25–64-year-old men. And men 65 years and older. And for men in every age group, teacher and cheerleader are both in the top four. Clearly, the early years of life seem to play an outsize role in men’s adult fantasies.

  I have not yet been able to use all this unprecedented data on adult sexuality to figure out precisely how sexual preferences form. Over the next few decades, other social scientists and I will be able to create new, falsifiable theories on adult sexuality and test them with actual data.

  Already I can predict some basic themes that will undoubtedly be part of a data-based theory of adult sexuality. It is clearly not going to be the identical story to the one Freud told, with his particular, well-defined, universal stages of childhood and repression. But, based on my first look at PornHub data, I am absolutely certain the final verdict on adult sexuality will feature some key themes that Freud emphasized. Childhood will play a major role. So will mothers.

  It likely would have been impossible to analyze Freud in this way ten years ago. It certainly would have been impossible eighty years ago, when Freud was still alive. So let’s think through why these data sources helped. This exercise can help us understand why Big Data is so powerful.

  Remember, we have said that just having mounds and mounds of data by itself doesn’t automatically generate insights. Data size, by itself, is overrated. Why, then, is Big Data so powerful? Why will it create a revolution in how we see ourselves? There are, I claim, four unique powers of Big Data. This analysis of Freud provides a good illustration of them.

  You may have noticed, to begin with, that we’re taking pornography seriously in this discussion of Freud. And we are going to utilize data from pornography frequently in this book. Somewhat surprisingly, porn data is rarely utilized by sociologists, most of whom are comfortable relying on the traditional survey datasets they have built their careers on. But a moment’s reflection shows that the widespread use of porn—and the search and views data that comes with it—is the most important development in our ability to understand human sexuality in, well . . . Actually, it’s probably the most important ever. It is data that Schopenhauer, Nietzsche, Freud, and Foucault would have drooled over. This data did not exist when they were alive. It did not exist a couple decades ago. It exists now. There are many unique data sources, on a range of topics, that give us windows into areas about which we could previously just guess. Offering up new types of data is the first power of Big Data.

  The porn data and the Google search data are not just new; they are honest. In the pre-digital age, people hid their embarrassing thoughts from other people. In the digital age, they still hide them from other people, but not from the internet and in particular sites such as Google and PornHub, which protect their anonymity. These sites function as a sort of digital truth serum—hence our ability to uncover a widespread fascination with incest. Big Data allows us to finally see what people really want and really do, not what they say they want and say they do. Providing honest data is the second power of Big Data.

  Because there is now so much data, there is meaningful information on even tiny slices of a population. We can compare, say, the number of people who dream of cucumbers versus those who dream of tomatoes. Allowing us to zoom in on small subsets of people is the third power of Big Data.

  Big Data has one more impressive power—one that was not utilized in my quick study of Freud but could be in a future one: it allows us to undertake rapid, controlled experiments. This allows us to test for causality, not merely correlations. These kinds of tests are mostly used by businesses now, but they will prove a powerful tool for social scientists. Allowing us to do many causal experiments is the fourth power of Big Data.

  Now it is time to unpack each of these powers and explore exactly why Big Data matters.

  3

  DATA REIMAGINED

  At 6 A.M. on a particular Friday of every month, the streets of most of Manhattan will be largely desolate. The stores lining these streets will be closed, their façades covered by steel security gates, the apartments above dark and silent.

  The floors of Goldman Sachs, the global investment banking institution in lower Manhattan, on the other hand, will be brightly lit, its elevators taking thousands of workers to their desks. By 7 A.M. most of these desks will be occupied.

  It would not be unfair on any other day to describe this hour in this part of town as sleepy. On this Friday morning, however, there will be a buzz of energy and excitement. On this day, information that will massively impact the stock market is set to arrive.

  Minutes after its release, this
information will be reported by news sites. Seconds after its release, this information will be discussed, debated, and dissected, loudly, at Goldman and hundreds of other financial firms. But much of the real action in finance these days happens in milliseconds. Goldman and other financial firms paid tens of millions of dollars to get access to fiber-optic cables that reduced the time information travels from Chicago to New Jersey by just four milliseconds (from 17 to 13). Financial firms have algorithms in place to read the information and trade based on it—all in a matter of milliseconds. After this crucial information is released, the market will move in less time than it takes you to blink your eye.

  So what is this crucial data that is so valuable to Goldman and numerous other financial institutions?

  The monthly unemployment rate.

  The rate, however—which has such a profound impact on the stock market that financial institutions have done whatever it takes to maximize the speed with which they receive, analyze, and act upon it—is from a phone survey that the Bureau of Labor Statistics conducts and the information is some three weeks—or 2 billion milliseconds—old by the time it is released.

  When firms are spending millions of dollars to chip a millisecond off the flow of information, it might strike you as more than a bit strange that the government takes so long to calculate the unemployment rate.

  Indeed, getting these critical numbers out sooner was one of Alan Krueger’s primary agendas when he took over as President Obama’s chairman of the Council of Economic Advisors in 2011. He was unsuccessful. “Either the BLS doesn’t have the resources,” he concluded. “Or they are stuck in twentieth-century thinking.”

  With the government clearly not picking up the pace anytime soon, is there a way to get at least a rough measure of the unemployment statistics at a faster rate? In this high-tech era—when nearly every click any human makes on the internet is recorded somewhere—do we really have to wait weeks to find out how many people are out of work?

  One potential solution was inspired by the work of a former Google engineer, Jeremy Ginsberg. Ginsberg noticed that health data, like unemployment data, was released with a delay by the government. The Centers for Disease Control and Prevention takes one week to release influenza data, even though doctors and hospitals would benefit from having the data much sooner.

  Ginsberg suspected that people sick with the flu are likely to make flu-related searches. In essence, they would report their symptoms to Google. These searches, he thought, could give a reasonably accurate measure of the current influenza rate. Indeed, searches such as “flu symptoms” and “muscle aches” have proven important indicators of how fast the flu is spreading.*

  Meanwhile, Google engineers created a service, Google Correlate, that gives outside researchers the means to experiment with the same type of analyses across a wide range of fields, not just health. Researchers can take any data series that they are tracking over time and see what Google searches correlate most with that dataset.

  For example, using Google Correlate, Hal Varian, chief economist at Google, and I were able to show which searches most closely track housing prices. When housing prices are rising, Americans tend to search for such phrases as “80/20 mortgage,” “new home builder,” and “appreciation rate.” When housing prices are falling, Americans tend to search for such phrases as “short sale process,” “underwater mortgage,” and “mortgage forgiveness debt relief.”

  So can Google searches be used as a litmus test for unemployment in the same way they can for housing prices or influenza? Can we tell, simply by what people are Googling, how many people are unemployed, and can we do so well before the government collates its survey results?

  One day, I put the United States unemployment rate from 2004 through 2011 into Google Correlate.

  Of the trillions of Google searches during that time, what do you think turned out to be most tightly connected to unemployment? You might imagine “unemployment office”—or something similar. That was high but not at the very top. “New jobs”? Also high but also not at the very top.

  The highest during the period I searched—and these terms do shift—was “Slutload.” That’s right, the most frequent search was for a pornographic site. This may seem strange at first blush, but unemployed people presumably have a lot of time on their hands. Many are stuck at home, alone and bored. Another of the highly correlated searches—this one in the PG realm—is “Spider Solitaire.” Again, not surprising for a group of people who presumably have a lot of time on their hands.

  Now, I am not arguing, based on this one analysis, that tracking “Slutload” or “Spider Solitaire” is the best way to predict the unemployment rate. The specific diversions that unemployed people use can change over time (at one point, “Rawtube,” a different porn site, was among the strongest correlations) and none of these particular terms by itself attracts anything approaching a plurality of the unemployed. But I have generally found that a mix of diversion-related searches can track the unemployment rate—and would be a part of the best model predicting it.

  This example illustrates the first power of Big Data, the reimagining of what qualifies as data. Frequently, the value of Big Data is not its size; it’s that it can offer you new kinds of information to study—information that had never previously been collected.

  Before Google there was information available on certain leisure activities—movie ticket sales, for example—that could yield some clues as to how much time people have on their hands. But the opportunity to know how much solitaire is being played or porn is being watched is new—and powerful. In this instance this data might help us more quickly measure how the economy is doing—at least until the government learns to conduct and collate a survey more quickly.

  Life on Google’s campus in Mountain View, California, is very different from that in Goldman Sachs’s Manhattan headquarters. At 9 A.M. Google’s offices are nearly empty. If any workers are around, it is probably to eat breakfast for free—banana-blueberry pancakes, scrambled egg whites, filtered cucumber water. Some employees might be out of town: at an off-site meeting in Boulder or Las Vegas or perhaps on a free ski trip to Lake Tahoe. Around lunchtime, the sand volleyball courts and grass soccer fields will be filled. The best burrito I’ve ever eaten was at Google’s Mexican restaurant.

  How can one of the biggest and most competitive tech companies in the world seemingly be so relaxed and generous? Google harnessed Big Data in a way that no other company ever has to build an automated money stream. The company plays a crucial role in this book since Google searches are by far the dominant source of Big Data. But it is important to remember that Google’s success is itself built on the collection of a new kind of data.

  If you are old enough to have used the internet in the twentieth century, you might remember the various search engines that existed back then—MetaCrawler, Lycos, AltaVista, to name a few. And you might remember that these search engines were, at best, mildly reliable. Sometimes, if you were lucky, they managed to find what you wanted. Often, they would not. If you typed “Bill Clinton” into the most popular search engines in the late 1990s, the top results included a random site that just proclaimed “Bill Clinton Sucks” or a site that featured a bad Clinton joke. Hardly the most relevant information about the then president of the United States.

  In 1998, Google showed up. And its search results were undeniably better those that of every one of its competitors. If you typed “Bill Clinton” into Google in 1998, you were given his website, the White House email address, and the best biographies of the man that existed on the internet. Google seemed to be magic.

  What had Google’s founders, Sergey Brin and Larry Page, done differently?

  Other search engines located for their users the websites that most frequently included the phrase for which they searched. If you were looking for information on “Bill Clinton,” those search engines would find, across the entire internet, the websites that had the most references to Bill Clinton.
There were many reasons this ranking system was imperfect and one of them was that it was easy to game the system. A joke site with the text “Bill Clinton Bill Clinton Bill Clinton Bill Clinton Bill Clinton” hidden somewhere on its page would score higher than the White House’s official website.*

  What Brin and Page did was find a way to record a new type of information that was far more valuable than a simple count of words. Websites often would, when discussing a subject, link to the sites they thought were most helpful in understanding that subject. For example, the New York Times, if it mentioned Bill Clinton, might allow readers who clicked on his name to be sent to the White House’s official website.

  Every website creating one of these links was, in a sense, giving its opinion of the best information on Bill Clinton. Brin and Page could aggregate all these opinions on every topic. It could crowdsource the opinions of the New York Times, millions of Listservs, hundreds of bloggers, and everyone else on the internet. If a whole slew of people thought that the most important link for “Bill Clinton” was his official website, this was probably the website that most people searching for “Bill Clinton” would want to see.

  These kinds of links were data that other search engines didn’t even consider, and they were incredibly predictive of the most useful information on a given topic. The point here is that Google didn’t dominate search merely by collecting more data than everyone else. They did it by finding a better type of data. Fewer than two years after its launch, Google, powered by its link analysis, grew to be the internet’s most popular search engine. Today, Brin and Page are together worth more than $60 billion.

  As with Google, so with everyone else trying to use data to understand the world. The Big Data revolution is less about collecting more and more data. It is about collecting the right data.

  But the internet isn’t the only place where you can collect new data and where getting the right data can have profoundly disruptive results. This book is largely about how the data on the web can help us better understand people. The next section, however, doesn’t have anything to do with web data. In fact, it doesn’t have anything to do with people. But it does help illustrate the main point of this chapter: the outsize value of new, unconventional data. And the principles it teaches us are helpful in understanding the digital-based data revolution.

 

‹ Prev