“They’re the largest economy in Africa,” Reisinger said, his voice slowly rising. “We don’t even know the most basic thing we would want to know about that country.”
He wanted to find a way to get a sharper look at economic performance. His solution is quite an example of how to reimagine what constitutes data and the value of doing so.
Reisinger founded a company, Premise, which employs a group of workers in developing countries, armed with smartphones. The employees’ job? To take pictures of interesting goings-on that might have economic import.
The employees might get snapshots outside gas stations or of fruit bins in supermarkets. They take pictures of the same locations over and over again. The pictures are sent back to Premise, whose second group of employees—computer scientists—turn the photos into data. The company’s analysts can code everything from the length of lines in gas stations to how many apples are available in a supermarket to the ripeness of these apples to the price listed on the apples’ bin. Based on photographs of all sorts of activity, Premise can begin to put together estimates of economic output and inflation. In developing countries, long lines in gas stations are a leading indicator of economic trouble. So are unavailable or unripe apples. Premise’s on-the-ground pictures of China helped them discover food inflation there in 2011 and food deflation in 2012, long before the official data came in.
Premise sells this information to banks or hedge funds and also collaborates with the World Bank.
Like many good ideas, Premise’s is a gift that keeps on giving. The World Bank was recently interested in the size of the underground cigarette economy in the Philippines. In particular, they wanted to know the effects of the government’s recent efforts, which included random raids, to crack down on manufacturers that produced cigarettes without paying a tax. Premise’s clever idea? Take photos of cigarette boxes seen on the street. See how many of them have tax stamps, which all legitimate cigarettes do. They have found that this part of the underground economy, while large in 2015, got significantly smaller in 2016. The government’s efforts worked, although seeing something usually so hidden—illegal cigarettes—required new data.
As we’ve seen, what constitutes data has been wildly reimagined in the digital age and a lot of insights have been found in this new information. Learning what drives media bias, what makes a good first date, and how developing economies are really doing is just the beginning.
Not incidentally, a lot of money has also been made from such new data, starting with Messrs. Brin’s and Page’s tens of billions. Joseph Reisinger hasn’t done badly himself. Observers estimate that Premise is now making tens of millions of dollars in annual revenue. Investors recently poured $50 million into the company. This means some investors consider Premise among the most valuable enterprises in the world primarily in the business of taking and selling photos, in the same league as Playboy.
There is, in other words, outsize value, for scholars and entrepreneurs alike, in utilizing all the new types of data now available, in thinking broadly about what counts as data. These days, a data scientist must not limit herself to a narrow or traditional view of data. These days, photographs of supermarket lines are valuable data. The fullness of supermarket bins is data. The ripeness of apples is data. Photos from outer space are data. The curvature of lips is data. Everything is data!
And with all this new data, we can finally see through people’s lies.
4
DIGITAL TRUTH SERUM
Everybody lies.
People lie about how many drinks they had on the way home. They lie about how often they go to the gym, how much those new shoes cost, whether they read that book. They call in sick when they’re not. They say they’ll be in touch when they won’t. They say it’s not about you when it is. They say they love you when they don’t. They say they’re happy while in the dumps. They say they like women when they really like men.
People lie to friends. They lie to bosses. They lie to kids. They lie to parents. They lie to doctors. They lie to husbands. They lie to wives. They lie to themselves.
And they damn sure lie to surveys.
Here’s my brief survey for you:
Have you ever cheated on an exam? __________
Have you ever fantasized about killing someone? _________
Were you tempted to lie? Many people underreport embarrassing behaviors and thoughts on surveys. They want to look good, even though most surveys are anonymous. This is called social desirability bias.
An important paper in 1950 provided powerful evidence of how surveys can fall victim to such bias. Researchers collected data, from official sources, on the residents of Denver: what percentage of them voted, gave to charity, and owned a library card. They then surveyed the residents to see if the percentages would match. The results were, at the time, shocking. What the residents reported to the surveys was very different from the data the researchers had gathered. Even though nobody gave their names, people, in large numbers, exaggerated their voter registration status, voting behavior, and charitable giving.
Has anything changed in sixty-five years? In the age of the internet, not owning a library card is no longer embarrassing. But, while what’s embarrassing or desirable may have changed, people’s tendency to deceive pollsters remains strong.
A recent survey asked University of Maryland graduates various questions about their college experience. The answers were compared to official records. People consistently gave wrong information, in ways that made them look good. Fewer than 2 percent reported that they graduated with lower than a 2.5 GPA. (In reality, about 11 percent did.) And 44 percent said they had donated to the university in the past year. (In reality, about 28 percent did.)
And it is certainly possible that lying played a role in the failure of the polls to predict Donald Trump’s 2016 victory. Polls, on average, underestimated his support by about 2 percentage points. Some people may have been embarrassed to say they were planning to support him. Some may have claimed they were undecided when they were really going Trump’s way all along.
Why do people misinform anonymous surveys? I asked Roger Tourangeau, a research professor emeritus at the University of Michigan and perhaps the world’s foremost expert on social desirability bias. Our weakness for “white lies” is an important part of the problem, he explained. “About one-third of the time, people lie in real life,” he suggests. “The habits carry over to surveys.”
Then there’s that odd habit we sometimes have of lying to ourselves. “There is an unwillingness to admit to yourself that, say, you were a screw-up as a student,” says Tourangeau.
Lying to oneself may explain why so many people say they are above average. How big is this problem? More than 40 percent of one company’s engineers said they are in the top 5 percent. More than 90 percent of college professors say they do above-average work. One-quarter of high school seniors think they are in the top 1 percent in their ability to get along with other people. If you are deluding yourself, you can’t be honest in a survey.
Another factor that plays into our lying to surveys is our strong desire to make a good impression on the stranger conducting the interview, if there is someone conducting the interview, that is. As Tourangeau puts it, “A person who looks like your favorite aunt walks in. . . . Do you want to tell your favorite aunt you used marijuana last month?”* Do you want to admit that you didn’t give money to your good old alma mater?
For this reason, the more impersonal the conditions, the more honest people will be. For eliciting truthful answers, internet surveys are better than phone surveys, which are better than in-person surveys. People will admit more if they are alone than if others are in the room with them.
However, on sensitive topics, every survey method will elicit substantial misreporting. Tourangeau here used a word that is often thrown around by economists: “incentive.” People have no incentive to tell surveys the truth.
How, therefore, can we learn what our fellow hum
ans are really thinking and doing?
In some instances, there are official data sources we can reference to get the truth. Even if people lie about their charitable donations, for example, we can get real numbers about giving in an area from the charities themselves. But when we are trying to learn about behaviors that are not tabulated in official records or we are trying to learn what people are thinking—their true beliefs, feelings, and desires—there is no other source of information except what people may deign to tell surveys. Until now, that is.
This is the second power of Big Data: certain online sources get people to admit things they would not admit anywhere else. They serve as a digital truth serum. Think of Google searches. Remember the conditions that make people more honest. Online? Check. Alone? Check. No person administering a survey? Check.
And there’s another huge advantage that Google searches have in getting people to tell the truth: incentives. If you enjoy racist jokes, you have zero incentive to share that un-PC fact with a survey. You do, however, have an incentive to search for the best new racist jokes online. If you think you may be suffering from depression, you don’t have an incentive to admit this to a survey. You do have an incentive to ask Google for symptoms and potential treatments.
Even if you are lying to yourself, Google may nevertheless know the truth. A couple of days before the election, you and some of your neighbors may legitimately think you will drive to a polling place and cast ballots. But, if you and they haven’t searched for any information on how to vote or where to vote, data scientists like me can figure out that turnout in your area will actually be low. Similarly, maybe you haven’t admitted to yourself that you may suffer from depression, even as you’re Googling about crying jags and difficulty getting out of bed. You would show up, however, in an area’s depression-related searches that I analyzed earlier in this book.
Think of your own experience using Google. I am guessing you have upon occasion typed things into that search box that reveal a behavior or thought that you would hesitate to admit in polite company. In fact, the evidence is overwhelming that a large majority of Americans are telling Google some very personal things. Americans, for instance, search for “porn” more than they search for “weather.” This is difficult, by the way, to reconcile with the survey data since only about 25 percent of men and 8 percent of women admit they watch pornography.
You may have also noticed a certain honesty in Google searches when looking at the way this search engine automatically tries to complete your queries. Its suggestions are based on the most common searches that other people have made. So auto-complete clues us in to what people are Googling. In fact, auto-complete can be a bit misleading. Google won’t suggest certain words it deems inappropriate, such as “cock,” “fuck,” and “porn.” This means auto-complete tells us that people’s Google thoughts are less racy than they actually are. Even so, some sensitive stuff often still comes up.
If you type “Why is . . .” the first two Google auto-completes currently are “Why is the sky blue?” and “Why is there a leap day?” suggesting these are the two most common ways to complete this search. The third: “Why is my poop green?” And Google auto-complete can get disturbing. Today, if you type in “Is it normal to want to . . . ,” the first suggestion is “kill.” If you type in “Is it normal to want to kill . . . ,” the first suggestion is “my family.”
Need more evidence that Google searches can give a different picture of the world than the one we usually see? Consider searches related to regrets around the decision to have or not to have children. Before deciding, some people fear they might make the wrong choice. And, almost always, the question is whether they will regret not having kids. People are seven times more likely to ask Google whether they will regret not having children than whether they will regret having children.
After making their decision—either to reproduce (or adopt) or not—people sometimes confess to Google that they rue their choice. This may come as something of a shock but post-decision, the numbers are reversed. Adults with children are 3.6 times more likely to tell Google they regret their decision than are adults without children.
One caveat that should be kept in mind throughout this chapter: Google can display a bias toward unseemly thoughts, thoughts people feel they can’t discuss with anyone else. Nonetheless, if we are trying to uncover hidden thoughts, Google’s ability to ferret them out can be useful. And the large disparity between regrets on having versus not having kids seems to be telling us that the unseemly thought in this case is a significant one.
Let’s pause for a moment to consider what it even means to make a search such as “I regret having children.” Google presents itself as a source from which we can seek information directly, on topics like the weather, who won last night’s game, or when the Statue of Liberty was erected. But sometimes we type our uncensored thoughts into Google, without much hope that it will be able to help us. In this case, the search window serves as a kind of confessional.
There are thousands of searches every year, for example, for “I hate cold weather,” “People are annoying,” and “I am sad.” Of course, those thousands of Google searches for “I am sad” represent only a tiny of fraction of the hundreds of millions of people who feel sad in a given year. Searches expressing thoughts, rather than looking for information, my research has found, are only made by a small sample of everyone for whom that thought comes to mind. Similarly, my research suggests that the seven thousand searches by Americans every year for “I regret having children” represent a small sample of those who have had that thought.
Kids are obviously a huge joy for many, probably most, people. And, despite my mom’s fear that “you and your stupid data analysis” are going to limit her number of grandchildren, this research has not changed my desire to have kids. But that unseemly regret is interesting—and another aspect of humanity that we tend not to see in the traditional datasets. Our culture is constantly flooding us with images of wonderful, happy families. Most people would never consider having children as something they might regret. But some do. They may admit this to no one—except Google.
THE TRUTH ABOUT SEX
How many American men are gay? This is a legendary question in sexuality research. Yet it has been among the toughest questions for social scientists to answer. Psychologists no longer believe Alfred Kinsey’s famous estimate—based on surveys that oversampled prisoners and prostitutes—that 10 percent of American men are gay. Representative surveys now tell us about 2 to 3 percent are. But sexual preference has long been among the subjects upon which people have tended to lie. I think I can use Big Data to give a better answer to this question than we have ever had.
First, more on that survey data. Surveys tell us there are far more gay men in tolerant states than intolerant states. For example, according to a Gallup survey, the proportion of the population that is gay is almost twice as high in Rhode Island, the state with the highest support for gay marriage, than Mississippi, the state with the lowest support for gay marriage.
There are two likely explanations for this. First, gay men born in intolerant states may move to tolerant states. Second, gay men in intolerant states may not divulge that they are gay; they are even more likely to lie.
Some insight into explanation number one—gay mobility—can be gleaned from another Big Data source: Facebook, which allows users to list what gender they are interested in. About 2.5 percent of male Facebook users who list a gender of interest say they are interested in men; that corresponds roughly with what the surveys indicate. And Facebook too shows big differences in the gay population in states with high versus low tolerance: Facebook has the gay population more than twice as high in Rhode Island as in Mississippi.
Facebook also can provide information on how people move around. I was able to code the hometown of a sample of openly gay Facebook users. This allowed me to directly estimate how many gay men move out of intolerant states into more tolerant parts of the co
untry. The answer? There is clearly some mobility—from Oklahoma City to San Francisco, for example. But I estimate that men packing up their Judy Garland CDs and heading to someplace more open-minded can explain less than half of the difference in the openly gay population in tolerant versus intolerant states.*
In addition, Facebook allows us to focus in on high school students. This is a special group, because high school boys rarely get to choose where they live. If mobility explained the state-by-state differences in the openly gay population, these differences should not appear among high school users. So what does the high school data say? There are far fewer openly gay high school boys in intolerant states. Only two in one thousand male high school students in Mississippi are openly gay. So it ain’t just mobility.
If a similar number of gay men are born in every state and mobility cannot fully explain why some states have so many more openly gay men, the closet must be playing a big role. Which brings us back to Google, with which so many people have proved willing to share so much.
Might there be a way to use porn searches to test how many gay men there really are in different states? Indeed, there is. Countrywide, I estimate—using data from Google searches and Google AdWords—that about 5 percent of male porn searches are for gay-male porn. (These would include searches for such terms as “Rocket Tube,” a popular gay pornographic site, as well as “gay porn.”)
And how does this vary in different parts of the country? Overall, there are more gay porn searches in tolerant states compared to intolerant states. This makes sense, given that some gay men move out of intolerant places into tolerant places. But the differences are not nearly as large as the differences suggested by either surveys or Facebook. In Mississippi, I estimate that 4.8 percent of male porn searches are for gay porn, far higher than the numbers suggested by either surveys or Facebook and reasonably close to the 5.2 percent of pornography searches that are for gay porn in Rhode Island.
Everybody Lies Page 9