In fact, the smartest Big Data companies are often cutting down their data. At Google, major decisions are based on only a tiny sampling of all their data. You don’t always need a ton of data to find important insights. You need the right data. A major reason that Google searches are so valuable is not that there are so many of them; it is that people are so honest in them. People lie to friends, lovers, doctors, surveys, and themselves. But on Google they might share embarrassing information, about, among other things, their sexless marriages, their mental health issues, their insecurities, and their animosity toward black people.
Most important, to squeeze insights out of Big Data, you have to ask the right questions. Just as you can’t point a telescope randomly at the night sky and have it discover Pluto for you, you can’t download a whole bunch of data and have it discover the secrets of human nature for you. You must look in promising places—Google searches that begin “my husband wants . . .” in India, for example.
This book is going to show how Big Data is best used and explain in detail why it can be so powerful. And along the way, you’ll also learn about what I and others have already discovered with it, including:
› How many men are gay?
› Does advertising work?
› Why was American Pharoah a great racehorse?
› Is the media biased?
› Are Freudian slips real?
› Who cheats on their taxes?
› Does it matter where you go to college?
› Can you beat the stock market?
› What’s the best place to raise kids?
› What makes a story go viral?
› What should you talk about on a first date if you want a second?
. . . and much, much more.
But before we get to all that, we need to discuss a more basic question: why do we need data at all? And for that, I am going to introduce my grandmother.
PART I
DATA, BIG AND SMALL
1
YOUR FAULTY GUT
If you’re thirty-three years old and have attended a few Thanksgivings in a row without a date, the topic of mate choice is likely to arise. And just about everybody will have an opinion.
“Seth needs a crazy girl, like him,” my sister says.
“You’re crazy! He needs a normal girl, to balance him out,” my brother says.
“Seth’s not crazy,” my mother says.
“You’re crazy! Of course, Seth is crazy,” my father says.
All of a sudden, my shy, soft-spoken grandmother, quiet through the dinner, speaks. The loud, aggressive New York voices go silent, and all eyes focus on the small old lady with short yellow hair and still a trace of an Eastern European accent. “Seth, you need a nice girl. Not too pretty. Very smart. Good with people. Social, so you will do things. Sense of humor, because you have a good sense of humor.”
Why does this old woman’s advice command such attention and respect in my family? Well, my eighty-eight-year-old grandmother has seen more than everybody else at the table. She’s observed more marriages, many that worked and many that didn’t. And over the decades, she has cataloged the qualities that make for successful relationships. At that Thanksgiving table, for that question, my grandmother has access to the largest number of data points. My grandmother is Big Data.
In this book, I want to demystify data science. Like it or not, data is playing an increasingly important role in all of our lives—and its role is going to get larger. Newspapers now have full sections devoted to data. Companies have teams with the exclusive task of analyzing their data. Investors give start-ups tens of millions of dollars if they can store more data. Even if you never learn how to run a regression or calculate a confidence interval, you are going to encounter a lot of data—in the pages you read, the business meetings you attend, the gossip you hear next to the watercoolers you drink from.
Many people are anxious over this development. They are intimidated by data, easily lost and confused in a world of numbers. They think that a quantitative understanding of the world is for a select few left-brained prodigies, not for them. As soon as they encounter numbers, they are ready to turn the page, end the meeting, or change the conversation.
But I have spent ten years in the data analysis business and have been fortunate to work with many of the top people in the field. And one of the most important lessons I have learned is this: Good data science is less complicated than people think. The best data science, in fact, is surprisingly intuitive.
What makes data science intuitive? At its core, data science is about spotting patterns and predicting how one variable will affect another. People do this all the time.
Just think how my grandmother gave me relationship advice. She utilized the large database of relationships that her brain has uploaded over a near century of life—in the stories she has heard from her family, her friends, her acquaintances. She limited her analysis to a sample of relationships in which the man had many qualities that I have—a sensitive temperament, a tendency to isolate himself, a sense of humor. She zeroed in on key qualities of the woman—how kind she was, how smart she was, how pretty she was. She correlated these key qualities of the woman with a key quality of the relationship—whether it was a good one. Finally, she reported her results. In other words, she spotted patterns and predicted how one variable will affect another. Grandma is a data scientist.
You are a data scientist, too. When you were a kid, you noticed that when you cried, your mom gave you attention. That is data science. When you reached adulthood, you noticed that if you complain too much, people want to hang out with you less. That is data science, too. When people hang out with you less, you noticed, you are less happy. When you are less happy, you are less friendly. When you are less friendly, people want to hang out with you even less. Data science. Data science. Data science.
Because data science is so natural, the best Big Data studies, I have found, can be understood by just about any smart person. If you can’t understand a study, the problem is probably with the study, not with you.
Want proof that great data science tends to be intuitive? I recently came across a study that may be one of the most important conducted in the past few years. It is also one of the most intuitive studies I’ve ever seen. I want you to think not just about the importance of the study—but how natural and grandma-like it is.
The study was by a team of researchers from Columbia University and Microsoft. The team wanted to find what symptoms predict pancreatic cancer. This disease has a low five-year survival rate—only about 3 percent—but early detection can double a patient’s chances.
The researchers’ method? They utilized data from tens of thousands of anonymous users of Bing, Microsoft’s search engine. They coded a user as having recently been given a diagnosis of pancreatic cancer based on unmistakable searches, such as “just diagnosed with pancreatic cancer” or “I was told I have pancreatic cancer, what to expect.”
Next, the researchers looked at searches for health symptoms. They compared that small number of users who later reported a pancreatic cancer diagnosis with those who didn’t. What symptoms, in other words, predicted that, in a few weeks or months, a user will be reporting a diagnosis?
The results were striking. Searching for back pain and then yellowing skin turned out to be a sign of pancreatic cancer; searching for just back pain alone made it unlikely someone had pancreatic cancer. Similarly, searching for indigestion and then abdominal pain was evidence of pancreatic cancer, while searching for just indigestion without abdominal pain meant a person was unlikely to have it. The researchers could identify 5 to 15 percent of cases with almost no false positives. Now, this may not sound like a great rate; but if you have pancreatic cancer, even a 10 percent chance of possibly doubling your chances of survival would feel like a windfall.
The paper detailing this study would be difficult for non-experts to fully make sense of. It includes a lot of technical jargon, such as the Kolmo
gorov-Smirnov test, the meaning of which, I have to admit, I had forgotten. (It’s a way to determine whether a model correctly fits data.)
However, note how natural and intuitive this remarkable study is at its most fundamental level. The researchers looked at a wide array of medical cases and tried to connect symptoms to a particular illness. You know who else uses this methodology in trying to figure out whether someone has a disease? Husbands and wives, mothers and fathers, and nurses and doctors. Based on experience and knowledge, they try to connect fevers, headaches, runny noses, and stomach pains to various diseases. In other words, the Columbia and Microsoft researchers wrote a groundbreaking study by utilizing the natural, obvious methodology that everybody uses to make health diagnoses.
But wait. Let’s slow down here. If the methodology of the best data science is frequently natural and intuitive, as I claim, this raises a fundamental question about the value of Big Data. If humans are naturally data scientists, if data science is intuitive, why do we need computers and statistical software? Why do we need the Kolmogorov-Smirnov test? Can’t we just use our gut? Can’t we do it like Grandma does, like nurses and doctors do?
This gets to an argument intensified after the release of Malcolm Gladwell’s bestselling book Blink, which extols the magic of people’s gut instincts. Gladwell tells the stories of people who, relying solely on their guts, can tell whether a statue is fake; whether a tennis player will fault before he hits the ball; how much a customer is willing to pay. The heroes in Blink do not run regressions; they do not calculate confidence intervals; they do not run Kolmogorov-Smirnov tests. But they generally make remarkable predictions. Many people have intuitively supported Gladwell’s defense of intuition: they trust their guts and feelings. Fans of Blink might celebrate the wisdom of my grandmother giving relationship advice without the aid of computers. Fans of Blink may be less apt to celebrate my studies or the other studies profiled in this book, which use computers. If Big Data—of the computer type, rather than the grandma type—is a revolution, it has to prove that it’s more powerful than our unaided intuition, which, as Gladwell has pointed out, can often be remarkable.
The Columbia and Microsoft study offers a clear example of rigorous data science and computers teaching us things our gut alone could never find. This is also one case where the size of the dataset matters. Sometimes there is insufficient experience for our unaided gut to draw upon. It is unlikely that you—or your close friends or family members—have seen enough cases of pancreatic cancer to tease out the difference between indigestion followed by abdominal pain compared to indigestion alone. Indeed, it is inevitable, as the Bing dataset gets bigger, that the researchers will pick up many more subtle patterns in the timing of symptoms—for this and other illnesses—that even doctors might miss.
Moreover, while our gut may usually give us a good general sense of how the world works, it is frequently not precise. We need data to sharpen the picture. Consider, for example, the effects of weather on mood. You would probably guess that people are more likely to feel more gloomy on a 10-degree day than on a 70-degree day. Indeed, this is correct. But you might not guess how big an impact this temperature difference can make. I looked for correlations between an area’s Google searches for depression and a wide range of factors, including economic conditions, education levels, and church attendance. Winter climate swamped all the rest. In winter months, warm climates, such as that of Honolulu, Hawaii, have 40 percent fewer depression searches than cold climates, such as that of Chicago, Illinois. Just how significant is this effect? An optimistic read of the effectiveness of antidepressants would find that the most effective drugs decrease the incidence of depression by only about 20 percent. To judge from the Google numbers, a Chicago-to-Honolulu move would be at least twice as effective as medication for your winter blues.*
Sometimes our gut, when not guided by careful computer analysis, can be dead wrong. We can get blinded by our own experiences and prejudices. Indeed, even though my grandmother is able to utilize her decades of experience to give better relationship advice than the rest of my family, she still has some dubious views on what makes a relationship last. For example, she has frequently emphasized to me the importance of having common friends. She believes that this was a key factor in her marriage’s success: she spent most warm evenings with her husband, my grandfather, in their small backyard in Queens, New York, sitting on lawn chairs and gossiping with their tight group of neighbors.
However, at the risk of throwing my own grandmother under the bus, data science suggests that Grandma’s theory is wrong. A team of computer scientists recently analyzed the biggest dataset ever assembled on human relationships—Facebook. They looked at a large number of couples who were, at some point, “in a relationship.” Some of these couples stayed “in a relationship.” Others switched their status to “single.” Having a common core group of friends, the researchers found, is a strong predictor that a relationship will not last. Perhaps hanging out every night with your partner and the same small group of people is not such a good thing; separate social circles may help make relationships stronger.
As you can see, our intuition alone, when we stay away from the computers and go with our gut, can sometimes amaze. But it can make big mistakes. Grandma may have fallen into one cognitive trap: we tend to exaggerate the relevance of our own experience. In the parlance of data scientists, we weight our data, and we give far too much weight to one particular data point: ourselves.
Grandma was so focused on her evening schmoozes with Grandpa and their friends that she did not think enough about other couples. She forgot to fully consider her brother-in-law and his wife, who chitchatted most nights with a small, consistent group of friends but fought frequently and divorced. She forgot to fully consider my parents, her daughter and son-in-law. My parents go their separate ways many nights—my dad to a jazz club or ball game with his friends, my mom to a restaurant or the theater with her friends; yet they remain happily married.
When relying on our gut, we can also be thrown off by the basic human fascination with the dramatic. We tend to overestimate the prevalence of anything that makes for a memorable story. For example, when asked in a survey, people consistently rank tornadoes as a more common cause of death than asthma. In fact, asthma causes about seventy times more deaths. Deaths by asthma don’t stand out—and don’t make the news. Deaths by tornadoes do.
We are often wrong, in other words, about how the world works when we rely just on what we hear or personally experience. While the methodology of good data science is often intuitive, the results are frequently counterintuitive. Data science takes a natural and intuitive human process—spotting patterns and making sense of them—and injects it with steroids, potentially showing us that the world works in a completely different way from how we thought it did. That’s what happened when I studied the predictors of basketball success.
When I was a little boy, I had one dream and one dream only: I wanted to grow up to be an economist and data scientist. No. I’m just kidding. I wanted desperately to be a professional basketball player, to follow in the footsteps of my hero, Patrick Ewing, all-star center for the New York Knicks.
I sometimes suspect that inside every data scientist is a kid trying to figure out why his childhood dreams didn’t come true. So it is not surprising that I recently investigated what it takes to make the NBA. The results of the investigation were surprising. In fact, they demonstrate once again how good data science can change your view of the world, and how counterintuitive the numbers can be.
The particular question I looked at is this: are you more likely to make it in the NBA if you grow up poor or middle-class?
Most people would guess the former. Conventional wisdom says that growing up in difficult circumstances, perhaps in the projects with a single, teenage mom, helps foster the drive necessary to reach the top levels of this intensely competitive sport.
This view was expressed by William Ellerbee, a high sch
ool basketball coach in Philadelphia, in an interview with Sports Illustrated. “Suburban kids tend to play for the fun of it,” Ellerbee said. “Inner-city kids look at basketball as a matter of life or death.” I, alas, was raised by married parents in the New Jersey suburbs. LeBron James, the best player of my generation, was born poor to a sixteen-year-old single mother in Akron, Ohio.
Indeed, an internet survey I conducted suggested that the majority of Americans think the same thing Coach Ellerbee and I thought: that most NBA players grow up in poverty.
Is this conventional wisdom correct?
Let’s look at the data. There is no comprehensive data source on the socioeconomics of NBA players. But by being data detectives, by utilizing data from a whole bunch of sources—basketball-reference.com, ancestry.com, the U.S. Census, and others—we can figure out what family background is actually most conducive to making the NBA. This study, you will note, uses a variety of data sources, some of them bigger, some of them smaller, some of them online, and some of them offline. As exciting as some of the new digital sources are, a good data scientist is not above consulting old-fashioned sources if they can help. The best way to get the right answer to a question is to combine all available data.
The first relevant data is the birthplace of every player. For every county in the United States, I recorded how many black and white men were born in the 1980s. I then recorded how many of them reached the NBA. I compared this to a county’s average household income. I also controlled for the racial demographics of a county, since—and this is a subject for a whole other book—black men are about forty times more likely than white men to reach the NBA.
Everybody Lies Page 3