Everybody Lies
Page 17
Pandora does the same in picking what songs you might want to listen to. And this is how Netflix figures out the movies you might like. The impact has been so profound that when Amazon engineer Greg Linden originally introduced doppelganger searches to predict readers’ book preferences, the improvement in recommendations was so good that Amazon founder Jeff Bezos got to his knees and shouted, “I’m not worthy!” to Linden.
But what is really interesting about doppelganger searches, considering their power, is not how they’re commonly being used now. It is how frequently they are not used. There are major areas of life that could be vastly improved by the kind of personalization these searches allow. Take our health, for instance.
Isaac Kohane, a computer scientist and medical researcher at Harvard, is trying to bring this principle to medicine. He wants to organize and collect all of our health information so that instead of using a one-size-fits-all approach, doctors can find patients just like you. Then they can employ more personalized, more focused diagnoses and treatments.
Kohane considers this a natural extension for the medical field and not even a particularly radical one. “What is a diagnosis?” Kohane asks. “A diagnosis really is a statement that you share properties with previously studied populations. When I diagnose you with a heart attack, God forbid, I say you have a pathophysiology that I learned from other people means you have had a heart attack.”
A diagnosis is, in essence, a primitive kind of doppelganger search. The problem is that the datasets doctors use to make their diagnoses are small. These days a diagnosis is based on a doctor’s experience with the population of patients he or she has treated and perhaps supplemented by academic papers from small populations that other researchers have encountered. As we’ve seen, though, for a doppelganger search to really get good, it would have to include many more cases.
Here is a field where some Big Data could really help. So what’s taking so long? Why isn’t it already widely used? The problem lies with data collection. Most medical reports still exist on paper, buried in files, and for those that are computerized, they’re often locked up in incompatible formats. We often have better data, Kohane notes, on baseball than on health. But simple measures would go a long way. Kohane talks repeatedly of “low-hanging fruit.” He believes, for instance, that merely creating a complete dataset of children’s height and weight charts and any diseases they might have would be revolutionary for pediatrics. Each child’s growth path then could be compared to every other child’s growth path. A computer could find children who were on a similar trajectory and automatically flag any troubling patterns. It might detect a child’s height leveling off prematurely, which in certain scenarios would likely point to one of two possible causes: hypothyroidism or a brain tumor. Early diagnosis in both cases would be a huge boon. “These are rare birds,” according to Kohane, “one-in-ten-thousand kind of events. Children, by and large, are healthy. I think we could diagnose them earlier, at least a year earlier. One hundred percent, we could.”
James Heywood is an entrepreneur who has a different approach to deal with difficulties linking medical data. He created a website, PatientsLikeMe.com, where individuals can report their own information—their conditions, treatments, and side effects. He’s already had a lot of success charting the varying courses diseases can take and how they compare to our common understanding of them.
His goal is to recruit enough people, covering enough conditions, so that people can find their health doppelganger. Heywood hopes that you can find people of your age and gender, with your history, reporting symptoms similar to yours—and see what has worked for them. That would be a very different kind of medicine, indeed.
DATA STORIES
In many ways the act of zooming in is more valuable to me than the particular findings of a particular study, because it offers a new way of seeing and talking about life.
When people learn that I am a data scientist and a writer, they sometimes will share some fact or survey with me. I often find this data boring—static and lifeless. It has no story to tell.
Likewise, friends have tried to get me to join them in reading novels and biographies. But these hold little interest for me as well. I always find myself asking, “Would that happen in other situations? What’s the more general principle?” Their stories feel small and unrepresentative.
What I have tried to present in this book is something that, for me, is like nothing else. It is based on data and numbers; it is illustrative and far-reaching. And yet the data is so rich that you can visualize the people underneath it. When we zoom in on every minute of Edmonton’s water consumption, I see the people getting up from their couch at the end of the period. When we zoom in on people moving from Philadelphia to Miami and starting to cheat on their taxes, I see these people talking to their neighbors in their apartment complex and learning about the tax trick. When we zoom in on baseball fans of every age, I see my own childhood and my brother’s childhood and millions of adult men still crying over a team that won them over when they were eight years old.
At the risk of once again sounding grandiose, I think the economists and data scientists featured in this book are creating not only a new tool but a new genre. What I have tried to present in this chapter, and much of this book, is data so big and so rich, allowing us to zoom in so close that, without limiting ourselves to any particular, unrepresentative human being, we can still tell complex and evocative stories.
6
ALL THE WORLD’S A LAB
February 27, 2000, started as an ordinary day on Google’s Mountain View campus. The sun was shining, the bikers were pedaling, the masseuses were massaging, the employees were hydrating with cucumber water. And then, on this ordinary day, a few Google engineers had an idea that unlocked the secret that today drives much of the internet. The engineers found the best way to get you clicking, coming back, and staying on their sites.
Before describing what they did, we need to talk about correlation versus causality, a huge issue in data analysis—and one that we have not yet adequately addressed.
The media bombard us with correlation-based studies seemingly every day. For example, we have been told that those of us who drink a moderate amount of alcohol tend to be in better health. That is a correlation.
Does this mean drinking a moderate amount will improve one’s health—a causation? Perhaps not. It could be that good health causes people to drink a moderate amount. Social scientists call this reverse causation. Or it could be that there is an independent factor that causes both moderate drinking and good health. Perhaps spending a lot of time with friends leads to both moderate alcohol consumption and good health. Social scientists call this omitted-variable bias.
How, then, can we more accurately establish causality? The gold standard is a randomized, controlled experiment. Here’s how it works. You randomly divide people into two groups. One, the treatment group, is asked to do or take something. The other, the control group, is not. You then see how each group responds. The difference in the outcomes between the two groups is your causal effect.
For example, to test whether moderate drinking causes good health, you might randomly pick some people to drink one glass of wine per day for a year, randomly choose others to drink no alcohol for a year, and then compare the reported health of both groups. Since people were randomly assigned to the two groups, there is no reason to expect one group would have better initial health or have socialized more. You can trust that the effects of the wine are causal. Randomized, controlled experiments are the most trusted evidence in any field. If a pill can pass a randomized, controlled experiment, it can be dispensed to the general populace. If it cannot pass this test, it won’t make it onto pharmacy shelves.
Randomized experiments have increasingly been used in the social sciences as well. Esther Duflo, a French economist at MIT, has led the campaign for greater use of experiments in developmental economics, a field that tries to figure out the best ways to help th
e poorest people in the world. Consider Duflo’s study, with colleagues, of how to improve education in rural India, where more than half of middle school students cannot read a simple sentence. One potential reason students struggle so much is that teachers don’t show up consistently. On a given day in some schools in rural India, more than 40 percent of teachers are absent.
Duflo’s test? She and her colleagues randomly divided schools into two groups. In one (the treatment group), in addition to their base pay, teachers were paid a small amount—50 rupees, or about $1.15—for every day they showed up to work. In the other, no extra payment for attendance was given. The results were remarkable. When teachers were paid, teacher absenteeism dropped in half. Student test performance also improved substantially, with the biggest effects on young girls. By the end of the experiment, girls in schools where teachers were paid to come to class were 7 percentage points more likely to be able to write.
According to a New Yorker article, when Bill Gates learned of Duflo’s work, he was so impressed he told her, “We need to fund you.”
THE ABCS OF A/B TESTING
So randomized experiments are the gold standard for proving causality, and their use has spread through the social sciences. Which brings us back to Google’s offices on February 27, 2000. What did Google do on that day that revolutionized the internet?
On that day, a few engineers decided to perform an experiment on Google’s site. They randomly divided users into two groups. The treatment group was shown twenty links on the search results pages. The control group was shown the usual ten. The engineers then compared the satisfaction of the two groups based on how frequently they returned to Google.
This is a revolution? It doesn’t seem so revolutionary. I already noted that randomized experiments have been used by pharmaceutical companies and social scientists. How can copying them be such a big deal?
The key point—and this was quickly realized by the Google engineers—is that experiments in the digital world have a huge advantage relative to experiments in the offline world. As convincing as offline randomized experiments can be, they are also resource-intensive. For Duflo’s study, schools had to be contacted, funding had to be arranged, some teachers had to be paid, and all students had to be tested. Offline experiments can cost thousands or hundreds of thousands of dollars and take months or years to conduct.
In the digital world, randomized experiments can be cheap and fast. You don’t need to recruit and pay participants. Instead, you can write a line of code to randomly assign them to a group. You don’t need users to fill out surveys. Instead, you can measure mouse movements and clicks. You don’t need to hand-code and analyze the responses. You can build a program to automatically do that for you. You don’t have to contact anybody. You don’t even have to tell users they are part of an experiment.
This is the fourth power of Big Data: it makes randomized experiments, which can find truly causal effects, much, much easier to conduct—anytime, more or less anywhere, as long as you’re online. In the era of Big Data all the world’s a lab.
This insight quickly spread through Google and then the rest of Silicon Valley, where randomized controlled experiments have been renamed “A/B testing.” In 2011, Google engineers ran seven thousand A/B tests. And this number is only rising.
If Google wants to know how to get more people to click on ads on their sites, they may try two shades of blue in ads—one shade for Group A, another for Group B. Google can then compare click rates. Of course, the ease of such testing can lead to overuse. Some employees felt that because testing was so effortless, Google was overexperimenting. In 2009, one frustrated designer quit after Google went through forty-one marginally different shades of blue in A/B testing. But this designer’s stand in favor of art over obsessive market research has done little to stop the spread of the methodology.
Facebook now runs a thousand A/B tests per day, which means that a small number of engineers at Facebook start more randomized, controlled experiments in a given day than the entire pharmaceutical industry starts in a year.
A/B testing has spread beyond the biggest tech firms. A former Google employee, Dan Siroker, brought this methodology to Barack Obama’s first presidential campaign, which A/B-tested home page designs, email pitches, and donation forms. Then Siroker started a new company, Optimizely, which allows organizations to perform rapid A/B testing. In 2012, Optimizely was used by Obama as well as his opponent, Mitt Romney, to maximize sign-ups, volunteers, and donations. It’s also used by companies as diverse as Netflix, TaskRabbit, and New York magazine.
To see how valuable testing can be, consider how Obama used it to get more people engaged with his campaign. Obama’s home page initially included a picture of the candidate and a button below the picture that invited people to “Sign Up.”
Was this the best way to greet people? With the help of Siroker, Obama’s team could test whether a different picture and button might get more people to actually sign up. Would more people click if the home page instead featured a picture of Obama with a more solemn face? Would more people click if the button instead said “Join Now”? Obama’s team showed users different combinations of pictures and buttons and measured how many of them clicked the button. See if you can predict the winning picture and winning button.
Pictures Tested
Buttons Tested
The winner was the picture of Obama’s family and the button “Learn More.” And the victory was huge. By using that combination, Obama’s campaign team estimated it got 40 percent more people to sign up, netting the campaign roughly $60 million in additional funding.
Winning Combination
There is another great benefit to the fact that all this gold-standard testing can be done so cheap and easy: it further frees us from our reliance upon our intuition, which, as noted in Chapter 1, has its limitations. A fundamental reason for A/B testing’s importance is that people are unpredictable. Our intuition often fails to predict how they will respond.
Was your intuition correct on Obama’s optimal website?
Here are some more tests for your intuition. The Boston Globe A/B-tests headlines to figure out which ones get the most people to click on a story. Try to guess the winners from these pairs:
Finished your guesses? The answers are in bold below.
I predict you got more than half right, perhaps by considering what you would click on. But you probably did not guess all of these correctly.
Why? What did you miss? What insights into human behavior did you lack? What lessons can you learn from your mistakes?
We usually ask questions such as these after making bad predictions.
But look how difficult it is to draw general conclusions from the Globe headlines. In the first headline test, changing a single word, “this” to “SnotBot,” led to a big win. This might suggest more details win. But in the second headline, “deflated balls,” the detailed term, loses. In the fourth headline, “makes bank” beats the number $179,000. This might suggest slang terms win. But the slang term “hookup contest” loses in the third headline.
The lesson of A/B testing, to a large degree, is to be wary of general lessons. Clark Benson is the CEO of ranker.com, a news and entertainment site that relies heavily on A/B testing to choose headlines and site design. “At the end of the day, you can’t assume anything,” Benson says. “Test literally everything.”
Testing fills in gaps in our understanding of human nature. These gaps will always exist. If we knew, based on our life experience, what the answer would be, testing would not be of value. But we don’t, so it is.
Another reason A/B testing is so important is that seemingly small changes can have big effects. As Benson puts it, “I’m constantly amazed with minor, minor factors having outsized value in testing.”
In December 2012, Google changed its advertisements. They added a rightward-pointing arrow surrounded by a square.
Notice how bizarre this arrow is. It points rightward to abs
olutely nothing. In fact, when these arrows first appeared, many Google customers were critical. Why were they adding meaningless arrows to the ad, they wondered?
Well, Google is protective of its business secrets, so they don’t say exactly how valuable the arrows were. But they did say that these arrows had won in A/B testing. The reason Google added them is that they got a lot more people to click. And this minor, seemingly meaningless change made Google and their ad partners oodles of money.
So how can you find these small tweaks that produce outsize profits? You have to test lots of things, even many that seem trivial. In fact, Google’s users have noted numerous times that ads have been changed a tiny bit only to return to their previous form. They have unwittingly become members of treatment groups in A/B tests—but at the cost only of seeing these slight variations.
Centering Experiment (Didn’t Work)
Green Star Experiment (Didn’t Work)
New Font Experiment (Didn’t Work)
The above variations never made it to the masses. They lost. But they were part of the process of picking winners. The road to a clickable arrow is paved with ugly stars, faulty positionings, and gimmicky fonts.
It may be fun to guess what makes people click. And if you are a Democrat, it might be nice to know that testing got Obama more money. But there is a dark side to A/B testing.
In his excellent book Irresistible, Adam Alter writes about the rise of behavioral addictions in contemporary society. Many people are finding aspects of the internet increasingly difficult to turn off.