Randomistas
Page 13
‘If I can predict what you are going to think of pretty much any problem,’ argues Esther Duflo, ‘it is likely that you will be wrong on stuff.’59 Since childhood, Duflo wanted to do something to reduce global suffering. Growing up in Paris, she recalls watching television coverage of the Ethiopian famine in the 1980s. Duflo’s mother, Violaine, a paediatrician, would travel each year to Africa – returning with images of the child victims of war that she had treated.
Initially, Duflo studied history at the elite École Normale Supérieure in Paris. But her turning point came when she spent a year working with economist Jeffrey Sachs in Moscow, helping advise the Russian government. ‘I immediately saw that, as an economist, I can have the best of both worlds. You can be there but keep your principles intact and say exactly what you think, and go back to your studies if you are ejected. What you are doing is meaningful and pertinent and maybe will change something.’60
It didn’t take Duflo long to make her mark on the economics profession. Tenured at twenty-nine, she has been awarded the Clark Medal (often a precursor to the Nobel Prize) and a MacArthur Foundation genius grant. With Abhijit Banerjee, Duflo founded the Abdul Latif Jameel Poverty Action Lab, or J-PAL, based at the Massachusetts Institute of Technology. Through her work at J-PAL, Duflo has been responsible for hundreds of randomised trials, including several in this chapter: business training, microcredit, subsidised anti-malaria bed nets, and vaccination incentives.
Duflo’s favourite hobby is rock climbing – a sport that rewards bravery, tenacity and flexibility. So it perhaps isn’t surprising that she tends to reject the idea that we can simply guess the best way up the mountain. Instead, Duflo prefers to start with a range of strategies, and put them to the test. ‘When someone of good will comes and wants to do something to affect education or the role of women or local governments,’ she says, ‘I want them to have a menu of things they can experiment with.’61 Sure, she admits, policies sometimes fail. But generally it is because people are complex, not because there is a grand conspiracy against the poor.
Like her mother, Duflo spends a considerable amount of time each year in developing countries – travelling to places like India, Kenya, Rwanda and Indonesia to work with her research collaborators and see J-PAL’s experiments firsthand. She knows that some studies will confirm the effectiveness of an anti-poverty program, while other evaluations will highlight defects. ‘One of my great assets,’ she says, ‘is I don’t have many opinions to start with. I have one opinion – one should evaluate things – which is strongly held. I’m never unhappy with the results. I haven’t yet seen a result I didn’t like.’62
In business, governance, health and education, randomistas like Esther Duflo are providing answers that help to reduce poverty in slums and villages across Africa, Latin America and the Asia-Pacific. Invariably, these results are messier than the grand theories that preceded them. That’s the reality of the world in which we live.
But it’s not all chaos. Just as biologists and physicists build up from the results of individual experiments to construct a model of how larger systems operate, randomistas have sought to combine the results of multiple experiments to inform policymakers. J-PAL doesn’t seek just to run randomised trials, but also to synthesise the evidence. In the case of schooling, J-PAL runs the ruler across dozens of programs designed to raise test scores in developing countries.63 It gives high marks to programs that increase the authority of the local school committee, that encourage teachers to show up (in one case by having them take a daily class photo) and that track students by achievement levels. But J-PAL gives a failing grade to free laptops, smaller class sizes and flipcharts.
At Yale University, Innovations for Poverty Action plays a similar role to MIT’s J-PAL centre, conducting randomised trials and summarising their results for decision-makers. A recent publication explores what works to boost financial inclusion.64 Monthly text messages reminding clients of their goals boosts savings. ATM cards have no impact on women’s savings behaviour. Rainfall insurance makes farmers more productive. Microcredit does not increase the numbers of small businesses.
This kind of scorecard approach is not without its critics. One challenge is that combining studies across the developing world may miss important local differences. When I summarised J-PAL’s findings on schooling and Innovations for Poverty Action’s findings on financial literacy just now, I neglected to tell you that these experiments were conducted in different countries, including Ghana, Peru and Indonesia. These places vary greatly in their poverty levels, racial mix, literacy levels and so on. Indeed, when researchers run the same experiment in different places, there generally turns out to be a good deal of variation.65
Another challenge is that apparently similar interventions might turn out to be different in design or implementation. Text messages reminding people to save more money may work only if worded a particular way. Giving more power to school committees could range from letting them look at the school budget to allowing them to fire teachers.
To illustrate the challenges of generalisability across developing country randomised trials, Karthik Muralidharan uses the example of textbooks.66 For many charitable donors, seeing a classroom of students sharing a single textbook led to an obvious conclusion: giving textbooks would raise student performance.
In four different experiments, students in schools that were randomly selected to receive textbooks did no better on exams than students in schools without textbooks. But as Muralidharan points out, closer examination of the four studies reveals four different reasons why textbook distribution failed to make a difference. In Sierra Leone, textbooks reached the schools but were put in storage rather than being distributed to the children.67 In India, parents cut back their spending on education in response to the free textbook program.68 In Tanzania, teachers had little incentive to use textbooks in their lessons.69 In Kenya, textbooks only helped the top fifth of students; the rest of the students were unable to read.70
Knowing all that, how would you respond to someone who asked: ‘Do free textbooks help students?’ You might just say, ‘no.’ Or you could say, ‘Maybe – but only if they aren’t put in storage, if parents don’t spend less on schooling, if teachers use them, and if students are literate.’ Together, the studies illustrate four different ways that a promising program can fail. They are a reminder of the need for randomistas to produce deeper knowledge. As Princeton’s Angus Deaton puts it, the best experiments don’t just test programs, they help us understand ‘theories that are generalizable to other situations’.71
The challenge of taking research findings from one specific part of the world and applying them in another is not unique to randomised trials. In fact, it’s not even unique to statistical research. Any time we generalise about humans in the world, we have to remember that people and programs differ. A drug that works on people of African ancestry may be less effective on people of European ancestry. Fire-fighting strategies that are effective in Madagascar may not work in Mali.
In Chapter 11, I’ll return to the topic of replication. But it is also worth pointing out that the alternative to using one kind of evidence is not necessarily another, better kind of evidence. As comedian Stephen Colbert sardonically noted of President George W. Bush, ‘We’re not members of the factinista. We go straight from the gut.’ As the saying goes, you can’t reason someone out of a position if they didn’t reason their way into it in the first place.
Randomised trials may not be perfect, but the alternative is making policy based on what one pair of experts describe as ‘opinions, prejudices, anecdotes and weak data’.72 As the poet W.H. Auden once put it, ‘We may not know very much, but we do know something, and while we must always be prepared to change our minds, we must act as best we can in the light of what we do know.’73
8
FARMS, FIRMS AND FACEBOOK
English scientist John Bennet Lawes had a problem. In 1842 he had been granted a patent for a product called ‘superphosphate’,
made by treating bones or mineral phosphates with sulphuric acid. He established a factory, and was ready to produce this new artificial fertiliser. But when he tried to sell ‘J.B. Lawes’s Patent Manures’, farmers told Lawes that they didn’t see why they should buy fertiliser when their crops already grew perfectly well with animal manure. To persuade his customers, Lawes ran an experiment. On his family estate in Hertfordshire, just north of London, he divided a field into twenty plots and randomly assigned them to receive no fertiliser, chicken and cow manure, or ammonium sulphate.1 Then he planted winter wheat on each of the plots. Each year the wheat was harvested, and the amount produced by each plot was recorded.
It didn’t take long for the field to show dramatic differences. Within a couple of years, Lawes observed, ‘the experimental ground looked almost as much as if it were devoted to trials with different seeds as with different manures’.2 Lawes’ fertiliser sales took off, and by the time of his death in 1900 his estate was worth £565,000 (£55 million in today’s money). Since then, global production of phosphate-based fertilisers has grown from around 100,000 tonnes per year to over 40 million tonnes per year.3 Hertfordshire became a major research centre, hosting some of the world’s leading experimental scientists, including Ronald Fisher, whose story we heard earlier. The experiments begun by Lawes continue today, making them the world’s oldest continuously running ecological experiment.
Globally, randomised experiments have been critical to farming success. In 1890 the rust fungus obliterated much of the Australian wheat crop, and the colonies had to import wheat. In response, a mathematician-turned-farmer by the name of William Farrer used experiments to find a rust-resistant variety of wheat. Critics mocked his ‘pocket handkerchief wheat plots’.4 But after trying hundreds of different breeding combinations, Farrer created a new ‘Federation Wheat’ based not on reputation or appearance, but on pure performance.
Agricultural trials of this kind are often called ‘field experiments’, a term which some people also use to describe randomised trials in social science. Modern agricultural field experiments use spatial statistical models to divide up the plots.5 As in medicine and aid, the most significant agricultural randomised trials are now conducted across multiple countries. They are at the heart of much of our understanding of genetically modified crops, the impact of climate change on agriculture, and drought resistance.
*
Gary Loveman was in his late thirties when he decided to make the switch from Harvard to Las Vegas. A junior professor at the time, he took up an offer to become the chief operating officer at Harrah’s casino. The CEO of Harrah’s was preparing to step down and wanted Loveman to succeed him in the top job. Former professors aren’t most people’s idea of a casino manager, and Loveman had no intention of becoming a typical casino manager. One of the things that attracted him to the betting businesses was the ready availability of data. And yet casinos were often run by gut feeling: ‘What I found in our industry was that the institutionalization of instinct was a source of many of its problems.’6
For as many problems as possible, Loveman set about running randomised trials.7 How to get interstate customers to come more often? Randomly choose a group to receive a discounted hotel offer and compare their response with those who aren’t randomly selected. How to keep high-rollers happy? Randomly experiment with incentives like free meals, free hotel rooms, exclusive access to venues, and free chips. How to get waiters to sell more drinks without becoming obnoxious? Randomly adjust the salary bonuses paid to casino staff. How to stop unlucky first-timers walking away? Randomly trial consolation prizes. (Loveman claims that the casino does not target gambling addicts.8)
Asked whether it was difficult to set up a constant stream of experiments, Loveman replied, ‘Honestly, my only surprise is that it is easier than I would have thought. I remember back in school how difficult it was to find rich data sets to work on. In our world, where we measure virtually everything we do, what has struck me is how easy it is to do this. I’m a little surprised more people don’t do this.’9 He said that Harrah set out three cardinal sins: ‘It’s like you don’t harass women, you don’t steal and you’ve got to have a control group. This is one of the things that you can lose your job for at Harrah’s – not running a control group.’10
Loveman is part of a growing band of business randomistas.11 In 1994 Nigel Morris and Rich Fairbank started the credit card company Capital One. Their philosophy was explicitly experimental. Should credit card offers be sent in white or blue envelopes? Mail out 50,000 of each, and see which gets the higher response rate.12 Could the website be tweaked? Create two versions, randomly direct visitors to each and see which is best.
Capital One’s biggest innovation was to be the first major firm to offer customers free balance transfers from other credit cards. It ended up being highly successful, but Morris and Fairbank didn’t need to bet the company on it. Instead, they randomly selected a small group of customers, offered them free balance transfers and compared their behaviour with other Capital One customers.13 Capital One is now the eighth-largest bank holding company in the United States. Fairbank describes his firm as ‘a scientific laboratory where every decision about product design, marketing, channels of communication, credit lines, customer selection, collection policies, and cross-selling decisions could be subjected to systematic testing using thousands of experiments’.14
At Kohl’s, one of the largest department store chains in the United States, an experiment helped resolve a split on the board regarding opening hours.15 In 2013 the company was aiming to cut costs, and considering opening an hour later. Some managers liked the idea, while others expected the drop in sales would outweigh the cost savings. So the firm ran an experiment with 100 of its stores. The verdict: big cost savings with a small loss of sales. Bolstered by rigorous evidence, the firm pushed back opening hours across more than a thousand stores.
On 15 January 2014 the dating website OkCupid ran an unexpected experiment with its customers: it declared ‘Love is Blind Day’, and removed all photos from the website. Three-quarters of users immediately left the site. But then the company noticed something else. Among those who remained, people were more likely to respond to messages, conversations went deeper and contact details were exchanged more often. Racial and attractiveness biases disappeared. As company co-founder Christian Rudder put it, ‘In short, OkCupid worked better.’16
Then, after seven hours, OkCupid restored access to photos. Conversations immediately melted away. As Rudder puts it, ‘The goodness was gone, in fact worse than gone. It was like we’d turned on the bright lights at midnight.’17
Having access to OkCupid, Rudder points out, makes it possible to analyse ‘a data set of person-to-person interaction that’s deeper and more varied than anything held by any other private individual’.18 It also makes it possible to run experiments. Because Love is Blind Day wasn’t a randomised trial, its impact had to be assessed by comparing it with user patterns at the same time of the week.
But other OkCupid experiments are random. In a blog post titled ‘We experiment on human beings!’, Rudder explains how the website tested the impact of its match quality scores.19 Take a pair of people who are a bad match (with a 30 per cent match rating). When told the truth, there was a 1.4 per cent chance they would strike up a conversation. But what if OkCupid told them they were a terrific match (displaying a 90 per cent match rating)? In that case, the odds of the pair striking up a conversation were 2.9 per cent: more than twice as high. The results of the randomised experiment showed that ‘the mere myth of compatibility works just as well as the truth’.20
Some firms remain reluctant to run experiments, with managers stymied by bureaucratic inertia, or fearing that an experiment will demonstrate that they don’t have all the answers (the truth: they don’t).21 But in other industries, experiments are everywhere. Running a randomised experiment in business is often called ‘A/B testing’, and has become integral to the operation of firms
such as eBay, Intuit, Humana, Chrysler, United Airlines, Lyft and Uber.
Money transfer firm Western Union uses randomised experiments to decide what combination of fixed fees and foreign exchange mark-ups to charge consumers. Quora, a question-and-answer website, devotes a tenth of its staff to running randomised trials, and is conducting about thirty experiments at any given time.22 As one writer puts it, ‘We talk about the Google homepage or the Amazon checkout screen, but it’s now more accurate to say that you visited a Google homepage, an Amazon checkout screen.’23 Another commentator observes that ‘every pixel on the [Amazon] home page has had to justify its existence through repeated testing of alternative layouts’.24 Among the largest restaurants, retailers and financial institutions in the United States, at least a third are running randomised experiments.25
But these experiments don’t always go according to plan.
*
In 2000, tech-savvy users of online bookseller Amazon discovered something curious: movie prices were changing. A customer who looked at the DVD of Men in Black might see it priced at $23.97. But if the same person then deleted their cookies (small files on your hard disk that allow websites to track your browsing patterns), and went back to Amazon’s website, then the same DVD might be offered for sale at $27.97. The four-dollar difference was random – part of an Amazon experiment to see how price-responsive their customers were.
The experiment probably didn’t last long enough to teach Amazon much about price-responsiveness, but when the news broke, they did quickly learn how their customers felt about random price changes. Amazon users called it a ‘bad idea’ and a ‘strange business model’. One person posted on a discussion forum: ‘I will never buy another thing from those guys.’26 Los Angeles actor John Dziak said he felt betrayed. ‘You trust a company, then you hear they’re doing this kind of stuff, your trust wavers, and you go somewhere else.’27