Randomistas
Page 14
A week later, Amazon admitted that it had embarked on a five-day randomised price test, involving sixty-eight DVD titles, and testing discounts of up to 40 per cent. The company announced that it was ceasing the experiment, and would refund to 6896 customers an average of $3.10 as a result of the random price test.28 Amazon would keep on running randomised experiments in areas such as website design, but the company promised to no longer run pricing experiments on its customers.
But pricing experiments remain ubiquitous.29 Randomly varying prices in mail-order catalogues for women’s clothing, researchers found that when retailers end prices with the number nine (such as $9.99 or $39), demand jumps by up to one-third.30 Although the rational thing to do is to round up, our brains have a tendency to round down, making things priced with nines seem cheaper than they really are. In fact, customers seem to be so bad at processing prices that the study finds higher demand at a price of $39 than $34. Perhaps it shouldn’t come as a surprise that a survey of consumer prices estimates that about half of all published prices end in nine.
Customers also seem to make predictable mistakes when shopping online. One randomised experiment on eBay tested what happened when the seller dropped the minimum bid by $1 but raised the shipping cost by $1.31 It turned out that for the products they were selling – CDs and Xbox games – buyers tend to ignore the shipping costs. If you’re buying a product on eBay and you suddenly notice that it has an exorbitant shipping cost, you might be dealing with a seller who keeps up to date with the latest randomised trials.
Another trick that companies use to increase their profits is to offer products with a special status. In an experiment in Indonesia, a major credit card company randomly offered customers an upgrade to one of two products. Some were offered a ‘platinum’ credit card, featuring airport lounge access, discounts on international fashion brands and a higher credit limit. Other customers were offered a card with all the same features, but without the ‘platinum’ branding. Customers were significantly more likely to accept the offer of a high-status platinum card, and more likely to use it in visible settings, such as restaurants.32
Other experiments take place in stores. Working with a single US retail store for sixteen weeks, a team of researchers tested whether shoppers were more likely to buy a bottle of hand lotion when the shelf sign said ‘33 per cent off’ or when it said ‘50 per cent more free’.33 The two descriptions are mathematically identical, but they found that the product was nearly twice as popular in the weeks when it was sold using the larger ‘50 per cent more free’ tagline.
With only one store, there’s a limit to how much we can generalise results. Moreover, the experiment has to be run by randomly turning the promotion on and off. But with multiple stores, it can be done in a single week. Working with a chain of eighty-six supermarkets, a group of researchers compared the impact of single-unit discounts (e.g. 50 cents) with multiple unit promotions (e.g. two for $1).34 In half the stores, customers saw single unit discounts, while in the other half, they saw multiple unit promotions. Measuring sales across a variety of products – from canned soup to toilet paper – the researchers find that multiple unit promotions have a psychological ‘anchoring’ effect, leading customers to spend one-third more.
But just as there are lousy products, there are dud promotions too. In 2003 a team of marketing professionals worked with CVS, the largest pharmacy chain in the United States, to identify product promotions that they thought weren’t working.35 The company then agreed to run a major experiment. For fifteen product categories, they would stop promotions in 400 randomly selected stores.
Three months later, the evidence was in. By axing the promotions, CVS sold fewer products but at higher prices. Across the 400 stores that ran the experiment, profits were up. Not surprisingly, CVS soon put the changes in place across all its 9000 stores. A simple randomised trial likely increased the organisation’s annual profits by more than $50 million. If your favourite CVS discount suddenly disappeared in 2003, it’s likely because the company figured out that the promotional price wasn’t attracting enough new buyers.
Randomised trials have even looked into what happens when customers are invited to ‘pay what you want’. A German study found that most of the time, customers pay less than the price that the company would normally charge.36 But because the quirky pricing scheme attracts new customers, it can end up paying off in the form of higher profits.
*
I’m sitting in a conference room in the Coles headquarters in Hawthorn East, several kilometres outside the centre of Melbourne. The building is the epicentre of the $30 billion Coles supermarket empire. There are people wearing Coles uniforms, Coles products, and a Coles-branded cafe in the middle of an atrium emblazoned with Coles advertisements. It’s like Disneyland for retailers.
The reason for my visit is to learn about how one of Australia’s biggest supermarkets uses randomised trials. Coles owns FlyBuys, the nation’s largest loyalty card program. Each year, one in three Australians swipes a FlyBuys card. As well as earning points at Coles, Target and Kmart, all FlyBuys users get promotions on their checkout dockets, emails offering special discounts, and coupons mailed to their home.
Well, not quite all of them. To be precise, 99 per cent.
The FlyBuys loyalty card has an inbuilt randomised trial, they explain to me. One in every 100 customers is in the control group. Their cards work just the same as everyone else’s, but FlyBuys doesn’t send them any promotional material.
‘Can you tell whether my FlyBuys card is in the control group?’ I ask, pulling it out of my wallet.
‘Sure, I just look at the last two digits,’ replies one of the data analysts from Coles.
‘What’s the number for the control group?’
‘Sorry, it’s a secret.’
I can’t blame them for not wanting to give it away – but it’s too tantalising not to ask.
For Coles, the FlyBuys control group provides a clear and unambiguous performance benchmark for their promotions. If the promotional program is working well, then sales per customer should be higher among the 99 per cent who get promotions than among the 1 per cent who do not. The better the promotions, the bigger the gap.
I asked whether the management team and the board ever worry that FlyBuys are leaving money on the table by taking one in 100 customers out of the promotions. Not at all, they reply. Without a randomised control group, ‘how do you know you’re having an impact?’
Coles also uses randomised trials to test all kinds of aspects of their marketing. Should discount coupons be tear-off or loose? Are people more likely to open an email with a quirky header like ‘Bonzer Offer’? Should special offers be cash discounts or FlyBuys points? How does a 1000-point reward compare with a 2000-point reward? Do countdown clocks make people more likely to go to a store?
But it’s their response to my question about the FlyBuys control group that sticks in my head as I leave the Coles mothership. If you don’t have a randomised control group, ‘how do you know you’re having an impact?’
*
With over 3 billion passengers flying every year, planes account for 2 to 3 per cent of worldwide greenhouse gas-emissions. For the airlines themselves, fuel is a massive expense. If you’re running an airline, about one-third of your costs are on jet fuel. And yet it turns out that pilots’ decisions can make a significant difference to how much fuel a plane uses. By not carrying too much fuel, updating the route based on new weather information, or turning off an engine when taxiing off the runway, pilots can cut their fuel use considerably.
But how to get pilots to save fuel? Virgin Atlantic Airways teamed up with economists to see whether providing better feedback to pilots on their fuel use would make them more efficient.37 Working with the pilots’ union, the researchers reassured pilots that they would not be ranked against each other. ‘This is not, in any way, shape or form, an attempt to set up a “fuel league table”,’ the letter told them. Despite knowing
this, pilots who received monthly reports on their fuel efficiency ended up guzzling less gas than pilots who did not receive such reports. The feedback was purely private, yet it led pilots to tweak their behaviour. With an experiment that cost less than US$1000 in postage, Virgin cut its annual fuel usage by about 1 million litres.
Other personnel experiments have focused on low-wage jobs. In an experiment in the Canadian province of British Columbia, tree planters were either paid a fixed daily wage or an amount per tree. When paid a fixed wage, they planted about 1000 trees a day. When paid for each tree, they planted about 1200 trees each day.38 A randomised experiment in the US state of Washington found similar results for workers in fruit orchards.39 The Washington state study had an interesting twist, however. Flat-rate workers were resentful when they realised their colleagues were earning piece rates. So perhaps piece-rate workers did better because they were being compared against disappointed colleagues.
If you’ve ever picked fruit, you’ll know that your productivity depends on the quality of the crop as much as on how hard you’re working. In one set of experiments, a team of British researchers forged a relationship with a strawberry farmer who was trying to set the right pay rates.40 In one setting, workers were simply paid a fixed sum per kilogram. In another setting, workers’ pay rates depended on the total amount everyone picked that day. If everyone’s output was high, the managers reasoned, the fields must be bountiful. So as total pickings rose, the farmer lowered the rate per kilogram (though all workers were paid above the minimum wage). But the workers didn’t take long to cotton on – or strawberry on. Under the second pay scheme, working hard had a collective cost. So the strawberry pickers slowed down, particularly when they had a lot of friends in the picking team. This meant they picked fewer kilograms, but kept the pay rate per kilogram higher. By contrast, the simple piece-rate pay scheme didn’t create perverse incentives, and turned out to significantly increase total output.41
Pay isn’t all that matters. Studies of supermarket checkout workers, strawberry pickers and vehicle emissions inspectors find that more productive co-workers raise everyone’s performance.42 Social incentives matter too. In a randomised experiment of condom sellers in urban Zambia, a team of researchers compared the effect of bonus pay with social recognition of star employees.43 In that setting, it turns out that the promise of being publicly recognised has twice as large an impact as financial rewards. Who knew that being ‘condom seller of the month’ could be such an incentive?
But other kinds of ‘recognition’ can have the opposite effect. In a randomised experiment run on workers on Amazon’s online Mechanical Turk platform, some employees were given feedback about their relative ranking, compared to their co-workers.44 Telling workers their place in the pecking order turned out to reduce productivity. If you’re a boss, these experiments are a reminder that the most productive people in the workplace are bringing in a double dividend. They also suggest that you might want to promote ‘most valued employee’ awards, but sack worker league tables.
Randomisation is particularly critical when studying employees. That’s because people behave differently when they know they’re being watched. If I tell you that I’m going to measure your productivity for a day, you’ll probably spend less time surfing Facebook. This is known as the ‘Hawthorne effect’, after a famous study of workers in the Hawthorne Works factory in the 1920s. As it turns out, there are now questions over precisely what was going on in that study.45 But there is no doubt that Hawthorne effects are real, just like placebo effects in medicine. For example, Virgin Atlantic’s experiment showed significant differences between pilots who received personalised fuel usage reports and those who did not. But there was another impact too. All the pilots were told at the outset that the airline was conducting a fuel use study. Just knowing that they were being watched made pilots in the control group 50 per cent more likely to fly efficiently.
*
While many randomised trials in business have studied tweaks, some have looked at more dramatic changes. One of the big debates in business has been whether management consulting firms are change agents or charlatans. Reviewing the history of the industry, Matthew Stewart’s The Management Myth concludes that management consulting is more like a party trick than a science.46 Another critic calls it ‘precisely nine-tenths shtick and one-tenth Excel’.47
If we want to know the impact of management consultants on firm performance, it isn’t enough to compare firms that use consultants with those that don’t. That kind of naive comparison could be biased upwards or downwards, depending on why firms engage consultants. If hiring a consultant is a sign that the management team is thinking big, then we might expect those kinds of firms to outperform their competitors, even in the absence of outside help. Alternatively, if hiring a management consultant is the corporate equivalent of calling in the doctor, then we might expect a naive comparison to show that consultants were associated with underperformance.
To tackle the question, Stanford’s Nicholas Bloom and his collaborators worked with twenty textile plants in India.48 Fourteen were randomly selected to receive five months’ worth of management consulting advice from Accenture, an international management consulting firm.49 Afterwards, productivity was one-tenth higher in the firms that had received management advice. Although the advice was free, the productivity boost was large enough that companies would have made a profit even if they had paid for it at the going rate.
So if you run a medium-sized firm, should you stop reading now and phone the management consultants? Not necessarily. International surveys of management quality tend to rate India very low. Bloom and his team include photographs in their report of what the factories looked like before the intervention. They show tools scattered on the floor, garbage left in piles, machines not maintained and yarn left in damp sacks. It isn’t hard to see how outside consultants were able to help managers clean up the plants, shift to computerised systems and ramp up their production. But with a better-run firm, the gains from calling in Accenture, McKinsey or Boston Consulting might be a good deal smaller.50
Bloom’s management consulting experiment also illustrates a curious fact about randomised trials: under the right conditions, the number of treatment and control groups can be remarkably small.51 With similar factories, a major intervention and plenty of output data, Bloom and his team could be confident that they had measured statistically significant effects – meaning that the results were unlikely to be a mere fluke in the data.
Sometimes the statistics allow randomistas to find significant effects in small samples. But occasionally the problem is the reverse. This turns out to be a difficulty plaguing experiments that test the impact of advertising. Since the early 1980s, US cable television companies have had a product called ‘split cable’, making it possible to deliver different advertisements to households watching the same program. Market researchers such as AC Nielsen then work with a panel of households to log everything they buy (these days, participating households have a handheld scanner and just scan the barcodes of every new product).
Online, running randomised experiments is even simpler. Web retailers select a large group of people, randomly send half of them an advertisement for their product, and then use cookies to follow how people behave. In some cases, companies are even able to link up online ads with purchases made in person.
But while setting up a randomised advertising trial is easy, measuring the effect is hard. Our shopping patterns are volatile: we switch brands, buy on impulse and only occasionally buy big products (think about the last time you chose a new credit card, or bought a vacuum cleaner). Advertising is ubiquitous, so most advertisements have no impact on most people. Estimating the true impact of advertising on customers is a needle-in-a-haystack problem.
Combining 389 ‘split cable’ experiments, one study concluded: ‘There is no simple correspondence between increased television advertising weight and increased sales.’52 In general, the esti
mates suggest that advertising works, but struggle to tell a lousy campaign from a successful one. To illustrate the problem, the researchers turn to the most-watched advertisements in America: television ads shown to Super Bowl viewers. Airing a thirty-second Super Bowl ad costs $5 million, nearly 5 cents per viewer. Even if an advertiser could use ‘split cable’ techniques to individually randomise across all 110 million Super Bowl viewers, the firm would have to buy dozens of ads to generate a measurable impact. No product in America, the researchers conclude, has a big enough advertising budget to find out whether Super Bowl ads really work. They dub this finding ‘the Super Bowl impossibility theorem’.53
If anything, the problem is even worse when it comes to measuring the impact of online advertising. Pooling twenty-five online advertising experiments, which each randomised advertisements to about a million users, a recent study concludes that it is ‘exceedingly difficult’ to discern what works. Their typical campaign had a sample size of 1 million people. The researchers estimate that it would have to cover 9 million people in order to reliably distinguish between a wildly profitable campaign and one that merely broke even.
Difficult as it is to measure the impact of online ads, randomised trials are still the best option. The alternative is to compare people who were targeted with an ad against similar people who weren’t targeted.54 This can go badly wrong when the advertisement is served up based on internet searches. Suppose we run a non-randomised study of the effect of Nike ads, looking only at young and healthy people. Now imagine that people in the study who searched for ‘running shoes’ were shown Nike ads, and their purchasing patterns compared against those who did not search for running shoes. Would we really credit the advertisements for any difference in buying habits? Or would it be more sensible to think that people who searched for running shoes were already more likely to buy Nikes (or Asics or Brooks, for that matter)? If you’ve ever had the experience of searching for a toaster and then seeing toaster advertisements on your screen for the next week, you’ll know how ubiquitous microtargeting has become. In that environment, randomisation gives firms the best chance of finding out the true effect of their online marketing campaigns.