2008 - Bad Science

Home > Science > 2008 - Bad Science > Page 27
2008 - Bad Science Page 27

by Ben Goldacre


  When it comes to thinking about the world around you, you have a range of tools available. Intuitions are valuable for all kinds of things, especially in the social domain: deciding if your girlfriend is cheating on you, perhaps, or whether a business partner is trustworthy. But for mathematical issues, or assessing causal relationships, intuitions are often completely wrong, because they rely on shortcuts which have arisen as handy ways to solve complex cognitive problems rapidly, but at a cost of inaccuracies, misfires and oversensitivity.

  It’s not safe to let our intuitions and prejudices run unchecked and unexamined: it’s in our interest to challenge these flaws in intuitive reasoning wherever we can, and the methods of science and statistics grew up specifically in opposition to these flaws. Their thoughtful application is our best weapon against these pitfalls, and the challenge, perhaps, is to work out which tools to use where. Because trying to be ‘scientific’ about your relationship with your partner is as stupid as following your intuitions about causality.

  Now let’s see how journalists deal with stats.

  14 Bad Stats

  Now that you appreciate the value of statistics—the benefits and risks of intuition—we can look at how these numbers and calculations are repeatedly misused and misunderstood. Our first examples will come from the world of journalism, but the true horror is that journalists are not the only ones to make basic errors of reasoning.

  Numbers, as we will see, can ruin lives.

  The biggest statistic

  Newspapers like big numbers and eye-catching headlines. They need miracle cures and hidden scares, and small percentage shifts in risk will never be enough for them to sell readers to advertisers (because that is the business model). To this end they pick the single most melodramatic and misleading way of describing any statistical increase in risk, which is called the ‘relative risk increase’.

  Let’s say the risk of having a heart attack in your fifties is 50 per cent higher if you have high cholesterol. That sounds pretty bad. Let’s say the extra risk of having a heart attack if you have high cholesterol is only 2 per cent. That sounds OK to me. But they’re the same (hypothetical) figures. Let’s try this. Out of a hundred men in their fifties with normal cholesterol, four will be expected to have a heart attack; whereas out of a hundred men with high cholesterol, six will be expected to have a heart attack. That’s two extra heart attacks per hundred. Those are called ‘natural frequencies’.

  Natural frequencies are readily understandable, because instead of using probabilities, or percentages, or anything even slightly technical or difficult, they use concrete numbers, just like the ones you use every day to check if you’ve lost a kid on a coach trip, or got the right change in a shop. Lots of people have argued that we evolved to reason and do maths with concrete numbers like these, and not with probabilities, so we find them more intuitive. Simple numbers are simple.

  The other methods of describing the increase have names too. From our example above, with high cholesterol, you could have a 50 per cent increase in risk (the ‘relative risk increase’); or a 2 per cent increase in risk (the ‘absolute risk increase’); or, let me ram it home, the easy one, the informative one, an extra two heart attacks for every hundred men, the natural frequency.

  As well as being the most comprehensible option, natural frequencies also contain more information than the journalists’ ‘relative risk increase’. Recently, for example, we were told that red meat causes bowel cancer, and ibuprofen increases the risk of heart attacks: but if you followed the news reports, you would be no wiser. Try this, on bowel cancer, from the Today programme on Radio 4: A bigger risk meaning what, Professor Bingham?’ ‘A third higher risk’. ‘That sounds an awful lot, a third higher risk; what are we talking about in terms of numbers here?’ ‘A difference…of around about twenty people per year’. ‘So it’s still a small number?’ ‘Umm…per 10,000…’

  These things are hard to communicate if you step outside of the simplest format. Professor Sheila Bingham is Director of the MRC Centre for Nutrition in Cancer Epidemiology Prevention and Survival at the University of Cambridge, and deals with these numbers for a living, but in this (entirely forgivable) fumbling on a live radio show she is not alone: there are studies of doctors, and commissioning committees for local health authorities, and members of the legal profession, which show that people who interpret and manage risk for a living often have huge difficulties expressing what they mean on the spot. They are also much more likely to make the right decision when information about risk is presented as natural frequencies, rather than as probabilities or percentages.

  For painkillers and heart attacks, another front-page story, the desperate urge to choose the biggest possible number led to the figures being completely inaccurate, in many newspapers. The reports were based on a study that had observed participants over four years, and the results suggested, using natural frequencies, that you would expect one extra heart attack for every 1,005 people taking ibuprofen. Or as the Daily Mail, in an article titled ‘How Pills for Your Headache Could Kill’, reported: ‘British research revealed that patients taking ibuprofen to treat arthritis face a 24 per cent increased risk of suffering a heart attack.’ Feel the fear.

  Almost everyone reported the relative risk increases: diclofenac increases the risk of heart attack by 55 per cent, ibuprofen by 24 per cent. Only the Daily Telegraph and the Evening Standard reported the natural frequencies: one extra heart attack in 1,005 people on ibuprofen. The Mirror, meanwhile, tried and failed, reporting that one in 1,005 people on ibuprofen ‘will suffer heart failure over the following year’. No. It’s heart attack, not heart failure, and it’s one extra person in 1,005, on top of the heart attacks you’d get anyway. Several other papers repeated the same mistake.

  Often it’s the fault of the press releases, and academics can themselves be as guilty as the rest when it comes to overdramatising their research (there are excellent best-practice guidelines from the Royal Society on communicating research, if you are interested). But if anyone in a position of power is reading this, here is the information I would like from a newspaper, to help me make decisions about my health, when reporting on a risk: I want to know who you’re talking about (e.g. men in their fifties); I want to know what the baseline risk is (e.g. four men out of a hundred will have a heart attack over ten years); and I want to know what the increase in risk is, as a natural frequency (two extra men out of that hundred will have a heart attack over ten years). I also want to know exactly what’s causing that increase in risk—an occasional headache pill or a daily tub full of pain-relieving medication for arthritis. Then I will consider reading your newspapers again, instead of blogs which are written by people who understand research, and which link reliably back to the original academic paper, so that I can double-check their precis when I wish.

  Over a hundred years ago, H.G. Wells said that statistical thinking would one day be as important as the ability to read and write in a modern technological society. I disagree; probabilistic reasoning is difficult for everyone, but everyone understands normal numbers. This is why ‘natural frequencies’ are the only sensible way to communicate risk.

  Choosing your figures

  Sometimes the mispresentation of figures goes so far beyond reality that you can only assume mendacity. Often these situations seem to involve morality: drugs, abortion and the rest. With very careful selection of numbers, in what some might consider to be a cynical and immoral manipulation of the facts for personal gain, you can sometimes make figures say anything you want.

  The Independent was in favour of legalising cannabis for many years, but in March 2007 it decided to change its stance. One option would have been simply to explain this as a change of heart, or a reconsideration of the moral issues. Instead it was decorated with science—as cowardly zealots have done from eugenics through to prohibition—and justified with a fictitious change in the facts. ‘Cannabis—An Apology’ was the headline for their front-page splash.<
br />
  In 1997, this newspaper launched a campaign to decriminalise the drug. If only we had known then what we can reveal today…Record numbers of teenagers are requiring drug treatment as a result of smoking skunk, the highly potent cannabis strain that is 25 times stronger than resin sold a decade ago.

  Twice in this story we are told that cannabis is twenty-five times stronger than it was a decade ago. For the paper’s former editor Rosie Boycott, in her melodramatic recantation, skunk was ‘thirty times stronger’. In one inside feature the strength issue was briefly downgraded to a ‘can be’. The paper even referenced its figures: ‘The Forensic Science Service says that in the early Nineties cannabis would contain around 1 per cent tetrahydro-cannabidinol (THC), the mind-altering compound, but can now have up to 25 per cent.’

  This is all sheer fantasy.

  I’ve got the Forensic Science Service data right here in front of me, and the earlier data from the Laboratory of the Government Chemist, the United Nations Drug Control Program, and the European Union’s Monitoring Centre for Drugs and Drug Addiction. I’m going to share it with you, because I happen to think that people are very well able to make their own minds up about important social and moral issues when given the facts.

  The data from the Laboratory of the Government Chemist goes from 1975 to 1989. Cannabis resin pootles around between 6 per cent and 10 per cent THC, herbal between 4 per cent and 6 per cent. There is no clear trend.

  The Forensic Science Service data then takes over to produce the more modern figures, showing not much change in resin,

  Mean potency (% THC) of cannabis products examined in the UK (Laboratory of the Government Chemist, 1975-89)

  and domestically produced indoor herbal cannabis doubling in potency from 6 per cent to around 12 or 14 per cent. (2003-05 data in table under references).

  The rising trend of cannabis potency is gradual, fairly unspectacular, and driven largely by the increased availability of domestic, intensively grown indoor herbal cannabis.

  Mean potency (% THC) of cannabis products examined in the UK (Forensic Science Service, 1995-2002)

  Mean THC content of cannabis products seized in the UK (Forensic Science Service, 1995-2002)

  ‘Twenty-five times stronger’, remember. Repeatedly, and on the front page.

  If you were in the mood to quibble with the Independent’s moral and political reasoning, as well as its evident and shameless venality, you could argue that intensive indoor cultivation of a plant which grows perfectly well outdoors is the cannabis industry’s reaction to the product’s illegality itself. It is dangerous to import cannabis in large amounts. It is dangerous to be caught growing a field of it. So it makes more sense to grow it intensively indoors, using expensive real estate, but producing a more concentrated drug. More concentrated drugs products are, after all, a natural consequence of illegality. You can’t buy coca leaves in Peckham, although you can buy crack.

  There is, of course, exceptionally strong cannabis to be found in some parts of the British market today, but then there always has been. To get its scare figure, the Independent can only have compared the worst cannabis from the past with the best cannabis of today. It’s an absurd thing to do, and moreover you could have cooked the books in exactly the same way thirty years ago if you’d wanted: the figures for individual samples are available, and in 1975 the weakest herbal cannabis analysed was 0.2 per cent THC, while in 1978 the strongest herbal cannabis was 12 per cent. By these figures, in just three years herbal cannabis became ‘sixty times stronger’.

  And this scare isn’t even new. In the mid-1980s, during Ronald Reagan’s ‘war on drugs’ and Zammo’s ‘Just say no’ campaign on Grange Hill, American campaigners were claiming that cannabis was fourteen times stronger than in 1970. Which sets you thinking. If it was fourteen times stronger in 1986 than in 1970, and it’s twenty-five times stronger today than at the beginning of the 1990s, does that mean it’s now 350 times stronger than in 1970?

  That’s not even a crystal in a plant pot. It’s impossible. It would require more THC to be present in the plant than the total volume of space taken up by the plant itself. It would require matter to be condensed into super-dense quark-gluon-plasma cannabis. For God’s sake don’t tell the Independent such a thing is possible.

  Cocaine floods the playground

  We are now ready to move on to some more interesting statistical issues, with another story from an emotive area, an article in The Times in March 2006 headed: ‘Cocaine Floods the Playground’. ‘Use of the addictive drug by children doubles in a year,’ said the subheading. Was this true?

  If you read the press release for the government survey on which the story is based, it reports ‘almost no change in patterns of drug use, drinking or smoking since 2000’. But this was a government press release, and journalists are paid to investigate: perhaps the press release was hiding something, to cover up for government failures. The Telegraph also ran the ‘cocaine use doubles’ story, and so did the Mirror. Did the journalists find the news themselves, buried in the report?

  You can download the full document online. It’s a survey of 9,000 children, aged eleven to fifteen, in 305 schools. The three-page summary said, again, that there was no change in prevalence of drug use. If you look at the full report you will find the raw data tables: when asked whether they had used cocaine in the past year, 1 per cent said yes in 2004, and 2 per cent said yes in 2005.

  So the newspapers were right: it doubled? No. Almost all the figures given were 1 per cent or 2 per cent. They’d all been rounded off. Civil servants are very helpful when you ring them up. The actual figures were 1.4 per cent for 2004, and 1.9 per cent for 2005, not 1 per cent and 2 per cent. So cocaine use hadn’t doubled at all. But people were still eager to defend this story: cocaine use, after all, had increased, yes?

  No. What we now have is a relative risk increase of 35.7 per cent, or an absolute risk increase of 0.5 per cent. Using the real numbers, out of 9,000 kids we have about forty-five more saying ‘Yes’ to the question ‘Did you take cocaine in the past year?’

  Presented with a small increase like this, you have to think: is it statistically significant? I did the maths, and the answer is yes, it is, in that you get a p-value of less than 0.05. What does ‘statistically significant’ mean? It’s just a way of expressing the likelihood that the result you got was attributable merely to chance. Sometimes you might throw ‘heads’ five times in a row, with a completely normal coin, especially if you kept tossing it for long enough. Imagine a jar of 980 blue marbles, and twenty red ones, all mixed up: every now and then—albeit rarely—picking blindfolded, you might pull out three red ones in a row, just by chance. The standard cut-off point for statistical significance is a p-value of 0.05, which is just another way of saying, ‘If I did this experiment a hundred times, I’d expect a spurious positive result on five occasions, just by chance.’

  To go back to our concrete example of the kids in the playground, let’s imagine that there was definitely no difference in cocaine use, but you conducted the same survey a hundred times: you might get a difference like the one we have seen here, just by chance, just because you randomly happened to pick up more of the kids who had taken cocaine this time around. But you would expect this to happen less than five times out of your hundred surveys.

  So we have a risk increase of 35.7 per cent, which seems at face value to be statistically significant; but it is an isolated figure. To ‘data mine’, taking it out of its real-world context, and saying it is significant, is misleading. The statistical test for significance assumes that every data point is independent, but here the data is ‘clustered’, as statisticians say. They are not data points, they are real children, in 305 schools. They hang out together, they copy each other, they buy drugs from each other, there are crazes, epidemics, group interactions.

  The increase of forty-five kids taking cocaine could have been a massive epidemic of cocaine use in one school, or a few groups of a dozen
kids in a few different schools, or mini-epidemics in a handful of schools. Or forty-five kids independently sourcing and consuming cocaine alone without their friends, which seems pretty unlikely to me.

  This immediately makes our increase less statistically significant. The small increase of 0.5 per cent was only significant because it came from a large sample of 9,000 data points—like 9,000 tosses of a coin—and the one thing almost everyone knows about studies like this is that a bigger sample size means the results are probably more significant. But if they’re not independent data points, then you have to treat it, in some respects, like a smaller sample, so the results become less significant. As statisticians would say, you must ‘correct for clustering’. This is done with clever maths which makes everyone’s head hurt. All you need to know is that the reasons why you must ‘correct for clustering’ are transparent, obvious and easy, as we have just seen (in fact, as with many implements, knowing when to use a statistical tool is a different and equally important skill to understanding how it is built). When you correct for clustering, you greatly reduce the significance of the results. Will our increase in cocaine use, already down from ‘doubled’ to ‘35.7 per cent’, even survive?

  No. Because there is a final problem with this data: there is so much of it to choose from. There are dozens of data points in the report: on solvents, cigarettes, ketamine, cannabis, and so on. It is standard practice in research that we only accept a finding as significant if it has a p-value of 0.05 or less. But as we said, a p-value of 0.05 means that for every hundred comparisons you do, five will be positive by chance alone. From this report you could have done dozens of comparisons, and some of them would indeed have shown increases in usage—but by chance alone, and the cocaine figure could be one of those. If you roll a pair of dice often enough, you will get a double six three times in a row on many occasions. This is why statisticians do a ‘correction for multiple comparisons’, a correction for ‘rolling the dice’ lots of times. This, like correcting for clustering, is particularly brutal on the data, and often reduces the significance of findings dramatically.

 

‹ Prev