by Ben Goldacre
Data dredging is a dangerous profession. You could—at face value, knowing nothing about how stats works—have said that this government report showed a significant increase of 35.7 per cent in cocaine use. But the stats nerds who compiled it knew about clustering, and Bonferroni’s correction for multiple comparisons. They are not stupid; they do stats for a living.
That, presumably, is why they said quite clearly in their summary, in their press release and in the full report that there was no change from 2004 to 2005. But the journalists did not want to believe this: they tried to re-interpret the data for themselves, they looked under the bonnet, and they thought they’d found the news. The increase went from 0.5 per cent—a figure that might be a gradual trend, but it could equally well be an entirely chance finding—to a front-page story in The Times about cocaine use doubling. You might not trust the press release, but if you don’t know about numbers, then you take a big chance when you delve under the bonnet of a study to find a story.
OK, back to an easy one
There are also some perfectly simple ways to generate ridiculous statistics, and two common favourites are to select an unusual sample group, and to ask them a stupid question. Let’s say 70 per cent of all women want Prince Charles to be told to stop interfering in public life. Oh, hang on—70 per cent of all women who visit my website want Prince Charles to be told to stop interfering in public life. You can see where we’re going. And of course, in surveys, if they are voluntary, there is something called selection bias: only the people who can be bothered to fill out the survey form will actually have a vote registered.
There was an excellent example of this in the Telegraph in the last days of 2007. ‘Doctors Say No to Abortions in their Surgeries’ was the headline. ‘Family doctors are threatening a revolt against government plans to allow them to perform abortions in their surgeries, the Daily Telegraph can disclose.’ A revolt? ‘Four out of five GPs do not want to carry out terminations even though the idea is being tested in NHS pilot schemes, a survey has revealed.’
Where did these figures come from? A systematic survey of all GPs, with lots of chasing to catch the non-responders? Telephoning them at work? A postal survey, at least? No. It was an online vote on a doctors’ chat site that produced this major news story. Here is the question, and the options given:
‘GPs should carry out abortions in their surgeries’
Strongly agree, agree, don’t know, disagree, strongly disagree.
We should be clear: I myself do not fully understand this question. Is that ‘should’ as in ‘should’? As in ‘ought to’? And in what circumstances? With extra training, time and money? With extra systems in place for adverse outcomes? And remember, this is a website where doctors—bless them—go to moan. Are they just saying no because they’re grumbling about more work and low morale?
More than that, what exactly does ‘abortion’ mean here? Looking at the comments in the chat forum, I can tell you that plenty of the doctors seemed to think it was about surgical abortions, not the relatively safe oral pill for termination of pregnancy. Doctors aren’t that bright, you see. Here are some quotes:
This is a preposterous idea. How can GPs ever carry out abortions in their own surgeries. What if there was a major complication like uterine and bowel perforation?
GP surgeries are the places par excellence where infective disorders present. The idea of undertaking there any sort of sterile procedure involving an abdominal organ is anathema.
The only way it would or rather should happen is if GP practices have a surgical day care facility as part of their premises which is staffed by appropriately trained staff, i.e. theatre staff, anaesthetist and gynaecologist…any surgical operation is not without its risks, and presumably [we] will undergo gynaecological surgical training in order to perform.
What are we all going on about? Let’s all carry out abortions in our surgeries, living rooms, kitchens, garages, corner shops, you know, just like in the old days.
And here’s my favourite:
I think that the question is poorly worded and I hope that [the doctors’ website] do not release the results of this poll to the Daily Telegraph.
Beating you up
It would be wrong to assume that the kinds of oversights we’ve covered so far are limited to the lower echelons of society, like doctors and journalists. Some of the most sobering examples come from the very top.
In 2006, after a major government report, the media reported that one murder a week is committed by someone with psychiatric problems. Psychiatrists should do better, the newspapers told us, and prevent more of these murders. All of us would agree, I’m sure, with any sensible measure to improve risk management and violence, and it’s always timely to have a public debate about the ethics of detaining psychiatric patients (although in the name of fairness I’d like to see preventive detention discussed for all other potentially risky groups too—like alcoholics, the repeatedly violent, people who have abused staff in the job centre, and so on).
But to engage in this discussion, you need to understand the maths of predicting very rare events. Let’s take a very concrete example, and look at the HIV test. What features of any diagnostic procedure do we measure in order to judge how useful it might be? Statisticians would say the blood test for HIV has a very high ‘sensitivity’, at 0.999. That means that if you do have the virus, there is a 99.9 per cent chance that the blood test will be positive. They would also say the test has a high ‘specificity’ of 0.9999—so, if you are not infected, there is a 99.99 per cent chance that the test will be negative. What a smashing blood test.*
≡ The figures here are ballpark, from Gerd Gigerenzer’s excellent book Reckoning with Risk.
But if you look at it from the perspective of the person being tested, the maths gets slightly counterintuitive. Because weirdly, the meaning, the predictive value, of an individual’s positive or negative test is changed in different situations, depending on the background rarity of the event that the test is trying to detect. The rarer the event in your population, the worse your test becomes, even though it is the same test.
This is easier to understand with concrete figures. Let’s say the HIV infection rate among high-risk men in a particular area is 1.5 per cent. We use our excellent blood test on 10,000 of these men, and we can expect 151 positive blood results overall: 150 will be our truly HIV-positive men, who will get true positive blood tests; and one will be the one false positive we could expect from having 10,000 HIV-negative men being given a test that is wrong one time in 10,000. So, if you get a positive HIV blood test result, in these circumstances your chances of being truly HIV positive are 150 out of 151. It’s a highly predictive test.
Let’s now use the same test where the background HIV infection rate in the population is about one in 10,000. If we test 10,000 people, we can expect two positive blood results overall. One from the person who really is HIV positive; and the one false positive that we could expect, again, from having 10,000 HIV-negative men being tested with a test that is wrong one time in 10,000.
Suddenly, when the background rate of an event is rare, even our previously brilliant blood test becomes a bit rubbish. For the two men with a positive HIV blood test result, in this population where only one in 10,000 has HIV, it’s only 50:50 odds on whether they really are HIV positive.
Let’s think about violence. The best predictive tool for psychiatric violence has a ‘sensitivity’ of 0.75, and a ‘specificity’ of 0.75. It’s tougher to be accurate when predicting an event in humans, with human minds and changing human lives. Let’s say 5 per cent of patients seen by a community mental health team will be involved in a violent event in a year. Using the same maths as we did for the HIV tests, your ‘0.75’ predictive tool would be wrong eighty-six times out of a hundred. For serious violence, occurring at 1 per cent a year, with our best ‘0.75’ tool, you inaccurately finger your potential perpetrator ninety-seven times out of a hundred. Will you preventively detain nine
ty-seven people to prevent three violent events? And will you apply that rule to alcoholics and assorted nasty antisocial types as well?
For murder, the extremely rare crime in question in this report, for which more action was demanded, occurring at one in 10,000 a year among patients with psychosis, the false positive rate is so high that the best predictive test is entirely useless.
This is not a counsel of despair. There are things that can be done, and you can always try to reduce the number of actual stark cock-ups, although it’s difficult to know what proportion of the ‘one murder a week’ represents a clear failure of a system, since when you look back in history, through the retrospecto-scope, anything that happens will look as if it was inexorably leading up to your one bad event. I’m just giving you the maths on rare events. What you do with it is a matter for you.
Locking you up
In 1999 solicitor Sally Clark was put on trial for murdering her two babies. Most people are aware that there was a statistical error in the prosecution case, but few know the true story, or the phenomenal extent of the statistical ignorance that went on in the case.
At her trial, Professor Sir Roy Meadow, an expert in parents who harm their children, was called to give expert evidence. Meadow famously quoted ‘one in seventy-three million’ as the chance of two children in the same family dying of Sudden Infant Death Syndrome (SIDS).
This was a very problematic piece of evidence for two very distinct reasons: one is easy to understand, the other is an absolute mindbender. Because you have the concentration span to follow the next two pages, you will come out smarter than Professor Sir Roy, the judge in the Sally Clark case, her defence teams, the appeal court judges, and almost all the journalists and legal commentators reporting on the case. We’ll do the easy reason first.
The ecological fallacy
The figure of ‘one in seventy-three million’ itself is iffy, as everyone now accepts. It was calculated as 8,543 x 8,543, as if the chances of two SIDS episodes in this one family were independent of each other. This feels wrong from the outset, and anyone can see why: there might be environmental or genetic factors at play, both of which would be shared by the two babies. But forget how pleased you are with yourself for understanding that fact. Even if we accept that two SIDS in one family is much more likely than one in seventy-three million—say, one in 10,000—any such figure is still of dubious relevance, as we will now see.
The prosecutor’s fallacy
The real question in this case is: what do we do with this spurious number? Many press reports at the time stated that one in seventy-three million was the likelihood that the deaths of Sally Clark’s two children were accidental: that is, the likelihood that she was innocent. Many in the court process seemed to share this view, and the factoid certainly sticks in the mind. But this is an example of a well-known and well-documented piece of flawed reasoning known as ‘the prosecutor’s fallacy’.
Two babies in one family have died. This in itself is very rare. Once this rare event has occurred, the jury needs to weigh up two competing explanations for the babies’ deaths: double SIDS or double murder. Under normal circumstances—before any babies have died—double SIDS is very unlikely, and so is double murder. But now that the rare event of two babies dying in one family has occurred, the two explanations—double murder or double SIDS—are suddenly both very likely. If we really wanted to play statistics, we would need to know which is relatively more rare, double SIDS or double murder. People have tried to calculate the relative risks of these two events, and one paper says it comes out at around 2:1 in favour of double SIDS.
Not only was this crucial nuance of the prosecutor’s fallacy missed at the time—by everyone in the court—it was also clearly missed in the appeal, at which the judges suggested that instead of ‘one in seventy-three million’, Meadow should have said ‘very rare’. They recognised the flaws in its calculation, the ecological fallacy, the easy problem above, but they still accepted his number as establishing ‘a very broad point, namely the rarity of double SIDS’.
That, as you now understand, was entirely wrongheaded: the rarity of double SIDS is irrelevant, because double murder is rare too. An entire court process failed to spot the nuance of how the figure should be used. Twice.
Meadow was foolish, and has been vilified (some might say this process was exacerbated by the witch-hunt against paediatricians who work on child abuse), but if it is true that he should have spotted and anticipated the problems in the interpretation of his number, then so should the rest of the people involved in the case: a paediatrician has no more unique responsibility to be numerate than a lawyer, a judge, journalist, jury member or clerk. The prosecutor’s fallacy is also highly relevant in DNA evidence, for example, where interpretation frequently turns on complex mathematical and contextual issues. Anyone who is going to trade in numbers, and use them, and think with them, and persuade with them, let alone lock people up with them, also has a responsibility to understand them. All you’ve done is read a popular science book on them, and already you can see it’s hardly rocket science.
Losing the lottery
You know, the most amazing thing happened to me tonight. I was coming here, on the way to the lecture, and I came in through the parking lot. And you won’t believe what happened. I saw a car with the license plate ARW 357. Can you imagine? Of all the millions of license plates in the state, what was the chance that I would see that particular one tonight? Amazing…
Richard Feynman
It is possible to be very unlucky indeed. A nurse called Lucia de Berk has been in prison for six years in Holland, convicted of seven counts of murder and three of attempted murder. An unusually large number of people died when she was on shift, and that, essentially, along with some very weak circumstantial evidence, is the substance of the case against her. She has never confessed, she has continued to protest her innocence, and her trial has generated a small collection of theoretical papers in the statistics literature.
The judgement was largely based on a figure of ‘one in 342 million against’. Even if we found errors in this figure—and believe me, we will—as in our previous story, the figure itself would still be largely irrelevant. Because, as we have already seen repeatedly, the interesting thing about statistics is not the tricky maths, but what the numbers mean.
There is also an important lesson here from which we could all benefit: unlikely things do happen. Somebody wins the lottery every week; children are struck by lightning. It’s only weird and startling when something very, very specific and unlikely happens if you have specifically predicted it beforehand.*
≡ The magician and pseudoscience debunker James Randi used to wake up every morning and write on a card in his pocket: ‘I, James Randi, will die today’, followed by the date and his signature. Just in case, he has recently explained, he really did, by some completely unpredictable accident.
Here is an analogy.
Imagine I am standing near a large wooden barn with an enormous machine gun. I place a blindfold over my eyes and—laughing maniacally—I fire off many thousands and thousands of bullets into the side of the barn. I then drop the gun, walk over to the wall, examine it closely for some time, all over, pacing up and down. I find one spot where there are three bullet holes close to each other, then draw a target around them, announcing proudly that I am an excellent marksman.
You would, I think, disagree with both my methods and my conclusions for that deduction. But this is exactly what has happened in Lucia’s case: the prosecutors found seven deaths, on one nurse’s shifts, in one hospital, in one city, in one country, in the world, and then drew a target around them.
This breaks a cardinal rule of any research involving statistics: you cannot find your hypothesis in your results. Before you go to your data with your statistical tool, you have to have a specific hypothesis to test. If your hypothesis comes from analysing the data, then there is no sense in analysing the same data again to confirm it.
r /> This is a rather complex, philosophical, mathematical form of circularity: but there were also very concrete forms of circular reasoning in the case To collect more data, the investigators went back to the wards to see if they could find more suspicious deaths. But all the people who were asked to remember ‘suspicious incidents’ knew that they were being asked because Lucia might be a serial killer. There was a high risk that ‘an incident was suspicious’ became synonymous with ‘Lucia was present’. Some sudden deaths when Lucia was not present would not be listed in the calculations, by definition: they are in no way suspicious, because Lucia was not present.
It gets worse. ‘We were asked to make a list of incidents that happened during or shortly after Lucia’s shifts,’ said one hospital employee. In this manner more patterns were unearthed, and so it became even more likely that investigators would find more suspicious deaths on Lucia’s shifts. Meanwhile, Lucia waited in prison for her trial.
This is the stuff of nightmares.
At the same time, a huge amount of corollary statistical information was almost completely ignored. In the three years before Lucia worked on the ward in question, there were seven deaths. In the three years that she did work on the ward, there were six deaths. Here’s a thought: it seems odd that the death rate should go down on a ward at the precise moment that a serial killer—on a killing spree—arrives. If Lucia killed them all, then there must have been no natural deaths on that ward at all in the whole of the three years that she worked there.