by Ben Goldacre
For rare diseases, you do a ‘retrospective case-control study’: gather lots of cases; get a control group of people who don’t have the rare disease but are otherwise similar; then, finally, see if your cases are more or less likely to report being exposed to mobile phones.
This sounds fine, but such studies are vulnerable to the frailties of memory. If someone has a tumour on the left of their head, say, and you ask, ‘Which side did you mostly use your phone on ten years ago?’, they might think, God, yes, that’s a good point, and unconsciously be more likely to inaccurately remember ‘the left’. In one study on the relationship between mobile-phone use and brain tumours, ten people with brain cancer (but no controls) reported phone usage figures that worked out overall as more than twelve hours a day. This might reflect people misremembering the distant past.
Then there are other problems, such as time course: it’s possible that mobile phones might cause brain cancer, but through exposure over thirty years, while we’ve only got data for ten or twenty years, because these devices haven’t been in widespread use for long. If this is the case, then the future risk may be unknowable right now (although, to be fair, other exposures that are now known to cause a peak in problems after several decades, such as asbestos, do still have measurable effects only ten years after exposure). And then, of course, phones change over time: twenty years ago the devices had more powerful transmitters, for example. So we might get a false alarm, or false reassurance, by measuring the impact of irrelevant technology.
But lastly, as so often, there’s the issue of a large increase in a small baseline risk. The absolute worst-case scenario, from the Interphone study, is this: it found that phone use overall was associated with fewer tumours, which is odd; but very, very high phone use was associated with a 40 per cent increase in tumours. If everyone used their phones that much – an extreme assumption – and the apparent relationship is a genuine one, then this would still only take you from ten brain tumour cases in 100,000 people to fourteen cases in 100,000 people.
That’s what ‘possible’ looks like: the risk itself is much less interesting than the science behind it.
Anecdotes Are Great. If They Really Illustrate the Data
Guardian, 29 July 2011
On Channel 4 News, scientists have found a new treatment for Duchenne’s muscular dystrophy. ‘A study in the Lancet today shows a drug injected weekly for three months appears to have reduced the symptoms,’ they say. ‘While it’s not a cure, it does appear to reduce the symptoms.’
Unfortunately, the study shows no such thing. The gene for making a muscle protein called dystrophin is damaged in patients with DMD. The Lancet paper shows that a new treatment led to some restoration of dystrophin production in some children in a small, unblinded study.
That’s not the same as symptoms improving. But Channel 4 reiterates its case, with the mother of two participants in the study. ‘I think for Jack … it maintained his mobility … with Tom, there’s definitely significant changes … more energy, he’s less fatigued.’
Where did these positive anecdotes come from? Dis-appointingly, they come from the Great Ormond Street Hospital press release (which was tracked down online by evidence-based policy wonk Evan Harris). It summarises the dystrophin results accurately, but then, once more, it presents an anecdotal case study going way further: ‘Our whole family noticed a marked difference in their quality of life and mobility over that period. We feel it helped prolong Jack’s mobility and Tom has been considerably less fatigued.’
There are two issues here. Firstly, anecdotes are a great communication tool, but only when they accurately illustrate the data. The anecdotes here plainly go beyond that. Great Ormond Street denies that this is problematic (though it has changed its press release online). I strongly disagree (and this is not, of course, the first time an academic press release has been suboptimal).
But this story is also a reminder that we should always be cautious with ‘surrogate’ outcomes. The biological change measured was important, and good grounds for optimism, because it shows the treatment is doing what it should do in the body. But things that work in theory do not always work in practice, and while a measurable biological indicator is a hint that something is working, such outcomes can often be misleading.
Examples are easy to find, and from some of the biggest diseases in medicine. The ALLHAT trial was a vast scientific project, comparing various blood-pressure and lipid-lowering drugs against each other. One part compared 9,000 patients on doxazosin against 15,000 on chlorthalidone. Both drugs were known to lower blood pressure, to pretty much the same extent, and so people assumed they would also be fairly similar in their impact on real-world outcomes that matter, like strokes and heart attacks.
But patients on doxazosin turned out to have a higher risk of stroke, and cardiovascular problems, than patients on chlorthalidone – even though both lowered blood pressure – to such an extent that the trial had to be stopped early. Blood pressure, in this case, was not a reliable surrogate outcome for assessing the drug’s benefits on real-world outcomes.
This is not an isolated example. A blood test called HbA1c is often used to monitor progress in diabetes, because it gives an indicator of blood-glucose levels over the preceding few weeks. Many drugs, such as rosiglitazone, have been licensed on the grounds that they reduce your HbA1c level. But this, again, is just a surrogate outcome: what we really care about in diabetes are real-world outcomes like heart attacks and death. And when these were finally measured, it turned out that rosiglitazone – while lowering HbA1c levels very well – also, unfortunately, massively increased your risk of heart attack. (The drug has now been suspended from the market.)
We might all wish otherwise, but blood tests are a mixed bag. Positive improvements on surrogate biological outcomes that can be measured in a laboratory might give us strong hints on whether something works, but the proof, ultimately, is whether we can show an impact on patients’ pain, suffering, disability and death. I hope this new DMD treatment does turn out to be effective: but that’s not an excuse for overclaiming, and even for the most well-established surrogate measures and drugs, laboratory endpoints have often turned out to be very misleading. People writing press releases, and shepherding misleading patient anecdotes into our living rooms, might want to bear that in mind.
Six weeks after the piece above was published, the Great Ormond Street in-house magazine RoundAbout published a response from Professor Andrew Copp, Director of the Institute of Child Health. It seems to me that he has missed the point entirely – both on surrogate outcomes, and on the issue of widely reported anecdotal claims that went way beyond the actual data – but his words are reproduced in full below, so that you can decide for yourself.
The end of July saw the publication in The Lancet of an important paper by Sebahattin Cirak, Francesco Muntoni and colleagues in the Dubowitz Neuromuscular Centre at the ICH/Great Ormond Street Hospital (GOSH). Together with collaborators, the team provides evidence that a new technique called ‘exon skipping’ may be used in future to treat Duchenne muscular dystrophy (DMD).
DMD is a progressive, severely disabling neuromuscular disease that affects one in every 3,500 boys and leads to premature death. The cause is an alteration in the gene for dystrophin, a vital link protein in muscle. Without dystrophin, muscles become inflamed and degenerate, leading to the handicap seen in DMD. Patients with Becker muscular dystrophy also have dystrophin mutations, but their disease is much milder because the dystrophin protein, although shorter than normal, still functions quite well. By introducing short stretches of artificial DNA, it is possible to ‘skip’ over the damaged DNA in DMD cells and cause a shorter but otherwise functional protein to be made.
Previous studies showed this approach to work when the artificial DNA was injected directly into patients’ muscles. The Lancet study asked whether the therapy would also work by intravenous injection, a crucial step as it would not be feasible to inject every muscle ind
ividually in clinical practice. Of the 19 boys who took part, seven showed an increase in dystrophin protein in their muscle biopsies. Importantly, there were few adverse effects, suggesting the treatment might be tolerated long term as would be necessary in DMD.
The study has been welcomed with great optimism in many parts of the media. However, I was disappointed to see an article about the research appearing in the ‘Bad Science’ column of the Guardian, written by Ben Goldacre. While he does not criticise the research in the Lancet paper, Goldacre accuses Channel 4 and GOSH of misleading the public about the extent of the advance for patients. The GOSH press release included comments by a parent who felt her two boys had shown improvement in mobility and less fatigue after treatment. Although this was only two sentences within a much longer and scientifically accurate press release, the sentences were seized upon as an opportunity for bad publicity for GOSH.
Goldacre spoils his argument by claiming the improvement in dystrophin protein level to be ‘theoretical’. He says: ‘things that work in theory do not always work in practice’. As a scientifically trained journalist, it is sad to see him confusing theory with an advance that is manifestly practical – the missing protein actually returned in the muscles! Nevertheless, it is a reminder that there are those in the media who like to cast doubt on the work of GOSH, and we must continue to be careful when conveying what we do, in order not to be misunderstood.
The Strange Case of the Magnetic Wine
Guardian, 4 December 2003
What is it about magnets that amazes the pseudoscientists so much? The good magnetic energy of my Magneto-Tex blanket will cure my back pain; but I need a Q-Link pendant to protect me from the bad magnetism created by household devices. Reader Bill Bingham (oddly enough, the guy who used to read the Shipping Forecast) sends in news of the exciting new Wine Magnet: ‘Let your wine “age” several years in only 45 minutes! Place the bottle in the Wine Magnet! The Wine Magnet then creates a strong magnetic field that goes to the heart of your wine and naturally softens the bitter taste of tannins in “young” wines.’
I was previously unaware of the magnetic properties of wine, but this explains why I tend to become aligned with the earth’s magnetic field after drinking more than two bottles. The general theory on wine maturation – and it warms the cockles of my heart to know there are people out there studying this – is that it’s all about the polymerisation of tannins, which could conceivably be accelerated if they were all concentrated in local pockets: although surely not in forty-five minutes.
But this exciting new technology seems to be so potent – or perhaps unpatentable – that it is being flogged by at least half a dozen different companies. Cellarnot, marketing the almost identical ‘Perfect Sommelier’, even has personal testimonies from ‘Susan’ who works for the Pentagon, ‘Maggie, Editor, Vogue’, and a science professor, who did not want to be named but who, after giving a few glasses to some friends, exclaimed, ‘The experiment definitely showed that the TPS is everything that it claims to be.’ He’s no philosopher of science. But perhaps all of these magnetic products will turn out to be interchangeable. Maybe I can even save myself some cash, and wear my MagneForce magnetic insoles (‘increases circulation; reduces foot, leg and back fatigue’) to improve the wine after I’ve drunk it.
And most strangely of all, none of these companies seems to be boasting about having done the one simple study necessary to test their wine magnets. As always, if any of them want advice on how to do the stats on a simple double-blind randomised trial (which could, after all, be done pretty robustly in one evening with fifty people) – and if they can’t find a seventeen-year-old science student to hold their hand – I am at their disposal.
What Is Science? First, Magnetise Your Wine …
Guardian, 3 December 2005
People often ask me (pulls pensively on pipe), ‘What is science?’ And I reply thusly: Science is exactly what we do in this column. We take a claim, and we pull it apart to extract a clear scientific hypothesis, like ‘Homeopathy makes people better faster than placebo,’ or ‘The Chemsol lab correctly identifies MRSA’; then we examine the experimental evidence for that hypothesis; and lastly, if there is no evidence, we devise new experiments. Science.
Back in December 2003, as part of our Bad Science Christmas Gift series, we discovered The Perfect Sommelier, an expensive wine-conditioning device available in all good department stores. In fact there are lots of devices like this for sale, including the ubiquitous Wine Magnet: ‘Let your wine “age” several years in only 45 minutes! Place the bottle in the Wine Magnet! The Wine Magnet then creates a strong magnetic field that goes to the heart of your wine and naturally softens the bitter taste of tannins in “young” wines.’
At the time, I mentioned how easy it would be to devise an experiment to test whether people could tell the difference between magnetised and untreated wine. I also noted how strange it was that none of these devices’ manufacturers seemed to have bothered, since it could be done in an evening with fifty people.
Now Dr James Rubin et al. of the Mobile Phones Research Unit at King’s College London, have published that very study, in the esteemed Journal of Wine Research. They note the dearth of experimental research (quoting, chuffingly, the Bad Science column), and go on: ‘One retailer states, “We challenge you to try it yourself – you won’t believe the difference it can make.”’
Unwise words.
‘A review of Medline, PsychInfo, Cinahl, Embase, Amed and the Web of Science using the search term “wine and magnet” suggested that, as yet, no scientists have taken up this challenge.’
Now, this study was an extremely professional operation. Before starting, they did a power calculation: this is to decide how big your sample size needs to be, to be reasonably sure you don’t miss a true positive finding by not having enough subjects to detect a small difference. Since the manufacturers’ claims are dramatic, this came out at only fifty subjects.
Then they recruited their subjects, using wine. This wine had been magnetised, or not, by a third party, and the experimenters were blind to which wine was which. The subjects were also unaware of whether the wine they were tasting, which cost £2.99 a bottle, was magnetised or not. They received wine A or wine B, and it was a ‘crossover design’ – some people got wine A first, and some people got wine B first, in case the order you got them in affected your palate and your preferences.
There was no statistically significant difference in whether people expressed a preference for the magnetised wine or the non-magnetised wine. To translate back to the language of commercial claims: people couldn’t tell the difference between magnetised and non-magnetised wine. I realise that might not come as a huge surprise to you. But the real action is in the conclusions: ‘Practitioners of unconventional interventions often cite cost as a reason for not carrying out rigorous assessments of the effectiveness of their products. This double-blind randomised cross-over trial cost under £70 to conduct and took one week to design, run and analyse. Its simplicity is shown by the fact that it was run by two sixteen-year-old work experience students (EA and RI).’
‘Unfortunately,’ they continue, ‘our research leaves us no nearer to an understanding of how to improve the quality of cheap wine and more research into this area is now called for as a matter of urgency.’
BAD ACADEMIA
What If Academics Were as Dumb as Quacks with Statistics?
Guardian, 10 September 2011
We all like to laugh at quacks when they misuse basic statistics. But what if academics, en masse, make mistakes that are equally foolish?
This week Sander Nieuwenhuis and colleagues publish a mighty torpedo in the journal Nature Neuroscience. They’ve identified one direct, stark statistical error that is so widespread it appears in about half of all the published papers surveyed from the academic neuroscience research literature.
To understand the scale of this problem, first we have to understand the statistical error th
ey’ve identified. This will take four hundred words. At the end, you will understand an important aspect of statistics better than half the professional university academics currently publishing in the field of neuroscience.
Let’s say you’re working on some nerve cells, measuring the frequency with which they fire. When you drop a particular chemical on them, they seem to fire more slowly. You’ve got some normal mice, and some mutant mice. You want to see if their cells are differently affected by the chemical. So you measure the firing rate before and after applying the chemical, both in the mutant mice, and in the normal mice.
When you drop the chemical on the mutant-mice nerve cells, their firing rate drops by, let’s say, 30 per cent. With the number of mice you have (in your imaginary experiment), this difference is statistically significant, which means it is unlikely to be due to chance. That’s a useful finding which you can maybe publish. When you drop the chemical on the normal-mice nerve cells, there is a bit of a drop in firing rate, but not as much – let’s say the drop is 15 per cent – and this smaller drop doesn’t reach statistical significance.
But here is the catch. You can say that there is a statistically significant effect for your chemical reducing the firing rate in the mutant cells. And you can say there is no such statistically significant effect in the normal cells. But you cannot say that mutant cells and normal cells respond to the chemical differently. To say that, you would have to do a third statistical test, specifically comparing the ‘difference in differences’, the difference between the chemical-induced change in firing rate for the normal cells against the chemical-induced change in the mutant cells.
Now, looking at the figures I’ve given you here (entirely made up, for our made-up experiment), it’s very likely that this ‘difference in differences’ would not be statistically significant, because the responses to the chemical only differ from each other by 15 per cent, and we saw earlier that a drop of 15 per cent on its own wasn’t enough to achieve statistical significance.