by Ben Goldacre
But in exactly this situation, academics in neuroscience papers are routinely claiming that they have found a difference in response, in every field imaginable, with all kinds of stimuli and interventions: comparing responses in younger versus older participants; in patients against normal volunteers; in one task against another; between different brain areas; and so on.
How often? Nieuwenhuis and colleagues looked at 513 papers published in five prestigious neuroscience journals over two years. In half the 157 studies where this error could have been made, it was made. They broadened their search to 120 cellular and molecular articles in Nature Neuroscience during 2009 and 2010: they found twenty-five studies committing this statistical fallacy, and not one single paper analysed differences in effect sizes correctly.
These errors are appearing throughout the most prestigious journals in the field of neuroscience. How can we explain that? Analysing data correctly, to identify a ‘difference in differences’, is a little tricksy, so thinking very generously, we might suggest that researchers worry it’s too long-winded for a paper, or too difficult for readers. Alternatively, perhaps less generously, we might decide it’s too tricky for the researchers themselves.
But the darkest thought of all is this: analysing a ‘difference in differences’ properly is much less likely to give you a statistically significant result, and so it’s much less likely to produce the kind of positive finding you need to get your study published, to get a point on your CV, to get claps at conferences, and to get a good feeling in your belly. In all seriousness: I hope this error is only being driven by incompetence.
Brain-Imaging Studies Report More Positive Findings Than Their Numbers Can Support. This Is Fishy
Guardian, 13 August 2011
While the authorities are distracted by mass disorder, we can do some statistics. You’ll have seen plenty of news stories telling you that one part of the brain is bigger, or smaller, in people with a particular mental health problem, or even a specific job. These are generally based on real, published scientific research. But how reliable are the studies?
One way of critiquing a piece of research is to read the academic paper itself, in detail, looking for flaws. But that might not be enough, if some sources of bias might exist outside the paper, in the wider system of science.
By now you’ll be familiar with publication bias: the phenomenon whereby studies with boring negative results are less likely to get written up, and less likely to get published. Normally you can estimate this using a tool such as, say, a funnel plot. The principle behind these is simple: big, expensive landmark studies are harder to brush under the carpet, but small studies can disappear more easily. So essentially you split your studies into ‘big ones’ and ‘small ones’: if the small studies, averaged out together, give a more positive result than the big studies, then maybe some small negative studies have gone missing in action.
Sadly, this doesn’t work for brain-scan studies, because there’s not enough variation in size. So Professor John Ioannidis, a godlike figure in the field of ‘research about research’, took a different approach. He collected a large representative sample of these anatomical studies, counted up how many positive results they got, and how positive those results were, and then compared this to how many similarly positive results you could plausibly have expected to detect, simply from the sizes of the studies.
This can be derived from something called the ‘power calculation’. Everyone knows that bigger is better when collecting data for a piece of research: the more you have, the greater your ability to detect a modest effect. What people often miss is that the size of the sample needed also changes with the size of the effect you’re trying to detect: detecting a true 0.2 per cent difference in the size of the hippocampus between two groups, say, would need more subjects than a study aiming to detect a huge 25 per cent difference.
By working backwards and sideways from these kinds of calculations, Ioannidis was able to determine, from the sizes of effects measured, and from the numbers of people scanned, how many positive findings could plausibly have been expected, and compare that to how many were actually reported. The answer was stark: even being generous, there were twice as many positive findings as you could realistically have expected from the amount of data reported on.
What could explain this? Inadequate blinding is an issue: a fair amount of judgement goes into measuring the size of a brain area on a scan, so wishful nudges can creep in. And boring old publication bias is another: maybe whole negative papers aren’t getting published.
But a final, more interesting explanation is also possible. In these kinds of studies, it’s possible that many brain areas are measured to see if they’re bigger or smaller, and maybe then only the positive findings get reported within each study.
There is one final line of evidence to support this. In studies of depression, for example, thirty-one studies report data on the hippocampus, six on the putamen, and seven on the prefrontal cortex. Maybe, perhaps, more investigators really did focus solely on the hippocampus. But given how easy it is to measure the size of another area – once you’ve recruited and scanned your participants – it’s also possible that people are measuring these other areas, finding no change, and not bothering to report that negative result in their paper alongside the positive ones they’ve found.
There’s only one way to prevent this: researchers would have to publicly pre-register what areas they plan to measure, before they begin, and report all findings. In the absence of that process, the entire field might be distorted by a form of exaggeration that is – we trust – honest and unconscious, but more interestingly, collective and disseminated.
‘None of Your Damn Business’
Guardian, 15 January 2011
Sometimes something will go wrong with an academic paper, and it will need to be retracted: that’s entirely expected. What matters is how academic journals deal with problems when they arise.
In 2004 the Annals of Thoracic Surgery published a study comparing two heart drugs. This week it was retracted. Ivan Oransky and Adam Marcus are two geeks who set up a website called RetractionWatch because it was clear that retractions are often handled badly: they contacted the editor of ATS, Dr L. Henry Edmunds Jr, MD to find out why the paper was retracted. ‘It’s none of your damn business,’ replied Dr Edmunds, before railing against ‘journalists and bloggists’. The retraction notice, he said, was merely there ‘to inform our readers that the article is retracted’. ‘If you get divorced from your wife, the public doesn’t need to know the details.’
ATS’s retraction notice on this paper is equally uninformative and opaque. The paper was retracted ‘following an investigation by the University of Florida, which uncovered instances of repetitious, tabulated data from previously published studies’. Does that mean duplicate publication, two bites of the cherry? Or maybe plagiarism? And if so, of what, by whom? And can we still trust the authors’ numerous other papers?
What’s odd is that this is not uncommon. Academic journals have high expectations of academic authors, with explicit descriptions of every step in an experiment, clear references, peer review, declarations for financial conflicts of interest, and so on, for a good reason: academic journals are there to inform academics about the results of experiments, and to discuss their interpretation. Retractions form an important part of that record.
Here’s one example of why. In October 2010 the Journal of the American Chemical Society retracted a 2009 paper about a new technique for measuring DNA, explaining it was because of ‘inaccurate DNA hybridization detection results caused by application of an incorrect data processing method’. This tells you nothing. When RetractionWatch got in touch with the author, he explained that his team forgot to correct for something in their analysis, which made the technique they were testing appear to be more powerful than it really was; they actually found it’s no better than the process it was proposed to replace.
That’s useful information,
much more informative than the paper simply disappearing one morning, and it clearly belongs in the academic journal the original paper appeared in, not in an email to two people from the internet running an ad hoc blog tracking down the stories behind retractions.
This all becomes especially important when you think through how academic papers are used: that JACS paper has now been cited fourteen times, by people who believed it to be true. And we know that news of even the simple fact of a retraction fails to permeate through to consumers of information.
Researcher Stephen Breuning faked huge amounts of trial data on the drug ritalin, and was found guilty of scientific misconduct in 1988 by a US federal judge – which is unusual and extreme in itself – so most of his papers were retracted. A study last year chased up all the references to Breuning’s work from 1989 to 2007, and found over a dozen academic papers still citing his work. Some discussed it as a case of fraud, but around half – in more prominent journals – still cited it as if it was valid, twenty-four years after its retraction.
The role of journals in policing academic misconduct is still unclear, but obviously, explaining the disappearance of a paper you published is a bare minimum. Like publication bias, whereby negative findings are less likely to be published, this is a systemic failure, across all fields, so it has far greater ramifications than any one single, eye-catching academic cock-up or fraud. Unfortunately it’s also a boring corner in the technical world of academia, so nobody has been shamed into fixing it. Eyeballs are an excellent disinfectant: you should read RetractionWatch.
Twelve Monkeys. No … Eight. Wait, Sorry, I Meant Fourteen
Guardian, 23 January 2010
Like many people, you’re possibly afraid to share your views on animal experiments, because you don’t want anyone digging up your grandmother’s grave, or setting fire to your house, or stuff like that. Animal experiments are necessary, they need to be properly regulated, and we have some of the tightest regulation in the world.
But it’s easy to assess whether animals are treated well, or whether an experiment was necessary. In the nerd corner there is another issue: is the research well conducted, and are the results properly communicated? If it’s not, then animals have suffered – whatever you believe that might mean for an animal – partly in vain.
The National Centre for the Replacement, Refinement and Reduction of Animals in Research was set up by the government in 2004. It has published, in the academic journal PLoS One, a systematic survey of the quality of reporting, experimental design and statistical analysis of recently published biomedical research using laboratory animals. These results are not good news.
The study is pretty solid. It describes the strategy they used to search for papers, which is important, because you don’t want to be like a homeopath, and only quote the papers that support your conclusions: you want to have a representative sample of all the literature. And the papers they found covered a huge range of publicly funded research: behavioural and diet studies, drug and chemical testing, immunological experiments, and more.
Some of the flaws they discovered were bizarre. Four per cent of papers didn’t mention how many animals were used in the experiment, anywhere. The researchers looked in detail at forty-eight studies that did say how many were used: not one explained why that particular number of animals had been chosen. Thirty-five per cent of the papers gave one figure for the number of animals used in the methods, and then a different number of animals appeared in the results. That’s pretty disorganised.
They looked at how many studies used basic strategies to reduce bias in their results, like randomisation and blinding. If you’re comparing one intervention against another, for example, and you don’t randomly assign animals to each group, then it’s possible you might unconsciously put the stronger animals in the group getting a potentially beneficial experimental intervention, or vice versa, thus distorting your results.
If you don’t ‘blind’, then you know, as the experimenter, which animals had which intervention. So you might allow that knowledge, even unconsciously, to affect close calls on measurements you take. Or maybe you’ll accept a high blood-pressure reading when you expected it to be high, knowing what you do about your own experiment, but then double-check a high blood-pressure measurement in an animal where you expected it to be low.
Only 12 per cent of the animal studies used randomisation. Only 14 per cent used blinding. And the reporting was often poor. Only 8 per cent gave the raw data, allowing you to go back and do your own analysis. About half the studies left the numbers of animals in each group out of their tables.
I grew up friends with the daughters of Colin Blakemore, a neuroscientist in Oxford who has taken courageous risks over many decades to speak out and defend necessary animal research. My first kiss – not one of those sisters, I should say – was outside a teenage party in a church hall, in front of two Special Branch officers sitting in a car with their lights off.
People who threaten the lives of fifteen-year-old girls, to shut their father up, are beneath contempt. People who fail to damn these threats are similarly contemptible. That’s why it sticks in the throat to say that the reporting and conduct of animal research is often poor; but we have to be better.
Medical Hypotheses Fails the Aids Test
Guardian, 12 September 2009
This week the peer-review system has been in the newspapers, after a survey of scientists suggested it had some problems. This is barely news. Peer review – where articles submitted to an academic journal are reviewed by other scientists from the same field for an opinion on their quality – has always been recognised as problematic. It is time-consuming, it can be open to corruption, and it cannot always prevent fraud, plagiarism or duplicate publication, although in a more obvious case it might. The main problem with peer review is: it’s hard to find anything better.
Here is one example of a failing alternative. This month, after a concerted campaign by academics aggregating around websites such as Aidstruth.org, academic publishers Elsevier have withdrawn two papers from a journal called Medical Hypotheses. This academic journal is a rarity: it does not have peer review; instead, submissions are approved for publication by its one editor.
Articles from Medical Hypotheses have appeared in this column quite a lot. It carried one almost surreally crass paper1 in which two Italian doctors argued that ‘mongoloid’ really was an appropriate term for people with Down’s syndrome after all, because they share many characteristics with Oriental populations (including: sitting cross-legged, eating small amounts of lots of different types of food with MSG in it, and an enjoyment of handicrafts). You might also remember two pieces discussing the benefits and side effects of masturbation as a treatment for nasal congestion.2
The papers withdrawn this month step into a new domain of foolishness. Both were from the community whose members characterise themselves as ‘Aids dissidents’, and one was co-authored by its figureheads, Peter Duesberg and David Rasnick.
To say that a peer reviewer might have spotted the flaws in their paper – which had already been rejected by the Journal of Aids – is an understatement. My favourite part is the whole page they devote to arguing that there cannot be lots of people dying of Aids in South Africa, because the population of that country has grown over the past few years.
We might expect anyone to spot such poor reasoning – and only two days passed between this paper’s submission and its acceptance – but they also misrepresent landmark papers from the literature on Aids research. Rasnick and Duesberg discuss antiretroviral medications, which have side effects, but which have stopped Aids being a death sentence, and attack the notion that their benefits outweigh the toxicity: ‘Contrary to these claims,’ they say, ‘hundreds of American and British researchers jointly published a collaborative analysis in the Lancet in 2006, concluding that treatment of Aids patients with anti-viral drugs has “not translated into a decrease in mortality”.’
This is a simp
le, flat, unambiguous misrepresentation of the Lancet paper to which they refer. Antiretroviral medications have repeatedly been shown to save lives in systematic reviews of large numbers of well-conducted randomised controlled trials. The Lancet paper they reference simply surveys the first decade of patients who received highly active antiretroviral therapy (HAART) – modern combinations of multiple antiretroviral medications – to see if things had improved, and they had not. Patients receiving HAART in 2003 did no better than patients receiving HAART in 1995. This doesn’t mean that HAART is no better than placebo. It means outcomes for people on HAART didn’t improve over an eight-year period of their use. This would be obvious to anyone familiar with the papers, but also to anyone who thought to spend the time checking the evidence for an obviously improbable assertion.
What does all this tell us about peer review? The editor of Medical Hypotheses, Bruce Charlton, has repeatedly argued – very reasonably – that the academic world benefits from having journals with different editorial models, that peer review can censor provocative ideas, and that scientists should be free to pontificate in their internal professional literature. But there are blogs where Aids dissidents, or anyone, can pontificate wildly and to their colleagues: from journals we expect a little more.
Twenty academics and others have now written to Medline, requesting that Medical Hypotheses should be removed from its index. Aids denialism in South Africa has been responsible for the unnecessary deaths of an estimated 330,000 people. You can do peer review well, or badly. You can follow the single-editor model well, or foolishly. This article was plainly foolish.