Bad Science

Home > Science > Bad Science > Page 20
Bad Science Page 20

by Ben Goldacre


  The CLASS trial compared a new painkiller called celecoxib against two older pills over a six-month period. The new drug showed fewer gastrointestinal complications, so lots more doctors prescribed it. A year later it emerged that the original intention of the trial had been to follow up for over a year. The trial had shown no benefit for celecoxib over that longer period, but when only the results over six months were included, the drug shone. That became the published paper.

  At this stage we should pause a moment, to recognise that it can sometimes be legitimate to stop a trial early: for example, if there is a massive, jaw-dropping difference in benefit between the two treatment groups; and specifically, a difference so great, so unambiguous and so informative that even when you factor in the risk of side effects, no physician of sound mind would continue to prescribe the losing treatment, and none will, ever again.

  But you have to be very cautious here, and some terrible wrong results have been let through by people generously accepting this notion. For example, a trial of the drug bisoprolol during blood-vessel surgery stopped early, when two patients on the drug had a major cardiac event, while eighteen on placebo did. It seemed that the drug was a massive life-saver, and the treatment recommendations were changed. But when it began to seem that this trial might have overstated the benefits, two larger ones were conducted, which found that bisoprolol actually conferred no benefit.12 The original finding had been incorrect, caused by researchers stopping the trial early after a fluke clump of deaths.

  Peeking at your data during a trial can raise a genuinely troubling ethical question. If you seem to have found evidence of harm for one or other treatment before the end of the study period, should you continue to expose the patients in your trial to what might be a genuine risk, in the interests of getting to the bottom of whether it’s simply a chance finding? Or should you shut up shop and close the trial, potentially allowing that chance finding to pollute the medical literature, misinforming treatment decisions for larger numbers of patients in the future? This is particularly worrying when you consider that after a truncated trial, a larger one often has to be done anyway, exposing more people to risk, just to discover if your finding was an anomaly.

  One way to restrict the harm that can come from early stopping is to set up ‘stopping rules’, specified before the trial begins, and carefully calculated to be extreme enough that they are unlikely to be triggered by the chance variation you’d expect to see, over time, in any trial. Such rules are useful because they restrict the intrusion of human judgement, which can introduce systematic bias.

  But whatever we do about early stopping in medicine, it will probably pollute the data. A review from 2010 took around a hundred truncated trials, and four hundred matched trials that ran their natural course to the end: the truncated trials reported much bigger benefits, overstating the usefulness of the treatments they were testing by about a quarter.13 Another recent review found that the number of trials stopped early has doubled since 1990,14 which is probably not good news. At the very least, results from trials that stop early should be regarded with a large dose of scepticism. Particularly since these same systematic reviews show that trials which stop early often don’t properly report their reasons for doing so.

  And all of this, finally, becomes even more concerning when you look at which trials are being truncated early, who they’re run by, and what they’re being used for.

  In 2008, four Italian academics pulled together all the randomised trials on cancer treatments that had been published in the preceding eleven years, and that were stopped early for benefit.15 More than half had been published in the previous three years, suggesting once again that this issue has become more prevalent. Cancer is a fast-moving, high-visibility field in medicine, where time is money and new drugs can make big profits quickly. Eighty-six per cent of the trials that stopped early were being used to support an application to bring a new drug onto the market.

  Trials that stop late

  It would be a mistake to think that any of these issues illustrate transgressions of simple rules that should be followed thoughtlessly: a trial can be stopped too early, in ways that are foolish, but it can also be stopped early for sensible reasons. Similarly, the opposite can happen: sometimes a trial can be prolonged for entirely valid reasons, but sometimes, prolonging a trial – or including the results from a follow-up period after it – can dilute important findings, and make them harder to see.

  Salmeterol is an inhaler drug used to treat asthma and emphysema. What follows16 is – if you can follow the technical details to the end – pretty frightening, so, as always, remember that this is not a self-help book, and it contains no advice whatsoever about whether any one drug is good, or bad, overall. We are looking at flawed methods, and they crop up in trials of all kinds of drugs.

  Salmeterol is a ‘bronchodilator’ drug, which means it works by opening up the airways in your lungs, making it easier for you to breathe. In 1996, occasional reports began to emerge of ‘paradoxical bronchospasm’ with salmeterol, where the opposite would happen, causing patients to become very unwell indeed. Amateur critics often like to dismiss anecdotes as ‘unscientific’, but this is wrong: anecdotes are weaker evidence than trials, but they are not without value, and are often the first sign of a problem.

  Salmeterol’s manufacturer, GSK, wisely decided to investigate these early reports by setting up a randomised trial. This compared patients on salmeterol inhalers against patients with dummy placebo inhalers, which contained no active medicine. The main outcome to be measured was carefully pre-specified as ‘respiratory deaths and life-threatening experiences’, combined together. The secondary outcomes were things like asthma-related deaths (which is a subset of all respiratory deaths), allcause deaths, and ‘asthma-related deaths or life-threatening experiences’, again bundled up.

  The trial was supposed to recruit 60,000 people, and follow them up intensively for twenty-eight weeks, with researchers seeing them every four weeks to find out about progress and problems. For the six months after this twenty-eight-week period, investigators were asked to report any serious adverse events they knew of – but they weren’t actively seeking them out.

  What happened next is a dismal tale, told in detail in a Lancet paper some years later by Peter Lurie and Sidney Wolfe, working from the FDA documents. In September 2002 the trial’s own monitoring board met, and looked at the 26,000 patients who had been through so far. Judging by the main outcome – ‘respiratory deaths and life-threatening experiences’ – salmeterol was worse than placebo, although the difference wasn’t quite statistically significant. The same was true for ‘asthma-related deaths’. The trial board said to GSK: you can either run another 10,000 patients through to confirm this worrying hint, or terminate the trial, ‘with dissemination of findings as quickly as possible’. GSK went for the latter, and presented this interim analysis at a conference (saying it was ‘inconclusive’). The FDA got worried, and changed the drug’s label to mention ‘a small but significant increase in asthma-related deaths’.

  Here is where it gets interesting. GSK sent its statistics dossier on the trial to the FDA, but the figures it sent weren’t calculated using the method specified in the protocol laid down before the study began, which stipulated that the outcome figures for these adverse events should come from the twenty-eight-week period of the trial, as you’d imagine, when such events were being carefully monitored. Instead, GSK sent the figures for the full twelve-month period: the twenty-eight weeks when the adverse events were closely monitored, and also the six months after the trial finished, when adverse events weren’t being actively sought out, so were less likely to be reported. This means that the high rate of adverse events from the first twenty-eight weeks of the trial was diluted by the later period, and the problem became much less prominent.

  If you look at the following table, from the Lancet paper, you can see what a difference that made. Don’t worry if you don’t understand everything,
but here is one easy bit of background, and one hard bit. ‘Relative risk’ describes how much more likely you were to have an event (like death) if you were in the salmeterol group, compared with the placebo group: so a relative risk of 1.31 means you were 31 per cent more likely to have that event (let’s say, ‘death’).

  The numbers in brackets after that, the ‘95 per cent CI’, are the ‘95 per cent confidence interval’. While the single figure of the relative risk is our ‘point estimate’ for the difference in risk between the two groups (salmeterol and placebo), the 95 per cent CI tells us how certain we can be about this finding. Statisticians will be queuing up to torpedo me if I oversimplify the issue, but essentially, if you ran this same experiment, in patients from the same population, a hundred times, then you’d get slightly different results every time, simply through the play of chance. But ninety-five times out of a hundred the true relative risk would lie somewhere between the two extremes of the 95 per cent confidence interval. If you have a better way of explaining that in fifty-four words, my email address is at the back of this book.

  GSK didn’t tell the FDA which set of results it had handed over. In fact, it was only in 2004, when the FDA specifically asked, that it was told it was the twelve-month data. The FDA wasn’t impressed, though this is expressed in a bland sentence: ‘The Division presumed the data represented [only] the twenty-eight-week period as the twenty-eight-week period is clinically the period of interest.’ It demanded the twenty-eight-week data, and said it was going to base all its labelling information on that. This data, as you can see, painted a much more worrying picture about the drug.

  It took a couple of years from the end of the trial for these results to be published in an academic paper, read by doctors. Similarly, it took a long time for the label on the drug to explain the findings from this study.

  There are two interesting lessons to be learnt from this episode, as Lurie and Wolfe point out. Firstly, it was possible for a company to slow down the news of an adverse finding reaching clinicians and patients, even though the treatment was in widespread use, for a considerable period of time. This is something we have seen before. And secondly, we would never have known about any of this if the activities of the FDA Advisory Committees hadn’t been at least partially open to public scrutiny, because ‘many eyes’ are often necessary to spot hidden flaws in data. Again, this is something we have seen before.

  GSK responded in the Lancet that the twelve-month data was the only data analysed by the trial’s board, which was independent of the company (the trial was run by a CRO).17 It said that it communicated the risks urgently, sent a letter to doctors who’d prescribed salmeterol in January 2003, when the trial was formally stopped, and that a similar notice appeared on the GSK and FDA websites, stating that there was a problem.

  Trials that are too small

  A small trial is fine, if your drug is consistently life-saving in a condition that is consistently fatal. But you need a large trial to detect a small difference between two treatments; and you need a very large trial to be confident that two drugs are equally effective.

  If there’s one thing everybody thinks they know about research, it’s that a bigger number of participants means a better study. That is true, but it’s not the only factor. The benefit of more participants is that it evens out the random variation among them. If you’ve run a tiny trial of an amazing concentration-enhancing drug, with ten people in each group, then if only one person in one group had a big party the night before your concentration test, their performance alone could mess up your findings. If you have lots of participants, this sort of irritating noise evens itself out.

  It’s worth remembering, though, that sometimes a small study can be adequate, as the sample size required for a trial depends on a number of factors. For example, if you have a disease where everyone who gets it dies within a day, and you have a drug that you claim will cure this disease immediately, you won’t need very many participants at all to show that your drug works. If the difference you’re trying to detect between the two treatment groups is very subtle, though, you’ll need many more participants to be able to detect this tiny difference against the natural background of everyday unpredictable variation in health for all the individuals in your study.

  Sometimes you see a suspiciously large number of small trials being published on a drug, and when this happens it’s reasonable to suspect that they might be marketing devices – a barrage of publications – rather than genuine acts of scientific enquiry. We’ll also see an even more heinous example of marketing techniques in the section on ‘marketing trials’ shortly.

  But there’s a methodologically interesting problem hiding in here too. When you are planning a trial to detect a difference between two groups of patients, on two different treatments, you do something called a ‘power calculation’. This tells you how many patients you will need if you’re to have – say – an 80 per cent chance of detecting a true 20 per cent difference in deaths, given the expected frequency of deaths in your participants. If you complete your trials and find no difference in deaths between the two treatments, that means you cannot find evidence that one is better than the other.

  This is not the same as showing that they are equivalent. If you want to be able to say that two treatments are equivalent, then for dismally complicated technical reasons (I had to draw a line somewhere) you need a much larger number of participants.

  People often forget that. For example, the INSIGHT trial was set up to see if nifedipine was better than co-amilozide for treating high blood pressure. It found no evidence that it was. At the time, the paper said the two drugs had been found to be equivalent. They hadn’t.18 Many academics and doctors enjoyed pointing that out in the letters that followed.

  Trials that measure uninformative outcomes

  Blood tests are easy to measure, and often respond very neatly to a dose of a drug; but patients care more about whether they are suffering, or dead, than they do about the numbers printed on a lab report.

  This is something we have already covered in the previous chapter, but it bears repeating, because it’s impossible to overstate how many gaps have been left in our clinical knowledge through unjustified, blind faith in surrogate outcomes. Trials have been done comparing a statin against placebo, and these have shown that they save lives rather well. Trials have also compared one statin with another: but these all use cholesterol as a surrogate outcome. Nobody has ever compared the statins against each other to measure which is best at preventing death. This is a truly staggering oversight, when you consider that tens of millions of people around the world have taken these drugs, and for many, many years. If just one of them is only 2 per cent better at preventing heart attacks than the others, we are permitting a vast number of avoidable deaths, every day of the week. These tens of millions of patients are being exposed to unnecessary risk, because the drugs they are taking haven’t been appropriately compared with each other; but each one of those patients is capable of producing data that could be used to compile new knowledge about which drug is best, in aggregate, if only it was systematically randomised, and the outcomes followed up. You will hear much more on this when we discuss the need for bigger, simpler trials in the next chapter, because this problem is not academic: lives are lost through our uncritical acceptance of trials that fail to measure real-world outcomes.

  Trials that bundle their outcomes together in odd ways

  Sometimes, the way you package up your outcome data can give misleading results. For example, by setting your thresholds just right, you can turn a modest benefit into an apparently dramatic one. And by bundling up lots of different outcomes, to make one big ‘composite outcome’, you can dilute harms; or allow freak results on uninteresting outcomes to make it look as if a whole group of outcomes are improved.

  Even if you collect entirely legitimate outcome data, the way you pool these outcomes together over the course of a trial can be misleading. There are some simple examples
of this, and then some slightly more complicated ones.

  As a very crude example, many papers (mercifully, mostly, in the past) have used the ‘worst-ever side-effects score’ method.19 This can be very misleading, as it takes the worst side effects a patient has ever scored during a trial, rather than a sum of all their side-effects scores throughout its whole duration. In the graphs below, you can see why this poses such a problem, because the drug on the left is made to look as good as the drug on the right, by using this ‘worst-ever side-effects score’ method, even though the drug on the right is clearly better for side effects.

  Another misleading summary can be created by choosing a cut-off for success, and pretending that this indicates a meaningful treatment benefit, where in reality there has been no such thing. For example, a 10 per cent reduction in symptom severity may be defined as success in a trial, even though it still leaves patients profoundly disabled.20 This is particularly misleading if one treatment achieves a dramatic benefit if it works at all, and another a modest benefit if it works at all, but both get over the arbitrary and modest 10 per cent benefit threshold in the same number of patients: suddenly, a very inferior drug has been made to look just as good as the best in its class.

  You can also mix lots of different outcomes together to produce one ‘composite outcome’.21 Often this is legitimate, but sometimes it can overstate benefits. For example, heart attacks are a fairly rare event in life generally, and also in most trials of cardiovascular drugs, which is why such trials often have to be very large, in order to have a chance of detecting a difference in the rate of heart attacks between the two groups. Because of this, it’s fairly common to see ‘important cardiovascular endpoints’ all bundled up together. This ‘composite outcome’ will include death, heart attack and angina (angina, in case you don’t know, is chest pain caused by heart problems: it’s a worry, but not as much of a worry as heart attack and death). A massive improvement in that omnibus score can feel like a huge breakthrough for heart attack and death, until you look closely at the raw data, and see that there were hardly any heart attacks or deaths in the duration of the study at all, and all you’re really seeing is some improvement in angina.

 

‹ Prev