by Ben Goldacre
There are other detection methods. The human brain is a fairly bad random-number generator, for example, and simple frauds have often been uncovered by forensic statisticians looking at last-digit frequency: if you’re pencilling numbers into a column at random, you might have a slight unconscious preference for the number seven. To avoid this you might use a random-number generator, but here you would run into the odd problem of telltale uniformity in your randomness. The German physicist Jan Hendrik Schön co-authored roughly one paper every week in 2001, but his results were too accurate. Eventually someone noticed that two studies had the same amount of ‘noise’ superimposed on a perfect prototype result; it turned out that many of his figures had been generated by computer, using the very equations they were supposed to be checking, with supposedly realistic random variation built into the model.
There are all sorts of things we should be doing to catch outright fraud: better investigations, better routine monitoring, better communication from journal editors on suspicions about papers they reject, better protection of whistleblowers, random spot checks of primary data by journals, and so on. People talk about them, but they seldom do them, because responsibility for the problem is diffuse and unclear.
So, fraud: it happens, it’s not clever, it’s just criminal, and is perpetrated by bad people. But its total contribution to error in the medical literature is marginal when compared to the routine, sophisticated and – more than anything – plausibly deniable everyday methodological distortions which fill this book. Despite that fact, outright fraud is almost the only source of distortion that receives regular media coverage, simply because it’s easy to understand. That’s reason enough for me to leave it alone, and move on to the meat.
Test your treatment in freakishly perfect ‘ideal’ patients
As we have seen, patients in trials are often nothing like real patients seen by doctors in everyday clinical practice. Because these ‘ideal’ patients are more likely to get better, they exaggerate the benefits of drugs, and help expensive new medicines appear to be more cost effective than they really are.
In the real world, patients are often complicated: they might have many different medical problems, or take lots of different medicines, which all interfere with each other in unpredictable ways; they might drink more alcohol in a week than is ideal; or have some mild kidney impairment. That’s what real patients are like. But most of the trials we rely on to make real-world decisions study drugs in unrepresentative, freakishly ideal patients, who are often young, with perfect single diagnoses, fewer other health problems, and so on.3
Are the results of trials in these atypical participants really applicable to everyday patients? We know, after all, that different groups of patients respond to drugs in different ways. Trials in an ideal population might exaggerate the benefits of a treatment, for example, or find benefits where there are none. Sometimes, if we’re very unlucky, the balance between risk and benefit can even switch over completely, when we move between different populations. Anti-arrhythmic drugs, for example, were shown to be effective at prolonging life in patients with severe abnormal heart rhythms, but were also widely prescribed for patients after they’d had heart attacks, when they had only mild abnormal heart rhythms. When these drugs were finally trialled in this second population, we found – to everyone’s horror – that they actively increased their risk of dying.4
Doctors and academics often ignore this problem, but when you start to stack up the differences between trial patients and real patients side by side, the scale of the problem is staggering.
One study from 2007 took 179 representative asthma patients from the general population, and looked at how many would have been eligible to participate in a group of asthma treatment trials.5 The answer was 6 per cent on average, and these weren’t any old trials they were being rejected from: they were the trials that form the basis of the international consensus guidelines for treating asthma in GP clinics and hospitals. These guidelines are used around the world, and yet, as this study shows, they are based on trials that would have excluded almost every single real-world patient they’re applied to.
Another study took six hundred patients being treated for depression in an outpatient clinic, and found that on average only a third of them would have been eligible to participate in thirty-nine recently published trials of treatments for depression.6 People often talk about the difficulties in recruiting patients for research: but one study described how 186 patients with depression enquired about participating in two different trials of antidepressants, and more than seven out of every eight had to be turned away as they weren’t eligible.7
To see what this looks like in reality, we can follow one group of patients with a particular medical problem. In 2011 some researchers in Finland took every patient who’d ever had a hip fracture, and worked out if they would have been eligible for the trials that have been done on bisphosphonate drugs, which are in widespread use for preventing fractures.8 They started with 7,411 patients, but 2,134 were excluded straight off, because they were men, and all the trials have been done in women. Are there differences in how men and women respond to drugs? Sometimes, yes. Of the 5,277 patients remaining, 3,596 were excluded because they were the wrong age: patients in the trials had to be between sixty-five and seventy-nine. Then, finally, 609 patients were excluded because they didn’t have osteoporosis. That only leaves 1,072 patients. So the data from the trials on these fracture-preventing drugs are only strictly applicable to about one of every seven patients with a fracture. They might still work in the people who’ve been excluded, but that’s a judgement call you have to make; and even if they do work, the size of the benefit might be different in different people.
This problem goes way beyond simply measuring the effectiveness of drugs: it also distorts our estimates of their cost effectiveness (and in an era of escalating costs in health care, we need to worry about value). Here’s one example from the new ‘coxib’ painkiller drugs. These are sold on the basis that they cause fewer gastrointestinal or ‘GI’ bleeds when compared with older, cheaper painkillers, like high-street ibuprofen.
Coxibs really do seem to reduce the risk of GI bleeds, which is good, because such bleeds can be extremely serious. In fact they lessened the risk by about a half in trials, which were conducted – of course – in ideal patients, who were at much higher risk of having a GI bleed. For the people running the trials this made perfect sense: if you want to show that a drug reduces the risk of having a bleed, it will be much easier and cheaper to show that in a population which is having lots of bleeds in the first place (because otherwise, if your outcome is really rare, you’re going to need a huge number of patients in your trial).
But an interesting problem appears if you use these figures, on a change in the rate of GI bleeds in freakishly ideal trial patients, to calculate the cost of preventing a bleed in the real world. NICE estimated this cost at $20,000 per avoided bleed, but the real answer is more like $100,000.9 We can see very easily how NICE got this wrong, by doing the maths on some simple rough round figures, though these are – pleasingly – almost exactly the same as the real ones (we must work in dollars here, by the way, because the analysis exposing this problem was published in a US academic journal).
The trial patients had a high risk of a bleed: over a year, fifty out of 1,000 had one. This was reduced to twenty-five out of 1,000 if they were on a coxib, because a coxib halves your risk of a bleed. A coxib drug costs an extra $500 a year for each patient. So, $500,000 spent on 1,000 patients buys you twenty-five fewer bleeds, and $500,000÷25 means the avoided bleeds cost you $20,000 each.
But if you look at the real patients getting coxibs in the GP records database, you can see that they have a much lower risk of bleeds: over a year, ten out of 1,000 had one. That goes down to five out of 1,000 if they were on a coxib, because a coxib halves your risk of a bleed. So, you still pay $500,000 for 1,000 patients to have a coxib for a year, but it only buys you five f
ewer bleeds, and $500,000÷5 means that these avoided bleeds now cost you $100,000 each. That is a lot more than $20,000.
This problem of trial patients being unrepresentative is called ‘external validity’, or ‘generalisability’ (in case you want to read more about it elsewhere). It can make a trial completely irrelevant to real-world populations, yet it is absolutely routine in research, which is conducted on tight budgets, to tight schedules, for fast results, by people who don’t mind if their results are irrelevant to real-world clinical questions. This is a quiet, dismal scandal. There’s no dramatic newspaper headline, and no single killer drug: just a slow and unnecessary pollution of almost the entire evidence base in medicine.
Test your drug against something rubbish
Drugs are often compared with something that’s not very good. We’ve already seen this in companies preferring to test their drugs against a dummy placebo sugar pill that contains no medicine, as this sets the bar very low. But it is also common to see trials where a new drug is compared with a competitor that is known to be pretty useless; or with a good competitor, but at a stupidly low dose, or a stupidly high dose.
One thing that’s likely to make your new treatment look good is testing it against something that doesn’t work very well: this might sound absurd, or even cruel, so we’re lucky that a researcher called Daniel Safer has pulled together a large collection of trials using odd doses specifically to illustrate this problem.10 One study, for example, compares paroxetine against amitryptiline. Paroxetine is one of the newer antidepressants, and it is largely free from side effects like drowsiness. Amitryptyline is a very old drug, known to make people sleepy, so in real clinical practice it’s often best to advise patients to take it only at night, because drowsiness doesn’t matter so much when you’re asleep. But in this trial amitryptyline was given twice a day, morning and night. The patients reported lots of daytime sleepiness on amitryptyline, making paroxetine look much better.
Alternatively, some trials compare the expensive new drug against an older one given at an unusually high dose, which means it has worse side effects by comparison. The world of antipsychotic medication gives an interesting illustration of this, and one that spans across several eras of research.
Schizophrenia is, like cancer, a disease for which treatments are not perfect, and the benefits of intervention must often be weighed against disadvantages. Each person with schizophrenia will have different goals. Some prefer to tolerate a higher risk of relapse because of their very strong desire to avoid side effects at any cost, and might choose a lower dose of medication; others may find that serious relapses damage their lives, costing them their home, job or friendships, and so they might choose to tolerate some side effects, in exchange for the benefits that go alongside them.
This is often a difficult decision, because side effects are common with schizophrenia medication: especially movement disorders (which are a little like the symptoms of Parkinson’s disease) and weight gain. So the goal of drug innovation in this field has been to find tablets which treat the symptoms, but without causing side effects. A couple of decades ago there was a breakthrough: a new group of drugs were brought to market, the ‘atypicals’, which promised just that. A series of trials was set up to compare these new drugs with the old ones.
Safer found six trials comparing new-generation antipsychotic drugs with boring old-fashioned haloperidol – a drug well known to have serious side effects – at 20mg a day. This is not an insanely high dose of haloperidol: it wouldn’t get you immediately struck off, and it’s not entirely outside the maximum dose permitted in the British National Formulary (BNF), the standard reference manual for drug prescription. But it’s an odd routine dose, and it’s inevitable that patients receiving so much would report lots of side effects.
Interestingly, a decade later, history repeated itself: risperi-done was one of the first of this new generation of antipsychotic drugs, so it came off patent first, immediately becoming very cheap, like the older generation of drugs. As a consequence, many drug companies wanted to show that their own expensive new-generation antipsychotic was better than risperidone, which was now suddenly cheap and old-fashioned: and so trials appeared comparing new drugs against risperidone at a dose of 8mg. Again, 8mg isn’t an unimaginably high dose: but it’s still pretty high, and patients on this dose of risperidone will be much more likely to report side effects, making the comparator drug look more attractive.
This – again – is a quiet and diffuse scandal. It doesn’t mean that any of these specific drugs are outright, headline-grabbing killers: just that the evidence, overall, is distorted.
Trials that are too short
Trials are often brief, as we have seen, because companies need to get results as quickly as possible, in order to make their drug look good while it is still in patent, and owned by them. This raises several problems, including ones that we have already reviewed: specifically, people using ‘surrogate outcomes’, like changes in blood tests, instead of ‘real-world outcomes’, like changes in heart attack rates, which take longer to emerge. But brief trials can also distort the benefits of a drug simply by virtue of their brevity, if the short-term effects are different to the long-term ones.
An operation to remove a cancer, for example, has immediate short-term risks – you might die on the table in the operating theatre, or from an infection in the following week – but you hope that this short-term risk is offset by long-term benefits. If you do a trial to compare patients who have the operation with patients who don’t, but only measure outcomes for one week, you might find that those having the operation die sooner than those who don’t. This is because it takes months or years for people to die of the cancer you’re cutting out, so the benefits of that operation take months and years to emerge, whereas the risks, the small number of people who die on the operating table, appear immediately.
The same problem presents itself with drug trials. There might be a sudden, immediate, short-term benefit from a weight-loss drug, for example, which deteriorates over time to nothing. Or there might be short-term benefits and long-term side effects, which only become apparent in longer trials. The weight-loss treatment Fenphen, for example, caused weight loss in the positive short-term trials, but when patients receiving it were observed over longer periods, it turned out that they also developed heart valve defects.11 Benzodiazapine drugs like valium are very good for alleviating anxiety in the short term, and a trial lasting six weeks would show huge benefits; but over the months and years that follow, their benefits decrease, and patients become addicted. These adverse long-term outcomes would only be captured in a longer trial.
Longer trials are not, however, automatically always better: it’s a question of the clinical question you are trying to answer, or perhaps trying to avoid. With an expensive cancer drug like Herceptin, for example, you might be interested in whether giving it for short periods is just as effective as giving it for long periods, in order to avoid paying for larger quantities of the drug unnecessarily (and exposing patients to a longer duration of side effects). For this you’d need short trials, or at the very least trials that reported outcomes over a long period, but after a short period of treatment. Roche applied for twelve-month treatment licences with Herceptin, presenting data from twelve-month-long trials. In Finland a trial was done with only a nine-week course of treatment, finding significant benefit, and the New Zealand government decided to approve nine-week treatment. Roche rubbished this brief study, and commissioned new trials for a two-year period of treatment. As you can imagine, if we want to find out whether nine weeks of Herceptin are as good as twelve months of Herceptin, we need to run some trials comparing those two treatment regimes: funding trials like these is often a challenge.
Trials that stop early
If you stop a trial early, or late, because you were peeking at the results as it went along, you increase the chances of getting a favourable result. This is because you are exploiting the random variation
that exists in the data. It is a sophisticated version of the way someone can increase their chances of winning in a coin toss by using this strategy: ‘Damn! OK, best of three…Damn! Best of five?…Damn! OK, best of seven…’
Time and again in this book we have come back to the same principle: if you give yourself multiple chances of finding a positive result, but use statistical tests that assume you only had one go, you hugely increase your chances of getting a misleading false positive. This is the problem with people hiding negative results. But it also creeps into the way people analyse studies which haven’t been hidden.
For example, if you flip a coin for long enough, then fairly soon you’ll get four heads in a row. That’s not the same as saying ‘I’m going to throw four heads in a row right now,’ and then doing so. We know that the time frame you put around some data can allow you to pick out a clump of findings which please you; and we know that this can be a source of mischief.