But placebos can please too much: patients get drunk on placebo alcohol, they become more alert (although not more irritable) on placebo coffee. Placebo morphine is a more effective painkiller than placebo Darvon. The same placebo constricts the airways of asthmatic people when described as a constrictor and dilates them when described as a dilator. Red sugar pills stimulate; blue ones depress—brand-name placebos work better than generic. And higher dosages are usually more effective.
This placebo effect is well documented, but that doesn’t make it easy to deal with. If a control-group patient improves, it needn’t be because of the placebo; some simply get better, whether through the natural course of the illness or through reversion to the mean. You need, essentially, a control for your control—another group you are not even trying to please. How do you formulate that? “Would you like to participate in a study where we do nothing for your condition?” Might this cheery message affect the patient’s well-being? We are already weaving a tangled web.
Quantifying the placebo effect can be essential, particularly in subjective areas like pain or depression. A meta-analysis of all the effectiveness trials submitted to the FDA for the six most widely prescribed antidepressants approved for use between 1987 and 1999 found that, on the standard 50-point scale for depression, the drugs, on average, moved the patients’ mood by only two points more than did the placebo. In other words, the placebo showed about 80 percent of the effectiveness of the medication. So, although antidepressants showed a statistically significant effect, it was clinically negligible—if you assume that the effect of medication should be additional to the placebo effect.
If, however, the effect of medication and placebo are not additive, you would have to design an experiment that isolates them from each other. You cannot use a double-blind experiment, because you will need a four-way distribution of patients: those who are given treatment and are told it is treatment, those who are given a placebo and are told it is treatment, those who are given treatment and are told it is a placebo, those who are given a placebo and are told it is a placebo. This should isolate the effect of the drug from that of the placebo—but, instead of one comparison, you would need to make six, with all the usual problems of significance, sample size, and chance variation. It would be a large and complex trial—but without it there would always be the suspicion that most antidepressants are little more than a very expensive but pharmacologically dubious pink pill.
In the United States today the whole question has become academic, because it is almost impossible to get the required informed consent for a classic randomized placebo trial: nobody who would sign up is willing to take the chance of not getting the newest treatment. Instead you do crossover trials, in which you give one of your randomized groups the treatment, and if it seems to be effective, then the other group gets it, too—which solves the ethical problem of denying treatment, but makes it much harder to show a clear difference in results. “First, do no harm”—but what if that means you can do no good?
In 1766, Jean Astruc, onetime physician to the Regent of France, published a long and learned treatise on the art of midwifery that began with the observation that he had not himself been present at a birth (except, one assumes, his own).
For a long time, such a combination of prudery, presumption, and tradition could conceal the fact that women and men are confounded variables in any calculation of human health. We do overlap in many respects; indeed, as our roles in life converge we can see points where our health histories are also aligning, both for good (fewer men dying of industrial diseases) and for ill (more women dying of lung cancer). Nevertheless, there are significant variations between the sexes, ranging from the obvious reproductive and hormonal differences to types of depression, overall life expectancy, and unequal response to painkillers.
What should experimenters do with confounded variables? Isolate them, if possible. When obvious differences appear in the clinical record, we can design studies to quantify those differences with the same precision as any other comparison. The residual problem, though, is: what about differences that are not obvious? As we have seen, the significance or insignificance of clinical trials can hang on a single percentage point. What if the responses of men and women placed in the same group actually differed from one another? Would we be justified in taking the mean of our results? You can hardly say that a bucket of boiling water in a tub full of ice is the equivalent of a lukewarm bath.
It’s not only an issue of l’homme moyen differing from la femme moyenne: in an age of mass movements of populations, countries with large numbers of people from different parts of the world are particularly aware of the effects of genetic variations—from lactose intolerance to sickle-cell anemia. The National Institutes of Health Revitalization Act of 1993 required that clinical trials be “designed and carried out in a manner sufficient to provide for a valid analysis of whether the variables being studied in the trial affect women or members of minority subgroups, as the case may be, differently than [sic] other subjects in the trial.”
This seems fair enough, but the Act was less specific on how it was to be achieved—and with reason. A trial is designed to have a certain “power”: a probability that a genuine effect will be recognized and not mistaken for the workings of chance. Like a signal-to-noise ratio, it depends crucially on the number of subjects. Increasing the number increases the probability that random variation will cancel out, but the relation is not linear: if you want to reduce the random effect by half, you need four times as many observations; by three-quarters, you need 16 times as many; by seven-eighths, 64 times. The minimum number of patients for a given trial is determined by the minimum size of effect that the researchers hope to observe, the error of observation, and the power of the experiment. These factors are mutually connected and, as the number of patients in the study decreases, they all work together to reduce the validity of the results.
If you want to look at the difference between male and female responses to a given treatment, you need to look at the female response in isolation, then the male response, and then compare the two. So if you had needed 1,000 patients to assure the required power for a sex-blind experiment, you would need 2,000 to achieve the same power for an investigation of the male and female response in isolation. If you want to compare those responses, you are now working with the results of this initial experiment, already affected by one layer of error; so, if your comparison is to have the same power as the experiment you first proposed, you need 4,000 patients. Add two further conditions, say, age and income—you need an initial sample of 64,000. If your group reflects the relative proportion of black Americans in the population (10 percent), you need to multiply your initial sample by 10 to be able to say anything of significance about them in isolation. Determining the numbers you need is easy—achieving them may be impossible.
In medicine, we all want certainty—but we’d settle for rigor. Rigor, though, demands a high price in the complexity and size of experiment; and the numbers required for confidence in the results may be beyond any institution’s capacity to administer. Ultimately, we reach a point where society has to trust the researchers to isolate the right variables in the right studies. We will never be entirely free of medical tact.
Fisher never took the Hippocratic oath; the beings whose development he encouraged or stunted were mere plants. His standards were mathematical, and his highest duty was to precision under uncertainty. Importing Fisher’s methods into medicine, however, brings clinical researchers constantly up against ethical questions. The point of randomization is to purge inference of human preconception and allow simple error full play. The point of ethics is to save the situation from error in the name of a human preconception. The two do not sit easily together.
Do you withhold untested but promising AIDS treatments from dying patients in the name of controlled experiment? Do you, to preserve the validity of your results, continue hormone replacement therapy trials in the face of statisticall
y significant increases in the incidence of breast cancer? Every question puts you back in the Lanarkshire classroom, milk jug in hand, choosing between braw Sandy and poor wee Robert.
In practice, somebody else will usually make the choice. Almost all clinical trials now have to be approved beforehand by institutional ethical committees, often combining laypeople with experts. These committees are increasingly overworked and sometimes ill equipped to assess the studies they must consider. Of course, ethical decisions must trump scientific ones—but this raises the question whether it is ethical to involve patients in a poorly designed study with insufficient statistical power to provide definitive results. And although the Declaration of Helsinki is supposed to govern all research committees, different countries have different standards—so, just as some shipowners register their leaky tankers in a country with low safety standards, others take their dubious experiments to less demanding jurisdictions.
You can see where this is leading: the ethical component has become a variable in itself. When, as now so often happens, a study hopes to include samples from many institutions in different countries, setting up the protocol to harmonize the review process can be just as important as the experimental design. This requires so much time and money that there is now an international review of multicenter studies of review bodies to determine the international protocols that can govern the review of multicenter studies—proving there is such a thing as a conceptual palindrome.
Medical research is self-sacrifice: years of study, long hours in ill-smelling buildings, complex statistical analysis—and, brooding behind it all, the Null Hypothesis: a form of self-mortification far more difficult to bear than eschewing fish without scales or fasting during daylight for a month.
Fisher said that experiment gives facts a chance to disprove the null hypothesis. As in hide-and-seek, the positive result is “it”: it will find you—assuming that it exists. If it doesn’t, the null hypothesis prevails: alone in your office, you mark down a negative result, document your methodology, tidy up your notes, and send it all off to a journal.
This is not failure. Verifying the null hypothesis should be a valuable result—assuming the trial stands up to scrutiny, it consigns the treatment tested to history’s trash heap: we no longer bleed fever patients because of a negative result for Dr. Broussais’ leeches. But a negative result, like a positive one, requires statistical power: there is a big difference between “We found nothing” and “We didn’t find anything.” An effect might be strong enough to appear even in an underpowered study where, if it hadn’t shown up, the null hypothesis could not be assumed. Short are the steps from uncertainty to ambiguity to confusion.
The Harvard statistician William Cochran said that experimenters always started a consultation by saying “I want to do an experiment to show that . . .” We are a hopeful species with an urge toward the positive, even in science: beneath the lab coat beats a human heart. Does this urge tip the results of clinical trials? Statistically, it does. In many medical fields, published work shows a slight overrepresentation of positive results—what’s called “publication bias”—since, after all, no news is no news.
All the things that affect the quality of a study—sample size, randomization, double-blinding, placebo selection, statistical power—err statistically in the same direction: the less perfect the study, the more likely a positive result. Not every trial can afford the 58,050 patients of ISIS-4; not every scientific committee insists on perfect methodology; and every such departure from absolute rigor increases the chance of seeing something that might not be there. Bias need not be intentional, or even human error—the researcher’s desire to have something positive to say—it’s inherent in the experimental process. We may think of a positive result as something hewn with great effort out of the surrounding randomness, but genuine randomness is actually harder to demonstrate. When it comes to seeking out the Null Hypothesis, the researcher is “it.”
Doctors’ mailboxes are full of glossy advertisements for new drugs, and doctors are eager to get their hands on new cures. A harassed GP hasn’t time to trawl through peer-reviewed journals on the off chance of finding what he needs. His starting place has to be the flyer with the sample attached; FDA or MCA approval gives a guarantee that he’s not dealing with crooks or wishful thinkers—but it is his responsibility to read the very fine print and decide if the product is safe, effective, and appropriate for his patients.
Richard Franklin has a Ph.D. in mathematics as well as an M.D.; he is also president of a medical information search company, so he is unusually aware of the situation our doctor is in: “He’ll read that, in a double-blind placebo-controlled clinical trial, this drug was shown to be effective in the treatment of a particular problem. If it’s a fatal disease, his presumption is that patients who took the drug didn’t die, or at least that fewer people died. But if we have the leisure to look at the design of a clinical trial, we discover that there is a definition of the word ‘effective’—and that might be, for this particular trial, that there was a 30 percent reduction in the size of a lesion. You have your own understanding of ‘effective’—but in fact, ‘effective’ has its specific definition in each individual context—a definition that aims to produce the minimum commercially acceptable difference.”
In many cases what a layman would consider to be the real clinical trial of a new drug takes place only after it has been approved, is on the market, and is being prescribed: postmarketing use. Over time enough data will be accumulated that even if something fails for its intended indication it succeeds for another. The most famous example is Viagra, which was initially tested as a hypertension reducer.
Numbers can be just as slippery as words. Suppose that you are a doctor and have been presented with a choice of four cancer-screening programs to recommend for your hospital: here are the results as laid out in your questionnaire. All you need to do is mark your grade for each on a line stretching from 0 (“would not support”) to 10 (“definitely would support”).• Program A reduced the death rate by 34 percent
• Program B produced an absolute reduction in deaths of 0.06 percent
• Program C increased the patients’ survival rate from 99.82 percent to 99.88 percent
• Program D meant that 1,592 patients needed to be screened to prevent 1 death.
Program A looks pretty good, doesn’t it? Doctors and administrators who were given this questionnaire agreed: they gave it a score of 7.9 out of 10, well above its rivals. In fact, these numbers all describe exactly the same program.
The same misunderstanding appeared in studies of decisions by health purchasers in the UK, teaching-hospital doctors in Canada, physicians in the United States and Europe, and American pharmacists. All plumped for relative risk reduction, the percentage drop in the rate of deaths. We live in a world of percentages—but because they are a measure of proportion, not of absolute size, they contain the seeds of confusion. “Compared to what?” is never a pointless question.
Gerd Gigerenzer, in Calculated Risks, posed a simple question to doctors in a German teaching hospital: You’re in charge of a mammogram screening program covering women between 40 and 50 who show no symptoms. You know that the overall probability that a woman of this age has breast cancer is 0.8 percent. If a woman has breast cancer, the probability that she will show a positive mammogram result is 90 percent. If she does not have breast cancer, the probability that she will have a falsely positive mammogram result is 7 percent. Your patient, Ursula K., has a positive mammogram result. What is the probability that she has breast cancer?
The doctors were baffled: a third of them decided the probability was 90 percent; a sixth thought it was 1 percent. It would have made a big difference to Ursula K. which doctor was on duty the day she came in for her results.
As Gigerenzer explains it, the problem is not with the doctors but with the percentages. Phrase the question again, using plain numbers: out of 1,000 women of this age, 8 will have breast c
ancer. When you screen those 8, 7 will have a positive mammogram result. When you screen the remaining 992 women without breast cancer, 69 or 70 of them will have a false-positive mammogram. Ursula K. is one of the 7 + 70 women with a positive result; how likely is she to have breast cancer?
It is a lot easier to compare 7 and 77 than to figure out
When the problem was explained this way, half the doctors got the answer right: Ursula’s chance of having cancer is less than 1 in 10. That said, half of them still got it wrong, and two even said her chance of having cancer was 80 percent. Maybe part of the problem is the doctors.
So far, we have been talking about patients and their diseases as the raw data—but in a modern health-care system doctors, too, are objects of collective scrutiny. Spending on health care has reached an annual level of $1.5 trillion in the United States; staving off mortality costs every American $5,267 a year—more than 14 percent of the gross domestic product. Meanwhile, the UK has had fifty years of a state-funded universal health-care system—the world’s third-largest employer, after the Chinese army and the Indian railroads—which an anxious electorate alternately praises and abuses.
How do you gauge the success of such an enterprise? All lives eventually end: medicine wins many battles but must lose the war. Can you define “unnecessary deaths prevented”? Like existence, health care lacks an ultimate goal. Its paymasters, however, have to describe and regulate the movements of this vast collective organism—and their method is necessarily statistical.
Chances Are Page 19