by Steven Hatch
I originally envisioned Snowball in a Blizzard as a book that would focus on methodological aspects of human-subjects research, mainly the difficulties of study design and the subtleties of statistical interpretation. When, for instance, does a relative risk value diverge from an odds ratio, and why are the two often confused? What is a Type I versus Type II error? How do we “power” studies? A few years ago, as I was struggling with these kinds of issues in my professional work, I thought that they would be ideal subjects to illuminate to a general audience. I can see now that these fairly technical matters were unlikely to help nonspecialists have a more thorough understanding of clinical research, and it is probably why I received fairly tepid responses from literary agents.
Over time, I realized that more was to be gained by telling stories about the consequences of these issues, and that I could occasionally sprinkle the text with brief explanations of the more essential methodological points. For instance, I thought it absolutely critical to explain the concept of positive predictive value in order to show why the USPSTF does not universally recommend mammograms for women under age fifty. One can’t easily grasp the justification for the task force’s reasoning without being acquainted with the notion of positive predictive value; once one understands the concept and sees the truly lousy predictive value of a positive screening mammogram in this age group, it’s hard to understand why there was (and is) so much fuss in the first place. However, I shelved the idea of devoting entire chapters, say, to the difference between nested case-control and case-cohort studies or the beauty inherent in the Mann-Whitney U test. Such subjects, fascinating though they can be to epidemiologists, would probably be valuable to nonspecialists only as a soporific.
Thus, I elected to prioritize narration over technical explanation to describe these points, and whether I have succeeded or failed at that task, I leave for you, the reader, to judge. However, I do believe that there is one statistical concept worth exploring in a little more detail than the structure of this book allowed for because so much of what I have discussed in the previous pages relies on it: significance. I can’t speak for the basic research scientists, but for clinicians statistical significance is in many ways the yardstick by which we measure relevance in medical knowledge.
I have hinted at how we use statistical significance throughout, but what follows is a very cursory overview of the concept and an explanation of the two main types of data used in calculating significance: categorical and continuous variables. There is much more that could be described, and at a much deeper level of detail. My only goal here is to give readers a sense of what it means to hear that a study “found” something, and when skepticism may be in order. In doing so, I also want to provide readers with a way to understand news about medical breakthroughs not mentioned in this book because a moderate portion of the medical “knowledge” discussed here will be out of date by the time it hits bookstore shelves, assuming that any bookstore shelves are left by its publication date.
For those eager readers who have a deeper interest in the details of clinical research or statistics, among the many fine books devoted to the subject, I recommend Epidemiology, by Leon Gordis, as well as Naked Statistics, by Charles Whelan. The former is actually a textbook, but it is so elegantly written and has such clear examples that it can be easily understood by lay readers, while the latter is intended for a nonspecialist audience and is rollicking good fun, a rare thing to say about a book on statistics.
Statistical significance relies on the idea that some events are random and some are not. How we separate the two is through various mathematical calculations, the details of which are unimportant for this discussion, but these calculations allow us to say something to the effect of “Mathematically, what we have observed in this experiment is so unlikely that it can’t just be a matter of chance.” Statisticians have numerical measurement (known as a “p value”) for statistical significance. With those measurements, they have selected an arbitrary dividing line between randomness and pattern, a bit like an umpire’s strike zone, to allow for the process of calling the balls and the strikes of research. Different kinds of research questions call for different strike zones (i.e., there are different thresholds for statistical significance), but in general, in clinical research, the p value for significance is 0.05. What that means is that, in order for something to be considered statistically significant, for whatever observed differences are being assessed, there has to be less than one-in-twenty odds—or 5 percent, thus 0.05—that it could have happened by chance. To explore that further, let’s flip some coins.
Flipping a coin is an act with a random outcome. Barring trick flips and trick coins, when a coin is flipped it has an equal chance (or one-in-two odds) of landing heads or tails. Thus, when we consider the mathematical likelihood of flipping two consecutive heads, the odds are one in four, or 25 percent. This can be seen by looking at all the actual possibilities of flipping a coin twice:
In a relative sense, an event that has one-in-four odds of happening is fairly likely: although the more likely event is a combination (either #2 or #3), if you were to repeat the “two flip” experiment many times, you would expect to see this happen with some frequency—indeed, assuming the flip of the coin is genuinely random, and if you repeated this two-flip experiment thousands of times, you would expect to see it almost precisely 25 percent of the time! Think also of parents who are planning on having kids: odds are about 50 percent that they’ll have one boy and one girl and 50 percent that they’ll have two children of the same sex, but only 25 percent that they’ll have either two girls or two boys. And that should match most people’s experiences with other friends’ siblings. The point is that a one-in-four chance of something happening is pretty common: 25 percent likely things happen all the time randomly, and we don’t think much of them.
Now flip the coin three times, and there are eight different outcomes.
So the odds of flipping three consecutive heads (or having, say, three girls) is one in eight, or 12.5 percent. It’s less likely, but still not that uncommon. Most people know of some lovely lady bringing up three very lovely girls, or of a man with three boys of his own, or perhaps even an old-fashioned nuclear family doing the same. We’re starting to get close to the boundary of statistical significance, but we’re not quite there yet.
Flip the coin four times, and the odds of landing consecutive heads is one in sixteen, or just over 6 percent. Five consecutive heads has about a 3 percent chance (one-in-thirty-two odds) of occurring—or, to put it another way, if you flipped your coin five times in a row one hundred times over, odds are that about three of those hundred five-flip series would be all heads. In other words, that’s very unlikely: if you had someone randomly flip a coin five times, and in those five flips they got five heads, your first instinct would be that they were incredibly lucky or that it wasn’t random: either it’s a trick flip or a trick coin.
In most of the scenarios we have discussed in this book, the mathematical threshold that statisticians want to see in order to regard something so unlikely to occur by chance as to be “real” is in this kind of range. The p value for statistical significance requires an event that is a little more unlikely than flipping heads four times in a row (i.e., a 6 percent chance of happening randomly) and a little less unlikely than flipping heads five times in a row (a 3 percent chance).
So thus far you can see that a statistically significant finding means that researchers observed a difference between two groups of things, and that the observed difference is slightly more unlikely to occur randomly as a person picking up a quarter and flipping heads four times in a row. But people aren’t coins, and, besides, not every event that researchers study (like heart attacks or occurrences of cancer) has the same odds as flipping a coin. So how do statisticians ultimately arrive at a p value?
The answer is that although people and the clinical events they experience aren’t actually coins, their behavior in a statistical sense is very much like a flipped co
in. And although it’s true that not every event has the same odds of happening as a flipped coin landing heads, statisticians make adjustments for this. To get at how this can happen without resorting to equations, let’s step away from flipped coins for a moment and think about randomly plucking marbles out of a jar.
Recall the story of James Lind’s scurvy experiment. Let’s re-envision his experiment as observing the differences between two jars filled with marbles. The marbles come in two colors, either blue or red. Blue marbles signify “alive and healthy,” and red marbles signify “dead or very close to it.” Now, we modern readers happen to know with hindsight that vitamin C is essential to life, so even though Lind investigated six different interventions, only one arm of the study contained any appreciable amount of vitamin C. Therefore, for this example we’ll pretend he only looked at the one variable of citrus fruit. The sailors who got the oranges and lemons were really getting vitamin C even if he didn’t know that, and everyone else was getting nothing at all.
So if Lind had started this trial before his sailors went to sea and then followed them over the course of their voyage, we would say that he was going to withdraw two marbles out of one jar (the men who got oranges and lemons) and ten marbles out of the other “just watch them” (or control) jar. Picture these jars filled with hundreds or even thousands of marbles: this would be the theoretical population in the treatment group or the control group. In performing the experiment, the researcher is basically saying, “If I gave everyone in the world this particular treatment, and the treatment succeeds, there would be more blue marbles in the treatment jar (i.e., more healthy people because of the beneficial effects of my treatment) than there would be in a control jar that contained an equal number of marbles.” The actual trial, then, is just like plucking out a sample from this much larger theoretical pool of people in order to compare results.
But what is the statistical likelihood that, for whichever jar, you will pluck out a blue versus red marble? If you don’t know that, you can’t actually know the odds of finding two blue marbles in one jar and ten red ones in the other. Let’s suppose that dying from scurvy on a long sea voyage really did carry the same risk as flipping a coin, that is, you had a one-in-two chance of dying from scurvy when you set sail. For Lind’s experiment, that would mean one would have expected equal numbers of red and blue marbles in each jar if the oranges and lemons had no effect on scurvy at all.* (In fact, although I chose the 50 percent mortality of scurvy just to make the example of picking red versus blue marbles to be a 50/50 proposition, at about the time Lind was performing his experiment, the mortality rate from scurvy on long sea voyages probably was 50 percent, and perhaps even higher. A famous episode in British naval history known as Anson’s voyage took place in the 1740s; a small squadron of ships sailed around the world with the goal of harassing the Spanish navy as part of a geopolitical chess match between the two superpowers. At the end of the four-year journey, as many as three-quarters of the original crew had died, most from disease or starvation, a significant portion of which included deaths from scurvy. Almost none of the crew died from actual warfare.)
The idea that we wouldn’t expect to observe any difference if we assume the intervention (the oranges and lemons in this case) does not work is called the “null hypothesis.” When a difference is observed, and the observed difference is calculated to be less likely than a one-in-twenty chance, the null hypothesis is rejected.
Therefore, if you had only two people in your trial, that is, if you randomly drew only one marble from each jar, and retrieved one blue marble from the oranges and lemons (treatment) jar and one red marble from the control jar, you wouldn’t know what to conclude because it’s basically a coin flip as to whether you draw red or blue. The chances of finding one person surviving scurvy in the treatment arm and one person dying in the control arm are 25 percent—hardly convincing data. That should make it obvious enough why enrolling two patients for a drug trial isn’t a recipe for success from the standpoint of clinical study design. But how many is enough?
What if the trial had eight participants, equally split? Suppose that you randomly drew four marbles from each jar and found three of four healthy-blue marbles in the treatment jar and three of four dead-red marbles in the control jar. Without doing statistics, at first glance most readers wouldn’t find this wildly improbable, only slightly improbable. You’d say that it might hint at something but you’d need to see more marbles before you could be confident. That’s the intuitive take, and the math backs that up: the likelihood that you would see this pattern if you drew four marbles from each jar is about 15 percent, or just under one-in-seven odds. So if Lind had run a four-person-per-arm experiment today under modern standards, we would say that his results did not support a firm conclusion that citrus fruits prevent scurvy, although he might be set up for the next grant application after this promising pilot study.†
Joke aside, that’s actually how most modern “pilot” therapies are studied, in very small trials not geared for statistical significance, but just to see whether the researchers or clinicians are barking up a suitable tree.
As we have said, Lind’s actual experiment involved six groups of two, but, because of what we know today, we realize that really it was one treatment group of two and a placebo group of ten. If we perform this experiment with our red and blue marbles, the math would bear out his nonmathematical conclusions about the value of oranges and lemons: the likelihood of plucking out, by pure chance, two blue marbles from the oranges and lemons jar and ten red marbles from the control jar, is about 1 in 2,000. That is statistically significant, and we would say that he proved that citrus fruits prevent or cure scurvy. And shame on him for not being a citrus fruit convert right away.
Now let’s transport ourselves two centuries forward (while staying in Great Britain) to the site of the first truly modern drug trial, where researchers evaluated streptomycin for the treatment of tuberculosis. Unlike the situation with scurvy, the risk of dying from TB in the mid-twentieth century had different odds of happening than a coin landing on heads. Or, to think about this in another way, in a TB treatment experiment, the number of blue and red marbles in the jars is not equal: the mortality rate from TB among these study subjects in 1948 was just under 30 percent. That means that if you had two 10,000-marble jars, you would expect about 3,000 red marbles in each jar, and you would adjust your mathematics of the odds of plucking a red marble from each jar accordingly.
For the streptomycin study, fifty-five patients were allocated to the treatment group, and fifty-two to the control group. Of these, four of the fifty-five “marbles” in the streptomycin jar were red (a 7 percent mortality), compared to fourteen red marbles in the control jar (a 27 percent mortality). You can see at once that this is a difference, but it is not quite as dramatic a difference as in Lind’s scurvy data. Keep in mind that 30 percent of the marbles in each jar should be red if streptomycin was not effective—that’s the null hypothesis. Through some calculations with which we won’t currently concern ourselves, the likelihood of this pattern happening by chance (i.e., of randomly plucking this particular pattern of red/blue marbles from each jar) is still very improbable: it is about 1 in 167. That’s still statistically significant, and based on that we would reject the null hypothesis and conclude, as the researchers did, that streptomycin saves lives. Streptomycin is still effective against TB today, though we use it only on rare occasion because of highly toxic side effects.
But you can see that the streptomycin trial results are less dramatic than what Lind had seen in treating scurvy. If we had recruited half the number of patients for the trial (that is, about twenty-five each) and observed the same proportions of red and blue marbles, we would not have been able to conclude confidently that streptomycin saves lives, noting that it could purely have been a matter of chance that some of the streptomycin-treated patients had improved, perhaps because some patients in the treatment arm were in a better state of health an
d thus more likely to recover and, therefore, less likely to die. Because the mortality rate is somewhat lower, it doesn’t take much of a difference to make a real effect disappear when sample sizes are small.
The implication for clinical research is that, to observe big effects, researchers don’t require many “marbles” (i.e., people to recruit), but, to observe small effects, large trials are required. Most medical innovations—whether medications, new surgical techniques, screening tests, or various other developments—worth studying today have weak effects. Therefore, the kind of clinical research that’s required to sort out these small effects can be a massive undertaking and can be very expensive and time-consuming. The Canadian mammography trial mentioned at the end of the chapter on mammograms took nearly three decades to complete and enrolled nearly one hundred thousand women. That’s a remarkable allocation of resources for one clinical question, and at the end of it they found no difference. The sheer size of previous mammography studies, which recruited similar numbers of women and followed them for equally long stretches, should now indicate to you the relative magnitude of mammography’s benefits even if we assume that the most optimistic estimates of their value are accurate.
Small effects can be small for more than one reason. Some interventions may be lifesaving—a saved life is a big benefit for a patient—but like mammography may require hundreds or even thousands of people to be treated before one life is saved. The statin drugs discussed midway through the book have a pretty big bang for the buck in patients with heart disease or very high cholesterol, but the patients now considered eligible for treatment by the new American Heart Association guidelines are almost certainly going to benefit less as a group.
Similarly, medications that lower blood pressure are lifesaving, and statistically significant results can be obtained with fairly small numbers of patients when the patients have very high blood pressures. However, many more patients are required to demonstrate that even a few lives are saved as the pressures get closer to “normal.” This is the rub in the new JNC8 guidelines discussed toward the beginning of the book. The entire argument about target blood pressure revolves around the problem that, as we get lower and lower pressures, people are healthier, so it is harder to observe statistically significant effects. In other words, the jars are filled with so many blue marbles that we must sample from thousands and thousands of marbles to see whether there really are fewer red marbles in the treatment jar than in the control jar. Again, we come to the boundary of what we can distinguish, no matter how hard we squint in a statistical sense.