Snowball in a Blizzard

Home > Other > Snowball in a Blizzard > Page 9
Snowball in a Blizzard Page 9

by Steven Hatch


  As I’ve indicated, one critical reason there was such consternation in the public reaction had to do with a misunderstanding of the degree to which screening mammography is reliable. Many had assumed that mammograms deliver a black-and-white solution to a vexing national health problem. Yet a mammogram is, in a very literal sense, a gray technology. Teasing out its value requires sifting through numbers that do not speak unambiguously. Appreciating what’s at stake in the Great Mammogram Flap of 2009, and why uncertainty plays a starring role in the saga, requires an understanding of tests and their accuracy. For mammograms, despite their popular portrayals as tools that in general uncover hidden truths—that is, cancers—are rather slightly foggy lenses that look out upon the world. The degree to which those lenses are foggy, and the consequences of imperfect vision, forms the core of the debate.

  Before discussing exactly what the USPSTF actually said, it is important one last time to emphasize that we are talking about mammography as a screening tool. Mammograms, even if they use the exact same technology, can be understood in two different contexts. The first context is one that most people think of when they go to a doctor because something is awry. Either they are sick or they just don’t feel right or they discover an unusual rash, and so on. When they come to the office, the doctor would offer testing—sometimes blood tests, sometimes radiology tests, and sometimes invasive tests—to help identify the problem. In the case of a mammogram, a woman would come to her doctor having felt a lump. When used in this way, a mammogram is not a screen; it’s a diagnostic test in the context of some real physical symptom or abnormality and is being used as part of a process to evaluate that abnormality.

  A screening mammogram, by contrast, is the very same mammogram, but it is performed on healthy women who have no evidence of disease. The notion that mammograms are lifesaving lies in the idea that they detect cancers before a lump is palpable. The added time between the detection of a cancer by a mammogram, and when the patient or doctor notices a lump, is what the mammogram provides. Since cancer is a disease always on the move, and one that becomes more lethal the longer it is left to fester, the rationale is that this added time leads to saved lives.

  In principle, this benefit can be studied and quantified in a relatively straightforward way. You take a large cohort of women and randomly assign them to two groups. The women in the first group have annual mammograms, while the women in the second group have no mammograms and go to their doctor either on an as-needed basis or have an annual physical. Women in both groups, over the years, will be diagnosed with breast cancer. At the end of some defined period—usually between ten and twenty years—researchers tally the number of deaths from breast cancer in both groups, and any differences that result can arguably be attributed to the use of the mammogram. If a significantly smaller percentage of women in the mammogram arm die of breast cancer, then that extra time between mammogram and a palpable lump can be shown to have a clear mortality benefit.

  The research is simple in theory. In reality, researchers have been investigating mammography for just about half a century, performing trials costing tens of millions of dollars and involving hundreds of thousands of women from countries around the world. And yet, despite this massive use of resources, critical questions about mammography as a screening tool remain unsettled. When the researchers of the US Preventive Services Task Force sat down to review the mountains of data, they used a variety of mathematical models to help them define, in as precise a manner as possible, the value of a mammogram. Their conclusions forever changed the mammography debate.

  2009: Turnaround

  The 2009 task force issued several recommendations that differed from previous guidelines, and each of these recommendations was based on slightly different issues related to statistics and uncertainty. By far, the most contentious recommendation was that women under the age of fifty not have annual mammograms, which represented an almost absolute about-face from what had come before. What had happened in the intervening years to cause such a change?

  The answer can be found in the heaps of data that had been compiled over the previous twenty years, but it boils down to two simple numbers: the first is that about 50,000 women between the ages of forty and fifty develop invasive cancer each year in the United States; the second is that there are about 22 million women in the United States in this age range. Why is this important? Because it shows that breast cancer in this age group is still relatively uncommon, especially at the younger end. The annual incidence increases with each passing year (thus, of these 50,000 women, more of them are forty-eight or forty-nine than forty or forty-one). By the time a woman is sixty-five, by comparison, that incidence roughly doubles.

  These baseline numbers have a dramatic effect on the accuracy of a mammogram. This is a wildly counterintuitive concept, but it is critical to understanding the task force recommendations. I foreshadowed this with the vignette about Leonard Mlodinow’s HIV screening test. In the following paragraphs, I’m going to run the numbers so that you can actually see how even accurate tests have lousy predictive value when applied in the wrong population, like women under the age of fifty. Even a small error rate for a screening test like a mammogram can have a huge impact on how confident we can be that a “positive” test is really positive.

  Let’s assume for the moment that mammograms are 99 percent accurate—that is, they detect ninety-nine cases out of one hundred women with breast cancer, and they don’t show cancer in ninety-nine of one hundred women who do not have breast cancer. By current medical standards, this would be an extremely accurate screening tool. So if a forty-three-year-old woman goes for a mammogram and, at the follow-up visit, is told by her physician that the mammogram was suggestive of breast cancer, that test is 99 percent likely to be correct, right?

  In fact, it’s not right at all. Indeed, almost the complete opposite is true. When you multiply the error rate (1 percent) by the actual size of the two populations (those with disease and those without), something strange happens. If you screened all 22 million women who don’t have cancer (technically, it’s 22 million minus 50,000, but that’s essentially 22 million to keep the number simple), a 1 percent error rate would be approximately 220,000 false-positive tests—women who don’t have cancer but are read by the radiologist as possibly having cancer. Of the 50,000 women who do have cancer, 49,500 are read as positive. Because this false-negative rate is negligible for the purposes of the example, I’ll round the true positive number back up to 50,000 to keep the math simple.

  So now you have 270,000 positive mammograms—that is, women who receive a diagnosis of potential breast cancer—but the majority of these mammograms are actually inaccurate, because as we’ve just seen 220,000 of the 270,000 are false positive! The bizarre truth, then, is that even if a mammogram is 99 percent accurate, the likelihood is over 80 percent that this forty-three-year-old woman’s “positive” mammogram is actually negative.

  A visual, not-to-scale representation might help drive the point home, as seen in Figure 3.1 on the next page. The false-negative rate of mammograms is a tiny speck on the already small square of women who actually have breast cancer. Now you can compare true positives to false positives for mammograms, and see the result. Look at the boxes representing positive tests: the false-positive box is much larger than the one representing true positives. And there’s no simple way other than doing a breast biopsy to know which positive mammogram belongs in which box—and more on biopsies, and the uncertainty surrounding them, shortly.

  Figure 3.1 could just as easily describe Mlodinow’s HIV test. The reason his doctor had been so confident in his diagnosis was because he had considered only the black box—those tests that picked up people who are actually infected. In this respect the test is virtually error-free. But Mlodinow was much more likely to be in the gray box just by random chance, because there are so many more people in the general population who don’t have disease—that big, white box. As a consequence, an extremely acc
urate test in a low-incidence population produces a relatively large number of false positives when compared to the number of people actually infected. This is the central problem in screening tests—all screening tests. We saw it with the PSA blood test to screen prostate cancer, we saw it in Mlodinow’s HIV screen, and we’ll see it again in the Appendix when I briefly discuss lung CT scans for smokers.* False positives are the bane of a screen’s existence, for exactly the reason outlined in Figure 3.1.

  I repeat: HIV tests no longer have this problem. Please, get screened!

  FIGURE 3.1. Positive predictive value of a mammogram among women under fifty. If the total number of women who have breast cancer is small, even a minor error rate among the women who don’t have breast cancer leads to there being more false-positive mammograms than true-positive mammograms—that is, one can easily see here that of the total number of positive tests, many more are of the gray, false-positive variety than the black, true-positive kind. And there is no way for a radiologist, or anyone else, to distinguish true positive from false positive except in thought experiments like this.

  This is the reason the USPSTF believed that there was no evidence for a benefit to mammography in women under fifty: there were too many false positives leading to too many unnecessary procedures. Moreover, women who fall into the gray, false-positive category receive none of screening mammography’s benefits but also might incur substantial harm, from the unnecessary fear produced by being handed a potentially life-threatening diagnosis to undergoing unnecessary surgery.

  To emphasize, even though this discussion of black, white, and gray boxes might imply otherwise, in a real-life situation there is no way to discern whether a woman’s mammogram is a false positive or a true positive. All that she will know is that the radiologic evidence indicates there is something that looks very suspicious for cancer, and she will therefore be referred for a biopsy. Out of every ten women under the age of fifty who are sent for these biopsies, about eight will have no cancer at all—a staggering number that speaks to a large group of women who will wait around days or weeks for the biopsy to be performed, worried sick that they might die, and maybe die horribly. It is part of the toll that did not appear in the calculus of those lobbing insults at the task force, but it was very much on the minds of the members of the task force itself. For them, such data didn’t seem so bland and indicated a huge amount of anxiety and, as I’ll soon explain, physical risk.

  The Blacks, Whites, and Grays of the Mammogram

  Thus far we have looked at the problem of mammography screening for women under fifty with the mathematical assumption that it is 99 percent accurate. But it is not, and it is not by a long shot. Numbers from various studies show a wide range, but it is safe to say that mammograms have, at absolute best, a 90 percent accuracy rate, and more likely worse, making it a more unwieldy tool than even what is described above.*

  For the purposes of this discussion, I am using “accuracy” in a deliberately vague manner as it doesn’t dramatically affect the gist of this issue. In reality, accuracy (and therefore error) is described in medical research by either sensitivity or specificity. Sensitivity describes the ability of a test to “catch” disease in a population: that is, a very sensitive test will be positive in most patients who have some disease. Specificity, by contrast, describes the level of certainty that a positive test really does mean that someone has the disease and not something else. An ideal screening test is one that is highly sensitive (it finds disease in most people who otherwise would have no idea they had a disease) and at least reasonably specific. In the above example, I’ve lumped these two concepts together and called it “accuracy,” but what I really did was assign the exact same value of sensitivity and specificity (i.e., 99 percent) to keep matters simple. It is generally uncommon for a test’s sensitivity and specificity to be exactly the same number.

  To demonstrate this, it’s helpful to explain the basic concept behind the X-ray, one of the most powerful tools in medicine. X-rays, as most of us learned in high school science class, are high-energy forms of radiation that can be detected by photographic film. X-rays can move along totally unimpeded by air, but they are stopped in their tracks by thick metals. So placing a piece of metal in front of X-ray film should leave a white outline; where there is nothing but air in the X-ray field, the color will be black.

  The human body is composed of tissues of different densities, however, and the images produced by an X-ray of a person are seen as various shades of gray. Bone is, of course, much more dense than skin and muscle, and so appears as a lighter image because fewer X-rays pass through to the film, instead blocked by the bone itself. This is a point worth keeping in mind because those high-energy waves of radiation that don’t pass through, and in doing so create the varying shades to make an X-ray what it is, are radiation that is absorbed by the body (meaning that mammograms may—and I emphasize that this is highly speculative because no well-designed trials have ever been performed to address this question—ever so slightly increase one’s risk of developing cancer in addition to detecting it). Thus, when metal, bone, and soft tissues are superimposed, you see an image like that in Figure 3.2.

  FIGURE 3.2. Bone, metal, and soft tissues form varying shades of gray in an X-ray of the arm and hand. Roughly, there are four separate shades easily seen in this film: the white of the “external fixator” here that is used to stabilize this patient’s fracture, which represents metal; the somewhat more translucent white of the bones; the dark gray of the soft tissues; and finally air, which is black.

  SOURCE: User Ashish j29, at https://en.wikipedia.org/wiki/External_fixation#/media/File:External_fixator_xray.jpg.

  If the goal of the mammogram were as simple as the exercise of finding the metal in the picture, not only would mammography be of enormous benefit, many radiologists would be looking for other work. But, alas, locating a tumor—which is, in fact, nothing more than human tissue whose cells act in a peculiar, ultimately life-threatening manner—is a significantly greater challenge and requires years of training. Even then, radiologists are constrained by the limits of the technology. Figure 3.3, for instance, is a “blizzard” of breast tissue in one mammogram. And Figure 3.4 is a different mammogram.

  FIGURE 3.3. Mammogram 1.

  FIGURE 3.4. Mammogram 2.

  Where is the “snowball” in these “blizzards”? Which one shows cancer and which does not? Or do they both show cancerous tissue? Or is neither positive? These photos, taken from publicly available electronic archives of the National Cancer Institute, are pretty good illustrations of how tricky reading a mammogram can be. As it turns out, the first image is considered to be negative (although abnormal, for which a near-term follow-up is recommended), while the second turned out to be positive for breast cancer.

  As can be imagined by looking at these images, separating cancer from noncancer is not quite as easy a task as is commonly understood. And the interpretation can be highly dependent upon whose eyes are doing the reading: a study in the Journal of the National Cancer Institute published in 2002 looked at a radiology practice composed of two dozen doctors and found that some members of the group had up to six times more false-positive diagnoses than did their colleagues. This marked variation was true even among experienced radiologists in the group: the radiologist with the highest number of “reads” had a false-positive rate of just over 4 percent, which is laudable. Yet the radiologist who had read the second-highest number of mammograms had a false-positive rate of nearly three times that number. Other factors that might influence radiologists’ false-positive rate are where they trained, how long they have been in practice, whether they have been sued, and so on. Since a mammogram isn’t a number, whoever is sitting in front of the computer can have a major impact on whether a given woman will be referred for additional imaging or a biopsy. These are the ways in which uncertainty multiplies.

  Unfortunately, many advocacy organizations, as well as the medical establishment, do not as a
rule acknowledge these caveats when discussing the value of mammography. Further, the media, almost genetically programed to avoid discussions involving uncertainty, dodges the subject as well, which contributes to the misperceptions about the predictive value of a screening mammogram. When the task force reviewed the data in 2009 and concluded that the situation was more complicated, it was inevitable that many would react with outrage.

  Harm

  You might read all of this and think that, although perhaps steep, the error rate associated with mammography is a reasonable price to pay for saving a woman’s life. This is essentially the boiled-down argument that our commentator made when he took to cyberspace with his dismissal of the task force’s “bland” data: sure, mammograms may result in many overcalls, but when compared to doing nothing at all, it is unambiguously beneficial. Without doubt, that logic is compelling. Unfortunately, it ignores one highly important variable: the amount of harm that comes from a false-positive diagnosis.

  The act of measuring such harm is not easy. I’ve already alluded to the distress that a false-positive mammogram, and possibly a follow-up mammogram, can generate. In a series of eloquent essays for the New York Times in late 2010 and early 2011, Dr. Ellen Feld of Drexel University noted the feelings that swirled about as she awaited the reading of the pathologist. Her words are a case study of the kind of paralyzing fear that women must deal with when faced with the possibility that they have a tumor in their breast: “For the next 48 hours, as I wait for her call, I feel suspended, hanging from a strap cinched too tightly around my chest. Waiting to hear just how bad the news is. Waiting to hear when I will move forward and to what unfamiliar places I will go.”

 

‹ Prev