by Nate Silver
The Simple Mathematics of Bayes’s Theorem
If the philosophical underpinnings of Bayes’s theorem are surprisingly rich, its mathematics are stunningly simple. In its most basic form, it is just an algebraic expression with three known variables and one unknown one. But this simple formula can lead to vast predictive insights.
Bayes’s theorem is concerned with conditional probability. That is, it tells us the probability that a theory or hypothesis is true if some event has happened.
Suppose you are living with a partner and come home from a business trip to discover a strange pair of underwear in your dresser drawer. You will probably ask yourself: what is the probability that your partner is cheating on you? The condition is that you have found the underwear; the hypothesis you are interested in evaluating is the probability that you are being cheated on. Bayes’s theorem, believe it or not, can give you an answer to this sort of question—provided that you know (or are willing to estimate) three quantities:
First, you need to estimate the probability of the underwear’s appearing as a condition of the hypothesis being true—that is, you are being cheated upon. Let’s assume for the sake of this problem that you are a woman and your partner is a man, and the underwear in question is a pair of panties. If he’s cheating on you, it’s certainly easy enough to imagine how the panties got there. Then again, even (and perhaps especially) if he is cheating on you, you might expect him to be more careful. Let’s say that the probability of the panties’ appearing, conditional on his cheating on you, is 50 percent.
Second, you need to estimate the probability of the underwear’s appearing conditional on the hypothesis being false. If he isn’t cheating, are there some innocent explanations for how they got there? Sure, although not all of them are pleasant (they could be his panties). It could be that his luggage got mixed up. It could be that a platonic female friend of his, whom you trust, stayed over one night. The panties could be a gift to you that he forgot to wrap up. None of these theories is inherently untenable, although some verge on dog-ate-my-homework excuses. Collectively you put their probability at 5 percent.
Third and most important, you need what Bayesians call a prior probability (or simply a prior). What is the probability you would have assigned to him cheating on you before you found the underwear? Of course, it might be hard to be entirely objective about this now that the panties have made themselves known. (Ideally, you establish your priors before you start to examine the evidence.) But sometimes, it is possible to estimate a number like this empirically. Studies have found, for instance, that about 4 percent of married partners cheat on their spouses in any given year,33 so we’ll set that as our prior.
If we’ve estimated these values, Bayes’s theorem can then be applied to establish a posterior possibility. This is the number that we’re interested in: how likely is it that we’re being cheated on, given that we’ve found the underwear? The calculation (and the simple algebraic expression that yields it) is in figure 8-3.
As it turns out, this probability is still fairly low: 29 percent. This may still seem counterintuitive—aren’t those panties pretty incriminating? But it stems mostly from the fact that you had assigned a low prior probability to him cheating. Although an innocent man has fewer plausible explanations for the appearance of the panties than a guilty one, you had started out thinking he was an innocent man, so that weighs heavily into the equation.
When our priors are strong, they can be surprisingly resilient in the face of new evidence. One classic example of this is the presence of breast cancer among women in their forties. The chance that a woman will develop breast cancer in her forties is fortunately quite low—about 1.4 percent.34 But what is the probability if she has a positive mammogram?
Studies show that if a woman does not have cancer, a mammogram will incorrectly claim that she does only about 10 percent of the time.35 If she does have cancer, on the other hand, they will detect it about 75 percent of the time.36 When you see those statistics, a positive mammogram seems like very bad news indeed. But if you apply Bayes’s theorem to these numbers, you’ll come to a different conclusion: the chance that a woman in her forties has breast cancer given that she’s had a positive mammogram is still only about 10 percent. These false positives dominate the equation because very few young women have breast cancer to begin with. For this reason, many doctors recommend that women do not begin getting regular mammograms until they are in their fifties and the prior probability of having breast cancer is higher.37
Problems like these are no doubt challenging. A recent study that polled the statistical literacy of Americans presented this breast cancer example to them—and found that just 3 percent of them came up with the right probability estimate.38 Sometimes, slowing down to look at the problem visually (as in figure 8-4) can provide a reality check against our inaccurate approximations. The visualization makes it easier to see the bigger picture—because breast cancer is so rare in young women, the fact of a positive mammogram is not all that telling.
FIGURE 8-4: BAYES’S THEOREM—MAMMOGRAM EXAMPLE
Usually, however, we focus on the newest or most immediately available information, and the bigger picture gets lost. Smart gamblers like Bob Voulgaris have learned to take advantage of this flaw in our thinking. He made a profitable bet on the Lakers in part because the bookmakers placed much too much emphasis on the Lakers’ first several games, lengthening their odds of winning the title from 4 to 1 to 6½ to 1, even though their performance was about what you might expect from a good team that had one of its star players injured. Bayes’s theorem requires us to think through these problems more carefully and can be very useful for detecting when our gut-level approximations are much too crude.
This is not to suggest that our priors always dominate the new evidence, however, or that Bayes’s theorem inherently produces counterintuitive results. Sometimes, the new evidence is so powerful that it overwhelms everything else, and we can go from assigning a near-zero probability of something to a near-certainty of it almost instantly.
Consider a somber example: the September 11 attacks. Most of us would have assigned almost no probability to terrorists crashing planes into buildings in Manhattan when we woke up that morning. But we recognized that a terror attack was an obvious possibility once the first plane hit the World Trade Center. And we had no doubt we were being attacked once the second tower was hit. Bayes’s theorem can replicate this result.
For instance, say that before the first plane hit, our estimate of the possibility of a terror attack on tall buildings in Manhattan was just 1 chance in 20,000, or 0.005 percent. However, we would also have assigned a very low probability to a plane hitting the World Trade Center by accident. This figure can actually be estimated empirically: in the previous 25,000 days of aviation over Manhattan39 prior to September 11, there had been two such accidents: one involving the Empire State Building in 1945 and another at 40 Wall Street in 1946. That would make the possibility of such an accident about 1 chance in 12,500 on any given day. If you use Bayes’s theorem to run these numbers (figure 8-5a), the probability we’d assign to a terror attack increased from 0.005 percent to 38 percent the moment that the first plane hit.
The idea behind Bayes’s theorem, however, is not that we update our probability estimates just once. Instead, we do so continuously as new evidence presents itself to us. Thus, our posterior probability of a terror attack after the first plane hit, 38 percent, becomes our prior possibility before the second one did. And if you go through the calculation again, to reflect the second plane hitting the World Trade Center, the probability that we were under attack becomes a near-certainty—99.99 percent. One accident on a bright sunny day in New York was unlikely enough, but a second one was almost a literal impossibility, as we all horribly deduced.
I have deliberately picked some challenging examples—terror attacks, cancer, being cheated on—because I want to demonstrate the breadth of problems to which Bayesian reasoning can
be applied. Bayes’s theorem is not any kind of magic formula—in the simple form that we have used here, it consists of nothing more than addition, subtraction, multiplication, and division. We have to provide it with information, particularly our estimates of the prior probabilities, for it to yield useful results.
However, Bayes’s theorem does require us to think probabilistically about the world, even when it comes to issues that we don’t like to think of as being matters of chance. This does not require us to have taken the position that the world is intrinsically, metaphysically uncertain—Laplace thought everything from the orbits of the planets to the behavior of the smallest molecules was governed by orderly Newtonian rules, and yet he was instrumental in the development of Bayes’s theorem. Rather, Bayes’s theorem deals with epistemological uncertainty—the limits of our knowledge.
The Problem of False Positives
When we fail to think like Bayesians, false positives are a problem not just for mammograms but for all of science. In the introduction to this book, I noted the work of the medical researcher John P. A. Ioannidis. In 2005, Ioannidis published an influential paper, “Why Most Published Research Findings Are False,”40 in which he cited a variety of statistical and theoretical arguments to claim that (as his title implies) the majority of hypotheses deemed to be true in journals in medicine and most other academic and scientific professions are, in fact, false.
Ioannidis’s hypothesis, as we mentioned, looks to be one of the true ones; Bayer Laboratories found that they could not replicate about two-thirds of the positive findings claimed in medical journals when they attempted the experiments themselves.41 Another way to check the veracity of a research finding is to see whether it makes accurate predictions in the real world—and as we have seen throughout this book, it very often does not. The failure rate for predictions made in entire fields ranging from seismology to political science appears to be extremely high.
“In the last twenty years, with the exponential growth in the availability of information, genomics, and other technologies, we can measure millions and millions of potentially interesting variables,” Ioannidis told me. “The expectation is that we can use that information to make predictions work for us. I’m not saying that we haven’t made any progress. Taking into account that there are a couple of million papers, it would be a shame if there wasn’t. But there are obviously not a couple of million discoveries. Most are not really contributing much to generating knowledge.”
This is why our predictions may be more prone to failure in the era of Big Data. As there is an exponential increase in the amount of available information, there is likewise an exponential increase in the number of hypotheses to investigate. For instance, the U.S. government now publishes data on about 45,000 economic statistics. If you want to test for relationships between all combinations of two pairs of these statistics—is there a causal relationship between the bank prime loan rate and the unemployment rate in Alabama?—that gives you literally one billion hypotheses to test.*
But the number of meaningful relationships in the data—those that speak to causality rather than correlation and testify to how the world really works—is orders of magnitude smaller. Nor is it likely to be increasing at nearly so fast a rate as the information itself; there isn’t any more truth in the world than there was before the Internet or the printing press. Most of the data is just noise, as most of the universe is filled with empty space.
Meanwhile, as we know from Bayes’s theorem, when the underlying incidence of something in a population is low (breast cancer in young women; truth in the sea of data), false positives can dominate the results if we are not careful. Figure 8-6 represents this graphically. In the figure, 80 percent of true scientific hypotheses are correctly deemed to be true, and about 90 percent of false hypotheses are correctly rejected. And yet, because true findings are so rare, about two-thirds of the findings deemed to be true are actually false!
Unfortunately, as Ioannidis figured out, the state of published research in most fields that conduct statistical testing is probably very much like what you see in figure 8-6.* Why is the error rate so high? To some extent, this entire book represents an answer to that question. There are many reasons for it—some having to do with our psychological biases, some having to do with common methodological errors, and some having to do with misaligned incentives. Close to the root of the problem, however, is a flawed type of statistical thinking that these researchers are applying.
FIGURE 8-6: A GRAPHICAL REPRESENTATION OF FALSE POSITIVES
When Statistics Backtracked from Bayes
Perhaps the chief intellectual rival to Thomas Bayes—although he was born in 1890, almost 120 years after Bayes’s death—was an English statistician and biologist named Ronald Aylmer (R. A.) Fisher. Fisher was a much more colorful character than Bayes, almost in the English intellectual tradition of Christopher Hitchens. He was handsome but a slovenly dresser,42 always smoking his pipe or his cigarettes, constantly picking fights with his real and imagined rivals. He was a mediocre lecturer but an incisive writer with a flair for drama, and an engaging and much-sought-after dinner companion. Fisher’s interests were wide-ranging: he was one of the best biologists of his day and one of its better geneticists, but was an unabashed elitist who bemoaned the fact that the poorer classes were having more offspring than the intellectuals.43 (Fisher dutifully had eight children of his own.)
Fisher is probably more responsible than any other individual for the statistical methods that remain in wide use today. He developed the terminology of the statistical significance test and much of the methodology behind it. He was also no fan of Bayes and Laplace—Fisher was the first person to use the term “Bayesian” in a published article, and he used it in a derogatory way,44 at another point asserting that the theory “must be wholly rejected.”45
Fisher and his contemporaries had no problem with the formula called Bayes’s theorem per se, which is just a simple mathematical identity. Instead, they were worried about how it might be applied. In particular, they took issue with the notion of the Bayesian prior.46 It all seemed too subjective: we have to stipulate, in advance, how likely we think something is before embarking on an experiment about it? Doesn’t that cut against the notion of objective science?
So Fisher and his contemporaries instead sought to develop a set of statistical methods that they hoped would free us from any possible contamination from bias. This brand of statistics is usually called “frequentism” today, although the term “Fisherian” (as opposed to Bayesian) is sometimes applied to it.47
The idea behind frequentism is that uncertainty in a statistical problem results exclusively from collecting data among just a sample of the population rather than the whole population. This makes the most sense in the context of something like a political poll. A survey in California might sample eight hundred people rather than the eight million that will turn out to vote in an upcoming election there, producing what’s known as sampling error. The margin of error that you see reported alongside political polls is a measure of this: exactly how much error is introduced because you survey eight hundred people in a population of eight million? The frequentist methods are designed to quantify this.
Even in the context of political polling, however, sampling error does not always tell the whole story. In the brief interval between the Iowa Democratic caucus and New Hampshire Democratic Primary in 2008, about 15,000 people were surveyed48 in New Hampshire—an enormous number in a small state, enough that the margin of error on the polls was theoretically just plus-or-minus 0.8 percent. The actual error in the polls was about ten times that, however: Hillary Clinton won the state by three points when the polls had her losing to Barack Obama by eight. Sampling error—the only type of error that frequentist statistics directly account for—was the least of the problem in the case of the New Hampshire polls.
Likewise, some polling firms consistently show a bias toward one or another party:49 they could survey all 200 mil
lion American adults and they still wouldn’t get the numbers right. Bayes had these problems figured out 250 years ago. If you’re using a biased instrument, it doesn’t matter how many measurements you take—you’re aiming at the wrong target.
Essentially, the frequentist approach toward statistics seeks to wash its hands of the reason that predictions most often go wrong: human error. It views uncertainty as something intrinsic to the experiment rather than something intrinsic to our ability to understand the real world. The frequentist method also implies that, as you collect more data, your error will eventually approach zero: this will be both necessary and sufficient to solve any problems. Many of the more problematic areas of prediction in this book come from fields in which useful data is sparse, and it is indeed usually valuable to collect more of it. However, it is hardly a golden road to statistical perfection if you are not using it in a sensible way. As Ioannidis noted, the era of Big Data only seems to be worsening the problems of false positive findings in the research literature.
Nor is the frequentist method particularly objective, either in theory or in practice. Instead, it relies on a whole host of assumptions. It usually presumes that the underlying uncertainty in a measurement follows a bell-curve or normal distribution. This is often a good assumption, but not in the case of something like the variation in the stock market. The frequentist approach requires defining a sample population, something that is straightforward in the case of a political poll but which is largely arbitrary in many other practical applications. What “sample population” was the September 11 attack drawn from?