The answer to this question—how to distill all the various possible hypotheses into a single specific expectation—would be discovered only a few years later, by the French mathematician Pierre-Simon Laplace.
Laplace’s Law
Laplace was born in Normandy in 1749, and his father sent him to a Catholic school with the intent that he join the clergy. Laplace went on to study theology at the University of Caen, but unlike Bayes—who balanced spiritual and scientific devotions his whole life—he ultimately abandoned the cloth entirely for mathematics.
In 1774, completely unaware of the previous work by Bayes, Laplace published an ambitious paper called “Treatise on the Probability of the Causes of Events.” In it, Laplace finally solved the problem of how to make inferences backward from observed effects to their probable causes.
Bayes, as we saw, had found a way to compare the relative probability of one hypothesis to another. But in the case of a raffle, there is literally an infinite number of hypotheses: one for every conceivable proportion of winning tickets. Using calculus, the once-controversial mathematics of which Bayes had been an important defender, Laplace was able to prove that this vast spectrum of possibilities could be distilled down to a single estimate, and a stunningly concise one at that. If we really know nothing about our raffle ahead of time, he showed, then after drawing a winning ticket on our first try we should expect that the proportion of winning tickets in the whole pool is exactly 2/3. If we buy three tickets and all of them are winners, the expected proportion of winning tickets is exactly 4/5. In fact, for any possible drawing of w winning tickets in n attempts, the expectation is simply the number of wins plus one, divided by the number of attempts plus two: (w+1)⁄(n+2).
This incredibly simple scheme for estimating probabilities is known as Laplace’s Law, and it is easy to apply in any situation where you need to assess the chances of an event based on its history. If you make ten attempts at something and five of them succeed, Laplace’s Law estimates your overall chances to be 6/12 or 50%, consistent with our intuitions. If you try only once and it works out, Laplace’s estimate of 2/3 is both more reasonable than assuming you’ll win every time, and more actionable than Price’s guidance (which would tell us that there is a 75% metaprobability of a 50% or greater chance of success).
Laplace went on to apply his statistical approach to a wide range of problems of his time, including assessing whether babies are truly equally likely to be born male or female. (He established, to a virtual certainty, that male infants are in fact slightly more likely than female ones.) He also wrote the Philosophical Essay on Probabilities, arguably the first book about probability for a general audience and still one of the best, laying out his theory and considering its applications to law, the sciences, and everyday life.
Laplace’s Law offers us the first simple rule of thumb for confronting small data in the real world. Even when we’ve made only a few observations—or only one—it offers us practical guidance. Want to calculate the chance your bus is late? The chance your softball team will win? Count the number of times it has happened in the past plus one, then divide by the number of opportunities plus two. And the beauty of Laplace’s Law is that it works equally well whether we have a single data point or millions of them. Little Annie’s faith that the sun will rise tomorrow is justified, it tells us: with an Earth that’s seen the sun rise for about 1.6 trillion days in a row, the chance of another sunrise on the next “attempt” is all but indistinguishable from 100%.
Bayes’s Rule and Prior Beliefs
All these suppositions are consistent and conceivable. Why should we give the preference to one, which is no more consistent or conceivable than the rest?
—DAVID HUME
Laplace also considered another modification to Bayes’s argument that would prove crucial: how to handle hypotheses that are simply more probable than others. For instance, while it’s possible that a lottery might give away prizes to 99% of the people who buy tickets, it’s more likely—we’d assume—that they would give away prizes to only 1%. That assumption should be reflected in our estimates.
To make things concrete, let’s say a friend shows you two different coins. One is a normal, “fair” coin with a 50–50 chance of heads and tails; the other is a two-headed coin. He drops them into a bag and then pulls one out at random. He flips it once: heads. Which coin do you think your friend flipped?
Bayes’s scheme of working backward makes short work of this question. A flip coming up heads happens 50% of the time with a fair coin and 100% of the time with a two-headed coin. Thus we can assert confidently that it’s 100%⁄50%, or exactly twice as probable, that the friend had pulled out the two-headed coin.
Now consider the following twist. This time, the friend shows you nine fair coins and one two-headed coin, puts all ten into a bag, draws one at random, and flips it: heads. Now what do you suppose? Is it a fair coin or the two-headed one?
Laplace’s work anticipated this wrinkle, and here again the answer is impressively simple. As before, a fair coin is exactly half as likely to come up heads as a two-headed coin. But now, a fair coin is also nine times as likely to have been drawn in the first place. It turns out that we can just take these two different considerations and multiply them together: it is exactly four and a half times more likely that your friend is holding a fair coin than the two-headed one.
The mathematical formula that describes this relationship, tying together our previously held ideas and the evidence before our eyes, has come to be known—ironically, as the real heavy lifting was done by Laplace—as Bayes’s Rule. And it gives a remarkably straightforward solution to the problem of how to combine preexisting beliefs with observed evidence: multiply their probabilities together.
Notably, having some preexisting beliefs is crucial for this formula to work. If your friend simply approached you and said, “I flipped one coin from this bag and it came up heads. How likely do you think it is that this is a fair coin?,” you would be totally unable to answer that question unless you had at least some sense of what coins were in the bag to begin with. (You can’t multiply the two probabilities together when you don’t have one of them.) This sense of what was “in the bag” before the coin flip—the chances for each hypothesis to have been true before you saw any data—is known as the prior probabilities, or “prior” for short. And Bayes’s Rule always needs some prior from you, even if it’s only a guess. How many two-headed coins exist? How easy are they to get? How much of a trickster is your friend, anyway?
The fact that Bayes’s Rule is dependent on the use of priors has at certain points in history been considered controversial, biased, even unscientific. But in reality, it is quite rare to go into a situation so totally unfamiliar that our mind is effectively a blank slate—a point we’ll return to momentarily.
When you do have some estimate of prior probabilities, meanwhile, Bayes’s Rule applies to a wide range of prediction problems, be they of the big-data variety or the more common small-data sort. Computing the probability of winning a raffle or tossing heads is only the beginning. The methods developed by Bayes and Laplace can offer help any time you have uncertainty and a bit of data to work with. And that’s exactly the situation we face when we try to predict the future.
The Copernican Principle
It’s difficult to make predictions, especially about the future.
—DANISH PROVERB
When J. Richard Gott arrived at the Berlin Wall, he asked himself a very simple question: Where am I? That is to say, where in the total life span of this artifact have I happened to arrive? In a way, he was asking the temporal version of the spatial question that had obsessed the astronomer Nicolaus Copernicus four hundred years earlier: Where are we? Where in the universe is the Earth? Copernicus would make the radical paradigm shift of imagining that the Earth was not the bull’s-eye center of the universe—that it was, in fact, nowhere special in particular. Gott decided to take the same step with regard to time.r />
He made the assumption that the moment when he encountered the Berlin Wall wasn’t special—that it was equally likely to be any moment in the wall’s total lifetime. And if any moment was equally likely, then on average his arrival should have come precisely at the halfway point (since it was 50% likely to fall before halfway and 50% likely to fall after). More generally, unless we know better we can expect to have shown up precisely halfway into the duration of any given phenomenon.* And if we assume that we’re arriving precisely halfway into something’s duration, the best guess we can make for how long it will last into the future becomes obvious: exactly as long as it’s lasted already. Gott saw the Berlin Wall eight years after it was built, so his best guess was that it would stand for eight years more. (It ended up being twenty.)
This straightforward reasoning, which Gott named the Copernican Principle, results in a simple algorithm that can be used to make predictions about all sorts of topics. Without any preconceived expectations, we might use it to obtain predictions for the end of not only the Berlin Wall but any number of other short- and long-lived phenomena. The Copernican Principle predicts that the United States of America will last as a nation until approximately the year 2255, that Google will last until roughly 2032, and that the relationship your friend began a month ago will probably last about another month (maybe tell him not to RSVP to that wedding invitation just yet). Likewise, it tells us to be skeptical when, for instance, a recent New Yorker cover depicts a man holding a six-inch smartphone with a familiar grid of square app icons, and the caption reads “2525.” Doubtful. The smartphone as we know it is barely a decade old, and the Copernican Principle tells us that it isn’t likely to be around in 2025, let alone five centuries later. By 2525 it’d be mildly surprising if there were even a New York City.
More practically, if we’re considering employment at a construction site whose signage indicates that it’s been “7 days since the last industrial accident,” we might want to stay away, unless it’s a particularly short job we plan to do. And if a municipal transit system cannot afford the incredibly useful but expensive real-time signs that tell riders when the next bus is going to arrive, the Copernican Principle suggests that there might be a dramatically simpler and cheaper alternative. Simply displaying how long it’s been since the previous bus arrived at that stop offers a substantial hint about when the next one will.
But is the Copernican Principle right? After Gott published his conjecture in Nature, the journal received a flurry of critical correspondence. And it’s easy to see why when we try to apply the rule to some more familiar examples. If you meet a 90-year-old man, the Copernican Principle predicts he will live to 180. Every 6-year-old boy, meanwhile, is predicted to face an early death at the tender age of 12.
To understand why the Copernican Principle works, and why it sometimes doesn’t, we need to return to Bayes. Because despite its apparent simplicity, the Copernican Principle is really an instance of Bayes’s Rule.
Bayes Meets Copernicus
When predicting the future, such as the longevity of the Berlin Wall, the hypotheses we need to evaluate are all the possible durations of the phenomenon at hand: will it last a week, a month, a year, a decade? To apply Bayes’s Rule, as we have seen, we first need to assign a prior probability to each of these durations. And it turns out that the Copernican Principle is exactly what results from applying Bayes’s Rule using what is known as an uninformative prior.
At first this may seem like a contradiction in terms. If Bayes’s Rule always requires us to specify our prior expectations and beliefs, how could we tell it that we don’t have any? In the case of a raffle, one way to plead ignorance would be to assume what’s called the “uniform prior,” which considers every proportion of winning tickets to be equally likely.* In the case of the Berlin Wall, an uninformative prior means saying that we don’t know anything about the time span we’re trying to predict: the wall could equally well come down in the next five minutes or last for five millennia.
Aside from that uninformative prior, the only piece of data we supply to Bayes’s Rule, as we’ve seen, is the fact that we’ve encountered the Berlin Wall when it is eight years old. Any hypothesis that would have predicted a less than eight-year life span for the wall is thereby ruled out immediately, since those hypotheses can’t account for our situation at all. (Similarly, a two-headed coin is ruled out by the first appearance of tails.) Anything longer than eight years is within the realm of possibility—but if the wall were going to be around for a million years, it would be a big coincidence that we happened to bump into it so very close to the start of its existence. Therefore, even though enormously long life spans cannot be ruled out, neither are they very likely.
When Bayes’s Rule combines all these probabilities—the more-probable short time spans pushing down the average forecast, the less-probable yet still possible long ones pushing it up—the Copernican Principle emerges: if we want to predict how long something will last, and have no other knowledge about it whatsoever, the best guess we can make is that it will continue just as long as it’s gone on so far.
In fact, Gott wasn’t even the first to propose something like the Copernican Principle. In the mid-twentieth century, the Bayesian statistician Harold Jeffreys had looked into determining the number of tramcars in a city given the serial number on just one tramcar, and came up with the same answer: double the serial number. And a similar problem had arisen even earlier, during World War II, when the Allies sought to estimate the number of tanks being produced by Germany. Purely mathematical estimates based on captured tanks’ serial numbers predicted that the Germans were producing 246 tanks every month, while estimates obtained by extensive (and highly risky) aerial reconnaissance suggested the figure was more like 1,400. After the war, German records revealed the true figure: 245.
Recognizing that the Copernican Principle is just Bayes’s Rule with an uninformative prior answers a lot of questions about its validity. The Copernican Principle seems reasonable exactly in those situations where we know nothing at all—such as looking at the Berlin Wall in 1969, when we’re not even sure what timescale is appropriate. And it feels completely wrong in those cases where we do know something about the subject matter. Predicting that a 90-year-old man will live to 180 years seems unreasonable precisely because we go into the problem already knowing a lot about human life spans—and so we can do better. The richer the prior information we bring to Bayes’s Rule, the more useful the predictions we can get out of it.
Real-World Priors …
In the broadest sense, there are two types of things in the world: things that tend toward (or cluster around) some kind of “natural” value, and things that don’t.
Human life spans are clearly in the former category. They roughly follow what’s termed a “normal” distribution—also known as the “Gaussian” distribution, after the German mathematician Carl Friedrich Gauss, and informally called the “bell curve” for its characteristic shape. This shape does a good job of characterizing human life spans; the average life span for men in the United States, for instance, is centered at about 76 years, and the probabilities fall off fairly sharply to either side. Normal distributions tend to have a single appropriate scale: a one-digit life span is considered tragic, a three-digit one extraordinary. Many other things in the natural world are normally distributed as well, from human height, weight, and blood pressure to the noontime temperature in a city and the diameter of fruits in an orchard.
There are a number of things in the world that don’t look normally distributed, however—not by a long shot. The average population of a town in the United States, for instance, is 8,226. But if you were to make a graph of the number of towns by population, you wouldn’t see anything remotely like a bell curve. There would be way more towns smaller than 8,226 than larger. At the same time, the larger ones would be way bigger than the average. This kind of pattern typifies what are called “power-law distributions.” These are also known
as “scale-free distributions” because they characterize quantities that can plausibly range over many scales: a town can have tens, hundreds, thousands, tens of thousands, hundreds of thousands, or millions of residents, so we can’t pin down a single value for how big a “normal” town should be.
The power-law distribution characterizes a host of phenomena in everyday life that have the same basic quality as town populations: most things below the mean, and a few enormous ones above it. Movie box-office grosses, which can range from four to ten figures, are another example. Most movies don’t make much money at all, but the occasional Titanic makes … well, titanic amounts.
In fact, money in general is a domain full of power laws. Power-law distributions characterize both people’s wealth and people’s incomes. The mean income in America, for instance, is $55,688—but because income is roughly power-law distributed, we know, again, that many more people will be below this mean than above it, while those who are above might be practically off the charts. So it is: two-thirds of the US population make less than the mean income, but the top 1% make almost ten times the mean. And the top 1% of the 1% make ten times more than that.
It’s often lamented that “the rich get richer,” and indeed the process of “preferential attachment” is one of the surest ways to produce a power-law distribution. The most popular websites are the most likely to get incoming links; the most followed online celebrities are the ones most likely to gain new fans; the most prestigious firms are the ones most likely to attract new clients; the biggest cities are the ones most likely to draw new residents. In every case, a power-law distribution will result.
Bayes’s Rule tells us that when it comes to making predictions based on limited evidence, few things are as important as having good priors—that is, a sense of the distribution from which we expect that evidence to have come. Good predictions thus begin with having good instincts about when we’re dealing with a normal distribution and when with a power-law distribution. As it turns out, Bayes’s Rule offers us a simple but dramatically different predictive rule of thumb for each.
Algorithms to Live By Page 17