Algorithms to Live By

Home > Nonfiction > Algorithms to Live By > Page 18
Algorithms to Live By Page 18

by Brian Christian


  … and Their Prediction Rules

  Did you mean “this could go on forever” in a good way?

  —BEN LERNER

  Examining the Copernican Principle, we saw that when Bayes’s Rule is given an uninformative prior, it always predicts that the total life span of an object will be exactly double its current age. In fact, the uninformative prior, with its wildly varying possible scales—the wall that might last for months or for millennia—is a power-law distribution. And for any power-law distribution, Bayes’s Rule indicates that the appropriate prediction strategy is a Multiplicative Rule: multiply the quantity observed so far by some constant factor. For an uninformative prior, that constant factor happens to be 2, hence the Copernican prediction; in other power-law cases, the multiplier will depend on the exact distribution you’re working with. For the grosses of movies, for instance, it happens to be about 1.4. So if you hear a movie has made $6 million so far, you can guess it will make about $8.4 million overall; if it’s made $90 million, guess it will top out at $126 million.

  This multiplicative rule is a direct consequence of the fact that power-law distributions do not specify a natural scale for the phenomenon they’re describing. The only thing that gives us a sense of scale for our prediction, therefore, is the single data point we have—such as the fact that the Berlin Wall has stood for eight years. The larger the value of that single data point, the larger the scale we’re probably dealing with, and vice versa. It’s possible that a movie that’s grossed $6 million is actually a blockbuster in its first hour of release, but it’s far more likely to be just a single-digit-millions kind of movie.

  When we apply Bayes’s Rule with a normal distribution as a prior, on the other hand, we obtain a very different kind of guidance. Instead of a multiplicative rule, we get an Average Rule: use the distribution’s “natural” average—its single, specific scale—as your guide. For instance, if somebody is younger than the average life span, then simply predict the average; as their age gets close to and then exceeds the average, predict that they’ll live a few years more. Following this rule gives reasonable predictions for the 90-year-old and the 6-year-old: 94 and 77, respectively. (The 6-year-old gets a tiny edge over the population average of 76 by virtue of having made it through infancy: we know he’s not in the distribution’s left tail.)

  Movie running times, like human lifetimes, also follow a normal distribution: most films cluster right around a hundred minutes or so, with diminishing numbers of exceptions tailing off to either side. But not all human activities are so well behaved. The poet Dean Young once remarked that whenever he’s listening to a poem in numbered sections, his heart sinks if the reader announces the start of section four: if there are more than three parts, all bets are off, and Young needs to hunker down for an earful. It turns out that Young’s dismay is, in fact, perfectly Bayesian. An analysis of poems shows that, unlike movie running times, poems follow something closer to a power-law than a normal distribution: most poems are short, but some are epics. So when it comes to poetry, make sure you’ve got a comfortable seat. Something normally distributed that’s gone on seemingly too long is bound to end shortly; but the longer something in a power-law distribution has gone on, the longer you can expect it to keep going.

  Between those two extremes, there’s actually a third category of things in life: those that are neither more nor less likely to end just because they’ve gone on for a while. Sometimes things are simply … invariant. The Danish mathematician Agner Krarup Erlang, who studied such phenomena, formalized the spread of intervals between independent events into the function that now carries his name: the Erlang distribution. The shape of this curve differs from both the normal and the power-law: it has a winglike contour, rising to a gentle hump, with a tail that falls off faster than a power-law but more slowly than a normal distribution. Erlang himself, working for the Copenhagen Telephone Company in the early twentieth century, used it to model how much time could be expected to pass between successive calls on a phone network. Since then, the Erlang distribution has also been used by urban planners and architects to model car and pedestrian traffic, and by networking engineers designing infrastructure for the Internet. There are a number of domains in the natural world, too, where events are completely independent from one another and the intervals between them thus fall on an Erlang curve. Radioactive decay is one example, which means that the Erlang distribution perfectly models when to expect the next ticks of a Geiger counter. It also turns out to do a pretty good job of describing certain human endeavors—such as the amount of time politicians stay in the House of Representatives.

  The Erlang distribution gives us a third kind of prediction rule, the Additive Rule: always predict that things will go on just a constant amount longer. The familiar refrain of “Just five more minutes!… [five minutes later] Five more minutes!” that so often characterizes human claims regarding, say, one’s readiness to leave the house or office, or the time until the completion of some task, may seem indicative of some chronic failure to make realistic estimates. Well, in the cases where one’s up against an Erlang distribution, anyway, that refrain happens to be correct.

  If a casino card-playing enthusiast tells his impatient spouse, for example, that he’ll quit for the day after hitting one more blackjack (the odds of which are about 20 to 1), he might cheerily predict, “I’ll be done in about twenty more hands!” If, an unlucky twenty hands later, she returns, asking how long he’s going to make her wait now, his answer will be unchanged: “I’ll be done in about twenty more hands!” It sounds like our indefatigable card shark has suffered a short-term memory loss—but, in fact, his prediction is entirely correct. Indeed, distributions that yield the same prediction, no matter their history or current state, are known to statisticians as “memoryless.”

  Different prior distributions and their prediction rules.

  These three very different patterns of optimal prediction—the Multiplicative, Average, and Additive Rules—all result directly from applying Bayes’s Rule to the power-law, normal, and Erlang distributions, respectively. And given the way those predictions come out, the three distributions offer us different guidance, too, on how surprised we should be by certain events.

  In a power-law distribution, the longer something has gone on, the longer we expect it to continue going on. So a power-law event is more surprising the longer we’ve been waiting for it—and maximally surprising right before it happens. A nation, corporation, or institution only grows more venerable with each passing year, so it’s always stunning when it collapses.

  In a normal distribution, events are surprising when they’re early—since we expected them to reach the average—but not when they’re late. Indeed, by that point they seem overdue to happen, so the longer we wait, the more we expect them.

  And in an Erlang distribution, events by definition are never any more or less surprising no matter when they occur. Any state of affairs is always equally likely to end regardless of how long it’s lasted. No wonder politicians are always thinking about their next election.

  Gambling is characterized by a similar kind of steady-state expectancy. If your wait for, say, a win at the roulette wheel were characterized by a normal distribution, then the Average Rule would apply: after a run of bad luck, it’d tell you that your number should be coming any second, probably followed by more losing spins. (In that case, it’d make sense to press on to the next win and then quit.) If, instead, the wait for a win obeyed a power-law distribution, then the Multiplicative Rule would tell you that winning spins follow quickly after one another, but the longer a drought had gone on the longer it would probably continue. (In that scenario, you’d be right to keep playing for a while after any win, but give up after a losing streak.) Up against a memoryless distribution, however, you’re stuck. The Additive Rule tells you the chance of a win now is the same as it was an hour ago, and the same as it will be an hour from now. Nothing ever changes. You’re not rewarded for sticking
it out and ending on a high note; neither is there a tipping point when you should just cut your losses. In “The Gambler,” Kenny Rogers famously advised that you’ve got to “Know when to walk away / Know when to run”—but for a memoryless distribution, there is no right time to quit. This may in part explain these games’ addictiveness.

  Knowing what distribution you’re up against can make all the difference. When the Harvard biologist and prolific popularizer of science Stephen Jay Gould discovered that he had cancer, his immediate impulse was to read the relevant medical literature. Then he found out why his doctors had discouraged him from doing so: half of all patients with his form of cancer died within eight months of discovery.

  But that one statistic—eight months—didn’t tell him anything about the distribution of survivors. If it were a normal distribution, then the Average Rule would give a pretty clear forecast of how long he could expect to live: about eight months. But if it were a power-law, with a tail that stretches far out to the right, then the situation would be quite different: the Multiplicative Rule would tell him that the longer he lived, the more evidence it would provide that he would live longer. Reading further, Gould discovered that “the distribution was indeed, strongly right skewed, with a long tail (however small) that extended for several years above the eight month median. I saw no reason why I shouldn’t be in that small tail, and I breathed a very long sigh of relief.” Gould would go on to live for twenty more years after his diagnosis.

  Small Data and the Mind

  The three prediction rules—Multiplicative, Average, and Additive—are applicable in a wide range of everyday situations. And in those situations, people in general turn out to be remarkably good at using the right prediction rule. When he was in graduate school, Tom, along with MIT’s Josh Tenenbaum, ran an experiment asking people to make predictions for a variety of everyday quantities—such as human life spans, the grosses of movies, and the time that US representatives would spend in office—based on just one piece of information in each case: current age, money earned so far, and years served to date. Then they compared the predictions people made to the predictions given by applying Bayes’s Rule to the actual real-world data across each of those domains.

  As it turned out, the predictions that people had made were extremely close to those produced by Bayes’s Rule. Intuitively, people made different types of predictions for quantities that followed different distributions—power-law, normal, and Erlang—in the real world. In other words, while you might not know or consciously remember which situation calls for the Multiplicative, Average, or Additive Rule, the predictions you make every day tend to implicitly reflect the different cases where these distributions appear in everyday life, and the different ways they behave.

  In light of what we know about Bayes’s Rule, this remarkably good human performance suggests something critical that helps to understand how people make predictions. Small data is big data in disguise. The reason we can often make good predictions from a small number of observations—or just a single one—is that our priors are so rich. Whether we know it or not, we appear to carry around in our heads surprisingly accurate priors about movie grosses and running times, poem lengths, and political terms of office, not to mention human life spans. We don’t need to gather them explicitly; we absorb them from the world.

  The fact that, on the whole, people’s hunches seem to closely match the predictions of Bayes’s Rule also makes it possible to reverse-engineer all kinds of prior distributions, even ones about which it’s harder to get authoritative real-world data. For instance, being kept on hold by customer service is a lamentably common facet of human experience, but there aren’t publicly available data sets on hold times the way there are for Hollywood box-office grosses. But if people’s predictions are informed by their experiences, we can use Bayes’s Rule to conduct indirect reconnaissance about the world by mining people’s expectations. When Tom and Josh asked people to predict hold times from a single data point, the results suggested that their subjects were using the Multiplicative Rule: the total wait people expect is one and a third times as long as they’ve waited so far. This is consistent with having a power-law distribution as a prior, where a wide range of scales is possible. Just hope you don’t end up on the Titanic of hold times. Over the past decade, approaches like these have enabled cognitive scientists to identify people’s prior distributions across a broad swath of domains, from vision to language.

  There’s a crucial caveat here, however. In cases where we don’t have good priors, our predictions aren’t good. In Tom and Josh’s study, for instance, there was one subject where people’s predictions systematically diverged from Bayes’s Rule: predicting the length of the reign of Egyptian pharaohs. (As it happens, pharaohs’ reigns follow an Erlang distribution.) People simply didn’t have enough everyday exposure to have an intuitive feel for the range of those values, so their predictions, of course, faltered. Good predictions require good priors.

  This has a number of important implications. Our judgments betray our expectations, and our expectations betray our experience. What we project about the future reveals a lot—about the world we live in, and about our own past.

  What Our Predictions Tell Us About Ourselves

  When Walter Mischel ran his famous “marshmallow test” in the early 1970s, he was trying to understand how the ability to delay gratification develops with age. At a nursery school on the Stanford campus, a series of three-, four-, and five-year-olds had their willpower tested. Each child would be shown a delicious treat, such as a marshmallow, and told that the adult running the experiment was about to leave the room for a while. If they wanted to, they could eat the treat right away. But if they waited until the experimenter came back, they would get two treats.

  Unable to resist, some of the children ate the treat immediately. And some of them stuck it out for the full fifteen minutes or so until the experimenter returned, and got two treats as promised. But perhaps the most interesting group comprised the ones in between—the ones who managed to wait a little while, but then surrendered and ate the treat.

  These cases, where children struggled mightily and suffered valiantly, only to give in and lose the extra marshmallow anyway, have been interpreted as suggesting a kind of irrationality. If you’re going to cave, why not just cave immediately, and skip the torture? But it all depends on what kind of situation the children think they are in. As the University of Pennsylvania’s Joe McGuire and Joe Kable have pointed out, if the amount of time it takes for adults to come back is governed by a power-law distribution—with long absences suggesting even longer waits lie ahead—then cutting one’s losses at some point can make perfect sense.

  In other words, the ability to resist temptation may be, at least in part, a matter of expectations rather than willpower. If you predict that adults tend to come back after short delays—something like a normal distribution—you should be able to hold out. The Average Rule suggests that after a painful wait, the thing to do is hang in there: the experimenter should be returning any minute now. But if you have no idea of the timescale of the disappearance—consistent with a power-law distribution—then it’s an uphill battle. The Multiplicative Rule then suggests that a protracted wait is just a small fraction of what’s to come.

  Decades after the original marshmallow experiments, Walter Mischel and his colleagues went back and looked at how the participants were faring in life. Astonishingly, they found that children who had waited for two treats grew into young adults who were more successful than the others, even measured by quantitative metrics like their SAT scores. If the marshmallow test is about willpower, this is a powerful testament to the impact that learning self-control can have on one’s life. But if the test is less about will than about expectations, then this tells a different, perhaps more poignant story.

  A team of researchers at the University of Rochester recently explored how prior experiences might affect behavior in the marshmallow test. Before mars
hmallows were even mentioned, the kids in the experiment embarked on an art project. The experimenter gave them some mediocre supplies, and promised to be back with better options soon. But, unbeknownst to them, the children were divided into two groups. In one group, the experimenter was reliable, and came back with the better art supplies as promised. In the other, she was unreliable, coming back with nothing but apologies.

  The art project completed, the children went on to the standard marshmallow test. And here, the children who had learned that the experimenter was unreliable were more likely to eat the marshmallow before she came back, losing the opportunity to earn a second treat.

  Failing the marshmallow test—and being less successful in later life—may not be about lacking willpower. It could be a result of believing that adults are not dependable: that they can’t be trusted to keep their word, that they disappear for intervals of arbitrary length. Learning self-control is important, but it’s equally important to grow up in an environment where adults are consistently present and trustworthy.

  Priors in the Age of Mechanical Reproduction

  As if someone were to buy several copies of the morning paper to assure himself that what it said was true.

  —LUDWIG WITTGENSTEIN

  He is careful of what he reads, for that is what he will write. He is careful of what he learns, for that is what he will know.

  —ANNIE DILLARD

  The best way to make good predictions, as Bayes’s Rule shows us, is to be accurately informed about the things you’re predicting. That’s why we can do a good job of projecting human life spans, but perform poorly when asked to estimate the reigns of pharaohs.

 

‹ Prev