Algorithms to Live By

Page 19

by Brian Christian

Being a good Bayesian means representing the world in the correct proportions—having good priors, appropriately calibrated. By and large, for humans and other animals this happens naturally; as a rule, when something surprises us, it ought to surprise us, and when it doesn’t, it ought not to. Even when we accumulate biases that aren’t objectively correct, they still usually do a reasonable job of reflecting the specific part of the world we live in. For instance, someone living in a desert climate might overestimate the amount of sand in the world, and someone living at the poles might overestimate the amount of snow. Both are well tuned to their own ecological niche.

Everything starts to break down, however, when a species gains language. What we talk about isn’t what we experience—we speak chiefly of interesting things, and those tend to be things that are uncommon. More or less by definition, events are always experienced at their proper frequencies, but this isn’t at all true of language. Anyone who has experienced a snake bite or a lightning strike will tend to retell those singular stories for the rest of their lives. And those stories will be so salient that they will be picked up and retold by others.

There’s a curious tension, then, between communicating with others and maintaining accurate priors about the world. When people talk about what interests them—and offer stories they think their listeners will find interesting—it skews the statistics of our experience. That makes it hard to maintain appropriate prior distributions. And the challenge has only increased with the development of the printing press, the nightly news, and social media—innovations that allow our species to spread language mechanically.

Consider how many times you’ve seen either a crashed plane or a crashed car. It’s entirely possible you’ve seen roughly as many of each—yet many of those cars were on the road next to you, whereas the planes were probably on another continent, transmitted to you via the Internet or television. In the United States, for instance, the total number of people who have lost their lives in commercial plane crashes since the year 2000 would not be enough to fill Carnegie Hall even half full. In contrast, the number of people in the United States killed in car accidents over that same time is greater than the entire population of Wyoming.

Simply put, the representation of events in the media does not track their frequency in the world. As sociologist Barry Glassner notes, the murder rate in the United States declined by 20% over the course of the 1990s, yet during that time period the presence of gun violence on American news increased by 600%.

If you want to be a good intuitive Bayesian—if you want to naturally make good predictions, without having to think about what kind of prediction rule is appropriate—you need to protect your priors. Counterintuitively, that might mean turning off the news.

*There’s a certain irony here: when it comes to time, assuming that there’s nothing special about our arrival does result in us imagining ourselves at the very center after all.

*This is precisely what Laplace’s Law does in its simplest form: it assumes that having 1% or 10% of the tickets be winners is just as likely as 50% or 100%. The (w+1)⁄(n+2) formula might seem naive in its suggestion that after buying a single losing Powerball ticket you have a 1/3 chance of winning on your next one—but that result faithfully reflects the odds in a raffle where you come in knowing nothing at all.

7 Overfitting

When to Think Less

When Charles Darwin was trying to decide whether he should propose to his cousin Emma Wedgwood, he got out a pencil and paper and weighed every possible consequence. In favor of marriage he listed children, companionship, and the “charms of music & female chit-chat.” Against marriage he listed the “terrible loss of time,” lack of freedom to go where he wished, the burden of visiting relatives, the expense and anxiety provoked by children, the concern that “perhaps my wife won’t like London,” and having less money to spend on books. Weighing one column against the other produced a narrow margin of victory, and at the bottom Darwin scrawled, “Marry—Marry—Marry Q.E.D.” Quod erat demonstrandum, the mathematical sign-off that Darwin himself then restated in English: “It being proved necessary to Marry.”

The pro-and-con list was already a time-honored algorithm by Darwin’s time, being endorsed by Benjamin Franklin a century before. To get over “the Uncertainty that perplexes us,” Franklin wrote,

my Way is, divide half a Sheet of Paper by a Line into two Columns, writing over the one Pro, and over the other Con. Then during three or four Days Consideration I put down under the different Heads short Hints of the different Motives that at different Times occur to me for or against the Measure. When I have thus got them all together in one View, I endeavour to estimate their respective Weights; and where I find two, one on each side, that seem equal, I strike them both out: If I find a Reason pro equal to some two Reasons con, I strike out the three. If I judge some two Reasons con equal to some three Reasons pro, I strike out the five; and thus proceeding I find at length where the Ballance lies; and if after a Day or two of farther Consideration nothing new that is of Importance occurs on either side, I come to a Determination accordingly.

Franklin even thought about this as something like a computation, saying, “I have found great Advantage from this kind of Equation, in what may be called Moral or Prudential Algebra.”

Darwin’s journal, July 1838. Reprinted with permission of Cambridge University Library.

When we think about thinking, it’s easy to assume that more is better: that you will make a better decision the more pros and cons you list, make a better prediction about the price of a stock the more relevant factors you identify, and write a better report the more time you spend working on it. This is certainly the premise behind Franklin’s system. In this sense, Darwin’s “algebraic” approach to matrimony, despite its obvious eccentricity, seems remarkably and maybe even laudably rational.

However, if Franklin or Darwin had lived into the era of machine-learning research—the science of teaching computers to make good judgments from experience—they’d have seen Moral Algebra shaken to its foundations. The question of how hard to think, and how many factors to consider, is at the heart of a knotty problem that statisticians and machine-learning researchers call “overfitting.” And dealing with that problem reveals that there’s a wisdom to deliberately thinking less. Being aware of overfitting changes how we should approach the market, the dining table, the gym … and the altar.

The Case Against Complexity

Anything you can do I can do better; I can do anything better than you.

—ANNIE GET YOUR GUN

Every decision is a kind of prediction: about how much you’ll like something you haven’t tried yet, about where a certain trend is heading, about how the road less traveled (or more so) is likely to pan out. And every prediction, crucially, involves thinking about two distinct things: what you know and what you don’t. That is, it’s an attempt to formulate a theory that will account for the experiences you’ve had to date and say something about the future ones you’re guessing at. A good theory, of course, will do both. But the fact that every prediction must in effect pull double duty creates a certain unavoidable tension.

Life satisfaction as a function of time since marriage.

As one illustration of this tension, let’s look at a data set that might have been relevant to Darwin: people’s life satisfaction over their first ten years of marriage, from a recent study conducted in Germany. Each point on that chart is taken from the study itself; our job is to figure out the formula for a line that would fit those points and extend into the future, allowing us to make predictions past the ten-year mark.

One possible formula would use just a single factor to predict life satisfaction: the time since marriage. This would create a straight line on the chart. Another possibility is to use two factors, time and time squared; the resulting line would have a parabolic U-shape, letting it capture a potentially more complex relationship between time and happiness. And if we expand
the formula to include yet more factors (time cubed and so on), the line will acquire ever more inflection points, getting more and more “bendy” and flexible. By the time we get to a nine-factor formula, we can capture very complex relationships indeed.

Mathematically speaking, our two-factor model incorporates all the information that goes into the one-factor model, and has another term it could use as well. Likewise, the nine-factor model leverages all of the information at the disposal of the two-factor model, plus potentially lots more. By this logic, it seems like the nine-factor model ought to always give us the best predictions.

As it turns out, things are not quite so simple.

Predictions of life satisfaction using models with different numbers of factors.

The result of applying these models to the data is shown above. The one-factor model, unsurprisingly, misses a lot of the exact data points, though it captures the basic trend—a comedown after the honeymoon bliss. However, its straight-line prediction forecasts that this decrease will continue forever, ultimately resulting in infinite misery. Something about that trajectory doesn’t sound quite right. The two-factor model comes closer to fitting the survey data, and its curved shape makes a different long-term prediction, suggesting that after the initial decline life satisfaction more or less levels out over time. Finally, the nine-factor model passes through each and every point on the chart; it is essentially a perfect fit for all the data from the study.

In that sense it seems like the nine-factor formula is indeed our best model. But if you look at the predictions it makes for the years not included in the study, you might wonder about just how useful it really is: it predicts misery at the altar, a giddily abrupt rise in satisfaction after several months of marriage, a bumpy roller-coaster ride thereafter, and a sheer drop after year ten. By contrast, the leveling off predicted by the two-factor model is the forecast most consistent with what psychologists and economists say about marriage and happiness. (They believe, incidentally, that it simply reflects a return to normalcy—to people’s baseline level of satisfaction with their lives—rather than any displeasure with marriage itself.)

The lesson is this: it is indeed true that including more factors in a model will always, by definition, make it a better fit for the data we have already. But a better fit for the available data does not necessarily mean a better prediction.

Adding small amounts of random “noise” to the data (simulating the effects of repeating the survey with a different group of participants) produces wild undulations in the nine-factor model, while the one- and two-factor models in comparison are much more stable and consistent in their predictions.

Granted, a model that’s too simple—for instance, the straight line of the one-factor formula—can fail to capture the essential pattern in the data. If the truth looks like a curve, no straight line can ever get it right. On the other hand, a model that’s too complicated, such as our nine-factor model here, becomes oversensitive to the particular data points that we happened to observe. As a consequence, precisely because it is tuned so finely to that specific data set, the solutions it produces are highly variable. If the study were repeated with different people, producing slight variations on the same essential pattern, the one- and two-factor models would remain more or less steady—but the nine-factor model would gyrate wildly from one instance of the study to the next. This is what statisticians call overfitting.

So one of the deepest truths of machine learning is that, in fact, it’s not always better to use a more complex model, one that takes a greater number of factors into account. And the issue is not just that the extra factors might offer diminishing returns—performing better than a simpler model, but not enough to justify the added complexity. Rather, they might make our predictions dramatically worse.

The Idolatry of Data

If we had copious data, drawn from a perfectly representative sample, completely mistake-free, and representing exactly what we’re trying to evaluate, then using the most complex model available would indeed be the best approach. But if we try to perfectly fit our model to the data when any of these factors fails to hold, we risk overfitting.

In other words, overfitting poses a danger any time we’re dealing with noise or mismeasurement—and we almost always are. There can be errors in how the data were collected, or in how they were reported. Sometimes the phenomena being investigated, such as human happiness, are hard to even define, let alone measure. Thanks to their flexibility, the most complex models available to us can fit any patterns that appear in the data, but this means that they will also do so even when those patterns are mere phantoms and mirages in the noise.

Throughout history, religious texts have warned their followers against idolatry: the worshipping of statues, paintings, relics, and other tangible artifacts in lieu of the intangible deities those artifacts represent. The First Commandment, for instance, warns against bowing down to “any graven image, or any likeness of any thing that is in heaven.” And in the Book of Kings, a bronze snake made at God’s orders becomes an object of worship and incense-burning, instead of God himself. (God is not amused.) Fundamentally, overfitting is a kind of idolatry of data, a consequence of focusing on what we’ve been able to measure rather than what matters.

This gap between the data we have and the predictions we want is virtually everywhere. When making a big decision, we can only guess at what will please us later by thinking about the factors important to us right now. (As Harvard’s Daniel Gilbert puts it, our future selves often “pay good money to remove the tattoos that we paid good money to get.”) When making a financial forecast, we can only look at what correlated with the price of a stock in the past, not what might in the future. Even in our small daily acts this pattern holds: writing an email, we use our own read-through of the text to predict that of the recipient. No less than in public surveys, the data in our own lives are thus also always noisy, at best a proxy metric for the things we really care about.

As a consequence, considering more and more factors and expending more effort to model them can lead us into the error of optimizing for the wrong thing—offering prayers to the bronze snake of data rather than the larger force behind it.

Overfitting Everywhere

Once you know about overfitting, you see it everywhere.

Overfitting, for instance, explains the irony of our palates. How can it be that the foods that taste best to us are broadly considered to be bad for our health, when the entire function of taste buds, evolutionarily speaking, is to prevent us from eating things that are bad?

The answer is that taste is our body’s proxy metric for health. Fat, sugar, and salt are important nutrients, and for a couple hundred thousand years, being drawn to foods containing them was a reasonable measure for a sustaining diet.

But being able to modify the foods available to us broke that relationship. We can now add fat and sugar to foods beyond amounts that are good for us, and then eat those foods exclusively rather than the mix of plants, grains, and meats that historically made up the human diet. In other words, we can overfit taste. And the more skillfully we can manipulate food (and the more our lifestyles diverge from those of our ancestors), the more imperfect a metric taste becomes. Our human agency thus turns into a curse, making us dangerously able to have exactly what we want even when we don’t quite want exactly the right thing.

Beware: when you go to the gym to work off the extra weight from all that sugar, you can also risk overfitting fitness. Certain visible signs of fitness—low body fat and high muscle mass, for example—are easy to measure, and they are related to, say, minimizing the risk of heart disease and other ailments. But they, too, are an imperfect proxy measure. Overfitting the signals—adopting an extreme diet to lower body fat and taking steroids to build muscle, perhaps—can make you the picture of good health, but only the picture.

Overfitting also shows up in sports. For instance, Tom has been a fencer, on and off, since he was a teenager. The origi
nal goal of fencing was to teach people how to defend themselves in a duel, hence the name: “defencing.” And the weapons used in modern fencing are similar to those that were used to train for such encounters. (This is particularly true of the épée, which was still used in formal duels less than fifty years ago.) But the introduction of electronic scoring equipment—a button on the tip of the blade that registers a hit—has changed the nature of the sport, and techniques that would serve you poorly in a serious duel have become critical skills in competition. Modern fencers use flexible blades that allow them to “flick” the button at their opponent, grazing just hard enough to register and score. As a result, they can look more like they’re cracking thin metal whips at each other than cutting or thrusting. It’s as exciting a sport as ever, but as athletes overfit their tactics to the quirks of scorekeeping, it becomes less useful in instilling the skills of real-world swordsmanship.

Perhaps nowhere, however, is overfitting as powerful and troublesome as in the world of business. “Incentive structures work,” as Steve Jobs put it. “So you have to be very careful of what you incent people to do, because various incentive structures create all sorts of consequences that you can’t anticipate.” Sam Altman, president of the startup incubator Y Combinator, echoes Jobs’s words of caution: “It really is true that the company will build whatever the CEO decides to measure.”

In fact, it’s incredibly difficult to come up with incentives or measurements that do not have some kind of perverse effect. In the 1950s, Cornell management professor V. F. Ridgway cataloged a host of such “Dysfunctional Consequences of Performance Measurements.” At a job-placement firm, staffers were evaluated on the number of interviews they conducted, which motivated them to run through the meetings as quickly as possible, without spending much time actually helping their clients find jobs. At a federal law enforcement agency, investigators given monthly performance quotas were found to pick easy cases at the end of the month rather than the most urgent ones. And at a factory, focusing on production metrics led supervisors to neglect maintenance and repairs, setting up future catastrophe. Such problems can’t simply be dismissed as a failure to achieve management goals. Rather, they are the opposite: the ruthless and clever optimization of the wrong thing.

‹ Prev Next ›