Algorithms to Live By

Page 20

by Brian Christian

The twenty-first-century shift into real-time analytics has only made the danger of metrics more intense. Avinash Kaushik, digital marketing evangelist at Google, warns that trying to get website users to see as many ads as possible naturally devolves into trying to cram sites with ads: “When you are paid on a [cost per thousand impressions] basis the incentive is to figure out how to show the most possible ads on every page [and] ensure the visitor sees the most possible pages on the site.… That incentive removes a focus from the important entity, your customer, and places it on the secondary entity, your advertiser.” The website might gain a little more money in the short term, but ad-crammed articles, slow-loading multi-page slide shows, and sensationalist clickbait headlines will drive away readers in the long run. Kaushik’s conclusion: “Friends don’t let friends measure Page Views. Ever.”

In some cases, the difference between a model and the real world is literally a matter of life and death. In the military and in law enforcement, for example, repetitive, rote training is considered a key means for instilling line-of-fire skills. The goal is to drill certain motions and tactics to the point that they become totally automatic. But when overfitting creeps in, it can prove disastrous. There are stories of police officers who find themselves, for instance, taking time out during a gunfight to put their spent casings in their pockets—good etiquette on a firing range. As former Army Ranger and West Point psychology professor Dave Grossman writes, “After the smoke had settled in many real gunfights, officers were shocked to discover empty brass in their pockets with no memory of how it got there. On several occasions, dead cops were found with brass in their hands, dying in the middle of an administrative procedure that had been drilled into them.” Similarly, the FBI was forced to change its training after agents were found reflexively firing two shots and then holstering their weapon—a standard cadence in training—regardless of whether their shots had hit the target and whether there was still a threat. Mistakes like these are known in law enforcement and the military as “training scars,” and they reflect the fact that it’s possible to overfit one’s own preparation. In one particularly dramatic case, an officer instinctively grabbed the gun out of the hands of an assailant and then instinctively handed it right back—just as he had done time and time again with his trainers in practice.

Detecting Overfitting: Cross-Validation

Because overfitting presents itself initially as a theory that perfectly fits the available data, it may seem insidiously hard to detect. How can we expect to tell the difference between a genuinely good model and one that’s overfitting? In an educational setting, how can we distinguish between a class of students excelling at the subject matter and a class merely being “taught to the test”? In the business world, how can we tell a genuine star performer from an employee who has just cannily overfit their work to the company’s key performance indicators—or to the boss’s perception?

Teasing apart those scenarios is indeed challenging, but it is not impossible. Research in machine learning has yielded several concrete strategies for detecting overfitting, and one of the most important is what’s known as Cross-Validation.

Simply put, Cross-Validation means assessing not only how well a model fits the data it’s given, but how well it generalizes to data it hasn’t seen. Paradoxically, this may involve using less data. In the marriage example, we might “hold back,” say, two points at random, and fit our models only to the other eight. We’d then take those two test points and use them to gauge how well our various functions generalize beyond the eight “training” points they’ve been given. The two held-back points function as canaries in the coal mine: if a complex model nails the eight training points but wildly misses the two test points, it’s a good bet that overfitting is at work.

Aside from withholding some of the available data points, it is also useful to consider testing the model with data derived from some other form of evaluation entirely. As we have seen, the use of proxy metrics—taste as a proxy for nutrition, number of cases solved as a proxy for investigator diligence—can also lead to overfitting. In these cases, we’ll need to cross-validate the primary performance measure we’re using against other possible measures.

In schools, for example, standardized tests offer a number of benefits, including a distinct economy of scale: they can be graded cheaply and rapidly by the thousands. Alongside such tests, however, schools could randomly assess some small fraction of the students—one per class, say, or one in a hundred—using a different evaluation method, perhaps something like an essay or an oral exam. (Since only a few students would be tested this way, having this secondary method scale well is not a big concern.) The standardized tests would provide immediate feedback—you could have students take a short computerized exam every week and chart the class’s progress almost in real time, for instance—while the secondary data points would serve to cross-validate: to make sure that the students were actually acquiring the knowledge that the standardized test is meant to measure, and not simply getting better at test-taking. If a school’s standardized scores rose while its “nonstandardized” performance moved in the opposite direction, administrators would have a clear warning sign that “teaching to the test” had set in, and the pupils’ skills were beginning to overfit the mechanics of the test itself.

Cross-Validation also offers a suggestion for law enforcement and military personnel looking to instill good reflexes without hammering in habits from the training process itself. Just as essays and oral exams can cross-validate standardized tests, so occasional unfamiliar “cross-training” assessments might be used to measure whether reaction time and shooting accuracy are generalizing to unfamiliar tasks. If they aren’t, then that’s a strong signal to change the training regimen. While nothing may truly prepare one for actual combat, exercises like this may at least warn in advance where “training scars” are likely to have formed.

How to Combat Overfitting: Penalizing Complexity

If you can’t explain it simply, you don’t understand it well enough.

—ANONYMOUS

We’ve seen some of the ways that overfitting can rear its head, and we’ve looked at some of the methods to detect and measure it. But what can we actually do to alleviate it?

From a statistics viewpoint, overfitting is a symptom of being too sensitive to the actual data we’ve seen. The solution, then, is straightforward: we must balance our desire to find a good fit against the complexity of the models we use to do so.

One way to choose among several competing models is the Occam’s razor principle, which suggests that, all things being equal, the simplest possible hypothesis is probably the correct one. Of course, things are rarely completely equal, so it’s not immediately obvious how to apply something like Occam’s razor in a mathematical context. Grappling with this challenge in the 1960s, Russian mathematician Andrey Tikhonov proposed one answer: introduce an additional term to your calculations that penalizes more complex solutions. If we introduce a complexity penalty, then more complex models need to do not merely a better job but a significantly better job of explaining the data to justify their greater complexity. Computer scientists refer to this principle—using constraints that penalize models for their complexity—as Regularization.

So what do these complexity penalties look like? One algorithm, discovered in 1996 by biostatistician Robert Tibshirani, is called the Lasso and uses as its penalty the total weight of the different factors in the model.* By putting this downward pressure on the weights of the factors, the Lasso drives as many of them as possible completely to zero. Only the factors that have a big impact on the results remain in the equation—thus potentially transforming, say, an overfitted nine-factor model into a simpler, more robust formula with just a couple of the most critical factors.

Techniques like the Lasso are now ubiquitous in machine learning, but the same kind of principle—a penalty for complexity—also appears in nature. Living organisms get a certain push toward simpl
icity almost automatically, thanks to the constraints of time, memory, energy, and attention. The burden of metabolism, for instance, acts as a brake on the complexity of organisms, introducing a caloric penalty for overly elaborate machinery. The fact that the human brain burns about a fifth of humans’ total daily caloric intake is a testament to the evolutionary advantages that our intellectual abilities provide us with: the brain’s contributions must somehow more than pay for that sizable fuel bill. On the other hand, we can also infer that a substantially more complex brain probably didn’t provide sufficient dividends, evolutionarily speaking. We’re as brainy as we have needed to be, but not extravagantly more so.

The same kind of process is also believed to play a role at the neural level. In computer science, software models based on the brain, known as “artificial neural networks,” can learn arbitrarily complex functions—they’re even more flexible than our nine-factor model above—but precisely because of this very flexibility they are notoriously vulnerable to overfitting. Actual, biological neural networks sidestep some of this problem because they need to trade off their performance against the costs of maintaining it. Neuroscientists have suggested, for instance, that brains try to minimize the number of neurons that are firing at any given moment—implementing the same kind of downward pressure on complexity as the Lasso.

Language forms yet another natural Lasso: complexity is punished by the labor of speaking at greater length and the taxing of our listener’s attention span. Business plans get compressed to an elevator pitch; life advice becomes proverbial wisdom only if it is sufficiently concise and catchy. And anything that needs to be remembered has to pass through the inherent Lasso of memory.

The Upside of Heuristics

The economist Harry Markowitz won the 1990 Nobel Prize in Economics for developing modern portfolio theory: his groundbreaking “mean-variance portfolio optimization” showed how an investor could make an optimal allocation among various funds and assets to maximize returns at a given level of risk. So when it came time to invest his own retirement savings, it seems like Markowitz should have been the one person perfectly equipped for the job. What did he decide to do?

I should have computed the historical covariances of the asset classes and drawn an efficient frontier. Instead, I visualized my grief if the stock market went way up and I wasn’t in it—or if it went way down and I was completely in it. My intention was to minimize my future regret. So I split my contributions fifty-fifty between bonds and equities.

Why in the world would he do that? The story of the Nobel Prize winner and his investment strategy could be presented as an example of human irrationality: faced with the complexity of real life, he abandoned the rational model and followed a simple heuristic. But it’s precisely because of the complexity of real life that a simple heuristic might in fact be the rational solution.

When it comes to portfolio management, it turns out that unless you’re highly confident in the information you have about the markets, you may actually be better off ignoring that information altogether. Applying Markowitz’s optimal portfolio allocation scheme requires having good estimates of the statistical properties of different investments. An error in those estimates can result in very different asset allocations, potentially increasing risk. In contrast, splitting your money evenly across stocks and bonds is not affected at all by what data you’ve observed. This strategy doesn’t even try to fit itself to the historical performance of those investment types—so there’s no way it can overfit.

Of course, just using a fifty-fifty split is not necessarily the complexity sweet spot, but there’s something to be said for it. If you happen to know the expected mean and expected variance of a set of investments, then use mean-variance portfolio optimization—the optimal algorithm is optimal for a reason. But when the odds of estimating them all correctly are low, and the weight that the model puts on those untrustworthy quantities is high, then an alarm should be going off in the decision-making process: it’s time to regularize.

Inspired by examples like Markowitz’s retirement savings, psychologists Gerd Gigerenzer and Henry Brighton have argued that the decision-making shortcuts people use in the real world are in many cases exactly the kind of thinking that makes for good decisions. “In contrast to the widely held view that less processing reduces accuracy,” they write, “the study of heuristics shows that less information, computation, and time can in fact improve accuracy.” A heuristic that favors simpler answers—with fewer factors, or less computation—offers precisely these “less is more” effects.

Imposing penalties on the ultimate complexity of a model is not the only way to alleviate overfitting, however. You can also nudge a model toward simplicity by controlling the speed with which you allow it to adapt to incoming data. This makes the study of overfitting an illuminating guide to our history—both as a society and as a species.

The Weight of History

Every food a living rat has eaten has, necessarily, not killed it.

—SAMUEL REVUSKY AND ERWIN BEDARF, “ASSOCIATION OF ILLNESS WITH PRIOR INGESTION OF NOVEL FOODS”

The soy milk market in the United States more than quadrupled from the mid-1990s to 2013. But by the end of 2013, according to news headlines, it already seemed to be a thing of the past, a distant second place to almond milk. As food and beverage researcher Larry Finkel told Bloomberg Businessweek: “Nuts are trendy now. Soy sounds more like old-fashioned health food.” The Silk company, famous for popularizing soy milk (as the name implies), reported in late 2013 that its almond milk products had grown by more than 50% in the previous quarter alone. Meanwhile, in other beverage news, the leading coconut water brand, Vita Coco, reported in 2014 that its sales had doubled since 2011—and had increased an astounding three-hundred-fold since 2004. As the New York Times put it, “coconut water seems to have jumped from invisible to unavoidable without a pause in the realm of the vaguely familiar.” Meanwhile, the kale market grew by 40% in 2013 alone. The biggest purchaser of kale the year before had been Pizza Hut, which put it in their salad bars—as decoration.

Some of the most fundamental domains of human life, such as the question of what we should put in our bodies, seem curiously to be the ones most dominated by short-lived fads. Part of what enables these fads to take the world by storm is how quickly our culture can change. Information now flows through society faster than ever before, while global supply chains enable consumers to rapidly change their buying habits en masse (and marketing encourages them to do so). If some particular study happens to suggest a health benefit from, say, star anise, it can be all over the blogosphere within the week, on television the week after that, and in seemingly every supermarket in six months, with dedicated star anise cookbooks soon rolling off the presses. This breathtaking speed is both a blessing and a curse.

In contrast, if we look at the way organisms—including humans—evolve, we notice something intriguing: change happens slowly. This means that the properties of modern-day organisms are shaped not only by their present environments, but also by their history. For example, the oddly cross-wired arrangement of our nervous system (the left side of our body controlled by the right side of our brain and vice versa) reflects the evolutionary history of vertebrates. This phenomenon, called “decussation,” is theorized to have arisen at a point in evolution when early vertebrates’ bodies twisted 180 degrees with respect to their heads; whereas the nerve cords of invertebrates such as lobsters and earthworms run on the “belly” side of the animal, vertebrates have their nerve cords along the spine instead.

The human ear offers another example. Viewed from a functional perspective, it is a system for translating sound waves into electrical signals by way of amplification via three bones: the malleus, incus, and stapes. This amplification system is impressive—but the specifics of how it works have a lot to do with historical constraints. Reptiles, it turns out, have just a single bone in their ear, but additional bones in the jaw that mammals lack. Those j
awbones were apparently repurposed in the mammalian ear. So the exact form and configuration of our ear anatomy reflects our evolutionary history at least as much as it does the auditory problem being solved.

The concept of overfitting gives us a way of seeing the virtue in such evolutionary baggage. Though crossed-over nerve fibers and repurposed jawbones may seem like suboptimal arrangements, we don’t necessarily want evolution to fully optimize an organism to every shift in its environmental niche—or, at least, we should recognize that doing so would make it extremely sensitive to further environmental changes. Having to make use of existing materials, on the other hand, imposes a kind of useful restraint. It makes it harder to induce drastic changes in the structure of organisms, harder to overfit. As a species, being constrained by the past makes us less perfectly adjusted to the present we know but helps keep us robust for the future we don’t.

A similar insight might help us resist the quick-moving fads of human society. When it comes to culture, tradition plays the role of the evolutionary constraints. A bit of conservatism, a certain bias in favor of history, can buffer us against the boom-and-bust cycle of fads. That doesn’t mean we ought to ignore the latest data either, of course. Jump toward the bandwagon, by all means—but not necessarily on it.

In machine learning, the advantages of moving slowly emerge most concretely in a regularization technique known as Early Stopping. When we looked at the German marriage survey data at the beginning of the chapter, we went straight to examining the best-fitted one-, two-, and nine-factor models. In many situations, however, tuning the parameters to find the best possible fit for given data is a process in and of itself. What happens if we stop that process early and simply don’t allow a model the time to become too complex? Again, what might seem at first blush like being halfhearted or unthorough emerges, instead, as an important strategy in its own right.

‹ Prev Next ›