Many prediction algorithms, for instance, start out by searching for the single most important factor rather than jumping to a multi-factor model. Only after finding that first factor do they look for the next most important factor to add to the model, then the next, and so on. Their models can therefore be kept from becoming overly complex simply by stopping the process short, before overfitting has had a chance to creep in. A related approach to calculating predictions considers one data point at a time, with the model tweaked to account for each new point before more points are added; there, too, the complexity of the model increases gradually, so stopping the process short can help keep it from overfitting.
This kind of setup—where more time means more complexity—characterizes a lot of human endeavors. Giving yourself more time to decide about something does not necessarily mean that you’ll make a better decision. But it does guarantee that you’ll end up considering more factors, more hypotheticals, more pros and cons, and thus risk overfitting.
Tom had exactly this experience when he became a professor. His first semester, teaching his first class ever, he spent a huge amount of time perfecting his lectures—more than ten hours of preparation for every hour of class. His second semester, teaching a different class, he wasn’t able to put in as much time, and worried that it would be a disaster. But a strange thing happened: the students liked the second class. In fact, they liked it more than the first one. Those extra hours, it turned out, had been spent nailing down nitty-gritty details that only confused the students, and wound up getting cut from the lectures the next time Tom taught the class. The underlying issue, Tom eventually realized, was that he’d been using his own taste and judgment as a kind of proxy metric for his students’. This proxy metric worked reasonably well as an approximation, but it wasn’t worth overfitting—which explained why spending extra hours painstakingly “perfecting” all the slides had been counterproductive.
The effectiveness of regularization in all kinds of machine-learning tasks suggests that we can make better decisions by deliberately thinking and doing less. If the factors we come up with first are likely to be the most important ones, then beyond a certain point thinking more about a problem is not only going to be a waste of time and effort—it will lead us to worse solutions. Early Stopping provides the foundation for a reasoned argument against reasoning, the thinking person’s case against thought. But turning this into practical advice requires answering one more question: when should we stop thinking?
When to Think Less
As with all issues involving overfitting, how early to stop depends on the gap between what you can measure and what really matters. If you have all the facts, they’re free of all error and uncertainty, and you can directly assess whatever is important to you, then don’t stop early. Think long and hard: the complexity and effort are appropriate.
But that’s almost never the case. If you have high uncertainty and limited data, then do stop early by all means. If you don’t have a clear read on how your work will be evaluated, and by whom, then it’s not worth the extra time to make it perfect with respect to your own (or anyone else’s) idiosyncratic guess at what perfection might be. The greater the uncertainty, the bigger the gap between what you can measure and what matters, the more you should watch out for overfitting—that is, the more you should prefer simplicity, and the earlier you should stop.
When you’re truly in the dark, the best-laid plans will be the simplest. When our expectations are uncertain and the data are noisy, the best bet is to paint with a broad brush, to think in broad strokes. Sometimes literally. As entrepreneurs Jason Fried and David Heinemeier Hansson explain, the further ahead they need to brainstorm, the thicker the pen they use—a clever form of simplification by stroke size:
When we start designing something, we sketch out ideas with a big, thick Sharpie marker, instead of a ball-point pen. Why? Pen points are too fine. They’re too high-resolution. They encourage you to worry about things that you shouldn’t worry about yet, like perfecting the shading or whether to use a dotted or dashed line. You end up focusing on things that should still be out of focus.
A Sharpie makes it impossible to drill down that deep. You can only draw shapes, lines, and boxes. That’s good. The big picture is all you should be worrying about in the beginning.
As McGill’s Henry Mintzberg puts it, “What would happen if we started from the premise that we can’t measure what matters and go from there? Then instead of measurement we’d have to use something very scary: it’s called judgment.”
The upshot of Early Stopping is that sometimes it’s not a matter of choosing between being rational and going with our first instinct. Going with our first instinct can be the rational solution. The more complex, unstable, and uncertain the decision, the more rational an approach that is.
To return to Darwin, his problem of deciding whether to propose could probably have been resolved based on just the first few pros and cons he identified, with the subsequent ones adding to the time and anxiety expended on the decision without necessarily aiding its resolution (and in all likelihood impeding it). What seemed to make up his mind was the thought that “it is intolerable to think of spending one’s whole life like a neuter bee, working, working, & nothing after all.” Children and companionship—the very first points he mentioned—were precisely those that ultimately swayed him in favor of marriage. His book budget was a distraction.
Before we get too critical of Darwin, however, painting him as an inveterate overthinker, it’s worth taking a second look at this page from his diary. Seeing it in facsimile shows something fascinating. Darwin was no Franklin, adding assorted considerations for days. Despite the seriousness with which he approached this life-changing choice, Darwin made up his mind exactly when his notes reached the bottom of the diary sheet. He was regularizing to the page. This is reminiscent of both Early Stopping and the Lasso: anything that doesn’t make the page doesn’t make the decision.
His mind made up to marry, Darwin immediately went on to overthink the timing. “When? Soon or Late,” he wrote above another list of pros and cons, considering everything from happiness to expenses to “awkwardness” to his long-standing desire to travel in a hot air balloon and/or to Wales. But by the end of the page he resolved to “Never mind, trust to chance”—and the result, within several months’ time, was a proposal to Emma Wedgwood, the start of a fulfilling partnership and a happy family life.
*For the mathematically inclined, that’s the sum of the absolute values of the variables’ coefficients.
8 Relaxation
Let It Slide
In 2010 Meghan Bellows was working on her PhD in chemical engineering at Princeton by day and planning her wedding by night. Her research revolved around finding the right places to put amino acids in a protein chain to yield a molecule with particular characteristics. (“If you maximize the binding energy of two proteins then you can successfully design a peptidic inhibitor of some biological function so you can actually inhibit a disease’s progress.”) On the nuptial front, she was stuck on the problem of seating.
There was a group of nine college friends, and Bellows agonized over who else to throw into the midst of such a mini-reunion to make a table of ten. Even worse, she’d counted up eleven close relatives. Who would get the boot from the honored parents’ table, and how could she explain it to them? And what about folks like her childhood neighbors and babysitter, or her parents’ work colleagues, who didn’t really know anyone at the wedding at all?
The problem seemed every bit as hard as the protein problem she was working on at the lab. Then it hit her. It was the problem she was working on at the lab. One evening, as she stared at her seating charts, “I realized that there was literally a one-to-one correlation between the amino acids and proteins in my PhD thesis and people sitting at tables at my wedding.” Bellows called out to her fiancé for a piece of paper and began scribbling equations. Amino acids became guests, binding energies became relat
ionships, and the molecules’ so-called nearest-neighbor interactions became—well—nearest-neighbor interactions. She could use the algorithms from her research to solve her own wedding.
Bellows worked out a way to numerically define the strength of the relationships among all the guests. If a particular pair of people didn’t know one another they got a 0, if they did they got a 1, and if they were a couple they got a 50. (The sister of the bride got to give a score of 10 to all the people she wanted to sit with, as a special prerogative.) Bellows then specified a few constraints: the maximum table capacity, and a minimum score necessary for each table, so that no one table became the awkward “miscellaneous” group full of strangers. She also codified the program’s goal: to maximize the relationship scores between the guests and their tablemates.
There were 107 people at the wedding and 11 tables, which could accommodate ten people each. This means there were about 11107 possible seating plans: that’s a 112-digit number, more than 200 billion googols, a figure that dwarfs the (merely 80-digit) number of atoms in the observable universe. Bellows submitted the job to her lab computer on Saturday evening and let it churn. When she came in on Monday morning, it was still running; she had it spit out the best assignment it had found so far and put it back onto protein design.
Even with a high-powered lab computer cluster and a full thirty-six hours of processing time, there was no way for the program to evaluate more than a tiny fraction of the potential seating arrangements. The odds are that the truly optimal solution, the one that would have earned the very highest score, never came up in its permutations. Still, Bellows was pleased with the computer’s results. “It identified relationships that we were forgetting about,” she says, offering delightful, unconventional possibilities that the human planners hadn’t even considered. For instance, it proposed removing her parents from the family table, putting them instead with old friends they hadn’t seen for years. Its final recommendation was an arrangement that all parties agreed was a hit—although the mother of the bride couldn’t resist making just a few manual tweaks.
The fact that all the computing power of a lab at Princeton couldn’t find the perfect seating plan might seem surprising. In most of the domains we’ve discussed so far, straightforward algorithms could guarantee optimal solutions. But as computer scientists have discovered over the past few decades, there are entire classes of problems where a perfect solution is essentially unreachable, no matter how fast we make our computers or how cleverly we program them. In fact, no one understands as well as a computer scientist that in the face of a seemingly unmanageable challenge, you should neither toil forever nor give up, but—as we’ll see—try a third thing entirely.
The Difficulty of Optimization
Before leading the country through the American Civil War, before drafting the Emancipation Proclamation or delivering the Gettysburg Address, Abraham Lincoln worked as a “prairie lawyer” in Springfield, Illinois, traveling the Eighth Judicial Circuit twice a year for sixteen years. Being a circuit lawyer meant literally making a circuit—moving through towns in fourteen different counties to try cases, riding hundreds of miles over many weeks. Planning these circuits raised a natural challenge: how to visit all the necessary towns while covering as few miles as possible and without going to any town twice.
This is an instance of what’s known to mathematicians and computer scientists as a “constrained optimization” problem: how to find the single best arrangement of a set of variables, given particular rules and a scorekeeping measure. In fact, it’s the most famous optimization problem of them all. If it had been studied in the nineteenth century it might have become forever known as “the prairie lawyer problem,” and if it had first come up in the twenty-first century it might have been nicknamed “the delivery drone problem.” But like the secretary problem, it emerged in the mid-twentieth century, a period unmistakably evoked by its canonical name: “the traveling salesman problem.”
The problem of route planning didn’t get the attention of the mathematics community until the 1930s, but then it did so with a vengeance. Mathematician Karl Menger spoke of “the postal messenger problem” in 1930, noting that no easier solution was known than simply trying out every possibility in turn. Hassler Whitney posed the problem in a 1934 talk at Princeton, where it lodged firmly in the brain of fellow mathematician Merrill Flood (who, you might recall from chapter 1, is also credited with circulating the first solution to the secretary problem). When Flood moved to California in the 1940s he spread it in turn to his colleagues at the RAND Institute, and the problem’s iconic name first appeared in print in a 1949 paper by mathematician Julia Robinson. As the problem swept through mathematical circles, it grew in notoriety. Many of the greatest minds of the time obsessed over it, and no one seemed able to make real headway.
In the traveling salesman problem, the question isn’t whether a computer (or a mathematician) could find the shortest route: theoretically, one can simply crank out a list of all the possibilities and measure each one. Rather, the issue is that as the number of towns grows, the list of possible routes connecting them explodes. A route is just an ordering of the towns, so trying them all by brute force is the dreaded O(n!) “factorial time”—the computational equivalent of sorting a deck of cards by throwing them in the air until they happen to land in order.
The question is: is there any hope of doing better?
Decades of work did little to tame the traveling salesman problem. Flood, for instance, wrote in 1956, more than twenty years after first encountering it: “It seems very likely that quite a different approach from any yet used may be required for successful treatment of the problem. In fact, there may well be no general method for treating the problem and impossibility results would also be valuable.” Another decade later, the mood was only more grim. “I conjecture,” wrote Jack Edmonds, “that there is no good algorithm for the traveling salesman problem.”
These words would prove prophetic.
Defining Difficulty
In the mid-1960s, Edmonds, at the National Institute of Standards and Technology, along with Alan Cobham of IBM, developed a working definition for what makes a problem feasible to solve or not. They asserted what’s now known as the Cobham–Edmonds thesis: an algorithm should be considered “efficient” if it runs in what’s called “polynomial time”—that is, O(n2), O(n3), or in fact n to the power of any number at all. A problem, in turn, is considered “tractable” if we know how to solve it using an efficient algorithm. A problem we don’t know how to solve in polynomial time, on the other hand, is considered “intractable.” And at anything but the smallest scales, intractable problems are beyond the reach of solution by computers, no matter how powerful.*
This amounts to what is arguably the central insight of computer science. It’s possible to quantify the difficulty of a problem. And some problems are just … hard.
Where does this leave the traveling salesman problem? Curiously enough, we are still not quite sure. In 1972, Berkeley’s Richard Karp demonstrated that the traveling salesman problem is linked to a controversially borderline class of problems that have not yet been definitively proven to be either efficiently solvable or not. But so far there have been no efficient solutions found for any of those problems—making them effectively intractable—and most computer scientists believe that there aren’t any to be found. So the “impossibility result” for the traveling salesman problem that Flood imagined in the 1950s is likely to be its ultimate fate. What’s more, many other optimization problems—with implications for everything from political strategy to public health to fire safety—are similarly intractable.
But for the computer scientists who wrestle with such problems, this verdict isn’t the end of the story. Instead, it’s more like a call to arms. Having determined a problem to be intractable, you can’t just throw up your hands. As scheduling expert Jan Karel Lenstra told us, “When the problem is hard, it doesn’t mean that you can forget about it, it mea
ns that it’s just in a different status. It’s a serious enemy, but you still have to fight it.” And this is where the field figured out something invaluable, something we can all learn from: how to best approach problems whose optimal answers are out of reach. How to relax.
Just Relax
The perfect is the enemy of the good.
—VOLTAIRE
When somebody tells you to relax, it’s probably because you’re uptight—making a bigger deal of things than you should. When computer scientists are up against a formidable challenge, their minds also turn to relaxation, as they pass around books like An Introduction to Relaxation Methods or Discrete Relaxation Techniques. But they don’t relax themselves; they relax the problem.
One of the simplest forms of relaxation in computer science is known as Constraint Relaxation. In this technique, researchers remove some of the problem’s constraints and set about solving the problem they wish they had. Then, after they’ve made a certain amount of headway, they try to add the constraints back in. That is, they make the problem temporarily easier to handle before bringing it back to reality.
For instance, you can relax the traveling salesman problem by letting the salesman visit the same town more than once, and letting him retrace his steps for free. Finding the shortest route under these looser rules produces what’s called the “minimum spanning tree.” (If you prefer, you can also think of the minimum spanning tree as the fewest miles of road needed to connect every town to at least one other town. The shortest traveling salesman route and the minimum spanning tree for Lincoln’s judicial circuit are shown below.) As it turns out, solving this looser problem takes a computer essentially no time at all. And while the minimum spanning tree doesn’t necessarily lead straight to the solution of the real problem, it is quite useful all the same. For one thing, the spanning tree, with its free backtracking, will never be any longer than the real solution, which has to follow all the rules. Therefore, we can use the relaxed problem—the fantasy—as a lower bound on the reality. If we calculate that the spanning tree distance for a particular set of towns is 100 miles, we can be sure the traveling salesman distance will be no less than that. And if we find, say, a 110-mile route, we can be certain it is at most 10% longer than the best solution. Thus we can get a grasp of how close we are to the real answer even without knowing what it is.
Algorithms to Live By Page 21