Super Crunchers
Page 12
I had not then completed a full-scale analysis but I [told them I] had some 85 couch hours with a Vienna-trained analyst, and my own therapeutic mode was strongly psychodynamic…. The glowing warmth of the gathering cooled noticeably. A well-known experimental psychologist became suddenly hostile. He glared at me and said, “Now, come on, Meehl, how could anybody like you, with your scientific training at Minnesota, running rats and knowing math, and giving a bang-up talk like you just gave, how could you think there is anything to that Freudian dream shit?”
Meehl continued to inhabit this schizophrenic space for fifty more years. His initial study—which he playfully referred to as “my disturbing little book”—was only an opening salvo in what would become not only a personal lifelong passion, but also a veritable industry of man-versus-machine studies by others.
Researchers have now completed dozens upon dozens of studies comparing the success rates of statistical and expert approaches to decision making. The studies have analyzed the relative ability of Super Crunchers and experts in predicting everything from marital satisfaction and academic success to business failures and lie detection. As long as you have a large enough dataset, almost any decision can be crunched. Studies have suggested that number crunchers using statistical databases can even outperform humans in guessing someone’s sexual orientation or in constructing a satisfying crossword puzzle.
Just recently Chris Snijders, a professor at Eindhoven University of Technology, in the Netherlands, decided to see whether he could out-purchase professional corporate buyers. Snijders (rhymes with White Castle “sliders”) had collected a database on more than 5,200 computer equipment and software purchases by more than 700 Dutch businesses. For each purchase, Snijders had information on more than 300 aspects of the transaction—including several aspects of purchasing satisfaction such as whether the delivery was late or non-conforming, or whether the product had insufficient documentation. “I had the feeling,” he told me, “there is knowledge in there that could be useful somehow to the world. So I went out with my papers trying to explain what is in there to firms, and then they laughed at me. ‘What do you know? You never worked in any kind of company. Forget your data.’ So this was the reason I did the test. If you think that these data say nothing, well, we will see.”
Snijders used part of his data to estimate a regression equation relating the satisfaction of corporate purchasers to fourteen aspects of the transaction—things like the size and reputation of the supplier as well as whether lawyers were involved in negotiating the contract. He then used a different set of transactions to run a test pitting his Super Crunched predictions against the predictions of professional purchasing managers. Just as in the Supreme Court test, each of the purchasing experts was presented with a handful of different purchasing cases to analyze.
Just as in the earlier studies, Snijders’s purchasing managers couldn’t outperform a simple statistical formula to predict timeliness of delivery, adherence to the budget, or purchasing satisfaction. He and his coauthors found that the judgments of professional managers were “meager at best.” The Super Crunching formula outperformed even above-average managers. More experienced managers did no better than novices. And managers reviewing transactions in their own industry did not fare any better than those reviewing transactions in different industries. In the end, the study suggests purchasing professionals cannot outpredict even a simple regression equation. Snijders is confident that this is true generally. “As long as you have some history and some quantifiable data from past experiences,” he claims, the regression will win. “It’s not just pie in the sky,” he said. “I have the data to support this.”
Snijders’s results are just a recent example in a long line of man-versus-machine comparisons. Near the end of his life, Meehl, together with Minnesota protégé William Grove, completed a “meta” analysis of 136 of these man-versus-machine studies. In only 8 of 136 studies was expert prediction found to be appreciably more accurate than statistical prediction. The rest of the studies were equally divided between those where statistical prediction “decisively outperformed” expert prediction, and those where the accuracy was not appreciably different. Overall, when asked to make binary predictions, the average expert in these wildly diverse fields got it right about two-thirds of the time (66.5 percent). The Super Crunchers, however, had a success rate that was almost three-quarters (73.2 percent). The eight studies favoring the expert weren’t concentrated in a particular issue area and didn’t have any obvious characteristics in common. Meehl and Grove concluded, “The most plausible explanation of these deviant studies is that they arose by a combination of random sampling errors (8 deviant out of 136) and the clinicians’ informational advantage in being provided with more data than the actuarial formula.”
Why Are Humans Bad at Making Predictions?
These results are the stuff of science fiction nightmares. Our best and brightest experts in field after field are losing out to Super Crunching predictions. How can this be happening?
The human mind tends to suffer from a number of well-documented cognitive failings and biases that distort our ability to predict accurately. We tend to give too much weight to unusual events that seem salient. For example, people systematically overestimate the probability of “newsworthy” deaths (such as murder) and underestimate the probability of more common causes of death. Most people think that having a gun in your house puts your children at risk. However, Steve Levitt, looking at statistics, pointed out that “on average, if you own a gun and have a swimming pool in the yard, the swimming pool is almost 100 times more likely to kill a child than the gun is.”
Once we form a mistaken belief about something, we tend to cling to it. As new evidence arrives, we’re likely to discount disconfirming evidence and focus instead on evidence that supports our preexisting beliefs.
In fact, it’s possible to test your own ability to make unbiased estimates. For each of the following ten questions, give the range of answers that you are 90 percent confident contains the correct answer. For example, for the first question, you are supposed to fill in the blanks: “I am 90 percent confident that Martin Luther King’s age at the time of his death was somewhere between—years and—years.” Don’t worry about not knowing the exact answer—and no using Google. The goal here is to see whether you can construct confidence intervals that include the correct answer 90 percent of the time. Here are the ten questions:
LOW
HIGH
1. What was Martin Luther King, Jr.’s age at death?
_____
_____
2. What is the length of the Nile River, in miles?
_____
_____
3. How many countries belong to OPEC?
_____
_____
4. How many books are there in the Old Testament?
_____
_____
5. What is the diameter of the moon, in miles?
_____
_____
6. What is the weight of an empty Boeing 747, in pounds?
_____
_____
7. In what year was Mozart born?
_____
_____
8. What is the gestation period of an Asian elephant, in days?
_____
_____
9. What is the air distance from London to Tokyo, in miles?
_____
_____
10. What is the deepest known point in the ocean, in feet?
_____
_____
Answering “I have no idea” is not allowed. It’s also a lie. Of course you have some idea. You know that the deepest point in the ocean is more than 2 inches and less than 100,000 miles. I’ve included the correct answers below so you can actually check to see how many you got right. You can’t win if you don’t play.*2
If all ten of your intervals include the correct answer, you’re under-confident. Any of us could have made sure that this occur
red—just by making our answers arbitrarily wide. I’m 100 percent sure Mozart was born sometime between 33 BC and say, 1980. But almost everyone who answers these questions has the opposite problem of overconfidence—they can’t help themselves from reporting ranges that are too small. People think they know more than they actually know. In fact, when Ed Russo and Paul Schoemaker tested more than 1,000 people, they found that most people missed between four and seven of the questions. Less than 1 percent of the people gave ranges that included the right answer nine or ten times. Ninety-nine percent of people were overconfident.
So humans not only are prone to make biased predictions, we’re also damnably overconfident about our predictions and slow to change them in the face of new evidence.
In fact, these problems of bias and overconfidence become more severe the more complicated the prediction. Humans aren’t so bad at predicting simple things—like whether a Coke can that has been shaken will spurt. Yet when the number of factors to be considered grows and it is not clear how much weight will be placed on the individual factors, we’re headed for trouble. Coke-can prediction is pretty much one factor: when a can is recently shaken, we can predict almost with certainty the result. In noisy environments, it’s often not clear what factors should be taken into account. This is where we tend mistakenly to bow down to experts who have years of experience—the baseball scouts and physicians—who are confident that they know more than the average Joe.
The problem of overconfidence isn’t just a problem of academic experiments either. It distorts real-world decisions. Vice President Cheney as late as June 2005 predicted on Larry King Live that the major U.S. involvement in Iraq would end before the Bush administration left office. “The level of activity that we see today from a military standpoint, I think will clearly decline,” he said with confidence. “I think they’re in the last throes, if you will, of the insurgency.” The administration’s overconfidence with regard to the cost of the war was, if anything, even more extreme. In 2002, Glenn Hubbard, chairman of the president’s Council of Economic Advisers, predicted that the “costs of any such intervention would be very small.” In April 2003, Defense Secretary Donald Rumsfeld dismissed the idea that reconstruction would be costly: “I don’t know,” he said, “that there is much reconstruction to do.” Several of the key planners were convinced that Iraq’s own oil revenue could pay for the war. “There’s a lot of money to pay for this that doesn’t have to be U.S. taxpayer money,” Deputy Defense Secretary Paul Wolfowitz predicted. “And it starts with the assets of the Iraqi people….We’re dealing with a country that canreally finance its own reconstruction.” While it’s easy to dismiss these forecasts as self-interested spin, I think it’s more likely that these were genuine beliefs of decision makers who, like the rest of us, have trouble updating our beliefs in the face of disconfirming information.
In contrast to these human failings, think about how well Super Crunching predictions are structured. First and foremost, Super Crunchers are better at making predictions because they do a better job at figuring out what weights should be put on individual factors in making a prediction. Indeed, regression equations are so much better than humans at figuring out appropriate weights that even very crude regressions with just a few variables have been found to outpredict humans. Cognitive psychologists Richard Nisbett and Lee Ross put it this way, “Human judges are not merely worse than optimal regression equations; they are worse than almost any regression equation.”
Unlike self-involved experts, statistical regressions don’t have egos or feelings. Being “unemotional is very important in the financial world,” said Greg Forsythe, a senior vice president at Schwab Equity Ratings, which uses computer models to evaluate stocks. “When money is involved, people get emotional.” Super Crunching models, in contrast, have no emotional commitments to their prior predictions. As new data are collected, the statistical formulae are recalculated and new weights are assigned to the individual elements.
Statistical predictions are also not overconfident. Remember Farecast.com doesn’t just predict whether an airline price is likely to increase; it also tells you what proportion of the time the prediction is going to be true. Same goes with randomized trials—they not only give you evidence of causation, but they also tell you how good the causal evidence is. Offermatica tells you that a new web design causes 12 percent more sales and that it’s 95 percent sure that the true causal effect is between 10.5 percent and 13.5 percent. Each statistical prediction comes with its own confidence interval.
In sharp contrast to traditional experts, statistical procedures not only predict, they also tell you the quality of the prediction. Experts either don’t tell you the quality of their predictions or are systematically overconfident in describing their accuracy. Indeed, this difference is at the heart of the shift to evidence-based medical guidelines. The traditional expert-generated guidelines just gave undifferentiated pronouncements of what physicians should and should not do. Evidence-based guidelines, for the first time, explicitly tell physicians the quality of the evidence underlying each suggested practice. Signaling the quality of the evidence lets physicians (and patients) know when a guideline is written in stone and when it’s just their best guess given limited information.
Of course, data analysis can itself be riddled with errors. Later on, I’ll highlight examples of data-based decision making that have failed spectacularly. Still, the trend is clear. Decisions that are backed by quantitative prediction are at least as good as and often substantially better than decisions based on mere lived experience. The mounting evidence of statistical superiority has led many to suggest that we should strip experts of at least some of their decision-making authority. As Dr. Don Berwick said, physicians might perform better if, like flight attendants, they were forced to follow more scripted procedures.
Why Not Both?
Instead of simply throwing away the know-how of traditional experts, wouldn’t it be better to combine Super Crunching and experiential knowledge? Can’t the two types of knowledge peacefully coexist?
There is some evidence to support the possibility of peaceful coexistence. Traditional experts make better decisions when they are provided with the results of statistical prediction. Those who cling to the authority of traditional experts tend to embrace the idea of combining the two forms of knowledge by giving the experts “statistical support.” The purveyors of diagnostic software are careful to underscore that its purpose is only to provide support and suggestions. They want the ultimate decision and discretion to lie with the doctor. Humans usually do make better predictions when they are provided with the results of statistical prediction. The problem, according to Chris Snijders, is that, even with Super Crunching assistance, humans don’t predict as well as the Super Crunching prediction by itself. “What you usually see is the judgment of the aided experts is somewhere in between the model and the unaided expert,” he said. “So the experts get better if you give them the model. But still the model by itself performs better.” Humans too often wave off the machine predictions and cling to their misguided personal convictions.
When I asked Ted Ruger whether he thought he could outpredict the computer algorithm if he had access to the computer’s prediction, he caught himself sliding once again into the trap of overconfidence. “I should be able to beat it,” he started, but then corrected himself. “But maybe not. I wouldn’t really know what my thought process would be. I’d look at the model and then I would sort of say, well, what could I do better here? I would probably muck it up in a lot of cases.”
Evidence is mounting in favor of a different and much more de-meaning, dehumanizing mechanism for combining expert and Super Crunching expertise. In several studies, the most accurate way to exploit traditional expertise is to merely add the expert evaluation as an additional factor in the statistical algorithm. Ted’s Supreme Court study, for example, suggests that a computer that had access to human predictions would rely on the experts to determine the
votes of the more liberal justices (Breyer, Ginsburg, Souter, and Stevens)—because the unaided experts outperformed the Super Crunching algorithm in predicting the votes of these justices.
Instead of having the statistics as a servant to expert choice, the expert becomes a servant of the statistical machine. Mark E. Nissen, a professor at the Naval Postgraduate School in Monterey, California, who has tested computer-versus-human procurement, sees a fundamental shift toward systems where the traditional expert is stripped of his or her power to make the final decision. “The newest space, and the one that’s most exciting, is where machines are actually in charge,” he said, “but they have enough awareness to seek out people to help them when they get stuck.” It’s best to have the man and machine in dialogue with each other, but, when the two disagree, it’s usually better to give the ultimate decision to the statistical prediction.
The decline of expert discretion is particularly pronounced in the case of parole. In the last twenty-five years, eighteen states have replaced their parole systems with sentencing guidelines. And those states that retain parole have shifted their systems to rely increasingly on Super Crunching risk assessments of recidivism. Just as your credit score powerfully predicts the likelihood that you will repay a loan, parole boards now have externally validated predictions framed as numerical scores in formula like the VRAG (Violence Risk Appraisal Guide), which estimates the probability that a released inmate will commit a violent crime. Still, even reduced discretion can give rise to serious risk when humans deviate from the statistically prescribed course of action.