Book Read Free

Expert Political Judgment

Page 13

by Philip E. Tetlock


  Radical skeptics should mostly welcome the initial results. Humanity barely bests the chimp, losing on one key variable and winning on the other. We lose on calibration. There are larger average gaps between human probability judgments and reality than there are for those of the hypothetical chimp. But we win on discrimination. We do better at assigning higher probabilities to occurrences than to nonoccurrences than does the chimp. And the win on discrimination is big enough to offset the loss on calibration and give humanity a superior overall probability score (reflected in the clustering of the human data points on the constant-probability-score diagonal just above that for the chimp).

  Defenders of our collective cognitive dignity also get faint solace when we compare human forecasting accuracy to that of algorithms that mechanically assign probabilities to events that correspond to estimates of the base rate frequencies of those events. Humanity does manage to edge out the restrictive and expansive base-rate algorithms which assumed that the near to medium-term future would—in aggregate—look exactly like the near to medium-term past. But humanity can only eke out a tie against the contemporary-base-rate strategy which, by exploiting aggregated outcome knowledge up to the present, achieves virtually perfect calibration. Of course, one could argue that little stigma attaches to this loss. The shortcut to good calibration is to assign probabilities that correspond to the best available data on the base-rate frequencies of the usual trichotomy of possibilities—perpetuation of status quo (50.5 percent), change in the direction of more of something (28.5 percent), and change in the direction of less of something (21 percent). The contemporary base rate algorithm accomplishes no great feat by predicting that events with a base-rate frequency of 51 percent in the current forecasting period have a 51 percent likelihood of occurring in that same period. The algorithm “cheats:” it peeks at outcome data to which no other competitor has advance access.

  Overall, we can put negative or positive spins on these results. The negative spin tells a tale of hubris: people bet on case-specific hunches that low-frequency events would occur and they pay the price for this bias—base-rate neglect—in the form of inferior calibration scores. The critical driver of human probability judgments was not how commonly do events of this sort occur; rather, it was “how easily do compelling causal scenarios come to mind?”

  The positive spin is a tale of courageous observers venturing out on limbs to tell us things we did not already know about a volatile world. Humans overpredicted lower-frequency events: departures from the status quo—either in the direction of less of something (which, in turn, could have been either bad [e.g., lower GDP growth] or good [e.g., declining corruption ratings]) or in the direction of more of something (which, in turn, could have been either bad [e.g., greater central government debt] or good [e.g., greater political freedom]).

  This positive spin analysis implies that we should “forgive” human forecasters for losing on calibration but winning on discrimination. The chimp and base-rate algorithms did not make even token efforts to discriminate. As a matter of policy, they always assigned the same probabilities across the board and received the lowest possible score, zero. By contrast, humans tried to discriminate and were somewhat successful (a value of .03 on the y axis translates into “explaining” roughly 18 percent of the total variation in forecasting outcomes.) The probability score curves in figure 2.5 (curves that plot logically possible trade-offs between calibration and discrimination, holding overall accuracy constant) suggest that people may indeed have paid a reasonable calibration price to achieve this level of discrimination: the human probability score function is higher than the chimp’s and roughly equal to the contemporary base-rate algorithm. One could also argue that these probability score functions underestimate the net human advantage because they treat calibration and discrimination equally, whereas discrimination should be valued over calibration.46 In many real-world settings, it is more vital to assign sharply higher probabilities to events that occur—even if at the cost of embarrassing false alarms or misses—than it is to attach probabilities to events that tightly covary with the objective likelihood of occurrence across the full zero-to-one scale.

  But we lose these “excuses” when we turn to the case-specific extrapolation algorithms, which assign different probabilities to outcomes as a function of the distinctive outcome histories of each case. Humanity now loses on both calibration and discrimination.

  This latter result demolishes two of humanity’s principal defenses. It neutralizes the argument that forecasters’ modest showing on calibration was a price worth paying for the bold, roughly accurate predictions that only human beings could deliver (and that were thus responsible for their earlier victories on discrimination). And it pours cold water on the comforting notion that human forecasters failed to outperform minimalist benchmarks because they had been assigned an impossible mission—in effect, predicting the unpredictable. Translating the predictions of the crude case-specific extrapolation algorithms, as well as the sophisticated time series forecasting equations, into subjective probability equivalents, we discover that, whereas the best human forecasters were hard-pressed to predict more than 20 percent of the total variability in outcomes (using the DI/VI “omniscience” index in the Technical Appendix), the crude case-specific algorithms could predict 25 percent to 30 percent of the variance and the generalized autoregressive distributed lag models explained on average 47 percent of the variance.47

  These results plunk human forecasters into an unflattering spot along the performance continuum, distressingly closer to the chimp than to the formal statistical models. Moreover, the results cannot be dismissed as aberrations: figure 2.4 shows that human calibration and discrimination scores do not vary much across a broad swath of short- and long-term forecasts in policy domains and states in both the zones of stability (North America, Western Europe, and Japan) and turbulence (Eastern Europe, the Middle East, Africa, South Asia, and Latin America). Surveying these scores across regions, time periods, and outcome variables, we find support for one of the strongest debunking predictions: it is impossible to find any domain in which humans clearly outperformed crude extrapolation algorithms, less still sophisticated statistical ones.48

  The Diminishing Marginal Predictive Returns Hypothesis

  Figures 2.5 and 2.6 bolster another counterintuitive prediction of radical skepticism. Figure 2.5 shows that, collapsing across all judgments, experts on their home turf made neither better calibrated nor more discriminating forecasts than did dilettante trespassers. And Figure 2.6 shows that, at each level along the subjective probability scale from zero to 1.0, expert and dilettante calibration curves were strikingly similar. People who devoted years of arduous study to a topic were as hard-pressed as colleagues casually dropping in from other fields to affix realistic probabilities to possible futures.

  The case for performance parity between experts and dilettantes is strong. But experts have several possible lines of defense. One is to argue that we have failed to identify the “real experts” among the experts—and that if we define expertise more selectively, we will find that it confers a real forecasting advantage. This argument does not hold up well when we make the standard distinctions among degrees and types of expertise. More refined statistical comparisons failed to yield any effects on either calibration or discrimination that could be traced to amount of experience (seniority) or types of expertise (academic, government or private sector background, access to classified information, doctoral degree or not, or status of university affiliation). There was also little sign that expertise by itself, or indicators of degrees of expertise, improved performance when we broke forecasting questions into subtypes: short-term versus long-term, zone of stability versus turbulence, and domestic political, economic policy/performance, and national security issues.

  Figure 2.6. The first panel compares the calibration functions of several types of human forecasters (collapsing over thousands of predictions across fifty-eight countries acro
ss fourteen years). The second panel compares the calibration functions of several types of statistical algorithms on the same outcome variables.

  A second, more promising, line of defense shifts from identifying superexperts to raising the status of the so-called dilettantes. After all, the dilettantes are themselves experts—sophisticated professionals who are well versed in whatever general purpose political or economic theories might confer predictive leverage. Their only disadvantage is that they know a lot less about certain regions of the world. And even that disadvantage is mitigated by the fact most dilettantes tracked a wide range of current events in elite news outlets.

  To shed further light on where the point of diminishing marginal predictive returns for expertise lies, on how far down the ladder of cognitive sophistication we can go without undercutting forecasting skill, we compared experts against a humbler, but still human, benchmark: briefly briefed Berkeley undergraduates. In 1992, we gave psychology majors “facts on file” summaries, each three paragraphs long, that presented basic information on the polities and economies of Russia, India, Canada, South Africa, and Nigeria. We then asked students to make their best guesses on a standard array of outcome variables. The results spared experts further embarrassment. Figure 2.5 shows that the undergraduates were both less calibrated and less discriminating than professionals working either inside or outside their specialties. And figure 2.6 shows that the calibration curve for undergraduates strays far further from the diagonal of perfect calibration than do the expert or dilettante curves (hence far worse calibration scores).

  These results suggest that, although subject matter expertise does not give a big boost to performance, it is not irrelevant. If one insists on thinking like a human being rather than a statistical algorithm, on trying to figure out on a case-by-case basis the idiosyncratic balance of forces favoring one or another outcome, it is especially dangerous doing so equipped only with the thin knowledge base of the undergraduates. The professionals—experts and dilettantes—possessed an extra measure of sophistication that allowed them to beat the undergraduates soundly and to avoid losing by ignominiously large margins to the chimp and crude extrapolation algorithms. That extra sophistication would appear to be pegged in the vicinity of savvy readers of high-quality news sources such as the Economist, the Wall Street Journal, and the New York Times, the publications that dilettantes most frequently reported as useful sources of information on topics outside their specialties.

  At this juncture, defenders of expertise have exhausted their defense within a probability scoring framework that treats all errors—overpredicting or underpredicting change for the better or worse—as equal. To continue the scientific fight, they must show that, although experts made as many and as large “mistakes” as dilettantes—where the bigger mistakes mean bigger gaps between subjective probabilities and objective frequencies—experts made mostly the “right mistakes,” mistakes with good policy rationales such as “better safe than sorry” and “don’t cry wolf too often,” whereas dilettantes underpredicted and overpredicted willy-nilly.

  Unfortunately for this defense, experts and dilettantes have similar mistake profiles. Both exaggerate the likelihood of change for the worse.49 Such outcomes occur 23 percent of the time, but experts assign average probabilities of .35, dilettantes, .29. Both also exaggerate the likelihood of change for the better. Such outcomes occur 28 percent of the time and experts assign average subjective probabilities of .34, dilettantes, .31. It follows that experts and dilettantes must underestimate the likelihood of the status quo by complementary margins.

  But the mistake profiles of experts versus dilettantes, and of humans versus chimps, were not identical. Experts overpredict change significantly more than both dilettantes and the chimp strategy. So, there is potential for catch-up by value-adjusting probability scores that give experts varying benefits of the doubt. The Technical Appendix shows how to value-adjust probability scores by (a) identifying the mistakes that forecasters are, on balance, most prone to make; (b) solving for values of k that narrow the gaps between subjective probabilities and objective reality in proportion to the average size of the dominant mistake (generously assuming that forecasters’ dominant mistake was, on balance, the right mistake given their error avoidance priorities).

  Figure 2.7 shows the impact of value-adjusting probability scores. The unadjusted probability scores are at the base of each arrow. These scores are the sum of overpredictions (assigning high probabilities to things that do not happen) and underpredictions (assigning low probabilities to things that do happen). The better scores thus migrate upward and right-ward. Once again, the unadjusted score for the case-specific algorithm (A) falls on a superior probability score function than that for human beings (either experts or dilettantes) which, in turn, falls on a superior probability score function than that for the chimp.

  Figure 2.7. The impact of the k-value adjustment (procedure) on performance of experts (E), dilettantes (D), chimps (C), and an aggressive case-specific extrapolation algorithm (A) in three different forecasting tasks: distinguishing status quo from change for either better or worse, distinguishing change for better from either status quo or change for worse, or distinguishing change for worse from either status quo or change for better.

  Longer arrows mean bigger value-adjustment effects, and the tip of each arrow is the value-adjusted probability score. Panel 1 shows that, when we focus on predicting continuation of the status quo (as opposed to change for either the better or worse), the optimal value adjustment decreases the “penalties” for underprediction, at some price in overprediction. Panel 2 shows that, when we focus on predicting change for the better, the optimal adjustment decreases the penalties for overprediction, at some price in underprediction. Panel 3 shows that, when we focus on predicting change for the worse, the optimal value adjustment decreases the penalties for overprediction, at some price in underprediction.

  Catch-up is so elusive, and the chimp is the biggest beneficiary of adjustments, because the long-term expected value of the chimp strategy errs most consistently (always underestimating probabilities of events that happen more than 33 percent of the time or overestimating probabilities of events that happen less than 33 percent of the time). Adjustments thus bring the chimp to approximate parity with humans. Experts, dilettantes, and case-specific extrapolation algorithms benefit less because they make more complicated mixtures of errors, alternately under- or overpredicting change for the better or worse.

  The net result is that, value adjustments do not give much of a boost to either experts in particular or humanity in general. Of course, there are still grounds for appeal. The current round of value adjustments corrects only for the dominant error a forecaster makes and do not allow that it might be prudent to alternate between under- and overpredicting change for the better or worse across contexts. One could thus argue for more radical (arguably desperate) value adjustments that give forecasters even more benefit of the doubt that their errors represent the right mistakes. This encounter with value adjustments will not therefore be our last. But, for reasons laid out in the Technical Appendix and chapter 6, we should beware that the more generous we make value adjustments, the greater the risk of our becoming apologists for poor forecasting. The reductio ad absurdum is value adjustments tailored so exquisitely that, whatever mistakes we make, we insist on correcting them.

  Qualifications noted, the consistent performance parity between experts and dilettantes—even after value adjustments—suggests that radical skeptics are right that we reach the point of diminishing marginal predictive returns for knowledge disconcertingly quickly.

  The Fifteen Minutes of Fame Hypothesis

  Radical skeptics see experts’ failure to outperform dilettantes as further evidence there is little consistency in “who gets what right” across regions, topics, or time—at least any more consistency than we should expect from the intercorrelations among outcomes in the forecasting environment. According to the Andy W
arhol hypothesis, everybody, no matter how silly, is ultimately entitled to his or her fifteen minutes of fame, plus or minus a random disturbance value. Forecasting skill should be roughly as illusory as hot hands in basketball.

  Radical skeptics need not be too perturbed by the strong consistency in who got what right across time. Forecasters who received superior calibration and discrimination scores in their short-term predictions of political and economic trends also received superior scores in their longer-term predictions (correlations of .53 and .44). These correlations are of approximately the magnitude one would expect from the predictive power of time series and regression models that captured the autoregressive continuity of, and the intercorrelations among, outcome variables. Radical skeptics should, however, be perturbed by the consistency coefficients across domains: knowledge of who did well within their specialties allowed us to identify far beyond chance accuracy those with good calibration and discrimination scores outside their specialties (correlations of .39 and .31). These latter strands of continuity in good judgment cut across topics with virtually zero statistical and conceptual overlap—from predicting GDP growth in Argentina to interethnic wars in the Balkans to nuclear proliferation in South Asia to election outcomes in Western Europe and North America. To be consistent, radical skeptics must dismiss such results, significant at the .01 level, as statistical flukes.

  From a meliorist perspective, the more reasonable response is to start searching for attributes that distinguish forecasters with better and worse track records. Chapter 3 picks up this challenge in earnest. Here it must suffice to say that it is possible to build on the initial findings of consistency in forecasting performance—to show that these individual differences are not a stand-alone phenomenon but rather correlate with a psychologically meaningful network of other variables, including cognitive style. For current purposes, though, these effects have the status of anomalies—irritating blotches on the skeptics’ otherwise impressive empirical record (something they will try to trivialize by attributing it to confounds or artifacts), but a source of potential comfort to meliorists who insist that, even within the restricted range of plausible worldviews and cognitive styles represented in this elite sample of forecasters, some ways of thinking about politics translate into superior performance.

 

‹ Prev