Expert Political Judgment

Page 29

by Philip E. Tetlock

7. This defense grants that hedgehogs made many mistakes but tries to deflect negative conclusions about their cognitive styles by (a) positing that the experts studied here did not have the right credentials or were not properly motivated or (b) chalking up failures to the turbulent character of late twentieth-century history.

8. The final defense attributes performance differentials to a misunderstanding. I judged participants by the standards of my world—an academic subculture that values empirical accuracy and logical coherence—whereas many participants judged themselves by the standards of their worlds—partisan subcultures in which the game of “gotcha” requires denying one’s own mistakes and pinning as many mistakes as possible on the other side.

Each defense gets a fair hearing, but not a free pass. When appropriate, the prosecution will attach skeptical rejoinders. In the end, we shall see that the case against the hedgehogs is neither as solid as their detractors declare nor as flimsy as their defenders hope. Qualified forms of the earlier indictments remain standing.

It would, however, be a mistake to read this chapter solely as an effort to give hedgehogs a fair shake. Each defense raises issues that arise whenever we confront the claim “Members of group x have ‘better judgment’ than members of group y.” Indeed, taken together, the eight defenses reveal the extraordinary complexity of the assumptions underlying all judgments of good judgment—a point reinforced by the Technical Appendix, which details the computational implementation of these defenses.

Really Not Such Bad Forecasters

In chapter 3, foxes bested hedgehogs on basic indicators of forecasting accuracy. But defenders of hedgehogs argue that the victory was a false one. Foxes did “better” not because they have better judgment but because (a) hedgehogs have more skewed error-avoidance priorities and either tolerate lots of false alarms to avoid misses or tolerate many misses to avoid false alarms; (b) hedgehogs used the probability scale more aggressively and swung harder for forecasting “home runs” by assigning zeroes to nonevents and 1.0’s to events; (c) hedgehogs were dealt tougher forecasting assignments; (d) our coding of forecasts as right or wrong was biased in favor of foxes; (e) our reality checks were predicated on the “naïve” assumption that forecasts could be coded as either right or wrong and, when we adopt a more nuanced scoring system, performance differentials vanish. In the end, we make a curious discovery rooted in an old levels-of-analysis paradox. Even though it is a great statistical struggle to raise the average accuracy of individual hedgehogs to parity with the average accuracy of individual foxes, it is easy to show that the accuracy of the average forecast of all hedgehogs is almost identical to that of all foxes.

THE NEED FOR VALUE ADJUSTMENTS

Some defenders of hedgehogs dismiss the foxes’ advantage as illusory—a by-product of our rigidly value-neutral probability-scoring rules that treat all errors equally. Hedgehogs get worse scores because they are less concerned with maximizing overall accuracy and more with minimizing those mistakes they deem really serious, even if at the expense of making many less consequential mistakes. Perhaps some hedgehogs subscribe to a “better safe than sorry” philosophy that puts a premium on being able to say, “I was never blindsided by change for the worse.” And perhaps other hedgehogs subscribe to a “don’t cry wolf” philosophy that puts a premium on avoiding false alarms of change for the worse that could undercut their long-term credibility.

The Technical Appendix describes the basic idea behind value-adjusting probability scores: to give experts some benefit of the doubt that the mistakes they make are the right mistakes in light of their own value priorities. Thus, if hedgehogs overpredict change for the worse, one value adjustment solves for a value of k that brings their forecasts into line with the observed base rate for change for the worse. Insofar as hedgehogs are more prone than foxes to this error, the value adjustment helps them “to catch up” on this task.

But rescuing hedgehogs via k-style value adjustments proves futile for a simple reason: foxes make fewer errors of both under- and overprediction. Figure 6.1 shows there are no constant probability score “indifference” curves consistent with the hypothesis that foxes and hedgehogs are equally good forecasters with different tastes for under- and over-prediction. Foxes are consistently to the “northeast” of hedgehogs. Adding insult to injury, figure 6.1 shows how easy it is to postulate constant-probability-score indifference curves consistent with the hypothesis that hedgehogs and dart-throwing chips had equivalent forecasting skill and just “opted” for different blends of mistakes. Lastly, figure 6.1 shows that, even after introducing the k adjustment, hedgehogs “lose” to foxes, regardless of whether the forecasting focus was on identifying change for the better, change for the worse, or change in either direction.7

Hedgehogs can never catch up via across-the-board k-style adjustments. To produce performance parity, we have to tailor value adjustments to specific types of mistakes using the a0/a1 method (see Technical Appendix). But when do we decide we have gone too far in contorting scoring rules in pursuit of parity? Figure 6.2 shows that crossover is possible only if we define the forecasting objective in a very particular way—distinguishing change (regardless of direction) from the status quo—and when we treat underpredicting change as at least seven times more serious than overpredicting change. Crossover occurs for two reasons: (a) aggregate hedgehog performance is dragged down by two subgroups—extreme optimists who exaggerate the likelihood of change for the better and extreme pessimists who exaggerate the likelihood of change for the worse; (b) the adjustments give lots of credit for the aggressive predictions of change by each subgroup that prove correct but trivialize the aggressive predictions by each subgroup that prove off the mark.

Figure 6.1. The impact of k-value adjustment on performance of hedgehog and fox experts and dilettantes (HE, HD, FE, FD) and chimps in three different forecasting tasks: distinguishing status quo from change for either better or worse (panel 1 in which everyone benefits from reducing underprediction of the status quo), distinguishing change for better from either status quo or change for worse (panel 2 in which everyone benefits from reducing overpredicting change for the worse), and distinguishing change for worse from either status quo or change for better (panel 3 in which everyone benefits from reducing overpredicting change for the better). K-value adjustments improve the overall performance of all groups but fail to produce hedgehog-fox performance parity.

There is no rule that tells us how far to take value adjustments: we could, in principle, tailor them to each forecaster’s probability estimate of each state of the world (say, correcting for overpredicting unemployment but underpredicting inflation). But such special-purpose value adjustments almost certainly give forecasters too much benefit of the doubt. Such adjustments make the null hypothesis of cognitive parity nonfalsifiable (they can make even the dart-throwing chimp perfectly calibrated). My own inclination is therefore not to go far beyond adjustments of the generic k sort. Hedgehogs lose too consistently to foxes—across outcome variables, time frames, and regional domains—to sustain the facile hypothesis that all performance differentials can be attributed to different value priorities.

THE NEED FOR PROBABILITY-WEIGHTING ADJUSTMENTS

Hedgehogs may also have been unfairly penalized because they tried harder to hit the forecasting equivalent of home runs: assigning the extreme values of zero (impossible) and 1.0 (sure thing) more often than foxes who were content with the forecasting equivalent of base hits (assigning low, but not zero, probabilities to things that did not happen and high, but not 1.0, probabilities to things that did happen). In this view, hedgehogs should get credit for their courage. People take notice when forecasters say something will or will not happen, with no diluting caveats. But impact falls off steeply as forecasters move from these endpoints of the probability scale into the murkier domain of likely or unlikely, and falls off further as we move to “just guessing.”

Defenders of hedgehogs argue for scoring adju
stments that capture this reality. And they point to recent empirical revelations of how people actually use subjective probabilities in decision making. Expected utility theory traditionally treated a shift in probability from .10 to .11 as exactly as consequential a determinant of the final choice that people made as a shift from .99 to 1.0; by contrast, cumulative prospect theory posits that people use subjective probabilities in strikingly nonlinear ways.8 People are willing to pay much more to increase the probability of winning a lottery ticket from .99 to 1.0 than they are from .10 to .11. And they are willing to pay much more to reduce the likelihood of disaster from .0001 to zero than they are from .0011 to .001. In the spirit of these observations, we introduced probability-weighting adjustments of probability scores that (a) assign special positive weights to home run predictions that gave values of 1.0 to things that happen and values of zero to things that do not; (b) assign special negative weights to big strike-out predictions that gave values of 1.0 to things that do not happen and values of zero to things that do happen; (c) count movement in the wrong direction at the extremes as a more serious mistake (when x happens, moving from 1.0 to .8; when x does not happen, moving from zero to .2) than movement in the wrong direction in the middle of the probability scale (say, from, .6 to .4 or from .4 to .6).

Figure 6.2. The gap between hedgehogs and foxes narrows, and even disappears, when we apply value adjustments, a0, that are increasingly tough on false alarms (over-predicting status quo [top panel], change for better [middle panel], and change for worse [bottom panel]). Crossover occurs with sufficiently extreme adjustments when we define the prediction task as distinguishing the status quo from change in either direction (up or down).

Hedgehogs benefit from these scoring adjustments. As noted in chapter 3, they benefit, in part, because they swing for home runs more often than foxes: they call outcomes inevitable (1.0) 1,479 times in comparison to the foxes’ 798 times, and they call outcomes impossible (zero) 6,929 times in comparison to foxes’ 4,022 times. And hedgehogs benefit, in part, from the fact that most of the occasions when they assign extreme values, they were right: roughly 85 percent of the outcomes they labeled “impossible” never materialize and roughly 74 percent of the outcomes they label “sure things” do materialize. Hedgehogs must, by logical necessity, be “very right” more often than foxes: foxes could not catch up on this dimension even if they never made mistakes.

But here the good news ends for hedgehogs. Major hedgehog misses (declaring something impossible that subsequently happens) vastly outnumber major fox misses (14 percent versus 4 percent), and major hedgehog false alarms vastly outnumber fox false alarms (26 percent versus 14 percent). These big mistakes slow the rate at which hedgehogs can catch up to foxes via probability-weighting adjustments. As figure 6.3 shows, the catch-up point occurs only when the weighting parameter, gamma, takes on an extreme value (roughly .2), so counterintuitively extreme, in fact, that it treats big errors (Judge A says x is highly likely [.9] and x does not occur) as only slightly more severe than small errors (Judge B says x is highly unlikely [.1] and x does not occur). Adjustments of this magnitude violate the intuition most of us have that the two cases are far from equivalent: we feel that Judge A was almost wrong and Judge B almost right. Adjustments of this magnitude also imply that, if our goal is to produce catch-up effects, we need a much more sharply inflected S-shaped weighting function than the more psychologically realistic one in prospect theory. Finally, even when we implement adjustments this extreme, catch-up only occurs when we define the forecasting goal as distinguishing the status quo from either change for the better or the worse (not when we look for directional accuracy—the ability to predict whether change will be for the better or the worse).

Figure 6.3. The gap between foxes and hedgehogs narrows, but never closes in the first and second panels and even eventually reverses itself in the third panel, when we apply increasingly extreme values of gamma to the weighted probabilities entered into the probability-scoring function. Extreme values of gamma treat all mistakes in the “maybe zone” (.1 to .9) as increasingly equivalent to each other.

THE NEED FOR DIFFICULTY ADJUSTMENTS

Hedgehogs may also look worse because they specialize in more volatile regions of the world and, thus, when they made predictions in their roles as experts, they more often wound up trying to predict the unpredictable. Table 6.1 shows there is a degree of truth to this objection. Although the similarities between hedgehog and fox forecasting environments were more pronounced than the differences—in both the short- and long-term forecasting exercises for foxes and hedgehogs, the status quo was the right answer more often than either change for the better (always coming in second) and change for the worse—there were still differences. Hedgehogs were dealt marginally tougher forecasting tasks (where tougher means closer to the 33/33/33 breakdown one would expect if all possible outcomes—the status quo, change for either the better or the worse—were equiprobable).

The Technical Appendix makes the case for difficulty-adjusted probability scores that level the playing field by taking into account variation in environmental variability. Figure 6.4 shows the results: difficulty-adjusted scores replicate the hedgehog-fox performance gaps observed with unadjusted probability scores. The results reinforce the notion that hedgehogs play a steep price for their confident, deductive style of reasoning. Difficulty-adjusted probability scores below zero signify lower forecasting accuracy than could have been achieved by just predicting the base rate. And the steepest decline into negative territory occurs among hedgehogs making long-term forecasts outside their specialties.

But, just as there is legitimate disagreement about how far to take value adjustments, so there is about how far to take difficulty adjustments.9 The “right” base rate for computing difficulty adjustments hinges on judgment calls. For instance, the base rate of nuclear proliferation falls off quite rapidly to the degree we expand the set of usual suspects to encompass not just immediate-risk candidates (such as Pakistan, North Korea, and Iran) but longer-term risks (Brazil, Argentina, Libya, Taiwan, Japan, South Korea, etc.). Similarly, regime change is rare in the zone of stability but occurs with moderate frequency in longer time spans in the zone of turbulence and with high frequency if we confine comparisons to the former Soviet bloc in the late 1980s and early 1990s. Much the same can be said for cross-border warfare, genocidal violence within borders, debt default, and so on.

TABLE 6.1

How often “Things” Happened (Continuation of Status Quo, Change in the Direction of More of Something, and Change in the Direction of Less of Something.

This table summarizes the percentage frequency of occurrence of possible futures (status quo and change for better or for worse) when hedgehogs and foxes made short- and long-term predictions inside and outside their domains of expertise.

Unfortunately for defenders of hedgehogs, figure 6.4 shows that hedgehogs lose to foxes across a range of plausible assumptions about base rates, with values ranging from 50 percent lower to 50 percent higher than the base rate for the entire dataset. The confidence bands reveal that increasing the base rate generally improved the forecasting skill scores of both hedgehogs and foxes and that decreasing the base rate generally impaired these scores. We can also see that, although hedgehogs benefit more than foxes from increasing the base rates, hedgehogs still receive worse difficulty-adjusted scores. Hedgehogs catch up only when we give them the most favorable possible assumptions about the base-rates of target events and foxes the least favorable—hardly a leveling of the playing field.10

Figure 6.4. The difficulty-adjusted forecasting skill of hedgehogs and foxes making short or long-range forecasts inside or outside their specialties. Higher scores indicate better performance and confidence bands show how forecasting skill shifts with estimated base rates (lower bands corresponding to lower estimates, higher bands to higher estimates). Hedgehogs and foxes gain or lose in similar ways from base rate adjustments and never converge in performanc
e.

THE NEED FOR CONTROVERSY ADJUSTMENTS

Defenders of hedgehogs can argue that some mistakes attributed to hedgehogs should have been attributed to us. Although we tried to pose only questions that passed the clairvoyance test, disputes still arose over which possible futures materialized. Did North Korea have the bomb in 1998? Did the Italian government cook the books to meet the Maastricht criteria? How much confidence should we have in Chinese economic growth statistics? Who was in charge in China as Deng Xiaoping slowly died?

Controversy adjustments show how the probability scores of forecasters shift when we heed “I really was right” protests and make alternative assumptions about what happened. Hedgehogs, however, get little traction here. Neither hedgehogs nor foxes registered a lot of complaints about reality checks. They challenged roughly 15 percent of the checks and were bothered by similar issues. Hedgehogs and foxes thus benefited roughly equally from adjustments.

THE NEED FOR FUZZY-SET ADJUSTMENTS

Defenders of hedgehogs can argue that, although hedgehogs were not as proficient at predicting what actually happened, they catch up when we give them credit for things that nearly happened. As we saw in chapter 4, inaccurate forecasters often insisted that their forecasts should be classified as almost right rather than clearly wrong—almost right because, although the expected future did not materialize, it either almost did (Quebec almost seceded) or soon will (South Africa has not yet had its tribal bloodbath, but it will). Fuzzy-set adjustments take such protests seriously by shrinking gaps between ex ante probability judgments and ex post classifications of reality whenever experts mobilized one of three belief system defenses (the close-call-counterfactual, off-on-timing and exogenous-shock).

‹ Prev Next ›