Expert Political Judgment

Page 16

by Philip E. Tetlock

Although the content factors proved anemic predictors of overall performance, it was easy to identify times and places at which one or another faction could crow over its successes. We shall soon see that, consistent with the fifteen minutes of fame hypothesis, who is up versus down can shift rapidly from case to case, and even from moment to moment within cases.

Cognitive Style Correlates

The search for correlates of good judgment across time and topics became more successful when the spotlight shifted from what experts thought to how they thought. Table 3.3 presents the thirteen items used to measure cognitive style, as well as the results of a maximum likelihood factor analysis. The low and high variable loadings on the first factor bear a striking resemblance to Isaiah Berlin’s famous distinction between hedgehogs and foxes in the history of ideas.5 Low scorers look like hedgehogs: thinkers who “know one big thing,” aggressively extend the explanatory reach of that one big thing into new domains, display bristly impatience with those who “do not get it,” and express considerable confidence that they are already pretty proficient forecasters, at least in the long term. High scorers look like foxes: thinkers who know many small things (tricks of their trade), are skeptical of grand schemes, see explanation and prediction not as deductive exercises but rather as exercises in flexible “ad hocery” that require stitching together diverse sources of information, and are rather diffident about their own forecasting prowess, and—like the skeptics in chapter 2—rather dubious that the cloudlike subject of politics can be the object of a clocklike science.6

Figure 3.1. Calibration and discrimination scores as a function of forecasters’ attitudes on the left-right, realism-idealism, and boomster-doomster “content” scales derived from factor analysis. The data are collapsed across all forecasting domains, including the zone of turbulence and the zone of stability, and forecasting topics, including political, economic, and national security. Higher (reflected) calibration scores indicate greater ability to assign subjective probabilities that correspond to objective relative frequency of outcomes. Higher discrimination scores indicate greater ability to assign higher probabilities to events that occur than to those that do not.

TABLE 3.3

Variable Loadings in Rotated Factor Matrix from the Maximum Likelihood Analysis of the Style-of-Reasoning Items (higher loadings indicate more foxlike cognitive style [column 1] or more decisive style [column 2]

Loading

Items Hedgehog-Fox Factor Decisiveness Factor

1. Self-identification as fox or hedgehog (Berlin’s definition)

+0.42 –0.04

2. More common error in judging situations is to exaggerate complexity of world

–0.20 +0.14

3. Closer than many think to achieving parsimonious explanations of political processes

–0.29 +0.05

4. Politics is more cloudlike than clocklike

+0.26 –0.02

5. More common error in decision making is to abandon good ideas too quickly

–0.31 +0.22

6. Having clear rules and order at work is essential for success

–0.09 +0.31

7. Even after making up my mind, I am always eager to consider a different opinion

+0.28 –0.07

8. I dislike questions that can be answered in many ways

–0.35 +0.05

9. I usually make important decisions quickly and confidently

–0.23 +0.26

10. When considering most conflicts, I can usually see how both sides could be right

+0.31 +0.01

11. It is annoying to listen to people who cannot make up their minds

–0.18 +0.14

12. I prefer interacting with people whose opinions are very different from my own

+0.23 –0.10

13. When trying to solve a problem, I often see so many options it is confusing

+0.08 –0.27

The measure of cognitive style correlates only feebly with the three content factors (all r’s < .10). Hedgehogs and foxes can be found in all factions. But they are not randomly distributed across the three dimensions of political thought. Foxes were more likely to be centrists. When we assess sample-specific extremism—by computing the squared deviation of each expert’s score on each dimension from the mean on that dimension—the resulting measures correlate at .31 with preference for a hedgehog style of reasoning. These relationships will help later to pinpoint where hedgehogs—of various persuasions—made their biggest mistakes.7

Most notable, though, is the power of the hedgehog-fox dimension of cognitive style as a broad bandwidth predictor of forecasting skill. The hedgehog-fox dimension did what none of the “content” measures of political orientation, and none of the measures of professional background, could do: distinguish observers of the contemporary scene with superior forecasting records, across regions, topics, and time.

Figure 3.2 plots the calibration and discrimination scores of thirty-two subgroups of forecasters (resulting from a 4 × 2 × 2 × 2 division and identified by the legend in the figure caption).

The results affirm the modest meliorist contention that the crude human-versus-chimp, and expert-versus-dilettante comparisons in chapter 2 mask systematic, not just random, variation in forecasting skill. Some cognitive-stylistic subspecies of humans consistently outperformed others. On the two most basic measures of accuracy—calibration and discrimination—foxes dominate hedgehogs. The calibration scores of foxes and fox-hogs (first and second quartile scorers on the cognitive-style scale) hover in the vicinity of .015, which means they assign subjective probabilities (1.0, .9, .8, …) that deviate from objective frequency (.88, .79, .67, …) by, on average, 12 percent; by contrast, the hedgehogs and hedge-foxes (third and fourth quartile scorers on the cognitive-style scale) have calibration scores hovering around .035, which means a subjective probability–objective reality gap, on average, of 18 percent. The discrimination scores of foxes and fox-hogs average .03 (which means they capture about 18 percent of the variance in their predictions), whereas those for hedgehogs and hedge-foxes average .023 (which means they capture about 14 percent of the variance).

But the results do not support the bolder meliorist contention that certain styles of thinking reliably yield forecasting accuracy comparable or superior to those of formal statistical models. Only the best-performing foxes come close to the forecasting accuracy of crude case-specific extrapolation algorithms (numbers 35 and 36 in upper right of fig. 3.2) and none even approach the autoregressive distributed lag models (number 37 in far upper right).

Figure 3.2 thus brackets human performance. It reveals how short of omniscience the best forecasters fall: they are lucky to approach 20 percent of that epistemic ideal across all exercises, whereas extrapolation algorithms approach 30 percent and formal models 50 percent.8 And it reveals how far human performance can fall—to the point where highly educated specialists are explaining less than 7 percent of the variance and fall on lower aggregate skill curves than the chimpanzee’s equal-guessing strategy.9

Figure 3.2. How thoroughly foxes and fox-hog hybrids (first and second quartiles on cognitive-style scale) making short-term or long-term predictions dominated hedgehogs and hedge-fox hybrids (fourth and third quartiles) making short- and long-term predictions on two indicators of forecasting accuracy: calibration and discrimination. Key to translating numbers into identifiable subgroups and tasks: pure foxes (1–8), pure hedgehogs (25–32), fox-hog hybrid (17–24) and hedge-fox hybrid (9–16); moderates (1–4, 9–12, 17–20, 25–28) and extremists (5–8, 13–16, 21–24, 29–32); experts (1–2, 5–6, 9–10, 13–14, 17–18, 21–22, 25–26, 29–30) and dilettantes (3–4, 7–8, 11–12, 15–16, 19–20, 23–24, 27–28, 31–32), short-term (2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32) and long-term (1, 2, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, and 31); mindless algorithms (chimp—33), base-rate extrapolation (34), and moderate and extreme case-specific extrapo
lation (35, 36); the average performance of formal statistical models (generalized autoregressive distributed lag—37), and Berkeley undergraduates (38).

Figure 3.3. The calibration functions of four groups of forecasters compared to the ideal of perfect calibration (diagonal). The further functions stray from the diagonal, the larger and worse the resulting calibration scores.

Figures 3.3 and 3.4 supplement figure 3.2. Figure 3.3 plots a series of calibration functions that show how well the fox-hedgehog difference holds up across the entire subjective probability scale, how close foxes come to the diagonal of perfect calibration when they make short-term predictions within their domain of expertise, and how far hedgehogs stray from this ideal when they make long-term predictions within their domains of expertise. Figure 3.4 brings into sharp focus how pronounced that fox-hedgehog difference becomes when the hedgehogs are ideological extremists making long-term predictions within their domains of expertise.

Examining all three figures, we can take home four principal sets of conclusions:

1. The fox advantage on calibration is remarkably generalizable. It holds up across all dimensions of the data displayed—across experts versus dilettantes, moderates versus extremists, and short-term versus long-range forecasts—and it holds up across two additional dimensions not displayed—across the zones of stability versus turbulence, and across domestic political versus economic versus national security outcomes.10 The fox advantage fades only when fox forecasters become less foxy (among “fox-hogs”) and when the hedgehog forecasters become foxier (among “hedge-foxes”). Here we find the largest cluster of “statistical ties” on calibration (in figure 3.2, data points 9, 10, 11, 12, 13, 15, 19, 20, 22, and 23).

Figure 3.4. Calibration scores of hedgehog and fox moderates and extremists making short- and long-term predictions as either experts or dilettantes.

2. But the fox advantage on calibration is more pronounced for certain subgroups. The worst performers were hedgehog extremists making long-term predictions in their domains of expertise. From that valley (see especially figure 3.4), hedgehog performance improves as we move from experts to dilettantes, from long-term to short-term predictions, and from extremists to moderates. By contrast, the best performers were foxes making short-term predictions in their domains of expertise. From that peak (see again figure 3.4), fox performance deteriorates as we move from experts to dilettantes, and from short- to long-term predictions.

3. There is no support for the argument, advanced by some defenders of hedgehogs, that “foxes were just chickens” and their victory on calibration was as intellectually empty as the victory on the same variable as the chimp equal-guessing strategy in chapter 2. If foxes had been indiscriminately cautious, they—like the chimp—would have been trounced on the discrimination index. But the opposite happened. Foxes enjoyed a statistically significant advantage on discrimination—an advantage that fades only when, as occurred with calibration, we compare the least foxy foxes and the most foxy hedgehogs. Also, as occurred with calibration, we find that the worst-performing hedgehogs are still extremists making long-term predictions in their roles as experts. And the best-performing foxes are still moderates making predictions in their roles as experts. These results smash a critical line of defense for hedgehogs. Figure 3.2 underscores this point by showing that it was impossible to identify any plausible (monotonic) set of constant probability score curves consistent with the hypothesis that hedgehogs lost on calibration because they were making a prudent trade-off in which they opted to give up calibration for the sake of discrimination. Adding insult to injury, figure 3.2 shows it is easy to generate plausible indifference curves consistent with the hypothesis that hedgehogs and the dart-throwing chimp had equivalent forecasting skill and were simply striking different trade-offs between calibration and discrimination, with the chimp equal-guessing strategy “opting” for more calibration in return for zero discrimination and hedgehogs “opting” for less calibration in return for some discrimination.11

4. The patterning of fox-hedgehog differences has implications for the interpretation of the effects of other variables, especially expertise, forecasting horizon, and ideological extremism. For instance, although expertise in chapter 2 had no across-the-board effect on forecasting accuracy, the null result is misleading. Foxes derive modest benefit from expertise whereas hedgehogs are—strange to say—harmed. And, although long-range forecasts were on average less accurate than short-term forecasts, the main effect was misleading. It was driven entirely by the greater inaccuracy of hedgehogs’ long-term forecasts. Finally, although extremists were on average less accurate than moderates, this main effect too was misleading. It was driven almost entirely by the rather sharp drop in accuracy among hedgehog, but not fox, extremists.

To sum up, the performance gap between foxes and hedgehogs on calibration and discrimination is statistically reliable, but the size of the gap is moderated by at least three other variables: extremism, expertise, and forecasting horizon. These “third-order” interactions pass stringent tests of significance (probability of arising by chance [conditional on null hypothesis being true] less than one in one hundred), so it is hard for skeptics to dismiss them as aberrations.12 And these interactions pose a profound challenge. We normally expect knowledge to promote accuracy (a working assumption of our educational systems). So, if it was surprising to discover how quickly we reached the point of diminishing returns for knowledge in chapter 2, it should be downright disturbing to discover that knowledge handicaps so large a fraction of forecasters in chapter 3.

The results do, however, fit comfortably into a cognitive-process account that draws on psychological research on cognitive styles and motivated reasoning. This account begins by positing that hedgehogs bear a strong family resemblance to high scorers on personality scales designed to measure needs for closure and structure—the types of people who have been shown in experimental research to be more likely to trivialize evidence that undercuts their preconceptions and to embrace evidence that reinforces their preconceptions.13 This account then posits that, the more relevant knowledge hedgehogs possess, the more conceptual ammunition they have to perform these belief defense and bolstering tasks. By contrast, foxes—who resemble low scorers on the same personality scales—should be predisposed to allocate their cognitive resources in a more balanced fashion—in the service of self-criticism as well as self-defense. When fox experts draw on their stores of knowledge for judging alternative futures, they should pay roughly equal attention to arguments, pro and con, for each possibility. We should thus expect a cognitive-style-by-expertise interaction: there is greater potential for one’s preferred style of thinking to influence judgment when one has a large stock of thoughts to bring to bear on the judgment task.

The next challenge for the cognitive-process account is to explain why the performance gap between fox and hedgehog experts should widen for longer-range forecasts. Most forecasters became less confident the deeper into the future we asked them to see. Understandably, they felt that, whereas shorter-term futures were more tightly constrained by known facts on the ground, longer-term futures were more “up for grabs.” Linking these observations to what we know about hedgehogs’ aversion to ambiguity, it is reasonable to conjecture that (a) hedgehogs felt more motivated to escape the vagaries of long-range forecasting by embracing cause-effect arguments that impose conceptual order; (b) hedgehogs with relevant subject matter knowledge were especially well equipped cognitively to generate compelling cause-effect arguments that impose the sought-after order. We should now expect a second-order cognitive style × expertise × time frame interaction: when we join the ability to achieve closure with the motivation to achieve it, we get the prediction that hedgehog experts will be most likely to possess and to embrace causal models of reality that give them too much confidence in their long-range projections.

The final challenge for the cognitive-process account is to lock in the fourth piece of the puzzle: to explain why
the performance gap further widens among extremists. Laboratory research has shown that observers with strong needs for closure (hedgehogs) are most likely to rely on their preconceptions in interpreting new situations when those observers hold strong relevant attitudes (priors).14 These results give us grounds for expecting a third-order interaction: the combination of a hedgehog style and extreme convictions should be a particularly potent driver of confidence, with the greatest potential to impair calibration and discrimination when forecasters possess sufficient expertise to generate sophisticated justifications (fueling confidence) and when forecasters make longer-range predictions (pushing potentially embarrassing reality checks on over-confidence into the distant future).

The cognitive-process account is now well positioned to explain the observed effects on forecasting accuracy. But can it explain the specific types of mistakes that forecasters make? We have yet to break down aggregate accuracy. We do not know whether hedgehogs’ more numerous mistakes were scattered helter-skelter across the board or whether they took certain distinctive forms: under- or overpredicting change for the worse or better.

To answer these questions, the Technical Appendix shows us we need indicators that, unlike the squared deviation formulas for calibration and discrimination, preserve direction-of-error information. These directional indicators reveal that, although both hedgehogs and foxes overpredict change (the lower base-rate outcome) and thus—by necessity—underpredict the status quo, hedgehogs make this pattern of mistakes to a greater degree than foxes. Relative to foxes, hedgehogs assign too high probabilities to both change for the worse (average subjective probabilities, .37 versus .29; average objective frequency = .23) and to change for the better (average subjective probabilities = .34 versus .30; average objective frequency = .28); and too low average probabilities, .29 versus .41, to perpetuation of the status quo (average objective frequency = .49). We can show, moreover, that the overprediction effect is not just a statistical artifact of regression toward the mean. A series of t-tests show that the gaps between average subjective probabilities and objective frequencies are statistically significant for both hedgehogs (at the .001 level) and foxes (at the .05 level). And the gaps for hedgehogs are consistently significantly larger than those for foxes (at the .01 level).

‹ Prev Next ›