Expert Political Judgment

Page 2

by Philip E. Tetlock

There is, of course, nothing exceptional about experts being blindsided by events. Nor is there anything unusual about partisans duking it out over ambiguous data. In politics, there is always someone eager to claim credit and deny blame and someone else ready to undercut the claims and denials. When we all insist on keeping our own scorecards, we should not be astonished by self-righteous eruptions of disagreements over “who won.” Absent strong reminders of what we once thought, we all too easily slip into believing our own self-promotional puffery.

This account sets the stage for unveiling the impetus behind this book. Inspiration was born from my exasperation at self-serving scorekeeping and the difficulty of inducing advocates of rival perspectives to answer the question “What would make you change your mind?” I set out on a mission that perhaps only a psychologist (and I am one) would be naïve enough to undertake: to “objectify” good political judgment by identifying standards for judging judgment that would command assent across the spectrum of reasonable opinion. This book, for better or for worse, is the result.

1 P. B. Medawar, The Art of the Soluble (London: Methuen, 1967).

2 My closing chapter in the first volume sponsored by the committee makes this point—about the limits of our knowledge—more delicately than I make it here. See P. E. Tetlock, R. Jervis, J. Husband, P. Stern, and C. Tilly, eds., Behavior, Society, Nuclear War, vol. 1 (New York: Oxford University Press, (1989). See also the discussion of nuclear theology in J. Nye, “Nuclear Learning and U.S.-Soviet Security Regimes,” International Organization 4 (1987): 371–402.

3 J. Schell, The Fate of the Earth (New York: Avon, 1982).

4 G. Allison, A. Carnesale, and J. Nye, Hawks, Doves, and Owls: An Agenda for Avoiding Nuclear War (New York: W.W. Norton, 1985); P. E. Tetlock, “Policy-makers’ Images of International Conflict,” Journal of Social Issues 39 (1983): 67–86.

5 For examples of how worried some were, see M. Deutsch, “The Prevention of World War III: A Psychological Perspective,” Political Psychology 4 (1983): 3–31; R. White, Fearful Warriors: A Psychological Profile of U.S.-Soviet Relations (New York: Free Press, 1984).

6 On the sharp transition between cold war and post–cold war thinking, see T. Friedman, The Lexus and the Olive Tree (New York: Farrar, Straus & Giroux, 1999).

7 Indeed, doctrinaire deterrence theorists stressed the dangers of appearing weak in many arenas. On the domestic front, they didn’t like mollycoddling criminals and, after the demise of the Soviet Union, they warned against appeasing international scofflaws such as North Korea and Iraq. For the view that foreign policy priorities are extensions of primitive interpersonal priorities, see L. S. Etheredge, A World of Men (Cambridge: MIT Press, 1980). For qualifications, see R. Herrmann, P. E. Tetlock, and P. Visser, “Mass Public Decisions on Going to War: A Cognitive-Interactionist Framework,” American Political Science Review 93 (1999): 553–74. Liberals are not always more doveish than conservatives. When the attitude object excites enough antipathy—apartheid in South Africa or ethnic cleansing in Yugoslavia—many are eager to “get tough.”

8 Tetlock, “Policy-makers’ Images of International Conflict,” 67–86.

9 For documentation of how pervasive this belief-system defense is, see P. E. Tetlock, “Close-call Counterfactuals and Belief System Defenses: I Was Not Almost Wrong but I Was Almost Right,” Journal of Personality and Social Psychology 75 (1998): 230–42.

10 I cannot refute every suspicion that the hypersensitive might form of my motives. Various views wind up looking silly at various points. But this book is not dedicated to testing rival political theories; its mission is to shed light on the workings of the minds of political observers. When advocates of a point of view are far off the mark, readers have an array of options, including concluding that (a) forecasters misinterpreted the theory; (b) forecasters had the right theory but no real-world savvy, so they fed the wrong antecedent conditions into the deductive machinery of their theory which, in the tradition of garbage in, garbage out, duly spat out idiotic predictions; (c) the theory is flawed in minor ways that tinkering can fix; (d) the theory is flawed in fundamental ways that require revising core assumptions. True believers in a theory will reach option (d) only after they have been dragged kicking and screaming through options (a), (b), and (c), whereas debunkers should leap straight to option (d) at the first hint of a glitch.

11 R. Pipes, “Gorbachev’s Party Congress: How Significant for the United States?” (Miami, FL: Soviet and East European Studies Program Working Paper Series, 1986).

12 Conservatives also risk having the tables turned on them if they mock the “Reagan was just lucky” defense. Across the many forecasting exercises in this book, conservatives are as likely as liberals to resort to the close-call defense in trying to rescue floundering forecasts.

Preface to the 2017 Edition

Since its publication in 2005, Expert Political Judgment (EPJ) has become ever more tightly coupled to a crude claim it never made: “Experts are clueless: their forecasts, no better than those of dart-tossing chimpanzees”—a sound-bite that has now spread well beyond the confines of academe.1 Viewed in this light, I suppose it is a tad ironic that neither author nor publisher came close to predicting the book’s impact. We expected it to be reviewed in the American Political Science Review, not the New Yorker and the Financial Times. We expected low-budget, follow-up forecasting competitions, not the massive tournaments sponsored by the U.S. intelligence community.2 And we never dreamt that, eleven years later, a pro-Brexit cabinet minister, Michael Gove, and the mastermind of Monty Python’s “Flying Circus,” John Cleese, would invoke EPJ to dismiss economists’ warnings against leaving the European Union and declare that “Britain has had enough of experts.”3 If we had soberly grounded our expectations in the base rates for survival of university-press books—an outside-view approach to forecasting widely taught as a best practice—we would have predicted a book long out of print by now.

From the vantage point of 20/20 hindsight, however, it is hard to see how we failed to foresee that EPJ would strike a chord. I had practically invited the media attention by choosing the chimps as my taunting metric to beat–a choice some colleagues have yet to forgive–and I offered reviewers a narrative easily prone to caricature, in which an academic study demonstrated that academic credentials don’t pay off in the real world. Moreover, the attentive public was in a foul mood. Shaken by two spectacular forecasting flops—the failure to prevent the 9/11 terrorist attacks and the failure to find Iraqi weapons of mass destruction—many felt betrayed by the “experts” whom they had trusted to keep them safe. Distrust of experts was not a new phenomenon in American politics. Populist skepticism of elites and their cabalistic machinations stretches back to the Jacksonian era. But it enjoyed a resurgence during the George W. Bush years, and the book seemed to lend intellectual backing to that anti-intellectualism. In short, the media misinterpreted EPJ to claim that experts know nothing, and know-nothings seized on that claim as proof that knowledge itself is somehow useless. This is not, to put it mildly, what EPJ said, and so the first purpose of this preface to the new edition is damage control.

Experts serve vital functions. Their knowledge enables them to perform highly skilled operations—from computer programming to brain surgery to aircraft design—that the rest of us would do well to avoid. Only experts can even begin to craft coherent legislation on complex topics like tax law, health care, or arms control. Experts are also, by and large, the generators of new knowledge, in fields from archaeology to quantum physics. The era of the amateur scientist is long past.

But when it comes to judgment under uncertainty—a key component of real-world decision-making—expertise confers a less-clear advantage, even in fields like medicine, where one might suppose it essential. In 1954, psychologist Paul Meehl showed that statistical methods were often superior to “clinical” methods of prediction—that is, prediction by doctor—in patient prognosis and treatment,4 a humbling finding th
at has held up well in a host of other professional domains.5 Starting in the 1970s and stretching to the present, the Daniel Kahneman and Amos Tversky research program and its prolific offshoots further demystified expertise by highlighting experts’ susceptibility to judgmental biases: too quick to jump to conclusions, too slow to change their minds and too swayed by the trivia of the moment.6 To be sure, there are exceptions in which experts acquit themselves well. Meteorologists and professional bridge and poker players make impressively well-calibrated probability judgments.7 Veteran fire-fighters and neonatal nurses can size up situations faster than can their rookie colleagues.8 The pivotal distinction is between experts who work in learning-friendly environments in which they make short-term predictions about easy-to-quantify outcomes and get the prompt, clear feedback essential for improvement—and experts toiling in less friendly environments in which they make longer-range forecasts about harder-to-quantify outcomes and get slower, vaguer feedback.9 Political judgment usually falls into the latter category, which is where EPJ comes in.

The studies that provided the foundation for Expert Political Judgment were forecasting tournaments held in the 1980s and 1990s during which experts assessed the probabilities of a wide range of global events—from interstate violence to economic growth to leadership changes. By the end, we had nearly 30,000 predictions that we scored for accuracy using a rigorous system invented by and named for a statistically savvy meteorologist, Glenn Brier. One gets better Brier scores by assigning probabilities “closer” to reality over the long term, where reality takes on the value of 1.0 when the predicted event occurs and zero when it does not. Lower Brier scores are therefore good. Indeed, a perfect score of zero indicates uncanny clairvoyance, an infallible knack for assigning probabilities of 1.0 to things that happen and of zero to things that do not. The worst possible score, 2.0, indicates equally uncanny inverse clairvoyance, an infallible knack for getting everything wrong. And if we computed the long-term average of all the chimps’ dart-tossing forecasts, we would converge on 0.5, the same maximum-uncertainty judgment rational observers would make in guessing purely stochastic binary outcomes such as coin flips—which would earn chimps and humans alike the chance-accuracy baseline Brier score of 0.5.

The headline result of the tournaments was the chimp sound-bite, but EPJ’s central findings were more nuanced. It is hard to condense them into fewer than five propositions, each a mouthful in itself:

• Overall, EPJ found over-confidence: experts thought they knew more about the future than they did. The subjective probabilities they attached to possible futures they deemed to be most likely exceeded, by statistically and substantively significant margins, the objective frequency with which those futures materialized. When experts judged events to be 100 percent slam-dunks, those events occurred, roughly, 80 percent of the time, and events assigned 80 percent probabilities materialized, on average, roughly 65 percent of the time.

• In aggregate, experts edged out the dart-tossing chimp but their margins of victory were narrow. And they failed to beat: (a) sophisticated dilettantes (experts making predictions outside their specialty, whom I labeled “attentive readers of the New York Times”—a label almost as unpopular as the dart-tossing chimp); (b) extrapolation algorithms which mechanically predicted that the future would be a continuation of the present. Experts’ most decisive victory was over Berkeley undergraduates, who pulled off the improbable feat of doing worse than chance.

• But we should not let terms like “overall” and “in aggregate” obscure key variations in performance. The experts surest of their big-picture grasp of the deep drivers of history, the Isaiah Berlin–style “hedgehogs,” performed worse than their more diffident colleagues, or “foxes,” who stuck closer to the data at hand and saw merit in clashing schools of thought.10 That differential was particularly pronounced for long-range forecasts inside experts’ domains of expertise. The more remote the day of reckoning with reality, the freer the well-informed hedgehogs felt to embellish their theory-driven portraits of the future, and the more embellishments there were, the steeper the price they eventually paid in accuracy. Foxes seemed more attuned to how rapidly uncertainty compounds over time—and more resigned to the eventual appearance of inherently unpredictable events, Black Swans, that will humble even the most formidable forecasters.11

• A tentative composite portrait of good judgment emerged in which a blend of curiosity, open-mindedness, and unusual tolerance for dissonance were linked both to forecasting accuracy and to an awareness of the fragility of forecasting achievements.12 For instance, better forecasters were more aware of how much our analyses of the present depend on educated guesswork about alternative histories, about what would have happened if we had gone down one policy path rather than another (chapter 5). This awareness translated into openness to ideologically discomfiting counterfactuals. So, better forecasters among liberals were more open to the possibility that the policies of a second Carter administration could have prolonged the Cold War, whereas better forecasters among conservatives were more open to the possibility that the Cold War could have ended just as swiftly under Carter as it did under Reagan. Greater open-mindedness also protected foxier forecasters from the more virulent strains of cognitive bias that handicapped hedgehogs in recalling their inaccurate forecasts (hindsight bias) and in updating their beliefs in response to failed predictions (cognitive conservatism).

• Most important, beware of sweeping generalizations. Hedgehogs were not always the worst forecasters. Tempting though it is to mock their belief-system defenses for their often too-bold forecasts—like “off-on-timing” (the outcome I predicted hasn’t happened yet, but it will) or the close-call counterfactual (the outcome I predicted would have happened but for a fluky exogenous shock)—some of these defenses proved quite defensible. And. though less opinionated, foxes were not always the best forecasters. Some were so open to alternative scenarios (in chapter 7) that their probability estimates of exclusive and exhaustive sets of possible futures summed to well over 1.0. Good judgment requires balancing opposing biases. Over-confidence and belief perseverance may be the more common errors in human judgment but we set the stage for over-correction if we focus solely on these errors—and ignore the mirror-image mistakes of under-confidence and excessive volatility.13

These caveats should set the record straight: EPJ was never the anti-egghead rant that some hoped. It showed that a subset of experts could rather consistently predict the near-term future better than chance—and it helped if they approached the task with an open mind and let go of preconceptions as new information became available. Of course, EPJ also found that foresight is a precarious achievement. In a noisy world, even the best forecasters feel the downward tug of regression toward the mean. But it matters (a lot) that the cognitively agile among us manage to resist the laws of statistical gravity as well and as long as they do.

The core thesis of EPJ was then never as democratic as some supposed. Neither self-awareness nor forecasting skills are evenly distributed. The book did however hold out a tantalizing prospect for democracy: the prospect of keeping score and using those scores to settle arguments over policy. By using forecasting tournaments to test subjective expectations of the future against reality, it should be possible to improve political discourse and reduce polarization. I realize that this idea sounded naïve in 2005 and perhaps it sounds downright delusional in the hyper-polarized atmosphere of 2017. But, at the risk of sounding like one of those waiting-for-Godot, hedgehog forecasters, it is a good idea whose time will come. I suspect that over the next few centuries, the idea of systematizing score-keeping will gradually become so commonplace that our descendants in 2317 will look back on our policy debates in 2017 with the same “crushing condescension of posterity” that we reserve for judging jurisprudence in the Salem witch trials. They will shake their heads: How could smart people have been so dumb?

In the meantime, there is a lot of work to do. So the second task of
this new preface is to make the case that forecasting tournaments (and kindred systems) do indeed have the potential to advance the cause of deliberative democracy.14

Policy debates pivot on competing claims about the future, such as whether a given tax plan will spur or slow the economy. The problem is that, in the world of 2017, the debaters—be they op-ed columnists or elected officials—rely heavily on vague-verbiage predictions riddled with weasel words like “could” or “might” that render them untestable. Such words suggest wildly different probabilities to different listeners, from as low as 0.1 to as high as 0.8, and they give forecasters the wiggle room they need for ensuring they can position themselves on the “right side of maybe,” no matter what happens.15 This makes it impossible to hold vague-verbiage forecasters accountable for the veracity of their forecasts. And this lack of accountability allows erroneous beliefs to go unchallenged and public policy to remain mired in unproductive debates. Worse, as EPJ showed, the hedgehogs, with their sweeping, top-down deductive theories of history, get the bulk of media attention, even though they tend to be less accurate in their forecasts. The net result is that our collective conversations become more polarized as each side grows more convinced that the other side is so oblivious to facts that it must be “on the take,” in the game solely for perks and power, not the truth.

EPJ didn’t address this downward spiral into stereotyping directly, but it held out the hope that we could alter the destructive dynamics of policy debates if we heeded its prescriptions. We would have to get into the habit of translating our vague hunches into precise probability judgments that would be judged for accuracy by impartial referees using proper scoring rules that incentivize us to focus on accuracy and only accuracy. For instance, Brier scoring both penalizes forecasters for over-confidence and rewards them for justified decisiveness. So, assigning a 90 percent chance to something that never happens leads to a painfully bad Brier score (1.62.). This was the fate of the prominent poll-aggregators who put 9:1 odds on Hillary Clinton winning the 2016 presidential election.16 By contrast, assigning a 90 percent chance to something that does happen gets you a superb score (.02). And this combination of incentives means that it is a mistake to mindlessly retreat toward 50 percent to avoid the over-confidence penalty. We should report probabilities that reflect our true confidence in our knowledge—and try to distinguish as many degrees of uncertainty between zero and 1.0 as problems permit. Nuance matters. Our credibility (Brier) scores in tournaments wax or wane as our bank accounts would if we were to win or lose bets based on the corresponding odds. Losing a 99:1 bet is a lot more painful than losing a 9:1 one, which, in turn, hurts more than losing a 3:1 one.

‹ Prev Next ›