Expert Political Judgment

Home > Other > Expert Political Judgment > Page 4
Expert Political Judgment Page 4

by Philip E. Tetlock


  The foreign-policy “Establishment” worries too much about false precision and too little about unnecessary ambiguity. Vague-verbiage forecasting—of the “there-is-a-distinct-possibility-Putin-will-move-next-against-Estonia” genre—is standard fare in talks at the Council on Foreign Relations and in publications such as Foreign Affairs and Foreign Policy. This vagueness is costly in two under-appreciated ways. First, although such claims sound informative, they say next to nothing because it is so easy, after the fact, to stretch the meanings of key terms to cover so wide a range of probabilities in the “maybe zone,” from as low as 15 percent (I only said “possible”) to as high as 85 percent (“I said “distinctly” possible). Second, vague-verbiage claims make it virtually impossible to learn to do what the best poker players—and geopolitical forecasters—do: become as-granular-as-possible assessors of uncertainty. Wittgenstein’s aphorism “The limits of my language are the limits of my world” applies. You’ll never learn to distinguish 60/40 from 40/60 bets, less still 55/45 and 45/55, if you confine yourself to chattering about what may or might or could happen.

  Should we care? Yes, when the stakes are high, small increments in probabilistic accuracy matter. Recall the infamous slam-dunk certainty of 2003 intelligence-community estimates of Iraqi “weapons of mass destruction” in Iraq. Imagine that analysts had tamped it down to a more reasonable 75 percent, conceding that Saddam was behaving suspiciously but there was no smoking gun. No one knows how history would have unfolded but we do know that better calibrated probability assessments would have lent some legitimacy to claims of reasonable doubt.35

  PREDICTION AND EXPLANATION

  Some experts also objected that their primary professional interest lay in explanation, not prediction. I had judged them too much on forecasting accuracy and not enough on what they really do for a living: explaining the past and framing the present. They understood the fragility of the Soviet Union in the late Gorbachev period. And they knew China was at a choice point during the Tiananmen demonstrations. But they also knew that understanding and prediction in politics are rather loosely coupled. I didn’t warn readers sternly enough that it is possible to be right about underlying causal drivers but unlucky on fluky details. Be careful about inferring a defective explanatory model from a weak forecasting record. Conversely, it is possible to be wrong on fundamentals but make offsetting errors that yield accurate predictions. Be careful about inferring a sound model from a strong forecasting record.

  One’s forecast on Russia might go awry for many reasons. It might go awry because one worked from faulty assumptions about facts on the ground, such as misreading Putin’s plans or the depth of his popular support or the resilience of the Russian economy. Or it might go awry due to incorrect assumptions about the psychological-political-economic laws of causation that apply—for example, did the observer fail to see the world through a sufficiently realist lens that highlights the ruthlessly competitive nature of world politics or through a sufficiently institutionalist lens that highlights the growing power of transnational norms? Or was the forecasting failure due to a blend of errors about facts on the ground and the causal laws that operate on those facts? There are so many ways to get it wrong. Some undercut our core commitments and some we shrug off as quibbling.

  Of course, falling repeatedly on the wrong side of “maybe” is not good news for one’s pet theory. But how bad is it? This recurring judgment call can be modeled as a Bayesian updating process. How much confidence should we lose in an extensively tested, well-calibrated forecast-generation system, like Nate Silver’s for U.S. elections, when it puts a 70 percent probability on an outcome, such as the election of President Hillary Clinton, that does not pan out. The answer is not zero but it is surprisingly close, an illustration of why the best forecasters often need to be patient incremental updaters who make relatively small-percentage adjustments to their initial estimates in response to the ebb and flow of low-but-not-zero-relevance, events.36

  Sometimes, however, the room for reasonable disagreement suddenly expands because experts can plausibly argue that, even though a well-specified target event did not occur within the designated time frame, they should not be scored as wrong. They were merely off-on-timing. In the view of some critics, I should be more open to re-opening certain judgment calls in EPJ about who was right or wrong about what. And perhaps I do owe certain forecasters (carefully hedged) apologies.

  Consider the story from chapter 3 about an expert on the Soviet Union whom I characterized as an “ethno-nationalist hedgehog,” a shorthand for someone who relies on clash-of-civilizations fault lines to diagnose contemporary situations.37 His 1992 forecast assigned a high probability to a Russian-Ukrainian border war in the next 5 to 10 years, a forecast that received a bad Brier score in 1997. But we now know that Russia did belatedly get around to invading Ukraine in 2014. So should we revisit that scoring decision? If so, how much weight should we give the off-on-timing defense, to the possibility that the errant forecaster had a basically sound explanatory framework? And should we also factor in the relative seriousness of the different types of forecasting errors—and “value-adjust” our scoring rules to take into account that missing an event with the potential to trigger a Russia-NATO nuclear war is far worse than crying wolf?

  These questions are hard—and it is tempting for defenders of the fragile science of foresight to sweep them under the rug. For it turns out that few hedgehog forecasts scored as wrong in the 1990s have come to pass 20 years later—not nearly enough to raise hedgehogs to parity with foxes. The ultimate-vindication waiting list still winds around the block. It includes the end-of-history optimists about democracy in China, Iran and Russia, about the unification of the Koreas and peace in the Middle East as well as ultra-bullish visions of an end to business cycles. And it includes the doomster pessimists about coming pandemics, famines, and nuclear wars as well as worst-case scenarios about the impact of GMOs and runaway climate change on life itself. These extreme-tail-risk forecasters rarely throw in the towel on the mental models underpinning their projections. And obstinacy is a shrewd career move. In the hard-elbowed jockeying for status in the real world, the reputational benefits of standing by outlier predictions that eventually prove right far exceed the penalties that attach to being defiantly wrong in relative obscurity.

  All that said, readers should not forget the lumpy carpet. The deeper question remains: how long should we wait for extreme, tail-risk forecasts to come to pass? The candid answer is that no one has the answers. History never tips its hand to reveal the true probability distributions of possible outcomes. For instance, World Wars I and II were unprecedentedly destructive, claiming roughly 20 and 50 million lives. Those may be the worst wars humanity will suffer. Or lurking in the tails may be World Wars III and IV with deaths tolls in the hundreds of millions. “May” is the weasel word. It could mean anything from a .0000000001 chance over the next 100 years to a 50 to 60 percent chance. Offering explanatory support to the high-end estimates, we find hegemon-transition theorists who tabulate 2,500 years of historical base rates of wars between dominant powers and upstart challengers and see a U.S.-China war as likelier than not in the long run. Offering support to the low-end estimates are the nuclear deterrence theorists who argue that the old base rates don’t apply under mutual assured destruction.38 Either way, distinguishing sound from specious off-on-timing defenses of forecasts is no longer a matter of academic curiosity; it is existential.

  At this juncture, Black-Swan skeptics chortle that, for all my perky Enlightenment talk about tournament, we still face irreducible ignorance. We still have to make leaps of faith and pick among policy paths, even though we have no idea which ones lead to ruin or prosperity.

  The skeptics are right that expert accuracy will fall to dart-tossing-chimp levels if we extrapolate too far beyond the forecasting horizon. Properly shuffle a perfectly ordered deck of cards enough times and all predictability vanishes. But I see no reason to concede more.
Claims of “irreducible ignorance” and “no idea” are assertions, not evidence. If we stop thinking every time we confront a dogmatic skeptic, we guarantee that, if there are creative ways of reducing “irreducible ignorance,” we will fail to discover them.

  The key here is not throwing out the baby—the genuine wisdom experts possess—with the forecasting bathwater. In my experience, experts’ musings about the distant future are far less valuable for their predictive power than for their heuristic power, their capacity to stimulate creative questions that highlight alternative scenarios we would otherwise have ignored. If I had been wiser thirty years ago, I would have treated the quality of experts’ questions as just as worthy of systematic study as the accuracy of their answers. For questions flow like gushers from the elaborate explanatory frameworks that experts on political-economic systems are continually constructing, tweaking, and debunking.39 But testable forecasts trickle in slowly, and often the well dries up entirely. For instance, experts on the E.U. can talk for hours about the subtle interplay of economic, institutional, and domestic-political processes, but the conversation stalls as soon as you ask how many countries will belong five years hence.

  RESOLVABLE QUESTIONS ARE UNIMPORTANT

  The crudest version of this critique boils down to the old maxim “If it can be measured, it doesn’t matter and if it matters, it can’t be measured.” The more constructive version of the critique highlights the latent tension between the psychological and political functions of tournaments. Psychologists are naturally drawn to the psychology of forecasting, specifically (a) measuring individual differences in human performance by posing questions that are rigorously resolvable in a reasonable time frame; (b) facilitating learning by giving people the regular, precise feedback they need to become more granular, better-calibrated forecasters. From a psychological perspective, a good forecasting question is one that helps us distinguish better from worse forecasters in much the same way that a good item in a medical-school aptitude test distinguishes better from worse doctors.

  From a political perspective, however, we run the risk of our tournaments degenerating into trivial pursuits when we focus solely on the psychometric goals of spotting and developing talent. The political function of tournaments—shedding light on who is closer to being right in high-stakes debates—merits equal attention, and that requires redefining what we mean by a “good” question. Purely psychometric tournaments need to be stocked with questions about measureable outcomes in limited time frames, such as how many people will die in naval confrontations on the East China Sea by the end of this calendar year. But callous though it may sound, policy-makers are not all that interested in such micro-confrontations per se. They care much more about Chinese capabilities and intentions, about the ways in which neighboring countries are reacting, and about how China might react to alternative U.S. policies. Tragedies involving drunken fishing-boat captains and coast guard officers in the East China Sea have some remote potential to play Sarajevo-style, cataclysm-triggering roles, but putting a high probability on radical-escalation scenarios is a good way to wind up with a terrible accuracy score.40

  The next generation of tournaments needs to blend psychometric rigor and political relevance.41 It won’t be easy. It will require subject-matter experts who can shuttle between the concrete and the abstract, the short- and long-term, to spot early micro indicators of distant macro trends. It will require skill at crafting questions that are simultaneously rigorously resolvable and can tip the scales of plausibility in high-stakes debates.

  Consider two high-stakes debates that, at a glance, look far beyond the capacity of forecasting tournaments to address. The first revolves round the scary scenario, proposed by hegemon-transition theorists, that a major U.S.-China war is likelier than not by, say, the middle of this century. The second revolves round an unsettling scenario, proposed by a mix of technologists and economic historians, that society is on the cusp of a Fourth Industrial Revolution driven by strong forms of Artificial Intelligence that will cause massive dislocations in labor markets by, let’s also say mid-century. Each scenario is vastly more open-ended than the tightly time/event-bounded questions of past tournaments, which prioritized psychometric objectives. The challenge is: Can we redesign tournaments to bridge the gap between speculative, longer-range scenario-based thinking and shorter-term forecasting?

  The answer is a cautious “yes.” To simplify, let’s treat each scenario independently.

  The first step requires doing what I have called “Fermi-izing”42 or breaking a big intractable question into smaller semi-tractable components. One approach to decomposing the two scenarios is to ask: What sorts of very specific events would specialists expect to observe in the next few years if we currently are/aren’t on a longer-range trajectory toward either a Sino-American war or a Fourth Industrial Revolution? For the Sino-American-war scenario, a quick survey of subject-matter experts generates a host of reasonably diagnostic, early warning indicators, including surging defense budgets, weakening commercial ties, brinkmanship in the East and South China Seas, North Korea’s development of an ICBM capable of delivering an atomic weapon to North America, and an anti-satellite-weapons race in space. For the Fourth-Industrial-Revolution scenario, the survey generates its own host of early indicators, including Google’s Alphago AI system defeating the best human player in Go (which did happen in 2016); or driverless cars picking up passengers for fares in major U.S. cities by the end of 2018 or IBM’s Watson advancing from vanquishing human competition in “Jeopardy!” to besting world-class oncologists in diagnosis tournaments.

  When the list grows unwieldy, the second step is to call again on subject-matter experts but now for pruning possibilities. The goal is optimal redundancy: each specific forecasting question should be as diagnostic as possible vis-à-vis the large concept but as independent of the other specific questions as possible so that it makes an incremental predictive contribution. This tricky balancing act yields collections of questions (“Bayesian question clusters”) that observers find useful in updating their views on the ultimate destination of the train of history on which we are all passengers. When we look out on the passing newscape, do we see signposts of a coming cataclysm or a manageably tense relationship? Or signs of a radically new work world or minor variants of the status quo?

  The third step is to put the clusters to work in sorting out the relative accuracy of the forecasts of clashing schools of thought—a process that becomes especially informative when the protagonists have been able to agree in advance on how much each side should change its mind if questions break in one direction or another. For instance, how many warning indicators of escalation need to flash “red” before advocates of a hawkish policy toward China should start to worry that their efforts to project strength are triggering a conflict spiral? Conversely, how much evidence do advocates of a dove-ish policy need to see before they start to worry that their efforts to reassure/appease the Chinese are seen in Beijing as signs of weakness?43

  The better the question clusters we can assemble, the more tournaments can accelerate collective learning by making rigidity transparent—and embarrassing. And that advances the cause of deliberative democracy, inch by (belief-updating) inch.

  SERIOUS PEOPLE WILL NEVER AGREE TO PLAY BY TOURNAMENT RULES

  I have tried unsuccessfully for 30-plus years to persuade political elites to participate in forecasting tournaments. In Superforecasting, I resorted to personifying the problem in a tale of two forecasters, one of whom was a famous pundit with an unknown track record (Tom Friedman) and the other of whom was a unknown research volunteer with a remarkably well documented record (Bill Flack). This is yet another problem that we must not sweep under the rug. For if we can’t solve it, we will have to resign ourselves to an oppressively low upper bound on the potential of forecasting tournaments to improve policy debates.

  I see four root causes of elites’ reluctance to engage.

  The first reason wi
ll strike many non-elites as prima facie preposterous, but it should be heard out. Opinion leaders may show little interest in improving the marketplace of ideas because they believe the market is already operating close to its efficiency frontier—and see little room for improvement. Vague-verbiage forecasts issued and defended by clashing communities of co-believers may be as good as democratic deliberation gets. The big policy questions in our quirky path-dependent world just don’t lend themselves to refined probabilistic parsing. So the lack of interest in tournaments is sensible, something we should expect even among pundits whose motives are purely truth-seeking, with no whiff of power-grubbing cynicism.

  This position is logically possible but increasingly empirically untenable. Voluminous data from the IARPA competitions now show that tournaments are efficient mechanisms for spotting individual talent at making fine-grained assessments of the odds of possible futures—and for testing methods of cultivating talent and raising performance beyond control groups in randomized trial experiments. The good-as-it-gets opposition to tournaments reminds me of the old story about two University of Chicago economists walking to lunch: when the junior professor spies a $20 bill on the sidewalk and bends to pick it up, his senior colleague interrupts, “Don’t bother. If it were real, someone would have already grabbed it.” The scientific evidence is pointing to something on the ground—and it is up to us whether the expected gains in foresight are worth the costs of bending down to pick it up.

 

‹ Prev