Holding people accountable for their Brier scores in tournaments might not seem a sharp break from what we routinely do in everyday life. Don’t we already hold each other accountable for the accuracy of our opinions? We do, sometimes. But we tend to be sloppy and conflate goals.
Accurately appraising reality is only one of a number of difficult-to-disentangle functions our opinions serve in everyday life. We also use our opinions to signal our membership in communities of co-believers.17 Members in good standing of any self-respecting ideological tribe are supposed to line up their beliefs about which policies on taxation or school vouchers or the Middle East will deliver good or bad results to which deserving groups. Adopting the views of our co-believers and distancing ourselves from non-believers are the obvious, low-effort, low-risk strategies for coping with such a world. Why bother thinking through issues on your own if your first reward for doing so is to be marginalized by the people whose esteem you most value? Imagine you are a scientist who sets aside time to sift through reports of the Inter-Governmental Panel on Climate Change and you emerge 70 percent confident that global temperatures will be in line with their projections. In the eyes of many, you have just signaled: “30 percent of me is flirting with the denialists.” It is easier and safer to declare: I stand with my tribe.”
All this implies that it won’t be easy to persuade people to sign up and stick with tournaments. Making it even harder, most people are, deep down, “naïve realists” who don’t appreciate the messy mix of functions their opinions serve. They feel they already are “objective” and resent implications to the contrary.18 They see no big gain, and considerable inconvenience, from joining tournaments that require stepping outside their familiar world in which co-believers help each other keep score—and entering an alien world in which accuracy trumps affinity.
Like financial markets, tournaments are impersonal mechanisms that reward cognitively unnatural acts, like second-guessing ourselves and looking for ways we could be wrong before others spot our blindspots for us.19 If we could institutionalize this sort of constructive competition in policy debates, it should, in theory, jumpstart a virtuous circle of “cognitive-debiasing” benefits. It should check confirmation bias, over-confidence and belief perseverance by encouraging us to take third-party perspectives on our own thought processes, to explore why others see things differently, and to weigh arguments dispassionately. It should help to pry open closed minds by checking our tendencies to retreat into echo chambers that restrict access to dissonant but predictively useful news. And it should tamp down the volume on the sorts of ideological over-claiming that plays well among true believers but, taken seriously, degrades predictive accuracy. Instead of hunkering down into a defensive crouch when we get unexpected news about our favorite candidate, we would learn, gradually less grudgingly, to say to ourselves: “Well, I might be wrong.” In conversations with others, we would start sounding less like hedgehogs and more fox-like. EPJ never promised an easy metamorphosis, but it implied that tournaments would help debates converge faster on trans-ideological truths.20
Yet the question lingers: how can we jumpstart a process that calls on people to give up old cognitive habits and take on new social risks? Given the common view of EPJ as an elite-debunking exercise, it may surprise some readers to learn that tournaments to date owe their successes more to top-down interventions of elites than to bottom-up effusions of enthusiasm from the masses.
The big upside to the media attention EPJ received was that it put forecasting tournaments onto the radar screens of influential people outside academia, some of whom commanded the resources that would enable us to test tournaments’ usefulness on a grander scale. In late 2009, Barbara Mellers and I met officials from the U.S. intelligence community who told us their plan to assess the feasibility of keeping score among professional analysts, of tracking the accuracy of their probability judgments across time and topics. They intended to launch a much more ambitious series of tournaments than anything in EPJ. Carefully culled and well-funded research teams would compete to invent ingenious strategies for generating accurate probability estimates for events of “national-security relevance.” And that plan soon became a reality. In the 2011–2015 tournaments, sponsored by the Intelligence Advanced Research Projects Activity (IARPA), tens of thousands of forecasters made well over 1 million forecasts on more than 500 questions representative of those that professional intelligence analysts face.21
In another book, Superforecasting, I told the tale of how the Mellers-Tetlock team, the Good Judgment Project, won that tournament. I won’t retell it here, except to say that winning a real-world forecasting tournament requires nimbleness among researchers as well as among forecasters, and a pragmatic readiness to draw opportunistically on odd mixes of ideas drawn from disparate traditions of thought.22 Tournaments punish rigidity, be it rooted in academic parochialism or ideological intransigence. Consider the price one would have paid in the IARPA tournament for indulging one’s distaste for cognitive elitism. The fluid intelligence of forecasters mattered—and singling out a small set of consistently high performers as “superforecasters” and tracking them into superteams proved a winning move.23 One also would have taken a hit for dismissing teams on the grounds that teams inevitably reduce independent perspectives and are inherently prone to groupthink. For it proved possible, especially among superforecasters, to engineer team norms in ways that promoted open-minded exchanges of ideas and out-performed people working alone.24 One would also have lost points in the tournament for dismissing the over-worked word “diversity” as politically correct posturing. For it turned out that our aggregation algorithms worked better to the degree the crowds consisted of forecasters working from different knowledge bases and points of view.25 There is no value in algorithms that try to extract wisdom from crowds of clones.
Tournaments shine a spotlight on the all-too-numerous pockets of mental rigidity that make our politics (and sometimes our science) so dysfunctional. If we could convince the news media of tournaments’ potential to depolarize unnecessarily polarized debates—by up-weighting the views of pragmatic belief-updaters and down-weighting dogmatists—tournaments could begin evolving from academic curiosities to integral parts of mainstream commentary. Imagine kicking off the traditional kabuki dance of clashing pundits with a quick summation of the latest probabilities from the tournament pipelines: “The current weighted average of the best forecasters puts 10:1 odds against an Iranian nuclear test before 2020 conditional on sticking with the current international agreement—and 6:1 conditional on U.S. withdrawal from that agreement. Let’s see what fresh insights our guests today have to offer.” Likelihood ratios of this sort are already common in the sophisticated election coverage of Websites such as 538 and The Upshot.26 Why not take the next step and infuse probabilistic forecasting into the policy debates on which elections and policy decisions are, in democratic theory, supposed to pivot? 27 Let demand for guests rise or fall, in part, on their skill at telling us things we did not already know—and even better, things that happen to be true?
If you squint and look at it from just the right angle—and ignore the transitory theatrical distractions in the public arena—the arc of history has already begun bending toward tournaments. The core idea of EPJ, keeping score, dovetails with the technocratic-meritocratic Zeitgeist of evidence-based decision making. Moneyball-style metrics for judging judgment have popped up not only in baseball and basketball but in philanthropy, criminology, and medicine.28 And Artificial Intelligence continues to encroach into ever more sophisticated spheres of human activity, from chess to Go to poker to equity analysis. It will look odd when the last hold-outs against evidence-based metrics are those opining on our highest-stakes policy problems. And it will look even odder when we realize that we know a lot more about the predictive track records of sports pundits than we do of political pundits.
Yet oddities can persist long enough to humble many a bold, arc-of-history forecaster. Th
e obstacles to widespread adoption in the here-and-now remain daunting, so daunting that I devote the remainder of this foreword to exploring them.
PROBLEMATIC PROBABILITIES
Critics of EPJ doubted from the outset whether prediction is even a sensible aspiration in their line of work. How could anyone ever assign meaningful probabilities to the swirling eddies of non-repeatable events that make up the flow of world politics? There was only one Reagan, one Gorbachev, one Thatcher, one Mandela. . . . How can we ever know whether experts who put only a 10 percent probability on the disintegration of the USSR between 1988 and 1993 were wrong? For all we know, they may have scored a bull’s eye. In 90 percent of the once-possible five-year futures in 1988, perhaps the USSR did not disintegrate and we just happen to live in one of the unlikely worlds in which it did.
In this view, I had committed a colossal category mistake, the sort of mistake one would expect from “a neo-positivist psychologist,” too eager to jump from problematic concepts, like good judgment, to facile operational definitions, like forecasting accuracy. To make the point even more pointedly, I deserve to be hoisted on my own heuristics-and-biases petard. I had used Kahneman’s attribution-substitution heuristic to pull off a bait and switch. I had started off with a really hard question and, consciously or unconsciously, segued into an easier one, from: “How good is political judgment?” to “How good are experts as forecasters?” 29 Finding that experts were unimpressive forecasters, I then encouraged trusting readers to draw the specious conclusion that there is zero value to expertise.
I had two opposing reactions to the critiques, one defiant and one more self-critical. Each captures a fraction of the truth and it is unclear to this day which fraction is larger.
My inner hardliner suspected that subject-matter experts would not have been so dismissive if the data had broken the other way and they had proven impressive forecasters. The critiques are largely, if not entirely, whiny high-brow rationalizations for failure.
My other reaction was to wonder whether these talented professionals had spotted serious design flaws in first-generation EPJ tournaments. Perhaps the Brier-scoring rules, which dichotomize reality into non-events (scored as zero) and events (scored as 1.0), were Procrustean. Perhaps the criteria for resolving what really happened, and thus who got what right, were arbitrary. I should have given more credence to errant forecasters’ after-the-fact belief-system defenses. Each objection raises questions about the limits of objectivity. I took them seriously in the first edition (especially in chapters 6 and 8 and the Technical Appendix), and I take them just as seriously today. The truth behind opposition to tournaments need not be either reason or rationalization. It is, quite commonly, a messy mix.
THE “UNIQUENESS” CHALLENGE
I understand why many subject-matter experts find the very notion of tournaments annoying. It would be galling to be handed a meaningless task, to politely give it a go, to fall short, and then be depicted as clueless. There are reasonable grounds for doubting how meaningful it is to assign probabilities to events like Grexit or the Syrian civil war. Has EPJ’s signature accomplishment been to impugn the competence of thoughtful professionals doing a hard job about as well as it is possible to do it?
Let’s start with the most strident version of the “meaningless” argument, which can be broken into two claims: (a) history never exactly repeats itself, so historical events are, by definition, unique; (b) probability judgments require relative-frequency or base-rate data, so probability judgments of unique events are, by definition, meaningless.30
Claim (a) follows from the dictionary definition of “unique”: events either are or are not unique. Events can be no more or less unique than women can be more or less pregnant. Claim (b) follows from a strict, frequentist philosophy of statistics, which requires all inferences about true population values be grounded in well-defined samples or comparison classes of observations.
It is pointless to argue with tautologies, but claim (b) breaks down if we do something most of us consider reasonable: relax our definition of “unique” and treat events as varying along a uniqueness continuum. Restrictive definitions make it unnecessarily hard to see how the opposite of great truths can also be true. On the one hand, there is nothing new under the sun if we crank up the magnification on our social-science microscope. Grexit may have looked sui generis, because no country had exited the Eurozone as of 2015, but it could also be viewed as just another instance of a broad comparison class, such as negotiation failures, or of a narrower class, such as a nation-states withdrawing from international agreements or, narrower still, of forced currency conversions.31 On the other hand, everything is unique under sufficiently close historical inspection. It is impossible to step into the same river twice. History never literally repeats itself. At most, to paraphrase Mark Twain, it occasionally rhymes.
When we view it as a continuum, we can say uniqueness falls close to zero whenever experts find it easy to define rigorous comparison classes and no one disputes “event replicability” (the notion that things of this sort have happened many times before). Statistics 101 instructors satisfy these requirements in the classroom by flipping coins or sampling colored balls from urns. Uniqueness rises as we venture into the real world and encounter complex, interdependent bundles of idiosyncratic events. As uncertainty grows about the right comparison classes, and the necessary and sufficient conditions for inclusion in those classes, we find ourselves relying on fuzzier-set, family-resemblance criteria.32 Nate Silver’s 538 forecasting Website specializes in outcomes in the middle zone of the continuum—quasi-replicable events like baseball games and political elections. It is still feasible in this zone to construct statistical models that out-perform unaided human judgment. Finally, uniqueness peaks when we confront game-changer events—Black Swans like the collapse of the USSR or the Arab Spring—that abruptly transform seemingly super-safe, comparison-class bets on continuation of the status quo into big Brier-score losers. This is the zone in which frequentists tell us to surrender because probabilities have lost all meaning.
But those invoking the “all-meaning” claim paint with too broad a brush. The concept of a uniqueness continuum transforms an intractable all-or-none debate over first principles into a matter-of-degree, empirical debate. If the strong form of the meaninglessness critique were right, researchers should not have found an accuracy pay-off from treating uniqueness as a matter of degree. And yet they have—repeatedly. Forecasters who frame events as special cases of comparison classes do better than those who insist on adopting a sui generis perspective. This is not to claim the accuracy payoff from base-rate reasoning in geopolitics is as large as it is in, say, poker. But the payoff is larger than zero—in the vicinity of 10 percent.33 We know this, in part, because training in using base rates improves forecaster accuracy and, in part, because rounding off the forecasts of the best forecasters degrades their accuracy.34 None of this should happen if the events being forecast were, strictly speaking, one of a kind.
Some philosophers see another category mistake about to surface: a psychologist trying to rebut a conceptual objection with an empirical observation. But I take my stand here—and incur the risk of being re-labeled a “primitive positivist.” I cannot see how large numbers of probability judgments collected in geopolitical tournaments can be simultaneously philosophically meaningless and psychologically meaningful. I stake the claim to “psychological meaning” in the mounting evidence that forecasters who use comparison classes to inform their probability estimates generate estimates that correspond more closely with frequentist regularities in the real world. If the best forecasters walk and talk like probabilistic thinkers, shouldn’t we entertain the possibility that they are indeed just that?
There is, moreover, a purely conceptual response to the “meaninglessness” argument. It grants that the frequentists are right that we cannot rerun history. No one will ever know how many times Greece would have left the Eurozone if we could have rerun
history, say, 100 times, from the same starting conditions in July 2015. The conditional probability of “Grexit” given these objective antecedents is, in this sense, unknowable.
But if we shift from a restrictive frequentist conception to an expansive Bayesian view of probabilities as reasonable gradations of belief subject to constant revision, we can “re-conditionalize” our forecasts on subjective rather than objective antecedents. A Bayesian approach acknowledges that people, especially the experts among us, bring a wealth of information to any given forecast and that we inevitably use that information to form hunches, of varying crudeness about what will happen next. These subjective probabilities are rarely grounded in random draws from precisely parameterized populations. And good Bayesians know that they need to update their initial odds estimates when they get new information and to make those updates proportional to the diagnostic value of the new evidence, an often painfully gradual process. In this view, there is no logical prohibition against asking: how likely was Grexit in July 2015, given that the top geopolitical forecasters assigned that outcome a probability of 0.2 and given that these top forecasters are well-calibrated and events they say are 20 percent likely actually happen about 20 percent of the time? It struck me as wrong in 2005—as it still does—to insist we know no more about the “true probability” than we did before learning about the most recent predictions of the best forecasters. Anyone happy to ignore where the smart-money places its bets is someone against whom I am happy to bet more than academic-reputational capital.
Expert Political Judgment Page 3