Expert Political Judgment

Home > Other > Expert Political Judgment > Page 12
Expert Political Judgment Page 12

by Philip E. Tetlock


  If past work on clinical versus statistical prediction is a guide, we should expect human forecasters to achieve levels of performance far closer to the chimp and simple extrapolation algorithms than to the formal statistical ones. There is an additional twist, though. Psychological skeptics add that many humans will fall below even extrapolation algorithms. When good case-based stories are circulating for expecting unusual outcomes, people will base their confidence largely on these stories and ignore the cautionary base rates. Coups, economic collapse, and so on are rare events in most places. Predicting them vastly inflates the likelihood of getting it wrong.43

  2. The diminishing marginal returns from expertise hypothesis. Radical skeptics have a fallback if experts manage to outpredict the chimp and extrapolation algorithms. They can argue that, whatever modest advantage expertise may confer, we quickly reach the point of diminishing returns. The attentive reader of the New York Times is likely to be as adept at picking up predictive cues as renowned area study specialists. It follows that, on average, specialists on Canada—who casually tracked events in the USSR in the elite press—should assign as realistic probabilities to possible Soviet futures as did certified Sovietologists. Dilettante trespassers will “win” as often as “lose” against experts on their home turf.

  3. The fifteen minutes of fame hypothesis. We have laid the basis for another counterintuitive corollary of radical skepticism: expect little consistency in who surpasses or falls short of these minimalist performance benchmarks. Occasionally, to be sure, some will do better and others worse by virtue of the interconnections among outcomes being forecast. Those “lucky enough” to anticipate the demise of the Soviet Union should also enjoy the benefits of a string of successful predictions of political and economic liberalization in Eastern Europe. But there should be no presumption that (a) experts who are good forecasters inside their domains of expertise will be good outside those domains; (b) experts with certain points of view or styles of thinking who do well at one historical juncture will do well at other junctures.

  4. The loquacious overconfidence (or hot air) hypothesis. Although knowledge beyond a bare minimum should not enhance forecasting accuracy, it should bestow on experts the cognitive resources to generate more elaborate and convincing rationales for their forecasts. Thus, as expertise rises, confidence in forecasts should rise faster than the accuracy of forecasts, producing substantial overconfidence by the time we reach the highest rungs of the expertise ladder. The most distinctive cognitive marker of expertise should be relatively extreme and elaborately justified probability judgments that fare poorly against the evidence.

  5. The seduced by fame, fortune, and power hypothesis. The more frequently experts are asked to offer their opinions on current events to the media, business, or government, the greater the temptation to offer quotable quotes and good sound bites. Ego-enhancing contact with outsiders counts as a factor that, like expertise itself, should increase confidence without increasing accuracy, thus further fueling overconfidence. Of course, causality can also work in the other direction. The media can be seduced by the charismatically overconfident.

  6. The indefinitely sustainable illusion hypothesis. Radical skeptics expect minimal consistency in who gets what right across time and topics. But psychologists are not surprised that people persist in believing there is a great deal of consistency, the forecasting equivalent of hot hands in basketball.44 The illusion of consistency is rooted in both cognitive biases (pervasive misunderstandings about the workings of chance) and motivational biases (the need to believe that we do not make life-and-death decisions whimsically). We should thus expect (a) the frequent emergence of shortfalls between what the public hopes experts can deliver and what experts can deliver; (b) the frequent widening of these shortfalls whenever historical forces increase how desperate people are for guidance but not how skilled experts are at providing it.

  METHODOLOGICAL BACKGROUND

  Readers more interested in results than in methods can skip ahead to the section entitled “The Evidence.” It is, however, a mistake for serious scholars to do so. The fact is there is no single best approach to testing the six hypotheses. No study will be without “fatal flaws” for those inclined to find them. Critics can always second-guess the qualifications of the forecasters or the ground rules or content of the forecasting exercises. This chapter therefore makes no claim on the final word.

  The Methodological Appendix does, however, give us four good reasons for supposing the current dataset is unusually well suited for testing the core tenets of radical skepticism. Those reasons are as follows:

  1. The sophistication of the research participants who agreed—admittedly with varying enthusiasm—to play the role of forecasters. This makes it hard to argue that if we had recruited “real heavyweights,” we would now be telling a more flattering tale about expertise. And, although defenders of expertise can always argue that an intellectually heftier or a more politically connected sample would have done better, we can say—without breaking our confidentiality promises spelled out in the appendix—that our sample of 284 participants was impressive on several dimensions. Participants were highly educated (the majority had doctorates and almost all had postgraduate training in fields such as political science (in particular, international relations and various branches of area studies), economics, international law and diplomacy, business administration, public policy, and journalism); they had, on average, twelve years of relevant work experience; they came from many walks of professional life, including academia, think tanks, government service, and international institutions; and they showed themselves in conversation to be remarkably thoughtful and articulate observers of the world scene.

  2. The broad, historically rolling cross section of political, economic, and national security outcomes that we asked forecasters to try to anticipate between 1988 and 2003. This makes it hard to argue that the portrait of good judgment that emerges here holds only for a few isolated episodes in recent history: objections of the canonical form “Sure, they got the collapse of the USSR or the 1992 U.S. election right, but did they get … ” A typical forecasting exercise elicited opinions on such diverse topics as GDP growth in Argentina, the risk of nuclear war in the Indian subcontinent, and the pace of “democratization” and “privatization” in former Communist bloc countries. Many participants made more than one hundred predictions, roughly half of which pertained to topics that fell within their self-reported domains of expertise and the other half of which fell outside their domains of expertise.

  3. The delicate balancing acts that had to be performed in designing the forecasting exercises. On the one hand, we wanted to avoid ridiculously easy questions that everyone could get right: “Yes, I am 100 percent confident that stable democracies will continue to hold competitive elections.” On the other hand, we wanted to avoid ridiculously hard questions that everyone knew they could not get right (or even do better than chance): “No, I can only give you just-guessing levels of confidence on which party will win the American presidential election of 2012.” In groping for the right balance, therefore, we wanted to give experts the flexibility to express their degrees of uncertainty about the future. To this end, we used standardized format, subjective probability scales. For instance, the scale for judging each possibility in the three-possible-future exercises looked like this (with the added option that forecasters could assign .33 to all possibilities (by checking the “maximum uncertainty box”) when they felt they had no basis for rating one possibility likelier than any other):

  In groping for the right balance, we also formulated response options so that experts did not feel they were being asked to make ridiculously precise point predictions. To this end, we carved the universe of possible futures into exclusive and exhaustive categories that captured broad ranges of past variation in outcome variables. The Methodological Appendix inventories the major categories of questions by regions, topics, and time frames. These questions tapped perceptions
of who was likely to be in charge of the legislative or executive branches of government after the next election (e.g., How likely is it that after the next election, the party that currently has the most representatives in the legislative branch(es) of government will retain this status (within plus or minus a plausible delta), will lose this status, or will strengthen its position?), perceptions of how far the government will fall into debt (e.g., How likely is it—in either a three or six-year frame—that the annual central government operating deficit as percentage of GDP will fall below, within, or above a specified comparison range of values (based on the variance in values over the last six years)?), and perceptions of national security threats (e.g., How likely is it—within either a three- or six-year frame—that defense spending as a percentage of central government expenditure will fall below, within, or above a specified comparison range of values [again, set by the immediately preceding patterns]?).

  4. The transparency and rigor of the rules for assessing the accuracy of the forecasts. Transparency makes it hard to argue that the game was rigged against particular schools of thought. Rigor makes it hard for losing forecasters to insist that, although they might appear to have been wrong, they were—in some more profound sense—right. We rely exclusively in this chapter—and largely in later chapters—on aggregate indices of accuracy logically derived from thousands of predictions.

  Some readers may, however, still wonder exactly how we can assess the accuracy of probability judgments of unique events. The answer is via the miracle of aggregation. Granted, the true probabilities of each outcome that did occur remain shrouded in mystery (we know only the value is not zero), and so too are the true probabilities of each outcome that failed to occur (we know only the value is not 1.0). However, if we collect enough predictions, we can still gauge the relative frequency with which outcomes assigned various probabilities do and do not occur. To take an extreme example, few would dispute that someone whose probability assignments to possible outcomes closely tracked the relative frequency of those outcomes (events assigned x percent likelihood occurred about x percent of the time) should be considered a better forecaster than someone whose probability assignments bore no relationship to the relative frequency of outcomes.

  The Technical Appendix details the procedures for computing the key measure of forecasting accuracy, the probability score, which is defined as the average deviation between the ex ante probabilities that experts assign possible futures and the ex post certainty values that the researchers assign those futures once we learn what did (1.0) or did not (0.0) happen.45 To get the best possible score, zero, one must be clairvoyant: assigning a probability of 1.0 to all things that subsequently happen and a probability of zero to all things that do not. To get the worst possible score, 1.0, one must be the opposite of clairvoyant, and infallibly declare impossible everything that later happens and declare inevitable everything that does not.

  Probability scores, however, provide only crude indicators of how large the gaps are between subjective probabilities and objective reality. Answering more fine-grained questions requires decomposing probability scores into more precise indicators. Readers should be alerted to three curve ball complications:

  a. Are some forecasters achieving better (smaller) probability scores by playing it safe and assigning close-to-guessing probabilities? To explore this possibility, we need to break probability scores into two component indicators—calibration and discrimination—that are often posited to be in a trade-off relationship. The calibration index taps the degree to which subjective probabilities are aligned with objective probabilities. Observers are perfectly calibrated when there is precise correspondence between subjective and objective probabilities (and thus the squared deviations sum to zero). Outcomes assigned 80 percent likelihoods happen about 80 percent of the time, those assigned a 70 percent likelihood happen about 70 percent of the time, and so on. The discrimination index taps forecasters’ ability to do better than a simple predict-the-base-rate strategy. Observers get perfect discrimination scores when they infallibly assign probabilities of 1.0 to things that happen and probabilities of zero to things that do not.

  To maximize calibration, it often pays to be cautious and assign probabilities close to the base rates; to maximize discrimination, it often pays to be bold and assign extreme probabilities. The first panel of figure 2.3 shows how a playing-it-safe strategy—assigning probabilities that never stray from the midpoint values of .4, .5, and .6—can produce excellent (small) calibration scores but poor (small) discrimination scores. The second and third panels show how it is possible to achieve both excellent calibration and discrimination scores. Doing so, though, does require skill: mapping probability values onto variation in real-world outcomes.

  Figure 2.3. It is possible to be perfectly calibrated but achieve a wide range of discrimination scores: poor (the fence-sitting strategy), good (using a broad range of values correctly), and perfect (using only the most extreme values correctly).

  b. Did some forecasters do better merely because they were dealt easier tasks? Probability scores can be inflated either because experts were error prone or because the task was hard. Distinguishing these alternatives requires statistical procedures for estimating: (1) task difficulty (tasks are easy to the degree either there is little variance in outcomes—say, predicting rain in Phoenix—or there is variance that can be captured in simple statistical models—say, predicting seasonal variation in temperature in Toronto); (2) the observed variation in performance can be attributed to variation in skill rather than task difficulty.

  c. Are some forecasters getting worse probability scores because they are willing to make many errors of one type to avoid even a few of another? Forecasters can obtain bad scores because they either overpredict (assign high probabilities to things that never happen) or underpredict (assign low probabilities to things that do happen). What looks initially like a mistake might, however, be a reflection of policy priorities. Experts sometimes insisted, for example, that it is prudent to exaggerate the likelihood of change for the worse, even at a steep cost in false alarms. Assessing such claims requires value-adjusting probability scores in ways that give experts varying benefits of the doubt when they under- or overestimate particular outcomes.

  THE EVIDENCE

  The methodological stage has been set for testing the core tenets of radical skepticism. Figure 2.4 presents the average calibration and discrimination scores for 27,451 forecasts which have been broken down into experts’ versus dilettantes’ predictions of shorter or longer-term futures of the domestic-political, economic, and national-security policies of countries in either the zones of turbulence or stability. A calibration score of .01 indicates that forecasters’ subjective probabilities diverged from objective frequencies, on average, by about 10 percent; a score of .04, an average gap of 20 percent. A discrimination score of .01 indicates that forecasters, on average, predicted about 6 percent of the total variation in outcomes; a score of .04, that they captured 24 percent.

  The Debunking Hypotheses: Humanity versus Algorithms of Varying Sophistication

  Figure 2.5 plots the average calibration and discrimination of human forecasters, of four forms of mindless competition—including: (a) the chimp strategy of assigning equal probabilities; (b) the expansive and restrictive base-rate strategies of assigning probabilities corresponding to the frequency of outcomes in the five year periods preceding forecasting periods in which we assess human performance (inserting values derived either from the entire dataset or from restricted subsets of cases such as the former Soviet bloc); (c) the contemporary-base-rate strategy of assigning probabilities corresponding to the frequency of outcomes in the actual forecasting periods; (d) the cautious and aggressive case-specific strategies of assigning either high or very high probabilities to the hypothesis that the most recent trend for a specific country will persist—and of the sophisticated competition that drew on autoregressive distributed lag models.

  Fig
ure 2.4. The calibration (CI) and discrimination (DI) scores of all subjective probability judgments entered into later data analyses. The first-tier breakdown is into judgments made by experts versus dilettantes; the second-tier breakdown is into shorter-versus longer-term predictions; the third-tier breakdown is into predictions for states in the zone of stability versus the zone of turbulence; and the final breakdown (into three boxes nested under each third-tier label) is into predictions for domestic politics (e.g., who is in charge?), government policy and performance (e.g., what are spending priorities and how well is the economy doing?), and national security (e.g., has there been violent conflict inside or outside state borders?).

  Figure 2.5. The calibration and discrimination scores achieved by human forecasters (experts and dilettantes), by their mindless competition (chimp random guessing, restrictive and expansive base-rate extrapolation, and cautious and aggressive case-specific extrapolation algorithms), and by the sophisticated statistical competition. Each curve represents a set of equal-weighting calibration-discrimination trade-offs that hold overall forecasting skill (probability score) constant. Higher curves represent improving overall performance.

 

‹ Prev