Book Read Free

The Economics of Artificial Intelligence

Page 100

by Ajay Agrawal


  Both QRE and CH/ level- k theories extend equilibrium theory by adding

  parsimonious, precise specifi cations of departures from either optimization

  (QRE) or rationality of beliefs (CH/ level- k) using a small number of behav-

  ioral parameters. The question that is asked is: Can we add predictive power

  in a simple, psychologically plausible9 way using these parameters?

  A more general question is: Are there structural features of payoff s and

  8. For example, see Goldfarb and Xiao 2011, Östling et al. 2011, and Hortacsu et al. 2017.

  9. In the case of CH/ level- k theories, direct measures of visual attention from Mouselab and eyetracking have been used to test the theories using a combination of choices and visual attention data. See Costa- Gomes, Crawford, and Broseta 2001; Wang, Spezio, and Camerer 2010; and Brocas et al. 2014. Eyetracking and moused- based methods provide huge data

  594 Colin F. Camerer

  strategies that can predict even more accurately than QRE or CH/ level- k?

  If the answer is “Yes” then the new theories, even if they are improvements,

  have a long way to go.

  Two recent research streams have made important steps in this direction.

  Using methods familiar in computer science, Wright and Leyton- Brown

  (2014) create a “meta- model” that combines payoff features to predict what

  the nonstrategic “level 0” players seem to, in six sets of two- player 3 × 3

  normal form games. This is a substantial improvement on previous specifi ca-

  tions, which typically assume random behavior or some simple action based

  on salient information.10

  Hartford, Wright, and Leyton- Brown (2016) go further, using deep learn-

  ing neural networks (NNs) to predict human choices on the same six data

  sets. The NNs are able to outpredict CH models in the hold- out test sample

  in many cases. Importantly, even models in which there is no hierarchical

  iteration of strategic thinking (“layers of action response” in their approach)

  can fi t well. This result—while preliminary—indicates that prediction purely

  from hidden layers of structural features can be successful.

  Coming from behavioral game theory, Fudenberg and Liang (2017)

  explore how well ML over structural properties of strategies can predict

  experimental choices. They use the six data sets from Wright and Leyton-

  Brown (2014) and also collected data on how MTurk subjects played 200

  new 3 × 3 games with randomly drawn payoff s. Their ML approach uses

  eighty- eight features that are categorical structural properties of strategies

  (e.g., Is it part of a Nash equilibrium? Is the payoff never the worst for each

  choice by the other player?).

  The main analysis creates decision trees with k branching nodes (for k

  from 1 to 10) that predict whether a strategy will be played or not. Analysis

  uses tenfold test validation to guard against overfi tting. As is common, the

  best- fi tting trees are simple; there is a substantial improvement in fi t going

  from k = 1 to k = 2, and then only small improvements for bushier trees. In the lab game data, the best k = 2 tree is simply what is called level 1 play in CH/ level- k; it predicts the strategy that is a best response to uniform play

  by an opponent. That simple tree has a misclassifi cation rate of 38.4 per-

  cent. The best k = 3 tree is only a little better (36.6 percent) and k = 5 is very slightly better (36.5 percent).

  The model classifi es rather well, but the ML feature- based models do a

  sets. These previous studies heavily fi lter (or dimension- reduce) those data based on theory that requires consistency between choices and attention to information necessary to execute the value computation underlying the choice (Costa- Gomes, Crawford, and Broseta 2001;

  Costa-Gomes and Crawford 2006). Another approach that has never been tried is to use ML

  to select features from the huge feature set, combining choices and visual attention, to see which features predict best.

  10. Examples of nonrandom behavior by nonstrategic players include bidding one’s private value in an auction (Crawford and Iriberri 2007) and reporting a private state honestly in a sender- receiver game (Wang, Spezio, and Camerer 2010; Crawford 2003).

  Artifi cial Intelligence and Behavioral Economics 595

  Table 24.1

  Frequency of prediction errors of various theoretical and ML models for

  new data from random- payoff games (from Fudenberg and Liang 2017)

  Error

  Completeness

  Naïve benchmark

  0.6667

  1

  Uniform Nash

  0.4722

  51.21%

  (0.0075)

  Poisson cognitive hierarchy model

  0.3159

  92.36%

  (0.0217)

  Prediction rule based on game features

  0.2984

  96.97%

  (0.0095)

  “Best possible”

  0.2869

  0

  little better. Table 24.1 summarizes results for their new random games. The

  classifi cation by Poisson cognitive hierarchy (PCH) is 92 percent of the way

  from random to “best possible” (using the overall distribution of actual

  play) in this analysis. The ML feature model is almost perfect (97 percent).

  Other analyses show less impressive performance for PCH, although it

  can be improved substantially by adding risk aversion, and also by trying to

  predict diff erent data set- specifi c τ values.

  Note that the FL “best possible” measure is the same as the “clairvoyant”

  model upper bound used by Camerer, Ho, and Chong (2004). Given a data

  set of actual human behavior, and assuming that subjects are playing people

  chosen at random from that set, the best they can do is to have somehow

  accurately guessed what those data would be and chosen accordingly.11 (The

  term “clairvoyant” is used to note that this upper bound is unlikely to be

  reached except by sheer lucky guessing, but if a person repeatedly chooses

  near the bound it implies they have an intuitive mental model of how others

  choose, which is quite accurate.)

  Camerer, Ho, and Chong (2004) went a step further by also computing

  the expected reward value from clairvoyant prediction and comparing it with

  how much subjects actually earn and how much they could have earned if

  they obeyed diff erent theories. Using reward value as a metric is sensible

  because a theory could predict frequencies rather accurately, but might not

  generate a much higher reward value than highly inaccurate predictions

  (because of the “fl at maximum” property).12 In fi ve data sets they studied,

  Nash equilibrium added very little marginal value and the PCH approach

  11. In psychophysics and experimental psychology, the term “ideal observer” model is used to refer to a performance benchmark closely related to what we called the clairvoyant upper bound.

  12. This property was referred to as the “fl at maximum” by von Winterfeldt and Edwards (1973). It came to prominence much later in experimental economics when it was noted that theories could badly predict, say, a distribution of choices in a zero- sum game, but such an inaccurate theory might not yield much less earnings than an ideal theory.

  596 Colin F. Camerer

  added some value in three games and more than half the maximum achiev-

  able value in two games.

  24.3 Human Prediction as Imperfect Machine Learning

  24.3.1 Some Pr
e- History of Judgment Research

  and Behavioral Economics

  Behavioral economics as we know it and describe it nowadays, began to

  thrive when challenges to simple rationality principles (then called “anoma-

  lies”) came to have rugged empirical status and to point to natural improve-

  ments in theory (?). It was common in those early days to distinguish anoma-

  lies about “preferences” such as mental accounting violations of fungibility

  and reference- dependence, and anomalies about “judgment” of likelihoods

  and quantities.

  Somewhat hidden from economists, at that time and even now, was the

  fact that there was active research in many areas of judgment and decision-

  making (JDM). The JDM research proceeded in parallel with the emergence

  of behavioral economics. It was conducted almost entirely in psychology

  departments and some business schools, and rarely published in econom-

  ics journals. The annual meeting of the S/ JDM society was, for logistical

  effi

  ciency, held as a satellite meeting of the Psychonomic Society (which

  weighted attendance toward mathematical experimental psychology).

  The JDM research was about general approaches to understanding judg-

  ment processes, including “anomalies” relative to logically normative

  benchmarks. This research fl ourished because there was a healthy respect

  for simple mathematical models and careful testing, which enabled regu-

  larities to cumulate and gave reasons to dismiss weak results. The research

  community also had one foot in practical domains too (such as judgments

  of natural risks, medical decision- making, law, etc.) so that generalizability

  of lab results was always implicitly addressed.

  The central ongoing debate in JDM from the 1970s on was about the

  cognitive processes involved in actual decisions, and the quality of those pre-

  dictions. There were plenty of careful lab experiments about such phenom-

  ena, but also an earlier literature on what was then called “clinical versus

  statistical prediction.” There lies the earliest comparison between primitive

  forms of ML and the important JDM piece of behavioral economics (see

  Lewis 2016). Many of the important contributions from this fertile period

  were included in the Kahneman, Slovic, and Tversky (1982) edited volume

  (which in the old days was called the “blue- green bible”).

  Paul Meehl’s (1954) compact book started it all. Meehl was a remarkable

  character. He was a rare example, at the time, of a working clinical psychia-

  trist who was also interested in statistics and evidence (as were others at Min-

  nesota). Meehl had a picture of Freud in his offi

  ce, and practiced clinically

  for fi fty years in the Veteran’s Administration.

  Artifi cial Intelligence and Behavioral Economics 597

  Meehl’s mother had died when he was sixteen, under circumstances which

  apparently made him suspicious of how much doctors actually knew about

  how to make sick people well.

  His book could be read as pursuit of such a suspicion scientifi cally: he col-

  lected all the studies he could fi nd—there were twenty- two—that compared

  a set of clinical judgments with actual outcomes, and with simple linear

  models using observable predictors (some objective and some subjectively

  estimated).

  Meehl’s idea was that these statistical models could be used as a bench-

  mark to evaluate clinicians. As Dawes and Corrigan (1974, 97) wrote, “the

  statistical analysis was thought to provide a fl oor to which the judgment of

  the experienced clinician could be compared. The fl oor turned out to be a

  ceiling.”

  In every case the statistical model outpredicted or tied the judgment accu-

  racy of the average clinician. A later meta- analysis of 117 studies (Grove

  et al. 2000) found only six in which clinicians, on average, were more accurate

  than models (and see Dawes, Faust, and Meehl 1989).

  It is possible that in any one domain, the distribution of clinicians con-

  tains some stars who could predict much more accurately. However, later

  studies at the individual level showed that only a minority of clinicians were

  more accurate than statistical models (e.g., Goldberg 1968, 1970). Kleinberg

  et al.’s (2017) study of machine- learned and judicial detention decisions is

  a modern example of the same theme.

  In the decades after Meehl’s book was published, evidence began to

  mount about why clinical judgment could be so imperfect. A common theme

  was that clinicians were good at measuring particular variables, or suggest-

  ing which objective variables to include, but were not so good at combining

  them consistently (e.g., Sawyer 1966). In a recollection Meehl (1986, 373)

  gave a succinct description of this theme:

  Why should people have been so surprised by the empirical results in my

  summary chapter? Surely we all know that the human brain is poor at

  weighting and computing. When you check out at a supermarket, you

  don’t eyeball the heap of purchases and say to the clerk, “Well it looks

  to me as if it’s about $17.00 worth; what do you think?” The clerk adds it

  up. There are no strong arguments, from the armchair or from empirical

  studies of cognitive psychology, for believing that human beings can

  assign optimal weights in equations subjectively or that they apply their

  own weights consistently, the query from which Lew Goldberg derived

  such fascinating and fundamental results.

  Some other important fi ndings emerged. One drawback of the statistical

  prediction approach, for practice, was that it requires large samples of high-

  quality outcome data (in more modern AI language, prediction required

  labeled data). There were rarely many such data available at the time.

  Dawes (1979) proposed to give up on estimating variable weights through

  598 Colin F. Camerer

  a criterion- optimizing “proper” procedure like ordinary least squares

  (OLS),13 using “improper” weights instead. An example is equal- weighting

  of standardized variables, which is often a very good approximation to OLS

  weighting (Einhorn and Hogarth 1975).

  An interesting example of improper weights is what Dawes called “boot-

  strapping” (a completely distinct usage from the concept in statistics of

  bootstrap resampling). Dawes’s idea was to regress clinical judgments on

  predictors, and use those estimated weights to make prediction. This is

  equivalent, of course, to using the predicted part of the clinical- judgment

  regression and discarding (or regularizing to zero, if you will) the residual.

  If the residual is mostly noise then correlation accuracies can be improved

  by this procedure, and they typically are (e.g., Camerer 1981a).

  Later studies indicated a slightly more optimistic picture for the clinicians.

  If bootstrap- regression residuals are pure noise, they will also lower the

  test- retest reliability of clinical judgment (i.e., the correlation between two

  judgments on the same cases made by the same person). However, analysis

  of the few studies that report both test- retest reliability and bootstrapping

  regressions indicate that only abo
ut 40 percent of the residual variance is

  unreliable noise (Camerer 1981b). Thus, residuals do contain reliable subjec-

  tive information (though it may be uncorrelated with outcomes). Blattberg

  and Hoch (1990) later found that for actual managerial forecasts of product

  sales and coupon redemption rate, residuals are correlated about .30 with

  outcomes. As a result, averaging statistical model forecasts and managerial

  judgments improved prediction substantially over statistical models alone.

  24.3.2 Sparsity Is Good for You but Tastes Bad

  Besides the then- startling fi nding that human judgment did reliably worse

  than statistical models, a key feature of the early results was how well small

  numbers of variables could fi t. Some of this conclusion was constrained

  by the fact that there were not huge feature sets with truly large number of

  variables in any case (so you couldn’t possibly know, at that time, if “large

  numbers of variables fi t surprisingly better” than small numbers).

  A striking example in Dawes (1979) is a two- variable model predicting

  marital happiness: the rate of lovemaking minus the rate of fi ghting. He

  reports correlations of .40 and .81 in two studies (Edwards and Edwards

  1977; Thornton 1977).14

  In another more famous example, Dawes (1971) did a study about admit-

  ting students to the University of Oregon PhD program in psychology from

  1964 to 1967. He compared and measured each applicant’s GRE, under-

  graduate GPA, and the quality of the applicant’s undergraduate school. The

  13. Presciently, Dawes also mentions using ridge regression as a proper procedure to maximize out- of-sample fi t.

  14. More recent analyses using transcribed verbal interactions generate correlations for divorce and marital satisfaction around .6– .7. The core variables are called the “four horse-men” of criticism, defensiveness, contempt, and “stonewalling” (listener withdrawal).

  Artifi cial Intelligence and Behavioral Economics 599

  variables were standardized, then weighted equally. The outcome variable

  was faculty ratings in 1969 of how well the students they had admitted suc-

  ceeded. (Obviously, the selection eff ect here makes the entire analysis much

  less than ideal, but tracking down rejected applicants and measuring their

  success by 1969 was basically impossible at the time.)

 

‹ Prev