The Economics of Artificial Intelligence

Page 100

by Ajay Agrawal

Both QRE and CH/ level- k theories extend equilibrium theory by adding

parsimonious, precise specifi cations of departures from either optimization

(QRE) or rationality of beliefs (CH/ level- k) using a small number of behav-

ioral parameters. The question that is asked is: Can we add predictive power

in a simple, psychologically plausible9 way using these parameters?

A more general question is: Are there structural features of payoff s and

8. For example, see Goldfarb and Xiao 2011, Östling et al. 2011, and Hortacsu et al. 2017.

9. In the case of CH/ level- k theories, direct measures of visual attention from Mouselab and eyetracking have been used to test the theories using a combination of choices and visual attention data. See Costa- Gomes, Crawford, and Broseta 2001; Wang, Spezio, and Camerer 2010; and Brocas et al. 2014. Eyetracking and moused- based methods provide huge data

594 Colin F. Camerer

strategies that can predict even more accurately than QRE or CH/ level- k?

If the answer is “Yes” then the new theories, even if they are improvements,

have a long way to go.

Two recent research streams have made important steps in this direction.

Using methods familiar in computer science, Wright and Leyton- Brown

(2014) create a “meta- model” that combines payoff features to predict what

the nonstrategic “level 0” players seem to, in six sets of two- player 3 × 3

normal form games. This is a substantial improvement on previous specifi ca-

tions, which typically assume random behavior or some simple action based

on salient information.10

Hartford, Wright, and Leyton- Brown (2016) go further, using deep learn-

ing neural networks (NNs) to predict human choices on the same six data

sets. The NNs are able to outpredict CH models in the hold- out test sample

in many cases. Importantly, even models in which there is no hierarchical

iteration of strategic thinking (“layers of action response” in their approach)

can fi t well. This result—while preliminary—indicates that prediction purely

from hidden layers of structural features can be successful.

Coming from behavioral game theory, Fudenberg and Liang (2017)

explore how well ML over structural properties of strategies can predict

experimental choices. They use the six data sets from Wright and Leyton-

Brown (2014) and also collected data on how MTurk subjects played 200

new 3 × 3 games with randomly drawn payoff s. Their ML approach uses

eighty- eight features that are categorical structural properties of strategies

(e.g., Is it part of a Nash equilibrium? Is the payoff never the worst for each

choice by the other player?).

The main analysis creates decision trees with k branching nodes (for k

from 1 to 10) that predict whether a strategy will be played or not. Analysis

uses tenfold test validation to guard against overfi tting. As is common, the

best- fi tting trees are simple; there is a substantial improvement in fi t going

from k = 1 to k = 2, and then only small improvements for bushier trees. In the lab game data, the best k = 2 tree is simply what is called level 1 play in CH/ level- k; it predicts the strategy that is a best response to uniform play

by an opponent. That simple tree has a misclassifi cation rate of 38.4 per-

cent. The best k = 3 tree is only a little better (36.6 percent) and k = 5 is very slightly better (36.5 percent).

The model classifi es rather well, but the ML feature- based models do a

sets. These previous studies heavily fi lter (or dimension- reduce) those data based on theory that requires consistency between choices and attention to information necessary to execute the value computation underlying the choice (Costa- Gomes, Crawford, and Broseta 2001;

Costa-Gomes and Crawford 2006). Another approach that has never been tried is to use ML

to select features from the huge feature set, combining choices and visual attention, to see which features predict best.

10. Examples of nonrandom behavior by nonstrategic players include bidding one’s private value in an auction (Crawford and Iriberri 2007) and reporting a private state honestly in a sender- receiver game (Wang, Spezio, and Camerer 2010; Crawford 2003).

Artifi cial Intelligence and Behavioral Economics 595

Table 24.1

Frequency of prediction errors of various theoretical and ML models for

new data from random- payoff games (from Fudenberg and Liang 2017)

Error

Completeness

Naïve benchmark

0.6667

1

Uniform Nash

0.4722

51.21%

(0.0075)

Poisson cognitive hierarchy model

0.3159

92.36%

(0.0217)

Prediction rule based on game features

0.2984

96.97%

(0.0095)

“Best possible”

0.2869

0

little better. Table 24.1 summarizes results for their new random games. The

classifi cation by Poisson cognitive hierarchy (PCH) is 92 percent of the way

from random to “best possible” (using the overall distribution of actual

play) in this analysis. The ML feature model is almost perfect (97 percent).

Other analyses show less impressive performance for PCH, although it

can be improved substantially by adding risk aversion, and also by trying to

predict diff erent data set- specifi c τ values.

Note that the FL “best possible” measure is the same as the “clairvoyant”

model upper bound used by Camerer, Ho, and Chong (2004). Given a data

set of actual human behavior, and assuming that subjects are playing people

chosen at random from that set, the best they can do is to have somehow

accurately guessed what those data would be and chosen accordingly.11 (The

term “clairvoyant” is used to note that this upper bound is unlikely to be

reached except by sheer lucky guessing, but if a person repeatedly chooses

near the bound it implies they have an intuitive mental model of how others

choose, which is quite accurate.)

Camerer, Ho, and Chong (2004) went a step further by also computing

the expected reward value from clairvoyant prediction and comparing it with

how much subjects actually earn and how much they could have earned if

they obeyed diff erent theories. Using reward value as a metric is sensible

because a theory could predict frequencies rather accurately, but might not

generate a much higher reward value than highly inaccurate predictions

(because of the “fl at maximum” property).12 In fi ve data sets they studied,

Nash equilibrium added very little marginal value and the PCH approach

11. In psychophysics and experimental psychology, the term “ideal observer” model is used to refer to a performance benchmark closely related to what we called the clairvoyant upper bound.

12. This property was referred to as the “fl at maximum” by von Winterfeldt and Edwards (1973). It came to prominence much later in experimental economics when it was noted that theories could badly predict, say, a distribution of choices in a zero- sum game, but such an inaccurate theory might not yield much less earnings than an ideal theory.

596 Colin F. Camerer

added some value in three games and more than half the maximum achiev-

able value in two games.

24.3 Human Prediction as Imperfect Machine Learning

24.3.1 Some Pr
e- History of Judgment Research

and Behavioral Economics

Behavioral economics as we know it and describe it nowadays, began to

thrive when challenges to simple rationality principles (then called “anoma-

lies”) came to have rugged empirical status and to point to natural improve-

ments in theory (?). It was common in those early days to distinguish anoma-

lies about “preferences” such as mental accounting violations of fungibility

and reference- dependence, and anomalies about “judgment” of likelihoods

and quantities.

Somewhat hidden from economists, at that time and even now, was the

fact that there was active research in many areas of judgment and decision-

making (JDM). The JDM research proceeded in parallel with the emergence

of behavioral economics. It was conducted almost entirely in psychology

departments and some business schools, and rarely published in econom-

ics journals. The annual meeting of the S/ JDM society was, for logistical

effi

ciency, held as a satellite meeting of the Psychonomic Society (which

weighted attendance toward mathematical experimental psychology).

The JDM research was about general approaches to understanding judg-

ment processes, including “anomalies” relative to logically normative

benchmarks. This research fl ourished because there was a healthy respect

for simple mathematical models and careful testing, which enabled regu-

larities to cumulate and gave reasons to dismiss weak results. The research

community also had one foot in practical domains too (such as judgments

of natural risks, medical decision- making, law, etc.) so that generalizability

of lab results was always implicitly addressed.

The central ongoing debate in JDM from the 1970s on was about the

cognitive processes involved in actual decisions, and the quality of those pre-

dictions. There were plenty of careful lab experiments about such phenom-

ena, but also an earlier literature on what was then called “clinical versus

statistical prediction.” There lies the earliest comparison between primitive

forms of ML and the important JDM piece of behavioral economics (see

Lewis 2016). Many of the important contributions from this fertile period

were included in the Kahneman, Slovic, and Tversky (1982) edited volume

(which in the old days was called the “blue- green bible”).

Paul Meehl’s (1954) compact book started it all. Meehl was a remarkable

character. He was a rare example, at the time, of a working clinical psychia-

trist who was also interested in statistics and evidence (as were others at Min-

nesota). Meehl had a picture of Freud in his offi

ce, and practiced clinically

for fi fty years in the Veteran’s Administration.

Artifi cial Intelligence and Behavioral Economics 597

Meehl’s mother had died when he was sixteen, under circumstances which

apparently made him suspicious of how much doctors actually knew about

how to make sick people well.

His book could be read as pursuit of such a suspicion scientifi cally: he col-

lected all the studies he could fi nd—there were twenty- two—that compared

a set of clinical judgments with actual outcomes, and with simple linear

models using observable predictors (some objective and some subjectively

estimated).

Meehl’s idea was that these statistical models could be used as a bench-

mark to evaluate clinicians. As Dawes and Corrigan (1974, 97) wrote, “the

statistical analysis was thought to provide a fl oor to which the judgment of

the experienced clinician could be compared. The fl oor turned out to be a

ceiling.”

In every case the statistical model outpredicted or tied the judgment accu-

racy of the average clinician. A later meta- analysis of 117 studies (Grove

et al. 2000) found only six in which clinicians, on average, were more accurate

than models (and see Dawes, Faust, and Meehl 1989).

It is possible that in any one domain, the distribution of clinicians con-

tains some stars who could predict much more accurately. However, later

studies at the individual level showed that only a minority of clinicians were

more accurate than statistical models (e.g., Goldberg 1968, 1970). Kleinberg

et al.’s (2017) study of machine- learned and judicial detention decisions is

a modern example of the same theme.

In the decades after Meehl’s book was published, evidence began to

mount about why clinical judgment could be so imperfect. A common theme

was that clinicians were good at measuring particular variables, or suggest-

ing which objective variables to include, but were not so good at combining

them consistently (e.g., Sawyer 1966). In a recollection Meehl (1986, 373)

gave a succinct description of this theme:

Why should people have been so surprised by the empirical results in my

summary chapter? Surely we all know that the human brain is poor at

weighting and computing. When you check out at a supermarket, you

don’t eyeball the heap of purchases and say to the clerk, “Well it looks

to me as if it’s about $17.00 worth; what do you think?” The clerk adds it

up. There are no strong arguments, from the armchair or from empirical

studies of cognitive psychology, for believing that human beings can

assign optimal weights in equations subjectively or that they apply their

own weights consistently, the query from which Lew Goldberg derived

such fascinating and fundamental results.

Some other important fi ndings emerged. One drawback of the statistical

prediction approach, for practice, was that it requires large samples of high-

quality outcome data (in more modern AI language, prediction required

labeled data). There were rarely many such data available at the time.

Dawes (1979) proposed to give up on estimating variable weights through

598 Colin F. Camerer

a criterion- optimizing “proper” procedure like ordinary least squares

(OLS),13 using “improper” weights instead. An example is equal- weighting

of standardized variables, which is often a very good approximation to OLS

weighting (Einhorn and Hogarth 1975).

An interesting example of improper weights is what Dawes called “boot-

strapping” (a completely distinct usage from the concept in statistics of

bootstrap resampling). Dawes’s idea was to regress clinical judgments on

predictors, and use those estimated weights to make prediction. This is

equivalent, of course, to using the predicted part of the clinical- judgment

regression and discarding (or regularizing to zero, if you will) the residual.

If the residual is mostly noise then correlation accuracies can be improved

by this procedure, and they typically are (e.g., Camerer 1981a).

Later studies indicated a slightly more optimistic picture for the clinicians.

If bootstrap- regression residuals are pure noise, they will also lower the

test- retest reliability of clinical judgment (i.e., the correlation between two

judgments on the same cases made by the same person). However, analysis

of the few studies that report both test- retest reliability and bootstrapping

regressions indicate that only abo
ut 40 percent of the residual variance is

unreliable noise (Camerer 1981b). Thus, residuals do contain reliable subjec-

tive information (though it may be uncorrelated with outcomes). Blattberg

and Hoch (1990) later found that for actual managerial forecasts of product

sales and coupon redemption rate, residuals are correlated about .30 with

outcomes. As a result, averaging statistical model forecasts and managerial

judgments improved prediction substantially over statistical models alone.

24.3.2 Sparsity Is Good for You but Tastes Bad

Besides the then- startling fi nding that human judgment did reliably worse

than statistical models, a key feature of the early results was how well small

numbers of variables could fi t. Some of this conclusion was constrained

by the fact that there were not huge feature sets with truly large number of

variables in any case (so you couldn’t possibly know, at that time, if “large

numbers of variables fi t surprisingly better” than small numbers).

A striking example in Dawes (1979) is a two- variable model predicting

marital happiness: the rate of lovemaking minus the rate of fi ghting. He

reports correlations of .40 and .81 in two studies (Edwards and Edwards

1977; Thornton 1977).14

In another more famous example, Dawes (1971) did a study about admit-

ting students to the University of Oregon PhD program in psychology from

1964 to 1967. He compared and measured each applicant’s GRE, under-

graduate GPA, and the quality of the applicant’s undergraduate school. The

13. Presciently, Dawes also mentions using ridge regression as a proper procedure to maximize out- of-sample fi t.

14. More recent analyses using transcribed verbal interactions generate correlations for divorce and marital satisfaction around .6– .7. The core variables are called the “four horse-men” of criticism, defensiveness, contempt, and “stonewalling” (listener withdrawal).

Artifi cial Intelligence and Behavioral Economics 599

variables were standardized, then weighted equally. The outcome variable

was faculty ratings in 1969 of how well the students they had admitted suc-

ceeded. (Obviously, the selection eff ect here makes the entire analysis much

less than ideal, but tracking down rejected applicants and measuring their

success by 1969 was basically impossible at the time.)

‹ Prev Next ›