The Economics of Artificial Intelligence
Page 100
Both QRE and CH/ level- k theories extend equilibrium theory by adding
parsimonious, precise specifi cations of departures from either optimization
(QRE) or rationality of beliefs (CH/ level- k) using a small number of behav-
ioral parameters. The question that is asked is: Can we add predictive power
in a simple, psychologically plausible9 way using these parameters?
A more general question is: Are there structural features of payoff s and
8. For example, see Goldfarb and Xiao 2011, Östling et al. 2011, and Hortacsu et al. 2017.
9. In the case of CH/ level- k theories, direct measures of visual attention from Mouselab and eyetracking have been used to test the theories using a combination of choices and visual attention data. See Costa- Gomes, Crawford, and Broseta 2001; Wang, Spezio, and Camerer 2010; and Brocas et al. 2014. Eyetracking and moused- based methods provide huge data
594 Colin F. Camerer
strategies that can predict even more accurately than QRE or CH/ level- k?
If the answer is “Yes” then the new theories, even if they are improvements,
have a long way to go.
Two recent research streams have made important steps in this direction.
Using methods familiar in computer science, Wright and Leyton- Brown
(2014) create a “meta- model” that combines payoff features to predict what
the nonstrategic “level 0” players seem to, in six sets of two- player 3 × 3
normal form games. This is a substantial improvement on previous specifi ca-
tions, which typically assume random behavior or some simple action based
on salient information.10
Hartford, Wright, and Leyton- Brown (2016) go further, using deep learn-
ing neural networks (NNs) to predict human choices on the same six data
sets. The NNs are able to outpredict CH models in the hold- out test sample
in many cases. Importantly, even models in which there is no hierarchical
iteration of strategic thinking (“layers of action response” in their approach)
can fi t well. This result—while preliminary—indicates that prediction purely
from hidden layers of structural features can be successful.
Coming from behavioral game theory, Fudenberg and Liang (2017)
explore how well ML over structural properties of strategies can predict
experimental choices. They use the six data sets from Wright and Leyton-
Brown (2014) and also collected data on how MTurk subjects played 200
new 3 × 3 games with randomly drawn payoff s. Their ML approach uses
eighty- eight features that are categorical structural properties of strategies
(e.g., Is it part of a Nash equilibrium? Is the payoff never the worst for each
choice by the other player?).
The main analysis creates decision trees with k branching nodes (for k
from 1 to 10) that predict whether a strategy will be played or not. Analysis
uses tenfold test validation to guard against overfi tting. As is common, the
best- fi tting trees are simple; there is a substantial improvement in fi t going
from k = 1 to k = 2, and then only small improvements for bushier trees. In the lab game data, the best k = 2 tree is simply what is called level 1 play in CH/ level- k; it predicts the strategy that is a best response to uniform play
by an opponent. That simple tree has a misclassifi cation rate of 38.4 per-
cent. The best k = 3 tree is only a little better (36.6 percent) and k = 5 is very slightly better (36.5 percent).
The model classifi es rather well, but the ML feature- based models do a
sets. These previous studies heavily fi lter (or dimension- reduce) those data based on theory that requires consistency between choices and attention to information necessary to execute the value computation underlying the choice (Costa- Gomes, Crawford, and Broseta 2001;
Costa-Gomes and Crawford 2006). Another approach that has never been tried is to use ML
to select features from the huge feature set, combining choices and visual attention, to see which features predict best.
10. Examples of nonrandom behavior by nonstrategic players include bidding one’s private value in an auction (Crawford and Iriberri 2007) and reporting a private state honestly in a sender- receiver game (Wang, Spezio, and Camerer 2010; Crawford 2003).
Artifi cial Intelligence and Behavioral Economics 595
Table 24.1
Frequency of prediction errors of various theoretical and ML models for
new data from random- payoff games (from Fudenberg and Liang 2017)
Error
Completeness
Naïve benchmark
0.6667
1
Uniform Nash
0.4722
51.21%
(0.0075)
Poisson cognitive hierarchy model
0.3159
92.36%
(0.0217)
Prediction rule based on game features
0.2984
96.97%
(0.0095)
“Best possible”
0.2869
0
little better. Table 24.1 summarizes results for their new random games. The
classifi cation by Poisson cognitive hierarchy (PCH) is 92 percent of the way
from random to “best possible” (using the overall distribution of actual
play) in this analysis. The ML feature model is almost perfect (97 percent).
Other analyses show less impressive performance for PCH, although it
can be improved substantially by adding risk aversion, and also by trying to
predict diff erent data set- specifi c τ values.
Note that the FL “best possible” measure is the same as the “clairvoyant”
model upper bound used by Camerer, Ho, and Chong (2004). Given a data
set of actual human behavior, and assuming that subjects are playing people
chosen at random from that set, the best they can do is to have somehow
accurately guessed what those data would be and chosen accordingly.11 (The
term “clairvoyant” is used to note that this upper bound is unlikely to be
reached except by sheer lucky guessing, but if a person repeatedly chooses
near the bound it implies they have an intuitive mental model of how others
choose, which is quite accurate.)
Camerer, Ho, and Chong (2004) went a step further by also computing
the expected reward value from clairvoyant prediction and comparing it with
how much subjects actually earn and how much they could have earned if
they obeyed diff erent theories. Using reward value as a metric is sensible
because a theory could predict frequencies rather accurately, but might not
generate a much higher reward value than highly inaccurate predictions
(because of the “fl at maximum” property).12 In fi ve data sets they studied,
Nash equilibrium added very little marginal value and the PCH approach
11. In psychophysics and experimental psychology, the term “ideal observer” model is used to refer to a performance benchmark closely related to what we called the clairvoyant upper bound.
12. This property was referred to as the “fl at maximum” by von Winterfeldt and Edwards (1973). It came to prominence much later in experimental economics when it was noted that theories could badly predict, say, a distribution of choices in a zero- sum game, but such an inaccurate theory might not yield much less earnings than an ideal theory.
596 Colin F. Camerer
added some value in three games and more than half the maximum achiev-
able value in two games.
24.3 Human Prediction as Imperfect Machine Learning
24.3.1 Some Pr
e- History of Judgment Research
and Behavioral Economics
Behavioral economics as we know it and describe it nowadays, began to
thrive when challenges to simple rationality principles (then called “anoma-
lies”) came to have rugged empirical status and to point to natural improve-
ments in theory (?). It was common in those early days to distinguish anoma-
lies about “preferences” such as mental accounting violations of fungibility
and reference- dependence, and anomalies about “judgment” of likelihoods
and quantities.
Somewhat hidden from economists, at that time and even now, was the
fact that there was active research in many areas of judgment and decision-
making (JDM). The JDM research proceeded in parallel with the emergence
of behavioral economics. It was conducted almost entirely in psychology
departments and some business schools, and rarely published in econom-
ics journals. The annual meeting of the S/ JDM society was, for logistical
effi
ciency, held as a satellite meeting of the Psychonomic Society (which
weighted attendance toward mathematical experimental psychology).
The JDM research was about general approaches to understanding judg-
ment processes, including “anomalies” relative to logically normative
benchmarks. This research fl ourished because there was a healthy respect
for simple mathematical models and careful testing, which enabled regu-
larities to cumulate and gave reasons to dismiss weak results. The research
community also had one foot in practical domains too (such as judgments
of natural risks, medical decision- making, law, etc.) so that generalizability
of lab results was always implicitly addressed.
The central ongoing debate in JDM from the 1970s on was about the
cognitive processes involved in actual decisions, and the quality of those pre-
dictions. There were plenty of careful lab experiments about such phenom-
ena, but also an earlier literature on what was then called “clinical versus
statistical prediction.” There lies the earliest comparison between primitive
forms of ML and the important JDM piece of behavioral economics (see
Lewis 2016). Many of the important contributions from this fertile period
were included in the Kahneman, Slovic, and Tversky (1982) edited volume
(which in the old days was called the “blue- green bible”).
Paul Meehl’s (1954) compact book started it all. Meehl was a remarkable
character. He was a rare example, at the time, of a working clinical psychia-
trist who was also interested in statistics and evidence (as were others at Min-
nesota). Meehl had a picture of Freud in his offi
ce, and practiced clinically
for fi fty years in the Veteran’s Administration.
Artifi cial Intelligence and Behavioral Economics 597
Meehl’s mother had died when he was sixteen, under circumstances which
apparently made him suspicious of how much doctors actually knew about
how to make sick people well.
His book could be read as pursuit of such a suspicion scientifi cally: he col-
lected all the studies he could fi nd—there were twenty- two—that compared
a set of clinical judgments with actual outcomes, and with simple linear
models using observable predictors (some objective and some subjectively
estimated).
Meehl’s idea was that these statistical models could be used as a bench-
mark to evaluate clinicians. As Dawes and Corrigan (1974, 97) wrote, “the
statistical analysis was thought to provide a fl oor to which the judgment of
the experienced clinician could be compared. The fl oor turned out to be a
ceiling.”
In every case the statistical model outpredicted or tied the judgment accu-
racy of the average clinician. A later meta- analysis of 117 studies (Grove
et al. 2000) found only six in which clinicians, on average, were more accurate
than models (and see Dawes, Faust, and Meehl 1989).
It is possible that in any one domain, the distribution of clinicians con-
tains some stars who could predict much more accurately. However, later
studies at the individual level showed that only a minority of clinicians were
more accurate than statistical models (e.g., Goldberg 1968, 1970). Kleinberg
et al.’s (2017) study of machine- learned and judicial detention decisions is
a modern example of the same theme.
In the decades after Meehl’s book was published, evidence began to
mount about why clinical judgment could be so imperfect. A common theme
was that clinicians were good at measuring particular variables, or suggest-
ing which objective variables to include, but were not so good at combining
them consistently (e.g., Sawyer 1966). In a recollection Meehl (1986, 373)
gave a succinct description of this theme:
Why should people have been so surprised by the empirical results in my
summary chapter? Surely we all know that the human brain is poor at
weighting and computing. When you check out at a supermarket, you
don’t eyeball the heap of purchases and say to the clerk, “Well it looks
to me as if it’s about $17.00 worth; what do you think?” The clerk adds it
up. There are no strong arguments, from the armchair or from empirical
studies of cognitive psychology, for believing that human beings can
assign optimal weights in equations subjectively or that they apply their
own weights consistently, the query from which Lew Goldberg derived
such fascinating and fundamental results.
Some other important fi ndings emerged. One drawback of the statistical
prediction approach, for practice, was that it requires large samples of high-
quality outcome data (in more modern AI language, prediction required
labeled data). There were rarely many such data available at the time.
Dawes (1979) proposed to give up on estimating variable weights through
598 Colin F. Camerer
a criterion- optimizing “proper” procedure like ordinary least squares
(OLS),13 using “improper” weights instead. An example is equal- weighting
of standardized variables, which is often a very good approximation to OLS
weighting (Einhorn and Hogarth 1975).
An interesting example of improper weights is what Dawes called “boot-
strapping” (a completely distinct usage from the concept in statistics of
bootstrap resampling). Dawes’s idea was to regress clinical judgments on
predictors, and use those estimated weights to make prediction. This is
equivalent, of course, to using the predicted part of the clinical- judgment
regression and discarding (or regularizing to zero, if you will) the residual.
If the residual is mostly noise then correlation accuracies can be improved
by this procedure, and they typically are (e.g., Camerer 1981a).
Later studies indicated a slightly more optimistic picture for the clinicians.
If bootstrap- regression residuals are pure noise, they will also lower the
test- retest reliability of clinical judgment (i.e., the correlation between two
judgments on the same cases made by the same person). However, analysis
of the few studies that report both test- retest reliability and bootstrapping
regressions indicate that only abo
ut 40 percent of the residual variance is
unreliable noise (Camerer 1981b). Thus, residuals do contain reliable subjec-
tive information (though it may be uncorrelated with outcomes). Blattberg
and Hoch (1990) later found that for actual managerial forecasts of product
sales and coupon redemption rate, residuals are correlated about .30 with
outcomes. As a result, averaging statistical model forecasts and managerial
judgments improved prediction substantially over statistical models alone.
24.3.2 Sparsity Is Good for You but Tastes Bad
Besides the then- startling fi nding that human judgment did reliably worse
than statistical models, a key feature of the early results was how well small
numbers of variables could fi t. Some of this conclusion was constrained
by the fact that there were not huge feature sets with truly large number of
variables in any case (so you couldn’t possibly know, at that time, if “large
numbers of variables fi t surprisingly better” than small numbers).
A striking example in Dawes (1979) is a two- variable model predicting
marital happiness: the rate of lovemaking minus the rate of fi ghting. He
reports correlations of .40 and .81 in two studies (Edwards and Edwards
1977; Thornton 1977).14
In another more famous example, Dawes (1971) did a study about admit-
ting students to the University of Oregon PhD program in psychology from
1964 to 1967. He compared and measured each applicant’s GRE, under-
graduate GPA, and the quality of the applicant’s undergraduate school. The
13. Presciently, Dawes also mentions using ridge regression as a proper procedure to maximize out- of-sample fi t.
14. More recent analyses using transcribed verbal interactions generate correlations for divorce and marital satisfaction around .6– .7. The core variables are called the “four horse-men” of criticism, defensiveness, contempt, and “stonewalling” (listener withdrawal).
Artifi cial Intelligence and Behavioral Economics 599
variables were standardized, then weighted equally. The outcome variable
was faculty ratings in 1969 of how well the students they had admitted suc-
ceeded. (Obviously, the selection eff ect here makes the entire analysis much
less than ideal, but tracking down rejected applicants and measuring their
success by 1969 was basically impossible at the time.)