by Ajay Agrawal
The simple three- variable statistical model correlated with later success
in the program more highly (.48, cross- validated) than the admissions com-
mittee’s quantitative recommendation (.19).15 The bootstrapping model of
the admissions committee correlated .25.
Despite Dawes’s evidence, I have never been able to convince any gradu-
ate admissions committee at two institutions (Penn and Caltech) to actually
compute statistical ratings, even as a way to fi lter out applications that are
likely to be certain rejections.
Why not?
I think the answer is that the human mind rebels against regularization
and the resulting sparsity. We are born to overfi t. Every AI researcher knows
that including fewer variables (e.g., by giving many of them zero weights in
LASSO, or limiting tree depth in random forests) is a useful all- purpose
prophylactic for overfi tting a training set. But the same process seems to be
unappealing in our everyday judgment.
The distaste for sparsity is ironic because, in fact, the brain is built to do
a massive amount of fi ltering of sensory information (and does so remark-
ably effi
ciently in areas where optimal effi
ciency can be quantifi ed, such as
vision; see Doi et al. [2012]). But people do not like to explicitly throw away information (Einhorn 1986). This is particularly true if the information is
already in front of us—in the form of a PhD admissions application, or a
person talking about their research in an AEA interview hotel room. It takes
some combination of willpower, arrogance, or what have you, to simply
ignore letters of recommendation, for example. Another force is “illusory
correlation,” in which strong prior beliefs about an association bias encod-
ing or memory so that the prior is maintained, incorrectly (Chapman and
Chapman 1969; Klayman and Ha 1985).
The poster child for misguided sparsity rebellion is personal short face-
to-face interviews in hiring. There is a mountain of evidence that such inter-
views do not predict anything about later work performance, if interviewers
are untrained and do not use a structured interview format, that isn’t better
predicted by numbers (e.g., Dana, Dawes, and Peterson 2013).
A likely example is interviewing faculty candidates with new PhDs in
hotel suites at the ASSA meetings. Suppose the goal of such interviews is
to predict which new PhDs will do enough terrifi c research, good teaching,
15. Readers might guess that the quality of econometrics for inference in some of these earlier papers is limited. For example, Dawes (1971) only used the 111 students who had been admitted to the program and stayed enrolled, so there is likely scale compression and so forth.
Some of the faculty members rating those students were probably also initial raters, which could generate consistency biases, and so forth.
600 Colin F. Camerer
and other kinds of service and public value to get tenure several years later
at the interviewers’ home institution.
That predictive goal is admirable, but the brain of an untrained inter-
viewer has more basic things on its mind. Is this person well dressed? Can
they protect me if there is danger? Are they friend or foe? Does their accent
and word choice sound like mine? Why are they stifl ing a yawn?—they’ll
never get papers accepted at Econometrica if they yawn after a long tense day slipping on ice in Philadelphia rushing to avoid being late to a hotel suite!
People who do these interviews (including me) say that we are trying to
probe the candidate’s depth of understanding about their topic, how prom-
ising their new planned research is, and so forth. But what we really are
evaluating is probably more like “Do they belong in my tribe?”
While I do think such interviews are a waste of time,16 it is conceivable
that they generate valid information. The problem is that interviewers may
weight the wrong information (as well as overweighting features that should
be regularized to zero). If there is valid information about long- run tenure
prospects and collegiality, the best method to capture such information is
to videotape the interview, combine it with other tasks that more closely
resemble work performance (e.g., have them review a diffi
cult paper), and
machine learn the heck out of that larger corpus of information.
Another simple example of where ignoring information is counterintui-
tive is captured by the two modes of forecasting that Kahneman and Lovallo
(1993) wrote about. They called the two modes the “inside” and “outside”
view. The two views were in the context of forecasting the outcome of a
project (such as writing a book, or a business investment). The inside view
“focused only on a particular case, by considering the plan and its obstacles
to completion, by constructing scenarios of future progress” (25). The out-
side view “focuses on the statistics of a class of cases chosen to be similar
in relevant respects to the current one” (25).
The outside view deliberately throws away most of the information about
a specifi c case at hand (but keeps some information): it reduces the rele-
vant dimensions to only those that are present in the outside view reference
class. (This is, again, a regularization that zeros out all the features that are
not “similar in relevant respects.”)
In ML terms, the outside and inside views are like diff erent kinds of cluster
analyses. The outside view parses all previous cases into K clusters; a current case belongs to one cluster or another (though there is, of course, a degree
of cluster membership depending on the distance from cluster centroids).
The inside view—in its extreme form—treats each case, like fi ngerprints
and snowfl akes, as unique.
16. There are many caveats, of course, to this strong claim. For example, often the school is pitching to attract a highly desirable candidate, not the other way around.
Artifi cial Intelligence and Behavioral Economics 601
24.3.3 Hypothesis: Human Judgment Is Like
Overfi tted Machine Learning
The core idea I want to explore is that some aspects of everyday human
judgment can be understood as the type of errors that would result from
badly done machine learning.17 I will focus on two aspects: overconfi dence
and how it increases, and limited error correction.
In both cases, I have in mind a research program that takes data on human
predictions and compares them with machine- learned predictions. Then
deliberately re- do the machine learning badly (e.g., failing to correct for
overfi tting) and see whether the impaired ML predictions have some of the
properties of human ones.
Overconfi dence. In a classic study from the early days of JDM, Oskamp
(1965) had eight experienced clinical psychologists and twenty- four gradu-
ate and undergraduate students read material about an actual person, in
four stages. The fi rst stage was just three sentences giving basic demograph-
ics, education, and occupation. The next three stages were one and a half
to two pages each about childhood, schooling, and the subject’s time in the
army and beyond. There were a total of fi ve pages of mate
rial.
The subjects had to answer twenty- fi ve personality questions about the
subject, each with fi ve multiple- choice answers18 after each of the four stages
of reading. All these questions had correct answers, based on other evidence
about the case. Chance guessing would be 20 percent accurate.
Oskamp learned two things: First, there was no diff erence in accuracy
between the experienced clinicians and the students.
Second, all the subjects were barely above chance, and accuracy did not
improve as they read more material in the three stages. After just the fi rst
paragraph, their accuracy was 26 percent; after reading all fi ve additional
pages across the three stages, accuracy was 28 percent (an insignifi cant dif-
ference from 26 percent). However, the subjects’ subjective confi dence in
their accuracy rose almost linearly as they read more, from 33 percent
to 53 percent.19
This increase in confi dence, combined with no increase in accuracy, is
reminiscent of the diff erence between training set and test set accuracy in
AI. As more and more variables are included in a training set, the (unpe-
nalized) accuracy will always increase. As a result of overfi tting, however,
test- set accuracy will decline when too many variables are included. The
17. My intuition about this was aided by Jesse Shapiro, who asked a well- crafted question pointing straight in this direction.
18. One of the multiple choice questions was “Kid’s present attitude toward his mother is one of: (a) love and respect for her ideals, (b) aff ectionate tolerance for her foibles,” and so forth.
19. Some other results comparing more and less experienced clinicians, however, have also confi rmed the fi rst fi nding (experience does not improve accuracy much), but found that experience tends to reduce overconfi dence (Goldberg 1959).
602 Colin F. Camerer
resulting gap between training- and test- set accuracy will grow, much as
the overconfi dence in Oskamp’s subjects grew with the equivalent of more
“variables” (i.e., more material on the single person they were judging).
Overconfi dence comes in diff erent fl avors. In the predictive context, we
will defi ne it as having too narrow a confi dence interval around a prediction.
(In regression, for example, this means underestimating the standard error
of a conditional prediction P( Y| X ) based on observables X.) My hypothesis is that human overconfi dence results from a failure to winnow the set of predictors (as in LASSO penalties for feature weights). Over-
confi dence of this type is a consequence of not anticipating overfi tting. High
training- set accuracy corresponds to confi dence about predictions. Overcon-
fi dence is a failure to anticipate the drop in accuracy from training to test.
Limited Error Correction. In some ML procedures, training takes place
over trials. For example, the earliest neural networks were trained by making
output predictions based on a set of node weights, then back- propagating
prediction errors to adjust the weights. Early contributions intended for this
process to correspond to human learning—for example, how children learn
to recognize categories of natural objects or to learn properties of language
(e.g., Rumelhart and McClelland 1986).
One can then ask whether some aspects of adult human judgment corre-
spond to poor implementation of error correction. An invisible assumption
that is, of course, part of neural network training is that output errors are
recognized (if learning is supervised by labeled data). But what if humans
do not recognize error or respond to it inappropriately?
One maladaptive response to prediction error is to add features, particu-
larly interaction eff ects. For example, suppose a college admissions direc-
tor has a predictive model and thinks students who play musical instru-
ments have good study habits and will succeed in the college. Now a student
comes along who plays drums in the Dead Milkmen punk band. The student
gets admitted (because playing music is a good feature), but struggles in
college and drops out.
The admissions director could back- propagate the predictive error to
adjust the weights on the “plays music” feature. Or she could create a new
feature by splitting “plays music” into “plays drums” and “plays nondrums”
and ignore the error. This procedure will generate too many features and
will not use error- correction eff ectively.20
Furthermore, note that a diff erent admissions director might create two
diff erent subfeatures, “plays music in a punk band” and “plays nonpunk
music.” In the stylized version of this description, both will become con-
vinced that they have improved their mental models and will retain high
confi dence about future predictions. But their inter- rater reliability will have
20. Another way to model this is as the refi nement of a prediction tree, where branches are added for new feature when predictions are incorrect. This will generate a bushy tree, which generally harms test- set accuracy.
Artifi cial Intelligence and Behavioral Economics 603
gone down, because they “improved” their models in diff erent ways. Inter-
rate reliability puts a hard upper bound on how good average predictive
accuracy can be. Finally, note that even if human experts are mediocre at
feature selection or create too many interaction eff ects (which ML regular-
izes away), they are often more rapid than novices (for a remarkable study
of actual admissions decisions, see Johnson 1980, 1988). The process they
use is rapid, but the predictive performance is not so impressive. But AI
algorithms are even faster.
24.4 AI Technology as a Bionic Patch, or Malware, for Human Limits
We spend a lot of time in behavioral economics thinking about how po-
litical and economic systems either exploit bad choices or help people make
good choices. What behavioral economics has to off er to this general discus-
sion is to specify a more psychologically accurate model of human choice
and human nature than the caricature of constrained utility- maximization
(as useful as it has been).
Artifi cial intelligence enters by creating better tools for making inferences
about what a person wants and what a person will do. Sometimes these tools
will hurt and sometimes they will help.
Artifi cial Intelligence Helps. A clear example is recommender systems.
Recommender systems use previous data on a target person’s choices and ex
post quality ratings, as well as data on many other people, possible choices,
and ratings, to predict how well the target person will like a choice they have
not made before (and may not even know exists, such as movies or books
they haven’t heard of ). Recommender systems are a behavioral prosthetic
to remedy human limits on attention and memory and the resulting incom-
pleteness of preferences.
Consider Netfl ix movie recommendations. Netfl ix uses a person’s viewing
and ratings history, as well as opinions of others and movie properties, as
inputs to a variety of algorithms to suggest what content to watch. As their
data scientists explained (Gomez- Uribe and Hunt 2016):
a typical Netfl ix member loses interest after perhaps 60 to 90 se
conds of
choosing, having reviewed 10 to 20 titles (perhaps 3 in detail) on one or
two screens. . . . The recommender problem is to make sure that on those
two screens each member in our diverse pool will fi nd something compel-
ling to view, and will understand why it might be of interest.
For example, their “Because You Watched” recommender line uses a
video- video similarity algorithm to suggest unwatched videos similar to
ones the user watched and liked.
There are so many interesting implications of these kinds of recom-
mender systems for economics in general, and for behavioral economics
in particular. For example, Netfl ix wants its members to “understand why
it (a recommended video) might be of interest.” This is, at bottom, a ques-
604 Colin F. Camerer
tion about interpretability of AI output, how a member learns from recom-
mender successes and errors, and whether a member then “trusts” Netfl ix
in general. All these are psychological processes that may also depend heav-
ily on design and experience features (UD, UX).
Artifi cial Intelligence “Hurts.” 21 Another feature of AI- driven personal-
ization is price discrimination. If people do know a lot about what they want,
and have precise willingness- to-pay (WTP), then companies will quickly
develop the capacity to personalize prices too. This seems to be a concept
that is emerging rapidly and desperately needs to be studied by industrial
economists who can fi gure out the welfare implications.
Behavioral economics can play a role by using evidence about how people
make judgments about fairness of prices (e.g., Kahneman, Knetsch, and
Thaler 1986), whether fairness norms adapt to “personalized pricing,” and
how fairness judgments infl uence behavior.
My intuition (echoing Kahneman, Knetsch, and Thaler 1986) is that
people can come to accept a high degree of variation in prices for what is
essentially the same product as long as there is either (a) very minor prod-
uct diff erentiation22 or (b) fi rms can articulate why diff erent prices are fair.
For example, price discrimination might be framed as cross- subsidy to help
those who can’t aff ord high prices.
It is also likely that personalized pricing will harm consumers who are