The Economics of Artificial Intelligence Page 101 Read online free by Ajay Agrawal

Home > Other > The Economics of Artificial Intelligence > Page 101

The Economics of Artificial Intelligence Page 101

The simple three- variable statistical model correlated with later success

in the program more highly (.48, cross- validated) than the admissions com-

mittee’s quantitative recommendation (.19).15 The bootstrapping model of

the admissions committee correlated .25.

Despite Dawes’s evidence, I have never been able to convince any gradu-

ate admissions committee at two institutions (Penn and Caltech) to actually

compute statistical ratings, even as a way to fi lter out applications that are

likely to be certain rejections.

Why not?

I think the answer is that the human mind rebels against regularization

and the resulting sparsity. We are born to overfi t. Every AI researcher knows

that including fewer variables (e.g., by giving many of them zero weights in

LASSO, or limiting tree depth in random forests) is a useful all- purpose

prophylactic for overfi tting a training set. But the same process seems to be

unappealing in our everyday judgment.

The distaste for sparsity is ironic because, in fact, the brain is built to do

a massive amount of fi ltering of sensory information (and does so remark-

ably effi

ciently in areas where optimal effi

ciency can be quantifi ed, such as

vision; see Doi et al. [2012]). But people do not like to explicitly throw away information (Einhorn 1986). This is particularly true if the information is

already in front of us—in the form of a PhD admissions application, or a

person talking about their research in an AEA interview hotel room. It takes

some combination of willpower, arrogance, or what have you, to simply

ignore letters of recommendation, for example. Another force is “illusory

correlation,” in which strong prior beliefs about an association bias encod-

ing or memory so that the prior is maintained, incorrectly (Chapman and

Chapman 1969; Klayman and Ha 1985).

The poster child for misguided sparsity rebellion is personal short face-

to-face interviews in hiring. There is a mountain of evidence that such inter-

views do not predict anything about later work performance, if interviewers

are untrained and do not use a structured interview format, that isn’t better

predicted by numbers (e.g., Dana, Dawes, and Peterson 2013).

A likely example is interviewing faculty candidates with new PhDs in

hotel suites at the ASSA meetings. Suppose the goal of such interviews is

to predict which new PhDs will do enough terrifi c research, good teaching,

15. Readers might guess that the quality of econometrics for inference in some of these earlier papers is limited. For example, Dawes (1971) only used the 111 students who had been admitted to the program and stayed enrolled, so there is likely scale compression and so forth.

Some of the faculty members rating those students were probably also initial raters, which could generate consistency biases, and so forth.

600 Colin F. Camerer

and other kinds of service and public value to get tenure several years later

at the interviewers’ home institution.

That predictive goal is admirable, but the brain of an untrained inter-

viewer has more basic things on its mind. Is this person well dressed? Can

they protect me if there is danger? Are they friend or foe? Does their accent

and word choice sound like mine? Why are they stifl ing a yawn?—they’ll

never get papers accepted at Econometrica if they yawn after a long tense day slipping on ice in Philadelphia rushing to avoid being late to a hotel suite!

People who do these interviews (including me) say that we are trying to

probe the candidate’s depth of understanding about their topic, how prom-

ising their new planned research is, and so forth. But what we really are

evaluating is probably more like “Do they belong in my tribe?”

While I do think such interviews are a waste of time,16 it is conceivable

that they generate valid information. The problem is that interviewers may

weight the wrong information (as well as overweighting features that should

be regularized to zero). If there is valid information about long- run tenure

prospects and collegiality, the best method to capture such information is

to videotape the interview, combine it with other tasks that more closely

resemble work performance (e.g., have them review a diffi

cult paper), and

machine learn the heck out of that larger corpus of information.

Another simple example of where ignoring information is counterintui-

tive is captured by the two modes of forecasting that Kahneman and Lovallo

(1993) wrote about. They called the two modes the “inside” and “outside”

view. The two views were in the context of forecasting the outcome of a

project (such as writing a book, or a business investment). The inside view

“focused only on a particular case, by considering the plan and its obstacles

to completion, by constructing scenarios of future progress” (25). The out-

side view “focuses on the statistics of a class of cases chosen to be similar

in relevant respects to the current one” (25).

The outside view deliberately throws away most of the information about

a specifi c case at hand (but keeps some information): it reduces the rele-

vant dimensions to only those that are present in the outside view reference

class. (This is, again, a regularization that zeros out all the features that are

not “similar in relevant respects.”)

In ML terms, the outside and inside views are like diff erent kinds of cluster

analyses. The outside view parses all previous cases into K clusters; a current case belongs to one cluster or another (though there is, of course, a degree

of cluster membership depending on the distance from cluster centroids).

The inside view—in its extreme form—treats each case, like fi ngerprints

and snowfl akes, as unique.

16. There are many caveats, of course, to this strong claim. For example, often the school is pitching to attract a highly desirable candidate, not the other way around.

Artifi cial Intelligence and Behavioral Economics 601

24.3.3 Hypothesis: Human Judgment Is Like

Overfi tted Machine Learning

The core idea I want to explore is that some aspects of everyday human

judgment can be understood as the type of errors that would result from

badly done machine learning.17 I will focus on two aspects: overconfi dence

and how it increases, and limited error correction.

In both cases, I have in mind a research program that takes data on human

predictions and compares them with machine- learned predictions. Then

deliberately re- do the machine learning badly (e.g., failing to correct for

overfi tting) and see whether the impaired ML predictions have some of the

properties of human ones.

Overconfi dence. In a classic study from the early days of JDM, Oskamp

(1965) had eight experienced clinical psychologists and twenty- four gradu-

ate and undergraduate students read material about an actual person, in

four stages. The fi rst stage was just three sentences giving basic demograph-

ics, education, and occupation. The next three stages were one and a half

to two pages each about childhood, schooling, and the subject’s time in the

army and beyond. There were a total of fi ve pages of mate
rial.

The subjects had to answer twenty- fi ve personality questions about the

subject, each with fi ve multiple- choice answers18 after each of the four stages

of reading. All these questions had correct answers, based on other evidence

about the case. Chance guessing would be 20 percent accurate.

Oskamp learned two things: First, there was no diff erence in accuracy

between the experienced clinicians and the students.

Second, all the subjects were barely above chance, and accuracy did not

improve as they read more material in the three stages. After just the fi rst

paragraph, their accuracy was 26 percent; after reading all fi ve additional

pages across the three stages, accuracy was 28 percent (an insignifi cant dif-

ference from 26 percent). However, the subjects’ subjective confi dence in

their accuracy rose almost linearly as they read more, from 33 percent

to 53 percent.19

This increase in confi dence, combined with no increase in accuracy, is

reminiscent of the diff erence between training set and test set accuracy in

AI. As more and more variables are included in a training set, the (unpe-

nalized) accuracy will always increase. As a result of overfi tting, however,

test- set accuracy will decline when too many variables are included. The

17. My intuition about this was aided by Jesse Shapiro, who asked a well- crafted question pointing straight in this direction.

18. One of the multiple choice questions was “Kid’s present attitude toward his mother is one of: (a) love and respect for her ideals, (b) aff ectionate tolerance for her foibles,” and so forth.

19. Some other results comparing more and less experienced clinicians, however, have also confi rmed the fi rst fi nding (experience does not improve accuracy much), but found that experience tends to reduce overconfi dence (Goldberg 1959).

602 Colin F. Camerer

resulting gap between training- and test- set accuracy will grow, much as

the overconfi dence in Oskamp’s subjects grew with the equivalent of more

“variables” (i.e., more material on the single person they were judging).

Overconfi dence comes in diff erent fl avors. In the predictive context, we

will defi ne it as having too narrow a confi dence interval around a prediction.

(In regression, for example, this means underestimating the standard error

of a conditional prediction P( Y| X ) based on observables X.) My hypothesis is that human overconfi dence results from a failure to winnow the set of predictors (as in LASSO penalties for feature weights). Over-

confi dence of this type is a consequence of not anticipating overfi tting. High

training- set accuracy corresponds to confi dence about predictions. Overcon-

fi dence is a failure to anticipate the drop in accuracy from training to test.

Limited Error Correction. In some ML procedures, training takes place

over trials. For example, the earliest neural networks were trained by making

output predictions based on a set of node weights, then back- propagating

prediction errors to adjust the weights. Early contributions intended for this

process to correspond to human learning—for example, how children learn

to recognize categories of natural objects or to learn properties of language

(e.g., Rumelhart and McClelland 1986).

One can then ask whether some aspects of adult human judgment corre-

spond to poor implementation of error correction. An invisible assumption

that is, of course, part of neural network training is that output errors are

recognized (if learning is supervised by labeled data). But what if humans

do not recognize error or respond to it inappropriately?

One maladaptive response to prediction error is to add features, particu-

larly interaction eff ects. For example, suppose a college admissions direc-

tor has a predictive model and thinks students who play musical instru-

ments have good study habits and will succeed in the college. Now a student

comes along who plays drums in the Dead Milkmen punk band. The student

gets admitted (because playing music is a good feature), but struggles in

college and drops out.

The admissions director could back- propagate the predictive error to

adjust the weights on the “plays music” feature. Or she could create a new

feature by splitting “plays music” into “plays drums” and “plays nondrums”

and ignore the error. This procedure will generate too many features and

will not use error- correction eff ectively.20

Furthermore, note that a diff erent admissions director might create two

diff erent subfeatures, “plays music in a punk band” and “plays nonpunk

music.” In the stylized version of this description, both will become con-

vinced that they have improved their mental models and will retain high

confi dence about future predictions. But their inter- rater reliability will have

20. Another way to model this is as the refi nement of a prediction tree, where branches are added for new feature when predictions are incorrect. This will generate a bushy tree, which generally harms test- set accuracy.

Artifi cial Intelligence and Behavioral Economics 603

gone down, because they “improved” their models in diff erent ways. Inter-

rate reliability puts a hard upper bound on how good average predictive

accuracy can be. Finally, note that even if human experts are mediocre at

feature selection or create too many interaction eff ects (which ML regular-

izes away), they are often more rapid than novices (for a remarkable study

of actual admissions decisions, see Johnson 1980, 1988). The process they

use is rapid, but the predictive performance is not so impressive. But AI

algorithms are even faster.

24.4 AI Technology as a Bionic Patch, or Malware, for Human Limits

We spend a lot of time in behavioral economics thinking about how po-

litical and economic systems either exploit bad choices or help people make

good choices. What behavioral economics has to off er to this general discus-

sion is to specify a more psychologically accurate model of human choice

and human nature than the caricature of constrained utility- maximization

(as useful as it has been).

Artifi cial intelligence enters by creating better tools for making inferences

about what a person wants and what a person will do. Sometimes these tools

will hurt and sometimes they will help.

Artifi cial Intelligence Helps. A clear example is recommender systems.

Recommender systems use previous data on a target person’s choices and ex

post quality ratings, as well as data on many other people, possible choices,

and ratings, to predict how well the target person will like a choice they have

not made before (and may not even know exists, such as movies or books

they haven’t heard of ). Recommender systems are a behavioral prosthetic

to remedy human limits on attention and memory and the resulting incom-

pleteness of preferences.

Consider Netfl ix movie recommendations. Netfl ix uses a person’s viewing

and ratings history, as well as opinions of others and movie properties, as

inputs to a variety of algorithms to suggest what content to watch. As their

data scientists explained (Gomez- Uribe and Hunt 2016):

a typical Netfl ix member loses interest after perhaps 60 to 90 se
conds of

choosing, having reviewed 10 to 20 titles (perhaps 3 in detail) on one or

two screens. . . . The recommender problem is to make sure that on those

two screens each member in our diverse pool will fi nd something compel-

ling to view, and will understand why it might be of interest.

For example, their “Because You Watched” recommender line uses a

video- video similarity algorithm to suggest unwatched videos similar to

ones the user watched and liked.

There are so many interesting implications of these kinds of recom-

mender systems for economics in general, and for behavioral economics

in particular. For example, Netfl ix wants its members to “understand why

it (a recommended video) might be of interest.” This is, at bottom, a ques-

604 Colin F. Camerer

tion about interpretability of AI output, how a member learns from recom-

mender successes and errors, and whether a member then “trusts” Netfl ix

in general. All these are psychological processes that may also depend heav-

ily on design and experience features (UD, UX).

Artifi cial Intelligence “Hurts.” 21 Another feature of AI- driven personal-

ization is price discrimination. If people do know a lot about what they want,

and have precise willingness- to-pay (WTP), then companies will quickly

develop the capacity to personalize prices too. This seems to be a concept

that is emerging rapidly and desperately needs to be studied by industrial

economists who can fi gure out the welfare implications.

Behavioral economics can play a role by using evidence about how people

make judgments about fairness of prices (e.g., Kahneman, Knetsch, and

Thaler 1986), whether fairness norms adapt to “personalized pricing,” and

how fairness judgments infl uence behavior.

My intuition (echoing Kahneman, Knetsch, and Thaler 1986) is that

people can come to accept a high degree of variation in prices for what is

essentially the same product as long as there is either (a) very minor prod-

uct diff erentiation22 or (b) fi rms can articulate why diff erent prices are fair.

For example, price discrimination might be framed as cross- subsidy to help

those who can’t aff ord high prices.

It is also likely that personalized pricing will harm consumers who are

‹ Prev Next ›