The Economics of Artificial Intelligence

Home > Other > The Economics of Artificial Intelligence > Page 101
The Economics of Artificial Intelligence Page 101

by Ajay Agrawal


  The simple three- variable statistical model correlated with later success

  in the program more highly (.48, cross- validated) than the admissions com-

  mittee’s quantitative recommendation (.19).15 The bootstrapping model of

  the admissions committee correlated .25.

  Despite Dawes’s evidence, I have never been able to convince any gradu-

  ate admissions committee at two institutions (Penn and Caltech) to actually

  compute statistical ratings, even as a way to fi lter out applications that are

  likely to be certain rejections.

  Why not?

  I think the answer is that the human mind rebels against regularization

  and the resulting sparsity. We are born to overfi t. Every AI researcher knows

  that including fewer variables (e.g., by giving many of them zero weights in

  LASSO, or limiting tree depth in random forests) is a useful all- purpose

  prophylactic for overfi tting a training set. But the same process seems to be

  unappealing in our everyday judgment.

  The distaste for sparsity is ironic because, in fact, the brain is built to do

  a massive amount of fi ltering of sensory information (and does so remark-

  ably effi

  ciently in areas where optimal effi

  ciency can be quantifi ed, such as

  vision; see Doi et al. [2012]). But people do not like to explicitly throw away information (Einhorn 1986). This is particularly true if the information is

  already in front of us—in the form of a PhD admissions application, or a

  person talking about their research in an AEA interview hotel room. It takes

  some combination of willpower, arrogance, or what have you, to simply

  ignore letters of recommendation, for example. Another force is “illusory

  correlation,” in which strong prior beliefs about an association bias encod-

  ing or memory so that the prior is maintained, incorrectly (Chapman and

  Chapman 1969; Klayman and Ha 1985).

  The poster child for misguided sparsity rebellion is personal short face-

  to-face interviews in hiring. There is a mountain of evidence that such inter-

  views do not predict anything about later work performance, if interviewers

  are untrained and do not use a structured interview format, that isn’t better

  predicted by numbers (e.g., Dana, Dawes, and Peterson 2013).

  A likely example is interviewing faculty candidates with new PhDs in

  hotel suites at the ASSA meetings. Suppose the goal of such interviews is

  to predict which new PhDs will do enough terrifi c research, good teaching,

  15. Readers might guess that the quality of econometrics for inference in some of these earlier papers is limited. For example, Dawes (1971) only used the 111 students who had been admitted to the program and stayed enrolled, so there is likely scale compression and so forth.

  Some of the faculty members rating those students were probably also initial raters, which could generate consistency biases, and so forth.

  600 Colin F. Camerer

  and other kinds of service and public value to get tenure several years later

  at the interviewers’ home institution.

  That predictive goal is admirable, but the brain of an untrained inter-

  viewer has more basic things on its mind. Is this person well dressed? Can

  they protect me if there is danger? Are they friend or foe? Does their accent

  and word choice sound like mine? Why are they stifl ing a yawn?—they’ll

  never get papers accepted at Econometrica if they yawn after a long tense day slipping on ice in Philadelphia rushing to avoid being late to a hotel suite!

  People who do these interviews (including me) say that we are trying to

  probe the candidate’s depth of understanding about their topic, how prom-

  ising their new planned research is, and so forth. But what we really are

  evaluating is probably more like “Do they belong in my tribe?”

  While I do think such interviews are a waste of time,16 it is conceivable

  that they generate valid information. The problem is that interviewers may

  weight the wrong information (as well as overweighting features that should

  be regularized to zero). If there is valid information about long- run tenure

  prospects and collegiality, the best method to capture such information is

  to videotape the interview, combine it with other tasks that more closely

  resemble work performance (e.g., have them review a diffi

  cult paper), and

  machine learn the heck out of that larger corpus of information.

  Another simple example of where ignoring information is counterintui-

  tive is captured by the two modes of forecasting that Kahneman and Lovallo

  (1993) wrote about. They called the two modes the “inside” and “outside”

  view. The two views were in the context of forecasting the outcome of a

  project (such as writing a book, or a business investment). The inside view

  “focused only on a particular case, by considering the plan and its obstacles

  to completion, by constructing scenarios of future progress” (25). The out-

  side view “focuses on the statistics of a class of cases chosen to be similar

  in relevant respects to the current one” (25).

  The outside view deliberately throws away most of the information about

  a specifi c case at hand (but keeps some information): it reduces the rele-

  vant dimensions to only those that are present in the outside view reference

  class. (This is, again, a regularization that zeros out all the features that are

  not “similar in relevant respects.”)

  In ML terms, the outside and inside views are like diff erent kinds of cluster

  analyses. The outside view parses all previous cases into K clusters; a current case belongs to one cluster or another (though there is, of course, a degree

  of cluster membership depending on the distance from cluster centroids).

  The inside view—in its extreme form—treats each case, like fi ngerprints

  and snowfl akes, as unique.

  16. There are many caveats, of course, to this strong claim. For example, often the school is pitching to attract a highly desirable candidate, not the other way around.

  Artifi cial Intelligence and Behavioral Economics 601

  24.3.3 Hypothesis: Human Judgment Is Like

  Overfi tted Machine Learning

  The core idea I want to explore is that some aspects of everyday human

  judgment can be understood as the type of errors that would result from

  badly done machine learning.17 I will focus on two aspects: overconfi dence

  and how it increases, and limited error correction.

  In both cases, I have in mind a research program that takes data on human

  predictions and compares them with machine- learned predictions. Then

  deliberately re- do the machine learning badly (e.g., failing to correct for

  overfi tting) and see whether the impaired ML predictions have some of the

  properties of human ones.

  Overconfi dence. In a classic study from the early days of JDM, Oskamp

  (1965) had eight experienced clinical psychologists and twenty- four gradu-

  ate and undergraduate students read material about an actual person, in

  four stages. The fi rst stage was just three sentences giving basic demograph-

  ics, education, and occupation. The next three stages were one and a half

  to two pages each about childhood, schooling, and the subject’s time in the

  army and beyond. There were a total of fi ve pages of mate
rial.

  The subjects had to answer twenty- fi ve personality questions about the

  subject, each with fi ve multiple- choice answers18 after each of the four stages

  of reading. All these questions had correct answers, based on other evidence

  about the case. Chance guessing would be 20 percent accurate.

  Oskamp learned two things: First, there was no diff erence in accuracy

  between the experienced clinicians and the students.

  Second, all the subjects were barely above chance, and accuracy did not

  improve as they read more material in the three stages. After just the fi rst

  paragraph, their accuracy was 26 percent; after reading all fi ve additional

  pages across the three stages, accuracy was 28 percent (an insignifi cant dif-

  ference from 26 percent). However, the subjects’ subjective confi dence in

  their accuracy rose almost linearly as they read more, from 33 percent

  to 53 percent.19

  This increase in confi dence, combined with no increase in accuracy, is

  reminiscent of the diff erence between training set and test set accuracy in

  AI. As more and more variables are included in a training set, the (unpe-

  nalized) accuracy will always increase. As a result of overfi tting, however,

  test- set accuracy will decline when too many variables are included. The

  17. My intuition about this was aided by Jesse Shapiro, who asked a well- crafted question pointing straight in this direction.

  18. One of the multiple choice questions was “Kid’s present attitude toward his mother is one of: (a) love and respect for her ideals, (b) aff ectionate tolerance for her foibles,” and so forth.

  19. Some other results comparing more and less experienced clinicians, however, have also confi rmed the fi rst fi nding (experience does not improve accuracy much), but found that experience tends to reduce overconfi dence (Goldberg 1959).

  602 Colin F. Camerer

  resulting gap between training- and test- set accuracy will grow, much as

  the overconfi dence in Oskamp’s subjects grew with the equivalent of more

  “variables” (i.e., more material on the single person they were judging).

  Overconfi dence comes in diff erent fl avors. In the predictive context, we

  will defi ne it as having too narrow a confi dence interval around a prediction.

  (In regression, for example, this means underestimating the standard error

  of a conditional prediction P( Y| X ) based on observables X.) My hypothesis is that human overconfi dence results from a failure to winnow the set of predictors (as in LASSO penalties for feature weights). Over-

  confi dence of this type is a consequence of not anticipating overfi tting. High

  training- set accuracy corresponds to confi dence about predictions. Overcon-

  fi dence is a failure to anticipate the drop in accuracy from training to test.

  Limited Error Correction. In some ML procedures, training takes place

  over trials. For example, the earliest neural networks were trained by making

  output predictions based on a set of node weights, then back- propagating

  prediction errors to adjust the weights. Early contributions intended for this

  process to correspond to human learning—for example, how children learn

  to recognize categories of natural objects or to learn properties of language

  (e.g., Rumelhart and McClelland 1986).

  One can then ask whether some aspects of adult human judgment corre-

  spond to poor implementation of error correction. An invisible assumption

  that is, of course, part of neural network training is that output errors are

  recognized (if learning is supervised by labeled data). But what if humans

  do not recognize error or respond to it inappropriately?

  One maladaptive response to prediction error is to add features, particu-

  larly interaction eff ects. For example, suppose a college admissions direc-

  tor has a predictive model and thinks students who play musical instru-

  ments have good study habits and will succeed in the college. Now a student

  comes along who plays drums in the Dead Milkmen punk band. The student

  gets admitted (because playing music is a good feature), but struggles in

  college and drops out.

  The admissions director could back- propagate the predictive error to

  adjust the weights on the “plays music” feature. Or she could create a new

  feature by splitting “plays music” into “plays drums” and “plays nondrums”

  and ignore the error. This procedure will generate too many features and

  will not use error- correction eff ectively.20

  Furthermore, note that a diff erent admissions director might create two

  diff erent subfeatures, “plays music in a punk band” and “plays nonpunk

  music.” In the stylized version of this description, both will become con-

  vinced that they have improved their mental models and will retain high

  confi dence about future predictions. But their inter- rater reliability will have

  20. Another way to model this is as the refi nement of a prediction tree, where branches are added for new feature when predictions are incorrect. This will generate a bushy tree, which generally harms test- set accuracy.

  Artifi cial Intelligence and Behavioral Economics 603

  gone down, because they “improved” their models in diff erent ways. Inter-

  rate reliability puts a hard upper bound on how good average predictive

  accuracy can be. Finally, note that even if human experts are mediocre at

  feature selection or create too many interaction eff ects (which ML regular-

  izes away), they are often more rapid than novices (for a remarkable study

  of actual admissions decisions, see Johnson 1980, 1988). The process they

  use is rapid, but the predictive performance is not so impressive. But AI

  algorithms are even faster.

  24.4 AI Technology as a Bionic Patch, or Malware, for Human Limits

  We spend a lot of time in behavioral economics thinking about how po-

  litical and economic systems either exploit bad choices or help people make

  good choices. What behavioral economics has to off er to this general discus-

  sion is to specify a more psychologically accurate model of human choice

  and human nature than the caricature of constrained utility- maximization

  (as useful as it has been).

  Artifi cial intelligence enters by creating better tools for making inferences

  about what a person wants and what a person will do. Sometimes these tools

  will hurt and sometimes they will help.

  Artifi cial Intelligence Helps. A clear example is recommender systems.

  Recommender systems use previous data on a target person’s choices and ex

  post quality ratings, as well as data on many other people, possible choices,

  and ratings, to predict how well the target person will like a choice they have

  not made before (and may not even know exists, such as movies or books

  they haven’t heard of ). Recommender systems are a behavioral prosthetic

  to remedy human limits on attention and memory and the resulting incom-

  pleteness of preferences.

  Consider Netfl ix movie recommendations. Netfl ix uses a person’s viewing

  and ratings history, as well as opinions of others and movie properties, as

  inputs to a variety of algorithms to suggest what content to watch. As their

  data scientists explained (Gomez- Uribe and Hunt 2016):

  a typical Netfl ix member loses interest after perhaps 60 to 90 se
conds of

  choosing, having reviewed 10 to 20 titles (perhaps 3 in detail) on one or

  two screens. . . . The recommender problem is to make sure that on those

  two screens each member in our diverse pool will fi nd something compel-

  ling to view, and will understand why it might be of interest.

  For example, their “Because You Watched” recommender line uses a

  video- video similarity algorithm to suggest unwatched videos similar to

  ones the user watched and liked.

  There are so many interesting implications of these kinds of recom-

  mender systems for economics in general, and for behavioral economics

  in particular. For example, Netfl ix wants its members to “understand why

  it (a recommended video) might be of interest.” This is, at bottom, a ques-

  604 Colin F. Camerer

  tion about interpretability of AI output, how a member learns from recom-

  mender successes and errors, and whether a member then “trusts” Netfl ix

  in general. All these are psychological processes that may also depend heav-

  ily on design and experience features (UD, UX).

  Artifi cial Intelligence “Hurts.” 21 Another feature of AI- driven personal-

  ization is price discrimination. If people do know a lot about what they want,

  and have precise willingness- to-pay (WTP), then companies will quickly

  develop the capacity to personalize prices too. This seems to be a concept

  that is emerging rapidly and desperately needs to be studied by industrial

  economists who can fi gure out the welfare implications.

  Behavioral economics can play a role by using evidence about how people

  make judgments about fairness of prices (e.g., Kahneman, Knetsch, and

  Thaler 1986), whether fairness norms adapt to “personalized pricing,” and

  how fairness judgments infl uence behavior.

  My intuition (echoing Kahneman, Knetsch, and Thaler 1986) is that

  people can come to accept a high degree of variation in prices for what is

  essentially the same product as long as there is either (a) very minor prod-

  uct diff erentiation22 or (b) fi rms can articulate why diff erent prices are fair.

  For example, price discrimination might be framed as cross- subsidy to help

  those who can’t aff ord high prices.

  It is also likely that personalized pricing will harm consumers who are

 

‹ Prev