The Economics of Artificial Intelligence

Home > Other > The Economics of Artificial Intelligence > Page 86
The Economics of Artificial Intelligence Page 86

by Ajay Agrawal


  time, and is unrelated to factors that shift consumer’s demand for the prod-

  uct (such demand shifters can be referred to as “confounders” becaues they

  aff ect both the optimal price set by the fi rm and the sales of the product).

  The instrumental variables method essentially projects the observed prices

  onto the input costs, thus only making use of the variation in price that is

  explained by changes in input costs when estimating the impact of price on

  sales. It is very common to see that a predictive model (e.g., least squares

  regression) might have very high explanatory power (e.g., high R 2), while

  the causal model (e.g., instrumental variables regression) might have very

  low explanatory power (in terms of predicting outcomes). In other words,

  economists typically abandon the goal of accurate prediction of outcomes

  in pursuit of an unbiased estimate of a causal parameter of interest.

  Another diff erence derives from the key concerns in diff erent approaches,

  and how those concerns are addressed. In predictive models, the key con-

  cern is the trade- off between expressiveness and overfi tting, and this trade-

  off can be evaluated by looking at goodness of fi t in an independent test

  set. In contrast, there are several distinct concerns for causal models. The

  fi rst is whether the parameter estimates from a particular sample are spuri-

  ous, that is, whether estimates arise due to sampling variation so that if a

  new random sample of the same size was drawn from the population, the

  parameter estimate would be substantially diff erent. The typical approach

  to this problem in econometrics and statistics is to prove theorems about

  the consistency and asymptotic normality of the parameter estimates, pro-

  pose approaches to estimating the variance of parameter estimates, and

  fi nally to use those results to estimate standard errors that refl ect the sam-

  pling uncertainty (under the conditions of the theory). A more data- driven

  approach is to use bootstrapping and estimate the empirical distribution of

  parameter estimates across bootstrap samples. The typical ML approach

  of evaluating performance in a test set does not directly handle the issue of

  the uncertainty over parameter estimates, since the parameter of interest is

  not actually observed in any test set. The researcher would need to estimate

  the parameter again in the test set.

  514 Susan Athey

  A second concern is whether the assumptions required to “identify” a

  causal eff ect are satisfi ed, where in econometrics we say that a parameter is

  identifi ed if we can learn it eventually with infi nite data (where even in the

  limit, the data has the same structure as in the sample considered). It is well

  known that the causal eff ect of a treatment is not identifi ed without making

  assumptions, assumptions that are generally not testable (that is, they cannot

  be rejected by looking at the data). Examples of identifying assumptions

  include the assumption that the treatment is randomly assigned, or that

  treatment assignment is “unconfounded.” In some settings, these assump-

  tions require the analyst to observe all potential “confounders” and con-

  trol for them adequately; in other settings, the assumptions require that an

  instrumental variable is uncorrelated with the unobserved component of

  outcomes. In many cases it can be proven that even with a data set of infi nite

  size, the assumptions are not testable—they cannot be rejected by looking at

  the data, and instead must be evaluated on substantive grounds. Justifying

  assumptions is one of the primary components of an observational study in

  applied economics. If the “identifying” assumptions are violated, estimates

  may be biased (in the same way) in both training data and test data. Testing

  assumptions usually requires additional information, like multiple experi-

  ments (designed or natural) in the data. Thus, the ML approach of evaluat-

  ing performance in a test set does not address this concern at all. Instead, ML

  is likely to help make estimation methods more credible, while maintaining

  the identifying assumptions: in practice, coming up with estimation methods

  that give unbiased estimates of treatment eff ects requires fl exibly modeling

  a variety of empirical relationships, such as the relationship between the

  treatment assignment and covariates. Since ML excels at data- driven model

  selection, it can be useful in systematizing the search for the best functional

  forms when implementing an estimation technique.

  Economists also build more complex models that incorporate both be-

  havioral and statistical assumptions in order to estimate the impact of coun-

  terfactual policies that have never been used before. A classic example is

  McFadden’s methodological work in the early 1970s (e.g., McFadden 1973)

  analyzing transportation choices. By imposing the behavioral assumption

  that consumers maximize utility when making choices, it is possible to esti-

  mate parameters of the consumer’s utility function and estimate the welfare

  eff ects and market share changes that would occur when a choice is added

  or removed (e.g., extending the BART transportation system), or when

  the characteristics of the good (e.g., price) are changed. Another example

  with more complicated behavioral assumptions is the case of auctions. For

  a data set with bids from procurement auctions, the “structural” approach

  involves estimating a probability distribution over bidder values, and then

  evaluating the counterfactual eff ect of changing auction design (e.g., Laf-

  font, Ossard, and Vuong 1995; Athey, Levin, and Seira 2011; Athey, Coey,

  The Impact of Machine Learning on Economics 515

  and Levin 2013; or the review by Athey and Haile 2007). For further discus-

  sions of the contrast between prediction and parameter estimation, see the

  recent review by Mullainathan and Spiess (2017). There is a small litera-

  ture in ML referred to as “inverse reinforcement learning” (Ng and Russell

  2000) that has a similar approach to the structural estimation literature

  economics; this ML literature has mostly operated independently without

  much reference to the earlier econometric literature. The literature attempts

  to learn “reward functions” (utility functions) from observed behavior in

  dynamic settings.

  There are also other categories of ML models; for example, anomaly

  detection focuses on looking for outliers or unusual behavior and is used,

  for example, to detect network intrusion, fraud, or system failures. Other

  categories that I will return to are reinforcement learning (roughly, approxi-

  mate dynamic programming) and multiarmed bandit experimentation

  (dynamic experimentation where the probabiity of selecting an arm is cho-

  sen to balance exploration and exploitation). These literatures often take

  a more explicitly causal perspective and thus are somewhat easier to relate

  to economic models, and so my general statements about the lack of focus

  on causal inference in ML must be qualifi ed when discussing the literature

  on bandits.

  Before proceeding, it is useful to highlight one other contribution of the
/>   ML literature. The contribution is computational rather than conceptual,

  but it has had such a large impact that it merits a short discussion. The tech-

  nique is called stochastic gradient descent (SGD), and it is used in many dif-

  ferent types of models, including the estimation of neural networks as well

  as large scale Bayesian models (e.g., Ruiz, Athey, and Blei [2017], discussed

  in more detail below). In short, stochastic gradient descent is a method for

  optimizing an objective function, such as a likelihood function or a gener-

  alized method of moments objective function, with respect to parameters.

  When the objective function is expensive to compute (e.g., because it requires

  numerical integration), stochastic gradient descent can be used. The main

  idea is that if the objective is the sum of terms, each term corresponding to a

  single observation, the gradient can be approximated by picking a single data

  point and using the gradient evaluated at that observation as an approxima-

  tion to the average (over observations) of the gradient. This estimate of the

  gradient will be very noisy, but unbiased. The idea is that it is more eff ective

  to “climb a hill” taking lots of steps in a direction that is noisy but unbiased,

  than it is to take a small number of steps, each in the right direction, which is

  what happens if computational resources are focused on getting very precise

  estimates of the gradient of the objective at each step. Stochastic gradient

  descent can lead to dramatic performance improvements, and thus enable

  the estimation of very complex models that would be intractable using tra-

  ditional approaches.

  516 Susan Athey

  21.3 Using Prediction Methods in Policy Analysis

  21.3.1 Applications of Prediction Methods

  to Policy Problems in Economics

  There have already been a number of successful applications of predic-

  tion methodology to policy problems. Kleinberg et al. (2015) have argued

  that there is a set of problems where off - the- shelf ML methods for predic-

  tion are the key part of important policy and decision problems. They use

  examples like deciding whether to do a hip replacement operation for an

  elderly patient; if you can predict based on their individual characteris-

  tics that they will die within a year, then you should not do the operation.

  Many Americans are incarcerated while awaiting trial; if you can predict

  who will show up for court, you can let more out on bail. Machine- learning

  algorithms are currently in use for this decision in a number of jurisdic-

  tions. Another natural example is credit scoring; an economics paper by

  Bjorkegren and Grissen (2017) uses ML methods to predict loan repayment

  using mobile phone data.

  In other applications, Goel, Rao, and Shroff (2016) use ML methods to

  examine stop- and- frisk laws, using observables of a police incident to pre-

  dict the probability that a suspect has a weapon, and they show that blacks

  are much less likely than whites to have a weapon conditional on observ-

  ables and being frisked. Glaeser, Hillis, et al. (2016) helped cities design a

  contest to build a predictive model that predicted health code violations in

  restaurants in order to better allocate inspector resources. There is a rap-

  idly growing literature using machine learning together with images from

  satellites and street maps to predict poverty, safety, and home values (see,

  e.g., Naik et al. 2017). As Glaeser, Kominers, et al. (2015) argue, there are

  a variety of applications of this type of prediction methodology. It can be

  used to compare outcomes over time at a very granular level, thus making

  it possible to assess the impact of a variety of policies and changes, such as

  neighborhood revitalization. More broadly, the new opportunities created

  by large- scale imagery and sensors may lead to new types of analyses of

  productivity and well- being.

  Although prediction is often a large part of a resource allocation prob-

  lem—there is likely to be agreement that people who will almost certainly

  die soon should not receive hip replacement surgery, and rich people should

  not receive poverty aid—Athey (2017) discusses the gap between identify-

  ing units that are at risk and those for whom intervention is most benefi cial.

  Determining which units should receive a treatment is a causal inference

  question, and answering it requires diff erent types of data than prediction.

  Either randomized experiments or natural experiments may be needed to

  estimate heterogeneous treatment eff ects and optimal assignment policies.

  In business applications, it has been common to ignore this distinction and

  The Impact of Machine Learning on Economics 517

  focus on risk identifi cation; for example, as of 2017, the Facebook advertis-

  ing optimization tool provided to advertisers optimizes for consumer clicks,

  but not for the causal eff ect of the advertisement. The distinction is often not

  emphasized in marketing materials and discussions in the business world,

  perhaps because many practitioners and engineers are not well versed in

  the distinction between prediction and causal inference.

  21.3.2 Additional Topics in Prediction for Policy Settings

  Athey (2017) summarizes a variety of research questions that arise when

  prediction methods are taken into policy applications. A number of these

  have attracted initial attention in both ML and the social sciences, and

  interdisciplinary conferences and workshops have begun to explore these

  issues.

  One set of questions concerns interpretability of models. There are discus-

  sions of what interpretability means, and whether simpler models have

  advantages. Of course, economists have long understood that simple models

  can also be misleading. In social sciences data, it is typical that many attri-

  butes of individuals or locations are positively correlated—parents’ educa-

  tion, parents’ income, child’s education, and so on. If we are interested in a

  conditional mean function, and estimate ˆ

  μ( x) = E[ Y | X = x], using a simpler

  i

  i

  model that omits a subset of covariates may be misleading. In the simpler

  model, the relationship between the omitted covariates and outcomes is

  loaded onto the covariates that are included. Omitting a covariate from a

  model is not the same thing as controlling for it in an analysis, and it can

  sometimes be easier to interpret a partial eff ect of a covariate controlling for

  other factors than it is to keep in mind all of the other (omitted) factors and

  how they covary with those included in a model. So, simpler models can

  sometimes be misleading; they may seem easy to understand, but the under-

  standing gained from them may be incomplete or wrong.

  One type of model that typically is easy to interpret and explain is a causal

  model. As reviewed in Imbens and Rubin (2015), the causal inference frame-

  work typically makes the estimand very precise—for example, the average

  eff ect if a treatment were applied to a particular population, the conditional

  average treatment eff ect (conditional on some observable characteristics of

  in
dividuals), or the average eff ect of a treatment on a subpopulation such as

  “compliers” (those whose treatment adoption is aff ected by an instrumental

  variable). Such parameters by defi nition give the answer to a well- defi ned

  question, and so the magnitudes are straightforward to interpret. Key pa-

  rameters of “structural” models are also straightforward to interpret—they

  represent parameters of consumer utility functions, elasticities of demand

  curves, bidder valuations in auctions, marginal costs of fi rms, and so on. An

  area for further research concerns whether there are other ways to math-

  ematically formalize what it means for a model to be interpretable, or to

  analyze empirically the implications of interpretability. Yeomans, Shah, and

  518 Susan Athey

  Kleinberg (2016) study empirically a related issue of how much people trust

  ML- based recommender systems, and why.

  Another area that has attracted a lot of attention is the question of fair-

  ness and nondiscrimination, for example, whether algorithms will promote

  discrimination by gender or race when used in settings like hiring, judicial

  decisions, or lending. There are a number of interesting questions that can

  be considered. One is, how can fairness constraints be defi ned? What type

  of fairness is desired? For example, if a predictive model is used to allocate

  job interviews based on resumes, there are two types of errors, Type I and

  Type II. It is straightforward to show that it is in general impossible to

  equalize both Type I and Type II errors across two diff erent categories of

  people (e.g., men and women), so the analyst must choose which to equalize

  (or both). See Kleinberg, Mullainathan, and Raghaven (2016) for further

  analysis and development of the inherent trade- off s in fairness in predictive

  algorithms. Overall, the literature on this topic has grown rapidly in the last

  two years, and we expect that as ML algorithms are deployed in more and

  more contexts, the topic will continue to develop. My view is that it is more

  likely that ML models will help make resource allocation more rather than

  less fair; algorithms can absorb and eff ectively use a lot more information

  than humans, and thus are less likely than humans to rely on stereotypes.

  To the extent that unconstrained algorithms do have undesirable distribu-

 

‹ Prev