The Economics of Artificial Intelligence

Home > Other > The Economics of Artificial Intelligence > Page 88
The Economics of Artificial Intelligence Page 88

by Ajay Agrawal


  than overlap (propensity score bounded strictly between 0 and 1, which is

  required for identifi cation of average treatment eff ects). No other approach

  has been proposed that is effi

  cient without assumptions on the assignment

  model. In settings where the assignment model is complex, simulations show

  that the method works better than alternatives, without sacrifi cing much in

  terms of performance on simpler models. Complex assignment rules with

  many weak confounders arise commonly in technology fi rms, where com-

  plex models are used to map from a user’s observed history to assignments

  of recommendations, advertisements, and so on.

  More recently, Chernozhukov et al. (2017) propose “double machine

  learning,” a method analogous to Robinson (1988), using a semiparametric

  524 Susan Athey

  residual- on- residual regression as a method for estimating average treat-

  ment eff ects under unconfoundedness. The idea is to run a nonparametric

  regression of outcomes on covariates, and a second nonparametric regres-

  sion of the treatment indicator on covariates; then, the residuals from the

  fi rst regression are regressed on the residuals from the second regression.

  In Robinson (1988), the nonparametric estimator was a kernel regression;

  the more recent work establishes that any ML method can be used for the

  nonparametric regression, so long as it is consistent and converges at the

  rate n 1/ 4 .

  A few themes are common to the latter two approaches. One is the impor-

  tance of building on the traditional literature on statistical effi

  ciency, which

  provides strong guidance on what types of estimators are likely to be suc-

  cessful, as well as the particular advantages of doubly robust methods for

  average treatment eff ect estimation. A second theme is that orthogonaliza-

  tion can work very well in practice—using machine learning to estimate

  fl exibly the relationship between outcomes and treatment indicators and

  covariates—and then estimating average treatment eff ects using residualized

  outcomes and/or residualized treatment indicators. The intuition is that in

  high dimensions, mistakes in estimating nuisance parameters are likely, but

  working with residualized variables makes the estimation of the average

  treatment eff ect orthogonal to errors in estimating nuisance parameters. I

  expect that this insight will continue to be utilized in the future literature.

  21.4.2 Heterogeneous Treatment Eff ects and Optimal Policies

  Another area of active research concerns the estimation of heterogene-

  ity in treatment eff ects, where here we refer to heterogeneity with respect to

  observed covariates. For example, if the treatment is a drug, we can be inter-

  ested in how the drug’s effi

  cacy varies with individual characteristics. Athey

  and Imbens (2017) provides a more detailed review of a variety of questions

  that can be considered relating to heterogeneity; we will focus on a few here.

  Treatment eff ect heterogeneity can be of interest either for basic scien-

  tifi c understanding (that can be used to design new policies or understand

  mechanisms), or as a means to the end of estimating treatment assignment

  policies that map from a user’s characteristics to a treatment.

  Starting with basic scientifi c understanding of treatment eff ects, another

  question concerns whether we wish to discover simple patterns of heteroge-

  neity, or whether a fully nonparametric estimator for how treatment eff ects

  vary with covariates is desired. One approach to discovering simpler patterns

  is provided by Athey and Imbens (2016). This paper proposes to create a

  partition of the covariate space, and then estimate treatment eff ects in each

  element of the partition. The splitting rule optimizes for fi nding splits that

  reveal treatment eff ect heterogeneity. The paper also proposes sample split-

  ting as a way to avoid the bias inherent in using the same data to discover

  the form of heterogeneity, and to estimate the magnitude of the heteroge-

  The Impact of Machine Learning on Economics 525

  neity. One sample is used to construct the partition, while a second sample

  is used to estimate treatment eff ects. In this way, the confi dence intervals

  built around the estimates on the second sample have nominal coverage no

  matter how many covariates there are. The intuition is that since the parti-

  tion is created on an independent sample, the partition used is completely

  unrelated to the realizations of outcomes in the second sample. In addition,

  the procedure used to create the partition penalizes splits that increase the

  variance of the estimated treatment eff ects too much. This, together with

  cross- validation to select tree complexity, ensures that the leaves don’t get

  too small, and thus the confi dence intervals have nominal coverage.

  There have already been a wide range of applications of “causal trees”

  in applications ranging from medicine to economic fi eld experiments. The

  methods allow the researcher to discover forms of heterogeneity that were

  not specifi ed in a preanalysis plan without invalidating confi dence intervals.

  The method is also easily “interpretable,” in that for each element of the

  partition the estimator is a traditional estimate of a treatment eff ect. How-

  ever, it is important for researchers to recognize that just because, say, three

  covariates are used to describe an element of a partition (e.g., male indi-

  viduals with income between $100,000 and $120,000 and fi fteen to twenty

  years of schooling), the average of all values of covariates will vary across

  partition elements. So, it is important not to draw conclusions about what

  covariates are not associated with treatment eff ect heterogeneity. This chap-

  ter builds on earlier work on “model- based recursive partitioning” (Zeileis,

  Hothorn, and Hornik 2008), which looked at recursive partitioning for

  more complex models (general models estimated by maximum likelihood),

  but did not provide statistical properties (nor suggest the sample splitting,

  which is a focus of Athey and Imbens 2016). Asher et al. (2016) provide

  another related example of building classifi cation trees for heterogeneity

  in GMM models.

  In some contexts, a simple partition of the covariate space is most useful.

  In other contexts, it is desirable to have a fully nonparametric estimate of

  how treatment eff ects vary with covariates. In the traditional econometrics

  literature, this could be accomplished through kernel estimation or matching

  techniques; these methods have well- understood statistical properties. How-

  ever, even though they work well in theory, in practice matching methods

  and kernel methods break down when there are more than a handful of

  covariates.

  In Wager and Athey (forthcoming), we introduce the idea of a “causal

  forest.” Essentially, a causal forest is the average of a lot of causal trees,

  where trees diff er from one another due to subsampling. Conceptually, a

  causal forest can be thought of as a version of a nearest neighbor match-

  ing method, but one where there is a data- driven approach to determi
ne

  which dimensions of the covariate space are important to match on. The

  main technical results in this chapter establish the fi rst asymptotic normality

  526 Susan Athey

  results for random forests used for prediction; this result is then extended to

  causal inference. We also propose an estimator for the variance and prove

  its consistency, so that confi dence intervals can be constructed.

  A key requirement for our results about random forests is that each indi-

  vidual tree is “honest”; that is, we use diff erent data to construct a partition

  of the covariate space from the data used to estimate treatment eff ects within

  the leaves. That is, we use sample splitting, similar to Athey and Imbens

  (2016). In the context of a random forest, all of the data is used for both

  “model selection” and estimation, as an observation that is in the partition-

  building subsample for one tree may be in the treatment eff ect estimation

  sample in another tree.

  Athey, Tibshirani, and Wager (2017) extended the framework to analyze

  nonparametric parameter heterogeneity in any model where the parameter

  of interest can be estimated via GMM. The idea is that the random forest

  is used to construct a series of trees. Rather than estimating a model in the

  leaves of every tree, the algorithm instead extracts the weights implied by

  the forest. In particular, when estimating treatment eff ects for a particular

  value of X, we estimate a “local GMM” model, where observations close

  to X are weighted more heavily. How heavily? The weights are determined

  by the fraction of time an observation ended up in the same leaf during the

  forest creation stage. A subtlety in this project is that it is diffi

  cult to design

  general purpose, computationally lightweight “splitting rules” for construct-

  ing partitions according to the covariates that predict parameter heteroge-

  neity. We provide a solution to that problem and also provide a proof of

  asymptotic normality of estimates, as well as an estimator for confi dence

  intervals. The paper highlights the case of instrumental variables, and how

  the method can be used to fi nd heterogeneity in treatment eff ect parameters

  estimated with instrumental variables. An alternative approach to estimating

  parameter heterogeneity in instrumental variables models was proposed by

  Hartford, Lewis, and Taddy (2016), who use an approach based on neural

  nets. General nonparametric theory is more challenging for neural nets.

  The method of Athey, Tibshirani, and Wager (2017), “generalized ran-

  dom forests,” can be used as an alternative to “traditional” methods such as

  local generalized method of moments or local maximum likelihood (Tib-

  shirani and Hastie 1987). Local methods such as local linear regression

  typically target a particular value of covariates, and use a kernel- weighting

  function to weight nearby observations more heavily when running a regres-

  sion. The insight in Athey, Tibshirani, and Wager (2017) is that the random

  forest can be reinterpreted as a method to generate a weighting function, and

  the forest- based weighting function can substitute for the kernel- weighting

  function in a local linear estimation procedure. The advantages of the forest-

  weighting function are that it is data adaptive as well as model adaptive.

  It is data adaptive in that covariates that are important for heterogeneity

  in parameters of interest are given more importance in determining what

  The Impact of Machine Learning on Economics 527

  observations are “nearby.” It is model adaptive in that it focuses on hetero-

  geneity in parameter estimates in a given model, rather than hetereogeneity

  in predicting the conditional mean of outcomes as in a traditional regres-

  sion forest.

  The insight of Athey, Tibshirani, and Wager (2017) is more general and I

  expect it to reappear in other papers in this literature: anyplace in traditional

  econometrics where a kernel function might have been used, ML methods

  that perform better than kernels in practice may be substituted. However,

  the statistical and econometric theory for the new methods needs to be estab-

  lished in order to ensure that the ML- based procedure has desired properties

  such as asymptotic normality of parameter estimates. Athey, Tibshirani,

  and Wager (2017) does this for their generalized random forests for estimat-

  ing heterogeneity in parameter estimates, and Hartford, Lewis, and Taddy

  (2016) use neural nets instead of kernels for semiparametric instrumental

  variables; Chernozhukov et al. (2017) does this for their generalization of

  Robinson (1988) semiparametric regression models.

  There are also other possible approaches to estimating conditional aver-

  age treatment eff ects when the structure of the heterogeneity is assumed to

  take a simple form, or when the analyst is willing to understand treatment

  eff ects conditioning only on a subset of covariates rather than attempting

  to condition on all relevant covariates. Targeted maximum likelihood (van

  der Laan and Rubin 2006) is one approach to this; more recently, Imai and

  Ratkovic (2013) proposed using LASSO to uncover heterogeneous treat-

  ment eff ects, while Künzel et al. (2017) proposes an ML approach using

  “metalearners.” It is important to note, however, that if there is insuffi

  -

  cient data to estimate the impact of all relevant covariates; a model such

  as LASSO will tend to drop covariates (and their interactions) that are

  correlated with other included covariates, so that the included covariates

  “pick up” the impact of omitted covariates.

  Finally, a motivating goal for understanding treatment eff ects is estimat-

  ing optimal policy functions; that is, functions that map from the observ-

  able covariates of individuals to policy assignments. This problem has been

  recently studied in economics by, for example, Kitagawa and Tetenov (2015),

  who focus on estimating the optimal policy from a class of potential policies

  of limited complexity. The goal is to select a policy function to minimize the

  loss from failing to use the (infeasible) ideal policy, referred to as the “regret”

  of the policy. Despite the general lack of research about causal inference in

  the ML literature, the topic of optimal policy estimation has received some

  attention. However, most of the ML literature focuses on algorithmic inno-

  vations, and does not exploit insights from the causal inference literature.

  An exception is that a line of research has incorporated the idea of pro-

  pensity score weighting or doubly robust methods, although often without

  much reference to the statistics and econometrics literature. Examples of

  papers from the ML literature focused on policy learning include Strehl et al.

  528 Susan Athey

  (2010), Dudik, Langford, and Li (2011), Li et al. (2012), Dudik et al. (2014),

  Li et al. (2014), Swaminathan and Joachims (2015), Jiang and Li (2016),

  Thomas and Brunskill (2016), and Kallus (2017). One type of result in that

  literature establishes bounds on the regret of the algorithm. In Athey and

  Wager (2017), we show how bringing in insights from semiparame
tric effi

  -

  ciency theory allows us to establish a tighter “regret bound” than the exist-

  ing literature, thus narrowing down substantially the set of algorithms that

  might achieve the regret bound. This highlights the fact that the econometric

  theory literature has added value that has not been fully exploited in ML.

  Another unrelated observation is that, perhaps surprisingly, the economet-

  rics of the problem of estimating optimal policy functions within a class of

  potential policies of limited complexity is quite diff erent from the problem

  of estimating conditional average treatment eff ects, although of course, the

  problems are related.

  21.4.3 Contextual Bandits: Estimating Optimal

  Policies Using Adaptive Experimentation

  Previously, I reviewed methods for estimating optimal policies mapping

  from individual covariates to treatment assignments. A growing literature

  based primarily in ML studies the problem of “bandits,” which are algo-

  rithms that actively learn about which treatment is best. Online experimenta-

  tion work yields large benefi ts when the setting is such that it is possible to

  quickly measure outcomes, and when there are many possible treatments.

  In the basic bandit problem when all units have identical covariates, the

  problem of “online experimentation,” or “multiarmed bandits,” asks the

  question of how experiments be designed to assign individuals to treatments

  as they arrive, using data from earlier individuals to determine the probabili-

  ties of assigning new individuals to each treatment, balancing the need for

  exploration against the desire for exploitation. That is, bandits balance the

  need to learn against the desire to avoid giving individuals suboptimal treat-

  ments. This type of online experimentation has been shown to yield reliable

  answers orders of magnitude faster than traditional randomized controlled

  trials in cases where there are many possible treatments (see, e.g., Scott 2010);

  the gain comes from the fact that treatments that are doing badly are eff ec-

  tively discarded, so that newly arriving units are instead assigned to the best

  candidates. When the goal is to estimate an optimal policy, it is not necessary

  to continue to allocate units to treatments that are fairly certain not to be

 

‹ Prev