The Economics of Artificial Intelligence Page 88 Read online free by Ajay Agrawal

Home > Other > The Economics of Artificial Intelligence > Page 88

The Economics of Artificial Intelligence Page 88

than overlap (propensity score bounded strictly between 0 and 1, which is

required for identifi cation of average treatment eff ects). No other approach

has been proposed that is effi

cient without assumptions on the assignment

model. In settings where the assignment model is complex, simulations show

that the method works better than alternatives, without sacrifi cing much in

terms of performance on simpler models. Complex assignment rules with

many weak confounders arise commonly in technology fi rms, where com-

plex models are used to map from a user’s observed history to assignments

of recommendations, advertisements, and so on.

More recently, Chernozhukov et al. (2017) propose “double machine

learning,” a method analogous to Robinson (1988), using a semiparametric

524 Susan Athey

residual- on- residual regression as a method for estimating average treat-

ment eff ects under unconfoundedness. The idea is to run a nonparametric

regression of outcomes on covariates, and a second nonparametric regres-

sion of the treatment indicator on covariates; then, the residuals from the

fi rst regression are regressed on the residuals from the second regression.

In Robinson (1988), the nonparametric estimator was a kernel regression;

the more recent work establishes that any ML method can be used for the

nonparametric regression, so long as it is consistent and converges at the

rate n 1/ 4 .

A few themes are common to the latter two approaches. One is the impor-

tance of building on the traditional literature on statistical effi

ciency, which

provides strong guidance on what types of estimators are likely to be suc-

cessful, as well as the particular advantages of doubly robust methods for

average treatment eff ect estimation. A second theme is that orthogonaliza-

tion can work very well in practice—using machine learning to estimate

fl exibly the relationship between outcomes and treatment indicators and

covariates—and then estimating average treatment eff ects using residualized

outcomes and/or residualized treatment indicators. The intuition is that in

high dimensions, mistakes in estimating nuisance parameters are likely, but

working with residualized variables makes the estimation of the average

treatment eff ect orthogonal to errors in estimating nuisance parameters. I

expect that this insight will continue to be utilized in the future literature.

21.4.2 Heterogeneous Treatment Eff ects and Optimal Policies

Another area of active research concerns the estimation of heterogene-

ity in treatment eff ects, where here we refer to heterogeneity with respect to

observed covariates. For example, if the treatment is a drug, we can be inter-

ested in how the drug’s effi

cacy varies with individual characteristics. Athey

and Imbens (2017) provides a more detailed review of a variety of questions

that can be considered relating to heterogeneity; we will focus on a few here.

Treatment eff ect heterogeneity can be of interest either for basic scien-

tifi c understanding (that can be used to design new policies or understand

mechanisms), or as a means to the end of estimating treatment assignment

policies that map from a user’s characteristics to a treatment.

Starting with basic scientifi c understanding of treatment eff ects, another

question concerns whether we wish to discover simple patterns of heteroge-

neity, or whether a fully nonparametric estimator for how treatment eff ects

vary with covariates is desired. One approach to discovering simpler patterns

is provided by Athey and Imbens (2016). This paper proposes to create a

partition of the covariate space, and then estimate treatment eff ects in each

element of the partition. The splitting rule optimizes for fi nding splits that

reveal treatment eff ect heterogeneity. The paper also proposes sample split-

ting as a way to avoid the bias inherent in using the same data to discover

the form of heterogeneity, and to estimate the magnitude of the heteroge-

The Impact of Machine Learning on Economics 525

neity. One sample is used to construct the partition, while a second sample

is used to estimate treatment eff ects. In this way, the confi dence intervals

built around the estimates on the second sample have nominal coverage no

matter how many covariates there are. The intuition is that since the parti-

tion is created on an independent sample, the partition used is completely

unrelated to the realizations of outcomes in the second sample. In addition,

the procedure used to create the partition penalizes splits that increase the

variance of the estimated treatment eff ects too much. This, together with

cross- validation to select tree complexity, ensures that the leaves don’t get

too small, and thus the confi dence intervals have nominal coverage.

There have already been a wide range of applications of “causal trees”

in applications ranging from medicine to economic fi eld experiments. The

methods allow the researcher to discover forms of heterogeneity that were

not specifi ed in a preanalysis plan without invalidating confi dence intervals.

The method is also easily “interpretable,” in that for each element of the

partition the estimator is a traditional estimate of a treatment eff ect. How-

ever, it is important for researchers to recognize that just because, say, three

covariates are used to describe an element of a partition (e.g., male indi-

viduals with income between $100,000 and $120,000 and fi fteen to twenty

years of schooling), the average of all values of covariates will vary across

partition elements. So, it is important not to draw conclusions about what

covariates are not associated with treatment eff ect heterogeneity. This chap-

ter builds on earlier work on “model- based recursive partitioning” (Zeileis,

Hothorn, and Hornik 2008), which looked at recursive partitioning for

more complex models (general models estimated by maximum likelihood),

but did not provide statistical properties (nor suggest the sample splitting,

which is a focus of Athey and Imbens 2016). Asher et al. (2016) provide

another related example of building classifi cation trees for heterogeneity

in GMM models.

In some contexts, a simple partition of the covariate space is most useful.

In other contexts, it is desirable to have a fully nonparametric estimate of

how treatment eff ects vary with covariates. In the traditional econometrics

literature, this could be accomplished through kernel estimation or matching

techniques; these methods have well- understood statistical properties. How-

ever, even though they work well in theory, in practice matching methods

and kernel methods break down when there are more than a handful of

covariates.

In Wager and Athey (forthcoming), we introduce the idea of a “causal

forest.” Essentially, a causal forest is the average of a lot of causal trees,

where trees diff er from one another due to subsampling. Conceptually, a

causal forest can be thought of as a version of a nearest neighbor match-

ing method, but one where there is a data- driven approach to determi
ne

which dimensions of the covariate space are important to match on. The

main technical results in this chapter establish the fi rst asymptotic normality

526 Susan Athey

results for random forests used for prediction; this result is then extended to

causal inference. We also propose an estimator for the variance and prove

its consistency, so that confi dence intervals can be constructed.

A key requirement for our results about random forests is that each indi-

vidual tree is “honest”; that is, we use diff erent data to construct a partition

of the covariate space from the data used to estimate treatment eff ects within

the leaves. That is, we use sample splitting, similar to Athey and Imbens

(2016). In the context of a random forest, all of the data is used for both

“model selection” and estimation, as an observation that is in the partition-

building subsample for one tree may be in the treatment eff ect estimation

sample in another tree.

Athey, Tibshirani, and Wager (2017) extended the framework to analyze

nonparametric parameter heterogeneity in any model where the parameter

of interest can be estimated via GMM. The idea is that the random forest

is used to construct a series of trees. Rather than estimating a model in the

leaves of every tree, the algorithm instead extracts the weights implied by

the forest. In particular, when estimating treatment eff ects for a particular

value of X, we estimate a “local GMM” model, where observations close

to X are weighted more heavily. How heavily? The weights are determined

by the fraction of time an observation ended up in the same leaf during the

forest creation stage. A subtlety in this project is that it is diffi

cult to design

general purpose, computationally lightweight “splitting rules” for construct-

ing partitions according to the covariates that predict parameter heteroge-

neity. We provide a solution to that problem and also provide a proof of

asymptotic normality of estimates, as well as an estimator for confi dence

intervals. The paper highlights the case of instrumental variables, and how

the method can be used to fi nd heterogeneity in treatment eff ect parameters

estimated with instrumental variables. An alternative approach to estimating

parameter heterogeneity in instrumental variables models was proposed by

Hartford, Lewis, and Taddy (2016), who use an approach based on neural

nets. General nonparametric theory is more challenging for neural nets.

The method of Athey, Tibshirani, and Wager (2017), “generalized ran-

dom forests,” can be used as an alternative to “traditional” methods such as

local generalized method of moments or local maximum likelihood (Tib-

shirani and Hastie 1987). Local methods such as local linear regression

typically target a particular value of covariates, and use a kernel- weighting

function to weight nearby observations more heavily when running a regres-

sion. The insight in Athey, Tibshirani, and Wager (2017) is that the random

forest can be reinterpreted as a method to generate a weighting function, and

the forest- based weighting function can substitute for the kernel- weighting

function in a local linear estimation procedure. The advantages of the forest-

weighting function are that it is data adaptive as well as model adaptive.

It is data adaptive in that covariates that are important for heterogeneity

in parameters of interest are given more importance in determining what

The Impact of Machine Learning on Economics 527

observations are “nearby.” It is model adaptive in that it focuses on hetero-

geneity in parameter estimates in a given model, rather than hetereogeneity

in predicting the conditional mean of outcomes as in a traditional regres-

sion forest.

The insight of Athey, Tibshirani, and Wager (2017) is more general and I

expect it to reappear in other papers in this literature: anyplace in traditional

econometrics where a kernel function might have been used, ML methods

that perform better than kernels in practice may be substituted. However,

the statistical and econometric theory for the new methods needs to be estab-

lished in order to ensure that the ML- based procedure has desired properties

such as asymptotic normality of parameter estimates. Athey, Tibshirani,

and Wager (2017) does this for their generalized random forests for estimat-

ing heterogeneity in parameter estimates, and Hartford, Lewis, and Taddy

(2016) use neural nets instead of kernels for semiparametric instrumental

variables; Chernozhukov et al. (2017) does this for their generalization of

Robinson (1988) semiparametric regression models.

There are also other possible approaches to estimating conditional aver-

age treatment eff ects when the structure of the heterogeneity is assumed to

take a simple form, or when the analyst is willing to understand treatment

eff ects conditioning only on a subset of covariates rather than attempting

to condition on all relevant covariates. Targeted maximum likelihood (van

der Laan and Rubin 2006) is one approach to this; more recently, Imai and

Ratkovic (2013) proposed using LASSO to uncover heterogeneous treat-

ment eff ects, while Künzel et al. (2017) proposes an ML approach using

“metalearners.” It is important to note, however, that if there is insuffi

-

cient data to estimate the impact of all relevant covariates; a model such

as LASSO will tend to drop covariates (and their interactions) that are

correlated with other included covariates, so that the included covariates

“pick up” the impact of omitted covariates.

Finally, a motivating goal for understanding treatment eff ects is estimat-

ing optimal policy functions; that is, functions that map from the observ-

able covariates of individuals to policy assignments. This problem has been

recently studied in economics by, for example, Kitagawa and Tetenov (2015),

who focus on estimating the optimal policy from a class of potential policies

of limited complexity. The goal is to select a policy function to minimize the

loss from failing to use the (infeasible) ideal policy, referred to as the “regret”

of the policy. Despite the general lack of research about causal inference in

the ML literature, the topic of optimal policy estimation has received some

attention. However, most of the ML literature focuses on algorithmic inno-

vations, and does not exploit insights from the causal inference literature.

An exception is that a line of research has incorporated the idea of pro-

pensity score weighting or doubly robust methods, although often without

much reference to the statistics and econometrics literature. Examples of

papers from the ML literature focused on policy learning include Strehl et al.

528 Susan Athey

(2010), Dudik, Langford, and Li (2011), Li et al. (2012), Dudik et al. (2014),

Li et al. (2014), Swaminathan and Joachims (2015), Jiang and Li (2016),

Thomas and Brunskill (2016), and Kallus (2017). One type of result in that

literature establishes bounds on the regret of the algorithm. In Athey and

Wager (2017), we show how bringing in insights from semiparame
tric effi

-

ciency theory allows us to establish a tighter “regret bound” than the exist-

ing literature, thus narrowing down substantially the set of algorithms that

might achieve the regret bound. This highlights the fact that the econometric

theory literature has added value that has not been fully exploited in ML.

Another unrelated observation is that, perhaps surprisingly, the economet-

rics of the problem of estimating optimal policy functions within a class of

potential policies of limited complexity is quite diff erent from the problem

of estimating conditional average treatment eff ects, although of course, the

problems are related.

21.4.3 Contextual Bandits: Estimating Optimal

Policies Using Adaptive Experimentation

Previously, I reviewed methods for estimating optimal policies mapping

from individual covariates to treatment assignments. A growing literature

based primarily in ML studies the problem of “bandits,” which are algo-

rithms that actively learn about which treatment is best. Online experimenta-

tion work yields large benefi ts when the setting is such that it is possible to

quickly measure outcomes, and when there are many possible treatments.

In the basic bandit problem when all units have identical covariates, the

problem of “online experimentation,” or “multiarmed bandits,” asks the

question of how experiments be designed to assign individuals to treatments

as they arrive, using data from earlier individuals to determine the probabili-

ties of assigning new individuals to each treatment, balancing the need for

exploration against the desire for exploitation. That is, bandits balance the

need to learn against the desire to avoid giving individuals suboptimal treat-

ments. This type of online experimentation has been shown to yield reliable

answers orders of magnitude faster than traditional randomized controlled

trials in cases where there are many possible treatments (see, e.g., Scott 2010);

the gain comes from the fact that treatments that are doing badly are eff ec-

tively discarded, so that newly arriving units are instead assigned to the best

candidates. When the goal is to estimate an optimal policy, it is not necessary

to continue to allocate units to treatments that are fairly certain not to be

‹ Prev Next ›