by Ajay Agrawal
than overlap (propensity score bounded strictly between 0 and 1, which is
required for identifi cation of average treatment eff ects). No other approach
has been proposed that is effi
cient without assumptions on the assignment
model. In settings where the assignment model is complex, simulations show
that the method works better than alternatives, without sacrifi cing much in
terms of performance on simpler models. Complex assignment rules with
many weak confounders arise commonly in technology fi rms, where com-
plex models are used to map from a user’s observed history to assignments
of recommendations, advertisements, and so on.
More recently, Chernozhukov et al. (2017) propose “double machine
learning,” a method analogous to Robinson (1988), using a semiparametric
524 Susan Athey
residual- on- residual regression as a method for estimating average treat-
ment eff ects under unconfoundedness. The idea is to run a nonparametric
regression of outcomes on covariates, and a second nonparametric regres-
sion of the treatment indicator on covariates; then, the residuals from the
fi rst regression are regressed on the residuals from the second regression.
In Robinson (1988), the nonparametric estimator was a kernel regression;
the more recent work establishes that any ML method can be used for the
nonparametric regression, so long as it is consistent and converges at the
rate n 1/ 4 .
A few themes are common to the latter two approaches. One is the impor-
tance of building on the traditional literature on statistical effi
ciency, which
provides strong guidance on what types of estimators are likely to be suc-
cessful, as well as the particular advantages of doubly robust methods for
average treatment eff ect estimation. A second theme is that orthogonaliza-
tion can work very well in practice—using machine learning to estimate
fl exibly the relationship between outcomes and treatment indicators and
covariates—and then estimating average treatment eff ects using residualized
outcomes and/or residualized treatment indicators. The intuition is that in
high dimensions, mistakes in estimating nuisance parameters are likely, but
working with residualized variables makes the estimation of the average
treatment eff ect orthogonal to errors in estimating nuisance parameters. I
expect that this insight will continue to be utilized in the future literature.
21.4.2 Heterogeneous Treatment Eff ects and Optimal Policies
Another area of active research concerns the estimation of heterogene-
ity in treatment eff ects, where here we refer to heterogeneity with respect to
observed covariates. For example, if the treatment is a drug, we can be inter-
ested in how the drug’s effi
cacy varies with individual characteristics. Athey
and Imbens (2017) provides a more detailed review of a variety of questions
that can be considered relating to heterogeneity; we will focus on a few here.
Treatment eff ect heterogeneity can be of interest either for basic scien-
tifi c understanding (that can be used to design new policies or understand
mechanisms), or as a means to the end of estimating treatment assignment
policies that map from a user’s characteristics to a treatment.
Starting with basic scientifi c understanding of treatment eff ects, another
question concerns whether we wish to discover simple patterns of heteroge-
neity, or whether a fully nonparametric estimator for how treatment eff ects
vary with covariates is desired. One approach to discovering simpler patterns
is provided by Athey and Imbens (2016). This paper proposes to create a
partition of the covariate space, and then estimate treatment eff ects in each
element of the partition. The splitting rule optimizes for fi nding splits that
reveal treatment eff ect heterogeneity. The paper also proposes sample split-
ting as a way to avoid the bias inherent in using the same data to discover
the form of heterogeneity, and to estimate the magnitude of the heteroge-
The Impact of Machine Learning on Economics 525
neity. One sample is used to construct the partition, while a second sample
is used to estimate treatment eff ects. In this way, the confi dence intervals
built around the estimates on the second sample have nominal coverage no
matter how many covariates there are. The intuition is that since the parti-
tion is created on an independent sample, the partition used is completely
unrelated to the realizations of outcomes in the second sample. In addition,
the procedure used to create the partition penalizes splits that increase the
variance of the estimated treatment eff ects too much. This, together with
cross- validation to select tree complexity, ensures that the leaves don’t get
too small, and thus the confi dence intervals have nominal coverage.
There have already been a wide range of applications of “causal trees”
in applications ranging from medicine to economic fi eld experiments. The
methods allow the researcher to discover forms of heterogeneity that were
not specifi ed in a preanalysis plan without invalidating confi dence intervals.
The method is also easily “interpretable,” in that for each element of the
partition the estimator is a traditional estimate of a treatment eff ect. How-
ever, it is important for researchers to recognize that just because, say, three
covariates are used to describe an element of a partition (e.g., male indi-
viduals with income between $100,000 and $120,000 and fi fteen to twenty
years of schooling), the average of all values of covariates will vary across
partition elements. So, it is important not to draw conclusions about what
covariates are not associated with treatment eff ect heterogeneity. This chap-
ter builds on earlier work on “model- based recursive partitioning” (Zeileis,
Hothorn, and Hornik 2008), which looked at recursive partitioning for
more complex models (general models estimated by maximum likelihood),
but did not provide statistical properties (nor suggest the sample splitting,
which is a focus of Athey and Imbens 2016). Asher et al. (2016) provide
another related example of building classifi cation trees for heterogeneity
in GMM models.
In some contexts, a simple partition of the covariate space is most useful.
In other contexts, it is desirable to have a fully nonparametric estimate of
how treatment eff ects vary with covariates. In the traditional econometrics
literature, this could be accomplished through kernel estimation or matching
techniques; these methods have well- understood statistical properties. How-
ever, even though they work well in theory, in practice matching methods
and kernel methods break down when there are more than a handful of
covariates.
In Wager and Athey (forthcoming), we introduce the idea of a “causal
forest.” Essentially, a causal forest is the average of a lot of causal trees,
where trees diff er from one another due to subsampling. Conceptually, a
causal forest can be thought of as a version of a nearest neighbor match-
ing method, but one where there is a data- driven approach to determi
ne
which dimensions of the covariate space are important to match on. The
main technical results in this chapter establish the fi rst asymptotic normality
526 Susan Athey
results for random forests used for prediction; this result is then extended to
causal inference. We also propose an estimator for the variance and prove
its consistency, so that confi dence intervals can be constructed.
A key requirement for our results about random forests is that each indi-
vidual tree is “honest”; that is, we use diff erent data to construct a partition
of the covariate space from the data used to estimate treatment eff ects within
the leaves. That is, we use sample splitting, similar to Athey and Imbens
(2016). In the context of a random forest, all of the data is used for both
“model selection” and estimation, as an observation that is in the partition-
building subsample for one tree may be in the treatment eff ect estimation
sample in another tree.
Athey, Tibshirani, and Wager (2017) extended the framework to analyze
nonparametric parameter heterogeneity in any model where the parameter
of interest can be estimated via GMM. The idea is that the random forest
is used to construct a series of trees. Rather than estimating a model in the
leaves of every tree, the algorithm instead extracts the weights implied by
the forest. In particular, when estimating treatment eff ects for a particular
value of X, we estimate a “local GMM” model, where observations close
to X are weighted more heavily. How heavily? The weights are determined
by the fraction of time an observation ended up in the same leaf during the
forest creation stage. A subtlety in this project is that it is diffi
cult to design
general purpose, computationally lightweight “splitting rules” for construct-
ing partitions according to the covariates that predict parameter heteroge-
neity. We provide a solution to that problem and also provide a proof of
asymptotic normality of estimates, as well as an estimator for confi dence
intervals. The paper highlights the case of instrumental variables, and how
the method can be used to fi nd heterogeneity in treatment eff ect parameters
estimated with instrumental variables. An alternative approach to estimating
parameter heterogeneity in instrumental variables models was proposed by
Hartford, Lewis, and Taddy (2016), who use an approach based on neural
nets. General nonparametric theory is more challenging for neural nets.
The method of Athey, Tibshirani, and Wager (2017), “generalized ran-
dom forests,” can be used as an alternative to “traditional” methods such as
local generalized method of moments or local maximum likelihood (Tib-
shirani and Hastie 1987). Local methods such as local linear regression
typically target a particular value of covariates, and use a kernel- weighting
function to weight nearby observations more heavily when running a regres-
sion. The insight in Athey, Tibshirani, and Wager (2017) is that the random
forest can be reinterpreted as a method to generate a weighting function, and
the forest- based weighting function can substitute for the kernel- weighting
function in a local linear estimation procedure. The advantages of the forest-
weighting function are that it is data adaptive as well as model adaptive.
It is data adaptive in that covariates that are important for heterogeneity
in parameters of interest are given more importance in determining what
The Impact of Machine Learning on Economics 527
observations are “nearby.” It is model adaptive in that it focuses on hetero-
geneity in parameter estimates in a given model, rather than hetereogeneity
in predicting the conditional mean of outcomes as in a traditional regres-
sion forest.
The insight of Athey, Tibshirani, and Wager (2017) is more general and I
expect it to reappear in other papers in this literature: anyplace in traditional
econometrics where a kernel function might have been used, ML methods
that perform better than kernels in practice may be substituted. However,
the statistical and econometric theory for the new methods needs to be estab-
lished in order to ensure that the ML- based procedure has desired properties
such as asymptotic normality of parameter estimates. Athey, Tibshirani,
and Wager (2017) does this for their generalized random forests for estimat-
ing heterogeneity in parameter estimates, and Hartford, Lewis, and Taddy
(2016) use neural nets instead of kernels for semiparametric instrumental
variables; Chernozhukov et al. (2017) does this for their generalization of
Robinson (1988) semiparametric regression models.
There are also other possible approaches to estimating conditional aver-
age treatment eff ects when the structure of the heterogeneity is assumed to
take a simple form, or when the analyst is willing to understand treatment
eff ects conditioning only on a subset of covariates rather than attempting
to condition on all relevant covariates. Targeted maximum likelihood (van
der Laan and Rubin 2006) is one approach to this; more recently, Imai and
Ratkovic (2013) proposed using LASSO to uncover heterogeneous treat-
ment eff ects, while Künzel et al. (2017) proposes an ML approach using
“metalearners.” It is important to note, however, that if there is insuffi
-
cient data to estimate the impact of all relevant covariates; a model such
as LASSO will tend to drop covariates (and their interactions) that are
correlated with other included covariates, so that the included covariates
“pick up” the impact of omitted covariates.
Finally, a motivating goal for understanding treatment eff ects is estimat-
ing optimal policy functions; that is, functions that map from the observ-
able covariates of individuals to policy assignments. This problem has been
recently studied in economics by, for example, Kitagawa and Tetenov (2015),
who focus on estimating the optimal policy from a class of potential policies
of limited complexity. The goal is to select a policy function to minimize the
loss from failing to use the (infeasible) ideal policy, referred to as the “regret”
of the policy. Despite the general lack of research about causal inference in
the ML literature, the topic of optimal policy estimation has received some
attention. However, most of the ML literature focuses on algorithmic inno-
vations, and does not exploit insights from the causal inference literature.
An exception is that a line of research has incorporated the idea of pro-
pensity score weighting or doubly robust methods, although often without
much reference to the statistics and econometrics literature. Examples of
papers from the ML literature focused on policy learning include Strehl et al.
528 Susan Athey
(2010), Dudik, Langford, and Li (2011), Li et al. (2012), Dudik et al. (2014),
Li et al. (2014), Swaminathan and Joachims (2015), Jiang and Li (2016),
Thomas and Brunskill (2016), and Kallus (2017). One type of result in that
literature establishes bounds on the regret of the algorithm. In Athey and
Wager (2017), we show how bringing in insights from semiparame
tric effi
-
ciency theory allows us to establish a tighter “regret bound” than the exist-
ing literature, thus narrowing down substantially the set of algorithms that
might achieve the regret bound. This highlights the fact that the econometric
theory literature has added value that has not been fully exploited in ML.
Another unrelated observation is that, perhaps surprisingly, the economet-
rics of the problem of estimating optimal policy functions within a class of
potential policies of limited complexity is quite diff erent from the problem
of estimating conditional average treatment eff ects, although of course, the
problems are related.
21.4.3 Contextual Bandits: Estimating Optimal
Policies Using Adaptive Experimentation
Previously, I reviewed methods for estimating optimal policies mapping
from individual covariates to treatment assignments. A growing literature
based primarily in ML studies the problem of “bandits,” which are algo-
rithms that actively learn about which treatment is best. Online experimenta-
tion work yields large benefi ts when the setting is such that it is possible to
quickly measure outcomes, and when there are many possible treatments.
In the basic bandit problem when all units have identical covariates, the
problem of “online experimentation,” or “multiarmed bandits,” asks the
question of how experiments be designed to assign individuals to treatments
as they arrive, using data from earlier individuals to determine the probabili-
ties of assigning new individuals to each treatment, balancing the need for
exploration against the desire for exploitation. That is, bandits balance the
need to learn against the desire to avoid giving individuals suboptimal treat-
ments. This type of online experimentation has been shown to yield reliable
answers orders of magnitude faster than traditional randomized controlled
trials in cases where there are many possible treatments (see, e.g., Scott 2010);
the gain comes from the fact that treatments that are doing badly are eff ec-
tively discarded, so that newly arriving units are instead assigned to the best
candidates. When the goal is to estimate an optimal policy, it is not necessary
to continue to allocate units to treatments that are fairly certain not to be