by Ajay Agrawal
tional consequences, it is possible to constrain the algorithms. Generally,
algorithms can be trained to optimize objectives under constraints, and thus
it may be easier to impose societal objectives on algorithms than on subjec-
tive decisions by humans.
A third issue that arises is stability and robustness, for example, in
response to variations in samples or variations in the environment. There
are a variety of related ideas in machine learning, including domain adapta-
tion (how do you make a model trained in one environment perform well in
another environment), “transfer learning,” and others. The basic concern
is that ML algorithms do exhaustive searches across a very large number
of possible specifi cations looking for the best model that predicts Y based
on X. The models will fi nd subtle relationships bewteen X and Y, some of which might not be stable across time or across environments. For example,
for the last few years there may be more videos of cats with pianos than
dogs with pianos. The presence of a piano in a video may thus predict cats.
However, pianos are not a fundamentnal feature of cats that holds across
environments, and so if a fad arises where dogs play pianos, performance
of an ML algorithm might suff er. This might not be a problem for a tech
fi rm that reestimates its models with fresh data daily, but predictive models
are often used over much longer time periods in industry. For example,
credit- scoring models may be held fi xed, since changing them makes it hard
to assess the risk of the set of consumers who accept credit off ers. Scoring
models used in medicine might be held fi xed over many years. There are
The Impact of Machine Learning on Economics 519
many interesting methodological issues involved in fi nding models that
have stable performance and are robust to changing circumstances.
Another issue is that of manipulability. In the application of using mobile
data to do credit scoring, a concern is that consumers may be able to mani-
plate the data observed by the loan provider (Bjorkegren and Grissen 2017).
For example, if certain behavioral patterns help a consumer get a loan, the
consumer can make it look like they have these behavioral patterns, for ex-
ample, by visiting certain areas of a city. If resources are allocated to homes
that look poor via satellite imagery, homes or villages can possibly modify
the aerial appearance of their homes to make them look poorer. An open
area for future research concerns how to constrain ML models to make them
less prone to manipulability; Athey (2017) discusses some other examples
of this.
There are also other considerations that can be brought into ML when
it is taken to the fi eld, including computational time, the cost of collecting
and maintaining the “features” that are used in a model, and so on. For ex-
ample, technology fi rms sometimes make use of simplifi ed models in order
to reduce the response time for real- time user requests for information.
Overall, my prediction is that social scientists (and computer scientists
at the intersection with social science), particularly economists and other
social scientists, will contribute heavily to defi ning these types of problems
and concerns formally, and proposing solutions to them. This will not only
provide for better implementations of ML in policy, but will also provide
rich fodder for interesting research.
21.4 A New Literature on Machine Learning and Causal Inference
Despite the fascinating examples of “off - the- shelf ” or slightly modifi ed
prediction methods, in general ML prediction models are solving fundamen-
tally diff erent problems from much empirical work in social science, which
instead focuses on causal inference. A prediction I have is that there will be
an active and important literature combining ML and causal inference to
create new methods, methods that harness the strengths of ML algorithms
to solve causal inference problems. In fact, it is easy to make this prediction
with confi dence because the movement is already well underway. Here I will
highlight a few examples, focusing on those that illustrate a range of themes,
while emphasizing that this is not a comprehensive survey or a thorough
review.
To see the diff erence between prediction and causal inference, imagine
that you have a data set that contains data about prices and occupancy
rates of hotels. Prices are easy to obtain through price comparison sites,
but occupancy rates are typically not made public by hotels. Imagine fi rst
that a hotel chain wishes to form an estimate of the occupancy rates of
competitors, based on publicly available prices. This is a prediction problem:
520 Susan Athey
the goal is to get a good estimate of occupancy rates, where posted prices
and other factors (such as events in the local area, weather, and so on) are
used to predict occupancy. For such a model, you would expect to fi nd that
higher posted prices are predictive of higher occupancy rates, since hotels
tend to raise their prices as they fi ll up (using yield management software).
In contrast, imagine that a hotel chain wishes to estimate how occupancy
would change if the hotel raised prices across the board (that is, if it repro-
grammed the yield management software to shift prices up by 5 percent in
every state of the world). This is a question of causal inference. Clearly, even
though prices and occupancy are positively correlated in a typical data set,
we would not conclude that raising prices would increase occupancy. It is
well known in the causal inference literature that the question about price
increases cannot be answered simply by examining historical data without
additional assumptions or structure. For example, if the hotel previously ran
randomized experiments on pricing, the data from these experiments can be
used to answer the question. More commonly, an analyst will exploit natural
experiments or instrumental variables where the latter are variables that are
unrelated to factors that aff ect consumer demand, but that shift fi rm costs
and thus their prices. Most of the classic supervised ML literature has little
to say about how to answer this question.
To understand the gap between prediction and causal inference, recall that
the foundation of supervised ML methods is that model selection (through,
e.g., cross- validation) is carried out to optimize goodness of fi t on a test
sample. A model is good if and only if it predicts outcomes well in a test
set. In contrast, a large body of econometric research builds models that
substantially reduce the goodness of fi t of a model in order to estimate the
causal eff ect of, say, changing prices. If prices and quantities are positively
correlated in the data, any model that estimates the true causal eff ect (quan-
tity goes down if you change price) will not do as good a job fi tting a test
data set that has the same joint distribution of prices and quantities as the
training data. The place where the econometric model with a causal estimate
would do better is at fi tting what happens if the fi rm actually changes prices
>
at a given point in time at doing counterfactual predictions when the world
changes. Techniques like instrumental variables seek to use only some of the
information that is in the data the clean or exogenous or experiment- like
variation in price sacrifi cing predictive accuracy in the current environment
to learn about a more fundamental relationship that will help make decisions
about changing price.
However, a new but rapidly growing literature is tackling the problem of
using ML methods for causal inference. This new literature takes many of
the strengths and innovations of ML methods, but applies them to causal
inference. Doing this requires changing the objective function, since the
ground truth of the causal parameter is not observed in any test set. Also
as a consequence of the fact that the truth is not observed in a test set, sta-
The Impact of Machine Learning on Economics 521
tistical theory plays a more important role in evaluating models, since it is
more diffi
cult to directly assess how well a parameter estimates the truth,
even if the analyst has access to an independent test set. Indeed, this dis-
cussion highlights one of the key ways in which prediction is substantially
simpler than parameter estimation: for prediction problems, a prediction
for a given unit (given its covariates) can be summarized in a single number,
the predicted outcome, and the quality of the prediction can be evaluated
on a test set without further modeling assumptions. Although the average
squared prediction error of a model on a test set is a noisy estimate of the
expected value of the mean squared error on a random test set (due to small
sample size), the law of large numbers applies to this average and it converges
quickly to the truth as the test set size increases. Since the standard deviation
of the prediction error can also be easily estimated, it is straightforward to
evaluate predictive models without imposing additional assumptions.
There are a variety of diff erent problems that can be tackled with ML
methods. An incomplete list of some that have gained early attention is
given as follows. First, we can consider the type of identifi cation strategy
for identifying causal eff ects. Some that have received attention in the new
ML/ causal inference literature include:
1. Treatment randomly assigned (experimental data).
2. Treatment assignment unconfounded (conditional on covariates).
3. Instrumental variables.
4. Panel data settings (including diff erence- in-diff erence designs).
5. Regression discontinuity designs.
6. Structural models of individual or fi rm behavior.
In each of those settings, there are diff erent problems of interest:
1. Estimating average treatment eff ects (or a low- dimensional parameter
vector).
2. Estimating heterogeneous treatment eff ects in simple models or models
of limited complexity.
3. Estimating heterogeneous treatment eff ects nonparametrically.
4. Estimating optimal treatment assignment policies.
5. Identifying groups of individuals that are similar in terms of their treat-
ment eff ects.
Although the early literature is already too large to summarize all of the
contributions to each combination of identifi cation strategty and problem
of interest, it is useful to observe that at this point there are entries in almost
all of the “boxes” associated with diff erent identifi cation strategies, both for
average treatment eff ects and heterogeneous treatment eff ects. Here, I will
provide a bit more detail on a few leading cases that have received a lot of
attention, in order to illustrate some key themes in the literature.
It is also useful to observe that even though the last four problems seem
522 Susan Athey
closely related, they are distinct, and the methods used to solve them as well
as the issues that arise are distinct. These distinctions have not traditionally
been emphasized as much in the literature on causal inference, but they mat-
ter more in environments with data- driven model selection because each has
a diff erent objective and the objective function can make a big diff erence in
determining the selected model in ML- based models. Issues of inference are
also distinct, as we will discuss further below.
21.4.1 Average Treatment Eff ects
A large and important branch of the literature on causal inference focuses
on estimation of average treatment eff ects under the unconfoundedness
assumption. This assumption requires that potential outcomes (the out-
comes a unit would experience in alternative treatment regimes) are inde-
pendent of treatment assignment, conditional on covariates. In other words,
treatment assignment is as good as random after controlling for covariates.
From the 1990s through the fi rst decade of the twenty- fi rst century, a
literature emerged about using semiparametric methods to estimate average
treatment eff ects (e.g., Bickel et al. [1993], focusing on an environment with
a fi xed number of covariates that is small relative to the sample size). The
methods are semiparametric in the sense that the goal is to estimate a low-
dimensional parameter—in this case, the average treatment eff ect—without
making parametric assumptions about the way in which covariates aff ect
outcomes (e.g., Hahn 1998). (See Imbens and Wooldridge [2009] and Imbens
and Rubin [2015] for reviews.) In the middle of the fi rst decade of the twenty-
fi rst decade, Mark van der Laan and coauthors introduced and developed
a set of methods called “targeted maximum likelihood” (van der Laan and
Rubin 2006). The idea is that maximum likelihood is used to estimate a low-
dimensional parameter vector in the presence of high- dimensional nuisance
parameters. The method allows the nuisance parameters to be estimated
with techniques that have less well- established properties or a slower con-
vergence rate. This approach can be applied to estimate an average treatment
eff ect parameter under a variety of identifi cation assumptions, but impor-
tantly, it is an approach that can be used with many covariates.
An early example of the application of ML methods to causal inference in
economics (see Belloni, Chernozhukov, and Hansen 2014 and Chernozhu-
kov, Hansen, and Spindler 2015 for reviews) uses regularized regression as
an approach to deal with many potential covariates in an environment where
the outcome model is “sparse,” meaning that only a small number of covari-
ates actually aff ect mean outcome (but there are many observables, and
the analyst does not know which ones are important). In an environment
with unconfoundedness, since some covariates are correlated with both the
treatment assignment and the outcome, if the analyst does not condition
on them the omission of the confounder will lead to a biased estimate of
the treatment eff ect. Belloni, Chernozhukov, and Hansen propose a double-
The Impact of Machine Learning on Economics 523
selection method based on the LASSO. The LASSO is a regularized regres-
sion procedure where a regression is estimated using an ob
jective function
that balances in-sample goodness of fi t with a penalty term that depends on
the sum of the magnitude of regression coeffi
cients. This form of penalty
leads many covariates to be assigned a coeffi
cient of zero, eff ectively drop-
ping them from the regression. The magnitude of the penalty parameter is
selected using cross- validation. The authors observe that if LASSO is used
in a regression of the outcome and both the treatment indicator and other
covariates, the coeffi
cient on the treatment indicator will be a biased estimate
of the treatment eff ect because confounders that have a weak relationship
with the outcome but a strong relationship with the treatment assignment
may be zeroed out by an algorithm whose sole objective is to select variables
that predict outcomes.
A variety of other methods have been proposed for combining machine
learning and traditional econometric methods for estimating average treat-
ment eff ects under the unconfoundedness assumption. Athey, Imbens, and
Wager (2016) propose using a method they refer to as “residual balanc-
ing,” building on work on balancing weights by Zubizarreta (2015). Their
approach is similar to a “doubly- robust” method for estimating average treat-
ment eff ects that proceeds by taking the average of the effi
cient score, which
involves an estimate of the conditional mean of outcomes given covariates
as well as the inverse of the estimated propensity score; however, the residual
balancing replaces inverse propensity score weights with weights obtained
using quadratic programming, where the weights are designed to achieve
balance between the treatment and control group. The conditional mean of
outcomes is estimated using LASSO. The main result in the paper is that this
procedure is effi
cient and achieves the same rate of convergence as if the out-
come model was known, under a few key assumptions. The most important
assumption is that the outcome model is linear and sparse, although there
can be a large number of covariates and the analyst does not need to have
knowledge of which ones are important. The linearity assumption, while
strong, allows the key result to hold in the absence of any assumptions about
the structure of the process mapping covariates to the assignment, other