by Ajay Agrawal
when the goal is semiparametric estimation or when there are a large number
of covariates relative to the number of observations. Machine learning has
great strengths in using data to select functional forms fl exibly.
A second theme is that a key advantage of ML is that ML views empirical
analysis as “algorithms” that estimate and compare many alternative models.
This approach constrasts with economics, where (in principle, though rarely
in reality) the researcher picks a model based on principles and estimates it
once. Instead, ML algorithms build in “tuning” as part of the algorithm.
The tuning is essentially model selection, and in an ML algorithm that is
data driven. There are a whole host of advantages of this approach, includ-
ing improved performance as well as enabling researchers to be systematic
and fully describe the process by which their model was selected. Of course,
cross- validation has also been used historically in economics, for example,
for selecting the bandwidth for a kernel regression, but it is viewed as a fun-
damental part of an algorithm in ML.
A third, closely related theme is that “outsourcing” model selection to
algorithm works very well when the problem is “simple”—for example, pre-
diction and classifi cation tasks, where performance of a model can be evalu-
ated by looking at goodness of fi t in a held- out test set. Those are typically
not the problems of greatest interest for empirical researchers in economics,
who instead are concerned with causal inference, where there is typically not
an unbiased estimate of the ground truth available for comparison. Thus,
more work is required to apply an algorithmic approach to economic prob-
lems. The recent literature at the intersection of ML and causal inference,
reviewed in this chapter, has focused on providing the conceptual framework
and specifi c proposals for algorithms that are tailored for causal inference.
A fourth theme is that the algorithms also have to be modifi ed to pro-
vide valid confi dence intervals for estimated eff ects when the data is used to
select the model. Many recent papers make use of techniques such as sample
splitting, leave- one- out estimation, and other similar techniques to provide
confi dence intervals that work both in theory and in practice. The upside is
that using ML can provide the best of both worlds: the model selection is
data driven, systematic, and a wide range of models are considered; yet, the
model- selection process is fully documented, and confi dence intervals take
into account the entire algorithm.
Finally, the combination of ML and newly available data sets will change
economics in fairly fundamental ways ranging from new questions, to new
The Impact of Machine Learning on Economics 509
approaches, to collaboration (larger teams and interdisciplinary inter-
action), to a change in how involved economists are in the engineering and
implementation of policies.
21.2 What Is Machine Learning and What Are Early Use Cases?
It is harder than one might think to come up with an operational defi -
nition of ML. The term can be (and has been) used broadly or narrowly; it
can refer to a collections of subfi elds of computer science, but also to a set
of topics that are developed and used across computer science, engineer-
ing, statistics, and increasingly the social sciences. Indeed, one could devote
an entire article to the defi nition of ML, or to the question of whether the
thing called ML really needed a new name other than statistics, the distinc-
tion between ML and AI, and so on. However, I will leave this debate to
others and focus on a narrow, practical defi nition that will make it easier
to distinguish ML from the most commonly used econometric approaches
used in applied econometrics until very recently.1 For readers coming from
a machine- learning background, it is also important to note that applied
statistics and econometrics have developed a body of insights on topics rang-
ing from causal inference to effi
ciency that have not yet been incorporated in
mainstream machine learning, while other parts of machine learning have
overlap with methods that have been used in applied statistics and social
sciences for many decades.
Starting from a relatively narrow defi nition of machine learning, machine
learning is a fi eld that develops algorithms designed to be applied to data
sets, with the main areas of focus being prediction (regression), classifi ca-
tion, and clustering or grouping tasks. These tasks are divided into two main
branches, supervised and unsupervised ML. Unsupervised ML involves
fi nding clusters of observations that are similar in terms of their covariates,
and thus can be interpreted as “dimensionality reduction”; it is commonly
used for video, images, and text. There are a variety of techniques available
for unsupervised learning, including k- means clustering, topic modeling,
community detection methods for networks, and many more. For example,
the Latent Dirichlet Allocation model (Blei, Ng, and Jordan 2003) has fre-
quently been applied to fi nd “topics” in textual data. The output of a typical
unsupervised ML model is a partition of the set of observations, where
observations within each element of the partition are similar according to
some metric, or, a vector of probabilities or weights that describe a mixture
of topics or groups that an observation might belong to. If you read in the
1. I will also focus on the most popular parts of ML; like many fi elds, it is possible to fi nd researchers who defi ne themselves as members of the fi eld of ML doing a variety of diff erent things, including pushing the boundaries of ML with tools from other disciplines. In this chapter I will consider such work to be interdisciplinary rather than “pure” ML, and will discuss it as such.
510 Susan Athey
newspaper that a computer scientist “discovered cats on YouTube,” that
might mean that they used an unsupervised ML method to partition a set
of videos into groups, and when a human watches the the largest group,
they observe that most of the videos in the largest group contain cats. This
is referred to as “unsupervised” because there were no “labels” on any of the
images in the input data; only after examining the items in each group does
an observer determine that the algorithm found cats or dogs. Not all dimen-
sionality reduction methods involve creating clusters; older methods such as
principal components analysis can be used to reduce dimensionality, while
modern methods include matrix factorization (fi nding two low- dimensional
matrices whose product well approximates a larger matrix), regularization
on the norm of a matrix, hierarchical Poisson factorization (in a Bayesian
framework) (Gopalan, Hofman, and Blei 2015), and neural networks.
In my view, these tools are very useful as an intermediate step in empirical
work in economics. They provide a data- driven way to fi nd similar news-
paper articles, restaurant reviews, and so forth, and thus create variables
that can be used in economic analyses. These variables might be part of the
construction of ei
ther outcome variables or explanatory variables, depend-
ing on the context. For example, if an analyst wishes to estimate a model
of consumer demand for diff erent items, it is common to model consumer
preferences over characteristics of the items. Many items are associated with
text descriptions as well as online reviews. Unsupervised learning could be
used to discover items with similar product descriptions in an initial phase
of fi nding potentially related products, and it could also be used to fi nd
subgroups of similar products. Unsupervised learning could further be used
to categorize the reviews into types. An indicator for the review group could
be used in subsequent analysis without the analyst having to use human
judgement about the review content; the data would reveal whether a cer-
tain type of review was associated with higher consumer perceived quality,
or not. An advantage of using unsupervised learning to create covariates
is that the outcome data is not used at all; thus, concerns about spurious
correlation between constructed covariates and the observed outcome are
less problematic. Despite this, Egami et al. (2016) have argued that research-
ers may be tempted to fi ne- tune their construction of covariates by testing
how they perform in terms of predicting outcomes, thus leading to spuri-
ous relationships between covariates and outcomes. They recommend the
approach of sample splitting, whereby the model tuning takes place on one
sample of data, and then the selected model is applied on a fresh sample
of data.
Unsupervised learning can also be used to create outcome variables. For
example, Athey, Mobius, and Pál (2017) examine the impact of Google’s
shutdown of Google News in Spain on the types of news consumers read. In
this case, the share of news in diff erent categories is an outcome of interest.
Unsupervised learning can be used to categorize news in this type of anal-
The Impact of Machine Learning on Economics 511
ysis; that paper uses community detection techniques from network theory.
In the absence of dimensionality reduction, it would be diffi
cult to mean-
ingfully summarize the impact of the shutdown on all of the diff erent news
articles consumed in the relevant time frame.
Supervised machine learning typically entails using a set of features or
covariates ( X ) to predict an outcome ( Y). When using the term prediction, it is important to emphasize that the framework focuses not on forecasting,
but rather on a setting where there are some labeled observations where both
X and Y are observed (the training data), and the goal is to predict outcomes ( Y) in an independent test set based on the realized values of X for each unit in the test set. In other words, the goal is to construct ˆ
μ( x), which is an esti-
mator of ( x) = E[ Y | X = x], in order to do a good job predicting the true values of Y in an independent data set. The observations are assumed to be
independent, and the joint distribution of X and Y in the training set is the same as that in the test set. These assumptions are the only substantive
assumptions required for most machine- learning methods to work.
In the case of classifi cation, the goal is to accurately classify observations.
For example, the outcome could be the animal depicted in an image, the
“features” or covariates are the pixels in the image, and the goal is to cor-
rectly classify images into the correct animal depicted. A related but distinct
estimation problem is to estimate Pr( Y = k | X = x) for each of k = 1, . . , K
possible realizations of Y.
It is important to emphasize that the ML literature does not frame itself
as solving estimation problems—so estimating ( x) or Pr( Y = k | X = x) is not the primary goal. Instead, the goal is to achieve goodness of fi t in an
independent test set by minimizing deviations between actual outcomes and
predicted outcomes. In applied econometrics, we often wish to understand
an object like ( x) in order to perform exercises like evaluating the impact of changing one covariate while holding others constant. This is not an explicit
aim of ML modeling.
There are a variety of ML methods for supervised learning, such as regu-
larized regression (LASSO, ridge and elastic net), random forest, regression
trees, support vector machines, neural nets, matrix factorization, and many
others, such as model averaging. See Varian (2014) for an overview of some
of the most popular methods and Mullainathan and Spiess (2017) for more
details. (Also note that White [1992] attempted to popularize neural nets in
economics in the early 1990s, but at the time they did not lead to substan-
tial performance improvements and did not become popular in economics.)
What leads us to categorize these methods as ML methods rather than tra-
ditional econometric or statistical methods? First is simply an observation:
until recently, these methods were neither used in published social science
research, nor taught in social science courses, while they were widely stud-
ied in the self- described ML and/or “statistical learning” literatures. One
exception is ridge regression, which received some attention in economics,
512 Susan Athey
and LASSO had also received some attention. But from a more functional
perspective, one common feature of many ML methods is that they use data-
driven model selection. That is, the analyst provides the list of covariates or
features, but the functional form is at least in part determined as a function
of the data, and rather than performing a single estimation (as is done, at
least in theory, in econometrics), so that the method is better described as
an algorithm that might estimate many alternative models and then select
among them to maximize a criterion.
There is typically a trade- off between expressiveness of the model (e.g.,
more covariates included in a linear regression) and risk of overfi tting, which
occurs when the model is too rich relative to the sample size. (See Mullaina-
than and Spiess [2017] for more discussion of this.) In the latter case, the
goodness of fi t of the model when measured on the sample where the model
is estimated is expected to be much better than the goodness of fi t of the
model when evaluated on an independent test set. The ML literature uses a
variety of techniques to balance expressiveness against overfi tting. The most
common approach is cross- validation whereby the analyst repeatedly esti-
mates a model on part of the data (a “training fold”) and then evaluates it
on the complement (the “test fold”). The complexity of the model is selected
to minimize the average of the mean- squared error of the prediction (the
squared diff erence between the model prediction and the actual outcome) on
the test folds. Other approaches used to control overfi tting include averaging
many diff erent models, sometimes estimating each model on a subsample
of the data (one can interpret the random forest in this way).
In contrast, in much of cross- sectional econometrics and empirical work
in economics, the tradition has been that the researcher specifi es one model,
estimates the model on the full data set, and relies on statistica
l theory to
estimate confi dence intervals for estimated parameters. The focus is on the
estimated eff ects rather than the goodness of fi t of the model. For much em-
pirical work in economics, the primary interest is in the estimate of a causal
eff ect, such as the eff ect of a training program, a minimum wage increase,
or a price increase. The researcher might check robustness of this parameter
estimate by reporting two or three alternative specifi cations. Researchers
often check dozens or even hundreds of alternative specifi cations behind
the scenes, but rarely report this practice because it would invalidate the
confi dence intervals reported (due to concerns about multiple testing and
searching for specifi cations with the desired results). There are many disad-
vantages to the traditional approach, including but not limited to the fact
that researchers would fi nd it diffi
cult to be systematic or comprehensive in
checking alternative specifi cations, and further because researchers were not
honest about the practice, given that they did not have a way to correct for
the specifi cation search process. I believe that regularization and systematic
model selection have many advantages over traditional approaches, and for
this reason will become a standard part of empirical practice in econom-
The Impact of Machine Learning on Economics 513
ics. This will particularly be true as we more frequently encounter data sets
with many covariates, and also as we see the advantages of being systematic
about model selection. As I discuss later, however, this practice must be
modifi ed from traditional ML and in general “handled with care” when the
researcher’s ultimate goal is to estimate a causal eff ect rather than maximize
goodness of fi t in a test set.
To build some intuition about the diff erence between causal eff ect estima-
tion and prediction, it can be useful to consider the widely used method of
instrumental variables. Instrumental variables are used by economists when
they wish to learn a causal eff ect, for example, the eff ect of a price on a fi rm’s
sales, but they only have access to observational (nonexperimental) data. An
instrument in this case might be an input cost for the fi rm that shifts over