by Ajay Agrawal
time, and is unrelated to factors that shift consumer’s demand for the prod-
uct (such demand shifters can be referred to as “confounders” becaues they
aff ect both the optimal price set by the fi rm and the sales of the product).
The instrumental variables method essentially projects the observed prices
onto the input costs, thus only making use of the variation in price that is
explained by changes in input costs when estimating the impact of price on
sales. It is very common to see that a predictive model (e.g., least squares
regression) might have very high explanatory power (e.g., high R 2), while
the causal model (e.g., instrumental variables regression) might have very
low explanatory power (in terms of predicting outcomes). In other words,
economists typically abandon the goal of accurate prediction of outcomes
in pursuit of an unbiased estimate of a causal parameter of interest.
Another diff erence derives from the key concerns in diff erent approaches,
and how those concerns are addressed. In predictive models, the key con-
cern is the trade- off between expressiveness and overfi tting, and this trade-
off can be evaluated by looking at goodness of fi t in an independent test
set. In contrast, there are several distinct concerns for causal models. The
fi rst is whether the parameter estimates from a particular sample are spuri-
ous, that is, whether estimates arise due to sampling variation so that if a
new random sample of the same size was drawn from the population, the
parameter estimate would be substantially diff erent. The typical approach
to this problem in econometrics and statistics is to prove theorems about
the consistency and asymptotic normality of the parameter estimates, pro-
pose approaches to estimating the variance of parameter estimates, and
fi nally to use those results to estimate standard errors that refl ect the sam-
pling uncertainty (under the conditions of the theory). A more data- driven
approach is to use bootstrapping and estimate the empirical distribution of
parameter estimates across bootstrap samples. The typical ML approach
of evaluating performance in a test set does not directly handle the issue of
the uncertainty over parameter estimates, since the parameter of interest is
not actually observed in any test set. The researcher would need to estimate
the parameter again in the test set.
514 Susan Athey
A second concern is whether the assumptions required to “identify” a
causal eff ect are satisfi ed, where in econometrics we say that a parameter is
identifi ed if we can learn it eventually with infi nite data (where even in the
limit, the data has the same structure as in the sample considered). It is well
known that the causal eff ect of a treatment is not identifi ed without making
assumptions, assumptions that are generally not testable (that is, they cannot
be rejected by looking at the data). Examples of identifying assumptions
include the assumption that the treatment is randomly assigned, or that
treatment assignment is “unconfounded.” In some settings, these assump-
tions require the analyst to observe all potential “confounders” and con-
trol for them adequately; in other settings, the assumptions require that an
instrumental variable is uncorrelated with the unobserved component of
outcomes. In many cases it can be proven that even with a data set of infi nite
size, the assumptions are not testable—they cannot be rejected by looking at
the data, and instead must be evaluated on substantive grounds. Justifying
assumptions is one of the primary components of an observational study in
applied economics. If the “identifying” assumptions are violated, estimates
may be biased (in the same way) in both training data and test data. Testing
assumptions usually requires additional information, like multiple experi-
ments (designed or natural) in the data. Thus, the ML approach of evaluat-
ing performance in a test set does not address this concern at all. Instead, ML
is likely to help make estimation methods more credible, while maintaining
the identifying assumptions: in practice, coming up with estimation methods
that give unbiased estimates of treatment eff ects requires fl exibly modeling
a variety of empirical relationships, such as the relationship between the
treatment assignment and covariates. Since ML excels at data- driven model
selection, it can be useful in systematizing the search for the best functional
forms when implementing an estimation technique.
Economists also build more complex models that incorporate both be-
havioral and statistical assumptions in order to estimate the impact of coun-
terfactual policies that have never been used before. A classic example is
McFadden’s methodological work in the early 1970s (e.g., McFadden 1973)
analyzing transportation choices. By imposing the behavioral assumption
that consumers maximize utility when making choices, it is possible to esti-
mate parameters of the consumer’s utility function and estimate the welfare
eff ects and market share changes that would occur when a choice is added
or removed (e.g., extending the BART transportation system), or when
the characteristics of the good (e.g., price) are changed. Another example
with more complicated behavioral assumptions is the case of auctions. For
a data set with bids from procurement auctions, the “structural” approach
involves estimating a probability distribution over bidder values, and then
evaluating the counterfactual eff ect of changing auction design (e.g., Laf-
font, Ossard, and Vuong 1995; Athey, Levin, and Seira 2011; Athey, Coey,
The Impact of Machine Learning on Economics 515
and Levin 2013; or the review by Athey and Haile 2007). For further discus-
sions of the contrast between prediction and parameter estimation, see the
recent review by Mullainathan and Spiess (2017). There is a small litera-
ture in ML referred to as “inverse reinforcement learning” (Ng and Russell
2000) that has a similar approach to the structural estimation literature
economics; this ML literature has mostly operated independently without
much reference to the earlier econometric literature. The literature attempts
to learn “reward functions” (utility functions) from observed behavior in
dynamic settings.
There are also other categories of ML models; for example, anomaly
detection focuses on looking for outliers or unusual behavior and is used,
for example, to detect network intrusion, fraud, or system failures. Other
categories that I will return to are reinforcement learning (roughly, approxi-
mate dynamic programming) and multiarmed bandit experimentation
(dynamic experimentation where the probabiity of selecting an arm is cho-
sen to balance exploration and exploitation). These literatures often take
a more explicitly causal perspective and thus are somewhat easier to relate
to economic models, and so my general statements about the lack of focus
on causal inference in ML must be qualifi ed when discussing the literature
on bandits.
Before proceeding, it is useful to highlight one other contribution of the
/> ML literature. The contribution is computational rather than conceptual,
but it has had such a large impact that it merits a short discussion. The tech-
nique is called stochastic gradient descent (SGD), and it is used in many dif-
ferent types of models, including the estimation of neural networks as well
as large scale Bayesian models (e.g., Ruiz, Athey, and Blei [2017], discussed
in more detail below). In short, stochastic gradient descent is a method for
optimizing an objective function, such as a likelihood function or a gener-
alized method of moments objective function, with respect to parameters.
When the objective function is expensive to compute (e.g., because it requires
numerical integration), stochastic gradient descent can be used. The main
idea is that if the objective is the sum of terms, each term corresponding to a
single observation, the gradient can be approximated by picking a single data
point and using the gradient evaluated at that observation as an approxima-
tion to the average (over observations) of the gradient. This estimate of the
gradient will be very noisy, but unbiased. The idea is that it is more eff ective
to “climb a hill” taking lots of steps in a direction that is noisy but unbiased,
than it is to take a small number of steps, each in the right direction, which is
what happens if computational resources are focused on getting very precise
estimates of the gradient of the objective at each step. Stochastic gradient
descent can lead to dramatic performance improvements, and thus enable
the estimation of very complex models that would be intractable using tra-
ditional approaches.
516 Susan Athey
21.3 Using Prediction Methods in Policy Analysis
21.3.1 Applications of Prediction Methods
to Policy Problems in Economics
There have already been a number of successful applications of predic-
tion methodology to policy problems. Kleinberg et al. (2015) have argued
that there is a set of problems where off - the- shelf ML methods for predic-
tion are the key part of important policy and decision problems. They use
examples like deciding whether to do a hip replacement operation for an
elderly patient; if you can predict based on their individual characteris-
tics that they will die within a year, then you should not do the operation.
Many Americans are incarcerated while awaiting trial; if you can predict
who will show up for court, you can let more out on bail. Machine- learning
algorithms are currently in use for this decision in a number of jurisdic-
tions. Another natural example is credit scoring; an economics paper by
Bjorkegren and Grissen (2017) uses ML methods to predict loan repayment
using mobile phone data.
In other applications, Goel, Rao, and Shroff (2016) use ML methods to
examine stop- and- frisk laws, using observables of a police incident to pre-
dict the probability that a suspect has a weapon, and they show that blacks
are much less likely than whites to have a weapon conditional on observ-
ables and being frisked. Glaeser, Hillis, et al. (2016) helped cities design a
contest to build a predictive model that predicted health code violations in
restaurants in order to better allocate inspector resources. There is a rap-
idly growing literature using machine learning together with images from
satellites and street maps to predict poverty, safety, and home values (see,
e.g., Naik et al. 2017). As Glaeser, Kominers, et al. (2015) argue, there are
a variety of applications of this type of prediction methodology. It can be
used to compare outcomes over time at a very granular level, thus making
it possible to assess the impact of a variety of policies and changes, such as
neighborhood revitalization. More broadly, the new opportunities created
by large- scale imagery and sensors may lead to new types of analyses of
productivity and well- being.
Although prediction is often a large part of a resource allocation prob-
lem—there is likely to be agreement that people who will almost certainly
die soon should not receive hip replacement surgery, and rich people should
not receive poverty aid—Athey (2017) discusses the gap between identify-
ing units that are at risk and those for whom intervention is most benefi cial.
Determining which units should receive a treatment is a causal inference
question, and answering it requires diff erent types of data than prediction.
Either randomized experiments or natural experiments may be needed to
estimate heterogeneous treatment eff ects and optimal assignment policies.
In business applications, it has been common to ignore this distinction and
The Impact of Machine Learning on Economics 517
focus on risk identifi cation; for example, as of 2017, the Facebook advertis-
ing optimization tool provided to advertisers optimizes for consumer clicks,
but not for the causal eff ect of the advertisement. The distinction is often not
emphasized in marketing materials and discussions in the business world,
perhaps because many practitioners and engineers are not well versed in
the distinction between prediction and causal inference.
21.3.2 Additional Topics in Prediction for Policy Settings
Athey (2017) summarizes a variety of research questions that arise when
prediction methods are taken into policy applications. A number of these
have attracted initial attention in both ML and the social sciences, and
interdisciplinary conferences and workshops have begun to explore these
issues.
One set of questions concerns interpretability of models. There are discus-
sions of what interpretability means, and whether simpler models have
advantages. Of course, economists have long understood that simple models
can also be misleading. In social sciences data, it is typical that many attri-
butes of individuals or locations are positively correlated—parents’ educa-
tion, parents’ income, child’s education, and so on. If we are interested in a
conditional mean function, and estimate ˆ
μ( x) = E[ Y | X = x], using a simpler
i
i
model that omits a subset of covariates may be misleading. In the simpler
model, the relationship between the omitted covariates and outcomes is
loaded onto the covariates that are included. Omitting a covariate from a
model is not the same thing as controlling for it in an analysis, and it can
sometimes be easier to interpret a partial eff ect of a covariate controlling for
other factors than it is to keep in mind all of the other (omitted) factors and
how they covary with those included in a model. So, simpler models can
sometimes be misleading; they may seem easy to understand, but the under-
standing gained from them may be incomplete or wrong.
One type of model that typically is easy to interpret and explain is a causal
model. As reviewed in Imbens and Rubin (2015), the causal inference frame-
work typically makes the estimand very precise—for example, the average
eff ect if a treatment were applied to a particular population, the conditional
average treatment eff ect (conditional on some observable characteristics of
in
dividuals), or the average eff ect of a treatment on a subpopulation such as
“compliers” (those whose treatment adoption is aff ected by an instrumental
variable). Such parameters by defi nition give the answer to a well- defi ned
question, and so the magnitudes are straightforward to interpret. Key pa-
rameters of “structural” models are also straightforward to interpret—they
represent parameters of consumer utility functions, elasticities of demand
curves, bidder valuations in auctions, marginal costs of fi rms, and so on. An
area for further research concerns whether there are other ways to math-
ematically formalize what it means for a model to be interpretable, or to
analyze empirically the implications of interpretability. Yeomans, Shah, and
518 Susan Athey
Kleinberg (2016) study empirically a related issue of how much people trust
ML- based recommender systems, and why.
Another area that has attracted a lot of attention is the question of fair-
ness and nondiscrimination, for example, whether algorithms will promote
discrimination by gender or race when used in settings like hiring, judicial
decisions, or lending. There are a number of interesting questions that can
be considered. One is, how can fairness constraints be defi ned? What type
of fairness is desired? For example, if a predictive model is used to allocate
job interviews based on resumes, there are two types of errors, Type I and
Type II. It is straightforward to show that it is in general impossible to
equalize both Type I and Type II errors across two diff erent categories of
people (e.g., men and women), so the analyst must choose which to equalize
(or both). See Kleinberg, Mullainathan, and Raghaven (2016) for further
analysis and development of the inherent trade- off s in fairness in predictive
algorithms. Overall, the literature on this topic has grown rapidly in the last
two years, and we expect that as ML algorithms are deployed in more and
more contexts, the topic will continue to develop. My view is that it is more
likely that ML models will help make resource allocation more rather than
less fair; algorithms can absorb and eff ectively use a lot more information
than humans, and thus are less likely than humans to rely on stereotypes.
To the extent that unconstrained algorithms do have undesirable distribu-