by Ajay Agrawal
optimal. Further, it is also not important from the perspective of expected
payoff s to statistically distinguish two very similar treatments. The litera-
ture has developed a number of heuristics for managing the explore- exploit
trade- off ; for example, “Thompson sampling” allocates units to treatment
arms in proportion to the estimated probability that each treatment arm is
the best.
There is much less known about the setting where individuals have ob-
The Impact of Machine Learning on Economics 529
served attributes, in which case the goal is to construct and evaluate per-
sonalized treatment assignment policies. This problem has been termed the
“contextual bandit” problem, since treatment assignments are sensitive to
the “context” (in this case, user characteristics). At fi rst, the problem seems
very challenging because the space of possible policies is large and complex
(each policy maps from user characteristics to the space of possible treat-
ments). However, if the returns to each of the actions can be estimated as a
function of individual attributes, a policy can be constructed by fi nding the
action whose return is estimated to be highest, balanced against the need
for exploration. Although there are a number of proposed methods for the
contextual bandit problem in the literature already, there is relatively little
known about how to select among methods and which ones are likely to
perform best in practice. For example, the literature on optimal policy esti-
mation suggests that particular approaches to policy estimation may work
better than others.
In particular, there are a variety of choices a researcher must make when
selecting a contextual bandit algorithm. These include the choice of the
model that maps user characteristics to expected outcomes (where the lit-
erature has considered alternatives such as Ridge regression, Li et al. [2010];
ordinary least squares (OLS) Goldenshluger and Zeevi [2013]; generalized
linear model (GLM) Li, Lu, and Zhou [2017]; LASSO, [Bastani and Bayati
2015]; and random forests, Dimakopoulou, Athey, and Imbens [2017];
Feraud et al. [2016]). Another choice concerns the heuristic used to balance
exploration versus exploitation, with leading choices Thompson Sampling
and Upper Confi dence Bounds (UCB) (Chapelle and Li 2011).
Dimakopoulou, Athey, and Imbens (2017) highlights some issues that
arise uniquely in the contextual bandit and that relate directly to the estima-
tion issues that have been the focus of the literature on estimation of treat-
ment eff ects (Imbens and Rubin 2015). For example, the paper highlights
the comparison between noncontextual bandits, where there will be many
future individuals arriving with exactly the same context (since they all share
the same context), and contextual bandits, where each unit is unique. The
assignment of a particular individual thus contributes to learning for the
future indirectly indirectly, since the future individuals will have diff erent
contexts (characteristics). The fact that the exploration benefi ts the future
through a model of how contexts relates to outcomes changes the problem.
This discussion highlights a further theme for the connection between
ML and causal inference: estimation considerations matter even more in the
“small sample” settings of contextual bandits, where the assumption is that
there is not enough data available to the policymaker to estimate perfectly
the optimal assignment. However, we know from the econometrics literature
that the small sample properties of diff erent estimators can vary substan-
tially across settings (Imbens and Rubin 2015), making it clear that the best
contextual bandit approach is likely to also vary across settings.
530 Susan Athey
21.4.4 Robustness and Supplementary Analysis
In a recent review paper, Athey and Imbens (2017) highlights the impor-
tance of “supplementary analyses” for establishing the credibility of causal
estimates in environments where crucial assumptions are not directly test-
able without additional information. Examples of supplementary analyses
include placebo tests, whereby the analyst assses whether a given model is
likely to fi nd evidence of treatment eff ects even at times where no treatment
eff ect should be found. One type of supplementary analysis is a robustness
measure. Athey and Imbens (2015) proposes to use ML- based methods to
develop a range of diff erent estimates of a target parameter (e.g., a treatment
eff ect), where the range is created by introducing interaction eff ects between
model parameters and covariates. The robustness measure is defi ned as the
standard deviation of parameter estimates across model specifi cations. This
paper provides one possible approach to ML- based robustness measures,
but I predict that more approaches will develop over time as ML methods
become more popular.
Another type of ML- based supplementary analysis, proposed by Athey,
Imbens, et al. (2017), uses ML- based methods to construct a measure of
how challenging the confounding problem is in a particular setting. The
proposed measure constructs an estimated conditional mean function for
the outcome as well as an estimated propensity score, and then estimates the
correlation between the two.
There is much more potential for supplementary analyses to be further
developed; the fact that ML has well- defi ned, systematic algorithms for
comparing a wide range of model specifi cations makes ML well suited for
constructing additional robustness checks and supplementary analyses.
21.4.5 Panel Data and Diff erence- in-Diff erence Models
Another commonly used approach to identifying causal eff ects is to
exploit assumptions about how outcomes vary across units and over time in
panel data. In a typical panel- data setting, units are not necessarily assigned
to a treatment randomly, but all units are observed prior to some units being
treated; the identifying assumption is that one or more untreated units can
be used to provide an estimate of the counterfactual time trend that would
have occurred for the treated units in the absence of the treatment. The
simplest “diff erence- in-diff erence” case involves two groups and two time
periods; more broadly, panel data may include many groups and many peri-
ods. Traditional econometric models for the panel- data case exploit func-
tional form assumptions, for example, assuming that a unit’s outcome in a
particular time period is an additive function of a unit eff ect, a time eff ect,
an independent shock. The unit eff ect can then be inferred for treated units
in the pretreatment period, while the time eff ect can be inferred from the
untreated units in the periods where some units receive the treatment. Note
The Impact of Machine Learning on Economics 531
that this structure implies that the matrix of mean outcomes (with rows
associated with units and columns associated with time) has a very simple
structure: it has rank two.
There have been a few recent approaches bringing ML tools to the panel
data setting. Doudchen
ko and Imbens (2016) develop an approach inspired
by synthetic controls (pioneered by Abadie, Diamond, and Hainmueller
2010), where a weighted average of control observations is used to con-
struct the counterfactual untreated outcomes for treated units in treated
periods. Doudchenko and Imbens (2016) propose using regularized regres-
sion to determine the weights, with the penalty parameter selected via cross-
validation.
Factor Models and Matrix Completion
Another way to think about causal inference in a panel- data setting is to
consider a matrix completion problem; Athey, Bayati, et al. (2017) propose
taking such a perspective. In the ML literature, a matrix completion prob-
lem is one where there is an observed matrix of data (in our case, units and
time periods), but some of the entries are missing. The goal is to provide
the best possible prediction of what those entries should be. For the panel-
data application, we can think of the units and time periods where the units
are treated as the missing entries, since we don’t observe the counterfactual
outcomes of those units in the absence of the treatment (this is the key bit
of missing information for estimating the treatment eff ect).
Athey, Bayati, et al. (2017) propose using a matrix version of regularized
regression to fi nd a matrix that well approximates the matrix of untreated
outcomes (a matrix that has missing elements corresponding to treated units
and periods). Recall that LASSO regression minimizes sum of squared
errors in sample, plus a penalty term that is proportional to the sum of the
magnitudes of the coeffi
cients in the regression. We propose matrix regres-
sion that minimizes the sum of squared errors of all elements of the matrix,
plus a penalty term proportional to the nuclear norm of the matrix. The
nuclear norm is the sum of absolute values of the singular values of the
matrix. A matrix that has a low nuclear norm is well approximated by a low
rank matrix.
How do we interpret the idea that a matrix can be well approximated by a
low- rank matrix? A low- rank matrix can be “factored” into the product of
two matrices. In the panel- data case, we can interpret such a factorization as
incorporating a vector of latent characteristics for each unit and a vector of
latent characteristics of each period. The outcome of a particular unit in a
particular period, if untreated, is approximately equal to the inner product
of the unit’s characteristics and the period characteristics. For example, if
the data concerned employment at the county level, we can think of the
counties as having outcomes that depend on the share of employment in dif-
ferent industries, and then each industry has common shocks in each period.
532 Susan Athey
So a county’s latent characteristic would be the vector of industry shares,
and the time characteristics would be industry shocks in a given period.
Athey, Bayati, et al. (2017) show that the matrix completion approach
reduces to commonly employed techniques in the econometrics literature
when the assumptions needed for those approaches hold, but the matrix
completion approach is able to model more complex patterns in the data,
while allowing the data (rather than the analyst) to indicate whether time-
series patterns within units, or cross- sectional patterns within a period, or a
more complex combination, are more useful for predicting counterfactual
outcomes.
The matrix completion approach can be linked to a literature that has
grown in the last two decades in time- series econometrics on factor models
(see, e.g., Bai and Ng 2008 for a review). The matrix- factorization approach
is similar, but rather than assuming that the true model has a fi xed but
unknown number of factors, the matrix- completion approach simply looks
for the best fi t while penalizing the norm of the matrix. The matrix is well
approximated by one with a small number of factors, but does not need
to be exactly represented that way. Athey, Bayati, et al. (2017) describe a
number of advantages of the matrix completion approach, and also show
that it performs better than existing panel- data causal inference approaches
in a range of settings.
21.4.6 Factor Models and Structural Models
Another important area of connection between machine learning and
causal inference concerns more complex structural models. For decades,
scholars working at the intersection of marketing and economics have built
structural models of consumer choice, sometimes in dynamic environ-
ments, and used Bayesian estimation to estimate the model, often Markov
Chain Monte Carlo. Recently, the ML literature has developed a variety
of techniques that allow similar types of Bayesian models to be estimated
at larger scale. These have been applied to settings such as textual analysis
and consumer choices of, for example, movies at Netfl ix. (See, for example,
Blei, Ng, and Jordan [2003] and Blei [2012]). I expect to see much closer
synergies between these two literatures in the future. For example, Athey,
Blei, et al. (2017) builds on models of hierarchical Poisson factorization
to create models of consumer demand, where a consumer’s preference
over thousands of products are considered simultaneously, but the con-
sumer’s choices in each product category are independent of one another.
The model reduces the dimensionality of this problem by using a lower-
dimensional factor representation of a consumer’s mean utility as well as
the consumer’s price sensitivity for each product. The paper establishes that
substantial effi
ciency gains are possible by considering many product cate-
gories in parallel; it is possible to learn about a consumer’s price sensitivity
in one product using behavior in other products. The paper departs from the
The Impact of Machine Learning on Economics 533
pure prediction literature in ML by evaluating and tuning the model based
on how it does at predicting consumer responses to price changes, rather
than simply on overall goodness of fi t. In particular, the paper highlights
that diff erent models would be selected for the “goodness of fi t” objective
as opposed to the “counterfactual inference” objective. In order to achieve
this goal, the paper analyzes goodness of fi t in terms of predicting changes
in demand for products before and after price changes, after providing
evidence that the price changes can be treated as natural experiments after
conditioning on week eff ects (price changes always occur mid- week). The
paper also demonstrates the benefi ts of personalized prediction, versus
more standard demand estimation methods. Thus the paper again high-
lights the theme that for causal inference, the objective function diff ers from
standard prediction.
With more scalable computational methods, it becomes possible to build
much richer models with much less prior information about products. Ruiz,
Athey, and Blei (2017) analyzes consumer preferences for bundles selected
from over 5,000 items in a grocery store, without incorporating informa-
/>
tion about which items are in the same category. Thus, the model uncovers
whether items are substitutes or complements. Since there are 25,000 bundles
when there are 5,000 products, in principle each individual consumer’s util-
ity function has 25,000 parameters. Even if we restrict the utility function to
have only pairwise interaction eff ects, there are still millions of parameters
of a consumer’s utility function over bundles. Ruiz, Athey, and Blei (2017)
uses a matrix- factorization approach to reduce the dimensionality of the
problem, factorizing the mean utilities of the items, the interaction eff ects
among items, and the user’s price sensitivity for the items. Price and availa-
bility variation in the data allows the model to distinguish correlated prefer-
ences (some consumers like both coff ee and diapers) from complementarity
(tacos and taco shells are more valuable together). In order to further sim-
plify the analysis, the model assumes that consumers are boundedly ratio-
nal when they make choices, and consider the interactions among products
as the consumer sequentially adds items to the cart. The alternative—that
the consumer considers all 25,000 bundles and optimizes among them—does
not seem plausible. Incorporating human computational constraints into
structural models thus appears to be another potential fruitful avenue at
the intersection of ML and economics. In the computational algorithm
for Ruiz, Athey, and Blei (2017), we rely on a technique called variational
inference to approximate the posterior distribution, as well as the technique
stochastic gradient descent (described in detail above) to fi nd the parameters
that provide the best approximation.
In another application of similar methodology, Athey et al. (2018) ana-
lyzes consumer choices over lunchtime restaurants using data from a sample
of several thousand mobile phone users in the San Francisco Bay Area.
The data is used to identify users’ typical morning location, as well as their
534 Susan Athey
choices of lunchtime restaurants. We build a model where restaurants have
latent characteristics (whose distribution may depend on restaurant observ-
ables, such as star ratings, food category, and price range), and each user
has preferences for these latent characteristics, and these preferences are