The Economics of Artificial Intelligence Page 87 Read online free by Ajay Agrawal

Home > Other > The Economics of Artificial Intelligence > Page 87

The Economics of Artificial Intelligence Page 87

tional consequences, it is possible to constrain the algorithms. Generally,

algorithms can be trained to optimize objectives under constraints, and thus

it may be easier to impose societal objectives on algorithms than on subjec-

tive decisions by humans.

A third issue that arises is stability and robustness, for example, in

response to variations in samples or variations in the environment. There

are a variety of related ideas in machine learning, including domain adapta-

tion (how do you make a model trained in one environment perform well in

another environment), “transfer learning,” and others. The basic concern

is that ML algorithms do exhaustive searches across a very large number

of possible specifi cations looking for the best model that predicts Y based

on X. The models will fi nd subtle relationships bewteen X and Y, some of which might not be stable across time or across environments. For example,

for the last few years there may be more videos of cats with pianos than

dogs with pianos. The presence of a piano in a video may thus predict cats.

However, pianos are not a fundamentnal feature of cats that holds across

environments, and so if a fad arises where dogs play pianos, performance

of an ML algorithm might suff er. This might not be a problem for a tech

fi rm that reestimates its models with fresh data daily, but predictive models

are often used over much longer time periods in industry. For example,

credit- scoring models may be held fi xed, since changing them makes it hard

to assess the risk of the set of consumers who accept credit off ers. Scoring

models used in medicine might be held fi xed over many years. There are

The Impact of Machine Learning on Economics 519

many interesting methodological issues involved in fi nding models that

have stable performance and are robust to changing circumstances.

Another issue is that of manipulability. In the application of using mobile

data to do credit scoring, a concern is that consumers may be able to mani-

plate the data observed by the loan provider (Bjorkegren and Grissen 2017).

For example, if certain behavioral patterns help a consumer get a loan, the

consumer can make it look like they have these behavioral patterns, for ex-

ample, by visiting certain areas of a city. If resources are allocated to homes

that look poor via satellite imagery, homes or villages can possibly modify

the aerial appearance of their homes to make them look poorer. An open

area for future research concerns how to constrain ML models to make them

less prone to manipulability; Athey (2017) discusses some other examples

of this.

There are also other considerations that can be brought into ML when

it is taken to the fi eld, including computational time, the cost of collecting

and maintaining the “features” that are used in a model, and so on. For ex-

ample, technology fi rms sometimes make use of simplifi ed models in order

to reduce the response time for real- time user requests for information.

Overall, my prediction is that social scientists (and computer scientists

at the intersection with social science), particularly economists and other

social scientists, will contribute heavily to defi ning these types of problems

and concerns formally, and proposing solutions to them. This will not only

provide for better implementations of ML in policy, but will also provide

rich fodder for interesting research.

21.4 A New Literature on Machine Learning and Causal Inference

Despite the fascinating examples of “off - the- shelf ” or slightly modifi ed

prediction methods, in general ML prediction models are solving fundamen-

tally diff erent problems from much empirical work in social science, which

instead focuses on causal inference. A prediction I have is that there will be

an active and important literature combining ML and causal inference to

create new methods, methods that harness the strengths of ML algorithms

to solve causal inference problems. In fact, it is easy to make this prediction

with confi dence because the movement is already well underway. Here I will

highlight a few examples, focusing on those that illustrate a range of themes,

while emphasizing that this is not a comprehensive survey or a thorough

review.

To see the diff erence between prediction and causal inference, imagine

that you have a data set that contains data about prices and occupancy

rates of hotels. Prices are easy to obtain through price comparison sites,

but occupancy rates are typically not made public by hotels. Imagine fi rst

that a hotel chain wishes to form an estimate of the occupancy rates of

competitors, based on publicly available prices. This is a prediction problem:

520 Susan Athey

the goal is to get a good estimate of occupancy rates, where posted prices

and other factors (such as events in the local area, weather, and so on) are

used to predict occupancy. For such a model, you would expect to fi nd that

higher posted prices are predictive of higher occupancy rates, since hotels

tend to raise their prices as they fi ll up (using yield management software).

In contrast, imagine that a hotel chain wishes to estimate how occupancy

would change if the hotel raised prices across the board (that is, if it repro-

grammed the yield management software to shift prices up by 5 percent in

every state of the world). This is a question of causal inference. Clearly, even

though prices and occupancy are positively correlated in a typical data set,

we would not conclude that raising prices would increase occupancy. It is

well known in the causal inference literature that the question about price

increases cannot be answered simply by examining historical data without

additional assumptions or structure. For example, if the hotel previously ran

randomized experiments on pricing, the data from these experiments can be

used to answer the question. More commonly, an analyst will exploit natural

experiments or instrumental variables where the latter are variables that are

unrelated to factors that aff ect consumer demand, but that shift fi rm costs

and thus their prices. Most of the classic supervised ML literature has little

to say about how to answer this question.

To understand the gap between prediction and causal inference, recall that

the foundation of supervised ML methods is that model selection (through,

e.g., cross- validation) is carried out to optimize goodness of fi t on a test

sample. A model is good if and only if it predicts outcomes well in a test

set. In contrast, a large body of econometric research builds models that

substantially reduce the goodness of fi t of a model in order to estimate the

causal eff ect of, say, changing prices. If prices and quantities are positively

correlated in the data, any model that estimates the true causal eff ect (quan-

tity goes down if you change price) will not do as good a job fi tting a test

data set that has the same joint distribution of prices and quantities as the

training data. The place where the econometric model with a causal estimate

would do better is at fi tting what happens if the fi rm actually changes prices
>
at a given point in time at doing counterfactual predictions when the world

changes. Techniques like instrumental variables seek to use only some of the

information that is in the data the clean or exogenous or experiment- like

variation in price sacrifi cing predictive accuracy in the current environment

to learn about a more fundamental relationship that will help make decisions

about changing price.

However, a new but rapidly growing literature is tackling the problem of

using ML methods for causal inference. This new literature takes many of

the strengths and innovations of ML methods, but applies them to causal

inference. Doing this requires changing the objective function, since the

ground truth of the causal parameter is not observed in any test set. Also

as a consequence of the fact that the truth is not observed in a test set, sta-

The Impact of Machine Learning on Economics 521

tistical theory plays a more important role in evaluating models, since it is

more diffi

cult to directly assess how well a parameter estimates the truth,

even if the analyst has access to an independent test set. Indeed, this dis-

cussion highlights one of the key ways in which prediction is substantially

simpler than parameter estimation: for prediction problems, a prediction

for a given unit (given its covariates) can be summarized in a single number,

the predicted outcome, and the quality of the prediction can be evaluated

on a test set without further modeling assumptions. Although the average

squared prediction error of a model on a test set is a noisy estimate of the

expected value of the mean squared error on a random test set (due to small

sample size), the law of large numbers applies to this average and it converges

quickly to the truth as the test set size increases. Since the standard deviation

of the prediction error can also be easily estimated, it is straightforward to

evaluate predictive models without imposing additional assumptions.

There are a variety of diff erent problems that can be tackled with ML

methods. An incomplete list of some that have gained early attention is

given as follows. First, we can consider the type of identifi cation strategy

for identifying causal eff ects. Some that have received attention in the new

ML/ causal inference literature include:

1. Treatment randomly assigned (experimental data).

2. Treatment assignment unconfounded (conditional on covariates).

3. Instrumental variables.

4. Panel data settings (including diff erence- in-diff erence designs).

5. Regression discontinuity designs.

6. Structural models of individual or fi rm behavior.

In each of those settings, there are diff erent problems of interest:

1. Estimating average treatment eff ects (or a low- dimensional parameter

vector).

2. Estimating heterogeneous treatment eff ects in simple models or models

of limited complexity.

3. Estimating heterogeneous treatment eff ects nonparametrically.

4. Estimating optimal treatment assignment policies.

5. Identifying groups of individuals that are similar in terms of their treat-

ment eff ects.

Although the early literature is already too large to summarize all of the

contributions to each combination of identifi cation strategty and problem

of interest, it is useful to observe that at this point there are entries in almost

all of the “boxes” associated with diff erent identifi cation strategies, both for

average treatment eff ects and heterogeneous treatment eff ects. Here, I will

provide a bit more detail on a few leading cases that have received a lot of

attention, in order to illustrate some key themes in the literature.

It is also useful to observe that even though the last four problems seem

522 Susan Athey

closely related, they are distinct, and the methods used to solve them as well

as the issues that arise are distinct. These distinctions have not traditionally

been emphasized as much in the literature on causal inference, but they mat-

ter more in environments with data- driven model selection because each has

a diff erent objective and the objective function can make a big diff erence in

determining the selected model in ML- based models. Issues of inference are

also distinct, as we will discuss further below.

21.4.1 Average Treatment Eff ects

A large and important branch of the literature on causal inference focuses

on estimation of average treatment eff ects under the unconfoundedness

assumption. This assumption requires that potential outcomes (the out-

comes a unit would experience in alternative treatment regimes) are inde-

pendent of treatment assignment, conditional on covariates. In other words,

treatment assignment is as good as random after controlling for covariates.

From the 1990s through the fi rst decade of the twenty- fi rst century, a

literature emerged about using semiparametric methods to estimate average

treatment eff ects (e.g., Bickel et al. [1993], focusing on an environment with

a fi xed number of covariates that is small relative to the sample size). The

methods are semiparametric in the sense that the goal is to estimate a low-

dimensional parameter—in this case, the average treatment eff ect—without

making parametric assumptions about the way in which covariates aff ect

outcomes (e.g., Hahn 1998). (See Imbens and Wooldridge [2009] and Imbens

and Rubin [2015] for reviews.) In the middle of the fi rst decade of the twenty-

fi rst decade, Mark van der Laan and coauthors introduced and developed

a set of methods called “targeted maximum likelihood” (van der Laan and

Rubin 2006). The idea is that maximum likelihood is used to estimate a low-

dimensional parameter vector in the presence of high- dimensional nuisance

parameters. The method allows the nuisance parameters to be estimated

with techniques that have less well- established properties or a slower con-

vergence rate. This approach can be applied to estimate an average treatment

eff ect parameter under a variety of identifi cation assumptions, but impor-

tantly, it is an approach that can be used with many covariates.

An early example of the application of ML methods to causal inference in

economics (see Belloni, Chernozhukov, and Hansen 2014 and Chernozhu-

kov, Hansen, and Spindler 2015 for reviews) uses regularized regression as

an approach to deal with many potential covariates in an environment where

the outcome model is “sparse,” meaning that only a small number of covari-

ates actually aff ect mean outcome (but there are many observables, and

the analyst does not know which ones are important). In an environment

with unconfoundedness, since some covariates are correlated with both the

treatment assignment and the outcome, if the analyst does not condition

on them the omission of the confounder will lead to a biased estimate of

the treatment eff ect. Belloni, Chernozhukov, and Hansen propose a double-

The Impact of Machine Learning on Economics 523

selection method based on the LASSO. The LASSO is a regularized regres-

sion procedure where a regression is estimated using an ob
jective function

that balances in-sample goodness of fi t with a penalty term that depends on

the sum of the magnitude of regression coeffi

cients. This form of penalty

leads many covariates to be assigned a coeffi

cient of zero, eff ectively drop-

ping them from the regression. The magnitude of the penalty parameter is

selected using cross- validation. The authors observe that if LASSO is used

in a regression of the outcome and both the treatment indicator and other

covariates, the coeffi

cient on the treatment indicator will be a biased estimate

of the treatment eff ect because confounders that have a weak relationship

with the outcome but a strong relationship with the treatment assignment

may be zeroed out by an algorithm whose sole objective is to select variables

that predict outcomes.

A variety of other methods have been proposed for combining machine

learning and traditional econometric methods for estimating average treat-

ment eff ects under the unconfoundedness assumption. Athey, Imbens, and

Wager (2016) propose using a method they refer to as “residual balanc-

ing,” building on work on balancing weights by Zubizarreta (2015). Their

approach is similar to a “doubly- robust” method for estimating average treat-

ment eff ects that proceeds by taking the average of the effi

cient score, which

involves an estimate of the conditional mean of outcomes given covariates

as well as the inverse of the estimated propensity score; however, the residual

balancing replaces inverse propensity score weights with weights obtained

using quadratic programming, where the weights are designed to achieve

balance between the treatment and control group. The conditional mean of

outcomes is estimated using LASSO. The main result in the paper is that this

procedure is effi

cient and achieves the same rate of convergence as if the out-

come model was known, under a few key assumptions. The most important

assumption is that the outcome model is linear and sparse, although there

can be a large number of covariates and the analyst does not need to have

knowledge of which ones are important. The linearity assumption, while

strong, allows the key result to hold in the absence of any assumptions about

the structure of the process mapping covariates to the assignment, other

‹ Prev Next ›