The Economics of Artificial Intelligence Page 86 Read online free by Ajay Agrawal

Home > Other > The Economics of Artificial Intelligence > Page 86

The Economics of Artificial Intelligence Page 86

time, and is unrelated to factors that shift consumer’s demand for the prod-

uct (such demand shifters can be referred to as “confounders” becaues they

aff ect both the optimal price set by the fi rm and the sales of the product).

The instrumental variables method essentially projects the observed prices

onto the input costs, thus only making use of the variation in price that is

explained by changes in input costs when estimating the impact of price on

sales. It is very common to see that a predictive model (e.g., least squares

regression) might have very high explanatory power (e.g., high R 2), while

the causal model (e.g., instrumental variables regression) might have very

low explanatory power (in terms of predicting outcomes). In other words,

economists typically abandon the goal of accurate prediction of outcomes

in pursuit of an unbiased estimate of a causal parameter of interest.

Another diff erence derives from the key concerns in diff erent approaches,

and how those concerns are addressed. In predictive models, the key con-

cern is the trade- off between expressiveness and overfi tting, and this trade-

off can be evaluated by looking at goodness of fi t in an independent test

set. In contrast, there are several distinct concerns for causal models. The

fi rst is whether the parameter estimates from a particular sample are spuri-

ous, that is, whether estimates arise due to sampling variation so that if a

new random sample of the same size was drawn from the population, the

parameter estimate would be substantially diff erent. The typical approach

to this problem in econometrics and statistics is to prove theorems about

the consistency and asymptotic normality of the parameter estimates, pro-

pose approaches to estimating the variance of parameter estimates, and

fi nally to use those results to estimate standard errors that refl ect the sam-

pling uncertainty (under the conditions of the theory). A more data- driven

approach is to use bootstrapping and estimate the empirical distribution of

parameter estimates across bootstrap samples. The typical ML approach

of evaluating performance in a test set does not directly handle the issue of

the uncertainty over parameter estimates, since the parameter of interest is

not actually observed in any test set. The researcher would need to estimate

the parameter again in the test set.

514 Susan Athey

A second concern is whether the assumptions required to “identify” a

causal eff ect are satisfi ed, where in econometrics we say that a parameter is

identifi ed if we can learn it eventually with infi nite data (where even in the

limit, the data has the same structure as in the sample considered). It is well

known that the causal eff ect of a treatment is not identifi ed without making

assumptions, assumptions that are generally not testable (that is, they cannot

be rejected by looking at the data). Examples of identifying assumptions

include the assumption that the treatment is randomly assigned, or that

treatment assignment is “unconfounded.” In some settings, these assump-

tions require the analyst to observe all potential “confounders” and con-

trol for them adequately; in other settings, the assumptions require that an

instrumental variable is uncorrelated with the unobserved component of

outcomes. In many cases it can be proven that even with a data set of infi nite

size, the assumptions are not testable—they cannot be rejected by looking at

the data, and instead must be evaluated on substantive grounds. Justifying

assumptions is one of the primary components of an observational study in

applied economics. If the “identifying” assumptions are violated, estimates

may be biased (in the same way) in both training data and test data. Testing

assumptions usually requires additional information, like multiple experi-

ments (designed or natural) in the data. Thus, the ML approach of evaluat-

ing performance in a test set does not address this concern at all. Instead, ML

is likely to help make estimation methods more credible, while maintaining

the identifying assumptions: in practice, coming up with estimation methods

that give unbiased estimates of treatment eff ects requires fl exibly modeling

a variety of empirical relationships, such as the relationship between the

treatment assignment and covariates. Since ML excels at data- driven model

selection, it can be useful in systematizing the search for the best functional

forms when implementing an estimation technique.

Economists also build more complex models that incorporate both be-

havioral and statistical assumptions in order to estimate the impact of coun-

terfactual policies that have never been used before. A classic example is

McFadden’s methodological work in the early 1970s (e.g., McFadden 1973)

analyzing transportation choices. By imposing the behavioral assumption

that consumers maximize utility when making choices, it is possible to esti-

mate parameters of the consumer’s utility function and estimate the welfare

eff ects and market share changes that would occur when a choice is added

or removed (e.g., extending the BART transportation system), or when

the characteristics of the good (e.g., price) are changed. Another example

with more complicated behavioral assumptions is the case of auctions. For

a data set with bids from procurement auctions, the “structural” approach

involves estimating a probability distribution over bidder values, and then

evaluating the counterfactual eff ect of changing auction design (e.g., Laf-

font, Ossard, and Vuong 1995; Athey, Levin, and Seira 2011; Athey, Coey,

The Impact of Machine Learning on Economics 515

and Levin 2013; or the review by Athey and Haile 2007). For further discus-

sions of the contrast between prediction and parameter estimation, see the

recent review by Mullainathan and Spiess (2017). There is a small litera-

ture in ML referred to as “inverse reinforcement learning” (Ng and Russell

2000) that has a similar approach to the structural estimation literature

economics; this ML literature has mostly operated independently without

much reference to the earlier econometric literature. The literature attempts

to learn “reward functions” (utility functions) from observed behavior in

dynamic settings.

There are also other categories of ML models; for example, anomaly

detection focuses on looking for outliers or unusual behavior and is used,

for example, to detect network intrusion, fraud, or system failures. Other

categories that I will return to are reinforcement learning (roughly, approxi-

mate dynamic programming) and multiarmed bandit experimentation

(dynamic experimentation where the probabiity of selecting an arm is cho-

sen to balance exploration and exploitation). These literatures often take

a more explicitly causal perspective and thus are somewhat easier to relate

to economic models, and so my general statements about the lack of focus

on causal inference in ML must be qualifi ed when discussing the literature

on bandits.

Before proceeding, it is useful to highlight one other contribution of the
/> ML literature. The contribution is computational rather than conceptual,

but it has had such a large impact that it merits a short discussion. The tech-

nique is called stochastic gradient descent (SGD), and it is used in many dif-

ferent types of models, including the estimation of neural networks as well

as large scale Bayesian models (e.g., Ruiz, Athey, and Blei [2017], discussed

in more detail below). In short, stochastic gradient descent is a method for

optimizing an objective function, such as a likelihood function or a gener-

alized method of moments objective function, with respect to parameters.

When the objective function is expensive to compute (e.g., because it requires

numerical integration), stochastic gradient descent can be used. The main

idea is that if the objective is the sum of terms, each term corresponding to a

single observation, the gradient can be approximated by picking a single data

point and using the gradient evaluated at that observation as an approxima-

tion to the average (over observations) of the gradient. This estimate of the

gradient will be very noisy, but unbiased. The idea is that it is more eff ective

to “climb a hill” taking lots of steps in a direction that is noisy but unbiased,

than it is to take a small number of steps, each in the right direction, which is

what happens if computational resources are focused on getting very precise

estimates of the gradient of the objective at each step. Stochastic gradient

descent can lead to dramatic performance improvements, and thus enable

the estimation of very complex models that would be intractable using tra-

ditional approaches.

516 Susan Athey

21.3 Using Prediction Methods in Policy Analysis

21.3.1 Applications of Prediction Methods

to Policy Problems in Economics

There have already been a number of successful applications of predic-

tion methodology to policy problems. Kleinberg et al. (2015) have argued

that there is a set of problems where off - the- shelf ML methods for predic-

tion are the key part of important policy and decision problems. They use

examples like deciding whether to do a hip replacement operation for an

elderly patient; if you can predict based on their individual characteris-

tics that they will die within a year, then you should not do the operation.

Many Americans are incarcerated while awaiting trial; if you can predict

who will show up for court, you can let more out on bail. Machine- learning

algorithms are currently in use for this decision in a number of jurisdic-

tions. Another natural example is credit scoring; an economics paper by

Bjorkegren and Grissen (2017) uses ML methods to predict loan repayment

using mobile phone data.

In other applications, Goel, Rao, and Shroff (2016) use ML methods to

examine stop- and- frisk laws, using observables of a police incident to pre-

dict the probability that a suspect has a weapon, and they show that blacks

are much less likely than whites to have a weapon conditional on observ-

ables and being frisked. Glaeser, Hillis, et al. (2016) helped cities design a

contest to build a predictive model that predicted health code violations in

restaurants in order to better allocate inspector resources. There is a rap-

idly growing literature using machine learning together with images from

satellites and street maps to predict poverty, safety, and home values (see,

e.g., Naik et al. 2017). As Glaeser, Kominers, et al. (2015) argue, there are

a variety of applications of this type of prediction methodology. It can be

used to compare outcomes over time at a very granular level, thus making

it possible to assess the impact of a variety of policies and changes, such as

neighborhood revitalization. More broadly, the new opportunities created

by large- scale imagery and sensors may lead to new types of analyses of

productivity and well- being.

Although prediction is often a large part of a resource allocation prob-

lem—there is likely to be agreement that people who will almost certainly

die soon should not receive hip replacement surgery, and rich people should

not receive poverty aid—Athey (2017) discusses the gap between identify-

ing units that are at risk and those for whom intervention is most benefi cial.

Determining which units should receive a treatment is a causal inference

question, and answering it requires diff erent types of data than prediction.

Either randomized experiments or natural experiments may be needed to

estimate heterogeneous treatment eff ects and optimal assignment policies.

In business applications, it has been common to ignore this distinction and

The Impact of Machine Learning on Economics 517

focus on risk identifi cation; for example, as of 2017, the Facebook advertis-

ing optimization tool provided to advertisers optimizes for consumer clicks,

but not for the causal eff ect of the advertisement. The distinction is often not

emphasized in marketing materials and discussions in the business world,

perhaps because many practitioners and engineers are not well versed in

the distinction between prediction and causal inference.

21.3.2 Additional Topics in Prediction for Policy Settings

Athey (2017) summarizes a variety of research questions that arise when

prediction methods are taken into policy applications. A number of these

have attracted initial attention in both ML and the social sciences, and

interdisciplinary conferences and workshops have begun to explore these

issues.

One set of questions concerns interpretability of models. There are discus-

sions of what interpretability means, and whether simpler models have

advantages. Of course, economists have long understood that simple models

can also be misleading. In social sciences data, it is typical that many attri-

butes of individuals or locations are positively correlated—parents’ educa-

tion, parents’ income, child’s education, and so on. If we are interested in a

conditional mean function, and estimate ˆ

μ( x) = E[ Y | X = x], using a simpler

i

i

model that omits a subset of covariates may be misleading. In the simpler

model, the relationship between the omitted covariates and outcomes is

loaded onto the covariates that are included. Omitting a covariate from a

model is not the same thing as controlling for it in an analysis, and it can

sometimes be easier to interpret a partial eff ect of a covariate controlling for

other factors than it is to keep in mind all of the other (omitted) factors and

how they covary with those included in a model. So, simpler models can

sometimes be misleading; they may seem easy to understand, but the under-

standing gained from them may be incomplete or wrong.

One type of model that typically is easy to interpret and explain is a causal

model. As reviewed in Imbens and Rubin (2015), the causal inference frame-

work typically makes the estimand very precise—for example, the average

eff ect if a treatment were applied to a particular population, the conditional

average treatment eff ect (conditional on some observable characteristics of

in
dividuals), or the average eff ect of a treatment on a subpopulation such as

“compliers” (those whose treatment adoption is aff ected by an instrumental

variable). Such parameters by defi nition give the answer to a well- defi ned

question, and so the magnitudes are straightforward to interpret. Key pa-

rameters of “structural” models are also straightforward to interpret—they

represent parameters of consumer utility functions, elasticities of demand

curves, bidder valuations in auctions, marginal costs of fi rms, and so on. An

area for further research concerns whether there are other ways to math-

ematically formalize what it means for a model to be interpretable, or to

analyze empirically the implications of interpretability. Yeomans, Shah, and

518 Susan Athey

Kleinberg (2016) study empirically a related issue of how much people trust

ML- based recommender systems, and why.

Another area that has attracted a lot of attention is the question of fair-

ness and nondiscrimination, for example, whether algorithms will promote

discrimination by gender or race when used in settings like hiring, judicial

decisions, or lending. There are a number of interesting questions that can

be considered. One is, how can fairness constraints be defi ned? What type

of fairness is desired? For example, if a predictive model is used to allocate

job interviews based on resumes, there are two types of errors, Type I and

Type II. It is straightforward to show that it is in general impossible to

equalize both Type I and Type II errors across two diff erent categories of

people (e.g., men and women), so the analyst must choose which to equalize

(or both). See Kleinberg, Mullainathan, and Raghaven (2016) for further

analysis and development of the inherent trade- off s in fairness in predictive

algorithms. Overall, the literature on this topic has grown rapidly in the last

two years, and we expect that as ML algorithms are deployed in more and

more contexts, the topic will continue to develop. My view is that it is more

likely that ML models will help make resource allocation more rather than

less fair; algorithms can absorb and eff ectively use a lot more information

than humans, and thus are less likely than humans to rely on stereotypes.

To the extent that unconstrained algorithms do have undesirable distribu-

‹ Prev Next ›