The Economics of Artificial Intelligence

Home > Other > The Economics of Artificial Intelligence > Page 89
The Economics of Artificial Intelligence Page 89

by Ajay Agrawal


  optimal. Further, it is also not important from the perspective of expected

  payoff s to statistically distinguish two very similar treatments. The litera-

  ture has developed a number of heuristics for managing the explore- exploit

  trade- off ; for example, “Thompson sampling” allocates units to treatment

  arms in proportion to the estimated probability that each treatment arm is

  the best.

  There is much less known about the setting where individuals have ob-

  The Impact of Machine Learning on Economics 529

  served attributes, in which case the goal is to construct and evaluate per-

  sonalized treatment assignment policies. This problem has been termed the

  “contextual bandit” problem, since treatment assignments are sensitive to

  the “context” (in this case, user characteristics). At fi rst, the problem seems

  very challenging because the space of possible policies is large and complex

  (each policy maps from user characteristics to the space of possible treat-

  ments). However, if the returns to each of the actions can be estimated as a

  function of individual attributes, a policy can be constructed by fi nding the

  action whose return is estimated to be highest, balanced against the need

  for exploration. Although there are a number of proposed methods for the

  contextual bandit problem in the literature already, there is relatively little

  known about how to select among methods and which ones are likely to

  perform best in practice. For example, the literature on optimal policy esti-

  mation suggests that particular approaches to policy estimation may work

  better than others.

  In particular, there are a variety of choices a researcher must make when

  selecting a contextual bandit algorithm. These include the choice of the

  model that maps user characteristics to expected outcomes (where the lit-

  erature has considered alternatives such as Ridge regression, Li et al. [2010];

  ordinary least squares (OLS) Goldenshluger and Zeevi [2013]; generalized

  linear model (GLM) Li, Lu, and Zhou [2017]; LASSO, [Bastani and Bayati

  2015]; and random forests, Dimakopoulou, Athey, and Imbens [2017];

  Feraud et al. [2016]). Another choice concerns the heuristic used to balance

  exploration versus exploitation, with leading choices Thompson Sampling

  and Upper Confi dence Bounds (UCB) (Chapelle and Li 2011).

  Dimakopoulou, Athey, and Imbens (2017) highlights some issues that

  arise uniquely in the contextual bandit and that relate directly to the estima-

  tion issues that have been the focus of the literature on estimation of treat-

  ment eff ects (Imbens and Rubin 2015). For example, the paper highlights

  the comparison between noncontextual bandits, where there will be many

  future individuals arriving with exactly the same context (since they all share

  the same context), and contextual bandits, where each unit is unique. The

  assignment of a particular individual thus contributes to learning for the

  future indirectly indirectly, since the future individuals will have diff erent

  contexts (characteristics). The fact that the exploration benefi ts the future

  through a model of how contexts relates to outcomes changes the problem.

  This discussion highlights a further theme for the connection between

  ML and causal inference: estimation considerations matter even more in the

  “small sample” settings of contextual bandits, where the assumption is that

  there is not enough data available to the policymaker to estimate perfectly

  the optimal assignment. However, we know from the econometrics literature

  that the small sample properties of diff erent estimators can vary substan-

  tially across settings (Imbens and Rubin 2015), making it clear that the best

  contextual bandit approach is likely to also vary across settings.

  530 Susan Athey

  21.4.4 Robustness and Supplementary Analysis

  In a recent review paper, Athey and Imbens (2017) highlights the impor-

  tance of “supplementary analyses” for establishing the credibility of causal

  estimates in environments where crucial assumptions are not directly test-

  able without additional information. Examples of supplementary analyses

  include placebo tests, whereby the analyst assses whether a given model is

  likely to fi nd evidence of treatment eff ects even at times where no treatment

  eff ect should be found. One type of supplementary analysis is a robustness

  measure. Athey and Imbens (2015) proposes to use ML- based methods to

  develop a range of diff erent estimates of a target parameter (e.g., a treatment

  eff ect), where the range is created by introducing interaction eff ects between

  model parameters and covariates. The robustness measure is defi ned as the

  standard deviation of parameter estimates across model specifi cations. This

  paper provides one possible approach to ML- based robustness measures,

  but I predict that more approaches will develop over time as ML methods

  become more popular.

  Another type of ML- based supplementary analysis, proposed by Athey,

  Imbens, et al. (2017), uses ML- based methods to construct a measure of

  how challenging the confounding problem is in a particular setting. The

  proposed measure constructs an estimated conditional mean function for

  the outcome as well as an estimated propensity score, and then estimates the

  correlation between the two.

  There is much more potential for supplementary analyses to be further

  developed; the fact that ML has well- defi ned, systematic algorithms for

  comparing a wide range of model specifi cations makes ML well suited for

  constructing additional robustness checks and supplementary analyses.

  21.4.5 Panel Data and Diff erence- in-Diff erence Models

  Another commonly used approach to identifying causal eff ects is to

  exploit assumptions about how outcomes vary across units and over time in

  panel data. In a typical panel- data setting, units are not necessarily assigned

  to a treatment randomly, but all units are observed prior to some units being

  treated; the identifying assumption is that one or more untreated units can

  be used to provide an estimate of the counterfactual time trend that would

  have occurred for the treated units in the absence of the treatment. The

  simplest “diff erence- in-diff erence” case involves two groups and two time

  periods; more broadly, panel data may include many groups and many peri-

  ods. Traditional econometric models for the panel- data case exploit func-

  tional form assumptions, for example, assuming that a unit’s outcome in a

  particular time period is an additive function of a unit eff ect, a time eff ect,

  an independent shock. The unit eff ect can then be inferred for treated units

  in the pretreatment period, while the time eff ect can be inferred from the

  untreated units in the periods where some units receive the treatment. Note

  The Impact of Machine Learning on Economics 531

  that this structure implies that the matrix of mean outcomes (with rows

  associated with units and columns associated with time) has a very simple

  structure: it has rank two.

  There have been a few recent approaches bringing ML tools to the panel

  data setting. Doudchen
ko and Imbens (2016) develop an approach inspired

  by synthetic controls (pioneered by Abadie, Diamond, and Hainmueller

  2010), where a weighted average of control observations is used to con-

  struct the counterfactual untreated outcomes for treated units in treated

  periods. Doudchenko and Imbens (2016) propose using regularized regres-

  sion to determine the weights, with the penalty parameter selected via cross-

  validation.

  Factor Models and Matrix Completion

  Another way to think about causal inference in a panel- data setting is to

  consider a matrix completion problem; Athey, Bayati, et al. (2017) propose

  taking such a perspective. In the ML literature, a matrix completion prob-

  lem is one where there is an observed matrix of data (in our case, units and

  time periods), but some of the entries are missing. The goal is to provide

  the best possible prediction of what those entries should be. For the panel-

  data application, we can think of the units and time periods where the units

  are treated as the missing entries, since we don’t observe the counterfactual

  outcomes of those units in the absence of the treatment (this is the key bit

  of missing information for estimating the treatment eff ect).

  Athey, Bayati, et al. (2017) propose using a matrix version of regularized

  regression to fi nd a matrix that well approximates the matrix of untreated

  outcomes (a matrix that has missing elements corresponding to treated units

  and periods). Recall that LASSO regression minimizes sum of squared

  errors in sample, plus a penalty term that is proportional to the sum of the

  magnitudes of the coeffi

  cients in the regression. We propose matrix regres-

  sion that minimizes the sum of squared errors of all elements of the matrix,

  plus a penalty term proportional to the nuclear norm of the matrix. The

  nuclear norm is the sum of absolute values of the singular values of the

  matrix. A matrix that has a low nuclear norm is well approximated by a low

  rank matrix.

  How do we interpret the idea that a matrix can be well approximated by a

  low- rank matrix? A low- rank matrix can be “factored” into the product of

  two matrices. In the panel- data case, we can interpret such a factorization as

  incorporating a vector of latent characteristics for each unit and a vector of

  latent characteristics of each period. The outcome of a particular unit in a

  particular period, if untreated, is approximately equal to the inner product

  of the unit’s characteristics and the period characteristics. For example, if

  the data concerned employment at the county level, we can think of the

  counties as having outcomes that depend on the share of employment in dif-

  ferent industries, and then each industry has common shocks in each period.

  532 Susan Athey

  So a county’s latent characteristic would be the vector of industry shares,

  and the time characteristics would be industry shocks in a given period.

  Athey, Bayati, et al. (2017) show that the matrix completion approach

  reduces to commonly employed techniques in the econometrics literature

  when the assumptions needed for those approaches hold, but the matrix

  completion approach is able to model more complex patterns in the data,

  while allowing the data (rather than the analyst) to indicate whether time-

  series patterns within units, or cross- sectional patterns within a period, or a

  more complex combination, are more useful for predicting counterfactual

  outcomes.

  The matrix completion approach can be linked to a literature that has

  grown in the last two decades in time- series econometrics on factor models

  (see, e.g., Bai and Ng 2008 for a review). The matrix- factorization approach

  is similar, but rather than assuming that the true model has a fi xed but

  unknown number of factors, the matrix- completion approach simply looks

  for the best fi t while penalizing the norm of the matrix. The matrix is well

  approximated by one with a small number of factors, but does not need

  to be exactly represented that way. Athey, Bayati, et al. (2017) describe a

  number of advantages of the matrix completion approach, and also show

  that it performs better than existing panel- data causal inference approaches

  in a range of settings.

  21.4.6 Factor Models and Structural Models

  Another important area of connection between machine learning and

  causal inference concerns more complex structural models. For decades,

  scholars working at the intersection of marketing and economics have built

  structural models of consumer choice, sometimes in dynamic environ-

  ments, and used Bayesian estimation to estimate the model, often Markov

  Chain Monte Carlo. Recently, the ML literature has developed a variety

  of techniques that allow similar types of Bayesian models to be estimated

  at larger scale. These have been applied to settings such as textual analysis

  and consumer choices of, for example, movies at Netfl ix. (See, for example,

  Blei, Ng, and Jordan [2003] and Blei [2012]). I expect to see much closer

  synergies between these two literatures in the future. For example, Athey,

  Blei, et al. (2017) builds on models of hierarchical Poisson factorization

  to create models of consumer demand, where a consumer’s preference

  over thousands of products are considered simultaneously, but the con-

  sumer’s choices in each product category are independent of one another.

  The model reduces the dimensionality of this problem by using a lower-

  dimensional factor representation of a consumer’s mean utility as well as

  the consumer’s price sensitivity for each product. The paper establishes that

  substantial effi

  ciency gains are possible by considering many product cate-

  gories in parallel; it is possible to learn about a consumer’s price sensitivity

  in one product using behavior in other products. The paper departs from the

  The Impact of Machine Learning on Economics 533

  pure prediction literature in ML by evaluating and tuning the model based

  on how it does at predicting consumer responses to price changes, rather

  than simply on overall goodness of fi t. In particular, the paper highlights

  that diff erent models would be selected for the “goodness of fi t” objective

  as opposed to the “counterfactual inference” objective. In order to achieve

  this goal, the paper analyzes goodness of fi t in terms of predicting changes

  in demand for products before and after price changes, after providing

  evidence that the price changes can be treated as natural experiments after

  conditioning on week eff ects (price changes always occur mid- week). The

  paper also demonstrates the benefi ts of personalized prediction, versus

  more standard demand estimation methods. Thus the paper again high-

  lights the theme that for causal inference, the objective function diff ers from

  standard prediction.

  With more scalable computational methods, it becomes possible to build

  much richer models with much less prior information about products. Ruiz,

  Athey, and Blei (2017) analyzes consumer preferences for bundles selected

  from over 5,000 items in a grocery store, without incorporating informa-
/>
  tion about which items are in the same category. Thus, the model uncovers

  whether items are substitutes or complements. Since there are 25,000 bundles

  when there are 5,000 products, in principle each individual consumer’s util-

  ity function has 25,000 parameters. Even if we restrict the utility function to

  have only pairwise interaction eff ects, there are still millions of parameters

  of a consumer’s utility function over bundles. Ruiz, Athey, and Blei (2017)

  uses a matrix- factorization approach to reduce the dimensionality of the

  problem, factorizing the mean utilities of the items, the interaction eff ects

  among items, and the user’s price sensitivity for the items. Price and availa-

  bility variation in the data allows the model to distinguish correlated prefer-

  ences (some consumers like both coff ee and diapers) from complementarity

  (tacos and taco shells are more valuable together). In order to further sim-

  plify the analysis, the model assumes that consumers are boundedly ratio-

  nal when they make choices, and consider the interactions among products

  as the consumer sequentially adds items to the cart. The alternative—that

  the consumer considers all 25,000 bundles and optimizes among them—does

  not seem plausible. Incorporating human computational constraints into

  structural models thus appears to be another potential fruitful avenue at

  the intersection of ML and economics. In the computational algorithm

  for Ruiz, Athey, and Blei (2017), we rely on a technique called variational

  inference to approximate the posterior distribution, as well as the technique

  stochastic gradient descent (described in detail above) to fi nd the parameters

  that provide the best approximation.

  In another application of similar methodology, Athey et al. (2018) ana-

  lyzes consumer choices over lunchtime restaurants using data from a sample

  of several thousand mobile phone users in the San Francisco Bay Area.

  The data is used to identify users’ typical morning location, as well as their

  534 Susan Athey

  choices of lunchtime restaurants. We build a model where restaurants have

  latent characteristics (whose distribution may depend on restaurant observ-

  ables, such as star ratings, food category, and price range), and each user

  has preferences for these latent characteristics, and these preferences are

 

‹ Prev