The Economics of Artificial Intelligence

Home > Other > The Economics of Artificial Intelligence > Page 15
The Economics of Artificial Intelligence Page 15

by Ajay Agrawal


  tioners make use of algorithms where C = [ . . . ] I, with p the dimension t

  1 t

  pt

  of Ω, and each is chosen to approximate 2L /

  2, the corresponding

  jt

  j

  diagonal element of the Hessian matrix of loss- function second derivatives

  (i.e., what would be used in a Newton’s algorithm). The ADAGRAD paper

  (Duchi, Hazan, and Singer 2011) provides a theoretical foundation for this

  approach and suggests an algorithm for specifying . Most deep learning

  jt

  systems make use of ADAGRAD-inspired algorithms, such as ADAM

  (Kingma and Ba 2015), that combine the original algorithm with heuristics

  that have been shown empirically to improve performance.

  Finally, there is another key trick to DNN training: dropout. This pro-

  cedure, proposed by researchers (Srivastava et al. 2014) in Hinton’s lab at

  the University of Toronto, involves introduction of random noise into each

  gradient calculation. For example, “Bernoulli dropout” replaces current

  estimates with w =

  where is a Bernoulli random variable with

  tj

  tj

  tj * tj

  tj

  p( = 1) = c. Each SGD update from equation (8) then uses these parameter tj

  values when evaluating the gradient, such that

  The Technological Elements of Artifi cial Intelligence 81

  (10)

  =

  C

  f

  ;{ d } B

  (

  )| ,

  t+1

  t

  t

  i

  b=1

  W

  b

  t

  where W is the noised-up version of Ω , with elements w .

  t

  t

  tj

  Dropout is used because it has been observed to yield model fi ts that have

  lower out- of-sample error rates (so long as you tune c appropriately). Why

  does this happen? Informally, dropout acts as a type of implicit regulariza-

  tion. An example of explicit regularization is parameter penalization: to

  avoid over- fi t, the minimization objective for DNNs almost always has a

  2 ridge penalty term added to the data- likelihood loss function. Drop-

  2

  out plays a similar role. By forcing SGD updates to ignore a random sample

  of the parameters, it prevents over- fi t on any individual parameter.10 More

  rigorously, it has recently been established by a number of authors (Kendall

  and Gal 2017) that SGD with dropout corresponds to a type of “variational

  Bayesian Inference.” That means that dropout SGD is solving to fi nd the

  posterior distribution over Ω rather than a point estimate.11 As interest grows around uncertainty quantifi cation for DNNs, this interpretation of dropout

  is one option for bringing Bayesian inference into deep learning.

  2.6 Reinforcement Learning

  As our fi nal section on the elements of deep learning, we will consider

  how these AI systems generate their own training data through a mix of

  experimentation and optimization. Reinforcement learning (RL) is the com-

  mon term for this aspect of AI. Reinforcement learning is sometimes used

  to denote specifi c algorithms, but we are using it to refer to the full area of

  active data collection.

  The general problem can be formulated as a reward- maximization task.

  You have some policy or “action” function, d( x ; Ω), that dictates how the t

  system responds to “event” t with characteristics x . The event could be t

  a customer arriving on your website at a specifi c time, or a scenario in a

  video game, and so forth. After the event, you observe “response” y and the

  t

  reward is calculated as r( d( x ,; Ω), y ). During this process you are accumulat-t

  t

  ing data and learning the parameters Ω, so we can write Ω as the parameters

  t

  used at event t. The goal is that this learning converges to some optimal

  reward- maximizing parametrization, say Ωå, and that this happens after

  some T events where T is not too big—that is, so that you minimize regret, T

  (11)

  r ( d( x ; å ), y ) r ( d( x ;

  ), y ) .

  t

  t

  t

  t

  t

  t=1

  10. This seems to contradict our earlier discussion about minimizing the variance of gradient estimates. The distinction is that we want to minimize variance due to noise in the data, but here we are introducing noise in the parameters independent of the data.

  11. It is a strange variational distribution, but basically the posterior distribution over Ω

  becomes that implied by W, with elements multiplied by random Bernoulli noise.

  j

  82 Matt Taddy

  This is a very general formulation. We can map it to some familiar scenarios.

  For example, suppose that the event t is a user landing on your website. You

  would like to show a banner advertisement on the landing page, and you

  want to show the ad that has the highest probability of getting clicked by

  the user. Suppose that there are J diff erent possible ads you can show, such

  that your action d = d( x ;Ω ) ∈ {1, . . . , J} is the one chosen for display. The t

  t

  t

  fi nal reward is y = 1 if the user clicks the ad and y = 0 otherwise.12

  t

  t

  This specifi c scenario is a multi- armed bandit (MAB) set-up, so named

  by analogy to a casino with many slot machines of diff erent payout proba-

  bilities (the casino is the bandit). In the classic MAB (or simply “bandit”)

  problem, there are no covariates associated with each ad and each user, such

  that you are attempting to optimize toward a single ad that has highest click

  probability across all users. That is, is ( y = 1| d = j), the generic click j

  t

  t

  probability for ad j, and you want to set d to the ad with highest . There t

  j

  are many diff erent algorithms for bandit optimization. They use diff erent

  heuristics to balance exploitation with exploration. A fully exploitive algorithm is greedy: it always takes the currently estimated best option without

  any consideration of uncertainty. In our simple advertising example, this

  implies always converging to the fi rst ad that ever gets clicked on. A fully

  exploratory algorithm always randomizes the ads and it will never converge

  to a single optimum. The trick to bandit learning is fi nding a way to balance

  between these two extremes.

  A classic bandit algorithm, and one which gives solid intuition into RL

  in general, is Thompson sampling (Thompson 1933). Like many tools in

  RL, Thompson sampling uses Bayesian inference to model the accumula-

  tion of knowledge over time. The basic idea is simple: at any point in the

  optimization process you have a probability distribution over the vector of

  click rates, ␻ = [ . . . ], and you want to show each ad j in proportion

  1

  J

  to the probability that is the largest click rate. That is, with yt = { y } t j

  s s=1

  denoting observed responses at time t, you want to have

  (12)

  p( d

  = j) p

  = max{ } J | yt

&nb
sp; (

  ),

  t+1

  j

  k k =1

  such that an ad’s selection probability is equal to the posterior probability

  that it is the best choice. Since the probability in equation (12) is tough to

  calculate in practice (the probability of a maximum is not an easy object to

  analyze), Thompson sampling uses Monte Carlo estimation. In particular,

  you draw a sample of ad- click probabilities from the posterior distribution

  at time t,

  (13)

  ~ p(

  | yt ),

  t+1

  12. This application, on the news website MSN .com with headlines rather than ads, motivates much of the RL work in Agarwal et al. (2014).

  The Technological Elements of Artifi cial Intelligence 83

  and set d = argmax

  . For example, suppose that you have a Beta(1,1)

  t+1

  j

  t+1 j

  prior on each ad’s click rate (i.e., a uniform distribution between zero and

  one). At time t, the posterior distribution for the j th ad’s click rate is t

  t

  (14)

  P(

  | d t, yt ) = Beta 1 +

  1

  y ,1 +

  1

  (1

  y ) .

  j

  d = j

  s

  d = j

  s

  s

  s

  s=1

  s=1

  A Thompson sampling algorithm draws

  from equation (14) for each j

  t+1 j

  and then shows the ad with highest sampled click rate.

  Why does this work? Think about scenarios where an ad j would be shown

  at time t—that is, when the sampled is largest. This can occur if there is a tj

  lot of uncertainty about , in which case high probabilities have nontrivial

  j

  posterior weight, or if the expected value of is high. Thus Thompson

  j

  sampling will naturally balance between exploration and exploitation. There

  are many other algorithms for obtaining this balance. For example, Agarwal

  et al. (2014) survey methods that work well in the contextual bandit set-

  ting where you have covariates attached to events (such that action- payoff

  probabilities are event specifi c). The options considered include ε- greedy

  search, which fi nds a predicted optimal choice and explores within a neigh-

  borhood of that optimum, and a bootstrap- based algorithm that is eff ec-

  tively a nonparametric version of Thompson sampling.

  Another large literature looks at so-called Bayesian optimization (Taddy

  et al. 2009). In these algorithms, you have an unknown function r( x) that you would like to maximize. This function is modeled using some type of

  fl exible Bayesian regression model, for example, a Gaussian process. As you

  accumulate data, you have a posterior over the “response surface” r at all

  potential input locations. Suppose that, after t function realizations, you

  have observed a maximal value r

  . This is your current best option, but you

  max

  want to continue exploring to see if you can fi nd a higher maximum. The

  Bayesian optimization update is based on the expected improvement statistic,

  (15)

  E max (0, r( x) r ) ,

  max

  the posterior expectation of improvement at new location x, thresholded

  below at zero. The algorithm evaluates equation (15) over a grid of potential

  x locations, and you choose to evaluate r( x ) at the location x with high-t+1

  t+1

  est expected improvement. Again, this balances exploitation with explora-

  tion: the statistic in equation (15) can be high if r( x) has high variance or a high mean (or both).

  These RL algorithms are all described in the language of optimization,

  but it is possible to map many learning tasks to optimization problems. For

  example, the term active learning is usually used to refer to algorithms that

  choose data to minimize some estimation variance (e.g., the average pre-

  diction error for a regression function over a fi xed input distribution). Say

  f ( x;Ω) is your regression function, attempting to predict response y. Then

  84 Matt Taddy

  your action function is simply prediction, d( x;Ω) = f ( x;Ω), and your optimization goal could be to minimize the squared error—that is, to maximize

  r( d( x;Ω), y) = – ( y – f ( x;Ω))2. In this way, active learning problems are special cases of the RL framework.

  From a business and economic perspective, RL is interesting (beyond its

  obvious usefulness) for assigning a value to new data points. In many set-

  tings the rewards can be mapped to actual monetary value: for instance, in

  our advertising example where the website receives revenue- per- click. Rein-

  forcement learning algorithms assign a dollar value to data observations.

  There is a growing literature on markets for data, for example, including the

  “data- is- labor” proposal in Lanier (2014). It seems useful for future study in

  this area to take account of how currently deployed AI systems assign rela-

  tive data value. As a high- level point, the valuation of data in RL depends

  upon the action options and potential rewards associated with these actions.

  The value of data is only defi ned in a specifi c context.

  The bandit algorithms described above are vastly simplifi ed in com-

  parison to the type of RL that is deployed as part of a deep learning sys-

  tem. In practice, when using RL with complex fl exible functions like DNNs

  you need to be very careful to avoid over exploitation and early conver-

  gence (Mnih et al. 2015). It is also impossible to do a comprehensive search

  through the super high- dimensional space of optional values for the Ω that

  parametrizes a DNN. However, approaches such as that in van Seijen et al.

  (2017) and Silver et al. (2017) show that if you impose structure on the full

  learning problem then it can be broken into a number of simple composite

  tasks, each of which is solvable with RL. As we discussed earlier, there is an

  undeniable advantage to having large fi xed data assets that you can use to

  hot- start your AI (e.g., data from a search engine or social media platform).

  But the exploration and active data collection of RL is essential when tuning

  an AI system to be successful in specifi c contexts. These systems are taking

  actions and setting policy in an uncertain and dynamic world. As statisti-

  cians, scientists, and economists are well aware, without constant experimen-

  tation it is not possible to learn and improve.

  2.7 AI in Context

  This chapter has provided a primer on the key ingredients of AI. We have

  also been pushing some general points. First, the current wave of ML- driven

  AI should be viewed as a new class of products growing up around a new

  general purpose technology: large- scale, fast, and robust machine learn-

  ing. Artifi cial intelligence is not machine learning, but general purpose ML,

  specifi cally deep learning, is the electric motor of AI. These ML tools are

  going to continue to get better, faster, and cheaper. Hardware and big data

  resources are adapting to the demands of DNNs, and self- service ML solu-

  tions are available on all of the major cloud computing platforms. Trained
r />   The Technological Elements of Artifi cial Intelligence 85

  DNNs might become a commodity in the near- term future, and the market

  for deep learning could get wrapped up in the larger battle over market share

  in cloud computing services.

  Second, we are still waiting for true end- to-end business AI solutions that

  drive a real increase in productivity. AI’s current “wins” are mostly limited

  to settings with high amounts of explicit structure, like board and video

  games.13 This is changing, as companies like Microsoft and Amazon produce

  semi- autonomous systems that can engage with real business problems. But

  there is still much work to be done, and the advances will be made by those

  who can impose structure on these complex business problems. That is, for

  business AI to succeed we need to combine the GPML and big data with

  people who know the rules of the “game” in their business domain.

  Finally, all of this will have signifi cant implications for the role of eco-

  nomics in industry. In many cases, the economists are those who can provide

  structure and rules around messy business scenarios. For example, a good

  structural econometrician (McFadden 1980; Heckman 1977; Deaton and

  Muellbauer 1980) uses economic theory to break a substantiative question

  into a set of measurable (i.e., identifi ed) equations with parameters that

  can be estimated from data. In many settings, this is exactly the type of

  workfl ow required for AI. The diff erence is that, instead of being limited to

  basic linear regression, these measurable pieces of the system will be DNNs

  that can actively experiment and generate their own training data. The next

  generation of economists needs to be comfortable in knowing how to apply

  economic theory to obtain such structure, and how to translate this structure

  into recipes that can be automated with ML and RL. Just as big data led to

  data science, a new discipline combining statistics and computer science, AI

  will require interdisciplinary pioneers who can combine economics, statis-

  tics, and machine learning.

  References

  Agarwal, Alekh, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert

  Schapire. 2014. “Taming the Monster: A Fast and Simple Algorithm for Con-

  textual Bandits.” In Proceedings of the 31st International Conference on Machine

  Learning 32:1638– 46. http:// proceedings.mlr.press/ v32/ agarwalb14 .pdf.

  Athey, Susan. 2017. “Beyond Prediction: Using Big Data for Policy Problems.”

  Science 355:483– 85.

 

‹ Prev