The Economics of Artificial Intelligence Page 15 Read online free by Ajay Agrawal

Home > Other > The Economics of Artificial Intelligence > Page 15

The Economics of Artificial Intelligence Page 15

tioners make use of algorithms where C = [ . . . ] I, with p the dimension t

1 t

pt

of Ω, and each is chosen to approximate 2L /

2, the corresponding

jt

j

diagonal element of the Hessian matrix of loss- function second derivatives

(i.e., what would be used in a Newton’s algorithm). The ADAGRAD paper

(Duchi, Hazan, and Singer 2011) provides a theoretical foundation for this

approach and suggests an algorithm for specifying . Most deep learning

jt

systems make use of ADAGRAD-inspired algorithms, such as ADAM

(Kingma and Ba 2015), that combine the original algorithm with heuristics

that have been shown empirically to improve performance.

Finally, there is another key trick to DNN training: dropout. This pro-

cedure, proposed by researchers (Srivastava et al. 2014) in Hinton’s lab at

the University of Toronto, involves introduction of random noise into each

gradient calculation. For example, “Bernoulli dropout” replaces current

estimates with w =

where is a Bernoulli random variable with

tj

tj

tj * tj

tj

p( = 1) = c. Each SGD update from equation (8) then uses these parameter tj

values when evaluating the gradient, such that

The Technological Elements of Artifi cial Intelligence 81

(10)

=

C

f

;{ d } B

(

)| ,

t+1

t

t

i

b=1

W

b

t

where W is the noised-up version of Ω , with elements w .

t

t

tj

Dropout is used because it has been observed to yield model fi ts that have

lower out- of-sample error rates (so long as you tune c appropriately). Why

does this happen? Informally, dropout acts as a type of implicit regulariza-

tion. An example of explicit regularization is parameter penalization: to

avoid over- fi t, the minimization objective for DNNs almost always has a

2 ridge penalty term added to the data- likelihood loss function. Drop-

2

out plays a similar role. By forcing SGD updates to ignore a random sample

of the parameters, it prevents over- fi t on any individual parameter.10 More

rigorously, it has recently been established by a number of authors (Kendall

and Gal 2017) that SGD with dropout corresponds to a type of “variational

Bayesian Inference.” That means that dropout SGD is solving to fi nd the

posterior distribution over Ω rather than a point estimate.11 As interest grows around uncertainty quantifi cation for DNNs, this interpretation of dropout

is one option for bringing Bayesian inference into deep learning.

2.6 Reinforcement Learning

As our fi nal section on the elements of deep learning, we will consider

how these AI systems generate their own training data through a mix of

experimentation and optimization. Reinforcement learning (RL) is the com-

mon term for this aspect of AI. Reinforcement learning is sometimes used

to denote specifi c algorithms, but we are using it to refer to the full area of

active data collection.

The general problem can be formulated as a reward- maximization task.

You have some policy or “action” function, d( x ; Ω), that dictates how the t

system responds to “event” t with characteristics x . The event could be t

a customer arriving on your website at a specifi c time, or a scenario in a

video game, and so forth. After the event, you observe “response” y and the

t

reward is calculated as r( d( x ,; Ω), y ). During this process you are accumulat-t

t

ing data and learning the parameters Ω, so we can write Ω as the parameters

t

used at event t. The goal is that this learning converges to some optimal

reward- maximizing parametrization, say Ωå, and that this happens after

some T events where T is not too big—that is, so that you minimize regret, T

(11)

r ( d( x ; å ), y ) r ( d( x ;

), y ) .

t

t

t

t

t

t=1

10. This seems to contradict our earlier discussion about minimizing the variance of gradient estimates. The distinction is that we want to minimize variance due to noise in the data, but here we are introducing noise in the parameters independent of the data.

11. It is a strange variational distribution, but basically the posterior distribution over Ω

becomes that implied by W, with elements multiplied by random Bernoulli noise.

j

82 Matt Taddy

This is a very general formulation. We can map it to some familiar scenarios.

For example, suppose that the event t is a user landing on your website. You

would like to show a banner advertisement on the landing page, and you

want to show the ad that has the highest probability of getting clicked by

the user. Suppose that there are J diff erent possible ads you can show, such

that your action d = d( x ;Ω ) ∈ {1, . . . , J} is the one chosen for display. The t

t

t

fi nal reward is y = 1 if the user clicks the ad and y = 0 otherwise.12

t

t

This specifi c scenario is a multi- armed bandit (MAB) set-up, so named

by analogy to a casino with many slot machines of diff erent payout proba-

bilities (the casino is the bandit). In the classic MAB (or simply “bandit”)

problem, there are no covariates associated with each ad and each user, such

that you are attempting to optimize toward a single ad that has highest click

probability across all users. That is, is ( y = 1| d = j), the generic click j

t

t

probability for ad j, and you want to set d to the ad with highest . There t

j

are many diff erent algorithms for bandit optimization. They use diff erent

heuristics to balance exploitation with exploration. A fully exploitive algorithm is greedy: it always takes the currently estimated best option without

any consideration of uncertainty. In our simple advertising example, this

implies always converging to the fi rst ad that ever gets clicked on. A fully

exploratory algorithm always randomizes the ads and it will never converge

to a single optimum. The trick to bandit learning is fi nding a way to balance

between these two extremes.

A classic bandit algorithm, and one which gives solid intuition into RL

in general, is Thompson sampling (Thompson 1933). Like many tools in

RL, Thompson sampling uses Bayesian inference to model the accumula-

tion of knowledge over time. The basic idea is simple: at any point in the

optimization process you have a probability distribution over the vector of

click rates, ␻ = [ . . . ], and you want to show each ad j in proportion

1

J

to the probability that is the largest click rate. That is, with yt = { y } t j

s s=1

denoting observed responses at time t, you want to have

(12)

p( d

= j) p

= max{ } J | yt

&nb
sp; (

),

t+1

j

k k =1

such that an ad’s selection probability is equal to the posterior probability

that it is the best choice. Since the probability in equation (12) is tough to

calculate in practice (the probability of a maximum is not an easy object to

analyze), Thompson sampling uses Monte Carlo estimation. In particular,

you draw a sample of ad- click probabilities from the posterior distribution

at time t,

(13)

~ p(

| yt ),

t+1

12. This application, on the news website MSN .com with headlines rather than ads, motivates much of the RL work in Agarwal et al. (2014).

The Technological Elements of Artifi cial Intelligence 83

and set d = argmax

. For example, suppose that you have a Beta(1,1)

t+1

j

t+1 j

prior on each ad’s click rate (i.e., a uniform distribution between zero and

one). At time t, the posterior distribution for the j th ad’s click rate is t

t

(14)

P(

| d t, yt ) = Beta 1 +

1

y ,1 +

1

(1

y ) .

j

d = j

s

d = j

s

s

s

s=1

s=1

A Thompson sampling algorithm draws

from equation (14) for each j

t+1 j

and then shows the ad with highest sampled click rate.

Why does this work? Think about scenarios where an ad j would be shown

at time t—that is, when the sampled is largest. This can occur if there is a tj

lot of uncertainty about , in which case high probabilities have nontrivial

j

posterior weight, or if the expected value of is high. Thus Thompson

j

sampling will naturally balance between exploration and exploitation. There

are many other algorithms for obtaining this balance. For example, Agarwal

et al. (2014) survey methods that work well in the contextual bandit set-

ting where you have covariates attached to events (such that action- payoff

probabilities are event specifi c). The options considered include ε- greedy

search, which fi nds a predicted optimal choice and explores within a neigh-

borhood of that optimum, and a bootstrap- based algorithm that is eff ec-

tively a nonparametric version of Thompson sampling.

Another large literature looks at so-called Bayesian optimization (Taddy

et al. 2009). In these algorithms, you have an unknown function r( x) that you would like to maximize. This function is modeled using some type of

fl exible Bayesian regression model, for example, a Gaussian process. As you

accumulate data, you have a posterior over the “response surface” r at all

potential input locations. Suppose that, after t function realizations, you

have observed a maximal value r

. This is your current best option, but you

max

want to continue exploring to see if you can fi nd a higher maximum. The

Bayesian optimization update is based on the expected improvement statistic,

(15)

E max (0, r( x) r ) ,

max

the posterior expectation of improvement at new location x, thresholded

below at zero. The algorithm evaluates equation (15) over a grid of potential

x locations, and you choose to evaluate r( x ) at the location x with high-t+1

t+1

est expected improvement. Again, this balances exploitation with explora-

tion: the statistic in equation (15) can be high if r( x) has high variance or a high mean (or both).

These RL algorithms are all described in the language of optimization,

but it is possible to map many learning tasks to optimization problems. For

example, the term active learning is usually used to refer to algorithms that

choose data to minimize some estimation variance (e.g., the average pre-

diction error for a regression function over a fi xed input distribution). Say

f ( x;Ω) is your regression function, attempting to predict response y. Then

84 Matt Taddy

your action function is simply prediction, d( x;Ω) = f ( x;Ω), and your optimization goal could be to minimize the squared error—that is, to maximize

r( d( x;Ω), y) = – ( y – f ( x;Ω))2. In this way, active learning problems are special cases of the RL framework.

From a business and economic perspective, RL is interesting (beyond its

obvious usefulness) for assigning a value to new data points. In many set-

tings the rewards can be mapped to actual monetary value: for instance, in

our advertising example where the website receives revenue- per- click. Rein-

forcement learning algorithms assign a dollar value to data observations.

There is a growing literature on markets for data, for example, including the

“data- is- labor” proposal in Lanier (2014). It seems useful for future study in

this area to take account of how currently deployed AI systems assign rela-

tive data value. As a high- level point, the valuation of data in RL depends

upon the action options and potential rewards associated with these actions.

The value of data is only defi ned in a specifi c context.

The bandit algorithms described above are vastly simplifi ed in com-

parison to the type of RL that is deployed as part of a deep learning sys-

tem. In practice, when using RL with complex fl exible functions like DNNs

you need to be very careful to avoid over exploitation and early conver-

gence (Mnih et al. 2015). It is also impossible to do a comprehensive search

through the super high- dimensional space of optional values for the Ω that

parametrizes a DNN. However, approaches such as that in van Seijen et al.

(2017) and Silver et al. (2017) show that if you impose structure on the full

learning problem then it can be broken into a number of simple composite

tasks, each of which is solvable with RL. As we discussed earlier, there is an

undeniable advantage to having large fi xed data assets that you can use to

hot- start your AI (e.g., data from a search engine or social media platform).

But the exploration and active data collection of RL is essential when tuning

an AI system to be successful in specifi c contexts. These systems are taking

actions and setting policy in an uncertain and dynamic world. As statisti-

cians, scientists, and economists are well aware, without constant experimen-

tation it is not possible to learn and improve.

2.7 AI in Context

This chapter has provided a primer on the key ingredients of AI. We have

also been pushing some general points. First, the current wave of ML- driven

AI should be viewed as a new class of products growing up around a new

general purpose technology: large- scale, fast, and robust machine learn-

ing. Artifi cial intelligence is not machine learning, but general purpose ML,

specifi cally deep learning, is the electric motor of AI. These ML tools are

going to continue to get better, faster, and cheaper. Hardware and big data

resources are adapting to the demands of DNNs, and self- service ML solu-

tions are available on all of the major cloud computing platforms. Trained
r /> The Technological Elements of Artifi cial Intelligence 85

DNNs might become a commodity in the near- term future, and the market

for deep learning could get wrapped up in the larger battle over market share

in cloud computing services.

Second, we are still waiting for true end- to-end business AI solutions that

drive a real increase in productivity. AI’s current “wins” are mostly limited

to settings with high amounts of explicit structure, like board and video

games.13 This is changing, as companies like Microsoft and Amazon produce

semi- autonomous systems that can engage with real business problems. But

there is still much work to be done, and the advances will be made by those

who can impose structure on these complex business problems. That is, for

business AI to succeed we need to combine the GPML and big data with

people who know the rules of the “game” in their business domain.

Finally, all of this will have signifi cant implications for the role of eco-

nomics in industry. In many cases, the economists are those who can provide

structure and rules around messy business scenarios. For example, a good

structural econometrician (McFadden 1980; Heckman 1977; Deaton and

Muellbauer 1980) uses economic theory to break a substantiative question

into a set of measurable (i.e., identifi ed) equations with parameters that

can be estimated from data. In many settings, this is exactly the type of

workfl ow required for AI. The diff erence is that, instead of being limited to

basic linear regression, these measurable pieces of the system will be DNNs

that can actively experiment and generate their own training data. The next

generation of economists needs to be comfortable in knowing how to apply

economic theory to obtain such structure, and how to translate this structure

into recipes that can be automated with ML and RL. Just as big data led to

data science, a new discipline combining statistics and computer science, AI

will require interdisciplinary pioneers who can combine economics, statis-

tics, and machine learning.

References

Agarwal, Alekh, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert

Schapire. 2014. “Taming the Monster: A Fast and Simple Algorithm for Con-

textual Bandits.” In Proceedings of the 31st International Conference on Machine

Learning 32:1638– 46. http:// proceedings.mlr.press/ v32/ agarwalb14 .pdf.

Athey, Susan. 2017. “Beyond Prediction: Using Big Data for Policy Problems.”

Science 355:483– 85.

‹ Prev Next ›