by Ajay Agrawal
tioners make use of algorithms where C = [ . . . ] I, with p the dimension t
1 t
pt
of Ω, and each is chosen to approximate 2L /
2, the corresponding
jt
j
diagonal element of the Hessian matrix of loss- function second derivatives
(i.e., what would be used in a Newton’s algorithm). The ADAGRAD paper
(Duchi, Hazan, and Singer 2011) provides a theoretical foundation for this
approach and suggests an algorithm for specifying . Most deep learning
jt
systems make use of ADAGRAD-inspired algorithms, such as ADAM
(Kingma and Ba 2015), that combine the original algorithm with heuristics
that have been shown empirically to improve performance.
Finally, there is another key trick to DNN training: dropout. This pro-
cedure, proposed by researchers (Srivastava et al. 2014) in Hinton’s lab at
the University of Toronto, involves introduction of random noise into each
gradient calculation. For example, “Bernoulli dropout” replaces current
estimates with w =
where is a Bernoulli random variable with
tj
tj
tj * tj
tj
p( = 1) = c. Each SGD update from equation (8) then uses these parameter tj
values when evaluating the gradient, such that
The Technological Elements of Artifi cial Intelligence 81
(10)
=
C
f
;{ d } B
(
)| ,
t+1
t
t
i
b=1
W
b
t
where W is the noised-up version of Ω , with elements w .
t
t
tj
Dropout is used because it has been observed to yield model fi ts that have
lower out- of-sample error rates (so long as you tune c appropriately). Why
does this happen? Informally, dropout acts as a type of implicit regulariza-
tion. An example of explicit regularization is parameter penalization: to
avoid over- fi t, the minimization objective for DNNs almost always has a
2 ridge penalty term added to the data- likelihood loss function. Drop-
2
out plays a similar role. By forcing SGD updates to ignore a random sample
of the parameters, it prevents over- fi t on any individual parameter.10 More
rigorously, it has recently been established by a number of authors (Kendall
and Gal 2017) that SGD with dropout corresponds to a type of “variational
Bayesian Inference.” That means that dropout SGD is solving to fi nd the
posterior distribution over Ω rather than a point estimate.11 As interest grows around uncertainty quantifi cation for DNNs, this interpretation of dropout
is one option for bringing Bayesian inference into deep learning.
2.6 Reinforcement Learning
As our fi nal section on the elements of deep learning, we will consider
how these AI systems generate their own training data through a mix of
experimentation and optimization. Reinforcement learning (RL) is the com-
mon term for this aspect of AI. Reinforcement learning is sometimes used
to denote specifi c algorithms, but we are using it to refer to the full area of
active data collection.
The general problem can be formulated as a reward- maximization task.
You have some policy or “action” function, d( x ; Ω), that dictates how the t
system responds to “event” t with characteristics x . The event could be t
a customer arriving on your website at a specifi c time, or a scenario in a
video game, and so forth. After the event, you observe “response” y and the
t
reward is calculated as r( d( x ,; Ω), y ). During this process you are accumulat-t
t
ing data and learning the parameters Ω, so we can write Ω as the parameters
t
used at event t. The goal is that this learning converges to some optimal
reward- maximizing parametrization, say Ωå, and that this happens after
some T events where T is not too big—that is, so that you minimize regret, T
(11)
r ( d( x ; å ), y ) r ( d( x ;
), y ) .
t
t
t
t
t
t=1
10. This seems to contradict our earlier discussion about minimizing the variance of gradient estimates. The distinction is that we want to minimize variance due to noise in the data, but here we are introducing noise in the parameters independent of the data.
11. It is a strange variational distribution, but basically the posterior distribution over Ω
becomes that implied by W, with elements multiplied by random Bernoulli noise.
j
82 Matt Taddy
This is a very general formulation. We can map it to some familiar scenarios.
For example, suppose that the event t is a user landing on your website. You
would like to show a banner advertisement on the landing page, and you
want to show the ad that has the highest probability of getting clicked by
the user. Suppose that there are J diff erent possible ads you can show, such
that your action d = d( x ;Ω ) ∈ {1, . . . , J} is the one chosen for display. The t
t
t
fi nal reward is y = 1 if the user clicks the ad and y = 0 otherwise.12
t
t
This specifi c scenario is a multi- armed bandit (MAB) set-up, so named
by analogy to a casino with many slot machines of diff erent payout proba-
bilities (the casino is the bandit). In the classic MAB (or simply “bandit”)
problem, there are no covariates associated with each ad and each user, such
that you are attempting to optimize toward a single ad that has highest click
probability across all users. That is, is ( y = 1| d = j), the generic click j
t
t
probability for ad j, and you want to set d to the ad with highest . There t
j
are many diff erent algorithms for bandit optimization. They use diff erent
heuristics to balance exploitation with exploration. A fully exploitive algorithm is greedy: it always takes the currently estimated best option without
any consideration of uncertainty. In our simple advertising example, this
implies always converging to the fi rst ad that ever gets clicked on. A fully
exploratory algorithm always randomizes the ads and it will never converge
to a single optimum. The trick to bandit learning is fi nding a way to balance
between these two extremes.
A classic bandit algorithm, and one which gives solid intuition into RL
in general, is Thompson sampling (Thompson 1933). Like many tools in
RL, Thompson sampling uses Bayesian inference to model the accumula-
tion of knowledge over time. The basic idea is simple: at any point in the
optimization process you have a probability distribution over the vector of
click rates, = [ . . . ], and you want to show each ad j in proportion
1
J
to the probability that is the largest click rate. That is, with yt = { y } t j
s s=1
denoting observed responses at time t, you want to have
(12)
p( d
= j) p
= max{ } J | yt
&nb
sp; (
),
t+1
j
k k =1
such that an ad’s selection probability is equal to the posterior probability
that it is the best choice. Since the probability in equation (12) is tough to
calculate in practice (the probability of a maximum is not an easy object to
analyze), Thompson sampling uses Monte Carlo estimation. In particular,
you draw a sample of ad- click probabilities from the posterior distribution
at time t,
(13)
~ p(
| yt ),
t+1
12. This application, on the news website MSN .com with headlines rather than ads, motivates much of the RL work in Agarwal et al. (2014).
The Technological Elements of Artifi cial Intelligence 83
and set d = argmax
. For example, suppose that you have a Beta(1,1)
t+1
j
t+1 j
prior on each ad’s click rate (i.e., a uniform distribution between zero and
one). At time t, the posterior distribution for the j th ad’s click rate is t
t
(14)
P(
| d t, yt ) = Beta 1 +
1
y ,1 +
1
(1
y ) .
j
d = j
s
d = j
s
s
s
s=1
s=1
A Thompson sampling algorithm draws
from equation (14) for each j
t+1 j
and then shows the ad with highest sampled click rate.
Why does this work? Think about scenarios where an ad j would be shown
at time t—that is, when the sampled is largest. This can occur if there is a tj
lot of uncertainty about , in which case high probabilities have nontrivial
j
posterior weight, or if the expected value of is high. Thus Thompson
j
sampling will naturally balance between exploration and exploitation. There
are many other algorithms for obtaining this balance. For example, Agarwal
et al. (2014) survey methods that work well in the contextual bandit set-
ting where you have covariates attached to events (such that action- payoff
probabilities are event specifi c). The options considered include ε- greedy
search, which fi nds a predicted optimal choice and explores within a neigh-
borhood of that optimum, and a bootstrap- based algorithm that is eff ec-
tively a nonparametric version of Thompson sampling.
Another large literature looks at so-called Bayesian optimization (Taddy
et al. 2009). In these algorithms, you have an unknown function r( x) that you would like to maximize. This function is modeled using some type of
fl exible Bayesian regression model, for example, a Gaussian process. As you
accumulate data, you have a posterior over the “response surface” r at all
potential input locations. Suppose that, after t function realizations, you
have observed a maximal value r
. This is your current best option, but you
max
want to continue exploring to see if you can fi nd a higher maximum. The
Bayesian optimization update is based on the expected improvement statistic,
(15)
E max (0, r( x) r ) ,
max
the posterior expectation of improvement at new location x, thresholded
below at zero. The algorithm evaluates equation (15) over a grid of potential
x locations, and you choose to evaluate r( x ) at the location x with high-t+1
t+1
est expected improvement. Again, this balances exploitation with explora-
tion: the statistic in equation (15) can be high if r( x) has high variance or a high mean (or both).
These RL algorithms are all described in the language of optimization,
but it is possible to map many learning tasks to optimization problems. For
example, the term active learning is usually used to refer to algorithms that
choose data to minimize some estimation variance (e.g., the average pre-
diction error for a regression function over a fi xed input distribution). Say
f ( x;Ω) is your regression function, attempting to predict response y. Then
84 Matt Taddy
your action function is simply prediction, d( x;Ω) = f ( x;Ω), and your optimization goal could be to minimize the squared error—that is, to maximize
r( d( x;Ω), y) = – ( y – f ( x;Ω))2. In this way, active learning problems are special cases of the RL framework.
From a business and economic perspective, RL is interesting (beyond its
obvious usefulness) for assigning a value to new data points. In many set-
tings the rewards can be mapped to actual monetary value: for instance, in
our advertising example where the website receives revenue- per- click. Rein-
forcement learning algorithms assign a dollar value to data observations.
There is a growing literature on markets for data, for example, including the
“data- is- labor” proposal in Lanier (2014). It seems useful for future study in
this area to take account of how currently deployed AI systems assign rela-
tive data value. As a high- level point, the valuation of data in RL depends
upon the action options and potential rewards associated with these actions.
The value of data is only defi ned in a specifi c context.
The bandit algorithms described above are vastly simplifi ed in com-
parison to the type of RL that is deployed as part of a deep learning sys-
tem. In practice, when using RL with complex fl exible functions like DNNs
you need to be very careful to avoid over exploitation and early conver-
gence (Mnih et al. 2015). It is also impossible to do a comprehensive search
through the super high- dimensional space of optional values for the Ω that
parametrizes a DNN. However, approaches such as that in van Seijen et al.
(2017) and Silver et al. (2017) show that if you impose structure on the full
learning problem then it can be broken into a number of simple composite
tasks, each of which is solvable with RL. As we discussed earlier, there is an
undeniable advantage to having large fi xed data assets that you can use to
hot- start your AI (e.g., data from a search engine or social media platform).
But the exploration and active data collection of RL is essential when tuning
an AI system to be successful in specifi c contexts. These systems are taking
actions and setting policy in an uncertain and dynamic world. As statisti-
cians, scientists, and economists are well aware, without constant experimen-
tation it is not possible to learn and improve.
2.7 AI in Context
This chapter has provided a primer on the key ingredients of AI. We have
also been pushing some general points. First, the current wave of ML- driven
AI should be viewed as a new class of products growing up around a new
general purpose technology: large- scale, fast, and robust machine learn-
ing. Artifi cial intelligence is not machine learning, but general purpose ML,
specifi cally deep learning, is the electric motor of AI. These ML tools are
going to continue to get better, faster, and cheaper. Hardware and big data
resources are adapting to the demands of DNNs, and self- service ML solu-
tions are available on all of the major cloud computing platforms. Trained
r /> The Technological Elements of Artifi cial Intelligence 85
DNNs might become a commodity in the near- term future, and the market
for deep learning could get wrapped up in the larger battle over market share
in cloud computing services.
Second, we are still waiting for true end- to-end business AI solutions that
drive a real increase in productivity. AI’s current “wins” are mostly limited
to settings with high amounts of explicit structure, like board and video
games.13 This is changing, as companies like Microsoft and Amazon produce
semi- autonomous systems that can engage with real business problems. But
there is still much work to be done, and the advances will be made by those
who can impose structure on these complex business problems. That is, for
business AI to succeed we need to combine the GPML and big data with
people who know the rules of the “game” in their business domain.
Finally, all of this will have signifi cant implications for the role of eco-
nomics in industry. In many cases, the economists are those who can provide
structure and rules around messy business scenarios. For example, a good
structural econometrician (McFadden 1980; Heckman 1977; Deaton and
Muellbauer 1980) uses economic theory to break a substantiative question
into a set of measurable (i.e., identifi ed) equations with parameters that
can be estimated from data. In many settings, this is exactly the type of
workfl ow required for AI. The diff erence is that, instead of being limited to
basic linear regression, these measurable pieces of the system will be DNNs
that can actively experiment and generate their own training data. The next
generation of economists needs to be comfortable in knowing how to apply
economic theory to obtain such structure, and how to translate this structure
into recipes that can be automated with ML and RL. Just as big data led to
data science, a new discipline combining statistics and computer science, AI
will require interdisciplinary pioneers who can combine economics, statis-
tics, and machine learning.
References
Agarwal, Alekh, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert
Schapire. 2014. “Taming the Monster: A Fast and Simple Algorithm for Con-
textual Bandits.” In Proceedings of the 31st International Conference on Machine
Learning 32:1638– 46. http:// proceedings.mlr.press/ v32/ agarwalb14 .pdf.
Athey, Susan. 2017. “Beyond Prediction: Using Big Data for Policy Problems.”
Science 355:483– 85.