by Ajay Agrawal
computer vision contest. Since then, the race has been on. For example,
The Technological Elements of Artifi cial Intelligence 75
Fig. 2.5 A basic convolution operation
Notes: The pixels A, B, and so forth, are multiplied and summed across kernel weights . The k
kernel here is applied to every 2 × 2 submatrix of our “image.”
Fig. 2.6 The network architecture used in Hartford et al. (2017)
Notes: Variables x, z contain structured business information (e.g., product IDs and prices) that is mixed with images of handwritten digits in our network.
image classifi cation performance has surpassed human abilities (He et al.
2016) and DNNs are now able to both recognize images and generate appro-
priate captions (Karpathy and Fei- Fei 2015).
The models behind these computer vision advances all make use of a
specifi c type of convolution transformation. The raw image data (pixels) goes
through multiple convolution layers before the output of those convolutions
is fed into the more classical neural network architecture of equation (1)
and fi gure 2.3. A basic image convolution operation is shown in fi gure 2.5:
you use a kernel of weights to combine image pixels in a local area into a
single output pixel in a (usually) lower- dimensional output image. So- called
convolutional neural networks (CNNs; LeCun and Bengio 1995) illustrate
the strategy that makes deep learning so successful: it is convenient to stack
layers of diff erent specializations such that image- specifi c functions (convo-
lutions) can feed into layers that are good at representing generic functional
forms. In a contemporary CNN, typically, you will have multiple layers of
convolutions feeding into ReLU activations and, eventually, into a max
pooling layer constructed of nodes that output the maximum of each input
matrix.6 For example, fi gure 2.6 shows the very simple architecture that we
used in Hartford et al. (2017) for a task that mixed digit recognition with
(simulated) business data.
This is a theme of deep learning: the models use early layer transforma-
tions that are specifi c to the input data format. For images, you use CNNs.
6. Convolutional neural networks are a huge and very interesting area. The textbook by
Goodfellow, Bengio, and Courville (2016) is a good place to start if you want to learn more.
76 Matt Taddy
Fig. 2.7 A cartoon of a DNN, taking as input images, structured data x . . . x , 1
big
and raw document text
For text data, you need to embed words into a vector space. This can hap-
pen through a simple word2vec transformation (Mikolov et al. 2013) (a
linear decomposition on the matrix of co-occurrence counts for words; for
example, within three words of each other) or through a LSTM (long short-
term memory) architecture (Hochreiter and Schmidhuber 1997)—models
for sequences of words or letters that essentially mix a hidden Markov model
(long) with an autoregressive process (short). And there are many other vari-
ants, with new architectures being developed every day.7
One thing should be clear: there is a lot of structure in DNNs. These mod-
els are not similar to the sorts of nonparametric regression models used by
statisticians, econometricians, and in earlier ML. They are semi- parametric.
Consider the cartoon DNN in fi gure 2.7. The early stages in the network
provide dramatic, and often linear, dimension reduction. These early stages
are highly parametric: it makes no sense to take a convolution model for
image data and apply it to, say, consumer transaction data. The output
of these early layers is then processed through a series of classical neural
network nodes, as in equation (1). These later network layers work like a
traditional nonparametric regression: they expand the output of early layers
to approximate arbitrary functional forms in the response of interest. Thus,
the DNNs combine restrictive dimension reduction with fl exible function
approximation. The key is that both components are learned jointly.
As warned at the outset, we have covered only a tiny part of the area
of deep learning. There is a ton of exciting new material coming out of
both industry and academia. (For a glimpse of what is happening in the
7. For example, the new Capsule networks of Sabour, Frosst, and Hinton (2017) replace the max- pooling of CNNs with more structured summarization functions.
The Technological Elements of Artifi cial Intelligence 77
fi eld, browse the latest proceedings of NIPS [Neural Information Processing
Systems, the premier ML conference] at https:// papers.nips.cc/ ). You will see
quickly the massive breadth of current research. One currently hot topic
is on uncertainty quantifi cation for deep neural networks, another is on
understanding how imbalance in training data leads to potentially biased
predictions. Topics of this type are gaining prominence as DNNs are mov-
ing away from academic competitions and into real- world applications. As
the fi eld grows, and DNN model construction moves from a scientifi c to
an engineering discipline, we will see more need for this type of research
that tells us when and how much we can trust the DNNs.
2.5 Stochastic Gradient Descent
To give a complete view of deep learning, we need to describe the one
algorithm that is relied upon for training all of the models: SGD. Stochas-
tic gradient descent optimization is a twist on gradient descent (GD), the
previously dominant method for minimizing any function that you can dif-
ferentiate. Given a minimization objective L(Ω), where Ω is the full set of
model parameters, each iteration of a gradient descent routine updates from
current parameters Ω as
t
(2)
=
C
L | ,
t+1
t
t
t
where ∇L| is the gradient of L evaluated at the current parameters and C
Ω t
t
is a projection matrix that determines the size of the steps taken in the direc-
tion implied by ∇L.8 We have the subscript t on C because this projection t
can be allowed to update during the optimization. For example, Newton’s
algorithm uses C equal to the matrix of objective second derivatives, ∇2L| .
t
Ω t
It is often stated that neural networks are trained through “back-
propagation,” which is not quite correct. Rather, they are trained through
variants of gradient descent. Back- propagation (Rumelhart et al. 1988), or
back- prop for short, is a method for calculating gradients on the parameters
of a network. In particular, back- prop is just an algorithmic implementation
of your chain rule from calculus. In the context of our simple neuron from
equation (1), the gradient calculation for a single weight is
hj
L
n
L zh
n
L
(3)
=
ij
=
zh 11
.
h 1
zh
zh ij
[0<
z
]
j
hj
ij
hj
i=1
ij
hj
i=1
ij
Another application of the chain rule can be used to expand L / zh as
ij
L / zh+1* zh+1/ zh, and so on until you have written the full gradient as a ij
ij
ij
product of layer- specifi c operations. The directed structure of the network
lets you effi
ciently calculate all of the gradients by working backward layer
8. If Ω = [ . . . ], then ∇L(Ω) = [(∂L/ ∂ ) . . . (∂ L/ ∂ )]. The Hessian matrix, ∇2L, has ele-1
p
1
p
ments [∇2L] = ∂ L 2/∂ ∂ ).
jk
j
k
78 Matt Taddy
by layer, from the response down to the inputs. This recursive application of
the chain rule, and the associated computation recipes, make up the general
back- prop algorithm.
In statistical estimation and ML model training, L typically involves a
loss function that sums across data observations. For example, assuming an
ℓ (ridge) regularization penalty on the parameters, the minimization objec-
2
tive corresponding to regularized likelihood maximization over n indepen-
dent observations d (e.g., d = [x , y ] for regression) can be written as i
i
i
i
n
(4)
L( ) L ;{ d } n
(
)=
logp( z |
) +
2 ,
i i=1
i
2
i=1
where
2 is the sum of all squared parameters in Ω. More generally,
2
L ;{ d } n
(
) can consist of any loss function that involves summation over
i i=1
observations. For example, to model predictive uncertainty we often work
with quantile loss. Defi ne τ (x;Ω) as the quantile function, parametrized by q
Ω, that maps from covariates x to the q th quantile of the response y: (5)
P( y < (x; ) | x) = q.
q
We fi t τ to minimize the regularized quantile loss function (again assuming
q
a ridge penalty):
n
(6)
L ;{ d } n
(
) =
( y
(x ; ))( q 1
) +
2
.
i i=1
i
q
i
[ y < (x ; )]
2
i
q
i
i=1
The very common “sum of squared errors” criterion, possibly regularized, is
another loss function that fi ts this pattern of summation over observations.
In all of these cases, the gradient calculations required for the updates in
equation (2) involve sums over all n observations. That is, each calculation
of ∇L requires an order of n calculations. For example, in a ridge penalized
linear regression where Ω = , the vector of regression coeffi
cients, the j th
gradient component is
L
n
(7)
=
( y
x ) x +
.
i
i'
j
j
j
i=1
The problem for massive data sets is that when n is really big these calcula-
tions become prohibitively expense. The issue is aggravated when, as it is for
DNNs, Ω is high dimensional and there are complex calculations required
in each gradient summand. GDGradient descent is the best optimization
tool that we’ve got, but it becomes computationally infeasible for massive
data sets.
The solution is to replace the actual gradients in equation (2) with esti-
mates of those gradients based upon a subset of the data. This is the SGD
algorithm. It has a long history, dating back to the Robbins- Monro (Rob-
bins and Monro 1951) algorithm proposed by a couple of statisticians in
1951. In the most common versions of SGD, the full- sample gradient is
The Technological Elements of Artifi cial Intelligence 79
simply replaced by the gradient on a smaller subsample. Instead of calculat-
ing gradients on the full- sample loss, L( ;{ d } n ), we descend according to i i=1
subsample calculations:
(8)
=
C
L ;{ d } B
(
)| ,
t+1
t
t
i
b=1
b
t
where { d } B is a mini- batch of observations with B << n. The key mathe-i
b=1
b
matical result behind SGD is that, so long as the sequence of C matrices
t
satisfy some basic requirements, the SGD algorithm will converge to a local
optimum whenever L
;{ d } B
(
) is an unbiased estimate of the full- sample
i
b=1
b
gradient.9 That is, SGD convergence relies upon
1
1
(9)
E
L ;{ d } B
(
) = E
L ;{ d } n
(
) = E L( ; d),
B
i
b=1
i i=1
b
n
where the last term here refers to the population expected gradient—that is,
the average gradient for observation d drawn from the true data generating
process.
To understand why SGD is so preferable to GD for machine learning, it
helps to discuss how computer scientists think about the constraints on esti-
mation. Statisticians and economists tend to view sample size (i.e., lack of
data) as the binding constraint on their estimators. In contrast, in many ML
applications the data is practically unlimited and continues to grow during
system deployment. Despite this abundance, there is a fi xed computational
budget (or the need to update in near- real- time for streaming data), such
that we can only execute a limited number of operations when crunching
through the data. Thus, in ML, the binding constraint is the amount of
computation rather than the amount of data.
Stochastic gradient descent trades faster updates for a slower per-update
convergence rate. As nicely explained in a 2008 paper by Bottou and Bous-
quet (Bottou and Bousquet 2008), this trade is worthwhile when the faster
updates allow you to expose your model to more data than would otherwise
be possible. To see this, note that the mini- batch gradient B 1 L
;{ d } B
(
)
i
b=1
b
has a much higher variance than the full- sample gradient, n 1 L
;{ d } n
(
).
i i=1
This variance introduces noise into the optimization updates. As a result,
for a fi xed data sample n, the GD algorithm will tend to take far fewer itera-
tions than SGD to get to a minimum of the in- sample loss, L
;{ d } n
>
(
).
i i=1
However, in DNN training we don’t really care about the in-sample loss. We
really want to minimize future prediction loss—that is, we want to minimize
the population loss function E L(Ω; d ). And the best way to understand the population loss is to see as much data as possible. Thus if the variance of
the SGD updates is not too large, it is more valuable to spend computational
9. You can actually get away with biased gradients. In Hartford et al. (2017) we fi nd that trading bias for variance can actually improve performance. But this is tricky business and in any case the bias must be kept very small.
80 Matt Taddy
eff ort streaming through more data than to spend it on minimizing the vari-
ance of each individual optimization update.
This is related to an important high- level point about SGD: the nature of
the algorithm is such that engineering steps taken to improve optimization
performance will tend to also improve estimation performance. The same
tweaks and tricks that lower the variance of each SGD update will lead to
fi tted models that generalize better when predicting new unseen data. The
“train faster, generalize better” paper by Hardt, Recht, and Singer (2016)
explains this phenomenon within the framework of algorithm stability.
For SGD to converge in fewer iterations means that the gradients on new
observations (new mini- batches) are approaching zero more quickly. That is,
faster SGD convergence means by defi nition that your model fi ts are general-
izing better to unseen data. Contrast this with full- sample GD, for example,
for likelihood maximization: faster convergence implies only quicker fi tting
on your current sample, potentially overfi tting for future data. A reliance on
SGD has made it relatively easy for deep learning to progress from a scien-
tifi c to engineering discipline. Faster is better, so the engineers tuning SGD
algorithms for DNNs can just focus on convergence speed.
On the topic of tuning SGD: real- world performance is very sensitive to
the choice of C , the projection matrix in equation (8). For computational
t
reasons, this matrix is usually diagonal (i.e., it has zeros off of the diagonal)
such that entries of C dictate your step- size in the direction of each pa-t
rameter gradient. Stochastic gradient descent algorithms have often been
studied theoretically under a single step- size, such that C = I where
t
t
t
is a scalar and I is the identity matrix. Unfortunately, this simple specifi cation will underperform and even fail to converge if is not going toward
t
zero at a precise rate (Toulis, Airoldi, and Rennie 2014). Instead, practi-