The Economics of Artificial Intelligence Page 14 Read online free by Ajay Agrawal

Home > Other > The Economics of Artificial Intelligence > Page 14

The Economics of Artificial Intelligence Page 14

computer vision contest. Since then, the race has been on. For example,

The Technological Elements of Artifi cial Intelligence 75

Fig. 2.5 A basic convolution operation

Notes: The pixels A, B, and so forth, are multiplied and summed across kernel weights . The k

kernel here is applied to every 2 × 2 submatrix of our “image.”

Fig. 2.6 The network architecture used in Hartford et al. (2017)

Notes: Variables x, z contain structured business information (e.g., product IDs and prices) that is mixed with images of handwritten digits in our network.

image classifi cation performance has surpassed human abilities (He et al.

2016) and DNNs are now able to both recognize images and generate appro-

priate captions (Karpathy and Fei- Fei 2015).

The models behind these computer vision advances all make use of a

specifi c type of convolution transformation. The raw image data (pixels) goes

through multiple convolution layers before the output of those convolutions

is fed into the more classical neural network architecture of equation (1)

and fi gure 2.3. A basic image convolution operation is shown in fi gure 2.5:

you use a kernel of weights to combine image pixels in a local area into a

single output pixel in a (usually) lower- dimensional output image. So- called

convolutional neural networks (CNNs; LeCun and Bengio 1995) illustrate

the strategy that makes deep learning so successful: it is convenient to stack

layers of diff erent specializations such that image- specifi c functions (convo-

lutions) can feed into layers that are good at representing generic functional

forms. In a contemporary CNN, typically, you will have multiple layers of

convolutions feeding into ReLU activations and, eventually, into a max

pooling layer constructed of nodes that output the maximum of each input

matrix.6 For example, fi gure 2.6 shows the very simple architecture that we

used in Hartford et al. (2017) for a task that mixed digit recognition with

(simulated) business data.

This is a theme of deep learning: the models use early layer transforma-

tions that are specifi c to the input data format. For images, you use CNNs.

6. Convolutional neural networks are a huge and very interesting area. The textbook by

Goodfellow, Bengio, and Courville (2016) is a good place to start if you want to learn more.

76 Matt Taddy

Fig. 2.7 A cartoon of a DNN, taking as input images, structured data x . . . x , 1

big

and raw document text

For text data, you need to embed words into a vector space. This can hap-

pen through a simple word2vec transformation (Mikolov et al. 2013) (a

linear decomposition on the matrix of co-occurrence counts for words; for

example, within three words of each other) or through a LSTM (long short-

term memory) architecture (Hochreiter and Schmidhuber 1997)—models

for sequences of words or letters that essentially mix a hidden Markov model

(long) with an autoregressive process (short). And there are many other vari-

ants, with new architectures being developed every day.7

One thing should be clear: there is a lot of structure in DNNs. These mod-

els are not similar to the sorts of nonparametric regression models used by

statisticians, econometricians, and in earlier ML. They are semi- parametric.

Consider the cartoon DNN in fi gure 2.7. The early stages in the network

provide dramatic, and often linear, dimension reduction. These early stages

are highly parametric: it makes no sense to take a convolution model for

image data and apply it to, say, consumer transaction data. The output

of these early layers is then processed through a series of classical neural

network nodes, as in equation (1). These later network layers work like a

traditional nonparametric regression: they expand the output of early layers

to approximate arbitrary functional forms in the response of interest. Thus,

the DNNs combine restrictive dimension reduction with fl exible function

approximation. The key is that both components are learned jointly.

As warned at the outset, we have covered only a tiny part of the area

of deep learning. There is a ton of exciting new material coming out of

both industry and academia. (For a glimpse of what is happening in the

7. For example, the new Capsule networks of Sabour, Frosst, and Hinton (2017) replace the max- pooling of CNNs with more structured summarization functions.

The Technological Elements of Artifi cial Intelligence 77

fi eld, browse the latest proceedings of NIPS [Neural Information Processing

Systems, the premier ML conference] at https:// papers.nips.cc/ ). You will see

quickly the massive breadth of current research. One currently hot topic

is on uncertainty quantifi cation for deep neural networks, another is on

understanding how imbalance in training data leads to potentially biased

predictions. Topics of this type are gaining prominence as DNNs are mov-

ing away from academic competitions and into real- world applications. As

the fi eld grows, and DNN model construction moves from a scientifi c to

an engineering discipline, we will see more need for this type of research

that tells us when and how much we can trust the DNNs.

2.5 Stochastic Gradient Descent

To give a complete view of deep learning, we need to describe the one

algorithm that is relied upon for training all of the models: SGD. Stochas-

tic gradient descent optimization is a twist on gradient descent (GD), the

previously dominant method for minimizing any function that you can dif-

ferentiate. Given a minimization objective L(Ω), where Ω is the full set of

model parameters, each iteration of a gradient descent routine updates from

current parameters Ω as

t

(2)

=

C

L | ,

t+1

t

t

t

where ∇L| is the gradient of L evaluated at the current parameters and C

Ω t

t

is a projection matrix that determines the size of the steps taken in the direc-

tion implied by ∇L.8 We have the subscript t on C because this projection t

can be allowed to update during the optimization. For example, Newton’s

algorithm uses C equal to the matrix of objective second derivatives, ∇2L| .

t

Ω t

It is often stated that neural networks are trained through “back-

propagation,” which is not quite correct. Rather, they are trained through

variants of gradient descent. Back- propagation (Rumelhart et al. 1988), or

back- prop for short, is a method for calculating gradients on the parameters

of a network. In particular, back- prop is just an algorithmic implementation

of your chain rule from calculus. In the context of our simple neuron from

equation (1), the gradient calculation for a single weight is

hj

L

n

L zh

n

L

(3)

=

ij

=

zh 11

.

h 1

zh

zh ij

[0<

z

]

j

hj
ij

hj

i=1

ij

hj

i=1

ij

Another application of the chain rule can be used to expand L / zh as

ij

L / zh+1* zh+1/ zh, and so on until you have written the full gradient as a ij

ij

ij

product of layer- specifi c operations. The directed structure of the network

lets you effi

ciently calculate all of the gradients by working backward layer

8. If Ω = [ . . . ], then ∇L(Ω) = [(∂L/ ∂ ) . . . (∂ L/ ∂ )]. The Hessian matrix, ∇2L, has ele-1

p

1

p

ments [∇2L] = ∂ L 2/∂ ∂ ).

jk

j

k

78 Matt Taddy

by layer, from the response down to the inputs. This recursive application of

the chain rule, and the associated computation recipes, make up the general

back- prop algorithm.

In statistical estimation and ML model training, L typically involves a

loss function that sums across data observations. For example, assuming an

ℓ (ridge) regularization penalty on the parameters, the minimization objec-

2

tive corresponding to regularized likelihood maximization over n indepen-

dent observations d (e.g., d = [x , y ] for regression) can be written as i

i

i

i

n

(4)

L( ) L ;{ d } n

(

)=

logp( z |

) +

2 ,

i i=1

i

2

i=1

where

2 is the sum of all squared parameters in Ω. More generally,

2

L ;{ d } n

(

) can consist of any loss function that involves summation over

i i=1

observations. For example, to model predictive uncertainty we often work

with quantile loss. Defi ne τ (x;Ω) as the quantile function, parametrized by q

Ω, that maps from covariates x to the q th quantile of the response y: (5)

P( y < (x; ) | x) = q.

q

We fi t τ to minimize the regularized quantile loss function (again assuming

q

a ridge penalty):

n

(6)

L ;{ d } n

(

) =

( y

(x ; ))( q 1

) +

2

.

i i=1

i

q

i

[ y < (x ; )]

2

i

q

i

i=1

The very common “sum of squared errors” criterion, possibly regularized, is

another loss function that fi ts this pattern of summation over observations.

In all of these cases, the gradient calculations required for the updates in

equation (2) involve sums over all n observations. That is, each calculation

of ∇L requires an order of n calculations. For example, in a ridge penalized

linear regression where Ω = ␤, the vector of regression coeffi

cients, the j th

gradient component is

L

n

(7)

=

( y

x ) x +

.

i

i'

j

j

j

i=1

The problem for massive data sets is that when n is really big these calcula-

tions become prohibitively expense. The issue is aggravated when, as it is for

DNNs, Ω is high dimensional and there are complex calculations required

in each gradient summand. GDGradient descent is the best optimization

tool that we’ve got, but it becomes computationally infeasible for massive

data sets.

The solution is to replace the actual gradients in equation (2) with esti-

mates of those gradients based upon a subset of the data. This is the SGD

algorithm. It has a long history, dating back to the Robbins- Monro (Rob-

bins and Monro 1951) algorithm proposed by a couple of statisticians in

1951. In the most common versions of SGD, the full- sample gradient is

The Technological Elements of Artifi cial Intelligence 79

simply replaced by the gradient on a smaller subsample. Instead of calculat-

ing gradients on the full- sample loss, L( ;{ d } n ), we descend according to i i=1

subsample calculations:

(8)

=

C

L ;{ d } B

(

)| ,

t+1

t

t

i

b=1

b

t

where { d } B is a mini- batch of observations with B << n. The key mathe-i

b=1

b

matical result behind SGD is that, so long as the sequence of C matrices

t

satisfy some basic requirements, the SGD algorithm will converge to a local

optimum whenever L

;{ d } B

(

) is an unbiased estimate of the full- sample

i

b=1

b

gradient.9 That is, SGD convergence relies upon

1

1

(9)

E

L ;{ d } B

(

) = E

L ;{ d } n

(

) = E L( ; d),

B

i

b=1

i i=1

b

n

where the last term here refers to the population expected gradient—that is,

the average gradient for observation d drawn from the true data generating

process.

To understand why SGD is so preferable to GD for machine learning, it

helps to discuss how computer scientists think about the constraints on esti-

mation. Statisticians and economists tend to view sample size (i.e., lack of

data) as the binding constraint on their estimators. In contrast, in many ML

applications the data is practically unlimited and continues to grow during

system deployment. Despite this abundance, there is a fi xed computational

budget (or the need to update in near- real- time for streaming data), such

that we can only execute a limited number of operations when crunching

through the data. Thus, in ML, the binding constraint is the amount of

computation rather than the amount of data.

Stochastic gradient descent trades faster updates for a slower per-update

convergence rate. As nicely explained in a 2008 paper by Bottou and Bous-

quet (Bottou and Bousquet 2008), this trade is worthwhile when the faster

updates allow you to expose your model to more data than would otherwise

be possible. To see this, note that the mini- batch gradient B 1 L

;{ d } B

(

)

i

b=1

b

has a much higher variance than the full- sample gradient, n 1 L

;{ d } n

(

).

i i=1

This variance introduces noise into the optimization updates. As a result,

for a fi xed data sample n, the GD algorithm will tend to take far fewer itera-

tions than SGD to get to a minimum of the in- sample loss, L

;{ d } n
>
(

).

i i=1

However, in DNN training we don’t really care about the in-sample loss. We

really want to minimize future prediction loss—that is, we want to minimize

the population loss function E L(Ω; d ). And the best way to understand the population loss is to see as much data as possible. Thus if the variance of

the SGD updates is not too large, it is more valuable to spend computational

9. You can actually get away with biased gradients. In Hartford et al. (2017) we fi nd that trading bias for variance can actually improve performance. But this is tricky business and in any case the bias must be kept very small.

80 Matt Taddy

eff ort streaming through more data than to spend it on minimizing the vari-

ance of each individual optimization update.

This is related to an important high- level point about SGD: the nature of

the algorithm is such that engineering steps taken to improve optimization

performance will tend to also improve estimation performance. The same

tweaks and tricks that lower the variance of each SGD update will lead to

fi tted models that generalize better when predicting new unseen data. The

“train faster, generalize better” paper by Hardt, Recht, and Singer (2016)

explains this phenomenon within the framework of algorithm stability.

For SGD to converge in fewer iterations means that the gradients on new

observations (new mini- batches) are approaching zero more quickly. That is,

faster SGD convergence means by defi nition that your model fi ts are general-

izing better to unseen data. Contrast this with full- sample GD, for example,

for likelihood maximization: faster convergence implies only quicker fi tting

on your current sample, potentially overfi tting for future data. A reliance on

SGD has made it relatively easy for deep learning to progress from a scien-

tifi c to engineering discipline. Faster is better, so the engineers tuning SGD

algorithms for DNNs can just focus on convergence speed.

On the topic of tuning SGD: real- world performance is very sensitive to

the choice of C , the projection matrix in equation (8). For computational

t

reasons, this matrix is usually diagonal (i.e., it has zeros off of the diagonal)

such that entries of C dictate your step- size in the direction of each pa-t

rameter gradient. Stochastic gradient descent algorithms have often been

studied theoretically under a single step- size, such that C = I where

t

t

t

is a scalar and I is the identity matrix. Unfortunately, this simple specifi cation will underperform and even fail to converge if is not going toward

t

zero at a precise rate (Toulis, Airoldi, and Rennie 2014). Instead, practi-

‹ Prev Next ›