The Economics of Artificial Intelligence

Home > Other > The Economics of Artificial Intelligence > Page 14
The Economics of Artificial Intelligence Page 14

by Ajay Agrawal


  computer vision contest. Since then, the race has been on. For example,

  The Technological Elements of Artifi cial Intelligence 75

  Fig. 2.5 A basic convolution operation

  Notes: The pixels A, B, and so forth, are multiplied and summed across kernel weights . The k

  kernel here is applied to every 2 × 2 submatrix of our “image.”

  Fig. 2.6 The network architecture used in Hartford et al. (2017)

  Notes: Variables x, z contain structured business information (e.g., product IDs and prices) that is mixed with images of handwritten digits in our network.

  image classifi cation performance has surpassed human abilities (He et al.

  2016) and DNNs are now able to both recognize images and generate appro-

  priate captions (Karpathy and Fei- Fei 2015).

  The models behind these computer vision advances all make use of a

  specifi c type of convolution transformation. The raw image data (pixels) goes

  through multiple convolution layers before the output of those convolutions

  is fed into the more classical neural network architecture of equation (1)

  and fi gure 2.3. A basic image convolution operation is shown in fi gure 2.5:

  you use a kernel of weights to combine image pixels in a local area into a

  single output pixel in a (usually) lower- dimensional output image. So- called

  convolutional neural networks (CNNs; LeCun and Bengio 1995) illustrate

  the strategy that makes deep learning so successful: it is convenient to stack

  layers of diff erent specializations such that image- specifi c functions (convo-

  lutions) can feed into layers that are good at representing generic functional

  forms. In a contemporary CNN, typically, you will have multiple layers of

  convolutions feeding into ReLU activations and, eventually, into a max

  pooling layer constructed of nodes that output the maximum of each input

  matrix.6 For example, fi gure 2.6 shows the very simple architecture that we

  used in Hartford et al. (2017) for a task that mixed digit recognition with

  (simulated) business data.

  This is a theme of deep learning: the models use early layer transforma-

  tions that are specifi c to the input data format. For images, you use CNNs.

  6. Convolutional neural networks are a huge and very interesting area. The textbook by

  Goodfellow, Bengio, and Courville (2016) is a good place to start if you want to learn more.

  76 Matt Taddy

  Fig. 2.7 A cartoon of a DNN, taking as input images, structured data x . . . x , 1

  big

  and raw document text

  For text data, you need to embed words into a vector space. This can hap-

  pen through a simple word2vec transformation (Mikolov et al. 2013) (a

  linear decomposition on the matrix of co-occurrence counts for words; for

  example, within three words of each other) or through a LSTM (long short-

  term memory) architecture (Hochreiter and Schmidhuber 1997)—models

  for sequences of words or letters that essentially mix a hidden Markov model

  (long) with an autoregressive process (short). And there are many other vari-

  ants, with new architectures being developed every day.7

  One thing should be clear: there is a lot of structure in DNNs. These mod-

  els are not similar to the sorts of nonparametric regression models used by

  statisticians, econometricians, and in earlier ML. They are semi- parametric.

  Consider the cartoon DNN in fi gure 2.7. The early stages in the network

  provide dramatic, and often linear, dimension reduction. These early stages

  are highly parametric: it makes no sense to take a convolution model for

  image data and apply it to, say, consumer transaction data. The output

  of these early layers is then processed through a series of classical neural

  network nodes, as in equation (1). These later network layers work like a

  traditional nonparametric regression: they expand the output of early layers

  to approximate arbitrary functional forms in the response of interest. Thus,

  the DNNs combine restrictive dimension reduction with fl exible function

  approximation. The key is that both components are learned jointly.

  As warned at the outset, we have covered only a tiny part of the area

  of deep learning. There is a ton of exciting new material coming out of

  both industry and academia. (For a glimpse of what is happening in the

  7. For example, the new Capsule networks of Sabour, Frosst, and Hinton (2017) replace the max- pooling of CNNs with more structured summarization functions.

  The Technological Elements of Artifi cial Intelligence 77

  fi eld, browse the latest proceedings of NIPS [Neural Information Processing

  Systems, the premier ML conference] at https:// papers.nips.cc/ ). You will see

  quickly the massive breadth of current research. One currently hot topic

  is on uncertainty quantifi cation for deep neural networks, another is on

  understanding how imbalance in training data leads to potentially biased

  predictions. Topics of this type are gaining prominence as DNNs are mov-

  ing away from academic competitions and into real- world applications. As

  the fi eld grows, and DNN model construction moves from a scientifi c to

  an engineering discipline, we will see more need for this type of research

  that tells us when and how much we can trust the DNNs.

  2.5 Stochastic Gradient Descent

  To give a complete view of deep learning, we need to describe the one

  algorithm that is relied upon for training all of the models: SGD. Stochas-

  tic gradient descent optimization is a twist on gradient descent (GD), the

  previously dominant method for minimizing any function that you can dif-

  ferentiate. Given a minimization objective L(Ω), where Ω is the full set of

  model parameters, each iteration of a gradient descent routine updates from

  current parameters Ω as

  t

  (2)

  =

  C

  L | ,

  t+1

  t

  t

  t

  where ∇L| is the gradient of L evaluated at the current parameters and C

  Ω t

  t

  is a projection matrix that determines the size of the steps taken in the direc-

  tion implied by ∇L.8 We have the subscript t on C because this projection t

  can be allowed to update during the optimization. For example, Newton’s

  algorithm uses C equal to the matrix of objective second derivatives, ∇2L| .

  t

  Ω t

  It is often stated that neural networks are trained through “back-

  propagation,” which is not quite correct. Rather, they are trained through

  variants of gradient descent. Back- propagation (Rumelhart et al. 1988), or

  back- prop for short, is a method for calculating gradients on the parameters

  of a network. In particular, back- prop is just an algorithmic implementation

  of your chain rule from calculus. In the context of our simple neuron from

  equation (1), the gradient calculation for a single weight is

  hj

  L

  n

  L zh

  n

  L

  (3)

  =

  ij

  =

  zh 11

  .

  h 1

  zh

  zh ij

  [0<

  z

  ]

  j

  hj
ij

  hj

  i=1

  ij

  hj

  i=1

  ij

  Another application of the chain rule can be used to expand L / zh as

  ij

  L / zh+1* zh+1/ zh, and so on until you have written the full gradient as a ij

  ij

  ij

  product of layer- specifi c operations. The directed structure of the network

  lets you effi

  ciently calculate all of the gradients by working backward layer

  8. If Ω = [ . . . ], then ∇L(Ω) = [(∂L/ ∂ ) . . . (∂ L/ ∂ )]. The Hessian matrix, ∇2L, has ele-1

  p

  1

  p

  ments [∇2L] = ∂ L 2/∂ ∂ ).

  jk

  j

  k

  78 Matt Taddy

  by layer, from the response down to the inputs. This recursive application of

  the chain rule, and the associated computation recipes, make up the general

  back- prop algorithm.

  In statistical estimation and ML model training, L typically involves a

  loss function that sums across data observations. For example, assuming an

  ℓ (ridge) regularization penalty on the parameters, the minimization objec-

  2

  tive corresponding to regularized likelihood maximization over n indepen-

  dent observations d (e.g., d = [x , y ] for regression) can be written as i

  i

  i

  i

  n

  (4)

  L( ) L ;{ d } n

  (

  )=

  logp( z |

  ) +

  2 ,

  i i=1

  i

  2

  i=1

  where

  2 is the sum of all squared parameters in Ω. More generally,

  2

  L ;{ d } n

  (

  ) can consist of any loss function that involves summation over

  i i=1

  observations. For example, to model predictive uncertainty we often work

  with quantile loss. Defi ne τ (x;Ω) as the quantile function, parametrized by q

  Ω, that maps from covariates x to the q th quantile of the response y: (5)

  P( y < (x; ) | x) = q.

  q

  We fi t τ to minimize the regularized quantile loss function (again assuming

  q

  a ridge penalty):

  n

  (6)

  L ;{ d } n

  (

  ) =

  ( y

  (x ; ))( q 1

  ) +

  2

  .

  i i=1

  i

  q

  i

  [ y < (x ; )]

  2

  i

  q

  i

  i=1

  The very common “sum of squared errors” criterion, possibly regularized, is

  another loss function that fi ts this pattern of summation over observations.

  In all of these cases, the gradient calculations required for the updates in

  equation (2) involve sums over all n observations. That is, each calculation

  of ∇L requires an order of n calculations. For example, in a ridge penalized

  linear regression where Ω = ␤, the vector of regression coeffi

  cients, the j th

  gradient component is

  L

  n

  (7)

  =

  ( y

  x ) x +

  .

  i

  i'

  j

  j

  j

  i=1

  The problem for massive data sets is that when n is really big these calcula-

  tions become prohibitively expense. The issue is aggravated when, as it is for

  DNNs, Ω is high dimensional and there are complex calculations required

  in each gradient summand. GDGradient descent is the best optimization

  tool that we’ve got, but it becomes computationally infeasible for massive

  data sets.

  The solution is to replace the actual gradients in equation (2) with esti-

  mates of those gradients based upon a subset of the data. This is the SGD

  algorithm. It has a long history, dating back to the Robbins- Monro (Rob-

  bins and Monro 1951) algorithm proposed by a couple of statisticians in

  1951. In the most common versions of SGD, the full- sample gradient is

  The Technological Elements of Artifi cial Intelligence 79

  simply replaced by the gradient on a smaller subsample. Instead of calculat-

  ing gradients on the full- sample loss, L( ;{ d } n ), we descend according to i i=1

  subsample calculations:

  (8)

  =

  C

  L ;{ d } B

  (

  )| ,

  t+1

  t

  t

  i

  b=1

  b

  t

  where { d } B is a mini- batch of observations with B << n. The key mathe-i

  b=1

  b

  matical result behind SGD is that, so long as the sequence of C matrices

  t

  satisfy some basic requirements, the SGD algorithm will converge to a local

  optimum whenever L

  ;{ d } B

  (

  ) is an unbiased estimate of the full- sample

  i

  b=1

  b

  gradient.9 That is, SGD convergence relies upon

  1

  1

  (9)

  E

  L ;{ d } B

  (

  ) = E

  L ;{ d } n

  (

  ) = E L( ; d),

  B

  i

  b=1

  i i=1

  b

  n

  where the last term here refers to the population expected gradient—that is,

  the average gradient for observation d drawn from the true data generating

  process.

  To understand why SGD is so preferable to GD for machine learning, it

  helps to discuss how computer scientists think about the constraints on esti-

  mation. Statisticians and economists tend to view sample size (i.e., lack of

  data) as the binding constraint on their estimators. In contrast, in many ML

  applications the data is practically unlimited and continues to grow during

  system deployment. Despite this abundance, there is a fi xed computational

  budget (or the need to update in near- real- time for streaming data), such

  that we can only execute a limited number of operations when crunching

  through the data. Thus, in ML, the binding constraint is the amount of

  computation rather than the amount of data.

  Stochastic gradient descent trades faster updates for a slower per-update

  convergence rate. As nicely explained in a 2008 paper by Bottou and Bous-

  quet (Bottou and Bousquet 2008), this trade is worthwhile when the faster

  updates allow you to expose your model to more data than would otherwise

  be possible. To see this, note that the mini- batch gradient B 1 L

  ;{ d } B

  (

  )

  i

  b=1

  b

  has a much higher variance than the full- sample gradient, n 1 L

  ;{ d } n

  (

  ).

  i i=1

  This variance introduces noise into the optimization updates. As a result,

  for a fi xed data sample n, the GD algorithm will tend to take far fewer itera-

  tions than SGD to get to a minimum of the in- sample loss, L

  ;{ d } n
>
  (

  ).

  i i=1

  However, in DNN training we don’t really care about the in-sample loss. We

  really want to minimize future prediction loss—that is, we want to minimize

  the population loss function E L(Ω; d ). And the best way to understand the population loss is to see as much data as possible. Thus if the variance of

  the SGD updates is not too large, it is more valuable to spend computational

  9. You can actually get away with biased gradients. In Hartford et al. (2017) we fi nd that trading bias for variance can actually improve performance. But this is tricky business and in any case the bias must be kept very small.

  80 Matt Taddy

  eff ort streaming through more data than to spend it on minimizing the vari-

  ance of each individual optimization update.

  This is related to an important high- level point about SGD: the nature of

  the algorithm is such that engineering steps taken to improve optimization

  performance will tend to also improve estimation performance. The same

  tweaks and tricks that lower the variance of each SGD update will lead to

  fi tted models that generalize better when predicting new unseen data. The

  “train faster, generalize better” paper by Hardt, Recht, and Singer (2016)

  explains this phenomenon within the framework of algorithm stability.

  For SGD to converge in fewer iterations means that the gradients on new

  observations (new mini- batches) are approaching zero more quickly. That is,

  faster SGD convergence means by defi nition that your model fi ts are general-

  izing better to unseen data. Contrast this with full- sample GD, for example,

  for likelihood maximization: faster convergence implies only quicker fi tting

  on your current sample, potentially overfi tting for future data. A reliance on

  SGD has made it relatively easy for deep learning to progress from a scien-

  tifi c to engineering discipline. Faster is better, so the engineers tuning SGD

  algorithms for DNNs can just focus on convergence speed.

  On the topic of tuning SGD: real- world performance is very sensitive to

  the choice of C , the projection matrix in equation (8). For computational

  t

  reasons, this matrix is usually diagonal (i.e., it has zeros off of the diagonal)

  such that entries of C dictate your step- size in the direction of each pa-t

  rameter gradient. Stochastic gradient descent algorithms have often been

  studied theoretically under a single step- size, such that C = I where

  t

  t

  t

  is a scalar and I is the identity matrix. Unfortunately, this simple specifi cation will underperform and even fail to converge if is not going toward

  t

  zero at a precise rate (Toulis, Airoldi, and Rennie 2014). Instead, practi-

 

‹ Prev