Book Read Free

The Economics of Artificial Intelligence

Page 13

by Ajay Agrawal


  The Technological Elements of Artifi cial Intelligence 69

  purpose ML (GPML) that we reference in the rightmost pillar of fi gure 2.1.

  The fi rst component of GPML is deep neural networks: models made up

  of layers of nonlinear transformation node functions, where the output of each layer becomes input to the next layer in the network. We will describe

  DNNs in more detail in our Deep Learning section , but for now it suffi

  ces

  to say that they make it faster and easier than ever before to fi nd patterns in

  unstructured data. They are also highly modular. You can take a layer that

  is optimized for one type of data (e.g., images) and combine it with other

  layers for other types of data (e.g., text). You can also use layers that have

  been pretrained on one data set (e.g., generic images) as components in a

  more specialized model (e.g., a specifi c recognition task).

  Specialized DNN architectures are responsible for the key GPML capa-

  bility of working on human- level data: video, audio, and text. This is essen-

  tial for AI because it allows these systems to be installed on top of the same

  sources of knowledge that humans are able to digest. You don’t need to

  create a new database system (or have an existing standard form) to feed

  the AI; rather, the AI can live on top of the chaos of information generated

  through business functions. This capability helps to illustrate why the new

  AI, based on GPML, is so much more promising than previous attempts at

  AI. Classical AI relied on hand- specifi ed logic rules to mimic how a rational

  human might approach a given problem (Haugeland 1985). This approach

  is sometimes nostalgically referred to as GOFAI, or “good old- fashioned

  AI.” The problem with GOFAI is obvious: solving human problems with

  logic rules requires an impossibly complex cataloging of all possible sce-

  narios and actions. Even for systems able to learn from structured data, the

  need to have an explicit and detailed data schema means that the system

  designer must to know in advance how to translate complex human tasks

  into deterministic algorithms.

  The new AI doesn’t have this limitation. For example, consider the

  problem of creating a virtual agent that can answer customer questions

  (e.g., “why won’t my computer start?”). A GOFAI system would be based

  on hand- coded dialog trees: if a user says X, answer Y, and so forth. To install the system, you would need to have human engineers understand

  and explicitly code for all of the main customer issues. In contrast, the new

  ML- driven AI can simply ingest all of your existing customer- support logs

  and learn to replicate how human agents have answered customer ques-

  tions in the past. The ML allows your system to infer support patterns from

  the human conversations. The installation engineer just needs to start the

  DNN- fi tting routine.

  This gets to the last bit of GPML that we highlight in fi gure 2.1, the tools

  that facilitate model fi tting on massive data sets: out- of-sample (OOS) vali-

  dation for model tuning, stochastic gradient descent (SGD) for parameter

  optimization, and graphical processing units (GPUs) and other computer

  hardware for massively parallel optimization. Each of these pieces is essen-

  70 Matt Taddy

  tial for the success of large- scale GPML. Although they are commonly

  associated with deep learning and DNNs (especially SGD and GPUs), these

  tools have developed in the context of many diff erent ML algorithms. The

  rise of DNNs over alternative ML modeling schemes is partly due to the

  fact that, through trial and error, ML researchers have discovered that neural

  network models are especially well suited to engineering within the context

  of these available tools (LeCun et al. 1998).

  Out- of-sample validation is a basic idea: you choose the best model speci-

  fi cation by comparing predictions from models estimated on data that was

  not used during the model “training” (fi tting). This can be formalized as a

  cross- validation routine: you split the data into K “folds,” and then K times fi t the model on all data but the K th fold and evaluate its predictive performance (e.g., mean squared error or misclassifi cation rate) on the left- out

  fold. The model with optimal average OOS performance (e.g., minimum

  error rate) is then deployed in practice.

  Machine learning’s wholesale adoption of OOS validation as the arbitra-

  tor of model quality has freed the ML engineer from the need to theorize

  about model quality. Of course, this can create frustration and delays when

  you have nothing other than “guess- and- test” as a method for model selec-

  tion. But, increasingly, the requisite model search is not being executed

  by humans: it is done by additional ML routines. This either happens ex-

  plicitly, in AutoML (Feurer et al. 2015) frameworks that use simple auxil-

  iary ML to predict OOS performance of the more complex target model, or

  implicitly by adding fl exibility to the target model (e.g., making the tuning

  parameters part of the optimization objective). The fact that OOS vali-

  dation provides a clear target to optimize against—a target which, unlike

  the in-sample likelihood, does not incentive over- fi t—facilitates automated

  model tuning. It removes humans from the process of adapting models to

  specifi c data sets.

  Stochastic gradient descent optimization will be less familiar to most

  readers, but it is a crucial part of GPML. This class of algorithms allows

  models to be fi t to data that is only observed in small chunks: you can train

  the model on a stream of data and avoid having to do batch computations on the entire data set. This lets you estimate complex models on massive data

  sets. For subtle reasons, the engineering of SGD algorithms also tends to

  encourage robust and generalizable model fi ts (i.e., use of SGD discourages

  over- fi t). We cover these algorithms in detail in a dedicated section.

  Finally, the GPUs: specialized computer processors have made massive-

  scale ML a reality, and continued hardware innovation will help push AI to

  new domains. Deep neural network training with stochastic gradient descent

  involves massively parallel computations: many basic operations executed

  simultaneously across parameters of the network. Graphical processing

  units were devised for calculations of this type, in the context of video and

  computer graphics display where all pixels of an image need to be rendered

  The Technological Elements of Artifi cial Intelligence 71

  simultaneously, in parallel. Although DNN training was originally a side use

  case for GPUs (i.e., as an aside from their main computer graphics mandate),

  AI applications are now of primary importance for GPU manufacturers.

  Nvidia, for example, is a GPU company whose rise in market value has been

  driven by the rise of AI.

  The technology here is not standing still. The GPUs are getting faster

  and cheaper every day. We are also seeing the deployment of new chips

  that have been designed from scratch for ML optimization. For example,

  fi eld- programmable gate arrays (FPGAs) are being used by Microsoft and

  Amazon in their data centers. These chips allow precision requirements
to

  be set dynamically, thus effi

  ciently allocating resources to high- precision

  operations and saving compute eff ort where you only need a few decimal

  points (e.g., in early optimization updates to the DNN parameters). As an-

  other example, Google’s Tensor Processing Units (TPUs) are specifi cally

  designed for algebra with “tensors,” a mathematical object that occurs com-

  monly in ML.4

  One of the hallmarks of a general purpose technology is that it leads

  to broad industrial changes, both above and below where that technology

  lives in the supply chain. This is what we are observing with the new general

  purpose ML. Below, we see that chip makers are changing the type of hard-

  ware they create to suit these DNN- based AI systems. Above, GPML has

  led to a new class of ML- driven AI products. As we seek more real- world

  AI capabilities—self- driving cars, conversational business agents, intelligent

  economic marketplaces—domain experts in these areas will need to fi nd

  ways to resolve their complex questions into structures of ML tasks. This is

  a role that economists and business professionals should embrace, where the

  increasingly user- friendly GPML routines become basic tools of their trade.

  2.4 Deep

  Learning

  We have stated that deep neural networks are a key tool in GPML, but

  what exactly are they? And what makes them deep? In this section we will

  give a high- level overview of these models. This is not a user guide. For that,

  we recommend the excellent recent textbook by Goodfellow, Bengio, and

  Courville (2016). This is a rapidly evolving area of research, and new types

  of neural network models and estimation algorithms are being developed

  at a steady clip. The excitement in this area, and considerable media and

  business hype, makes it diffi

  cult to keep track. Moreover, the tendency of

  ML companies and academics to proclaim every incremental change as

  “completely brand new” has led to a messy literature that is tough for new-

  comers to navigate. But there is a general structure to deep learning, and a

  4. A tensor is a multidimensional extension of a matrix—that is, a matrix is another name for a two- dimensional tensor.

  72 Matt Taddy

  Fig. 2.3 A fi ve- layer network

  Source: Adapted from Nielsen (2015).

  hype- free understanding of this structure should give you insight into the

  reasons for its success.

  Neural networks are simple models. Indeed, their simplicity is a strength:

  basic patterns facilitate fast training and computation. The model has linear

  combinations of inputs that are passed through nonlinear activation func-

  tions called nodes (or, in reference to the human brain, neurons). A set of

  nodes taking diff erent weighted sums of the same inputs is called a “layer,”

  and the output of one layer’s nodes becomes input to the next layer. This

  structure is illustrated in fi gure 2.3. Each circle here is a node. Those in the

  input (farthest left) layer typically have a special structure; they are either

  raw data or data that has been processed through an additional set of layers

  (e.g., convolutions as we will describe). The output layer gives your predic-

  tions. In a simple regression setting, this output could just be ˆ y, the predicted value for some random variable y, but DNNs can be used to predict all sorts

  of high- dimensional objects. As it is for nodes in input layers, output nodes

  also tend to take application- specifi c forms.

  Nodes in the interior of the network have a “classical” neural network

  structure. Say that (·) is the k th node in interior layer h. This node takes hk

  as input a weighted combination of the output of the nodes in the previous

  layer of the network, layer h – 1, and applies a nonlinear transformation to yield the output. For example, the ReLU (for “rectifi ed linear unit”) node is

  by far the most common functional form used today; it simply outputs the

  maximum of its input and zero, as shown in fi gure 2.4.5 Say zh 1 is output of ij

  5. In the 1990s, people spent much eff ort choosing among diff erent node transformation functions. More recently, the consensus is that you can just use a simple and computationally convenient transformation (like ReLU). If you have enough nodes and layers the specifi c transformation doesn’t really matter, so long as it is nonlinear.

  The Technological Elements of Artifi cial Intelligence 73

  node j in layer h – 1 for observation i. Then the corresponding output for the k th node in the h th layer can be written

  h

  h 1

  h 1

  (1)

  z =

  ( z ) = max 0,

  z

  ik

  hk

  h' i

  hj ij

  ,

  j

  where are the network weights. For a given network architecture—the

  hj

  structure of nodes and layers—these weights are the parameters that are

  updated during network training.

  Neural networks have a long history. Work on these types of models dates

  back to the mid- twentieth century, for example, including Rosenblatt’s Per-

  ceptron (Rosenblatt 1958). This early work was focused on networks as

  models that could mimic the actual structure of the human brain. In the

  late 1980s, advances in algorithms for training neural networks (Rumelhart

  et al. 1988) opened the potential for these models to act as general pattern-

  recognition tools rather than as a toy model of the brain. This led to a boom

  in neural network research, and methods developed during the 1990s are at

  the foundation of much of deep learning today (Hochreiter and Schmid-

  huber 1997; LeCun et al. 1998). However, this boom ended in bust. Due to

  the gap between promised and realized results (and enduring diffi

  culties in

  training networks on massive data sets) from the late 1990s, neural networks

  became just one ML method among many. In applications they were sup-

  planted by more robust tools such as Random Forests, high- dimensional

  regularized regression, and a variety of Bayesian stochastic process models.

  In the 1990s, one tended to add network complexity by adding width.

  A couple of layers (e.g., a single hidden layer was common) with a large

  number of nodes in each layer were used to approximate complex functions.

  Fig. 2.4 The ReLU function

  74 Matt Taddy

  Researchers had established that such “wide” learning could approximate

  arbitrary functions (Hornik, Stinchcombe, and White 1989) if you were able

  to train on enough data. The problem, however, was that this turns out to

  be an ineffi

  cient way to learn from data. The wide networks are very fl exible,

  but they need a ton of data to tame this fl exibility. In this way, the wide nets

  resemble traditional nonparametric statistical models like series and kernel

  estimators. Indeed, near the end of the 1990s, Radford Neal showed that

  certain neural networks converge toward Gaussian Processes, a classical

  statistical regression model, as the number of nodes in a single layer grows

  toward infi nity (Neal 2012). It seemed reasonable to conclude that neural

  networks were just clunky versions of more tran
sparent statistical models.

  What changed? A bunch of things. Two nonmethodological events are

  of primary importance: we got much more data (big data) and computing

  hardware became much more effi

  cient (GPUs). But there was also a cru-

  cial methodological development: networks went deep. This breakthrough

  is often credited to 2006 work by Geoff Hinton and coauthors (Hinton,

  Osindero, and Teh 2006) on a network architecture that stacked many pre-

  trained layers together for a handwriting recognition task. In this pretrain-

  ing, interior layers of the network are fi t using an unsupervised learning task (i.e., dimension reduction of the inputs) before being used as part of the

  supervised learning machinery. The idea is analogous to that of principal

  components regression: you fi rst fi t a low- dimensional representation of

  x, then use that low- D representation to predict some associated y. Hinton and colleague’s scheme allowed researchers to train deeper networks than

  was previously possible.

  This specifi c type of unsupervised pretraining is no longer viewed as cen-

  tral to deep learning. However, Hinton, Osindero, and Teh’s (2006) paper

  opened many people’s eyes to the potential for deep neural networks: mod-

  els with many layers, each of which may have diff erent structure and play

  a very diff erent role in the overall machinery. That is, a demonstration that

  one could train deep networks soon turned into a realization that one should add depth to models. In the following years, research groups began to show

  empirically and theoretically that depth was important for learning effi

  -

  ciently from data (Bengio et al. 2007). The modularity of a deep network

  is key: each layer of functional structure plays a specifi c role, and you can

  swap out layers like Lego blocks when moving across data applications. This

  allows for fast application- specifi c model development, and also for trans-

  fer learning across models: an internal layer from a network that has been

  trained for one type of image recognition problem can be used to hot- start

  a new network for a diff erent computer vision task.

  Deep learning came into the ML mainstream with a 2012 paper by

  Krizhevsky, Sutskever, and Hinton (2012) that showed their DNN was able

  to smash current performance benchmarks in the well- known ImageNet

 

‹ Prev