The Economics of Artificial Intelligence

Page 13

by Ajay Agrawal

The Technological Elements of Artifi cial Intelligence 69

purpose ML (GPML) that we reference in the rightmost pillar of fi gure 2.1.

The fi rst component of GPML is deep neural networks: models made up

of layers of nonlinear transformation node functions, where the output of each layer becomes input to the next layer in the network. We will describe

DNNs in more detail in our Deep Learning section , but for now it suffi

ces

to say that they make it faster and easier than ever before to fi nd patterns in

unstructured data. They are also highly modular. You can take a layer that

is optimized for one type of data (e.g., images) and combine it with other

layers for other types of data (e.g., text). You can also use layers that have

been pretrained on one data set (e.g., generic images) as components in a

more specialized model (e.g., a specifi c recognition task).

Specialized DNN architectures are responsible for the key GPML capa-

bility of working on human- level data: video, audio, and text. This is essen-

tial for AI because it allows these systems to be installed on top of the same

sources of knowledge that humans are able to digest. You don’t need to

create a new database system (or have an existing standard form) to feed

the AI; rather, the AI can live on top of the chaos of information generated

through business functions. This capability helps to illustrate why the new

AI, based on GPML, is so much more promising than previous attempts at

AI. Classical AI relied on hand- specifi ed logic rules to mimic how a rational

human might approach a given problem (Haugeland 1985). This approach

is sometimes nostalgically referred to as GOFAI, or “good old- fashioned

AI.” The problem with GOFAI is obvious: solving human problems with

logic rules requires an impossibly complex cataloging of all possible sce-

narios and actions. Even for systems able to learn from structured data, the

need to have an explicit and detailed data schema means that the system

designer must to know in advance how to translate complex human tasks

into deterministic algorithms.

The new AI doesn’t have this limitation. For example, consider the

problem of creating a virtual agent that can answer customer questions

(e.g., “why won’t my computer start?”). A GOFAI system would be based

on hand- coded dialog trees: if a user says X, answer Y, and so forth. To install the system, you would need to have human engineers understand

and explicitly code for all of the main customer issues. In contrast, the new

ML- driven AI can simply ingest all of your existing customer- support logs

and learn to replicate how human agents have answered customer ques-

tions in the past. The ML allows your system to infer support patterns from

the human conversations. The installation engineer just needs to start the

DNN- fi tting routine.

This gets to the last bit of GPML that we highlight in fi gure 2.1, the tools

that facilitate model fi tting on massive data sets: out- of-sample (OOS) vali-

dation for model tuning, stochastic gradient descent (SGD) for parameter

optimization, and graphical processing units (GPUs) and other computer

hardware for massively parallel optimization. Each of these pieces is essen-

70 Matt Taddy

tial for the success of large- scale GPML. Although they are commonly

associated with deep learning and DNNs (especially SGD and GPUs), these

tools have developed in the context of many diff erent ML algorithms. The

rise of DNNs over alternative ML modeling schemes is partly due to the

fact that, through trial and error, ML researchers have discovered that neural

network models are especially well suited to engineering within the context

of these available tools (LeCun et al. 1998).

Out- of-sample validation is a basic idea: you choose the best model speci-

fi cation by comparing predictions from models estimated on data that was

not used during the model “training” (fi tting). This can be formalized as a

cross- validation routine: you split the data into K “folds,” and then K times fi t the model on all data but the K th fold and evaluate its predictive performance (e.g., mean squared error or misclassifi cation rate) on the left- out

fold. The model with optimal average OOS performance (e.g., minimum

error rate) is then deployed in practice.

Machine learning’s wholesale adoption of OOS validation as the arbitra-

tor of model quality has freed the ML engineer from the need to theorize

about model quality. Of course, this can create frustration and delays when

you have nothing other than “guess- and- test” as a method for model selec-

tion. But, increasingly, the requisite model search is not being executed

by humans: it is done by additional ML routines. This either happens ex-

plicitly, in AutoML (Feurer et al. 2015) frameworks that use simple auxil-

iary ML to predict OOS performance of the more complex target model, or

implicitly by adding fl exibility to the target model (e.g., making the tuning

parameters part of the optimization objective). The fact that OOS vali-

dation provides a clear target to optimize against—a target which, unlike

the in-sample likelihood, does not incentive over- fi t—facilitates automated

model tuning. It removes humans from the process of adapting models to

specifi c data sets.

Stochastic gradient descent optimization will be less familiar to most

readers, but it is a crucial part of GPML. This class of algorithms allows

models to be fi t to data that is only observed in small chunks: you can train

the model on a stream of data and avoid having to do batch computations on the entire data set. This lets you estimate complex models on massive data

sets. For subtle reasons, the engineering of SGD algorithms also tends to

encourage robust and generalizable model fi ts (i.e., use of SGD discourages

over- fi t). We cover these algorithms in detail in a dedicated section.

Finally, the GPUs: specialized computer processors have made massive-

scale ML a reality, and continued hardware innovation will help push AI to

new domains. Deep neural network training with stochastic gradient descent

involves massively parallel computations: many basic operations executed

simultaneously across parameters of the network. Graphical processing

units were devised for calculations of this type, in the context of video and

computer graphics display where all pixels of an image need to be rendered

The Technological Elements of Artifi cial Intelligence 71

simultaneously, in parallel. Although DNN training was originally a side use

case for GPUs (i.e., as an aside from their main computer graphics mandate),

AI applications are now of primary importance for GPU manufacturers.

Nvidia, for example, is a GPU company whose rise in market value has been

driven by the rise of AI.

The technology here is not standing still. The GPUs are getting faster

and cheaper every day. We are also seeing the deployment of new chips

that have been designed from scratch for ML optimization. For example,

fi eld- programmable gate arrays (FPGAs) are being used by Microsoft and

Amazon in their data centers. These chips allow precision requirements
to

be set dynamically, thus effi

ciently allocating resources to high- precision

operations and saving compute eff ort where you only need a few decimal

points (e.g., in early optimization updates to the DNN parameters). As an-

other example, Google’s Tensor Processing Units (TPUs) are specifi cally

designed for algebra with “tensors,” a mathematical object that occurs com-

monly in ML.4

One of the hallmarks of a general purpose technology is that it leads

to broad industrial changes, both above and below where that technology

lives in the supply chain. This is what we are observing with the new general

purpose ML. Below, we see that chip makers are changing the type of hard-

ware they create to suit these DNN- based AI systems. Above, GPML has

led to a new class of ML- driven AI products. As we seek more real- world

AI capabilities—self- driving cars, conversational business agents, intelligent

economic marketplaces—domain experts in these areas will need to fi nd

ways to resolve their complex questions into structures of ML tasks. This is

a role that economists and business professionals should embrace, where the

increasingly user- friendly GPML routines become basic tools of their trade.

2.4 Deep

Learning

We have stated that deep neural networks are a key tool in GPML, but

what exactly are they? And what makes them deep? In this section we will

give a high- level overview of these models. This is not a user guide. For that,

we recommend the excellent recent textbook by Goodfellow, Bengio, and

Courville (2016). This is a rapidly evolving area of research, and new types

of neural network models and estimation algorithms are being developed

at a steady clip. The excitement in this area, and considerable media and

business hype, makes it diffi

cult to keep track. Moreover, the tendency of

ML companies and academics to proclaim every incremental change as

“completely brand new” has led to a messy literature that is tough for new-

comers to navigate. But there is a general structure to deep learning, and a

4. A tensor is a multidimensional extension of a matrix—that is, a matrix is another name for a two- dimensional tensor.

72 Matt Taddy

Fig. 2.3 A fi ve- layer network

Source: Adapted from Nielsen (2015).

hype- free understanding of this structure should give you insight into the

reasons for its success.

Neural networks are simple models. Indeed, their simplicity is a strength:

basic patterns facilitate fast training and computation. The model has linear

combinations of inputs that are passed through nonlinear activation func-

tions called nodes (or, in reference to the human brain, neurons). A set of

nodes taking diff erent weighted sums of the same inputs is called a “layer,”

and the output of one layer’s nodes becomes input to the next layer. This

structure is illustrated in fi gure 2.3. Each circle here is a node. Those in the

input (farthest left) layer typically have a special structure; they are either

raw data or data that has been processed through an additional set of layers

(e.g., convolutions as we will describe). The output layer gives your predic-

tions. In a simple regression setting, this output could just be ˆ y, the predicted value for some random variable y, but DNNs can be used to predict all sorts

of high- dimensional objects. As it is for nodes in input layers, output nodes

also tend to take application- specifi c forms.

Nodes in the interior of the network have a “classical” neural network

structure. Say that (·) is the k th node in interior layer h. This node takes hk

as input a weighted combination of the output of the nodes in the previous

layer of the network, layer h – 1, and applies a nonlinear transformation to yield the output. For example, the ReLU (for “rectifi ed linear unit”) node is

by far the most common functional form used today; it simply outputs the

maximum of its input and zero, as shown in fi gure 2.4.5 Say zh 1 is output of ij

5. In the 1990s, people spent much eff ort choosing among diff erent node transformation functions. More recently, the consensus is that you can just use a simple and computationally convenient transformation (like ReLU). If you have enough nodes and layers the specifi c transformation doesn’t really matter, so long as it is nonlinear.

The Technological Elements of Artifi cial Intelligence 73

node j in layer h – 1 for observation i. Then the corresponding output for the k th node in the h th layer can be written

h

h 1

h 1

(1)

z =

( z ) = max 0,

z

ik

hk

h' i

hj ij

,

j

where are the network weights. For a given network architecture—the

hj

structure of nodes and layers—these weights are the parameters that are

updated during network training.

Neural networks have a long history. Work on these types of models dates

back to the mid- twentieth century, for example, including Rosenblatt’s Per-

ceptron (Rosenblatt 1958). This early work was focused on networks as

models that could mimic the actual structure of the human brain. In the

late 1980s, advances in algorithms for training neural networks (Rumelhart

et al. 1988) opened the potential for these models to act as general pattern-

recognition tools rather than as a toy model of the brain. This led to a boom

in neural network research, and methods developed during the 1990s are at

the foundation of much of deep learning today (Hochreiter and Schmid-

huber 1997; LeCun et al. 1998). However, this boom ended in bust. Due to

the gap between promised and realized results (and enduring diffi

culties in

training networks on massive data sets) from the late 1990s, neural networks

became just one ML method among many. In applications they were sup-

planted by more robust tools such as Random Forests, high- dimensional

regularized regression, and a variety of Bayesian stochastic process models.

In the 1990s, one tended to add network complexity by adding width.

A couple of layers (e.g., a single hidden layer was common) with a large

number of nodes in each layer were used to approximate complex functions.

Fig. 2.4 The ReLU function

74 Matt Taddy

Researchers had established that such “wide” learning could approximate

arbitrary functions (Hornik, Stinchcombe, and White 1989) if you were able

to train on enough data. The problem, however, was that this turns out to

be an ineffi

cient way to learn from data. The wide networks are very fl exible,

but they need a ton of data to tame this fl exibility. In this way, the wide nets

resemble traditional nonparametric statistical models like series and kernel

estimators. Indeed, near the end of the 1990s, Radford Neal showed that

certain neural networks converge toward Gaussian Processes, a classical

statistical regression model, as the number of nodes in a single layer grows

toward infi nity (Neal 2012). It seemed reasonable to conclude that neural

networks were just clunky versions of more tran
sparent statistical models.

What changed? A bunch of things. Two nonmethodological events are

of primary importance: we got much more data (big data) and computing

hardware became much more effi

cient (GPUs). But there was also a cru-

cial methodological development: networks went deep. This breakthrough

is often credited to 2006 work by Geoff Hinton and coauthors (Hinton,

Osindero, and Teh 2006) on a network architecture that stacked many pre-

trained layers together for a handwriting recognition task. In this pretrain-

ing, interior layers of the network are fi t using an unsupervised learning task (i.e., dimension reduction of the inputs) before being used as part of the

supervised learning machinery. The idea is analogous to that of principal

components regression: you fi rst fi t a low- dimensional representation of

x, then use that low- D representation to predict some associated y. Hinton and colleague’s scheme allowed researchers to train deeper networks than

was previously possible.

This specifi c type of unsupervised pretraining is no longer viewed as cen-

tral to deep learning. However, Hinton, Osindero, and Teh’s (2006) paper

opened many people’s eyes to the potential for deep neural networks: mod-

els with many layers, each of which may have diff erent structure and play

a very diff erent role in the overall machinery. That is, a demonstration that

one could train deep networks soon turned into a realization that one should add depth to models. In the following years, research groups began to show

empirically and theoretically that depth was important for learning effi

-

ciently from data (Bengio et al. 2007). The modularity of a deep network

is key: each layer of functional structure plays a specifi c role, and you can

swap out layers like Lego blocks when moving across data applications. This

allows for fast application- specifi c model development, and also for trans-

fer learning across models: an internal layer from a network that has been

trained for one type of image recognition problem can be used to hot- start

a new network for a diff erent computer vision task.

Deep learning came into the ML mainstream with a 2012 paper by

Krizhevsky, Sutskever, and Hinton (2012) that showed their DNN was able

to smash current performance benchmarks in the well- known ImageNet

‹ Prev Next ›