purpose ML (GPML) that we reference in the rightmost pillar of fi gure 2.1.
The fi rst component of GPML is deep neural networks: models made up
of layers of nonlinear transformation node functions, where the output of each layer becomes input to the next layer in the network. We will describe
DNNs in more detail in our Deep Learning section , but for now it suffi
to say that they make it faster and easier than ever before to fi nd patterns in
unstructured data. They are also highly modular. You can take a layer that
is optimized for one type of data (e.g., images) and combine it with other
layers for other types of data (e.g., text). You can also use layers that have
been pretrained on one data set (e.g., generic images) as components in a
more specialized model (e.g., a specifi c recognition task).
Specialized DNN architectures are responsible for the key GPML capa-
bility of working on human- level data: video, audio, and text. This is essen-
tial for AI because it allows these systems to be installed on top of the same
sources of knowledge that humans are able to digest. You don’t need to
create a new database system (or have an existing standard form) to feed
the AI; rather, the AI can live on top of the chaos of information generated
through business functions. This capability helps to illustrate why the new
AI, based on GPML, is so much more promising than previous attempts at
AI. Classical AI relied on hand- specifi ed logic rules to mimic how a rational
human might approach a given problem (Haugeland 1985). This approach
is sometimes nostalgically referred to as GOFAI, or “good old- fashioned
AI.” The problem with GOFAI is obvious: solving human problems with
logic rules requires an impossibly complex cataloging of all possible sce-
narios and actions. Even for systems able to learn from structured data, the
need to have an explicit and detailed data schema means that the system
designer must to know in advance how to translate complex human tasks
into deterministic algorithms.
The new AI doesn’t have this limitation. For example, consider the
problem of creating a virtual agent that can answer customer questions
(e.g., “why won’t my computer start?”). A GOFAI system would be based
on hand- coded dialog trees: if a user says X, answer Y, and so forth. To install the system, you would need to have human engineers understand
and explicitly code for all of the main customer issues. In contrast, the new
ML- driven AI can simply ingest all of your existing customer- support logs
and learn to replicate how human agents have answered customer ques-
tions in the past. The ML allows your system to infer support patterns from
the human conversations. The installation engineer just needs to start the
DNN- fi tting routine.
This gets to the last bit of GPML that we highlight in fi gure 2.1, the tools
that facilitate model fi tting on massive data sets: out- of-sample (OOS) vali-
dation for model tuning, stochastic gradient descent (SGD) for parameter
optimization, and graphical processing units (GPUs) and other computer
hardware for massively parallel optimization. Each of these pieces is essen-
tial for the success of large- scale GPML. Although they are commonly
associated with deep learning and DNNs (especially SGD and GPUs), these
tools have developed in the context of many diff erent ML algorithms. The
rise of DNNs over alternative ML modeling schemes is partly due to the
fact that, through trial and error, ML researchers have discovered that neural
network models are especially well suited to engineering within the context
of these available tools (LeCun et al. 1998).
Out- of-sample validation is a basic idea: you choose the best model speci-
fi cation by comparing predictions from models estimated on data that was
not used during the model “training” (fi tting). This can be formalized as a
cross- validation routine: you split the data into K “folds,” and then K times fi t the model on all data but the K th fold and evaluate its predictive performance (e.g., mean squared error or misclassifi cation rate) on the left- out
fold. The model with optimal average OOS performance (e.g., minimum
error rate) is then deployed in practice.
Machine learning’s wholesale adoption of OOS validation as the arbitra-
tor of model quality has freed the ML engineer from the need to theorize
about model quality. Of course, this can create frustration and delays when
you have nothing other than “guess- and- test” as a method for model selec-
tion. But, increasingly, the requisite model search is not being executed
by humans: it is done by additional ML routines. This either happens ex-
plicitly, in AutoML (Feurer et al. 2015) frameworks that use simple auxil-
iary ML to predict OOS performance of the more complex target model, or
implicitly by adding fl exibility to the target model (e.g., making the tuning
parameters part of the optimization objective). The fact that OOS vali-
dation provides a clear target to optimize against—a target which, unlike
the in-sample likelihood, does not incentive over- fi t—facilitates automated
model tuning. It removes humans from the process of adapting models to
specifi c data sets.
Stochastic gradient descent optimization will be less familiar to most
readers, but it is a crucial part of GPML. This class of algorithms allows
models to be fi t to data that is only observed in small chunks: you can train
the model on a stream of data and avoid having to do batch computations on the entire data set. This lets you estimate complex models on massive data
sets. For subtle reasons, the engineering of SGD algorithms also tends to
encourage robust and generalizable model fi ts (i.e., use of SGD discourages
over- fi t). We cover these algorithms in detail in a dedicated section.
Finally, the GPUs: specialized computer processors have made massive-
scale ML a reality, and continued hardware innovation will help push AI to
new domains. Deep neural network training with stochastic gradient descent
involves massively parallel computations: many basic operations executed
simultaneously across parameters of the network. Graphical processing
units were devised for calculations of this type, in the context of video and
computer graphics display where all pixels of an image need to be rendered
simultaneously, in parallel. Although DNN training was originally a side use
case for GPUs (i.e., as an aside from their main computer graphics mandate),
AI applications are now of primary importance for GPU manufacturers.
Nvidia, for example, is a GPU company whose rise in market value has been
driven by the rise of AI.
The technology here is not standing still. The GPUs are getting faster
and cheaper every day. We are also seeing the deployment of new chips
that have been designed from scratch for ML optimization. For example,
fi eld- programmable gate arrays (FPGAs) are being used by Microsoft and
Amazon in their data centers. These chips allow precision requirements
be set dynamically, thus effi
ciently allocating resources to high- precision
operations and saving compute eff ort where you only need a few decimal
points (e.g., in early optimization updates to the DNN parameters). As an-
other example, Google’s Tensor Processing Units (TPUs) are specifi cally
designed for algebra with “tensors,” a mathematical object that occurs com-
monly in ML.4
One of the hallmarks of a general purpose technology is that it leads
to broad industrial changes, both above and below where that technology
lives in the supply chain. This is what we are observing with the new general
purpose ML. Below, we see that chip makers are changing the type of hard-
ware they create to suit these DNN- based AI systems. Above, GPML has
led to a new class of ML- driven AI products. As we seek more real- world
AI capabilities—self- driving cars, conversational business agents, intelligent
economic marketplaces—domain experts in these areas will need to fi nd
ways to resolve their complex questions into structures of ML tasks. This is
a role that economists and business professionals should embrace, where the
increasingly user- friendly GPML routines become basic tools of their trade.
2.4 Deep
We have stated that deep neural networks are a key tool in GPML, but
what exactly are they? And what makes them deep? In this section we will
give a high- level overview of these models. This is not a user guide. For that,
we recommend the excellent recent textbook by Goodfellow, Bengio, and
Courville (2016). This is a rapidly evolving area of research, and new types
of neural network models and estimation algorithms are being developed
at a steady clip. The excitement in this area, and considerable media and
business hype, makes it diffi
cult to keep track. Moreover, the tendency of
ML companies and academics to proclaim every incremental change as
“completely brand new” has led to a messy literature that is tough for new-
comers to navigate. But there is a general structure to deep learning, and a
4. A tensor is a multidimensional extension of a matrix—that is, a matrix is another name for a two- dimensional tensor.
Fig. 2.3 A fi ve- layer network
hype- free understanding of this structure should give you insight into the
reasons for its success.
Neural networks are simple models. Indeed, their simplicity is a strength:
basic patterns facilitate fast training and computation. The model has linear
combinations of inputs that are passed through nonlinear activation func-
tions called nodes (or, in reference to the human brain, neurons). A set of
nodes taking diff erent weighted sums of the same inputs is called a “layer,”
and the output of one layer’s nodes becomes input to the next layer. This
structure is illustrated in fi gure 2.3. Each circle here is a node. Those in the
input (farthest left) layer typically have a special structure; they are either
raw data or data that has been processed through an additional set of layers
(e.g., convolutions as we will describe). The output layer gives your predic-
tions. In a simple regression setting, this output could just be ˆ y, the predicted value for some random variable y, but DNNs can be used to predict all sorts
of high- dimensional objects. As it is for nodes in input layers, output nodes
also tend to take application- specifi c forms.
Nodes in the interior of the network have a “classical” neural network
structure. Say that (·) is the k th node in interior layer h. This node takes hk
as input a weighted combination of the output of the nodes in the previous
layer of the network, layer h – 1, and applies a nonlinear transformation to yield the output. For example, the ReLU (for “rectifi ed linear unit”) node is
by far the most common functional form used today; it simply outputs the
maximum of its input and zero, as shown in fi gure 2.4.5 Say zh 1 is output of ij
5. In the 1990s, people spent much eff ort choosing among diff erent node transformation functions. More recently, the consensus is that you can just use a simple and computationally convenient transformation (like ReLU). If you have enough nodes and layers the specifi c transformation doesn’t really matter, so long as it is nonlinear.
node j in layer h – 1 for observation i. Then the corresponding output for the k th node in the h th layer can be written
h 1
h 1
z =
( z ) = max 0,
h' i
hj ij
where are the network weights. For a given network architecture—the
structure of nodes and layers—these weights are the parameters that are
updated during network training.
Neural networks have a long history. Work on these types of models dates
back to the mid- twentieth century, for example, including Rosenblatt’s Per-
ceptron (Rosenblatt 1958). This early work was focused on networks as
models that could mimic the actual structure of the human brain. In the
late 1980s, advances in algorithms for training neural networks (Rumelhart
et al. 1988) opened the potential for these models to act as general pattern-
recognition tools rather than as a toy model of the brain. This led to a boom
in neural network research, and methods developed during the 1990s are at
the foundation of much of deep learning today (Hochreiter and Schmid-
huber 1997; LeCun et al. 1998). However, this boom ended in bust. Due to
the gap between promised and realized results (and enduring diffi
culties in
training networks on massive data sets) from the late 1990s, neural networks
became just one ML method among many. In applications they were sup-
planted by more robust tools such as Random Forests, high- dimensional
regularized regression, and a variety of Bayesian stochastic process models.
In the 1990s, one tended to add network complexity by adding width.
A couple of layers (e.g., a single hidden layer was common) with a large
number of nodes in each layer were used to approximate complex functions.
Fig. 2.4 The ReLU function
Researchers had established that such “wide” learning could approximate
arbitrary functions (Hornik, Stinchcombe, and White 1989) if you were able
to train on enough data. The problem, however, was that this turns out to
be an ineffi
cient way to learn from data. The wide networks are very fl exible,
but they need a ton of data to tame this fl exibility. In this way, the wide nets
resemble traditional nonparametric statistical models like series and kernel
estimators. Indeed, near the end of the 1990s, Radford Neal showed that
certain neural networks converge toward Gaussian Processes, a classical
statistical regression model, as the number of nodes in a single layer grows
toward infi nity (Neal 2012). It seemed reasonable to conclude that neural
networks were just clunky versions of more tran
sparent statistical models.
What changed? A bunch of things. Two nonmethodological events are
of primary importance: we got much more data (big data) and computing
hardware became much more effi
cient (GPUs). But there was also a cru-
cial methodological development: networks went deep. This breakthrough
is often credited to 2006 work by Geoff Hinton and coauthors (Hinton,
Osindero, and Teh 2006) on a network architecture that stacked many pre-
trained layers together for a handwriting recognition task. In this pretrain-
ing, interior layers of the network are fi t using an unsupervised learning task (i.e., dimension reduction of the inputs) before being used as part of the
supervised learning machinery. The idea is analogous to that of principal
components regression: you fi rst fi t a low- dimensional representation of
x, then use that low- D representation to predict some associated y. Hinton and colleague’s scheme allowed researchers to train deeper networks than
was previously possible.
This specifi c type of unsupervised pretraining is no longer viewed as cen-
tral to deep learning. However, Hinton, Osindero, and Teh’s (2006) paper
opened many people’s eyes to the potential for deep neural networks: mod-
els with many layers, each of which may have diff erent structure and play
a very diff erent role in the overall machinery. That is, a demonstration that
one could train deep networks soon turned into a realization that one should add depth to models. In the following years, research groups began to show
empirically and theoretically that depth was important for learning effi
ciently from data (Bengio et al. 2007). The modularity of a deep network
is key: each layer of functional structure plays a specifi c role, and you can
swap out layers like Lego blocks when moving across data applications. This
allows for fast application- specifi c model development, and also for trans-
fer learning across models: an internal layer from a network that has been
trained for one type of image recognition problem can be used to hot- start
a new network for a diff erent computer vision task.
Deep learning came into the ML mainstream with a 2012 paper by
Krizhevsky, Sutskever, and Hinton (2012) that showed their DNN was able
to smash current performance benchmarks in the well- known ImageNet