But if the sandwich contains chicken and cheese, it should do this instead:
Let’s look at how things are hooked up inside the black box.
First, let’s make it simple. We hook up all the inputs (all the ingredients) to our single output. To get our deliciousness rating, we add each ingredient’s contribution. Clearly each ingredient should not contribute equally—the presence of cheese would make the sandwich more delicious, while the presence of mud would make the sandwich less delicious. So each ingredient gets a different weight. The good ones get a weight of 1, while the ones we want to avoid get a weight of 0. Our neural network looks like this:
Let’s test it with some sample sandwiches. Suppose the sandwich contains mud and eggshells. Mud and eggshells both contribute a 0, so the deliciousness rating is 0 + 0 = 0.
But a peanut-butter-and-marshmallow sandwich will get a rating of 1 + 1 = 2. (Congratulations! You have been blessed with that New England delicacy, the fluffernutter.)
With this neural network configuration, we successfully avoid all the sandwiches that contain only eggshells, mud, and other inedible things. But this simple one-layer neural network is not sophisticated enough to recognize that some ingredients, while delicious on their own, are not delicious in combination with certain others. It’s going to rate a chicken-and-marshmallow sandwich as delicious, the equal of the fluffernutter. It’s also susceptible to something we’ll call the big sandwich bug: a sandwich that contains mulch might still be rated as tasty if it contains enough good ingredients to cancel out the mulch.
To get a better neural network, we’re going to need another layer of cells.
Here’s our neural network now. Each ingredient is connected to our new layer of cells, and each cell is connected to the output. This new layer is called a hidden layer, because the user only sees the inputs and the outputs. Just as before, each connection has its own weight, so it affects our final deliciousness output in different ways. This isn’t deep learning yet (that would require even more layers), but we’re getting there.
DEEP LEARNING
Adding hidden layers to our neural network gets us a more sophisticated algorithm, one that’s able to judge sandwiches as more than the sum of their ingredients. In this chapter, we’ve only added one hidden layer, but real-world neural networks often have several. Each new layer means a new way to combine the insights from the previous layer—at higher and higher levels of complexity, we hope. This approach—lots of hidden layers for lots of complexity—is known as deep learning.
With this neural network, we can finally avoid bad ingredients by connecting them to a cell that we’ll call the punisher. We’ll give that cell a huge negative weight (let’s say –100) and connect everything bad to it with a weight of 10. Let’s make the first cell the punisher and connect the mud and eggshells to it. Here’s what that looks like:
Now, no matter what happens in the other cells, a sandwich is likely to fail if it contains eggshells or mud. Using the punisher cell, we can beat the big sandwich bug.
We can do other things with the rest of the cells—like finally make a neural network that knows which ingredient combos work. Let’s use the second cell to recognize chicken-and-cheese-type sandwiches. We’ll refer to it as the deli sandwich cell. We connect chicken and cheese to it with weights of 1 (we’ll also do this with ham and turkey and mayo) and connect everything else to it with weights of 0. And this cell gets connected to the output with a modest weight of 1. The deli sandwich cell is a good thing, but if we get too excited about it and assign it a very high weight, we’ll be in danger of making the punisher cell less powerful. Let’s look at what this cell does.
A chicken-and-cheese sandwich will cause this cell to contribute a cheerful 1 + 1 = 2 to the final output. But adding marshmallow to the chicken-and-cheese sandwich doesn’t hurt it at all, even though it makes a pretty objectively less delicious sandwich. To fix that, we’ll need other cells that specifically look for and punish incompatibilities.
Cell 3, for example, might look for the chicken-marshmallow combination (let’s call it the cluckerfluffer) and severely punish any sandwich that contains it. It would be hooked up like this:
Cell 3 returns a devastating (10 + 10) × -100 = –2000 to any sandwich that dares to combine chicken and marshmallow. It’s acting like a very specialized punisher cell, designed specifically to punish chicken and marshmallow. Notice that I’ve shown an extra part of the cluckerfluffer cell here, called the activation function, because without it, the cell will punish any sandwich that contains chicken or marshmallow. With a threshold of 15, the activation function stops the cell from turning on when just chicken (10 points) or marshmallow (10 points) is present—it will return a neutral 0. But if both are present (10 + 10 = 20 points), the threshold of 15 is exceeded, and the cell turns on. Boom! The activated cell punishes any combination of ingredients that exceeds its threshold.
With all the cells connected in similarly sophisticated configurations, we have a neural net that can sort out the best sandwiches the magic hole has to offer.
THE TRAINING PROCESS
So now we know what a well-configured sandwich-picking neural network might look like. But the point of using machine learning is that we don’t have to set up the neural network by hand. Instead, it should be able to configure itself into something that does a great sandwich-picking job. How does this training process work?
Let’s go back to a simple two-layer neural network. At the beginning of the training process, it’s starting completely from scratch, with random weights for each ingredient. Chances are it’s very, very bad at rating sandwiches.
We’ll need to train it with some real-world data—some examples of the correct way to rate a sandwich, as demonstrated by real humans. As the neural net rates each sandwich, it needs to compare its ratings against those of a panel of cooperative sandwich judges. Note: never volunteer to test the early stages of a machine learning algorithm.
For this example, we’ll go back to the very simple neural network. Remember, since we’re trying to train it from scratch, we’re ignoring all our prior knowledge about what the weights should be, and starting from random ones. Here they are:
It hates cheese. It loves marshmallow. It’s rather fond of mud. And it can take or leave eggshells.
The neural net looks at the first sandwich that pops out of the magic sandwich hole and using its (terrible) judgment, gives it a score. It’s a marshmallow, eggshell, and mud sandwich, so it gets a score of 10 + 0 + 2 = 12. Wow! That’s a really, really great score!
It presents the sandwich to the panel of human judges. Harsh reality: it’s not a popular sandwich.
Now comes the part where the neural net has a chance to improve: it looks at what would have happened if its weights were slightly different. From this one sandwich, it doesn’t know what the problem is. Was it too excited about the marshmallow? Are eggshells not neutral but maybe even a teensy bit bad? It can’t tell. But if it looks at a batch of ten sandwiches, the scores it gave them, and the scores the human judges gave them, it can discover that if it had in general given mud a lower weight, lowering the score of any sandwich that contains mud, its scores would match those of the human judges a bit better.
With its newly adjusted weights, it’s time for another iteration. The neural net rates another bunch of sandwiches, compares its scores against those of the human judges, and adjusts its weights again. After thousands more iterations and tens of thousands of sandwiches, the human judges are very, very sick of this, but the neural network is doing a lot better.
There are plenty of pitfalls in the way of progress, though. As I mentioned above, this simple neural network only knows if particular ingredients are generally good or generally bad, and it isn’t able to come up with a nuanced idea of which combinations work. For that, it needs a more sophisticated structure, one with hidden layers of cells. It needs to evolve punishers and deli sandwich cells.
Another pitfall that we’l
l have to be careful of is the issue of class imbalance. Remember that only a handful of every thousand sandwiches from the sandwich hole are delicious. Rather than go through all the trouble of figuring out how to weight each ingredient, or how to use them in combination, the neural net may realize it can achieve 99.9 percent accuracy by rating each sandwich as terrible, no matter what.
To combat class imbalance, we’ll need to prefilter our training sandwiches so that there are approximately equal proportions of sandwiches that are delicious and awful. Even then, the neural net might not learn about ingredients that are usually to be avoided but delicious in very specific circumstances. Marshmallow might be an example of such an ingredient—awful with most of the usual sandwich ingredients but delicious in a fluffernutter (and maybe with chocolate and bananas). If the neural net doesn’t see fluffernutters in training, or sees them very rarely, it may decide that it can achieve pretty good accuracy by rejecting anything that contains marshmallow.
Class imbalance–related problems show up all the time in practical applications, usually when we ask AI to detect a rare event. When people try to predict when customers will leave a company, they have a lot more examples of customers who stay than customers who leave, so there’s a danger the AI will take the shortcut of deciding that all customers will stay forever. Detecting fraudulent logins and hacking attacks has a similar problem, since actual attacks are rare. People also report class imbalance problems in medical imaging, where they may be looking for just one abnormal cell among hundreds—the temptation is for the AI to shortcut its way to high accuracy just by predicting that all cells are healthy. Astronomers also run into class imbalance problems when they use AI, since many interesting celestial events are rare—there was a solar-flare-detecting program that discovered it could achieve near 100 percent accuracy by predicting zero solar flares, since these were very rare in the training data.2
WHEN CELLS WORK TOGETHER
In the sandwich-sorting example above, we saw how a layer of cells can increase the complexity of the tasks a neural network can perform. We built a deli sandwich cell that responded to combinations of deli meats and cheeses, and we built a cluckerfluffer cell that punished any sandwich that tried to use chicken and marshmallow in combination. But in a neural network that trains itself, using trial and error to adjust the connections between cells, it’s usually a lot harder to identify each cell’s job. Tasks tend to be spread among several cells—and in the case of some cells, it’s difficult or impossible to tell what tasks they accomplish.
To explore this phenomenon, let’s look at some of the cells of a fully trained neural net. Built and trained by researchers at OpenAI,3 this particular neural net looked at more than eighty-two million Amazon product reviews letter by letter and tried to predict which letter would come next. This is another recurrent neural network, the same general sort as the one that generated the knock-knock jokes, ice cream flavors, and recipes listed in chapters 1 and 2. This one’s larger—it has approximately as many neurons as a jelly fish. Here are a few examples of reviews it generated:
This is a great book that I would recommend to anyone who loves the great story of the characters and the series of books.
I love this song. I listen to it over and over again and never get tired of it. It is so addicting. I love it!!
This is the best product I have ever used to clean my shower stall. It is not greasy and does not strip the water of the water and stain the white carpet. I have been using it for a few years and it works well for me.
These workout DVDs are very useful. You can cover your whole butt with them.
I bought this thinking it would be good for the garage. Who has a lot of lake water? I was totally wrong. It was simple and fast. The night grizzly has not harmed it and we have had this for over 3 months. The guests are inspired and they really enjoy it. My dad loves it!
This particular neural net has an input for each letter or punctuation mark it could encounter (similar to the sandwich sorter, which had one input for each sandwich ingredient) and can look back at the past few letters and punctuation marks. (It is as if the sandwich rater’s scoring depended a bit on the last few sandwiches it had seen—maybe it can keep track of whether we might be sick of cheese sandwiches and adjust the next cheese sandwich’s rating accordingly.) And rather than having a single output, as the sandwich sorter does, the review-writing neural net has a lot of them, one output for each letter or punctuation mark that it could choose as most likely to come next in the review. If it sees the sequence “I own twenty eggbeaters and this is my very favorit,” then the letter e will be the most likely next choice.
Based on the outputs, we can take a look at each cell and see when it’s “active,” letting us make an educated guess about what its function is. In our sandwich-sorter example above, the deli sandwich cell would be active when it sees lots of meat and cheese and inactive when it sees socks or marbles or peanut butter. However, most of the neurons in the Amazon product-review neural net are going to be nowhere near as interpretable as deli cells and punisher neurons. Instead, most of the rules the neural net comes up with are going to be unintelligible to us. Sometimes we can guess what a cell’s function will be, but far more frequently, we have no idea what it’s doing.
Here’s the activity of one of the product-review algorithm’s cells (the 2,387th) as it generates a review (white = active, dark = inactive):
For me, this is one of the few albums of theirs I own that actually made me an instant classic pop fan. I also had a major problem with the audio with 10 new songs; the execution of the vocals and editing was awful. The next day, I was in a recording studio and I can’t tell you how many times I had to hit the play button to see where the song was going.
This cell is contributing to the neural net’s prediction of which letters come next, but its function is mysterious. It’s reacting to certain letters, or certain combinations of letters, but not in a way that makes sense to us. Why was it really excited about the letters um in album but not the letters al? What is it actually doing? It’s just one small piece of the puzzle working with a lot of other cells. Almost all the cells in a neural net are as mysterious as this one.
However, every once in a while, there will be a cell whose job is recognizable—a cell that activates whenever we’re between a pair of parentheses or that activates increasingly strongly the longer a sentence gets.4 The people who trained the product-review neural net noticed that it had one cell that was doing something they could recognize: it was responding to whether the review was positive or negative. As part of its task of predicting the next letter in a review, the neural net seems to have decided it was important to determine whether to praise the product or trash it. Here’s the activation of the “sentiment neuron” on that same review. Note that a light color indicates high activation, which means it thinks the review is positive:
For me, this is one of the few albums of theirs I own that actually made me an instant classic pop fan. I also had a major problem with the audio with 10 new songs; the execution of the vocals and editing was awful. The next day, I was in a recording studio and I can’t tell you how many times I had to hit the “play” button to see where the song was going.
The review starts out very positive, and the sentiment neuron is highly activated. Midway through, however, it switches tone, and the cell’s activation level goes way down.
Here’s another example of the sentiment neuron at work. It has low activity when the review is neutral or critical but quickly swings into high gear whenever it detects a change in sentiment:
The Harry Potter File, from which the previous one was based (which means it has a standard size liner) weighs a ton and this one is huge! I will definitely put it on every toaster I have in the kitchen since, it is that good. This is one of the best comedy movies ever made. It is definitely my favorite movie of all time. I would recommend this to ANYONE!
But it’s less good at detecting sentiment in other kinds of
text. Most people would not classify this passage from Edgar Allan Poe’s “The Fall of the House of Usher” as positive in sentiment, but this particular neural net thinks it’s mostly positive:
Overpowered by an intense sentiment of horror, unaccountable yet unendurable, I threw on my clothes with haste (for I felt that I should sleep no more during the night,) and endeavoured to arouse myself from the pitiable condition into which I had fallen, by pacing rapidly to and fro through the apartment.
I guess a movie could overpower you by an intense sentiment of horror and be a good movie if that’s what it was supposed to do.
Again, it’s unusual to find a cell in a text-generating or text-analyzing algorithm that behaves as transparently as the sentiment neuron. The same goes for other types of neural networks—and that’s too bad, since we’d love to be able to tell when they’re making unfortunate mistakes and to learn from their strategies.
In image-recognizing algorithms, though, it’s a bit easier to find cells whose jobs you can identify. There the inputs are the individual pixels of a particular image, and the outputs are the various possible ways to classify the image (dog, cat, giraffe, cockroach, and so on). Most image recognition algorithms have lots and lots of layers of cells in between—the hidden layers. And in most image recognition algorithms, there are cells or groups of cells whose functions we can identify if we analyze the neural net in the right way. We can look at the collections of cells that activate when they see particular things, or we can tweak the input image and see which changes make the cells activate most strongly.
You Look Like a Thing and I Love You Page 6