You Look Like a Thing and I Love You

Page 7

by Janelle Shane

DEEP DREAMING

Tweaking an image to make the neurons more excited about it is the technique used to make the famous Google DeepDream images where an image-identifying neural network turned ordinary images into landscapes full of trippy dog faces and fantastic conglomerations of arches and windows.

To make a DeepDream image, you start with a neural network that has been trained to recognize something—dogs, for example. Then you choose one of its cells and gradually change the image to make that cell increasingly more excited about it. If the cell is trained to recognize dog faces, then it will get more excited the more it sees areas in the image that look like dog faces. By the time you’ve changed the image to the cell’s liking, it will be highly distorted and covered in dogs.

The smallest groups of cells seem to look for edges, colors, and very simple textures. They might report vertical lines, curves, or green grassy textures. In subsequent layers, larger groups of cells look for collections of edges, colors, and textures or for simple features. Some researchers at Google, for example, analyzed their GoogLeNET image recognition algorithm and found that it had several collections of cells that were looking specifically for floppy versus pointy ears on animals, which helped it distinguish dogs from cats.5 Other cells got excited about fur or eyeballs.

Image-generating neural networks also have some cells that do identifiable jobs. We can do “brain surgery” on image-generating neural networks, removing certain cells to see how the generated image changes.6 A group at MIT found that it could deactivate cells to remove elements from generated images. Interestingly, elements that the neural net deemed “essential” were more difficult to remove than others—for example, it was easier to remove curtains from an image of a conference room than to remove the tables and chairs.

Now let’s look at another kind of algorithm, one you’ve probably interacted with directly if you’ve used the predictive-text feature of a smartphone.

MARKOV CHAINS

A Markov chain is an algorithm that can tackle many of the same problems as the recurrent neural network (RNN) that generated the recipes, ice cream flavors, Amazon reviews, and metal bands in this book. Like the RNN, it looks at what happened in the past (words previously used in a sentence or last week’s weather, for example) and predicts what’s most likely to happen next.

Markov chains are more lightweight than most neural networks and quicker to train. That’s why the predictive-text function of smartphones is usually a Markov chain rather than an RNN.

However, a Markov chain gets exponentially more unwieldy as its memory increases. Most predictive-text Markov chains, for example, have memories that are only three to five words long. RNNs, by contrast, can have memories that are hundreds of words long—or even longer with the use of LSTM (long short-term memory) and convolution tricks. In chapter 2 we saw how important memory length is when short memory made an RNN lose track of important information. The same is true for Markov chains.

I trained a Markov chain with a dataset of Disney songs using a trainable predictive-text keyboard.7 Training took only a few seconds as opposed to a few minutes for an RNN. But this Markov chain has a three-word memory. That is, the words it suggests are the ones it thinks are the most likely based on the previous three words in the song. When I had it generate a song, choosing only its top suggestion at every step, here is what it produced:

The sea)

under the sea)

under the sea)

under the sea)

under the sea)

under the sea)

under the sea)

It doesn’t know how many times to sing “under the sea” because it doesn’t know how many times it has already sung it.

If I start it out with the beginning of the song “Beauty and the Beast” (“Tale as Old as Time”), it quickly gets stuck again.

Tale as old as time

song as old as time

song as old as time

song as old as time

In several verses of “Beauty and the Beast,” the words “tale as old as time” are immediately followed by the words “song as old as rhyme.” But when this Markov chain is looking at the phrase “as old as,” it doesn’t know which of those two verses it’s in the middle of writing.

I can get it out of its trap by choosing the second most probable word at every step. Then it writes this:

A whole world

bright young master

you’re with all

ya think you’re by wonder

by the powers

and i got downhearted

alone hellfire dark side

And choosing the third most probable word each time:

You think i can open up

where we’ll see how you feel

it all my dreams will be mine

is something there before

she will be better time

These are a lot more interesting, but they don’t make much sense. And songs—and poetry—are pretty forgiving when it comes to grammar, structure, and coherence. If I give the Markov chain a different dataset to learn, then its shortcomings become even more obvious.

Here’s a Markov chain trained on a list of April Fool’s Day pranks as it chooses the most probable next word at each step. (It never suggested punctuation, so the line breaks are my additions.)

The door knob off a door and put it back on backwards softly

Do nothing all day to a co of someone’s ad in the paper for a garage sale at someone of an impending prank

Then do nothing all day to a co of someone’s ad in the paper for a garage sale at…

A predictive-text Markov chain isn’t likely to hold a conversation with a customer or write a story that can be used as a new video-game quest (both of which are things that people are trying to get RNNs to do one day). But one thing it can do is suggest likely words that might come next in a particular training set.

The people at Botnik, for example, use Markov chains trained on various datasets (Harry Potter books, Star Trek episodes, Yelp reviews, and more) to suggest words to human writers. The unexpected Markov chain suggestions often help the writers take their texts in weirdly surreal directions.

Rather than allowing the Markov chain and its short memory to try to choose the next word, I can let it come up with a bunch of options and present them to me—just as predictive text does when I’m composing a text message to someone.

Here’s an example of what it looks like to interact with one of Botnik’s trained Markov chains, this one trained on Harry Potter books:

And here are some new April Fool’s Day pranks I wrote with the help of the predictive text of a trained Markov chain:

Put plastic wrap pellets on your lips.

Arrange the kitchen sink into a chicken head.

Put a glow stick in your hand and pretend to sneeze on the roof.

Make a toilet seat into pants and then ask your car to pee.

For the sake of comparison, I also used a more complex, data-intensive RNN to generate April Fool’s Day pranks. In this case, the RNN generated the entire prank, punctuation and all. However, there was still an element of human creativity involved—I had to sort through all the RNN-generated pranks looking for the funniest ones.

Make a food in the office computer of someone.

Hide all of the entrance to your office building if it only has one entrance.

Putting googly eyes on someone’s computer mouse so that it won’t work.

Set out a bowl filled with a mix of M&M’s, Skittles, and Reese’s Pieces.

Place a pair of pants and shoes in your ice dispenser.

You can conduct similar experiments with the predictive text included in most phone messaging apps. If you start with “I was born…” or “Once upon a time…” and keep clicking the phone’s suggested words, you’ll get a strange piece of writing straight from the innards of a machine learning algorithm. And because training a new Markov chain is relatively quick and e
asy, the text you get is specific to you. Your phone’s predictive text and autocorrect Markov chains update themselves as you type, training themselves on what you write. That’s why if you make a typo, it may haunt you for quite some time.

Google Docs may have fallen victim to a similar effect when users reported its autocorrect would change “a lot” to “alot” and suggested “gonna” instead of “going.” Google was using a context-aware autocorrect that scanned the internet to decide which suggestions to make.8 On the plus side, a context-aware autocorrect is able to spot typos that form real words (like “gong” typed intead of “going”), and add new words as soon as they become common. However, as any user of the internet knows, common usage rarely dovetails with the grammatically “correct” formal usage you’d want in a word processor’s autocorrect feature. Although Google hasn’t talked specifically about these autocorrect bugs, the bugs do tend to disappear after users report them.

RANDOM FORESTS

A random forest algorithm is a type of machine learning algorithm frequently used for prediction and classification—predicting customer behavior, for example, or making book recommendations or judging the quality of a wine—based on a bunch of input data.

To understand the forest, let’s start with the trees. A random forest algorithm is made of individual units called decision trees. A decision tree is basically a flowchart that leads to an outcome based on the information we have. And, pleasingly, decision trees do kind of look like upside-down trees.

On the next page is a sample decision tree for, hypothetically, whether to evacuate a giant cockroach farm.

The decision tree keeps track of how we use information (ominous noises, the presence of cockroaches) to make decisions about how to handle the situation. Just as our sandwich decisions become more sophisticated as the number of cells in our neural network increases, we can handle the cockroach situation with more nuance if we have a larger decision tree.

If the cockroach farm is strangely quiet, yet the roaches have not escaped, then there may be other explanations (perhaps even more unsettling) besides “they’re all dead.” With a larger tree we could ask whether there are dead cockroaches around, how smart the cockroaches are known to be, and whether the cockroach-crushing machines have been mysteriously sabotaged.

With lots and lots of inputs and choices, the decision tree can become hugely complex (or, to use the programming parlance of deep learning, very deep). It could become so deep that it encompasses every possible input, decision, and outcome in the training set, but then the chart would only work for the specific situations from the training set. That is, it would overfit the training data. A human expert could cleverly construct a huge decision tree that avoids overfitting and can handle most decisions without fixating on specific, probably irrelevant data. For example, if it was cloudy and cool the last time the cockroaches got out, a human is smart enough to know that having the same weather doesn’t necessarily have anything to do with whether the cockroaches will escape again.

But an alternative approach to having a human carefully build a huge decision tree is to use the random forest method of machine learning. In much the same way as a neural network uses trial and error to configure the connections between its cells, a random forest algorithm uses trial and error to configure itself. A random forest is made of a bunch of tiny (that is, shallow) trees that each consider a tiny bit of information to make a couple of small decisions. During the training process, each shallow tree learns which information to pay attention to and what the outcome should be. Each tiny tree’s decision probably won’t be very good, because it’s based on very limited information. But if all the tiny trees in the forest pool their decisions and vote on the final outcome, they will be much more accurate than any individual tree. (The same phenomenon holds true for human voters: if people try to guess how many marbles are in a jar, individually their guesses may be way off, but on average their guesses will likely be very close to the real answer.) The trees in a random forest can pool their decisions on all sorts of topics, coming up with an accurate picture of staggeringly complex scenarios. One recent application, for example, was sorting through hundreds of thousands of genomic patterns to determine which species of livestock was responsible for a dangerous E. coli outbreak.9

If we used a random forest to handle the cockroach situation, here’s what a few of its trees might look like:

Now, each individual tree is only seeing a very small bit of the situation. There may be a perfectly reasonable explanation for why Barney isn’t around—perhaps Barney has merely called in sick. And if the cockroaches have not actually eaten the super serum, that doesn’t necessarily mean we’re safe. Maybe the cockroaches have taken samples of the super serum and are even now brewing up a huge batch, enough for the 1.7 billion cockroaches in the facility.

But the trees are combining their individual hunches, and with Barney mysteriously missing, the serum gone, and your password mysteriously changed, the decision to evacuate may be a prudent one.

EVOLUTIONARY ALGORITHMS

AI refines its understanding by making a guess about a good solution, then testing it. All three machine learning algorithms above use trial and error to refine their own structures, producing the configuration of neurons, chains, and trees that lets them best solve the problem. The simplest methods of trial and error are those in which you always travel in the direction of improvement—often called hill climbing if you’re trying to maximize a number (say, the number of points collected during a game of Super Mario Bros.) or gradient descent if you’re trying to minimize a number (like the number of escaped cockroaches). But this simple process of getting closer to your goal doesn’t always yield the best results. To visualize the pitfalls of simple hill climbing, imagine you’re somewhere on a mountain (in deep fog) and trying to find its highest point.

If you use a simple hill-climbing algorithm, you’ll head uphill no matter what. But depending on where you start, you might end up stopping at the lowest peak—a local maximum—rather than the highest peak, the global maximum.

So there are more complex methods of trial and error designed to force you to try out more parts of the mountain, maybe doing a few test hikes in a few different directions before deciding where the most promising areas are. With those strategies, you might end up exploring the mountain more efficiently.

In machine learning terms, the mountain is called your search space—somewhere in that space is your goal (that is, somewhere on the mountain is the peak), and you’re trying to find it. Some search spaces are convex, meaning that a basic hill-climbing algorithm will find you the peak each time. Other search spaces are much more annoying. The worst are the so-called needle-in-the-haystack problems, in which you might have very little clue how close you are to the best solution until the moment you stumble upon it. Searching for prime numbers is an example of a needle-in-the-haystack problem.

The search space of a machine learning algorithm could be anything. For example, the search space could be the shapes of parts that make up a walking robot. Or it could be the set of possible weights of a neural network, and the “peak” is the weights that help you identify fingerprints or faces. Or the search space could be the set of possible configurations of a random forest algorithm, and your goal is to find a configuration that’s good at predicting a customer’s favorite books—or whether the cockroach factory should be evacuated.

As we learned above, a basic search algorithm like hill climbing or gradient descent might not get you very far if the search space of possible neural net configurations is not very convex. So machine learning researchers sometimes turn to other, more complex trial-and-error methods.

One of these strategies takes its inspiration from the process of evolution. It makes a lot of sense to imitate evolution—after all, what is evolution if not a generational process of “guess and check”? If a creature is different from its neighbors in some way that makes it more likely to survive and therefore reprod
uce, then it will be able to pass its useful traits on to the next generation. A fish that can swim a tiny bit faster than other individuals of its species may be more likely to escape predators, and after a few generations of this, its fast-swimming offspring may be a bit more common than the descendants of slower-swimming fish. And evolution is a powerful, powerful process—one that has solved countless locomotion and information-processing problems, figured out how to extract food from sunlight and from hydrothermal vents, and figured out how to glow, fly, and hide from predators by looking like bird dung.

In evolutionary algorithms, each potential solution is like an organism. In each generation, the most successful solutions survive to reproduce, mutating or mating with other solutions to produce different—and, one hopes, better—children.

If you’ve ever struggled to solve a complex problem, it might be mind-boggling to think of each potential solution as a living being—eating, mating, whatever. But let’s think about it in concrete terms. Let’s say we’re trying to solve a crowd-control problem: we have a hallway that splits into a fork, and we want to design a robot that can direct people to take one hallway or the other.

‹ Prev Next ›