In the next chapter, we’ll look at more ways that AIs can be designed for success—or not.
CHAPTER 4
It’s trying!
Until now, we’ve been talking about how AI learns to solve problems, the kinds of problems it does well at, and AI doom. Let’s focus some more on doom—cases in which an AI-powered solution is a terrible way of solving a real-world problem. These cases can range from slightly annoying to quite serious. In this chapter we’ll talk about what happens when an AI can’t solve a problem very well—and what we can do about it. These could be instances when we
• gave it a problem that was too broad,
• didn’t give it enough data for it to figure out what’s going on,
• accidentally gave it data that confused it or wasted its time,
• trained it for a task that was much simpler than the one it encountered in the real world, or
• trained it in a situation that didn’t represent the real world.
PROBLEM TOO BROAD
This may be familiar from chapter 2, where we looked at the kinds of problems that are suitable for solving with AI. As we learned from the failure of M, Facebook’s AI assistant, if a problem is too broad, the AI will struggle to produce useful responses.
In 2019 researchers from Nvidia (a company that makes the kind of computing engines that are widely used for AI) trained a GAN (the two-part adversarial neural network, which I discussed in chapter 3) called StyleGAN to generate images of human faces.1 StyleGAN did an impressively good job, producing faces that were photorealistic except for subtleties like earrings that didn’t match and backgrounds that didn’t quite make sense. However, when the team trained StyleGAN on cat pictures instead, it produced cats with extra limbs, extra eyes, and weirdly distorted faces. Unlike the dataset of human pictures, which was made up of human faces seen from the front, the dataset of cat pictures included cats photographed from various angles, walking or curled up or meowing at the camera. StyleGAN had to learn from close-ups and pictures of multiple cats and even pictures with humans in the frame, and it was too much for one algorithm to handle well. It was hard to believe that the photorealistic humans and the distorted cats were the product of the same basic algorithm. But the narrower the task, the smarter the AI seems.
MORE DATA, PLEASE
The StyleGAN algorithm mentioned above, and most of the other AIs in this book, are the sort that learn by example. Given enough examples of something—enough cat names or horse drawings or successful driving decisions or financial predictions—these algorithms can learn patterns that help them imitate what they see. Without sufficient examples, however, the algorithm won’t have enough information to figure out what’s going on.
Let’s take this to the extreme and see what happens when we train a neural net to invent new ice cream flavors—with far, far too few flavors to learn from. Let’s give it only these eight flavors:
Chocolate
Vanilla
Pistachio
Moose Tracks
Peanut Butter Chip
Mint Chocolate Chip
Blue Moon
Champagne Bourbon Vanilla With Quince-Golden Raspberry Swirl And Candied Ginger
These are good classic flavors, to be sure. If you gave this list to a human, they would likely realize that these are supposed to be ice cream flavors and would probably be able to think of a few more to add. Strawberry, they might say. Or Butter Pecan with Huckleberry Swirl. The human is able to do this because they know about ice cream and about the kinds of flavors that tend to go in ice cream. They know how to spell these flavors and even know what order to put the words in (Mint Chocolate Chip, for example, never Chip Chocolate Mint). They know that strawberry is a thing and that glungberry isn’t.
But when I give this same list to an untrained neural network, it has none of that information to draw on. It doesn’t know what ice cream is or even what English is. It has no knowledge that vowels are different from consonants or that letters are different from spaces and line breaks. It might help to show this dataset as the neural net sees it—with each letter, space, and punctuation mark translated into a single number:
3;8;15;3;15;12;1;20;5;24;22;1;14;9;12;12;1;24;16;9;19;20;1;3;8;9;15;24;13;15;15;19;5;0;20;18;1;3;11;19;24;16;5;1;14;21;20;0;2;21;20;20;5;18;0;3;8;9;16;24;13;9;14;20;0;3;8;15;3;15;12;1;20;5;0;3;8;9;16;24;2;12;21;5;0;13;15;15;14;24;3;8;1;13;16;1;7;14;5;0;2;15;21;18;2;15;14;0;22;1;14;9;12;12;1;0;23;9;20;8;0;17;21;9;14;3;5;26;7;15;12;4;5;14;0;18;1;19;16;2;5;18;18;25;0;19;23;9;18;12;0;1;14;4;0;3;1;14;4;9;5;4;0;7;9;14;7;5;18;
The neural net’s job is to figure out, for example, when character 13 (an m) is likely to appear. Twice it appears after character 24 (a line break), but once it appears after character 0 (a space). Why? We haven’t explicitly told it, of course. And look at character 15 (an o). Sometimes it appears twice in a row (both times after character 13), but several other times it appears just once. Again, why? There isn’t enough information for the AI to figure this out. And since the letter f never appears at all in its input dataset, the neural net doesn’t have a number assigned to it. As far as it knows, f doesn’t exist. It can’t possibly come up with toffee, coffee, or fudge no matter how hard it tries.
Still, the neural net is trying very hard, and it manages to come up with a few things. It learns that vowels and spaces (characters 1, 5, 9, 15, 21, and 0) are common, and so early in the training its output looks like this:
aaaoo aaaaaaaaoalnat ia eain l e ee r r e r er n r en d edeedr ed d nrd d edi r rn n d e e eer d r e d d dd dr rr er r r n e ri d edAe eri diedd rd eder r edder dnrr dde er ne r dn nend n dn rnndr eddnr re rdre rdd e r e e dnrddrr rdd r
Since its training dataset contains that one really long ice cream flavor, it struggles a bit to understand how often to use character 24 (a line break). But eventually its output begins to separate into individual “flavors”:
tahnlaa aa otCht aa itonaC hi aa gChoCe ddiddddRe dCAndddriddrni dedweiliRee
aataa naa ai
tttCuat
It takes even longer for the first recognizable words of the dataset to appear, as the algorithm begins to memorize individual combinations of letters (and goes through some kind of Spooky Cow phase in which it sounds like it’s mooing and hissing, with a few raspberries thrown in):
MoooootChopooopteeCpp
BlpTrrrks
Bll Monooooooooooooo
Pnstchhhhhh
MoooosTrrrksssss
PeniautBuut tChppphippphppihpppi
Moonillaaaaal
Pnnillaaa
Buee Moooo
As the neural net’s training continues, the flavors become more recognizable:
Moint Chooolate Chip
Peanut Butter Chip
Pistchhio
Bue Moo
Moose
Trrack
Psenutcho
Vanilla
MintCcooolate Chhip
Psstchhio
Chaampgne Booouorr VanillaWith QciiG-Golddni aspberrrr ndirl AndCandiiddnngger
It’s even managing to copy some flavors verbatim from the input dataset as it memorizes successively longer sequences of characters that work. If it trains a while longer, it will learn to reproduce the entire eight-flavor dataset perfectly. But that wasn’t really our goal. Memorizing the input examples isn’t the same as learning how to generate new flavors. In other words, this algorithm has failed to generalize.
With a properly sized dataset, however, the neural net can make much better progress. When I trained a neural net with 2,011 flavors (still a small dataset but no longer a ridiculously small one), the AI could finally become inventive. It produced brand-new flavors like the ones in the list below as well as the flavors from chapter 2, none of which appeared in the original dataset.
Smoked Butter
Bourbon Oil
Roasted Beet Pecans
Grazed Oil
Green Tea Coconut
Chocolate With Ginger Lime and Oreo
Carrot Beer
Red Honey
Lime Cardamom
Chocolate Oreo Oil + Toffee
Milky Ginger Chocolate Peppercorn
So when it comes to training AI, more data is usually better. That’s why the Amazon-review-generating neural net discussed in chapter 3 trained on an impressive eighty-two million product reviews. It’s also why, as we learned in chapter 2, self-driving cars train on data from millions of road miles and billions of simulation miles and why standard image recognition datasets like ImageNet contain many millions of pictures.
But where do you get all this data? If you’re an entity like Facebook or Google, you might already have these huge datasets on hand. Google, for example, has collected so many search queries that it’s been able to train an algorithm to guess how you’ll finish a sentence when you start typing in the search window. (A disadvantage of training on data from real users is that the suggested search terms can end up being sexist and/or racist. And sometimes just plain weird.) In this era of big data, potential AI training data can be a valuable asset.
But if you don’t have all this data on hand, you’ll have to collect it somehow. Crowdsourcing is one cheap option, if the project is fun or useful enough to keep people interested. People have crowdsourced datasets for identifying animals on trail cameras, whale calls, and even patterns of temperature change in a Danish river delta. Researchers who develop an AI-powered tool for counting samples under a microscope can ask their users to submit labeled data so they can use it to improve future versions of the tool.
But sometimes, crowdsourcing doesn’t work as well, and for that I blame humans. I crowdsourced a set of Halloween costumes, for example, asking volunteers to fill out an online form where they could list every costume they could think of. Then the algorithm started producing costumes like:
Sports costume
Sexy scare costume
General Scare construct
The problem was that, in an apparent attempt to be helpful, someone had decided to enter a costume store’s entire inventory. (“What are you supposed to be?” “Oh, I’m Men’s Deluxe IT Costume—Size Standard.”)
An alternative to relying on the goodwill and cooperativeness of strangers is to pay people to crowdsource your data. Services like Amazon Mechanical Turk are built for this: a researcher can create a job (like answering questions about an image, role-playing as a customer service representative, or clicking on giraffes), then pay remote workers to fulfill the task. Ironically, this strategy can backfire if someone takes the job and then secretly has a bot do the actual work—the bot usually does a terrible job. Many people who use paid crowdsourcing services include simple tests to make sure the questions are being read by a human or, better yet, a human who’s paying attention and not answering at random.2 In other words, they have to include a Turing test as one of the questions to make sure they haven’t accidentally hired a bot to train their own bot.
Another way to get the most out of a small dataset is to make small changes to the data so that one bit of data becomes many slightly different bits. This strategy is known as data augmentation. A simple way to turn a single image into two images, for example, is to make a mirror image of it. You could also cut out parts of it or change its texture slightly.
Data augmentation works on text, too, but it’s rare. To turn a few phrases into many, one strategy is to replace various parts of the phrase with words that mean similar things.
A herd of horses is eating delicious cake.
A group of horses is munching marvelous dessert.
Several horses are enjoying their pudding.
The horses are consuming the comestibles.
The equines are devouring the confectionery offering.
Doing this generation automatically can result in weird and unlikely sentences, though. It’s a lot more common for programmers who are crowdsourcing text to simply ask a lot of people to do the same task so they get lots of slightly different answers that mean the same thing. For example, one team made a chatbot called the Visual Chatbot, which could answer questions about images. They used crowdsourced workers to provide training data by answering questions that other crowdsourced workers asked, producing a dataset of 364 million question-answer pairs. By my calculation, each image was seen an average of three hundred times, which is why their dataset contains lots of similarly worded answers:3
no, just the 2 giraffes
no, just 2 giraffe
there are two, it’s not a lone giraffe a baby and 1 grown
no it is just the 2 giraffes in the enclosure
no i just see 2 giraffes
no, just the 2 cute giraffes
no just the 2 giraffes
nope just the 2 giraffes
nope just the 2 giraffe
just the 2 giraffes
As you can tell from the answers below, some respondents were more committed to the seriousness of the project than others:
yeah, i would totally meet this giraffe
the tall giraffe might be regretting parenthood
bird is staring at giraffe asking about leaf thievery
The other effect of the setup was that each person had to ask ten questions about each image, and people eventually run out of things to ask about giraffes, so the questions got a bit whimsical at times. Some of the questions humans posed included:
does the giraffe appear to understand quantum physics and string theory
does the giraffe appear to be happy enough to star in a beloved dreamworks movie
does it look like the giraffe ate the humans before the picture was taken
is the giraffe waiting for the rest of his spotted four-legged overlords to come out so they can enslave mankind
on scale from bieber to gandalf, how epic
would you say these are gangster zebras
does this look like elite horse
what is giraffe song
how many inches long are bears, estimated
please, pay attention to the task you take a while to start typing after i’ve asked a question i don’t like to wait so long, do you like to wait this long
Humans do weird things to datasets.
Which brings us to the next point about data: it’s not enough just to have lots and lots of data. If there are problems with the dataset, the algorithm will at best waste time and at worst learn the wrong thing.
MESSY DATA
In a 2018 interview with The Verge, Vincent Vanhoucke, Google’s technical lead for AI, talked about Google’s efforts to train self-driving cars. When the researchers discovered their algorithm was having trouble labeling pedestrians, cars, and other obstacles, they went back to look at their input data and discovered that most of the errors could be traced back to labeling errors that humans had made in the training dataset.4
&nb
sp; I’ve definitely seen this happen, too. One of my first projects was to train a neural network to generate recipes. It made mistakes—a lot of them. It called on the chef to perform such actions as:
Mix honey, liquid toe water, salt and 3 tablespoon olive oil.
Cut flour into ¼-inch cubes
Spread the butter in the refrigerator.
Drop one greased pot.
Remove part of skillet.
It asked for ingredients such as:
½ cup wripping oil
1 lecture leaves thawed
6 squares french brownings cream
1 cup italian whole crambatch
It was definitely struggling with the magnitude and complexity of the recipe-generating problem. Its memory and mental capacity weren’t up to such a broad task. But it turns out that some of its mistakes weren’t its fault at all. The original training dataset included recipes that some computer program had automatically converted from another format at some point, and the conversion hadn’t always worked smoothly.
One of the neural net’s recipes called for:
1 strawberries
a phrase it had learned from the input dataset. There was a recipe in which the phrase “2 ½ cup sliced and sweetened fresh strawberries” had evidently been autoseparated into:
2 ½ cup sliced and sweetened fresh
1 strawberries
You Look Like a Thing and I Love You Page 9