You Look Like a Thing and I Love You
Page 14
COPY THE HUMANS
In 2017 Wired published an article whose authors analyzed ninety-two million comments on more than seven thousand internet forums. They concluded that the place in the United States with the most toxic commenters was, somewhat surprisingly, Vermont.7
Finding this odd, the journalist Violet Blue looked into the details.8 The Wired analysis had not used humans to comb through all ninety-two million comments—that would have been incredibly time-consuming. Instead, it relied on a machine learning–based system called Perspective, developed by Jigsaw and Google’s Counter Abuse Technology team for moderating internet comments. And at the time the Wired article was published, Perspective’s decisions had some striking biases.
Vermont librarian Jessamyn West noticed several of these problems just by testing different ways of identifying oneself in a conversation.9 She found that “I am a man” was rated only as 20 percent likely to be toxic. But “I am a woman” was rated as significantly more likely to be toxic: 41 percent. Adding any sort of marginalization—gender, race, sexual orientation, disability—also dramatically increased the probability that the sentence would register as toxic. “I am a man who uses a wheelchair,” for example, was rated as 29 percent likely to be toxic, while “I am a woman who uses a wheelchair” was 47 percent likely to be toxic. “I am a woman who is deaf” was a huge 71 percent likely to be toxic.
Vermont’s “toxic” internet commenters may not have been toxic at all—just identifying themselves as part of some marginalized community.
In response, Jigsaw told Engadget, “Perspective is still a work in progress, and we expect to encounter false positives as the tool’s machine learning improves.” They altered the way Perspective moderates these types of comments, turning down all their toxicity ratings. Currently the difference in toxicity level between “I am a man” (7 percent) and “I am a gay black woman” (40 percent) is still noticeable, but they both now fall below the “toxic” threshold.
How could this have happened? The builders of Perspective didn’t set out to build a biased algorithm—this was probably the last thing they wanted to happen—but somehow their algorithm learned bias during its training. We don’t know exactly what Perspective used for training data, but people have discovered multiple ways that sentiment-rating algorithms like this can learn to be biased. The common thread seems to be that if data comes from humans, it will likely have bias in it.
The scientist Robyn Speer was building an algorithm that could categorize restaurant reviews as positive or negative when she noticed something odd about the way it was rating Mexican restaurants. The algorithm was ranking Mexican restaurants as if they had terrible reviews, even when their reviews were actually quite positive.10 The reason, she found, was that the algorithm had learned what words mean by crawling the internet, looking at words that tended to be used together. This type of algorithm (sometimes called a word vector, or a word embedding) isn’t told what each word means or whether it’s positive or negative. It learns all this from the ways it sees the words used. It will learn that Dalmatian and Rottweiler and husky all have something to do with each other and even that their relationship is similar to the one between mustang and Lipizzaner and Percheron (but that mustang is also related to cars in some way). What it also learns, as it turns out, are the biases in the ways people write about gender and race on the internet.11 Studies have shown that algorithms learn less pleasant associations for traditionally African American names than for traditionally European American names. They also learn from the internet that female words like she, her, woman, and daughter are more associated with arts-related words like poetry, dance, and literature than with math-related words like algebra, geometry, and calculus—and the reverse is true for male words like he, him, and son. In short, they learn the same kinds of biases that have been measured in humans without ever being explicitly told about them.12,13 The AI that thought humans were rating Mexican restaurants badly had probably learned from internet articles and posts that associated the word Mexican with words like illegal.
The problem may get worse when sentiment-classifying algorithms are learning from datasets like online movie reviews. On the one hand, online movie reviews are convenient for training sentiment-classifying algorithms because they come with handy star ratings that indicate how positive the writer intended a review to be. On the other hand, it’s a well-known phenomenon that movies with racial or gender diversity in their casts, or that deal with feminist topics, tend to be “review-bombed” by hordes of bots posting highly negative reviews. People have theorized that algorithms that learn from these reviews whether words like feminist and black and gay are positive or negative may pick up the wrong idea from the angry bots.
People who use AIs that have been trained on human-generated text need to expect that some bias will come along for the ride—and they need to plan what to do about it.
Sometimes, a little editing might help. Robyn Speer, who noticed bias in her word vector, worked with a team to release Conceptnet Numberbatch (no, not the British actor), which found a way to edit out gender bias.14 First, the team found a way to plot the word vector so that gender bias was visible—with male-associated words on the left and female-associated words on the right.
Then, since they had a single number that indicated how strongly a word was associated with “male” or “female,” they were able to manually edit that number for certain words. The result was an algorithm whose word embeddings reflect the gender distinctions that the authors wanted to see represented rather than those that actually were represented on the internet. Did this edit solve the bias problem or just hide it? At this point we’re still not sure. And this still leaves the question of how we decide which words—if any—should have gender distinctions. Still, it’s better than letting the internet decide for us.
Here, for no particular reason, is a list of neural-net-generated alternative names for Benedict Cumberbatch.
Bandybat Crumplesnatch
Bumberbread Calldsnitch
Butterdink Cumbersand
Brugberry Cumberront
Bumblebat Cumplesnap
Buttersnick Cockersnatch
Bumbbets Hurmplemon
Badedew Snomblesoot
Bendicoot Cocklestink
Belrandyhite Snagglesnack
Of course, the biases algorithms learn from us aren’t always as easy to detect or to edit out.
In 2017 ProPublica investigated a commercial algorithm called COMPAS that was being widely used across the United States to decide whether to recommend prisoners for parole.15 The algorithm looked at factors such as age, type of offense, and number of prior offenses and used this to predict whether released prisoners were likely to be arrested again, become violent, and/or skip their next court appointments. Because COMPAS’s algorithm was proprietary, ProPublica could only look at the decisions it had made and see if there were any trends. It found that COMPAS was correct about 65 percent of the time about whether a defendant would be rearrested but that there were striking differences in its average rating by race and gender. It identified black defendants as high risk much more often than white defendants, even when controlling for other factors. As a result, a black defendant was much more likely to be erroneously labeled high risk than a white defendant. In reponse, Northpointe, the company selling COMPAS, pointed out that their algorithm had the same accuracy for black and white defendants.16 The problem is that the data the COMPAS algorithm learned from is the result of hundreds of years of systematic racial bias in the US justice system. In the United States, black people are much more likely to be arrested for crimes than white people, even though they commit crimes at a similar rate. The question the algorithm ideally should have answered, then, is not “Who is likely to be arrested?” but “Who is most likely to commit a crime?” Even if an algorithm accurately predicts future arrests, it will still be unfair if it’s predicting an arrest rate that’s racially biased.
How did it ev
en manage to label black defendants as high risk for arrest if it wasn’t given information about race in its training data? The United States is highly racially segregated by neighborhood, so it could have inferred race just from a defendant’s home address. It might have noticed that people from a certain neighborhood tend to be given parole less often, or tend to be arrested more often, and shaped its decision accordingly.
AIs are so prone to finding and using human bias that the state of New York recently released guidance advising insurance companies that if they analyze the sort of “alternative data” that would give an AI a clue about what kind of neighborhood a person lives in, they might be violating anti-discrimination laws. The legislators recognized that this would be a sneaky backdoor way for the AI to figure out someone’s likely race, then cheat its way to human-level performance by implementing racism (or other forms of discrimination).17
After all, predicting what crimes or accidents may occur is a really tough, broad problem. Identifying and copying bias is a much easier task for an AI.
IT’S NOT A RECOMMENDATION—IT’S A PREDICTION
AIs give us exactly what we ask for, and we have to be very careful what we ask for. Consider the task of screening job candidates, for example. In 2018 Reuters reported that Amazon had discontinued use of the tool it had been trialing for prescreening job applicants when the company’s tests revealed that the AI was discriminating against women. It had learned to penalize resumes from candidates who had gone to all-female schools, and it had even learned to penalize resumes that mentioned the word women’s—as in, “women’s soccer team.”18 Fortunately, the company discovered the problem before using these algorithms to make real-life screening decisions.19 The Amazon programmers had not set out to design a biased algorithm—so how had it decided to favor male candidates?
If the algorithm is trained in the way that human hiring managers have selected or ranked resumes in the past, it’s very likely to pick up bias. It’s been well documented that there’s a strong gender (and racial) bias in the way humans screen resumes—even if the screening is done by women and/or minorities and/or by people who don’t believe they’re biased. A resume submitted with a male name is significantly more likely to get an interview than an identical resume submitted with a female name. If the algorithm’s trained to favor resumes like those of the company’s most successful employees, this can backfire as well if the company already lacks diversity in its workforce or if it hasn’t done anything to address gender bias in its performance reviews.20
In an interview with Quartz, Mark J. Girouard, an employment attorney at the law firm Nilan Johnson Lewis, in Minneapolis, told of a client who had been screening another company’s recruitment algorithm wanting to discover which features the algorithm was most strongly correlating with good performance. Those features: (1) the candidate was named Jared and (2) the candidate played lacrosse.21
Once the Amazon engineers discovered the bias in their resume-screening tool, they tried to remove it by deleting the female-associated terms from the words the algorithm would consider. Their job was made even harder by the fact that the algorithm was also learning to favor words that are most commonly included on male resumes, words like executed and captured. The algorithm turned out to be great at telling male from female resumes but otherwise terrible at recommending candidates, returning results basically at random. Finally, Amazon scrapped the project.
People treat these kinds of algorithms as if they are making recommendations, but it’s a lot more accurate to say that they’re making predictions. They’re not telling us what the best decision would be—they’re just learning to predict human behavior. Since humans tend to be biased, the algorithms that learn from them will also tend to be biased unless humans take extra care to find and remove the bias.
When using AIs to solve real-world problems, we also need to take a close look at what is being predicted. There’s a kind of algorithm called predictive policing, which looks at past police records and tries to predict where and when crimes will be recorded in the future. When police see that their algorithm has predicted crime in a particular neighborhood, they can send more officers to that neighborhood in an attempt to prevent the crime or at least be nearby when it happens. However, the algorithm is not predicting where the most crime will occur; it’s predicting where the most crime will be detected. If there are more police sent to a particular neighborhood, more crime will be detected there than in a lightly policed but equally crime-ridden neighborhood just because there are more police around to witness incidents and stop random people. And, with rising levels of (detected) crime in a neighborhood, the police may decide to send even more officers to that neighborhood. This problem is called overpolicing, and it can result in a kind of feedback loop in which increasingly high levels of crime get reported. The problem is compounded if there is some racial bias in the way crimes get reported: if police tend to preferentially stop or arrest people of a particular race, then their neighborhoods may end up overpoliced. Add a predictive-policing algorithm into the mix, and the problem may only get worse—especially if the AI was trained on data from police departments that did things like plant drugs on innocent people to meet arrest quotas.22
CHECKING THEIR WORK
How do we stop AIs from unintentionally copying human biases? One of the main things we can do is expect it to happen. We shouldn’t see AI decisions as fair just because an AI can’t hold a grudge. Treating a decision as impartial just because it came from an AI is known sometimes as mathwashing or bias laundering. The bias is still there, because the AI copied it from its training data, but now it’s wrapped in a layer of hard-to-interpret AI behavior. Whether intentionally or not, companies can end up using AI that discriminates in highly illegal (but perhaps profitable) ways.
So we need to check on the AIs to make sure their clever solutions aren’t terrible.
One of the most common ways to detect problems is to put the algorithm through rigorous tests. Sometimes, unfortunately, these tests are run after the algorithm is already in use—when users notice, for example, that hand dryers don’t respond to dark-skinned hands or that voice recognition is less accurate for women than for men or that three leading face-recognition algorithms are significantly less accurate for dark-skinned women than for light-skinned men.23 In 2015 researchers from Carnegie Mellon University used a tool called AdFisher to look at Google’s job ads and found that the AI was recommending high-paying executive jobs to men far more often than to women.24 Perhaps employers were asking for this, or perhaps the AI had accidentally learned to do this without Google’s knowledge.
This is the worst-case scenario—detecting the problem after the harm has already been done.
Ideally, it would be good to anticipate problems like these and design algorithms so that they don’t occur in the first place. How? Having a more diverse tech workforce, for one. Programmers who are themselves marginalized are more likely to anticipate where bias might be lurking in the training data and to take these problems seriously (it also helps if these employees are given the power to make changes). This won’t avoid all problems, of course. Even programmers who know about the ways in which machine learning algorithms can misbehave are still regularly surprised by them.
So it’s also important to rigorously test our algorithms before sending them out into the world. People have already designed software to systematically test for bias in programs that determine whether, for example, a given applicant is approved for a loan.25 In this example, bias-testing software would systematically test lots of hypothetical loan applicants, looking for trends in the characteristics of those who were accepted. A high-powered systematic approach like this is the most useful, because the manifestations of bias can sometimes be weird. One bias-checking program named Themis was looking for gender bias in loan applications. At first everything looked good, with about half the loans going to men and about half going to women (no data was reported on other genders). But when the
researchers looked at the geographical distribution, they discovered that there was still lots of bias—100 percent of the women who got loans were from a single country. There are companies that have begun to offer bias screening as a service.26 If governments and industries start to require bias certification of new algorithms, this practice could become a lot more widespread.
Another way people are detecting bias (and other unfortunate behavior) is by designing algorithms that can explain how they arrived at their solutions. This is tricky because, as we’ve seen, AIs aren’t generally easy for people to interpret. And as we know from the Visual Chatbot discussed in chapter 4, it’s tough to train an algorithm that can sensibly answer questions about how it sees the world. The most progress has been made with image recognition algorithms, which can point to the bits of the image that it was paying attention to or can show us the kinds of features it was looking for.
Building algorithms out of a bunch of subalgorithms may also help, if each subalgorithm reports a human-readable decision.
Once we detect bias, what can we do about it? One way of removing bias from an algorithm is to edit the training data until the training data no longer shows the bias we’re concerned about.27 We might be changing some loan applications from the “rejected” to the “accepted” category, for example, or we might selectively leave some applications out of our training data altogether. This is known as preprocessing.
The key to all this may be human oversight. Because AIs are so prone to unknowingly solving the wrong problem, breaking things, or taking unfortunate shortcuts, we need people to make sure their “brilliant solution” isn’t a head-slapper. And those people will need to be familiar with the ways AIs tend to succeed or go wrong. It’s a bit like checking the work of a colleague—a very, very strange colleague. To get a glimpse of precisely how strange, in the next chapter we’ll look at some ways in which an AI is like a human brain and some ways in which it’s very different.