Technically Wrong

Home > Other > Technically Wrong > Page 11
Technically Wrong Page 11

by Sara Wachter-Boettcher


  Cathy O’Neil claims that this reliance on historical data is a fundamental problem with many algorithmic systems: “Big data processes codify the past,” she writes. “They do not invent the future.” 4 So if the past was biased (and it certainly was), then these systems will keep that bias alive—even as the public is led to believe that high-tech models remove human error from the equation. The only way to stop perpetuating the bias is to build a model that takes these historical facts into account, and adjusts to rectify them in the future. COMPAS doesn’t.

  Then there’s the way those 137 questions are asked. According to Northpointe, the software is designed for “test administration flexibility”: the agency using the software can choose to ask defendants to fill in their own reports, lead the defendant through an interview and ask the questions verbatim, or hold what it calls a “guided discussion,” in which the interviewer uses a “motivational interviewing style” to “enhance rapport and buy-in for the intervention process.” 5 That flexibility sounds great for buyers of the software, but it means that agencies are gathering data in varied, nonstandard ways—and that their own biases about offenders, or lack of attention to potential biases, might lead them to assume an answer in the “guided discussion” that a suspect never would have said about themselves.

  So that’s the data going into the algorithm—the facets that Northpointe says indicate future criminality. But what about the steps that the algorithm itself takes to arrive at a score? It turns out that those have their problems as well. After ProPublica released its report, several groups of researchers, each working independently at different institutions, decided to take a closer look at ProPublica’s findings. They didn’t find a clear origin for the bias—a specific piece of the algorithm gone wrong. Instead, they found that ProPublica and Northpointe were simply looking at the concept of “fairness” in very different ways.

  At Northpointe, fairness was defined as parity in accuracy: the company tuned its model to ensure that people of different races who were assigned the same score also had the same recidivism rates. For example, “among defendants who scored a seven on the COMPAS scale, 60 percent of white defendants reoffended, which is nearly identical to the 61 percent of black defendants who reoffended.” 6 At first glance, that makes intuitive sense. But parity in accuracy is only one measure of fairness. As ProPublica found, more than twice as many black people as white were labeled high-risk but did not reoffend. What this means is that there may be parity in accuracy, but there isn’t parity when it comes to the harm of incorrect predictions—such as not being allowed bail, or being given a harsher sentence. That harm is shouldered primarily by black defendants.

  In fact, researchers at Stanford who looked at ProPublica’s data found that, mathematically, when one group has an overall higher arrest rate than another (like black people do, as compared to white people), “it’s actually impossible for a risk score to satisfy both fairness criteria at the same time.” You can’t have parity in accuracy rates and parity in the harm of incorrect predictions, because black people are more likely to be arrested, so they are more likely to carry higher scores. As Nathan Srebro, a computer science professor at the University of Chicago and the Toyota Technological Institute at Chicago, puts it, they’re “paying the price for the uncertainty” of Northpointe’s algorithm.7

  Despite what Northpointe says, that sure doesn’t seem like a fair algorithm to me. So I asked Sorelle Friedler, an assistant professor of computer science at Haverford College and the cochair of a group called Fairness, Accountability, and Transparency in Machine Learning (FAT/ML), how we should look at these sorts of competing “fairness” criteria. According to her, the problem with COMPAS starts not with the algorithm, but with the values that underlie the way that algorithm was designed. “In order to mathematically define fairness, we have to decide what we think fairness should be,” she told me. “This is obviously a long-standing, many-thousands-of-years-old question in society: ‘What does it mean to be fair?’ And, somehow, we have to boil that down to some formula.” That’s a “necessarily flawed” process, she says. “It will capture some definition of fairness, under some situations, with lots of caveats.” 8

  But the underlying problem isn’t that algorithms like COMPAS can’t perfectly model “fairness.” It’s that they’re not really trying. Instead, they’re relying on a definition that sounds nice, without thinking about who is harmed by it and how it might perpetuate the inequities of the past. And they’re doing it in private, despite its impact on the public.

  That’s why COMPAS is such a cautionary tale for the tech industry—and all of us who use tech products. Because, as powerful as algorithms are, they’re not inherently “correct.” They’re just a series of steps and rules, applied to a set of data, designed to reach an outcome. The questions we need to ask are, Who decided what that desired outcome was? Where did the data come from? How did they define “good” or “fair” results? And how might that definition leave people behind? Otherwise, it’s far too easy for teams to carry the biases of the past with them into their software, creating algorithms that, at best, make a product less effective for some users—and, at worst, wreak havoc on their lives.

  THE NONGORILLA IN THE ROOM

  One Sunday in June 2015, Jacky Alciné was sitting in his room watching the BET Awards show, when a photo from a friend popped up in Google Photos. He started playing with the app and saw that all his images had been assigned to new categories: a picture of a jet wing snapped from his seat was tagged with “Airplanes.” A photo of his brother in cap and gown was tagged with “Graduation.” And a selfie of him and a friend mugging for the camera? It was labeled “Gorillas.”

  Yep, “Gorillas,” a term so loaded with racist history, so problematic, that Alciné felt frustrated the moment he saw it. “Having computers class you as something that black people have been called for centuries” was upsetting, he told Manoush Zomorodi on the WNYC podcast Note to Self. “You look like an ape, you’ve been classified as a creature. . . . That was the underlying thing that triggered me.” 9

  Google Photos’ automatic image tagging puts photos into categories for you. It works pretty well—except when it goes horribly wrong. (Jacky Alciné)

  And it wasn’t just one photo of Alciné and his friend that the app had tagged as gorillas. It was actually every single photo they had taken that day—a whole series of the two of them goofing around at an outdoor concert. “Over fifty of my photos in that set were labeled under the animal tag,” Alciné told me, looking back on the incident. “Think about that. It made the mistake and kept making it.” 10

  No human had made this decision, of course. The categories were courtesy of Google Photos’ autotagging feature, which had recently launched, promising to allow users to “search by what you remember about a photo, no description needed.” 11 The technology is based on deep neural networks: massive systems of information that enable machines to “see,” much in the same way the human brain does.

  All the photos that Alciné and his friend took that day were labeled as “Gorillas.” (Jacky Alciné)

  Neural networks are built using a learning algorithm: rather than being programmed to follow predetermined steps, the algorithm takes historical information, parses it, identifies its patterns, and then uses it to make inferences about new information it encounters. In image processing, it works something like this: millions of images are added to a system, each of them tagged by a human—like “Boston terrier puppy” or “maple tree.” Those images make up the system’s training data: the collection of examples it will learn from. Algorithms then go through the training data and identify patterns.

  For example, to build a neural network that could process handwriting, you would need training data with a huge number of different versions of the same letterforms. The idea is that the system would notice that each item tagged as a lowercase “p” had similarities—that they always had a descender immediately connected to a bowl, for exampl
e. The wider the range of handwriting used in the training data, the more accurate the system’s pattern recognition will be when you let it loose on new data. And once it’s accurately identifying new forms of the lowercase “p,” those patterns can be added into the network too. This is how a neural network gets smarter over time.12

  Like our brains, though, computers struggle to understand complex objects, such as photographs, all at once; there are simply too many details and variations to make sense of. Instead they have to learn patterns, like we learned as children. This kind of pattern recognition happens in layers: small details, like the little point at the top of a cat’s ear, get connected to larger concepts, like the ear itself, which then gets connected to the larger concept of a cat’s head, and so on—until the system builds up enough layers, sometimes twenty or more, to make sense of the full image.13 The connections between each of those layers are what turn the collection of data into a neural network. Some of these networks are so good that they can even identify an image not just as a car, but also as a particular make, model, and year.14

  This kind of machine learning is no joke to build. It takes a tremendous amount of training data, plus lots of jiggering, to make the system categorize things correctly. For example, the system might recognize the pointy ears, reddish coloring, and spindly legs of my friend’s Portuguese Podengo, compare those details to patterns it has seen before, and mistakenly categorize him as a deer rather than a dog. But the more test data the system has, the less often these kinds of mistakes should happen.

  In Alciné’s case, the system looked at his selfies and found patterns it recognized too. What made the incident stand out is that those patterns added up to a gorilla. If the error had been different—if the system had thought that Alciné and his friend were bears, for example—he might have chalked it up to a small failure, retagged the images, and moved on. But “gorilla” carries cultural weight—so much, in fact, that Google immediately apologized and told the BBC it was “appalled” that this had happened.15

  It would be easy to end this anecdote here: all machine-learning systems fail sometimes, and this one just happened to fail in a racially insensitive way. No one meant for this to happen. But the bias is actually hidden a lot deeper in the system. Remember, neural networks rely on having a variety of training data to learn how to identify images correctly. That’s the only way they get good at their jobs. Now consider what Google’s Yonatan Zunger, chief social architect at the time, told Alciné after this incident: “We’re also working on longer-term fixes around . . . image recognition itself (e.g., better recognition of dark-skinned faces),” 16 he wrote on Twitter.

  Wait a second. Why wasn’t Google’s image recognition feature as good at identifying dark-skinned faces as it was at identifying light-skinned faces when it launched? And why didn’t anyone notice the problem before it launched? Well, because failing to design for black people isn’t new. It’s been happening in photo technology for decades.

  Starting back in the 1950s, Kodak, which made most of the film used in the United States, was forced to break up its monopoly on photo processing—so rather than taking your film to Kodak to be developed, you could take it to an independent photo lab. So, Kodak developed a printer that would work for these small labs, and started sending out kits to aid photo technicians in developing their film properly. One item in the kit was the “Shirley Card”—a card depicting a woman in a high-contrast outfit, labeled with the word “normal,” surrounded by versions of the same photo with various color problems. Named for the first person to sit for these cards, Shirley Page, a Kodak employee at the time, the cards were used to calibrate skin tones, shadows, and light.17 Decade after decade, new women would sit for these cards. They were always referred to as “Shirley.” And they were always white.

  Not only were photo lab technicians calibrating their work to match Shirley’s version of normal, but the film itself wasn’t designed to work for black customers. Film emulsions—the coatings on one side of film that contain tiny, light-sensitive crystals—“could have been designed initially with more sensitivity to the continuum of yellow, brown, and reddish skin tones,” writes Lorna Roth, a professor of communications at Concordia University—but only if Kodak had been motivated to recognize “a more dynamic range” of people. Black customers weren’t recognized as an important enough demographic for Kodak to market to, so no one bothered.18

  One of Kodak’s “Shirley Cards,” showing “normal”—and always white—skin tones. (Courtesy of Kodak and Hermann Zschiegner)

  Roth notes that this only started to change in the 1970s—but not necessarily because Kodak was trying to improve its product for diverse audiences. Earl Kage, who managed Kodak Research Studios at the time, told her, “It was never Black flesh that was addressed as a serious problem that I knew of at the time.” 19 Instead, Kodak decided it needed its film to better handle color variations because furniture retailers and chocolate makers had started complaining: they said differences between wood grains, and between milk-chocolate and dark-chocolate varieties, weren’t rendering correctly. Improving the product for black audiences was just a by-product.

  In the dustup over the “gorillas” incident, Zunger, the Google engineer, even told Alciné on Twitter that “different contrast processing [is] needed for different skin tones and lighting.” 20 That’s true, but it’s not news: it’s the same problem that existed for Kodak six decades ago, and that black people have known about for years.

  It’s not just Google Photos that has failed to make facial-recognition products work as well for people of color as they do for white people. In 2015, Flickr’s automatic image tagging labeled a black man as “ape” (not to mention, the gates of Dachau as a “jungle gym”).21 Back in 2009, Nikon released a camera with a special feature: it would warn you if someone in a photo had blinked. The problem was that it hadn’t been tested well enough on Asian eyes, and as a result, it routinely flagged them as blinking.22 That same year, there was the HP computer with a camera that used facial-recognition software to move with the user—panning and zooming to keep their face front and center. Unless the user was black—in which case the software often couldn’t recognize them at all.23

  And yet, several years later—and after huge leaps forward in machine-learning capabilities—Google Photos was making the same mistakes. Why? The answer is right back in those tech-company offices we encountered in Chapter 2. If you recall, Google reported that just 1 percent of technical staff was black in 2016. If the team that made this product looked like Google as a whole, it would have been made up almost entirely of white and Asian men. Would they have noticed if, like COMPAS, their failure rates disproportionately affected black people?

  Just adding more people of color to the team won’t fix the problem, though. Sorelle Friedler, the computer science professor who studies fairness, says the problem is actually more insidious than that.

  Part of what machine learning is designed to do is find patterns in data. So, often, what you’re really seeing are societal patterns being reflected back. In order to remove that, or even notice that it’s a problem, it requires a certain perspective that I do think women and minorities are more likely to bring to the table. But in order to do something about it, it requires new algorithmic techniques.24

  In other words, regardless of the makeup of the team behind an algorithmically powered product, people must be trained to think more carefully about the data they’re working with, and the historical context of that data. Only then will they ask the right questions—like, “Is our training data representative of a range of skin tones?” and “Does our product fail more often for certain kinds of images?”—and, critically, figure out how to adjust the system as a result.

  Without those questions, it’s no surprise that the Google Photos algorithm didn’t learn to identify dark-skinned faces very well: because at Google, just like at Kodak in the 1950s, “normal” still defaults to white.

  BIASED INPUT, EVE
N MORE BIASED OUTPUT

  When Alciné started tweeting about his experience with Google Photos, some people were as angry as he was. Others replied with racist comments; it’s the internet, after all. But a lot of people rolled their eyes and said that he shouldn’t be offended—that “computers can’t be racist,” and “you can’t blame Google for it; machines are stupid.”

  That sentiment is echoed from within the tech companies that specialize in image recognition too: in 2016, Moshe Greenshpan, the CEO of the facial-recognition software company Face-Six, told Motherboard writer Rose Eveleth that he can’t worry about “little issues” like Alciné’s. “I don’t think my engineers or other companies’ engineers have any hidden agenda to give more attention to one ethnicity,” he told her. “It’s just a matter of practical use cases.” 25

  Sound familiar? That’s the classic edge-case thinking we looked at back in Chapter 3—thinking that allows companies to shrug their shoulders at all the people whose identities aren’t whatever their narrow definition of “normal” is. But Alciné and his friend aren’t edge cases. They’re people—customers who deserve to have a product that works just as well for them as for anyone else.

  There’s also a special danger in writing off “edge cases” in algorithmic systems as impractical and not worth designing for—because that write-off may not end there. It’s not just that Alciné and his friend get coded as gorillas. It’s that a system, left uncorrected, will keep making those mistakes, over and over, and think it’s getting things right. Without feedback—without people like Alciné taking it upon themselves to retag their photos manually—the system won’t get better over time.

 

‹ Prev