by Cathy O'Neil
So instead of measuring teachers on an absolute scale, they tried to adjust for social inequalities in the model. Instead of comparing Tim Clifford’s students to others in different neighborhoods, they would compare them with forecast models of themselves. The students each had a predicted score. If they surpassed this prediction, the teacher got the credit. If they came up short, the teacher got the blame. If that sounds primitive to you, believe me, it is.
Statistically speaking, in these attempts to free the tests from class and color, the administrators moved from a primary to a secondary model. Instead of basing scores on direct measurement of the students, they based them on the so-called error term—the gap between results and expectations. Mathematically, this is a much sketchier proposition. Since the expectations themselves are derived from statistics, these amount to guesses on top of guesses. The result is a model with loads of random results, what statisticians call “noise.”
Now, you might think that large numbers would bring the scores into focus. After all, New York City, with its 1.1 million public school students, should provide a big enough data set to create meaningful predictions. If eighty thousand eighth graders take the test, wouldn’t it be feasible to establish reliable averages for struggling, middling, and thriving schools?
Yes. And if Tim Clifford were teaching a large sampling of students, say ten thousand, then it might be reasonable to measure that cohort against the previous year’s average and draw some conclusions from it. Large numbers balance out the exceptions and outliers. Trends, theoretically, would come into focus. But it’s almost impossible for a class of twenty-five or thirty students to match up with the larger population. So if a class has certain types of students, they will tend to rise faster than the average. Others will rise more slowly. Clifford was given virtually no information about the opaque WMD that gave him such wildly divergent scores, but he assumed this variation in his classes had something to do with it. The year he scored poorly, Clifford said, “I taught many special education students as well as many top performers. And I think serving either the neediest or the top students—or both—creates problems. Needy students’ scores are hard to move because they have learning problems, and top students’ scores are hard to move because they have already scored high so there’s little room for improvement.”
The following year, he had a different mix of students, with more of them falling between the extremes. And the results made it look as though Clifford had progressed from being a failing teacher to being a spectacular one. Such results were all too common. An analysis by a blogger and educator named Gary Rubinstein found that of teachers who taught the same subject in consecutive years, one in four registered a 40-point difference. That suggests that the evaluation data is practically random. It wasn’t the teachers’ performance that was bouncing all over the place. It was the scoring generated by a bogus WMD.
While its scores are meaningless, the impact of value-added modeling is pervasive and nefarious. “I’ve seen some great teachers convince themselves that they were mediocre at best based on those scores,” Clifford said. “It moved them away from the great lessons they used to teach, toward increasing test prep. To a young teacher, a poor value-added score is punishing, and a good one may lead to a false sense of accomplishment that has not been earned.”
As in the case of so many WMDs, the existence of value-added modeling stems from good intentions. The Obama administration realized early on that school districts punished under the 2001 No Child Left Behind reforms, which mandated high-stakes standardized testing, tended to be poor and disadvantaged. So it offered waivers to districts that could demonstrate the effectiveness of their teachers, ensuring that these schools would not be punished even if their students were lagging.*
The use of value-added models stems in large part from this regulatory change. But in late 2015 the teacher testing craze took what may be an even more dramatic turn. First, Congress and the White House agreed to revoke No Child Left Behind and replace it with a law that gives states more latitude to develop their own approaches for turning around underperforming school districts. It also gives them a broader range of criteria to consider, including student and teacher engagement, access to advanced coursework, school climate, and safety. In other words, education officials can attempt to study what’s happening at each individual school—and pay less attention to WMDs like value-added models. Or better yet, jettison them entirely.
At around the same time, New York governor Andrew Cuomo’s education task force called for a four-year moratorium on the use of exams to evaluate teachers. This change, while welcome, does not signal a clear rejection of the teacher evaluation WMDs, much less a recognition that they’re unfair. The push, in fact, came from the parents, who complained that the testing regime was wearing out their kids and taking too much time in the school year. A boycott movement had kept 20 percent of third through eighth graders out of the tests in the spring of 2015, and it was growing. In bowing to the parents, the Cuomo administration delivered a blow to value-added modeling. After all, without a full complement of student tests, the state would lack the data to populate it.
Tim Clifford was cheered by this news but still wary. “The opt-out movement forced Cuomo’s hand,” he wrote in an e-mail. “He feared losing the support of wealthier voters in top school districts, who were the very people who most staunchly supported him. To get ahead of the issue, he’s placed this moratorium on using test scores.” Clifford fears that the tests will be back.
Maybe so. And, given that value-added modeling has become a proven tool against teachers’ unions, I don’t expect it to disappear anytime soon. It’s well entrenched, with forty states and the District of Columbia using or developing one form of it or another. That’s all the more reason to spread the word about these and other WMDs. Once people recognize them and understand their statistical flaws, they’ll demand evaluations that are fairer for both students and teachers. However, if the goal of the testing is to find someone to blame, and to intimidate workers, then, as we’ve seen, a WMD that spews out meaningless scores gets an A-plus.
* * *
* No Child Left Behind sanctions include offering students in failing schools the option of attending another, more successful school. In dire cases, the law calls for a failing school to be closed and replaced by a charter school.
Local bankers used to stand tall in a town. They controlled the money. If you wanted a new car or a mortgage, you’d put on your Sunday best and pay a visit. And as a member of your community, this banker would probably know the following details about your life. He’d know about your churchgoing habits, or lack of them. He’d know all the stories about your older brother’s run-ins with the law. He’d know what your boss (and his golfing buddy) said about you as a worker. Naturally, he’d know your race and ethnic group, and he’d also glance at the numbers on your application form.
The first four factors often worked their way, consciously or not, into the banker’s judgment. And there’s a good chance he was more likely to trust people from his own circles. This was only human. But it meant that for millions of Americans the predigital status quo was just as awful as some of the WMDs I’ve been describing. Outsiders, including minorities and women, were routinely locked out. They had to put together an impressive financial portfolio—and then hunt for open-minded bankers.
It just wasn’t fair. And then along came an algorithm, and things improved. A mathematician named Earl Isaac and his engineer friend, Bill Fair, devised a model they called FICO to evaluate the risk that an individual would default on a loan. This FICO score was fed by a formula that looked only at a borrower’s finances—mostly her debt load and bill-paying record. The score was color blind. And it turned out to be great for the banking industry, because it predicted risk far more accurately while opening the door to millions of new customers. FICO scores, of course, are still around. They’re used by the credit agencies, including Experian, Transunion, and Equifax, which
each contribute different sources of information to the FICO model to come up with their own scores. These scores have lots of commendable and non-WMD attributes. First, they have a clear feedback loop. Credit companies can see which borrowers default on their loans, and they can match those numbers against their scores. If borrowers with high scores seem to be defaulting on loans more frequently than the model would predict, FICO and the credit agencies can tweak those models to make them more accurate. This is a sound use of statistics.
The credit scores are also relatively transparent. FICO’s website, for example, offers simple instructions on how to improve your score. (Reduce debt, pay bills on time, and stop ordering new credit cards.) Equally important, the credit-scoring industry is regulated. If you have questions about your score, you have the legal right to ask for your credit report, which includes all the information that goes into the score, including your record of mortgage and utility payments, your total debt, and the percentage of available credit you’re using. Though the process can be slow to the point of torturous, if you find mistakes, you can have them fixed.
Since Fair and Isaac’s pioneering days, the use of scoring has of course proliferated wildly. Today we’re added up in every conceivable way as statisticians and mathematicians patch together a mishmash of data, from our zip codes and Internet surfing patterns to our recent purchases. Many of their pseudoscientific models attempt to predict our creditworthiness, giving each of us so-called e-scores. These numbers, which we rarely see, open doors for some of us, while slamming them in the face of others. Unlike the FICO scores they resemble, e-scores are arbitrary, unaccountable, unregulated, and often unfair—in short, they’re WMDs.
A Virginia company called Neustar offers a prime example. Neustar provides customer targeting services for companies, including one that helps manage call center traffic. In a flash, this technology races through available data on callers and places them in a hierarchy. Those at the top are deemed to be more profitable prospects and are quickly funneled to a human operator. Those at the bottom either wait much longer or are dispatched into an outsourced overflow center, where they are handled largely by machines.
Credit card companies such as Capital One carry out similar rapid-fire calculations as soon as someone shows up on their website. They can often access data on web browsing and purchasing patterns, which provide loads of insights about the potential customer. Chances are, the person clicking for new Jaguars is richer than the one checking out a 2003 Taurus on Carfax.com. Most scoring systems also pick up the location of the visitor’s computer. When this is matched with real estate data, they can draw inferences about wealth. A person using a computer on San Francisco’s Balboa Terrace is a far better prospect than the one across the bay in East Oakland.
The existence of these e-scores shouldn’t be surprising. We’ve seen models feeding on similar data when targeting us for predatory loans or weighing the odds that we might steal a car. For better or worse, they’ve guided us to school (or jail) and toward a job, and then they’ve optimized us inside the workplace. Now that it might be time to buy a house or car, it’s only natural that financial models would mine the same trove of data to size us up.
But consider the nasty feedback loop that e-scores create. There’s a very high chance that the e-scoring system will give the borrower from the rough section of East Oakland a low score. A lot of people default there. So the credit card offer popping up on her screen will be targeted to a riskier demographic. That means less available credit and higher interest rates for those who are already struggling.
Much of the predatory advertising we’ve been discussing, including the ads for payday loans and for-profit colleges, is generated through such e-scores. They’re stand-ins for credit scores. But since companies are legally prohibited from using credit scores for marketing purposes, they make do with this sloppy substitute.
There’s a certain logic to that prohibition. After all, our credit history includes highly personal data, and it makes sense that we should have control over who sees it. But the consequence is that companies end up diving into largely unregulated pools of data, such as clickstreams and geo-tags, in order to create a parallel data marketplace. In the process, they can largely avoid government oversight. They then measure success by gains in efficiency, cash flow, and profits. With few exceptions, concepts like justice and transparency don’t fit into their algorithms.
Let’s compare that for a moment to the 1950s-era banker. Consciously or not, that banker was weighing various data points that had little or nothing to do with his would-be borrower’s ability to shoulder a mortgage. He looked across his desk and saw his customer’s race, and drew conclusions from that. Her father’s criminal record may have counted against her, while her regular church attendance may have been seen favorably.
All of these data points were proxies. In his search for financial responsibility, the banker could have dispassionately studied the numbers (as some exemplary bankers no doubt did). But instead he drew correlations to race, religion, and family connections. In doing so, he avoided scrutinizing the borrower as an individual and instead placed him in a group of people—what statisticians today would call a “bucket.” “People like you,” he decided, could or could not be trusted.
Fair and Isaac’s great advance was to ditch the proxies in favor of the relevant financial data, like past behavior with respect to paying bills. They focused their analysis on the individual in question—and not on other people with similar attributes. E-scores, by contrast, march us back in time. They analyze the individual through a veritable blizzard of proxies. In a few milliseconds, they carry out thousands of “people like you” calculations. And if enough of these “similar” people turn out to be deadbeats or, worse, criminals, that individual will be treated accordingly.
From time to time, people ask me how to teach ethics to a class of data scientists. I usually begin with a discussion of how to build an e-score model and ask them whether it makes sense to use “race” as an input in the model. They inevitably respond that such a question would be unfair and probably illegal. The next question is whether to use “zip code.” This seems fair enough, at first. But it doesn’t take long for the students to see that they are codifying past injustices into their model. When they include an attribute such as “zip code,” they are expressing the opinion that the history of human behavior in that patch of real estate should determine, at least in part, what kind of loan a person who lives there should get.
In other words, the modelers for e-scores have to make do with trying to answer the question “How have people like you behaved in the past?” when ideally they would ask, “How have you behaved in the past?”
The difference between these two questions is vast. Imagine if a highly motivated and responsible person with modest immigrant beginnings is trying to start a business and needs to rely on such a system for early investment. Who would take a chance on such a person? Probably not a model trained on such demographic and behavioral data.
I should note that in the statistical universe proxies inhabit, they often work. More times than not, birds of a feather do fly together. Rich people buy cruises and BMWs. All too often, poor people need a payday loan. And since these statistical models appear to work much of the time, efficiency rises and profits surge. Investors double down on scientific systems that can place thousands of people into what appear to be the correct buckets. It’s the triumph of Big Data.
And what about the person who is misunderstood and placed in the wrong bucket? That happens. And there’s no feedback to set the system straight. A statistics-crunching engine has no way to learn that it dispatched a valuable potential customer to call center hell. Worse, losers in the unregulated e-score universe have little recourse to complain, much less correct the system’s error. In the realm of WMDs, they’re collateral damage. And since the whole murky system grinds away in distant server farms, they rarely find out about it. Most of them probably conclude, wit
h reason, that life is simply unfair.
In the world I’ve described so far, e-scores nourished by millions of proxies exist in the shadows, while our credit reports, packed with pertinent and relevant data, operate under rule of law. But sadly, it’s not quite that simple. All too often, credit reports serve as proxies, too.
It should come as little surprise that many institutions in our society, from big companies to the government, are on the hunt for people who are trustworthy and reliable. In the chapter on getting a job, we saw them sorting through résumés and red-lighting candidates whose psychological tests pointed to undesirable personal attributes. Another all-too-common approach is to consider the applicant’s credit score. If people pay their bills on time and avoid debt, employers ask, wouldn’t that signal trustworthiness and dependability? It’s not exactly the same thing, they know. But wouldn’t there be a significant overlap?
That’s how the credit reports have expanded far beyond their original turf. Creditworthiness has become an all-too-easy stand-in for other virtues. Conversely, bad credit has grown to signal a host of sins and shortcomings that have nothing to do with paying bills. As we’ll see, all sorts of companies turn credit reports into their own versions of credit scores and use them as proxies. This practice is both toxic and ubiquitous.
For certain applications, such a proxy might appear harmless. Some online dating services, for example, match people on the basis of credit scores. One of them, CreditScoreDating, proclaims that “good credit scores are sexy.” We can debate the wisdom of linking financial behavior to love. But at least the customers of CreditScoreDating know what they’re getting into and why. It’s up to them.