The more I learn about SAT items, the more understandable all this seems. Verbal SAT items rely on common knowledge. To a degree, so do math items. But what is common knowledge to one group may not be commonly known by another. Therefore, the tests are biased when used to compare students from different backgrounds.
I recalled an item biased against me from a test I took to get into graduate school, the Miller Analogies Test. According to the test-maker, “The MAT is a high-level mental ability test…. Performance on the MAT is designed to reflect candidates’ analytical thinking, an ability that is critical for success in both graduate school and professional life.”19 I recall the item imperfectly but have captured its essence in the following reconstruction:
______ is to sake as opera is to ______.
a. Pete’s … many
b. hit … run
c. swordplay … Chianti
d. attend … forego
e. her … star
“For Pete’s sake,” I groaned—“‘sake’ is like a pronoun. A pronoun is a place holder, a chameleon word that takes on the attributes of the word it stands in the place of, and you cannot easily make an analogy with a place holder.” I have no idea what I put for an answer, but I’m sure I got it wrong. Upon leaving the exam, I asked a fellow student, who taught me that I had mispronounced and misunderstood the word. It was not “SAYK” as in “Pete’s sake,” but two syllables: “SAH-kee.” I still looked blank, until he explained that “sake” is a type of wine. I had never heard the word. For that matter, I neither knew what “Chianti” was nor how to pronounce it. In my defense, the year was 1964, I had lived only in the Midwest, and sake was not on the list of alcoholic beverages wholesaled by the state of Minnesota, where I attended school.
Who was likely to know “sake”? Upscale residents of the West and perhaps East coasts. They would also know “Chianti.” For them, the question was easy: a vaguely Japanese activity is to Japanese wine as a vaguely Italian activity is to Italian wine. A test of “high-level mental ability”? Hardly. To the rest of us, the level of our mental ability mattered not. We did not know the vocabulary. We were guessing in the dark.
STATISTICAL PROCESSES CAUSE CULTURAL BIAS IN “STANDARDIZED” TESTS
The problem of bias affects all standardized tests, from Stanford Binet IQ tests to the Metropolitans to the Miller Analogy Test. It also affects achievement tests, like the multiple-choice exams many states now prescribe for high school graduation. The SAT is national, while most multiple-choice achievement tests differ in each state, so I shall focus on bias in the SAT. But the same problems beset all “standardized” tests. For two obvious reasons, students from groups that are sociologically distant from white upper-middle-class students in Princeton, New Jersey, won’t do well. First, such students haven’t had equal opportunity to learn the culture and vocabulary of upper-middle-class students in Princeton. Second, the material such students do know will never get tested on a real SAT.
For several years, I invented SAT-type items that favored various groups—the working class, females, males, blacks, whites, and so forth. I learned how to construct an entire test on which blacks outscored whites, students from working-class backgrounds outscored elite students, or females outscored males in math. Adding a few of my questions while removing those ETS questions that showed the largest disparities in the traditional directions resulted in tests on which group means were about equal.
If I could do this, why couldn’t ETS? The answer has to do with how the tests are built. Inadvertently the test-construction process excludes items on which minority students outperform white and some Asian students, math items on which girls outperform boys, and items that students from the working class get right more often than more elite students. Before an item becomes part of the SAT, ETS tests it by putting it on the “experimental section” of the test. Students taking the SAT act as guinea pigs for new questions, the results of which do not count toward their test scores. ETS seeks items that “behave” statistically. An item behaves when those who get it right have more ability than those who get it wrong.
At first glance, this makes sense. Indeed, shouldn’t people with ability get any item right more often than those with less ability? Items on which people with less ability do better would seem perverse indeed. Unfortunately, however, ETS uses no independent measure of ability. Even though the SAT is intended to predict first-semester college grades, researchers do not actually gather grades from past test-takers during their first semester in college.20 If they did, they could assess which items on the experimental test correlated with better grades in college. Nor do researchers bother to correlate item results with students’ high school grade point averages (GPAs). Instead, they take a shortcut: they use overall test score as their measure of ability.
Like most other “standardized” tests given widely in the U.S., researchers originally validated the SAT on affluent white students. Affluent white students have always done better on it than have African Americans, Hispanics, Native Americans, Filipino Americans, or working-class whites. Consider a hypothetical item on the experimental test that favors minority students. For example, it was widely bandied about in the 1970s that ETS tried an experimental item using “dashiki”—perhaps an analogy, “dashiki is to shirt as …”—on the SAT. Such an item resembles the “sake” item discussed earlier, except the vocabulary knowledge is opposite. That is, white suburban students were less likely to know “dashiki,” a handmade African shirt popular at times in black culture. African American students, along with Latinos and inner-city whites familiar with black culture, were more likely to get the item right.21 Thus the item “misbehaved” and never appeared on an actual SAT exam. In statistical terms, the item had a negative point-biserial correlation coefficient or item-to-scale correlation coefficient. Statistically speaking, the “wrong people” got it right. No one at ETS wanted nonwhites to score lower. No intentionality was involved. The process is purely statistical.
This problem goes far deeper than the question of whose specialized vocabulary and knowledge will be tapped, the elite’s (“sake”) or the ghetto’s (“dashiki”). Some items that favor African Americans contain no specialized vocabulary. Consider “environment.” When white students hear the word, most think first of the natural environment—ecology, pollution, the balance of nature, and the like. Nothing wrong with that—everyone knows that meaning of the word “environment.” According to research at ETS, when minority students hear the word, most think first of the social environment—“what kind of environment does that child come from?” Again, nothing wrong with that—everyone knows that meaning of “environment.” In the pressure-cooker conditions under which students take “standardized” tests, the first meaning that flashes into their minds when they encounter a word influences whether they pick a “distracter” (wrong answer) or get the item right. It follows like night after day that a potential item based upon the second meaning of “environment” could never make it to the SAT. Statistically speaking, the “wrong people” would get it right on the experimental test.22
Just as some items favor African Americans and those who know black culture, and hence do not get selected, other items favor Hispanics and Filipinos, and hence do not get selected. ETS researchers have shown how English words with Spanish cognates, like “pallid” (palido in Spanish), favor Hispanics and Filipinos. Words without such cognates, like “ashen,” do not. Thus, Hispanics did much better than white students on an antonym item asking for the opposite of “pallid”; Anglos did better on an antonym asking for the opposite of “ashen.” Both items had the same five alternatives; the desired opposite for both was “vividly colored.” “Pallid” and “ashen” occur with roughly equal frequency in English.23 Therefore, both items equally test “verbal reasoning,” or whatever the SAT is alleged to test. As the proportion of test-takers from Hispanic backgrounds increases, however, items like “pallid” will become rarer on real SATs because, again, the wrong peopl
e—Hispanics with lower overall test scores—get them right.
The graph to the left shows the behavior that earns an item a place on a “standardized” test, in this case an SAT. Each test-taker either gets it wrong (0 on the horizontal axis) or right (1). The vertical axis shows each person’s score on the rest of the test. Those who got the item wrong have mostly lower scores than those who got it right. The “regression line” summarizing the relationship slopes upward and to the right, showing a positive correlation, or r. For this item, r = .28. On the whole, the “more able” students got it right.
The graph to the right shows a test item that fails this statistical test. Items using dashiki or environment in each word’s social meaning wind up like this. On the whole, the “more able” students—defined by their overall score on the other items—got it wrong. It has a negative point-biserial correlation coefficient: r = –.17.
For the same reason, no item on which girls outscore boys is likely to make it to the mathematics part of the SAT. The wrong people—girls, who have lower overall test scores on the math test—would get it right. I found that simply setting a math item in a girls’ camp caused girls to outperform boys on it by a narrow margin. ETS set the same item in a boys’ camp; as a result, boys outscored girls on it by a wide margin. This was OK; the right people got the item right.24
SOCIAL CLASS AFFECTS “STANDARDIZED” TEST SCORES
Most “standardized” tests, including the SAT, also correlate strongly with social class. Again, test bias is partly to blame, and again, looking at a prospective item illustrates the problem. Consider this analogy:
Spline is to miter as ______ is to ______25
In a heuristic exam I developed, the “Loewen Low-Aptitude Test,” I provided these five alternatives:
a. love … marriage
b. straw … mud
c. key … lock
d. bond … bail
e. bond … paper
Over the years, I have given this item to thousands of people. Among college undergraduates, no more than one person in a hundred gets it right and can explain why. It is a difficult item. First, it relies on terms that fair numbers of working-class Americans know but few members of the middle and upper classes recognize. “Spline” and “miter” are carpentry terms. A miter joint is a way to cut and glue two pieces of wood to form a joint often used in picture frames. Although attractive, because it conceals the end grain, this joint is weak. Weight placed on the top piece, for example, stresses the joint and causes the glue to let go. Hence such a joint is often used for a picture frame but rarely for a desk.
A spline is a piece of wood inserted into a miter joint to make it strong. A carpenter uses a table saw to cut matching notches—called “kerfs”—in the angled faces to be joined. Then s/he cuts or carves a small piece of wood—perhaps in a contrasting color like walnut—to fit the newly created space. When the three pieces are glued together, a downward force on the top piece no longer stresses the glue; the spline locks the joint in place.
Note that as with the dashiki item, knowing the vocabulary does not make this question trivially easy. Straw goes into mud to make bricks, to be sure, but doesn’t love also go into marriage to make it stronger? The computerized grader replies, “Well, no. There is no evidence that love strengthens marriages. In most countries, including the United States, arranged marriages last longer, with fewer divorces, than love marriages.” Besides, even if data showed that love did strengthen marriage, the analogy is not as close, not as physical. Bricks made with straw can literally support a weight placed on top of them, as can a miter joint made with a spline.
What about the alternatives? A key goes into a lock to lock it, “strengthening” in a sense, but it also goes in to unlock the lock, making it “weak.” The other two choices make no sense whatsoever—although they draw votes from people who don’t have a clue as to the vocabulary.
Thus, unlike the sake item, choosing among three of the alternatives on the spline analogy tests higher-level reasoning. Thus, among working-class students who know the vocabulary, it might correlate with and hence help predict college GPA.26 Why then doesn’t this question appear on an SAT? Why has no item using working-class vocabulary ever appeared on an SAT?
The obvious answer, after having viewed the graphs presented in this chapter, is that the wrong people—members of the working class—would get a working-class item correct. Working-class students are “wrong” in that on average they will score lower overall—partly because other items on the test rely on upper-middle-class culture. This obvious answer seems right but is too elegant. In fact, test-makers rarely even propose a working-class item, because they don’t know working-class culture. Test-makers come from the same social class background as the students taking my test—the upper middle class. Like them, they have never heard of “kerf” or “spline” and just barely know “miter.” They could never make up a question based on carpentry terms.
These problems of test bias afflict any test that uses point-biserial correlation coefficients, which includes most multiple-choice tests in U.S. history.27
INTERNALIZING EXPECTATIONS
Most students—for that matter, most adults—do not understand that “standardized” tests are biased and inadequate ways of measuring aptitude. We noted that the SAT is supposed to predict first-semester college grades. It does this badly: correlations between SAT scores and college GPA are often as low as .33. A better predictor of aptitude for most colleges is high school GPA. Indeed, for more than fifteen years, “SAT” has not stood for “Scholastic Aptitude Test.” In June 1989, the U.S. Civil Rights Commission held a “consultation” in Washington, D.C., entitled The Validity of Testing in Education and Employment. I was the lead presenter, followed by Nancy S. Cole, then executive vice president of ETS (soon to be president), and several other testing experts. In the ensuing discussion, I stated,
I don’t think that the “A” in the SAT is merited. I don’t think it should be called the “Scholastic Aptitude Test.” I don’t think SAT scores measure who will be an apt student next year in college precisely enough to label a student as “inept” or “apt.”
Cole then said,
The word “aptitude” in the Scholastic Aptitude Test has existed for a very long time…. The word “aptitude” has become associated with intelligence or an inherent characteristic in an individual. This is not the intended meaning of the word, but because those associations are made and wrong interpretations follow, I would also prefer that the Scholastic Aptitude Test didn’t have the term “aptitude” in its name.
At that point Lloyd Bond, professor of education at the University of North Carolina, chimed in: “I agree.”28 Two years later, bowing to such arguments, ETS renamed the SAT the “Scholastic Assessment Test.” After three more years, during which time it became painfully apparent that “Scholastic Assessment Test” was repetitive, said the same thing twice, and repeated itself, the ETS renamed the SAT once more—to “SAT”! “It does not stand for anything,” as the College Board, ETS’s governing body, put it in 1994.
Few people know this history. More than fifteen years after “aptitude” was dropped from the name, published test preparation books still call it the “Scholastic Aptitude Test,” counselors call it the “Scholastic Aptitude Test,” and students still think it is called the “Scholastic Aptitude Test.” ETS knows this. Indeed, ETS makes use of it: when “Scholastic Aptitude Test” is typed into a search engine, the home page of the SAT comes up.29 “Get ready to take the SAT!” it begins cheerfully. Yet “Scholastic Aptitude Test” does not appear on that page—the webmaster has made the text invisible, but it resides there nevertheless, rewarding the seeker who mistakenly enters it. Nowhere on the site does ETS let on that “Scholastic Aptitude Test” is a misnomer, indeed has been wrong since 1991. Nowhere on it does ETS provide any warning that the SAT does not measure aptitude.
As one result, some students who score poorly on “standardized” tests
infer that they have low aptitude. One of the most brilliant undergraduates I have ever taught came to this conclusion. She was a psychology major at Tougaloo, but because psychology did not then offer a course on statistics and research methods, she took my course in sociology. It was tough, requiring thirteen different assignments during the semester. When I brought the course to the University of Vermont, I had to make it easier, because its demands were too far out of line with existing courses there.30 My Tougaloo student not only mastered everything the course threw her way; she also learned the statistical technique called chi-square on her own. Yet she scored only about 370 on the Graduate Record Exam. Her school of choice, Vanderbilt, rejected her on that account, despite my glowing letter of reference. She internalized the rejection, concluding she probably could not have succeeded there anyway. I had graduated from a program at Harvard that put me in classes with many psychology graduate students and I knew she was as capable as most of them, but my protestations did not convince her. She believed what her test scores told her.
If the GRE could have such an impact on a brilliant college undergraduate, the SAT can have far worse effects on capable—but still developing—high school students. To be sure, some eighteen-year-olds have the audacity to say that the tests are wrong and the self-confidence to go on to do well enough to disprove them. Other low scorers preserve some self-esteem by saying, “I just don’t test well,” while still believing they are capable. Often, however, their self-assurance is part bravado; they are no longer sure they are good college material.
Teaching What Really Happened: How to Avoid the Tyranny of Textbooks and Get Students Excited About Doing History Page 9