Super Crunchers
Page 14
Kenneth Hammond, the former director of Colorado’s Center for Research on Judgment and Policy, reflects with some amusement on the resistance of clinical psychologists to Meehl’s overwhelming evidence:
One might ask why clinical psychologists are offended by the discovery that their intuitive judgments and predictions are (almost) as good as, but (almost) never better than, a rule. We do not feel offended at learning that our excellent visual perception can often be improved in certain circumstances by the use of a tool (e.g., rangefinders, telescopes, microscopes). The answer seems to be that tools are used by clerks (i.e., someone without professional training); if psychologists are no different, then that demeans the status of the psychologist.
This transformation of clinicians to clerks is indicative of a larger trend. Something has happened out there that has triggered a shift of discretion from traditional experts to a new breed of Super Crunchers, the people who control the statistical equations.
CHAPTER 6
Why Now?
My nephew Marty has a T-shirt that says on the front: “There are 10 types of people in the world…” If you read these words and are trying to think of what the ten types are, you’ve already typed yourself.
The back of the shirt reads: “Those that understand binary, and those that don’t.” You see, in a digitalized world, all numbers are represented by 0s and 1s, so what we know as the number 2 is represented to a computer as 10. It’s the shift to binary bytes that is at the heart of the Super Crunching revolution.
More and more information is digitalized in binary bytes. Mail is now email. From health care files to real estate and legal filings, electronic records are everywhere. Instead of starting with paper records and then manually inputting information, data are increasingly captured electronically in the very first instance—such as when we swipe our credit card or the grocery clerks scan our deodorant purchase at the checkout line. An electronic record of most consumer purchases now exists.
And even when the information begins on paper, inexpensive scanning technologies are unlocking the wisdom of the distant and not-so-distant past. My brother-in-law used to tell me, “You can’t Google dead trees.” He meant that it was impossible to search the text of books. Yet now you can. For a small monthly fee, Questia.com will give you full text access to over 67,000 books. Amazon.com’s “Search Inside the Book” feature allows Internet users to read snippets from full text searches of over 100,000 volumes. And Google is attempting something that parallels the Human Genome Project in both its scope and scale. The Human Genome Project had the audacity to think that within the space of thirteen years it could sequence three billion genes. Google’s “Book Search” is ambitiously attempting to scan the full text of more than thirty million books in the next ten years. Google intends to scan every book ever published.
From 90 to 3,000,000
The increase in accessibility to digitalized data has been a part of my own life. Way back in 1989 when I had just started teaching, I sent six testers out to Chicagoland new car dealerships to see if dealers discriminated against women or minorities. I trained the testers to follow a uniform script that told them how to bargain for a car. The testers even had uniform answers to any questions the salesman might ask (including “I’m sorry but I’m not comfortable answering that”). The testers walked alike, talked alike. They were similar on every dimension I could think of except for their race and sex. Half the testers were white men and half were either women or African-Americans. Just like a classic fair housing test, I wanted to see if women or minorities were treated differently than white men.
They were. White women had to pay 40 percent higher markups than white men; black men had to pay more than twice the markup, and black women had to pay more than three times the markup of white male testers. My testers were systematically steered to salespeople of their own race and gender (who then gave them worse deals).
The study got a lot of press when it was published in the Harvard Law Review. Primetime Live filmed three different episodes testing whether women and minorities were treated equally not just at car dealerships but at a variety of retail establishments. A lot of people were disturbed by film clips of shoe clerks who forced black customers to wait and wait for service even though no one else was in the store. More importantly, the study played a small role in pushing the retail industry toward no-haggle purchasing.
A few years after my study, Saturn decided to air a television commercial that was centrally about Saturn’s unwillingness to discriminate. The commercial was composed entirely of a series of black-and-white photographs. In a voiceover narrative, an African-American man recalls his father returning home after purchasing a car and feeling that he had been mistreated by the salesman. The narrator then says maybe that’s why he feels good about having become a salesperson for Saturn. The commercial is a remarkable piece of rhetoric. The stark photographic images are devoid of the smiles that normally populate car advertisements. Instead there is a heartrending shot of a child realizing that his father has been mistreated because of their shared race and the somber but firmly proud shot of the grown, grim-faced man now having taken on the role of a salesman who does not discriminate. The commercial does not explicitly mention race or Saturn’s no-haggle policy—but few viewers would fail to understand that race was a central cause of the father’s mistreatment.
The really important point is that all this began with six testers bargaining at just ninety dealerships. While I ultimately did a series of follow-up studies analyzing the results of hundreds of additional bargains, the initial uproar came from a very small study. Why so small? It’s hard to remember, but this was back in the day before the Internet. Laptop computers barely existed and were expensive and bulky. As a result, all of my data were first collected on paper and then had to be hand-entered (and re-entered) into computer files for analysis. Technologically, back then, it was harder to create digital data.
Fast-forward to the new millennium, and you’ll still find me crunching numbers on race and cars. Now, however, the datasets are much, much bigger. In the last five years, I’ve helped to crunch numbers in massive class-action litigation against virtually all of the major automotive lenders. With the yeoman help of Vanderbilt economist Mark Cohen (who really bore the laboring oar), I have crunched data on more than three million car sales.
While most consumers now know that the sales price of a car can be negotiated, many do not know that auto lenders, such as Ford Motor Credit or GMAC, often give dealers the option of marking up a borrower’s interest rate. When a car buyer works with the dealer to arrange financing, the dealer normally sends the customer’s credit information to a potential lender. The lender then responds with a private message to the dealer that offers a “buy rate”—the interest rate at which the lender is willing to lend. Lenders often will pay a dealer—sometimes thousands of dollars—if the dealer can get the consumer to sign a loan with an inflated interest rate. For example, Ford Motor Credit tells a dealer that it was willing to lend Susan money at a 6 percent interest rate, but that they would pay the dealership $2,800 if the dealership could get Susan to sign an 11 percent loan. The borrower would never be told that the dealership was marking up the loan. The dealer and the lender would then split the expected profits from the markup, with the dealership taking the lion’s share.
In a series of cases that I worked on, African-American borrowers challenged the lenders’ markup policies because they disproportionately harmed minorities. Cohen and I found that on average white borrowers paid what amounted to about a $300 markup on their loans, while black borrowers paid almost $700 in markup profits. Moreover, the distribution of markups was highly skewed. Over half of white borrowers paid no markup at all, because they qualified for loans where markups were not allowed. Yet 10 percent of GMAC borrowers paid more than $1,000 in markups and 10 percent of the Nissan customers paid more than a $1,600 markup. These high markup borrowers were disproportionately black. African-Americans
were only 8.5 percent of GMAC borrowers, but paid 19.9 percent of the markup profits. The markup difference wasn’t about credit scores or default risk; minority borrowers with good credit routinely had to pay higher markups than white borrowers with similar credit scores.
These studies were only possible because lenders now keep detailed electronic records of every transaction. The one variable they don’t keep track of is the borrower’s race. Once again, though, technology came to the rescue. Fourteen states (including California) will, for a fee, make public the information from their driver’s license database—information that includes the name, race, and Social Security number of the driver. Since the lenders’ datasets also included the Social Security numbers of their borrowers (so that they could run credit checks), it was child’s play to combine the two different datasets. In fact, because so many people move around from state to state, Cohen and I were able to identify the race of borrowers for thousands upon thousands of loans that took place in all fifty states. We’d know the race of a lot of people who bought cars in Kansas, because sometime earlier or later in their lives they took out a driver’s license in California. A study that would have been virtually impossible to do ten years earlier had now become, if not easy, at least relatively straightforward. And in fact, Cohen and I did the study over and over as the cases against all the major automotive lenders moved forward. The cases have been a resounding success: lender after lender has agreed to cap the amount that dealerships can mark up loans. All borrowers regardless of their race are now protected by caps that they don’t even know about. Unlike my initial test of a few hundred negotiations, these statistical studies of millions of transactions were only possible because the information now is stored in readily accessible digital records.
Trading in Data
The willingness of states to sell information on the race of their own citizens is just a small part of the commercialization of data. Digitalized data has become a commodity. And both public and private vendors have found the value of aggregating information. For-profit database aggregators like Acxiom and ChoicePoint have flourished. Since its founding in 1997, ChoicePoint has acquired more than seventy smaller database companies. It will sell clients one file that contains not only your credit report but also your motor-vehicle, police, and property records together with birth and death certificates and marriage and divorce decrees. While much of this information was already publicly available, ChoicePoint’s billion dollars in annual revenue suggests that there’s real value in providing one-stop data-shopping.
And Acxiom is even larger. It maintains consumer information on nearly every household in the United States. Acxiom, which has been called “one of the biggest companies you’ve never heard of,” manages twenty billion customer records (more than 850 terabytes of raw data—enough to fill a 2,000-mile tower of one billion diskettes).
Like ChoicePoint, a lot of Acxiom’s information is culled from public records. Yet Acxiom combines public census data and tax records with information supplied by corporations and credit card companies that are Acxiom clients. It is the world’s leader in CDI, consumer data integration. In the end, Acxiom probably knows the catalogs you get, what shoes you wear, maybe even whether you like dogs or cats. Acxiom assigns every person a thirteen-digit code and places them in one of seventy “lifestyle” segments ranging from “Rolling Stones” to “Timeless Elders.” To Acxiom, a “Shooting Star” is someone who is thirty-six to forty-five, married, no kids yet, wakes up early and goes for runs, watches Seinfeld reruns, and travels abroad. These segments are so specific to the time of life and triggering events (such as getting married) that nearly one-third of Americans change their segment each year. By mining its humongous database, Acxiom not only knows what segment you are in today but it can predict what segment you are likely to be in next year.
The rise of Acxiom shows how commercialization has increased the fluidity of information across organizations. Some large retailers like Amazon.com and Wal-Mart simply sell aggregate customer transaction information. Want to know how well Crest toothpaste sells if it’s placed higher on the shelf? Target will sell you the answer. But Acxiom also allows vendors to trade information. By providing Acxiom’s transaction information about its individual customers, a retailer can gain access to a data warehouse of staggering proportions.
Do the Mash
The popular Internet mantra “information wants to be free” is centrally about the ease of liberating digital data so that it can be exploited by multiple users. The rise of database decision making is driven by the increasing access to what was OPI—other people’s information. Until recently, many datasets—even inside the same corporation—couldn’t easily be linked together. Even a firm that maintained two different datasets often had trouble linking them if the datasets had incompatible formats or were developed by different software companies. A lot of data were kept in isolated “data silos.”
These technological compatibility constraints are now in retreat. Data files in one format are easily imported and exported to other formats. Tagging systems allow single variables to have multiple names. So a retailer’s extra-large clothes can simultaneously be referred to as “XL” and “TG” (for the French term très grande). Almost gone are the days where it was impossible to link data stored in non-compatible proprietary formats.
What’s more, there is a wealth of non-proprietary information on the web just waiting to be harvested and merged into pre-existing datasets. “Data scraping” is the now-common practice of programming a computer to surf to a set of sites and then systematically copy information into a database. Some data scraping is pernicious—such as when spammers scrape email addresses off websites to create their spam lists. But many sites are happy to have their data taken and used by others. Investors looking for fraud or accounting hijinks can scrape data from quarterly SEC filings of all traded corporations. I’ve used computer programs to scrape data for eBay auctions to create a dataset for a study I’m doing about how people bid on baseball cards.
A variety of programmers have combined the free geographical information from Google maps with virtually any other dataset that contains address information. These data “mashups” provide striking visual maps that can represent the crime hot spots or campaign contributions or racial composition or just about anything else. Zillow.com mashes up public tax information about house size and other neighborhood characteristics together with information about recent neighborhood sales to produce beautiful maps with predicted housing values.
A “data commons” movement has created websites for people to post and link their data with others. In the last ten years, the norm of sharing datasets has become increasingly grounded in academics. The premier economics journal in the United States, the American Economic Review, requires that researchers post to a centralized website all the data backing up their empirical articles. So many researchers are posting their datasets to their personal web pages that it is now more likely than not that you can download the data for just about any empirical article by just typing a few words into Google. (You can find tons of my datasets at www.law.yale.edu/ayres/.)
Data aggregators like Acxiom and ChoicePoint have made an art of finding publicly available information and merging it into their pre-existing databases. The FBI has information on car theft in each city for each year; Allstate has information about how many anti-theft devices were used in particular cities in particular years. Nowadays, regardless of the initial digital format, it has become a fairly trivial task to link these two types of information together. Today, it’s even possible to merge datasets when there doesn’t exist a single unique identifier, such as the Social Security number, to match up observations. Indirect matches can be made by looking for similar patterns. For example, if you want to match house purchases from two different records, you might look for purchases that happened on the same day in the same city.
Yet the art of indirect matching can also be prone to error. Database Techno
logies (DBT), a company that was ultimately purchased by ChoicePoint, got in a lot of trouble for indirectly identifying felons before the 2000 Florida elections. The state of Florida hired DBT to create a list of potential people to remove from the list of registered voters. DBT matched the database of registered voters to lists of convicted felons not just from Florida but from every state in the union. The most direct and conservative means to match would have been to use the voter’s name and date of birth as necessary identifiers. But DBT, possibly under direction from Florida’s Division of Elections, cast a much broader net in trying to identify potential convicts. Its matching algorithm required only a 90 percent match between the name of the registered voter and the name of the convict. In practice this meant that there were lots of false positives, registered voters who were wrongly identified as possibly being convicts. For example, the Rev. Willie D. Whiting, Jr., a registered voter from Tallahassee, was initially told that he could not vote because someone named Willie J. Whiting, born two days later, had a felony conviction. The Division of Elections also required DBT to perform “nickname matches” for first names and to match on first and last names regardless of their order—so that the name Deborah Ann would also match the name Ann Deborah, for example.
The combination of these low matching requirements together with the broad universe of all state felonies produced a staggeringly large list of 57,746 registered Floridians who were identified as convicted felons. The concern was not just with the likely large number of false positives, but also with the likelihood that a disproportionate number of the so-called purged registrations would be for African-American voters. This is especially true because the algorithm was not relaxed when it came to race. Only registered voters who exactly matched the race of the convict were subject to exclusion from the voting rolls. So while Rev. Whiting, notwithstanding a different middle initial and birth date, could match convict Willie J. Whiting, a white voter with the same name and birth date would not qualify because the convict Whiting was black.