Dataclysm: Who We Are (When We Think No One's Looking) Page 12 Read online free by Christian Rudder

Home > Other > Dataclysm: Who We Are (When We Think No One's Looking) > Page 12

Dataclysm: Who We Are (When We Think No One's Looking) Page 12

OkCupid’s user-submitted profile essays are as close to personal self-summaries as you’ll find. The prompts are open-ended:

“My self-summary …”

“I’m really good at …”

“The first things people usually notice about me are …”

“I spend a lot of time thinking about …”

And insofar as people try to put their best foot forward, they’re not at all unlike college essays. I imagine many people approach them with the same sort of dread. There are no length restrictions, no guidelines but for the prompts. Altogether, people have given the site 3.2 billion words of self-description. Moreover, unlike other big hunks of text—say, what Google Books has collected—there are demographics behind every word: the age of the author, where she lives, her race, and so on. But deriving a group identity for, say, Asian women from the text isn’t quite as easy as counting up who types what the most, which for the most part is how we’ve looked at text so far in this book. Counting words just gets us this:

1. the

2. of

3. and

4. …

and so on down the line—basically that top 100 from the Oxford English Corpus we saw before. Asian women, white men, and all English speakers use the same pronouns and articles and prepositions to talk about themselves. To find out what’s actually special to a particular group, and to them alone, we have to sort the text a little differently.

I’ll use white men as my walk-through example, because I understand them the best. The first step is to separate those white guys’ essays from everyone else’s. Then, in the two sets of self-descriptions—white-guy and not—we order all the words and phrases in the texts by how frequently they appear. We put them into two lists, from most popular to least, and that gives us something like the chart below. I’ve pulled out three examples and put them in their correct places in the line; the full lists have about 360,000 phrases each:

Already we’re getting somewhere, but before we move on, there’s something a little misleading about these plots that I want to address while the list is still simple. No, it’s got nothing to do with Phish, though lord knows they’ve misled many. It’s that “pizza” and “the” appear to be mentioned almost the same number of times. Granted, pizza is the king of foods, but “the” is the absolute most popular word in the English language. And in our data, while “the” is in its rightful place at the top, “pizza” is seemingly right there with it, at the 98th percentile. This makes it feel like something is wrong either with my data or with my method, but the rankings of the words are correct. It’s just that humans use language in an odd way: we are always repeating ourselves. So a very few top-ranked words take up most of our writing. And, conversely, the frequency of a word falls off very quickly as you go even a small distance from “most popular.”

This counterintuitive relationship between the popularity of a word (its rank in a given vocabulary) and the number of times it appears is described by something called Zipf’s law, an observed statistical property of language that, like so much of the best math, lies somewhere between miracle and coincidence.1 It states that in any large body of text, a word’s popularity (its place in the lexicon, with 1 being the highest ranking) multiplied by the number of times it shows up, is the same for every word in the text. Or, very elegantly:

rank × number = constant

This law holds for the Bible, the collected lyrics of ’60s pop songs, the canonical corpus of English literature (the Oxford English Corpus), and it certainly holds for profile text. To see how well it works in practice even on a highly idiosyncratic body of writing, here’s the law applied to James Joyce’s Ulysses:2

word rank number of times it appears rank × number

’s 10 2,826 28,260

is 20 1,435 28,700

what 30 975 29,250

has 100 289 28,900

wife 200 140 28,000

Ireland 300 90 27,000

college 1,000 26 26,000

morn 5,000 5 25,000

builder 10,000 2 20,000

Zurich 29,055 1 29,055

The steady relationship between rank and number seems to be a property of the mind as much as of language—as you can see above, it accommodates arbitrary proper names, like “Ireland” and “Zurich,” and even words transcribed from dialect, like “ ’s.”

And as further evidence of its deep connection with the human experience, Zipf’s law also describes a wide variety of our social constructs: the sizes of cities, for example, and income distribution across a population. What it means for our purpose here is that because most of language is just a small body of repeated patterns, the use of a word drops off rapidly. “The” appears on nearly every profile. “Pizza” appears on about 1 in 14. “Phish,” even for white guys, for whom it ranks way up at the 80th percentile, appears in less than 1 in 200 profiles. Now that we understand how rankings and usage frequency compare, the next step is to use those rankings to our advantage.

Below, I’ve put the two lists at right angles, forming a square, and I have plotted the words inside it using their popularity rankings on the two lists as coordinates. I added some arrows around “Phish” to make it clear what I mean:

A word’s position here has dual meaning. The closer to the top it appears, the more popular it is with white guys. The farther toward the right, the more popular it is with everyone else. Adding a few more words to the chart will give you a sense of how the geometry translates before I zoom out to the full corpus:

I’ve added a diagonal, yet again, to show parity in the data. The words near the line are important to everyone equally. And the farther up and to the right the words go, the more universally important they are. But remember, we’re not looking for universals. We’re looking for particulars. We want to know what is special to the people we’re considering: here, white guys. For that we need to look to the upper left: the farther in that direction a word appears, the more often white men use it, and the less often everyone else does. In fact, the closer a word is to that remotest reach of white maleness, the top-left vertex of the square, the more it typifies them and only them. Imagine a dot all the way in the corner: to be there, the word would have to appear on every single white male profile and at the same time never appear anywhere else. At least as far as words in a self-summary go, that’s the platonic ideal of identity. This system, and that metric—distance from the upper-left corner—gives the data a way to speak to us, to help us understand how people are talking about themselves.

Because every data set has its quirks, researchers must often build tools from scratch, as we have here. Whenever you do this, it’s good to check your method against some familiar outcomes. Imagine a shipwright with a new boat: who knows what’ll happen once it’s out on the open ocean—so best to check for holes close to shore. Here, if we’d found “Kpop” (Korean pop) or “dreads” in the upper left, in my supposed corner of white-manhood, it would be a strong sign that either my data or my method was garbage. But as you can see, it’s working perfectly.

So, finally, here’s what the whole corpus of words and phrases looks like:

I’ve circled the dot closest to that upper-left corner: that’s the white-male-est thing a person can write about himself: my blue eyes. And getting a longer list of the things that uniquely define white men is just a matter of walking out from that vertex—for example, the thirty closest dots are the thirty things that are most typical. The geometry finds the clichés for us.

I’ve made plots like this for everyone in my data set, not just white guys, and using this same math I’ve gotten lists of their unique words and phrases, too. But before I move to listing all this, I want to make one important point. Walking through each combination of sex × ethnicity × orientation gives you 2 × 4 × 3 = 24 charts like the one above, and in all of them the mass of dots has this same tapered shape from bottom left to top right. That is, the farther a phrase goes into that upper-right corner, the closer to the di
agonal it gets. What that means is that we tend to agree on the things that are most important. As for the things we don’t agree on, I’ve listed them in detail below. I’ll start with the men:3

most typical words for …

white men black men Latinos Asian men

my blue eyes dreads colombian tall for an asian

blonde hair jill scott salsa merengue asians

ween haitian cumbia taiwanese

brown hair soca una taiwan

hunting and fishing neo soul merengue bachata cantonese

allman brothers jamie foxx mana infernal affairs

woodworking zane banda seoul

campfire paid in full puertorican infernal

redneck nigga colombia shanghai

dropkick murphys luther vandross gusta boba

they might be giants coldest winter puerto rican kbbq

brewing beer tyler perry tejano kpop

robert heinlein swagg corridos badminton

tom robbins jerome bachata merengue kimchi

townes dreadlocks hector chungking express

old crow medicine show spike lee espa chou

mystery science theater holla at me por viet

skis menace to society salsa bachata jiro

sailboat brotha aventura dash berlin

around a fire shottas english and spanish ucsd

caddyshack boomerang musica beijing

blond hair nigerian espa ol hk

bill bryson heartbeats como norwegian wood

wheelers anthony hamilton fiu jiro dreams of sushi

pogues gud pero lin

barenaked ladies wayans soledad philippines

mst3k dickey espanol noodle soup

truckers isley amor malaysian

jethro tull interracial muy for my next meal

canoe nigeria reggaeton gangnam style

Phish might’ve already given it away, but inside the white man rages a music festival for lumberjacks.

As for the other three lists, I had never heard of Zane or Anthony Hamilton or The Coldest Winter Ever or Chungking Express or Dash Berlin or a lot of the above before my scripts coughed them up, and I’m not going to pretend that a few minutes with Wikipedia can stand in for an understanding of a culture. These are users speaking in their own voice, and I’m going to let them do just that, but I will point out a few broad trends: white people differentiate themselves mostly by their hair and eyes, Asians by their country of origin, Latinos by their music. But because of the way the math is set up, the three non-white lists are evidence of cultures that I, as a white man, am not supposed to know. Of course, we’re all familiar with Spike Lee and Beijing and Shanghai, but these lists give us the “insiders’ ” view of a culture. It’s stuff an outsider can’t get from autocomplete, or in any other top-down way, because you can’t wonder at what you don’t realize is out there. “Why do Asian people like Norwegian Wood?” isn’t a stereotype because not enough non-Asians are familiar with the book (by Haruki Murakami) and movie. I thought it was just a Beatles song, and if before this chapter someone had asked me if I’d seen Norwegian Wood, I’d have said, “I don’t think they made videos back then.” The lists above are our shibboleths. As such, they are something no one could generate a priori, by typing things into Google Trends or by searching millions of hashtags. Sometimes, it takes a blind algorithm to really see the data.

Here are the lists for women. As you can see, they’re very similar in spirit to the male. Maybe a few more ballads.

most typical words for …

white women black women Asian women Latinas

my blue eyes soca taiwan latina

red hair and eric jerome dickey tall for an asian colombian

blonde hair and haitian philippines una

love to be outside imitation of life taiwanese cumbia

mudding zane beijing banda

campfire coldest winter ever coz tejano

four wheeling nigerian boba merengue bachata

phish interracial filipina gusta

hunting fishing rb and gospel cantonese puertorican

campfires five heartbeats asians colombia

green eyes and anita baker wong kar wai mana

redneck crooklyn shanghai vida

auburn neosoul seoul bachata merengue

ride horses octavia butler macarons amor

old crow medicine show housewives of atlanta viet musica

grateful dead luther vandross kimchi english and spanish

mountain goats zora for my next meal espanol

love country music but waiting to exhale singapore salsa merengue

gillian welch anthony hamilton malaysian todo

country girl chrisette hk por

christmas vacation locs malaysia mariachi

bill bryson outside my race noodle soup marc anthony

riding horses kem cambodian espa ol

eric church octavia norwegian wood novelas

barn real housewives of atlanta hong kong como

allman calypso chungking express pero

willie nelson know why the caged rachmaninoff venezuela

harley did i get married southeast asia soledad

brunette spike lee vienna mas

flogging molly braxton mandarin tacuba

I discovered in the course of working with it that the algorithm we used to make these lists is flexible. You can just as easily run the math in reverse. This gives you the antitheses of a group—the stuff they especially don’t talk about—which can be as illuminating as what they especially do. Here are the lists for the men; they are printed on a darker background to visually emphasize that these lists are the opposite of the previous ones. They are the words least used by these groups yet most used by everyone else, the negative space in our verbal Rorschach. The lists are worth reading all the way through:

most antithetical words for …

white men black men Asian men Latinos

slow jams borges sence southern accent

trey songz social distortion layed from the midwest

robin thicke tallest man on earth layed back ann arbor

smh gaslight anthem sence of humor midwestern

musiq snorkeling truck driver gumbo

merengue belle and sebastian 6′4 freakanomics

laker xkcd realy equity

ig diet coke anything else you wanna discworld

kevin hart surfboard like what u see shanghai

raised in nyc totoro and my son scallops

hip hop rap rb magnetic fields u like what u slopes

kpop gogol bordello care of my kids university of michigan

george lopez dropkick murphys makeing assessment

neo soul rebelution welder parentheses

rb and hip hop peru hunting fishing snowboarder

neyo horrible’s sing along blog care of my son nyt

knw wakeboarding wanna know anything else dominion

gud herzog else you wanna know msu

follow me my blue eyes raising my son ellipses

jordans guitar and sing ask and ill maple

handball dr horrible’s sing along comedys nigerian

soulchild coachella dnt kenya

ne yo dr horrible’s sing woman who wants john irving

bachata yo la tengo i’m a single father over a decade

basketball airborne toxic event somthing cheesesteaks

paid in full yosemite careing wall street journal

mos def talib feynman writting alternatively

mangas coppola and my daughter mistborn

abt wind up bird haveing weber

utada kar brown hair gravitate toward

The opposite-of-Latino list I found most surprising. Hispanic and white identities are often conflated by demographers; for example, the US Census has struggled for years to separate one from the other. But they can only use checkboxes on paper. Latinos’ “most typical” list above and their “opposite” one here define the extremes. That first gives you the furthest
reaches of Latin culture (music and language) and this second gives the “corn-fed” Midwestern white stereotype, which is one of the few white subcultures with no Latin influence. Also, please notice that the “least Asian” things are all misspellings, working-class occupations, and other underachievements, like single fatherhood. And of course there’s “64.”

The women’s lists are equally rich, and I again suggest you take in every word. There’s the awesome my name is Ashley in the Asian antitheses. And I have to say, as a point of professional pride—when you ask an algorithm “What aren’t black women talking about” and it tells you “tanning,” you know you did something right.

most antithetical words for …

white women black women Asian women Latinas

filipino belle and sebastian bbw midwestern

neo soul tanning god my children cincinnati

musiq bruins single mother of two classically

slow jams tahoe grandson kenya

rich dad poor dad simon and garfunkel god my daughter neal

corinne bailey rae magnetic fields mother of three shanghai

bailey rae sf giants human services financial services

salsa bachata flogging molly degree in criminal justice classically trained

aaliyah head and the heart single mom of two southern belle

jpop dodgers notice my eyes and cutting for stone

smh wavy wanna know just ask in new england

salsa merengue naked and famous mexican and chinese antarctica

nujabes social distortion they are my world kavalier

48 laws of power mountain biking being the best mom full disclosure

‹ Prev Next ›