Dataclysm: Who We Are (When We Think No One's Looking)

Home > Other > Dataclysm: Who We Are (When We Think No One's Looking) > Page 12
Dataclysm: Who We Are (When We Think No One's Looking) Page 12

by Christian Rudder


  OkCupid’s user-submitted profile essays are as close to personal self-summaries as you’ll find. The prompts are open-ended:

  “My self-summary …”

  “I’m really good at …”

  “The first things people usually notice about me are …”

  “I spend a lot of time thinking about …”

  And insofar as people try to put their best foot forward, they’re not at all unlike college essays. I imagine many people approach them with the same sort of dread. There are no length restrictions, no guidelines but for the prompts. Altogether, people have given the site 3.2 billion words of self-description. Moreover, unlike other big hunks of text—say, what Google Books has collected—there are demographics behind every word: the age of the author, where she lives, her race, and so on. But deriving a group identity for, say, Asian women from the text isn’t quite as easy as counting up who types what the most, which for the most part is how we’ve looked at text so far in this book. Counting words just gets us this:

  1. the

  2. of

  3. and

  4. …

  and so on down the line—basically that top 100 from the Oxford English Corpus we saw before. Asian women, white men, and all English speakers use the same pronouns and articles and prepositions to talk about themselves. To find out what’s actually special to a particular group, and to them alone, we have to sort the text a little differently.

  I’ll use white men as my walk-through example, because I understand them the best. The first step is to separate those white guys’ essays from everyone else’s. Then, in the two sets of self-descriptions—white-guy and not—we order all the words and phrases in the texts by how frequently they appear. We put them into two lists, from most popular to least, and that gives us something like the chart below. I’ve pulled out three examples and put them in their correct places in the line; the full lists have about 360,000 phrases each:

  Already we’re getting somewhere, but before we move on, there’s something a little misleading about these plots that I want to address while the list is still simple. No, it’s got nothing to do with Phish, though lord knows they’ve misled many. It’s that “pizza” and “the” appear to be mentioned almost the same number of times. Granted, pizza is the king of foods, but “the” is the absolute most popular word in the English language. And in our data, while “the” is in its rightful place at the top, “pizza” is seemingly right there with it, at the 98th percentile. This makes it feel like something is wrong either with my data or with my method, but the rankings of the words are correct. It’s just that humans use language in an odd way: we are always repeating ourselves. So a very few top-ranked words take up most of our writing. And, conversely, the frequency of a word falls off very quickly as you go even a small distance from “most popular.”

  This counterintuitive relationship between the popularity of a word (its rank in a given vocabulary) and the number of times it appears is described by something called Zipf’s law, an observed statistical property of language that, like so much of the best math, lies somewhere between miracle and coincidence.1 It states that in any large body of text, a word’s popularity (its place in the lexicon, with 1 being the highest ranking) multiplied by the number of times it shows up, is the same for every word in the text. Or, very elegantly:

  rank × number = constant

  This law holds for the Bible, the collected lyrics of ’60s pop songs, the canonical corpus of English literature (the Oxford English Corpus), and it certainly holds for profile text. To see how well it works in practice even on a highly idiosyncratic body of writing, here’s the law applied to James Joyce’s Ulysses:2

  word rank number of times it appears rank × number

  ’s 10 2,826 28,260

  is 20 1,435 28,700

  what 30 975 29,250

  has 100 289 28,900

  wife 200 140 28,000

  Ireland 300 90 27,000

  college 1,000 26 26,000

  morn 5,000 5 25,000

  builder 10,000 2 20,000

  Zurich 29,055 1 29,055

  The steady relationship between rank and number seems to be a property of the mind as much as of language—as you can see above, it accommodates arbitrary proper names, like “Ireland” and “Zurich,” and even words transcribed from dialect, like “ ’s.”

  And as further evidence of its deep connection with the human experience, Zipf’s law also describes a wide variety of our social constructs: the sizes of cities, for example, and income distribution across a population. What it means for our purpose here is that because most of language is just a small body of repeated patterns, the use of a word drops off rapidly. “The” appears on nearly every profile. “Pizza” appears on about 1 in 14. “Phish,” even for white guys, for whom it ranks way up at the 80th percentile, appears in less than 1 in 200 profiles. Now that we understand how rankings and usage frequency compare, the next step is to use those rankings to our advantage.

  Below, I’ve put the two lists at right angles, forming a square, and I have plotted the words inside it using their popularity rankings on the two lists as coordinates. I added some arrows around “Phish” to make it clear what I mean:

  A word’s position here has dual meaning. The closer to the top it appears, the more popular it is with white guys. The farther toward the right, the more popular it is with everyone else. Adding a few more words to the chart will give you a sense of how the geometry translates before I zoom out to the full corpus:

  I’ve added a diagonal, yet again, to show parity in the data. The words near the line are important to everyone equally. And the farther up and to the right the words go, the more universally important they are. But remember, we’re not looking for universals. We’re looking for particulars. We want to know what is special to the people we’re considering: here, white guys. For that we need to look to the upper left: the farther in that direction a word appears, the more often white men use it, and the less often everyone else does. In fact, the closer a word is to that remotest reach of white maleness, the top-left vertex of the square, the more it typifies them and only them. Imagine a dot all the way in the corner: to be there, the word would have to appear on every single white male profile and at the same time never appear anywhere else. At least as far as words in a self-summary go, that’s the platonic ideal of identity. This system, and that metric—distance from the upper-left corner—gives the data a way to speak to us, to help us understand how people are talking about themselves.

  Because every data set has its quirks, researchers must often build tools from scratch, as we have here. Whenever you do this, it’s good to check your method against some familiar outcomes. Imagine a shipwright with a new boat: who knows what’ll happen once it’s out on the open ocean—so best to check for holes close to shore. Here, if we’d found “Kpop” (Korean pop) or “dreads” in the upper left, in my supposed corner of white-manhood, it would be a strong sign that either my data or my method was garbage. But as you can see, it’s working perfectly.

  So, finally, here’s what the whole corpus of words and phrases looks like:

  I’ve circled the dot closest to that upper-left corner: that’s the white-male-est thing a person can write about himself: my blue eyes. And getting a longer list of the things that uniquely define white men is just a matter of walking out from that vertex—for example, the thirty closest dots are the thirty things that are most typical. The geometry finds the clichés for us.

  I’ve made plots like this for everyone in my data set, not just white guys, and using this same math I’ve gotten lists of their unique words and phrases, too. But before I move to listing all this, I want to make one important point. Walking through each combination of sex × ethnicity × orientation gives you 2 × 4 × 3 = 24 charts like the one above, and in all of them the mass of dots has this same tapered shape from bottom left to top right. That is, the farther a phrase goes into that upper-right corner, the closer to the di
agonal it gets. What that means is that we tend to agree on the things that are most important. As for the things we don’t agree on, I’ve listed them in detail below. I’ll start with the men:3

  most typical words for …

  white men black men Latinos Asian men

  my blue eyes dreads colombian tall for an asian

  blonde hair jill scott salsa merengue asians

  ween haitian cumbia taiwanese

  brown hair soca una taiwan

  hunting and fishing neo soul merengue bachata cantonese

  allman brothers jamie foxx mana infernal affairs

  woodworking zane banda seoul

  campfire paid in full puertorican infernal

  redneck nigga colombia shanghai

  dropkick murphys luther vandross gusta boba

  they might be giants coldest winter puerto rican kbbq

  brewing beer tyler perry tejano kpop

  robert heinlein swagg corridos badminton

  tom robbins jerome bachata merengue kimchi

  townes dreadlocks hector chungking express

  old crow medicine show spike lee espa chou

  mystery science theater holla at me por viet

  skis menace to society salsa bachata jiro

  sailboat brotha aventura dash berlin

  around a fire shottas english and spanish ucsd

  caddyshack boomerang musica beijing

  blond hair nigerian espa ol hk

  bill bryson heartbeats como norwegian wood

  wheelers anthony hamilton fiu jiro dreams of sushi

  pogues gud pero lin

  barenaked ladies wayans soledad philippines

  mst3k dickey espanol noodle soup

  truckers isley amor malaysian

  jethro tull interracial muy for my next meal

  canoe nigeria reggaeton gangnam style

  Phish might’ve already given it away, but inside the white man rages a music festival for lumberjacks.

  As for the other three lists, I had never heard of Zane or Anthony Hamilton or The Coldest Winter Ever or Chungking Express or Dash Berlin or a lot of the above before my scripts coughed them up, and I’m not going to pretend that a few minutes with Wikipedia can stand in for an understanding of a culture. These are users speaking in their own voice, and I’m going to let them do just that, but I will point out a few broad trends: white people differentiate themselves mostly by their hair and eyes, Asians by their country of origin, Latinos by their music. But because of the way the math is set up, the three non-white lists are evidence of cultures that I, as a white man, am not supposed to know. Of course, we’re all familiar with Spike Lee and Beijing and Shanghai, but these lists give us the “insiders’ ” view of a culture. It’s stuff an outsider can’t get from autocomplete, or in any other top-down way, because you can’t wonder at what you don’t realize is out there. “Why do Asian people like Norwegian Wood?” isn’t a stereotype because not enough non-Asians are familiar with the book (by Haruki Murakami) and movie. I thought it was just a Beatles song, and if before this chapter someone had asked me if I’d seen Norwegian Wood, I’d have said, “I don’t think they made videos back then.” The lists above are our shibboleths. As such, they are something no one could generate a priori, by typing things into Google Trends or by searching millions of hashtags. Sometimes, it takes a blind algorithm to really see the data.

  Here are the lists for women. As you can see, they’re very similar in spirit to the male. Maybe a few more ballads.

  most typical words for …

  white women black women Asian women Latinas

  my blue eyes soca taiwan latina

  red hair and eric jerome dickey tall for an asian colombian

  blonde hair and haitian philippines una

  love to be outside imitation of life taiwanese cumbia

  mudding zane beijing banda

  campfire coldest winter ever coz tejano

  four wheeling nigerian boba merengue bachata

  phish interracial filipina gusta

  hunting fishing rb and gospel cantonese puertorican

  campfires five heartbeats asians colombia

  green eyes and anita baker wong kar wai mana

  redneck crooklyn shanghai vida

  auburn neosoul seoul bachata merengue

  ride horses octavia butler macarons amor

  old crow medicine show housewives of atlanta viet musica

  grateful dead luther vandross kimchi english and spanish

  mountain goats zora for my next meal espanol

  love country music but waiting to exhale singapore salsa merengue

  gillian welch anthony hamilton malaysian todo

  country girl chrisette hk por

  christmas vacation locs malaysia mariachi

  bill bryson outside my race noodle soup marc anthony

  riding horses kem cambodian espa ol

  eric church octavia norwegian wood novelas

  barn real housewives of atlanta hong kong como

  allman calypso chungking express pero

  willie nelson know why the caged rachmaninoff venezuela

  harley did i get married southeast asia soledad

  brunette spike lee vienna mas

  flogging molly braxton mandarin tacuba

  I discovered in the course of working with it that the algorithm we used to make these lists is flexible. You can just as easily run the math in reverse. This gives you the antitheses of a group—the stuff they especially don’t talk about—which can be as illuminating as what they especially do. Here are the lists for the men; they are printed on a darker background to visually emphasize that these lists are the opposite of the previous ones. They are the words least used by these groups yet most used by everyone else, the negative space in our verbal Rorschach. The lists are worth reading all the way through:

  most antithetical words for …

  white men black men Asian men Latinos

  slow jams borges sence southern accent

  trey songz social distortion layed from the midwest

  robin thicke tallest man on earth layed back ann arbor

  smh gaslight anthem sence of humor midwestern

  musiq snorkeling truck driver gumbo

  merengue belle and sebastian 6′4 freakanomics

  laker xkcd realy equity

  ig diet coke anything else you wanna discworld

  kevin hart surfboard like what u see shanghai

  raised in nyc totoro and my son scallops

  hip hop rap rb magnetic fields u like what u slopes

  kpop gogol bordello care of my kids university of michigan

  george lopez dropkick murphys makeing assessment

  neo soul rebelution welder parentheses

  rb and hip hop peru hunting fishing snowboarder

  neyo horrible’s sing along blog care of my son nyt

  knw wakeboarding wanna know anything else dominion

  gud herzog else you wanna know msu

  follow me my blue eyes raising my son ellipses

  jordans guitar and sing ask and ill maple

  handball dr horrible’s sing along comedys nigerian

  soulchild coachella dnt kenya

  ne yo dr horrible’s sing woman who wants john irving

  bachata yo la tengo i’m a single father over a decade

  basketball airborne toxic event somthing cheesesteaks

  paid in full yosemite careing wall street journal

  mos def talib feynman writting alternatively

  mangas coppola and my daughter mistborn

  abt wind up bird haveing weber

  utada kar brown hair gravitate toward

  The opposite-of-Latino list I found most surprising. Hispanic and white identities are often conflated by demographers; for example, the US Census has struggled for years to separate one from the other. But they can only use checkboxes on paper. Latinos’ “most typical” list above and their “opposite” one here define the extremes. That first gives you the furthest
reaches of Latin culture (music and language) and this second gives the “corn-fed” Midwestern white stereotype, which is one of the few white subcultures with no Latin influence. Also, please notice that the “least Asian” things are all misspellings, working-class occupations, and other underachievements, like single fatherhood. And of course there’s “64.”

  The women’s lists are equally rich, and I again suggest you take in every word. There’s the awesome my name is Ashley in the Asian antitheses. And I have to say, as a point of professional pride—when you ask an algorithm “What aren’t black women talking about” and it tells you “tanning,” you know you did something right.

  most antithetical words for …

  white women black women Asian women Latinas

  filipino belle and sebastian bbw midwestern

  neo soul tanning god my children cincinnati

  musiq bruins single mother of two classically

  slow jams tahoe grandson kenya

  rich dad poor dad simon and garfunkel god my daughter neal

  corinne bailey rae magnetic fields mother of three shanghai

  bailey rae sf giants human services financial services

  salsa bachata flogging molly degree in criminal justice classically trained

  aaliyah head and the heart single mom of two southern belle

  jpop dodgers notice my eyes and cutting for stone

  smh wavy wanna know just ask in new england

  salsa merengue naked and famous mexican and chinese antarctica

  nujabes social distortion they are my world kavalier

  48 laws of power mountain biking being the best mom full disclosure

 

‹ Prev