Dataclysm: Who We Are (When We Think No One's Looking)

Home > Other > Dataclysm: Who We Are (When We Think No One's Looking) > Page 5
Dataclysm: Who We Are (When We Think No One's Looking) Page 5

by Christian Rudder


  best

  never

  home

  … make the top 100 cut. Twitter actually may be improving its users’ writing, as it forces them to wring meaning from fewer letters—it embodies William Strunk’s famous dictum, Omit needless words, at the keystroke level. A person tweeting has no option but concision, and in a backward way the character limit actually explains the slightly longer word length we see. Given finite room to work, longer words mean fewer spaces between them, which means less waste. Although the thoughts expressed on Twitter may be foreshortened, there’s no evidence here that they’re diminished.

  Mark Liberman, a professor of linguistics at the University of Pennsylvania, concluded much the same thing: in a direct response to Mr. Fiennes, he calculated the typical word length in Hamlet (3.99) and in a collection of Wodehouse’s stories (4.05) and found them both less than the length in his Twitter sample (4.80).2 He’s just one of many comparative linguists who’ve begun mining Twitter’s data. A team at Arizona State was able to reach beyond word count and length, and into the sentiment and style of the writing, and they found several surprising things: first, Twitter does not change how a person writes. Among the many examples they tracked, if a writer uses “u” for the second person in e-mails or text messages, she will also use it on Twitter. But, likewise, if she generally spells out “you,” she does so everywhere—on Twitter, in texts, in e-mail, and so on. The decision to refer to the first-person singular as “I” or “i” follows the same pattern. That is, a person’s style doesn’t change from medium to medium; there is no “dumbing down.” You write how you write, wherever you write. The linguists also measured Twitter’s lexical density, its proportion of content-carrying words like verbs and nouns, and found it was not only higher than e-mail’s, but was comparable to the writing on Slate, the control used for magazine-level syntax. Everything points to the same conclusion: that Twitter hasn’t so much altered our writing as just gotten it to fit into a smaller place. Looking through the data, instead of a wasteland of cut stumps, we find a forest of bonsai.

  This kind of in-depth analysis (lexical density, word frequency) hints at the real nature of the transformation under way. The change Twitter has wrought on language itself is nothing compared with the change it is bringing to the study of language. Twitter gives us a sense of words not only as the building blocks of thought but as a social connector, which indeed has been the purpose of language since humanity hunched its way across the Serengeti. And unlike older media, Twitter gives us a way to track those bonds on an individual level. You can see not only what a person says, but who she says it to, when, and how often. Comparative linguists have long traced group commonalities through language. Basic words often share common sounds (like tres, trois, drei, three, and thran, from Spanish, French, German, English, and India’s Gujarati) and those stems have given us a sense of the movements of genes and culture across the face of time. Researchers are already grouping people by the language they use on Twitter. Here I’ve excerpted an early attempt to find the tribes and emerging dialects—this is from a corpus of 189,000 tweeters sending 75 million tweets among them.

  subgroups on Twitter by messaging pattern

  example words characteristic speech percent of sample

  nigga, poppin, chillin shortened endings (e.g., -er => -a or -ing => -in) 14

  tweetup, metrics, innovation tech buzzspeak 12

  inspiring, webinar, affiliate, tips marketing self-help 11

  etsy, adorable, hubby crafting lingo 5

  pelosi, obamacare, beck, libs partisan talking points 4

  bieber, pleasee, youu, <33 lengthened endings (repeated last letter) 2

  anipals, pawesome, furever animal-based puns 1

  kstew, robsessed, twilighters amalgamations/puns around the Twilight movies 1

  It’s important to note that the study grouped users by their words alone, who they messaged, and what they wrote—these language clusters were not determined a priori. The top-listed group is in fact the largest the researchers detected, and it also happens to be the most voluble (sending the most tweets per capita) as well as the most insular. Some 90 percent of the tweets sent by the group are directed within it, and its users’ language is most strongly “characteristic”—half of their 100 most representative words fit the “shortened endings” pattern. Throughout the list you see groups typified by slang, pop culture references, jargon, goofy puns—people drawn together by special ways of speaking, and it’s exactly the kind of language (and information) that until now has been lost to history. Like knowing a man’s last words to his wife, knowing how people talk among friends gives you a much deeper sense of who they are. Technocrats, political wonks, marketing gurus, the robsessed; it will be interesting in the coming years to see how all these groups merge and recombine, and we’ll be able to track it all through their text.

  Once language and data come together, it’s that extra dimension, time, that’s so compelling. Going forward, services like Twitter will be indispensable. Looking back, Google Books is working to repair our historical blind spot: in collaboration with libraries around the world, they have digitized 30 million unique books, great and small, and, true to their expertise, they have made the whole searchable. This body of data has created a new field of quantitative cultural studies called culturomics; its primary method is to track changes in word use through time. The long reach of the data (it goes back to 1800) allows an unusual look at people and what’s important to them. Here’s a little chart I like to call Pizza Now, Pizza Forever:

  You can read bits of nonculinary history in the data, too. “Ice cream” took off in the 1910s—right when GE introduced the powered home icebox. See the nosedive “pasta” took in the late ’90s? The Atkins diet became popular. During world wars, we like red meat. These are light applications of a technique that can have deep reach into our collective psyche.3 Word frequencies can even show how we perceive abstractions, like the passage of time—something very difficult to investigate directly. Asking a person what “ten years” means is like asking him or her to describe a color—you get impressionism where you’re looking for facts. But looking at writing over time gives us a sense.

  The data shows that with each passing year, we’re getting more wrapped up in the present. For example, written mentions of the year 1850 peaked (in 1851) at roughly 35 instances for every million words written. Mentions of the year 1900 peaked at 58 per million. Mentions of recent years peak at roughly three times that. Here are the trajectories of the fifty-year benchmarks in the data set:

  Work like this, based on the printed word, helps us understand our larger culture. Twitter lets us see groups coming together within it. But books and tweets both are one-to-many forms of communication, and, often, like Major Ballou’s, our most important words are expressed one-to-one. Users on OkCupid exchange about 4 million messages a day. Of course, they do so with a special purpose—dating—but the interface provides no specific prompt and enforces no limit on what or how much anyone types. Think of it as Gmail for strangers: the communication on the site is about two people getting to know each other; the romance comes much later, offline. Outside researchers rarely get to work with private messages like this—it’s the most sensitive content users generate and even anonymized and aggregated, message data is rarely allowed out of the holiest of holies in the database. But my unique position at OkCupid gives us special access.

  First, the site’s decade of history lets us see how technology has altered how people communicate. OkCupid has records from the pre-smartphone, pre-Twitter, pre-Instagram days—hell, it was online when Myspace was still a file storage service. Judging by messaging over all those years, the broad writing culture is indeed changing, and the change is driven by phones. Apple opened their app store in mid-2008, and OkCupid, like every major service, quickly launched an app. The effect on writing was immediate. Users began typing on keyboards smaller than their palm, and message length has dropped by over two-
thirds since:

  The average message is now just over 100 characters—Twitter-sized, in fact. And in terms of effect, it seems readers have adapted. The best messages, the ones that get the highest response rate, are now only 40 to 60 characters long.

  By considering only messages of a certain length, and then asking how many seconds the message took to compose, we can get a sense of how much revision and effort translates into better results. Below are messages between 150 and 300 characters, plotted against how long they took to write. As you can see, taking your time helps, up to a point. But the downward bend of the trend lines is a wingman in numbers, saying don’t overthink it!

  Now, the first vertical on the left, the messages that took no more than ten seconds to write, represents an inordinate amount of the whole and should raise some eyebrows. It raised mine for sure, and at this point I’m so jaded my face is frozen—Botox has nothing on ten years working at a dating site. How are so many people typing messages that long that quickly? The short answer is, they’re not, and here’s how I know.

  Below is a scatter chart of 100,000 messages, with the number of characters typed plotted against characters actually sent.4 Because there’s a wide range of counts, running from 1 all the way to almost 10,000, this plot is logarithmic:

  I’ve added another diagonal line, and as before, it marks the place where the two axes are equal—meaning that for the red dots along it, the text matched the keystrokes that went into it. Essentially, the sender typed what was on his mind and hit Send, no backspace, no edits. Therefore we know that message A, in the upper-right corner, was typed more or less in a headlong rush, with almost no revision. Going back to the logs, I found it took the sender 73 minutes and 41 seconds to hammer out those 5,979 characters of hello—his final message was about as long as four pages in this book. He did not get a reply. Neither did the gentleman sender of B, who wins the Raymond Carver award for labor-intensive brevity. He took 387 keystrokes to get to “Hey.”

  But these are the examples at the extremes. The broad gist of the scatter plot is: as you approach the diagonal, the messages show less revision. Move toward the bottom right, you get heavy editing, toward the upper left, you get … physical impossibility. Our chart’s geometry means that as soon as you cross over the diagonal into the upper half, you’re into people who must’ve typed fewer characters than their messages actually contained. Who are these arcane summoners, wringing words from thought alone? They are the cut and pasters, and they are legion.

  We can clarify the graph by making each dot 90 percent transparent. This lets you see the real density underneath. It’s like we’re taking an X-ray of the data, and in so doing, we see the bones:

  That dense band of dots running just below the diagonal is the writing-from-scratch guys. It’s surprisingly compact. There is, of course, the hard upper boundary of the line, which separates the from-scratch messages from the pasted ones, like a border between warring factions. But the band’s lower boundary is almost as crisp. There appears to be a natural limit to how much effort a person is willing to put into a message. If you do the arithmetic, it’s 3 characters typed for every 1 in the finished product.

  Above the diagonal are the people who decided that kind of effort was too much. That diffusion of dots in the upper-left center is all the people who pasted a templated message and made a few edits to it. Here the logarithmic nature of the chart can fool you—even just a small amount over that central line means most of the content in the message is stock. Running up the left side, you see the dense vertical lines, the ruts. Those are the messages that were “typed” with just a few keystrokes. There are a lot of them—all told, 20 percent of the sample registered 5 or fewer keystrokes. These writers settled on something they like or that works, and they went with it. It’s not spam in the way we normally use that word—OkCupid is quick to get fake or bot accounts off the site. These are real people’s attempts at contact, essentially memorized digital pickup lines. Many are about as lazy and mundane as you’d expect: “Hey you’re cute” or “Wanna talk?”—just digital equivalents of “Come here often?” But some of the repeated messages are so idiosyncratic it’s hard to believe they would even apply to multiple people. Here’s one, presented exactly as typed:

  I’m a smoker too. I picked it up when backpacking in May. It used to be a drinking thing, but now I wake up and fuck, I want a cigarette. I sometimes wish that I worked in a Mad Men office. Have you seen the Le Corbusier exhibit at MoMA? It sounds pretty interesting. I just saw a Frank Gehry (sp?) display last week in Montreal, and how he used computer modelling to design a crazy house in Ohio.

  That’s the whole message—the sender was trying to pick up women who smoked and were into art. The unstudied “(sp?)” is my favorite flourish. Forty-two different women got this same message.

  Sitewide, the copy-and-paste strategy underperforms from-scratch messaging by about 25 percent, but in terms of effort-in to results-out it always wins: measuring by replies received per unit effort, it’s many times more efficient to just send everyone roughly the same thing than to compose a new message each time. I’ve told people about guys copying and pasting, and the response is usually some variation of “That’s so lame.” When I tell them that boilerplate is 75 percent as effective as something original, they’re skeptical—surely almost everyone sees through the formula. But this last message is an example of a replicated text that’s impossible to see through, and, in a fraction of the time it would’ve taken him otherwise, the sender got five replies from exactly the type of woman he was looking for. And let me tell you something. Nearly every single thing on my desk, on my person, probably in my entire home, was made in a factory alongside who knows how many copies. I just fought a crowd to pick up my lunch, which was a sandwich chosen from a wall of sandwiches. Templates work. Our social-smoking architecture-loving backpacker is just doing what people have always done: harnessing technology. In this case his innovation is using a few keyboard shortcuts to save himself some time.

  As we’ve seen, phones and services like Twitter demand their own adaptations. The eternal here is that writing, like life itself, abides. It changes form, it replicates in odd ways, it finds unexpected niches … it even, like anything alive, occasionally stinks. But realize this: we are living through writing’s Cambrian explosion, not its mass extinction. Language is more varied than ever before, even if some of it is directly copied from the clipboard—variety is the preservation of an art, not a threat to it. From the high-flown language of literary fiction to the simple, even misspelled, status update, through all this writing runs a common purpose. Whether friend to friend, stranger to stranger, lover to lover, or author to reader, we use words to connect. And as long as there is a person bored, excited, enraged, transported, in love, curious, or missing his home and afraid for his future, he’ll be writing about it.

  1 Definition of true ignorance: getting your “what the kids are into” intel from the Securities and Exchange Commission.

  2 Liberman (and I) stripped URLs and the special signs @ and # from the analysis, so these numbers aren’t artificially boosted by “nonword” material.

  3 The data in Google Books accounts for the fact that more books are published now than were published in, say, the nineteenth century. It samples a set number of books from each year. So though both the charts here happen to show increased mentions of their subject terms over time, that truly is a function of increased interest. Not all terms follow that pattern—“God,” for example, has been in steady decline for decades and is now used only about a third as much in American writing as it was in the early 1800s. The researchers Jean-Baptiste Michel and Erez Lieberman Aiden coined the term “culturomics” in their paper “Quantitative Analysis of Culture Using Millions of Digitized Books.” My charts and findings here are adapted from their work.

  4 I captured the characters typed through a script introduced for this chapter.

  4.

  You Gotta Be the Glue
/>   A major drawback to data from dating sites is that it tells you next to nothing about people actually going on dates. Once people are together in person, they don’t need messages or ratings or any of that. It’s an irony both in the data set and in the job itself—you do it right and the customers leave. In pairs, no less!

  Where they go, of course, is into the real world, into a bar, into daylight, into the flesh. They depart the easily quantified world of bits and pixels and enter, in short, each other’s lives. Think about the progression of a young relationship. Two people meet for the first time in person. Talk, drink, get to know each other. Next, if there is a next, is the apartments. The unfamiliar number on the door, a brass handle where yours is steel. The strange but pleasant smell of another person’s sheets. Shampoos in the shower, used, but new to you. Loganberry: Okay, why not? Back at your place next time, she opens the fridge, and it’s just … mustards. Sorry. We’ve all been there in someone’s bedroom, in the den, amidst mementos of events and people we don’t remember, wondering first at the tchotchkes themselves and then soon enough at how surprisingly yours something like the Ponderosa Invitational Swim Meet (third-place cup, 1985) can become, in spite of the fact—or is it because?—you only know it through her.

  You meet the friends. The best friend. The other best friend. The other other best friend, like, for real, they’ve known each other forever. Enough drinks, the right kind of people, they become your friends, too. Acquaintances, coworkers filter into the picture, some in passing, some on purpose. Finally, maybe, if it’s really turning into something, come the parents. You relate some fancier version of your life story, parts of which the two of you can tell together, because you’re that familiar—step away from the table for a second, and the parents know more about you than when you left. Settling back into your chair: “M tells me that …” and it’s the perfect setup for one of your favorite stories. Two lives are merging. And then, often, and often suddenly, it’s back to the beginning with someone else.

 

‹ Prev