It’s not completely hopeless: linguists have devised several methods for getting at more natural-sounding speech. One is to ask open-ended questions (“Could you describe your family?” rather than “How do you pronounce ‘aunt’?”). Another is to ask about an exciting or emotional event, to get people thinking about the content rather than the words (a popular, though perhaps rather morbid, question is “Can you tell me about a time you thought you might die?”). A third is to work with a community as an insider: many a linguist has analyzed the speech of their own children, grandparents, or extended family, or else worked with a local collaborator to conduct interviews. The Word Wagon linguists would even carry small notebooks, in case they overheard any interesting language at the grocery store, so that they’d remember to follow up on it when they got the tape recorder out.
But one particularly effective way of getting at unselfconscious speech is on the internet. Not only can researchers look at countless examples of public, informal, unselfconscious language, from videos to blog posts, but in many cases, it’s also searchable. No more hours of transcribing audio files, hoping for a few examples. Twitter is particularly valuable: even the most casual of searchers can look for a word or phrase and form an impression of how people are using it. They might notice that a lot of people who used “smol” in 2018 also appeared to be fans of anime or cute animals, or that “bae” was used primarily by African Americans until around 2014, when it started appearing in tweets by white people, only to get co-opted by brands shortly thereafter.
The presence of researchers on social sites is a still-evolving ethical domain. Regardless of who technically has access to their information, people tend to have a mental model of who they expect to read their posts, and feel that their trust is violated when someone outside that model does so. When the Library of Congress announced in 2010 that they’d be archiving every single tweet, Twitter users had to update their mental models for a previously ephemeral website. Many reacted by posting tongue-in-cheek instructions or commentary to future historians. Several people took advantage of the opportunity to make the august institution expand its holdings of choice four-letter words, while others asked, “What’s up, posterity?” or noted, “Please index all my kitten pictures properly under ‘kitteh’ as well as ‘kitten’ now that you’re saving my tweets.” Not much came of it, in the end: the Library of Congress changed course in 2017, restricting their Twitter archive to tweets that met stricter criteria of newsworthiness. A less benign social media data controversy happened in 2018, when British political consulting firm Cambridge Analytica was discovered to have obtained personal data from millions of Facebook users in 2015 by convincing people to link a personality quiz with their Facebook account. The personal data derived from the quiz was then used to target voters and potentially sway elections. The Library of Congress and Cambridge Analytica represent two extremes, but less publicized researchers have continued mining for data on social media, restricted only by terms of service and their own senses of fair play.
In this book, I have for the most part restricted my citations to social media data in aggregate, not linked to individual users, or examples which are already cited and anonymized in research papers. But where I’ve needed to pull out individual examples, I’ve aimed for those in which the writers are already clearly having a metalinguistic discussion, like the tweets addressing the Library of Congress archivists. Quoting people’s innocent chatter about their lunch or deeply personal heart-to-hearts felt to me uncomfortably like spying, but quoting comments about internet language in a book about internet language is, I hope, a way of entering into a conversation. After all, if you’re going to address your tweets to posterity, perhaps you shouldn’t be surprised when posterity addresses you back.
Twitter research is especially fruitful because about 1 to 2 percent of people who post on Twitter tag their tweets with their exact geographic coordinates. A reasonably competent data miner can therefore code up a county-level map of where Americans tweet “pop” versus “soda,” where they switch from “y’all” to “you guys,” or which states prefer which swear words—all in less time than it took Edmond Edmont to bike from Paris to Marseille. As a simple proof of concept, let’s look at the work of the linguist Jacob Eisenstein, who found that geo-tagged tweets containing “hella” (as in “That movie was hella long”) are most likely to occur in Northern California, while those containing “yinz” (as in “I’ll see yinz later”) are clustered around Pittsburgh. Both of these findings are consistent with previous linguistic research done in the labor-intensive interview style. Other features he found on Twitter probably wouldn’t have shown up in an interview: a later study by Eisenstein and colleagues found that the abbreviation “ikr” (“I know, right?”) was especially popular in Detroit, the emoticon ^_^ (happy) was characteristic of Southern California, and the spelling “suttin” (“something”) was popular in New York City.
Some of the linguistics research happening on Twitter wouldn’t be possible at all without the internet. The linguist Jack Grieve researches constructions like “might could,” “may can,” and “might should” in the American South—things like “We might should close the window,” where speakers of other dialects would say, “Maybe we should close the window.” Grieve has pointed out that as recently as 1973, prominent linguists said that it would simply be impossible to research these constructions: they’re vanishingly rare in edited text, and occur maybe once an hour, if you’re lucky, in a spontaneous spoken interview. That’s a heck of a lot of audio to transcribe for a tiny amount of data. But on Twitter, Grieve and his collaborators combed through nearly a billion geo-coded tweets and unearthed thousands of examples. Beyond just reinforcing the informal intuition that these constructions (known as double modals) exist, they’ve been able to make detailed county-level maps showing that they can actually be divided into two groups: some, like “might could” and “may can,” map onto the Upper South, while others, like “might can” and “might would,” are more common in the Lower South.
We may even be able to discover things about various regions that we hadn’t realized before. For example, after “might could,” Grieve turned his attention to swear words, finding that, while people in every state swear, their preferred swear words varied. Keeping it to the somewhat milder terms, people in the American South were especially fond of “hell,” while people in the northern states preferred “asshole,” the Midwest used a lot of “gosh,” and the West Coast liked the Britishy “bollocks” and “bloody.” The Oxford English Dictionary has also begun using Twitter as a source of data, especially for regional words that are less often printed in books and newspapers. The dictionary’s quarterly update notes for September 2017 gave the example of the word “mafted,” a northeastern British term defined as “exhausted from heat, crowds, or exertion.” The example quotations for “mafted” are a study in old-school and new-school lexicography: the oldest citation is from “a glossary compiled around the year 1800,” and the newest is from someone on Twitter in 2010 saying, “Dear Lord—a fur coat on the Bakerloo line, she must have been mafted.”
We can even use creative respellings on Twitter to investigate how people pronounce things differently. It’s a little bit harder than just searching for words, but the linguist Rachael Tatman gave us an example using two well-studied sounds in varieties of English. The first is the pronunciation of words like “cot” and “caught” or “tock” and “talk.” Some Americans (primarily in the West, Midwest, and New England) pronounce each member of the pairs the same, while others (primarily Southerners and African Americans) pronounce them differently—a trend which has been long established by the kind of linguists who make audio recordings. Tatman hypothesized that speakers who do have two distinct vowels in “sod” and “sawed” would sometimes want to call attention to one particular vowel, by respelling it as “aw.” Sure enough, she found that in tweets where a common word, like “on,” “also,” a
nd “because,” was respelled with “aw,” as in “awn,” “awlso,” and “becawse,” there also tended to be respellings for other well-documented features of Southern American English and African American English, such as deleting the “r” in words like “for” and “year” (writing “foah” and “yeah”) and writing “da” and “dat” for “the” and “that.”
But that could just be a coincidence. To test it, Tatman looked at a completely different sound in a completely different region: the pronunciation of words like “to” and “do” as “tae” and “dae.” This sound, and this particular spelling of it, is associated with Scottish English, and has been since Robbie Burns. Here again, Tatman found that people who tweet this respelling tend to show other linguistic markers of Scottishness: they also tweet respellings like “ye” for “you” and “oan” for “on.” To be sure, not all Scots, Southerners, or African Americans use these respellings, and those that do don’t use them all the time. But the point is, when we respell words in casual writing, we tend to do so with a purpose—we jump in with both feet and try to represent our whole manner of speaking. Even if it’s not always this clear which sounds are intended by a particular respelling, looking at which words and sounds people respell can help give linguists an idea of where to focus their audio recording energy.
The internet lets linguists do the kinds of dialect mapping and analysis of spontaneous speech that we’ve been trying to do for centuries, but with more data, from the comfort of a laptop, and without distorting the data by observing it. Just as the telephone study showed that people were still talking like their neighbors rather than like TV and radio broadcasters, and the bicycle peregrinations showed that regional dialects persisted even after centuries of print standardization, the internet studies show us that we often keep our local ways of speaking when we use social media. Our deep wells of enthusiasm for internet dialect quizzes give us a clue about why: talking in particular ways reinforces our networks, our sense of belonging and community.
Networks
Does it ever feel like your family or friend group speaks its very own dialect? This was the premise of a book called Kitchen Table Lingo, which collected examples from what the linguist David Crystal called familects: “the private and personal word-creations that are found in every household and in every social group, but which never get into the dictionary” (or onto dialect maps). The book’s initial appeal for “familect” words attracted thousands of submissions from around the world, with stories of misheard song lyrics, onomatopoeia, children’s coinages, and no less than fifty-seven words for the TV remote control. Dialect maps are just the beginning of our linguistic differences: every time we talk with some people more than others, we have the chance to develop a shared vocabulary, whether that’s families, friends, schools, workplaces, hobbies, or other organizations. Family dialects are often inspired by a cute word that comes out of a kid’s mouth (Queen Elizabeth II was apparently nicknamed “Gary” by a young Prince William, who was unable to say “Granny” yet), but the peak importance of in-group language happens at a later life stage: teenagehood.
High school is a place where people really notice small social details, whether that’s the cool brand of jeans, who’s now going out with who, or vowels. The linguist Penelope Eckert embedded at a high school in the Detroit suburbs in the 1980s to study the correlation between language and high school cliques. She found two main groups: jocks, who participated in the power structure of the school through activities like varsity sports and student council, and burnouts, who rejected the school’s authority. In Detroit, along with many other American cities around the Great Lakes, there’s a vowel change going on, where some speakers say “the busses with the antennas on top” in a way that sounds to people outside the area like “the bosses with the antennas on tap.” For Eckert’s students, the “bosses” pronunciation had a connotation of “street smarts,” so the burnouts were more likely to use it than the jocks—despite the fact that they all lived in the same neighborhood and attended the same school, and irrespective of the social class of their parents. You could arrange the students into more subtle cliques, from “burned-out burnouts” to “jock-jocks,” and their vowels would follow suit. To put it in the terms of classic high school movie characters, if Eckert’s high school was Rydell High from Grease, we’d expect Sandy to say “bus,” Rizzo to say “boss,” and Frenchy to be somewhere in between.
Further studies at other high schools show other groups with other linguistic attitudes. A group of girls in California identified as nerds, and rejected the jock-burnout dichotomy altogether: linguistically, they avoided the slang and cool vowels developing among their peers (such as pronouncing the word “friend” as “frand”), because they didn’t want to be heard as caring about high school popularity. Instead, they adopted linguistic features linked to intellectualism, such as hypercareful articulation, long words, and puns. A study of Latinas at another California high school found a linguistic distinction between Norteñas, who identified as American or Chicana and generally spoke English, and Sureñas, who identified as Mexicana and generally spoke Spanish. We could keep going, but let’s pause and think about how we develop our senses of what’s cool in the first place.
Remember how you learned about swearing? It was probably from a kid around your age, maybe an older sibling, and not from an educator or authority figure. And you were probably in early adolescence: the stage when linguistic influence tends to shift from caregivers to peers. Linguistic innovation follows a similar pattern, and the linguist who first noticed it was Henrietta Cedergren. She was doing a study in Panama City, where younger people had begun pronouncing “ch” as “sh”—saying chica (girl) as shica. When she drew a graph of which ages were using the new “sh” pronunciation, Cedergren noticed that sixteen-year-olds were the most likely to use the new version—more likely than the twelve-year-olds were. So did that mean that “sh” wasn’t the trendy new linguistic innovation after all, since the youngest age group wasn’t really adopting it? Cedergren returned to Panama a decade later to find out. The formerly untrendy twelve-year-olds had grown up into hyperinnovative twenty-two-year-olds. They now had the new “sh” pronunciation at even higher levels than the original trendy cohort of sixteen-year-olds, now twenty-six-year-olds, who sounded the same as they had a decade earlier. What’s more, the new group of sixteen-year-olds were even further advanced, and the new twelve-year-olds still looked a bit behind. Cedergren figured out that twelve-year-olds still have some linguistic growth to do: they keep imitating and building on the linguistic habits of their slightly older, cooler peers as they go through their teens, and then plateau in their twenties.
In terms of swearing, that’s like saying some twelve-year-olds swear, but a lot more sixteen-year-olds do. But swearing is very socially salient (we have laws about it!) and not really changing that much. It’s been peaking in adolescence and declining through adulthood for decades. The other trendy linguistic features that we acquire in adolescence (new pronunciations like “bosses” and “shica,” and innovative uses of words like “so” and “like”) are a case of subtle social discernment rather than massive social taboo, and so we tend to keep them as adults.
This age curve is important when we think about when young people start using social media: age thirteen, if you believe the terms of service of most sites and apps, or slightly younger, if you assume that some users lie about their ages. This is right at the beginning of the age range when the language of teens is tremendously influenced by the slang of their peers. Sure, little kids play games and watch videos and even ask questions of voice assistants, but their social lives are still mediated by their families and their reading level. This coincidence of peer influence and social media access means that it’s easy to conflate how the youth are talking now with the tools that they’re using to do so. But every generation has talked slightly differently from its parents: otherwise, we’d all still be talking like Shak
espeare. The question is, how much of that is influenced by technology, and how much is the linguistic evolution that would have happened regardless?
The answer seems to be that both happen simultaneously. Researchers from Georgia Tech, Columbia, and Microsoft looked at how many times a person had to see a word in order to start using it, using a group of words that were distinctively popular among Twitter users in a particular city in 2013–2014. As we’d expect, they noticed that people who follow each other on Twitter are likely to pick up words from each other. But there was an important difference in how people learned different kinds of words. People sometimes picked up words that are also found in speech—like “cookout,” “hella,” “jawn,” and “phony”—from their internet friends, but it didn’t really matter how many times they saw them. For rising words that are primarily written, not spoken—abbreviations like “tfti” (thanks for the information), “lls” (laughing like shit), and “ctfu” (cracking the fuck up) and phonetic spellings like “inna” (in a / in the) and “ard” (alright)—the number of times people saw them mattered a lot. Every additional exposure made someone twice as likely to start using them. The study pointed out that people encounter spoken slang both online and offline, so when we’re only measuring exposure via Twitter, we miss half or more of the exposures and the trend looks murky. But people mostly encounter the written slang online, so pretty much all of those exposures become measurable for a Twitter study. The researchers also found that you’re more likely to start using a new word from Friendy McNetwork, who shares a lot of mutual friends with you, and less likely to pick it up from Rando McRandomFace, who doesn’t share any of your friends, even if you and Rando follow each other just like you and Friendy do.
Because Internet Page 3