In 2010:
I didn’t quite know what to say thinking, “hmm, mud, what is it … when I found a mirror I didn’t see any other “brown stuff” i brought a watermelon and Costco multi-grain chips, Had a couple beers, I took Yuengling B & T - dinner was boiled/grilled chicken, okra, slaw, “dipping” brownies.
This person is the Alvin Ailey of punctuation. He jumps, swirls, swoops, and rolls with the full gamut of punctuational possibilities: [ ; - … & “/. Oddly, when I first read his blog, I didn’t even notice his use of punctuation marks—they just blended into his writing. However, when his blogs were computer analyzed, his use of punctuation stood out.
Punctuation marks can identify some people better than anything they write. In fact, when looking only at punctuation, computer programs identified 31 percent of authors correctly—essentially the same rate as relying on function words. When both function words and punctuation were used together, the computer correctly paired the original bloggers with writing samples several years later 39 percent of the time.
Punctuation, function words, and content words that are used in everyday writing are all parts of our personal signature. To appreciate this, go to your own e-mail account and spend a few minutes looking at the e-mails you send to and receive from others. Start with the page layout. Some people tend to write very long e-mails, whereas others keep them to a sentence or two. People tend to differ in the length of their paragraphs and sentences. Their greetings and closings vary tremendously as well. Some use emoticons; some never do.
Some of these differences may be psychologically important but most probably aren’t. The person who ends most e-mails with “Sincerely” may do this just because they were told to do so when they were younger. Even though these variations may not say anything about your conflicts with your mother when you were an infant, they still mark you. That is, they are part of your general writing style that makes you stand out from everyone else. And that is the interesting story. All of the language features we can measure can help to identify you.
THE CASE OF THE FEDERALIST PAPERS
In 1787 and 1788, a series of eighty-five essays were published in pamphlets and newspapers across the American colonies in an attempt to sway people to support the proposed document that would become the U.S. Constitution. Published anonymously under the name Publius, the papers discussed a wide range of topics, including the role of the presidency, taxation, state versus federal power, etc. Even at the time, many knew that Publius was not a single person but, instead, James Madison (who would become the fourth president), Alexander Hamilton (the first secretary of the treasury), and John Jay (the first chief justice of the Supreme Court).
In the years that followed, the authorship of seventy-four of the essays gradually became known. Madison wrote fifteen, Hamilton fifty-one, Jay five, and Madison and Hamilton jointly wrote three of the articles. The authorship of the remaining eleven was never determined and has been a source of speculation ever since. The first serious attempt to identify the author of the eleven papers was undertaken by historian Douglass Adair as part of his dissertation in 1943. Adair’s historical analysis deduced that all eleven anonymous essays had been written by James Madison.
The debate resurfaced in 1964 when statisticians Frederick Mosteller and David Wallace introduced a new way to analyze words. By focusing on a small number of function words, they concluded that Adair was indeed correct because their elegant statistical models pointed to Madison as the likely author. Since then, identifying the anonymous authors of the Federalist Papers has become something of a sport whenever new language analysis methods are developed.
I am proud to announce the New Official Findings. Historians, prepare your quills.
Function Word Analyses
Using similar methods to those of Mosteller and Wallace, we find the same effects. The anonymous eleven cases all use pronouns, prepositions, and other stealth words in ways similar to James Madison. Case closed?
Not so fast. Other statisticians have discovered a small problem that exists with the investigation of our founding fathers’ function words. Since Mosteller and Wallace, another technique has been devised that is called cross-validation. The idea is to examine each of the original essays individually as if they were anonymously written. In other words, we pull one of the known essays out of the stack and then develop a computer model based on the remaining essays to try to determine who wrote the essay we pulled out. It’s a marvelous method because we are determining results about a question whose answer we already know. If our cross-validation analyses successfully guess who wrote all of the known essay writers, we can place a tremendous amount of trust in our research methods.
Heartbreak city. The cross-validation results suggest that Mosteller and Wallace might have been wrong. About 14 percent of the known essays are not classified correctly based on function words. This is a serious problem. If the computer can’t tell us what we already know with extremely high accuracy, we have to be careful in interpreting the results from essays about which we don’t know the author.
Punctuation Analyses
Recall that people’s use of punctuation can reveal authorship in many cases. Using similar cross-validation analyses on punctuation resulted in disappointing results as well. And using a combination of function words and punctuation to predict authorship produced slightly better results than the function words alone. Interestingly, the function words plus punctuation results hinted that Hamilton wrote three of the eleven anonymous essays.
Going for the Tell: People’s Use of Obscure Words
Over the course of my career I have written more papers than I care to admit. Perhaps ten years ago, a colleague thanked me for a review I had written about her research. I was flattered, of course, but a bit puzzled since my review had been written anonymously. “How did you know I was the author of that review?” I blurted out. She just laughed and said one word: intriguing.
Intriguing, indeed. I went back to many of my reviews, then articles, and even books. I was shocked by how frequently I used the word intriguing. Even this book is littered with intrigue—I just can’t help myself. Over the years, I’ve noticed that most of my colleagues and friends have their own favorite but relatively obscure words that even they aren’t familiar with. The words aren’t used at high rates but they find their ways into the occasional e-mail, Facebook post, blog, tweet, or article.
Did Madison or Hamilton have tell words in their articles? With a little sleuthing it turns out the answer is yes. In almost half of his papers, Hamilton used the word readily; Madison never did. In nine of his fifteen articles, Madison used consequently, compared with Hamilton’s use of the word three times across his fifty-one papers. Hamilton also had a fondness for commonly, enough, intended, kind, and naturally. Madison tended to overuse absolutely, administer, betray, composing, compass, innovation, lies, proceedings, and wish.
If we just examine the use of these fourteen words, the statistics are promising—almost a perfect score for cross-validation. However, the story for the unknown authors comes out quite differently than what the earlier scholars claimed. They suggest that Hamilton actually wrote eight of the anonymous essays and Madison wrote only three.
What is truth in this case? Reading Douglass Adair’s delightful account of the controversy surrounding the eleven articles, it is clear that Hamilton and Madison had very different memories of who wrote what. Adair is ultimately more sympathetic to Madison’s claims, although the objective evidence to assign authorship is not compelling either way. Like Mosteller and Wallace, I have no in-depth knowledge of the actual case. Nevertheless, historians should know that from a statistical perspective, the case is still open.
WHAT SONG LYRICS SAY ABOUT THE BAND: THE BEATLES
The Beatles were together for about ten years before breaking up in 1970. During their time together, they recorded over two hundred songs and influenced music, politics, fashion, and culture for the next generation. The lead songwrite
rs, John Lennon and Paul McCartney, together or separately wrote 155 songs, and George Harrison penned another 25. Even today, scholars—and the occasional barfly—debate about the relative creativity of the band members, who ultimately influenced whom, and how the band changed over time.
Most of this book is devoted to the words people generate in conversations or write in the form of essays, letters, or electronic media such as blogs, e-mails, etc. Music lyrics, however, tell their own stories about their authors. My good friend and occasional collaborator from New Zealand, Keith Petrie, suggested that a computerized linguistic analysis of the Beatles was long overdue. Once we realized how complicated the topic really was, we invited another music lover and psychologist from Norway, Borge Silvertsen, to join us. What could we learn about the Beatles by analyzing their lyrics? Quite a bit, it turns out.
In many ways, the lyrics of the band reflected the natural aging process one usually sees in all working groups. Recall from the last chapter that as working groups spend time together, their conversations evidence drops in I-words and increases in we-words, with increasing language complexity, including bigger words and more prepositions, articles, and conjunctions. As the group aged together, the Beatles expressed themselves through their lyrics in the same way any group would in their conversations with each other.
In their first four years together, their songs brimmed with optimism, anger, and sexuality. Their thinking was simple, self-absorbed, and very much in the here and now. In the last years of the band, the group’s lyrics became more complex, more psychologically distant, and far less positive. Particularly telling was the drop in the use of I-words from almost 14 percent during their first years together to only 7 percent in their last three years. Lyrics also provide a window into the personalities of the various songwriters within a group. Although John Lennon and Paul McCartney had an agreement that all of their songs would include both men as authors, the order of authorship and extensive interviews has provided historians with a solid, albeit not perfect, record about who wrote what. Between the two, Lennon is credited as the primary writer for seventy-eight songs, McCartney for sixty-seven, and another fifteen songs are considered true collaborations where both were closely involved in the lyrics.
In the popular press, John Lennon was generally portrayed as the creative intellectual and McCartney as the melodic, upbeat tunesmith. The analyses of their lyrics paint a different picture. Lennon did use slightly more negative emotion words in his songs than McCartney, but the two were virtually identical in their use of positive emotions, linguistic complexity, and self-reflection. Interestingly, McCartney’s songs more often focused on couples—as can be seen in his higher use of we-words—than did Lennon’s.
Who was the more creative or varied in his lyric-writing abilities? We can actually test this by seeing how the lyrics from different songs are mathematically similar—both in terms of content as well as linguistic style. Whereas the popular press usually assumed that Lennon was the creative and stylistically variable writer, the numbers clearly support McCartney. Across his career as a Beatle, Paul McCartney proved to be far more flexible and varied both in terms of his writing style and also in the content of his lyrics.
And let’s not forget George Harrison, the quiet, spiritual Beatle who wrote about twenty-five songs, especially in the last years of the Beatles. Although somewhat more cognitively complex in his words than either McCartney or Lennon, he was the least flexible in his writing. In other words, both the content and style of his lyrics were more predictable from song to song. These same types of analyses also demonstrated that Harrison was more influenced in his songwriting style by Lennon than by McCartney.
DOES COLLABORATION RESULT IN AVERAGE OR SYNERGISTIC RESULTS?
Collaborations between writers is a funny business. When two people work together, in John Lennon’s words, “eyeball to eyeball,” do they produce something that is the average of their usual styles or is the result something completely different than either could have written alone? Language analyses can answer this question for both the Beatles and the Federalist Papers. Recall that Lennon and McCartney had very close collaborations on 15 of their 160 songs. Alexander Hamilton and James Madison jointly wrote three Federalist Papers.
Across the various dimensions of language and even punctuation, we can calculate what percentage of the time the collaboration produces an effect that is the average of the two collaborators working on their own. There are three clear hypotheses:
• Just-like-another-member-of-the-team hypothesis. Collaborative writing projects produce language that is similar to that produced by a single person writing alone. Sometimes the work will use words like one author and other times like the other author.
• The average-person hypothesis. More interesting is that collaborations produce language that is the average of the two writers. If Lennon uses a low rate of we-words and McCartney uses a high rate, it would follow that their collaboration would produce a moderate number of we-words.
• The synergy hypothesis. Even more interesting is the idea that when two people work closely together, they create a product unlike either of them would on their own. Their language style will be distinctive in a way such that most people would not recognize who the author was. Wouldn’t it be great if the results supported this hypothesis? Come on, statistics, please, please me.
And the winner by a mile is, in fact, the synergy hypothesis. When Lennon and McCartney and when Madison and Hamilton were working together, they produced works that were strikingly different than works produced by the individual writers themselves. When collaborating, the Lennon-McCartney team produced lyrics that were much more positive, while using more I-words, fewer we-words, and much shorter words than either artist normally used on his own. Similarly, when Hamilton and Madison worked together they used much bigger words, more past tense, and fewer auxiliary verbs than either did on their own. In fact, across about seventy-five dimensions of language and punctuation, more than 90 percent of the time collaborations resulted in language that was either higher or lower than the language of the two writers on their own.
Note that collaborations produce quite different language patterns than what the individuals would naturally do on their own. What’s not yet known is if collaborative work is generally better than individual products. This is a research question that is begging to be answered.
SUMMING UP: PACKING YOUR AUTHOR IDENTIFICATION TOOL KIT
Author identification is becoming a very hot topic in the computer world. The three methods that we have relied on involve tracking the rate of function word usage, analyzing punctuation and layout, and examining the use of obscure words. Each of these methods does far better than chance in identifying characteristics of an author as well as matching the author’s writing to other writing samples.
In terms of understanding the author’s personality, we currently know the most about function words. As discussed throughout the book, pronouns, articles, and other stealth words have reliably been linked to the authors’ age, sex, social class, personality, and social connections. Less is currently known about punctuation and personality, but I suspect future research will begin demonstrating convincing links. After all, it’s hard to imagine that there isn’t a difference between the writer who writes at the end of his or her note, “Thanks.” versus one who writes, “Thanks!!!!!!!!!”
The least is known about the use of relatively obscure words and their link to personality. If one author uses intriguing and another remarkable, does the choice of the word itself say anything about the person?
There are also a number of other exciting methods being developed by labs around the world that are relevant to author identification. One strategy is to look at something called N-grams. These can be pairs of words (or bigrams), three words in a row (or trigrams), etc. Looking at the beginning of this paragraph, the bigram approach would look at the occurrence of “there are,” “are also,” “also a,” and so forth. The idea
is that some people naturally use groups of words together in a unique way that identifies who they are.
More elaborate strategies attempt to mathematically predict word order within sentences based on the words the writer has already used. In the beginning of the last paragraph, the odds that it would start with the word there might be 1 in 1,000. The odds that the word are would be the second word, knowing that the first word is there, is perhaps 1 in 20. Knowing “There are,” the odds that the third word is also … you can get the idea. Researchers can determine how unique a person’s writing is and how much it deviates from chance on a sentence-by-sentence level. One argument is that every person’s way of stringing words together is unique to them. Yet another linguistic fingerprint idea.
Other new methods examine parts of speech, syntax, cohesiveness of sentences and paragraphs—all using increasingly sophisticated mathematical solutions. The time is not too far away where the author of most any extended language sample will be identifiable.
WORDS AS CLUES TO POLITICAL AND HISTORICAL EVENTS
It comes as no news to historians and literary scholars that the primary key to understanding people or works of the past is the study of the written word. Most scholars, however, rely primarily on their own reading of historical works rather than computerized text analyses. This has been changing over the last few years. One area that has been particularly innovative is political science. Partly because of the availability of transcribed speeches, interviews, newspaper and online articles, newscasts, and even letters to the editor, researchers have been able to tap the appeal of political candidates and people’s responses to them.
The Secret Life of Pronouns: What Our Words Say About Us Page 27