Book Read Free

The Half-Life of Facts

Page 10

by Samuel Arbesman


  Scholars who study paleography—the field of research that examines ancient writing—are all too aware of the mistakes that scribes make when copying a text. These types of errors, which can be used to understand the provenance of a document, are actually nearly identical to the types of errors caused by polymerase enzymes, the proteins responsible for copying DNA strands.

  When it comes to copying DNA—those strands of information that code for proteins and so much more—there are a few advantages over simply hand copying a document. DNA’s language is made up of four letters, or bases, which come in complementary pairs: A always goes with T, and G always goes with C. When DNA is copied, its double helix is unzipped, and the letters of each helix—one side of the zipper—can be easily paired with their complementary letters. This results in two new double helices—closed zippers—both of which have properly paired letters, because the complementary letters act as a simple way to prevent errors.

  Nonetheless, when DNA is replicated, it’s sometimes done imperfectly. The group of chemical machines responsible for duplicating a strand of DNA occasionally makes mistakes. That’s what can make up a mutation: an incorrect copying, or even a piece of DNA getting hit by a cosmic ray. However it happens, some error is introduced into the sequence. For example, an A gets turned into a G, or something much bigger happens. The types of mutations fall into a few categories, such as duplicating a section of DNA or deleting a letter, due to regular ways that the DNA copying mechanisms operate. The majority of these errors cause no problem whatsoever, but in some cases a change in a single letter of DNA can cause some large-scale issues, such as in the case of sickle-cell anemia.

  There are systematic errors in copying a text as well. Whether it’s skipping a word or duplicating it, there is an order to the ways in which a scribe’s mind wanders during his transcription. Many of the errors can be grouped into categories, just like the different types of genetic mutations. And not only are there regularities to how both DNA and ancient manuscripts are copied incorrectly, but these types of errors are often very similar, despite the large differences between how scribes and enzymes work.

  There is a common scribal error known by the Greek term homeoteleuton. This refers to a type of deletion, in which there are two identical word phrases separated by some other text and the scribe accidentally skips to the second phrase without transcribing the intervening portion, including the first instance of the phrase. For example, there is a verse at the end of the creation story in Genesis that reads, “And on the seventh day God finished the work that he had done, and he rested on the seventh day from all the work that he had done.” Notice that the phrase “the work that he had done” is repeated. If a scribe incorrectly transcribed the verse simply as “And on the seventh day God finished the work that he had done,” and then proceeded to the next verse, that would be a homeoteleuton.

  In genetics, this same error is known as a slipped-strand mispairing mutation. AATTCGATATACGA gets copied as AATTCGA, skipping the middle section.

  Insertions can occur during copying in both genetics and paleography as well. Simply called insertions in genetics, it is called dittography for manuscripts. There are also reversals: metathesis in paleography and chromosomal transpositions in genetics. And point mutations, substituting the wrong genetic base when copying DNA, also occur in handwritten manuscripts. In both cases the wrong letter is written, based on probabilities of their being similar. In DNA, C and T are quite similar chemically and can be confused easily. In ancient Greek, lambda and delta look similar, and are more likely to be exchanged as well. And the list goes on.

  While fun to chronicle such similarities, they can also be exploited in the same way. Each type of error occurs with different yet predictable frequencies, which we can use if we want to make judgments about the ages of documents or sequences. For example, if a rare mutation is found frequently in a genetic sequence, the sequence can be assumed to be quite old, since a long period of time is needed for these errors to accumulate. In addition, errors can be used to infer the relationship between differing versions of documents or sequences. If two documents have few differences between them, we can assume that they are more closely related than two documents that have many differences.

  More generally, mutational differences between DNA sequences can be used to understand the evolutionary history of a population, or even of a group of species. So too with variants of the same manuscript. A famous example of this is of research that quantitatively studied the differences18 between the surviving versions of Geoffrey Chaucer’s The Canterbury Tales. By subjecting the variants to a battery of genetic analyses, the researchers were able to better understand the contents of the ancestral version, Chaucer’s own copy.

  They used one of the better-known, and saucier, sections of Chaucer’s work, “The Wife of Bath’s Prologue,” in order to trace how changes can be used to find the original version. Based on fifty-eight different surviving versions of this section, which is 850 lines long, the team of researchers—made up of biochemists, information scientists, and humanities scholars—used off-the-shelf computer programs from the field of evolutionary genetics to deduce what Chaucer’s original version likely looked like. They concluded that Chaucer’s original was in fact an unfinished version, complete with his notes about intended additions and deletions.

  Armed with a sense of how genetic tools can be applied to understand how texts spread and change, we can now use this to understand exactly what we have been trying to grasp: how information, especially misinformation, spreads.

  Mutation of texts is far from an ancient problem. It happens in modern times and we can see it especially clearly in the world of science, when facts themselves are referenced: Citations—references in scholarly papers to previous works—also mutate over time.

  Too often a popular paper isn’t actually read by a scientist and then cited in her own work. Sometimes scientists just look at the bibliographies of other papers and copy the citation to the paper instead. This somewhat lazy approach is unfortunately all too common, and if one scientist types it incorrectly, then suddenly there is a mutated version of the citation out in the wild. If other scientists come along and look only at that reference and not the original paper itself, that typo gets propagated from paper to paper, leading to a proliferation of errors. Just as we can learn about how ancient manuscripts spread errors, studying these mutations can allow us to learn about the history of the article that is being cited.

  Mikhail Simkin and Vwani Roychowdhury, professors of electrical engineering at the University of California–Los Angeles, actually measured19 how often these sorts of factual corruptions occur in the scientific literature. In a series of papers, they explored the possible mechanisms for how this occurs, with most of their mathematical models relying on hypothetical scientists grabbing a few papers they’ve recently read and copying citations from the back. An assumption of laziness, certainly, but it also seems that something close to this might actually be the truth.

  Simkin and Roychowdhury conclude, using some elegant math, that only about 20 percent of scientists who cite an article have actually read that paper. This means that four out of five scientists never take the time to track down a publication they intend to use to buttress their arguments. By examining these mutations we can trace these errors backward in time, and understand how knowledge truly spread from scientist to scientist, instead of how it appeared to spread.

  We can even see the spread of such misinformation in a somewhat more lighthearted context. If you had an e-mail address in the late 1990s, you were likely the recipient of a letter that looked something like this:

  This is for anyone who thinks NPR/PBS is a worthwhile expenditure of $1.12/year of their taxes. . . . A petition follows. If you sign, please forward it on to others. If not, please don’t kill it—send it to the e-mail address listed here: XXXX@XXXX.edu. PBS, NPR (National Public Radio), and th
e arts are facing major cutbacks in funding.

  In case it isn’t immediately clear, this is a chain letter. It is one of the less insidious types, as some are much more overtly hoaxes that promise good luck if you spread it, bad luck (or death!) if you ignore it, and the like. While certainly not an accurate piece of knowledge, these types of letters circulate for a very long time.

  Once again, we can use letters to understand how errors spread by examining how they circulate before dying out. This was the question that David Liben-Nowell and Jon Kleinberg, both computer scientists, set out to answer.

  Liben-Nowell and Kleinberg compiled a massive collection of different versions of the chain letter shown above, as well as a petition purporting to organize opposition to the Iraq war (neither of these letters were entirely factual and had their roots in hoaxes). Through a Web site, they asked for volunteers to search their e-mail archives for variations of them. In their database, they found all the hallmarks of biological-style textual mutation. From their paper:20

  Some recipients reordered the list of names on their copy of the letter in ways closely analogous to the kinds of chromosomal rearrangements one finds due to sequence mutation events in biological settings. We observed examples of point mutations (in some petition copies, names were replaced by the names of political figures), insertion/deletion events (there were a number of small blocks of 1–5 names that were present in the middle of the list in some petition copies and absent in other copies), duplication events (blocks of 2–20 names that were duplicated in some petition copies, sometimes immediately adjacent within the list and sometimes hundreds of names later), block rearrangements (in one petition, two pairs of blocks of 2–3 names were swapped relative to their position in all other copies that contained the same names), and one hybridization event (the names at the ends of two copies of the petition were intermingled after their common prefix in a third copy).

  But just as these mutations and errors can be used to understand how knowledge spreads, another element of these letters can be used to explore the branching and spreading of information. Using the signatories (the people who spread it), which are included in the data, we can see how false knowledge makes its way through a population. By looking at how the letters accumulated signatures, the researchers were able to trace their spread and see who sent the letters to whom.

  And they found something that contradicts our intuition about social networks. While we are embedded in highly clustered social circles, ones that also have the property of connecting everyone within a handful of hops (that six degrees of separation again), the spread of these chain letters does not have the feel of an epidemic. Rather than the letters spreading to hundreds of individuals, who in turn each spread it to hundreds of additional recipients themselves, and so forth, it was much more tame. They only spread successfully to one or two people at each step. So while they spread for a long time, slowly burning through a tiny sample of the population, they didn’t create any sort of rapid, massive conflagration.

  This can be a good thing. While a fact or, more important, an incorrect fact—whether the iron content of spinach or the proper scientific name of a long-necked dinosaur—might be able to percolate through a population, and slowly weave its way through a group, it won’t necessarily spread widely. The downside, though, is that it might linger. It can hop from person to person, lasting far longer than we might expect, even if it only affects a tiny subset of the group.

  Happily, it is often the case that credible information or news spreads faster21 and wider than what is false. But no matter the speed with which an error takes hold, rooting it out can be a very difficult process, as it’s hard in everyday life to trace the error back to its source and disabuse each person at every step of their wrong information.

  • • •

  FACTS do not spread instantaneously, even with modern technology. They weave their way through social networks in mathematically predictable ways. Along the way they can also mutate and become filled with errors, again in a reliable manner. Errors can continue to spread, lasting much longer than we might realize. Soon enough, the knowledge in a single area is filled with facts but also with the ejecta from a single burst of errata, making it difficult to know what is true.

  Luckily, there is a simple remedy: Be critical before spreading information and examine it to see what is true. Too often not knowing where one’s facts came from and whether it is well-founded at all is the source of an error. We often just take things on faith.

  The modern origins of empirical scientific knowledge lie in the sixteenth and seventeenth centuries. This time period, known as the Scientific Revolution, saw advances such as Newton’s theory of gravitation, Boyle’s gas laws, Hooke’s recognition that all living things are made of cells, and the beginnings of the Royal Society—a scientific group that exists to this day. The spirit that infused this time period brought forth a whole host of new knowledge, and the disproving of facts that had existed for centuries, if not millennia. The Scientific Revolution has made the swift changes in modern-day knowledge possible.

  But some of the most important components of this endeavor were to try to eliminate errors and create a means of spreading correct facts. Many of the papers presented in the early years at the Royal Society were devoted to trying to understand errors, to root out misunderstandings, or to test the veracity of tales told to them that often seemed too good to be true. For example, here is a characteristically wordy title of a paper published in 1753 in the Philosophical Transactions of the Royal Society: “Experimental Examination of a White Metallic Substance Said to Be Found in the Gold Mines of the Spanish West-Indies, and There Known by the Appellations of Platina, Platina di Pinto, Juan Blanca.” No doubt some man of science had heard of this mysterious white metallic substance from these gold mines and its properties (it appears to be platinum) and felt it important to examine it.

  Anything that was heard they tried to test and to eliminate errors in it, however long they had persisted. Most important, they didn’t keep this new knowledge secret. They spread it far and wide, publishing it and disseminating it through the loose network of natural philosophers of Europe.

  One’s knowledge is dependent upon it being knowable to you specifically, on it having been spread to you. As we’ve seen, this spread relies on social networks, and sometimes on the all-too-human tendency to corrupt information as it spreads. But as long as we remain true to the spirit of the Scientific Revolution, by not taking things on faith and by spreading true facts, we are far from being overwhelmed with error.

  But sometimes, even with the massive advances in technology and our ability to disseminate knowledge—whose modern origins are found in Gutenberg’s Mainz—facts sometimes don’t spread as far as they should. Therein lies the curious situation of hidden knowledge.

  CHAPTER 6

  Hidden Knowledge

  MY father, Harvey Arbesman, is a dermatologist and an epidemiologist. He spends about half of his time seeing patients, diagnosing and treating skin cancer, and the other half doing research. As a researcher, he is fond of the unexpected hypothesis and the counterintuitive concept. This has led him to publish on such topics as whether malignant melanomas are associated with the increased use of antibacterial soaps1 and whether dairy consumption is related to acne.2 Essentially, he is drawn to the tough challenge. This research style led him to InnoCentive.

  Alpheus Bingham was the vice president of research and development strategy at the pharmaceutical company Eli Lilly when he began thinking about experts and how they solve problems. He realized that while an expert might solve a hard problem 20 percent of the time, simply giving it to five experts won’t always yield results. There’s a good chance that all the experts will fail.

  But what if this pool of people was made much wider? Perhaps, Bingham argued, there was a “long tail of expertise” (his term, not mine) of lots of people who are all in
terested in solving a technical problem but each of whom has a very small chance of success. Using this logic, as long as you get a really large group there’s a decent chance that the problem will be solved. The math sounds like it should work out, but would it really work in practice?

  Bingham, with the support of Eli Lilly, created a separate company called InnoCentive, which is designed to test this hypothesis. InnoCentive acts as a clearinghouse between organizations or companies that have problems and solvers—those people from all areas of life who are interested in solving problems and can work better in the aggregate than the experts.

  Bingham’s intuition was right: InnoCentive works. It works because it draws on solutions and insights from different fields. Often the solver is involved in a technical discipline that is near the area of the problem but just different enough to be distinct. With this distinctiveness comes the potential for informational import and export. A fact or solution might be well-known in one area, but it is still an entirely open question in the other. This allows people who might not be experts to bring what they know in their field and apply it to other areas. A sort of fact recombination—where ideas are brought together in new ways—is often the way that problems are solved at InnoCentive.

  For example, when Roche brought a problem3 that it had been working on for fifteen years, the crowd recapitulated all the possible solutions that the company had already tried, and in only sixty days. But even better, there was an actual working solution among the proposals, something the company had failed to find. When NASA used InnoCentive,4 they quickly got the answer to a problem that had been bothering them for thirty years! Instead of working in the same area for a quarter of a century, you can open up the question to a larger group and get an answer from an unexpected source. And more important, an unexpected field.

 

‹ Prev