by Sam Kean
A colleague once described Zipf as someone “who would take roses apart to count their petals,” and Zipf treated literature no differently. As a young scholar Zipf tackled James Joyce’s Ulysses, and the main thing he got out of it was that it contained 29,899 different words, and 260,430 words total. From there Zipf dissected Beowulf, Homer, Chinese texts, and the oeuvre of the Roman playwright Plautus. By counting the words in each work, he discovered Zipf’s law. It says that the most common word in a language appears roughly twice as often as the second most common word, roughly three times as often as the third most common, a hundred times as often as the hundredth most common, and so on. In English, the accounts for 7 percent of words, of about half that, and a third of that, all way down to obscurities like grawlix or boustrophedon. These distributions hold just as true for Sanskrit, Etruscan, or hieroglyphics as for modern Hindi, Spanish, or Russian. (Zipf also found them in the prices in Sears Roebuck mail-order catalogs.) Even when people make up languages, something like Zipf’s law emerges.
After Zipf died in 1950, scholars found evidence of his law in an astonishing variety of other places—in music (more on this later), city population ranks, income distributions, mass extinctions, earthquake magnitudes, the ratios of different colors in paintings and cartoons, and more. Every time, the biggest or most common item in each class was twice as big or common as the second item, three times as big or common as the third, and so on. Probably inevitably, the theory’s sudden popularity led to a backlash, especially among linguists, who questioned what Zipf’s law even meant, if anything.* Still, many scientists defend Zipf’s law because it feels correct—the frequency of words doesn’t seem random—and, empirically, it does describe languages in uncannily accurate ways. Even the “language” of DNA.
Of course, it’s not apparent at first that DNA is Zipfian, especially to speakers of Western languages. Unlike most languages DNA doesn’t have obvious spaces to distinguish each word. It’s more like those ancient texts with no breaks or pauses or punctuation of any kind, just relentless strings of letters. You might think that the A-C-G-T triplets that code for amino acids could function as “words,” but their individual frequencies don’t look Zipfian. To find Zipf, scientists had to look at groups of triplets instead, and a few turned to an unlikely source for help: Chinese search engines. The Chinese language creates compound words by linking adjacent symbols. So if a Chinese text reads ABCD, search engines might examine a sliding “window” to find meaningful chunks, first AB, BC, and CD, then ABC and BCD. Using a sliding window proved a good strategy for finding meaningful chunks in DNA, too. It turns out that, by some measures, DNA looks most Zipfian, most like a language, in groups of around twelve bases. Overall, then, the most meaningful unit for DNA might not be a triplet, but four triplets working together—a dodecahedron motif.
The expression of DNA, the translation into proteins, also obeys Zipf’s law. Like common words, a few genes in every cell get expressed time and time again, while most genes hardly ever come up in conversion. Over the ages cells have learned to rely on these common proteins more and more, and the most common one generally appears twice and thrice and quatrice as often as the next-most-common proteins. To be sure, many scientists harrumph that these Zipfian figures don’t mean anything; but others say it’s time to appreciate that DNA isn’t just analogous to but really functions like a language.
And not just a language: DNA has Zipfian musical properties, too. Given the key of a piece of music, like C major, certain notes appear more often than others. In fact Zipf once investigated the prevalence of notes in Mozart, Chopin, Irving Berlin, and Jerome Kern—and lo and behold, he found a Zipfian distribution. Later researchers confirmed this finding in other genres, from Rossini to the Ramones, and discovered Zipfian distributions in the timbre, volume, and duration of notes as well.
So if DNA shows Zipfian tendencies, too, is DNA arranged into a musical score of sorts? Musicians have in fact translated the A-C-G-T sequence of serotonin, a brain chemical, into little ditties by assigning the four DNA letters to the notes A, C, G, and, well, E. Other musicians have composed DNA melodies by assigning harmonious notes to the amino acids that popped up most often, and found that this produced more complex and euphonious sounds. This second method reinforces the idea that, much like music, DNA is only partly a strict sequence of “notes.” It’s also defined by motifs and themes, by how often certain sequences occur and how well they work together. One biologist has even argued that music is a natural medium for studying how genetic bits combine, since humans have a keen ear for how phrases “chunk together” in music.
Something even more interesting happened when two scientists, instead of turning DNA into music, inverted the process and translated the notes from a Chopin nocturne into DNA. They discovered a sequence “strikingly similar” to part of the gene for RNA polymerase. This polymerase, a protein universal throughout life, is what builds RNA from DNA. Which means, if you look closer, that the nocturne actually encodes an entire life cycle. Consider: Polymerase uses DNA to build RNA. RNA in turn builds complicated proteins. These proteins in turn build cells, which in turn build people, like Chopin. He in turn composed harmonious music—which completed the cycle by encoding the DNA to build polymerase. (Musicology recapitulates ontology.)
So was this discovery a fluke? Not entirely. Some scientists argue that when genes first appeared in DNA, they didn’t arise randomly, along any old stretch of chromosome. They began instead as repetitive phrases, a dozen or two dozen DNA bases duplicated over and over. These stretches function like a basic musical theme that a composer tweaks and tunes (i.e., mutates) to create pleasing variations on the original. In this sense, then, genes had melody built into them from the start.
Humans have long wanted to link music to deeper, grander themes in nature. Most notably astronomers from ancient Greece right through to Kepler believed that, as the planets ran their course through the heavens, they created an achingly beautiful musica universalis, a hymn in praise of Creation. It turns out that universal music does exist, only it’s closer than we ever imagined, in our DNA.
Genetics and linguistics have deeper ties beyond Zipf’s law. Mendel himself dabbled in linguistics in his older, fatter days, including an attempt to derive a precise mathematical law for how the suffixes of German surnames (like -mann and -bauer) hybridized with other names and reproduced themselves each generation. (Sounds familiar.) And heck, nowadays, geneticists couldn’t even talk about their work without all the terms they’ve lifted from the study of languages. DNA has synonyms, translations, punctuation, prefixes, and suffixes. Missense mutations (substituting amino acids) and nonsense mutations (interfering with stop codons) are basically typos, while frameshift mutations (screwing up how triplets get read) are old-fashioned typesetting mistakes. Genetics even has grammar and syntax—rules for combining amino acid “words” and clauses into protein “sentences” that cells can read.
More specifically, genetic grammar and syntax outline the rules for how a cell should fold a chain of amino acids into a working protein. (Proteins must be folded into compact shapes before they’ll work, and they generally don’t work if their shape is wrong.) Proper syntactical and grammatical folding is a crucial part of communicating in the DNA language. However, communication does require more than proper syntax and grammar; a protein sentence has to mean something to a cell, too. And, strangely, protein sentences can be syntactically and grammatically perfect, yet have no biological meaning. To understand what on earth that means, it helps to look at something linguist Noam Chomsky once said. He was trying to demonstrate the independence of syntax and meaning in human speech. His example was “Colorless green ideas sleep furiously.” Whatever you think of Chomsky, that sentence has to be one of the most remarkable things ever uttered. It makes no literal sense. Yet because it contains real words, and its syntax and grammar are fine, we can sort of follow along. It’s not quite devoid of meaning.
In the same way
, DNA mutations can introduce random amino acid words or phrases, and cells will automatically fold the resulting chain together in perfectly syntactical ways based on physics and chemistry. But any wording changes can change the sentence’s whole shape and meaning, and whether the result still makes sense depends. Sometimes the new protein sentence contains a mere tweak, minor poetic license that the cell can, with work, parse. Sometimes a change (like a frameshift mutation) garbles a sentence until it reads like grawlix—the #$%^&@! swear words of comics characters. The cell suffers and dies. Every so often, though, the cell reads a protein sentence littered with missense or nonsense… and yet, upon reflection, it somehow does make sense. Something wonderful like Lewis Carroll’s “mimsy borogoves” or Edward Lear’s “runcible spoon” emerges, wholly unexpectedly. It’s a rare beneficial mutation, and at these lucky moments, evolution creeps forward.*
Because of the parallels between DNA and language, scientists can even analyze literary texts and genomic “texts” with the same tools. These tools seem especially promising for analyzing disputed texts, whose authorship or biological origin remains doubtful. With literary disputes, experts traditionally compared a piece to others of known provenance and judged whether its tone and style seemed similar. Scholars also sometimes cataloged and counted what words a text used. Neither approach is wholly satisfactory—the first too subjective, the second too sterile. With DNA, comparing disputed genomes often involves matching up a few dozen key genes and searching for small differences. But this technique fails with wildly different species because the differences are so extensive, and it’s not clear which differences are important. By focusing exclusively on genes, this technique also ignores the swaths of regulatory DNA that fall outside genes.
To circumvent these problems, scientists at the University of California at Berkeley invented software in 2009 that again slides “windows” along a string of letters in a text and searches for similarities and patterns. As a test, the scientists analyzed the genomes of mammals and the texts of dozens of books like Peter Pan, the Book of Mormon, and Plato’s Republic. They discovered that the same software could, in one trial run, classify DNA into different genera of mammals, and could also, in another trial run, classify books into different genres of literature with perfect accuracy. In turning to disputed texts, the scientists delved into the contentious world of Shakespeare scholarship, and their software concluded that the Bard did write The Two Noble Kinsmen—a play lingering on the margins of acceptance—but didn’t write Pericles, another doubtful work. The Berkeley team then studied the genomes of viruses and archaebacteria, the oldest and (to us) most alien life-forms. Their analysis revealed new links between these and other microbes and offered new suggestions for classifying them. Because of the sheer amount of data involved, the analysis of genomes can get intensive; the virus-archaebacteria scan monopolized 320 computers for a year. But genome analysis allows scientists to move beyond simple point-by-point comparisons of a few genes and read the full natural history of a species.
Reading a full genomic history, however, requires more dexterity than reading other texts. Reading DNA requires both left-to-right and right-to-left reading—boustrophedon reading. Otherwise scientists miss crucial palindromes and semordnilaps, phrases that read the same forward and backward (and vice versa).
One of the world’s oldest known palindromes is an amazing up-down-and-sideways square carved into walls at Pompeii and other places:
S-A-T-O-R
A-R-E-P-O
T-E-N-E-T
O-P-E-R-A
R-O-T-A-S
At just two millennia old, however, sator… rotas* falls orders of magnitude short of the age of the truly ancient palindromes in DNA. DNA has even invented two kinds of palindromes. There’s the traditional, sex-at-noon-taxes type—GATTACATTAG. But because of A-T and C-G base pairing, DNA sports another, subtler type that reads forward down one strand and backward across the other. Consider the string CTAGCTAG, then imagine what bases must appear on the other strand, GATCGATC. They’re perfect palindromes.
Harmless as it seems, this second type of palindrome would send frissons of fear through any microbe. Long ago, many microbes evolved special proteins (called “restriction enzymes”) that can snip clean through DNA, like wire cutters. And for whatever reason, these enzymes can cut DNA only along stretches that are highly symmetrical, like palindromes. Cutting DNA has some useful purposes, like clearing out bases damaged by radiation or relieving tension in knotted DNA. But naughty microbes mostly used these proteins to play Hatfields versus McCoys and shred each other’s genetic material. As a result microbes have learned the hard way to avoid even modest palindromes.
Not that we higher creatures tolerate many palindromes, either. Consider CTAGCTAG and GATCGATC again. Notice that the beginning half of either palindromic segment could base-pair with the second half of itself: the first letter with the last (C… G), the second with the penult (T… A), and so on. But for these internal bonds to form, the DNA strand on one side would have to disengage from the other and buckle upward, leaving a bump. This structure, called a “hairpin,” can form along any DNA palindrome of decent length because of its inherent symmetry. As you might expect, hairpins can destroy DNA as surely as knots, and for the same reason—they derail cellular machinery.
Palindromes can arise in DNA in two ways. The shortish DNA palindromes that cause hairpins arose randomly, when A’s, C’s, G’s, and T’s just happened to arrange themselves symmetrically. Longer palindromes litter our chromosomes as well, and many of those—especially those that wreak havoc on the runt Y chromosome—probably arose through a specific two-step process. For various reasons, chromosomes sometimes accidentally duplicate chunks of DNA, then paste the second copy somewhere down the line. Chromosomes can also (sometimes after double-strand breaks) flip a chunk of DNA by 180 degrees and reattach it ass-backwards. In tandem, a duplication and inversion create a palindrome.
Most chromosomes, though, discourage long palindromes or at least discourage the inversions that create them. Inversions can break up or disable genes, leaving the chromosome ineffective. Inversions can also hurt a chromosome’s chances of crossing over—a huge loss. Crossing over (when twin chromosomes cross arms and exchange segments) allows chromosomes to swap genes and acquire better versions, or versions that work better together and make the chromosome more fit. Equally important, chromosomes take advantage of crossing over to perform quality-control checks: they can line up side by side, eyeball each other up and down, and overwrite mutated genes with nonmutated genes. But a chromosome will cross over only with a partner that looks similar. If the partner looks suspiciously different, the chromosome fears picking up malignant DNA and refuses to swap. Inversions look dang suspicious, and in these circumstances, chromosomes with palindromes get shunned.
Y once displayed this intolerance for palindromes. Way back when, before mammals split from reptiles, X and Y were twin chromosomes and crossed over frequently. Then, 300 million years ago, a gene on Y mutated and became a master switch that causes testes to develop. (Before this, sex was probably determined by the temperature at which Mom incubated her eggs, the same nongenetic system that determines pink or blue in turtles and crocodiles.) Because of this change, Y became the “male” chromosome and, through various processes, accumulated other manly genes, mostly for sperm production. As a consequence, X and Y began to look dissimilar and shied away from crossing over. Y didn’t want to risk its genes being overwritten by shrewish X, while X didn’t want to acquire Y’s meathead genes, which might harm XX females.
After crossing over slowed down, Y grew more tolerant about inversions, small and large. In fact Y has undergone four massive inversions in its history, truly huge flips of DNA. Each one created many cool palindromes—one spans three million letters—but each one made crossing over with X progressively harder. This wouldn’t be a huge deal except, again, crossing over allows chromosomes to overwrite malignant mutations. Xs could keep
doing this in XX females, but when Y lost its partner, malignant mutations started to accumulate. And every time one appeared, cells had no choice but to chop Y down and excise the mutated DNA. The results were not pretty. Once a large chromosome, Y has lost all but two dozen of its original fourteen hundred genes. At that rate, biologists once assumed that Ys were goners. They seem destined to keep picking up dysfunctional mutations and getting shorter and shorter, until evolution did away with Ys entirely—and perhaps did away with males to boot.
Palindromes, however, may have pardoned Y. Hairpins in a DNA strand are bad, but if Y folds itself into a giant hairpin, it can bring any two of its palindromes—which are the same genes, one running forward, one backward—into contact. This allows Y to check for mutations and overwrite them. It’s like writing down “A man, a plan, a cat, a ham, a yak, a yam, a hat, a canal: Panama!” on a piece of paper, folding the paper over, and correcting any discrepancies letter by letter—something that happens six hundred times in every newborn male. Folding over also allows Ys to make up for their lack of a sex-chromosome partner and “recombine” with themselves, swapping genes at one point along their lengths for genes at another.
This palindromic fix is ingenious. Too clever, in fact, by half. The system Y uses to compare palindromes regrettably doesn’t “know” which palindrome has mutated and which hasn’t; it just knows there’s a difference. So not infrequently, Y overwrites a good gene with a bad one. The self-recombination also tends to—whoops—accidentally delete the DNA between the palindromes. These mistakes rarely kill a man, but can render his sperm impotent. Overall the Y chromosome would disappear if it couldn’t correct mutations like this; but the very thing that allows it to, its palindromes, can unman it.