Yet the applicability of Shannon information theory to molecular biology has, to some degree, obscured a key distinction concerning the type of information that DNA possesses. Although Shannon’s theory measures the amount of information in a sequence of symbols or characters (or chemicals functioning as such), it doesn’t distinguish a meaningful or functional sequence from useless gibberish. For example:
“we hold these truths to be self-evident”
“ntnyhiznslhtgeqkahgdsjnfplknejmsed”
These two sequences are equally long and equally improbable if we imagine them being drawn at random. Thus, they contain the same amount of Shannon information. Yet clearly there is an important qualitative distinction between them that the Shannon measurement does not capture. The first meaningful sequence performs a communication function, while the second does not.
Shannon emphasized that the kind of information his theory described needs to be carefully distinguished from our common notions of information. As Warren Weaver, one of Shannon’s close collaborators, made clear in 1949, “The word ‘information’ in this theory is used in a special mathematical sense that must not be confused with its ordinary usage.”23 By ordinary usage, Weaver, of course, was referring to the idea of meaningful or functional communication.
Webster’s dictionary defines information as “the communication or reception of knowledge or intelligence.” It also defines information as “the attribute inherent in, and communicated by, alternative sequences or arrangements of something that produce a specific effect.” A sequence of characters possessing a large amount of Shannon information may convey meaning (as in an English text) or perform a function that “produces a specific effect” (as do both English sentences and computer codes, for example) or it may not (as would be the case with a meaningless pile of letters or a screen of scrambled computer code). In any case, Shannon’s purely mathematical theory of information does not distinguish the presence of meaningful or functional sequences from merely improbable, though meaningless ones. It only provides a mathematical measure of the improbability—or information-carrying capacity—of a sequence of characters. In a sense, it provides a measure of a sequence’s capacity to carry functional or meaningful information. It does not, and cannot, determine whether the sequence in question does convey meaning or generate a functionally significant effect.
Strands of DNA contain information-carrying capacity—something Shannon’s theory can measure.24 But DNA, like natural languages and computer codes, also contains functional information.25
In languages such as English, specifically arranged characters convey functional information to conscious agents. In computer or machine code, specifically arranged characters (zeros and ones) produce functionally significant outcomes within a computational environment without a conscious agent receiving the meaning of the code inside the machine. In the same way, DNA stores and conveys functional information for building proteins or RNA molecules, even if it is not received by a conscious agent. As in computer code, the precise arrangement of characters (or chemicals functioning as characters) allows the sequence to “produce a specific effect.” For this reason, I also like to use the term specified information as a synonym for functional information, because the function of a sequence of characters depends upon the specific arrangement of those characters.
And DNA contains specified information, not just Shannon information or information-carrying capacity. As Crick himself put it in 1958, “By information I mean the specification of the amino acid sequence in protein… . Information means here the precise determination of sequence, either of bases in the nucleic acid or on amino acid residues of the protein.”26
The Message as the Mystery
So if the origin of the Cambrian animals required vast amounts of new functional or specified information, what produced this information explosion? Since the molecular biological revolution first highlighted the primacy of information to the maintenance and function of living systems, questions about the origin of information have moved decidedly to the forefront of discussions about evolutionary theory. What’s more, the realization that specificity of arrangement, rather than mere improbability, characterizes the genetic text has raised some challenging questions about the adequacy of the neo-Darwinian mechanism. Is it plausible to think that natural selection working on random mutations in DNA could produce the highly specific arrangements of bases necessary to generate the protein building blocks of new cell types and novel forms of life? Perhaps nowhere do such questions pose more of a challenge to neo-Darwinian theory than in discussions of the Cambrian explosion.
9
Combinatorial Inflation
Murray Eden (see Fig. 9.1), a professor of engineering and computer science at MIT, was accustomed to thinking about how to build things. But when he began to consider the importance of information to building living organisms, he realized something didn’t add up. His critics said that he knew just enough biology to be dangerous. In retrospect, they were probably right.
In the early 1960s, just as molecular biologists had confirmed Francis Crick’s famed sequence hypothesis, Eden began to think about the challenge of building a living organism. Of course, Eden wasn’t contemplating building a living organism himself. Rather, he was thinking about what it would take for the neo-Darwinian mechanism of natural selection acting on random mutations to do the job. He wondered whether mutation and selection could generate the needed functional information.
To his way of thinking, specificity was a big part of the problem. Obviously, if DNA contained an improbable sequence of nucleotide bases in which the arrangement of bases does not matter to the function of the molecule, then random mutational changes in the sequence of bases would not have a detrimental effect on the function of the molecule. But, of course, sequence does affect function. Eden knew that in all computer codes or written text in which the specificity of sequence determines function, random changes in sequence consistently degrade function or meaning. As he explained, “No currently existing formal language can tolerate random changes in the symbol sequences which express its sentences. Meaning is almost invariably destroyed.”1 Thus, he suspected that the need for specificity in the arrangement of DNA bases made it extremely improbable that random mutations would generate new functional genes or proteins as opposed to degrading existing ones.
But how improbable? How difficult would it be for random mutations to generate, or stumble upon, the genetically meaningful or functional sequences needed to supply natural selection with the raw material—the genetic information and variation—it needed to produce new proteins, organs, and forms of life? Eden wasn’t the only mathematician or scientist asking these questions. But the mathematically based challenge to evolutionary theory that he helped to initiate would indeed prove dangerous to neo-Darwinian orthodoxy.
The Wistar Institute Conference
During the early 1960s, Eden began discussing the plausibility of the neo-Darwinian theory of evolution with several MIT colleagues in math, physics, and computer science. As the discussion grew to include mathematicians and scientists from other institutions, the idea of a conference was born. In 1966, a distinguished group of mathematicians, engineers, and scientists convened a conference at the Wistar Institute in Philadelphia called “Mathematical Challenges to the Neo-Darwinian Interpretation of Evolution.” Prominent among the attendees were Marcel-Paul Schützenberger, a mathematician and physician at the University of Paris; Stanislaw Ulam, the codesigner of the hydrogen bomb; and Eden himself. The conference also included a number of prominent biologists, including Ernst Mayr, an architect of modern neo-Darwinism, and Richard Lewontin, at the time a professor of genetics and evolutionary biology at the University of Chicago.
FIGURE 9.1
Murray Eden. Courtesy MIT Museum.
Sir Peter Medawar, a Nobel laureate and the director of the North London Medical Research Council’s laboratories, chaired the meeting. In his opening remarks, he said, “The imm
ediate cause of this conference is a pretty widespread sense of dissatisfaction about what has come to be thought of as the accepted evolutionary theory in the English-speaking world, the so-called neo-Darwinian theory.”2
For many, doubts about the creative power of the mutation and selection mechanism stemmed from the elucidation of the nature of genetic information by molecular biologists in the late 1950s and early 1960s.
The discovery that the genetic information in DNA is stored as a linear array of precisely sequenced nucleotide bases at first helped to clarify the nature of many mutational processes. Just as a sequence of letters in an English text might be altered either by changing individual letters one by one or by combining and recombining whole sections of text, so too might the genetic text be altered either one base at a time or by combining and recombining different sections of genes in various ways at random. Indeed, modern genetics has established various mechanisms of mutational change—not only “point mutations,” or changes in individual bases, but also duplications, insertions, inversions, recombinations, and deletions of whole sections of the genetic text.
Although fully aware of this range of mutational options at nature’s disposal, Eden argued at Wistar that such random changes to written texts or sections of digital code would inevitably degrade the function of information-bearing sequences, particularly when allowed to accumulate.3 For example, the simple phrase “One if by land and two if by sea” will be significantly degraded by just a handful of random changes such as those in bold: “Ine if bg lend and two ik bT Nea.” At the conference the French mathematician Marcel Schützenberger agreed with Eden’s concerns about the effect of random alterations. He noted that if someone makes even a few random changes in the arrangement of the digital characters in a computer program, “we find that we have no chance (i.e., less than 1/101000) even to see what the modified program would compute: it just jams.”4 Eden argued that much the same problem applied to DNA—that insofar as specific arrangements of bases in DNA function like digital code, random changes to these arrangements would likely efface their function, while attempts to generate completely new sections of genetic text by random means were likely doomed to failure.5
The explanation for this inevitable diminution in function is found in a branch of mathematics called combinatorics. Combinatorics studies the number of ways a group of things can be combined or arranged. At one level, the subject is fairly intuitive. If a thief slips round the corner of a dormitory after hours looking for a bike to steal, he will scan the bike rack for an easy target. If he spots a basic bicycle-style lock with only three dials of ten numbers each, and on the rack beside it one with five dials of ten numbers each, the thief wouldn’t need a degree in mathematics to realize which one he should attempt to open. He knows that he would need to search fewer total possibilities with the three-dial lock.
A straightforward calculation supports his intuition. The simpler lock has only 10 × 10 × 10, or 1000, possible combinations of digits—or what mathematicians refer to as “combinatorial” possibilities. The five-dial lock has 10 × 10 × 10 × 10 × 10, or 100,000, combinatorial possibilities. With a lot of patience, the thief might elect to systematically work his way through the different combinations of digits on the simpler lock, knowing that at some point he will stumble across the correct combination. He shouldn’t even bother with the five-dial lock, since making his way through all of the possible combinations on it would take 100 times as long. The five-dial lock simply has too many possibilities for the thief to have a reasonable chance of opening it by trial and error in the time available to him.
Several of the Wistar scientists noted that the mutation and selection mechanism faces a similar problem. Neo-Darwinism envisions new genetic information arising from random mutations in the DNA. If at any time from birth to reproduction the right mutation or combination of mutations accumulate in the DNA of cells involved in reproduction (whether sexual or asexual), then information for building a new protein or proteins will pass on to the next generation. When that new protein happens to confer a survival advantage on an organism, the genetic change responsible for the new protein will tend to be passed on to subsequent generations. As favorable mutations accumulate, the features of a population will gradually change over time.
Clearly, natural selection plays a crucial role in this process. Favorable mutations are passed on; unfavorable mutations are weeded out. Nevertheless, the process can only select variations in the genetic text that mutations have first produced. For this reason, evolutionary biologists typically recognize that mutation, not natural selection, provides the source of variation and innovation in the evolutionary process. As evolutionary biologists Jack King and Thomas Jukes put it in 1969, “Natural selection is the editor, rather than the composer, of the genetic message.”6
And that was the problem, as the Wistar skeptics saw it: random mutations must do the work of composing new genetic information, yet the sheer number of possible nucleotide base or amino-acid combinations (i.e., the size of the combinatorial “space”) associated with a single gene or protein of even modest length rendered the probability of random assembly prohibitively small. For every sequence of amino acids that generates a functional protein, there are a myriad of other combinations that don’t. As the length of the required protein grows, the number of possible amino-acid combinations mushrooms exponentially. As this happens, the probability of ever stumbling by random mutation onto a functional sequence rapidly diminishes.
Consider another illustration. The two letters X and Y can be combined in four different two-letter combinations (XX, XY, YX and YY). They can be combined in eight different ways for three-letter combinations (XXX, XXY, XYY, XYX, YXX, YYX, YXY, YYY), sixteen ways for four-letter combinations, and so on. The number of possible combinations grows exponentially—22, 23, 24, and so on—as the number of letters in the sequence grows. Mathematician David Berlinski calls this the problem of “combinatorial inflation,” because the number of possible combinations “inflates” dramatically as the number of characters in a sequence grows (see Fig. 9.2).
The combinations of bases in DNA are subject to combinatorial inflation of just this sort. The information-bearing sequences in DNA consist of specific arrangements of the four nucleotide bases. Consequently, there are four possible bases that could occur at each site along the DNA backbone and 4 × 4, or 42, or 16 possible two-base sequences (AA AT AG AC TA TG TC TT CG CT CC CA GA GG GC GT). Similarly, there are 4 × 4 × 4, or 43, or 64 possible three-base sequences. (I’ll refrain from listing them all.) That is, increasing the number of bases in a sequence from 1 to 2 to 3 increases the number of possibilities from 4 to 16 to 64. As the sequence length continues to grow, the number of combinatorial possibilities corresponding to sequences of increasing length inflates exponentially. For example, there are 4100, or 1060, possible ways of arranging one hundred bases in a row.
FIGURE 9.2
The problem of combinatorial inflation as illustrated by bike locks of varying sizes. As the number of dials on the bike locks increases, the number of possible combinations rises exponentially.
The amino-acid chains are also subject to such inflation. A chain of two amino acids could display 202, or 20 × 20, or 400 possible combinations, since each of the twenty protein-forming amino acids could combine with any one of that same group of twenty in the second position of a short peptide chain. With a three-amino-acid sequence, we’re looking at 203, or 8,000, possible sequences. With four amino acids, the number of combinations rises exponentially to 204, or 160,000, total combinations, and so on.
Now, the number of combinatorial possibilities corresponding to a chain with four amino acids only marginally outstrips the combinatorial possibilities associated with the five-dial lock in my first illustration (160,000 vs. 100,000). It turns out, however, that many necessary, functional proteins in cells require far, far more than just four amino acids linked in sequence, and necessary genes require far, far more than just a few
bases. Most genes—sections of DNA that code for a specific protein—consist of at least one thousand nucleotide bases. That corresponds to 41000—an unimaginably large number—possible base sequences of that length.
Moreover, it takes three bases in a group called a codon to designate one of the twenty protein-forming amino acids in a growing chain during protein synthesis. If an average gene has about 1000 bases, then an average protein would have over 300 amino acids, each of which are called “residues” by protein chemists. And indeed proteins typically require hundreds of amino acids in order to perform their functions. This means that an average-length protein represents just one possible sequence among an astronomically large number—20300, or over 10390—of possible amino-acid sequences of that length. Putting these numbers in perspective, there are only 1065 atoms in our Milky Way galaxy and 1080 elementary particles in the known universe.
That is what bothered Eden and other mathematically inclined scientists at Wistar. They understood the immensity of the combinatorial spaces associated with even single genes or proteins of average length. They realized that if the mutations themselves were truly random—that is, if they were neither directed by an intelligence nor influenced by the functional needs of the organism (as neo-Darwinism stipulates)—then the probability of the mutation and selection mechanism ever producing a new gene or protein could well be vanishingly small. Why? The mutations would have to generate, or “search” by trial and error, an enormous number of possibilities—far more than were realistic in the time available to the evolutionary process.
Darwin's Doubt Page 19