What Just Happened: A Chronicle From the Information Frontier
Page 31
To banish the fallacious thinking, he proposed a new terminology, beginning with gene: “nothing but a very applicable little word, easily combined with others.”♦ It hardly mattered that neither he nor anyone else knew what a gene actually was; “it may be useful as an expression for the ‘unit-factors,’ ‘elements,’ or ‘allelomorphs.’… As to the nature of the ‘genes’ it is as yet of no value to propose a hypothesis.” Gregor Mendel’s years of research with green and yellow peas showed that such a thing must exist. Colors and other traits vary depending on many factors, such as temperature and soil content, but something is preserved whole; it does not blend or diffuse; it must be quantized.♦ Mendel had discovered the gene, though he did not name it. For him it was more an algebraic convenience than a physical entity.
When Schrödinger contemplated the gene, he faced a problem. How could such a “tiny speck of material” contain the entire complex code-script that determines the elaborate development of the organism? To resolve the difficulty Schrödinger summoned an example not from wave mechanics or theoretical physics but from telegraphy: Morse code. He noted that two signs, dot and dash, could be combined in well-ordered groups to generate all human language. Genes, too, he suggested, must employ a code: “The miniature code should precisely correspond with a highly complicated and specified plan of development and should somehow contain the means to put it into action.”♦
Codes, instructions, signals—all this language, redolent of machinery and engineering, pressed in on biologists like Norman French invading medieval English. In the 1940s the jargon had a precious, artificial feeling, but that soon passed. The new molecular biology began to examine information storage and information transfer. Biologists could count in terms of “bits.” Some of the physicists now turning to biology saw information as exactly the concept needed to discuss and measure biological qualities for which tools had not been available: complexity and order, organization and specificity.♦ Henry Quastler, an early radiologist from Vienna, then at the University of Illinois, was applying information theory to both biology and psychology; he estimated that an amino acid has the information content of a written word and a protein molecule the information content of a paragraph. His colleague Sidney Dancoff suggested to him in 1950 that a chromosomal thread is “a linear coded tape of information”♦:
The entire thread constitutes a “message.” This message can be broken down into sub-units which may be called “paragraphs,” “words,” etc. The smallest message unit is perhaps some flip-flop which can make a yes-no decision.
In 1952 Quastler organized a symposium on information theory in biology, with no purpose but to deploy these new ideas—entropy, noise, messaging, differentiating—in areas from cell structure and enzyme catalysis to large-scale “biosystems.” One researcher constructed an estimate of the number of bits represented by a single bacterium: as much as 1013.♦ (But that was the number needed to describe its entire molecular structure in three dimensions—perhaps there was a more economical description.) The growth of the bacterium could be analyzed as a reduction in the entropy of its part of the universe. Quastler himself wanted to take the measure of higher organisms in terms of information content: not in terms of atoms (“this would be extremely wasteful”) but in terms of “hypothetical instructions to build an organism.”♦ This brought him, of course, to genes.
The whole set of instructions—situated “somewhere in the chromosomes”—is the genome. This is a “catalogue,” he said, containing, if not all, then at least “a substantial fraction of all information about an adult organism.” He emphasized, though, how little was known about genes. Were they discrete physical entities, or did they overlap? Were they “independent sources of information” or did they affect one another? How many were there? Multiplying all these unknowns, he arrived at a result:
that the essential complexity of a single cell and of a whole man are both not more than 1012 nor less than 105 bits; this is an extremely coarse estimate, but is better than no estimate at all.♦
These crude efforts led to nothing, directly. Shannon’s information theory could not be grafted onto biology whole. It hardly mattered. A seismic shift was already under way: from thinking about energy to thinking about information.
Across the Atlantic, an odd little letter arrived at the offices of the journal Nature in London in the spring of 1953, with a list of signatories from Paris, Zurich, Cambridge, and Geneva, most notably Boris Ephrussi, France’s first professor of genetics.♦ The scientists complained of “what seems to us a rather chaotic growth in technical vocabulary.” In particular, they had seen genetic recombination in bacteria described as “transformation,” “induction,” “transduction,” and even “infection.” They proposed to simplify matters:
As a solution to this confusing situation, we would like to suggest the use of the term “interbacterial information” to replace those above. It does not imply necessarily the transfer of material substances, and recognizes the possible future importance of cybernetics at the bacterial level.
This was the product of a wine-flushed lakeside lunch at Locarno, Switzerland—meant as a joke, but entirely plausible to the editors of Nature, who published it forthwith.♦ The youngest of the lunchers and signers was a twenty-five-year-old American named James Watson.
The very next issue of Nature carried another letter from Watson, along with his collaborator, Francis Crick. It made them famous. They had found the gene.
A consensus had emerged that whatever genes were, however they functioned, they would probably be proteins: giant organic molecules made of long chains of amino acids. Alternatively, a few geneticists in the 1940s focused instead on simple viruses—phages. Then again, experiments on heredity in bacteria had persuaded a few researchers, Watson and Crick among them, that genes might lie in a different substance, which, for no known reason, was found within the nucleus of every cell, plant and animal, phages included.♦ This substance was a nucleic acid, particularly deoxyribonucleic acid, or DNA. The people working with nucleic acids, mainly chemists, had not been able to learn much about it, except that the molecules were built up from smaller units, called nucleotides. Watson and Crick thought this must be the secret, and they raced to figure out its structure at the Cavendish Laboratory in Cambridge. They could not see these molecules; they could only seek clues in the shadows cast by X-ray diffraction. But they knew a great deal about the subunits. Each nucleotide contained a “base,” and there were just four different bases, designated as A, C, G, and T. They came in strictly predictable proportions. They must be the letters of the code. The rest was trial and error, fired by imagination.
What they discovered became an icon: the double helix, heralded on magazine covers, emulated in sculpture. DNA is formed of two long sequences of bases, like ciphers coded in a four-letter alphabet, each sequence complementary to the other, coiled together. Unzipped, each strand may serve as a template for replication. (Was it Schrödinger’s “aperiodic crystal”? In terms of physical structure, X-ray diffraction showed DNA to be entirely regular. The aperiodicity lies at the abstract level of language—the sequence of “letters.”) In the local pub, Crick, ebullient, announced to anyone who would listen that they had discovered “the secret of life”; in their one-page note in Nature they were more circumspect. They ended with a remark that has been called “one of the most coy statements in the literature of science”♦:
It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material.♦
They dispensed with the timidity in another paper a few weeks later. In each chain the sequence of bases appeared to be irregular—any sequence was possible, they observed. “It follows that in a long molecule many different permutations are possible.”♦ Many permutations—many possible messages. Their next remark set alarms sounding on both sides of the Atlantic: “It therefore seems likely that the precise sequence of the bases is the code
which carries the genetical information.” In using these terms, code and information, they were no longer speaking figuratively.
The macromolecules of organic life embody information in an intricate structure. A single hemoglobin molecule comprises four chains of polypeptides, two with 141 amino acids and two with 146, in strict linear sequence, bonded and folded together. Atoms of hydrogen, oxygen, carbon, and iron could mingle randomly for the lifetime of the universe and be no more likely to form hemoglobin than the proverbial chimpanzees to type the works of Shakespeare. Their genesis requires energy; they are built up from simpler, less patterned parts, and the law of entropy applies. For earthly life, the energy comes as photons from the sun. The information comes via evolution.
The DNA molecule was special: the information it bears is its only function. Having recognized this, microbiologists turned to the problem of deciphering the code. Crick, who had been inspired to leave physics for biology when he read Schrödinger’s What Is Life?, sent Schrödinger a copy of the paper but did not receive a reply.
On the other hand, George Gamow saw the Watson-Crick report when he was visiting the Radiation Laboratory at Berkeley. Gamow was a Ukrainian-born cosmologist—an originator of the Big Bang theory—and he knew a big idea when he saw one. He sent off a letter:
Dear Drs. Watson & Crick,
I am a physicist, not a biologist.… But I am very much excited by your article in May 30th Nature, and think that brings Biology over into the group of “exact” sciences.… If your point of view is correct each organism will be characterized by a long number written in quadrucal (?) system with figures 1, 2, 3, 4 standing for different bases.… This would open a very exciting possibility of theoretical research based on combinatorix and the theory of numbers!… I have a feeling this can be done. What do you think?♦
For the next decade, the struggle to understand the genetic code consumed a motley assortment of the world’s great minds, many of them, like Gamow, lacking any useful knowledge of biochemistry. For Watson and Crick, the initial problem had depended on a morass of specialized particulars: hydrogen bonds, salt linkages, phosphate-sugar chains with deoxyribofuranose residues. They had to learn how inorganic ions could be organized in three dimensions; they had to calculate exact angles of chemical bonds. They made models out of cardboard and tin plates. But now the problem was being transformed into an abstract game of symbol manipulation. Closely linked to DNA, its single-stranded cousin, RNA, appeared to play the role of messenger or translator. Gamow said explicitly that the underlying chemistry hardly mattered. He and others who followed him understood this as a puzzle in mathematics—a mapping between messages in different alphabets. If this was a coding problem, the tools they needed came from combinatorics and information theory. Along with physicists, they consulted cryptanalysts.
Gamow himself began impulsively by designing a combinatorial code. As he saw it, the problem was to get from the four bases in DNA to the twenty known amino acids in proteins—a code, therefore, with four letters and twenty words.♦ Pure combinatorics made him think of nucleotide triplets: three-letter words. He had a detailed solution—soon known as his “diamond code”—published in Nature within a few months. A few months after that, Crick showed this to be utterly wrong: experimental data on protein sequences ruled out the diamond code. But Gamow was not giving up. The triplet idea was seductive. An unexpected cast of scientists joined the hunt: Max Delbrück, an ex-physicist now at Caltech in biology; his friend Richard Feynman, the quantum theorist; Edward Teller, the famous bomb maker; another Los Alamos alumnus, the mathematician Nicholas Metropolis; and Sydney Brenner, who joined Crick at the Cavendish.
They all had different coding ideas. Mathematically the problem seemed daunting even to Gamow. “As in the breaking of enemy messages during the war,” he wrote in 1954, “the success depends on the available length of the coded text. As every intelligence officer will tell you, the work is very hard, and the success depends mostly on luck.… I am afraid that the problem cannot be solved without the help of electronic computer.”♦ Gamow and Watson decided to make it a club: the RNA Tie Club, with exactly twenty members. Each member received a woolen tie in black and green, made to Gamow’s design by a haberdasher in Los Angeles. The game playing aside, Gamow wanted to create a communication channel to bypass journal publication. News in science had never moved so fast. “Many of the essential concepts were first proposed in informal discussions on both sides of the Atlantic and were then quickly broadcast to the cognoscenti,” said another member, Gunther Stent, “by private international bush telegraph.”♦ There were false starts, wild guesses, and dead ends, and the established biochemistry community did not always go along willingly.
“People didn’t necessarily believe in the code,” Crick said later. “The majority of biochemists simply weren’t thinking along those lines. It was a completely novel idea, and moreover they were inclined to think it was oversimplified.”♦ They thought the way to understand proteins would be to study enzyme systems and the coupling of peptide units. Which was reasonable enough.
They thought protein synthesis couldn’t be a simple matter of coding from one thing to another; that sounded too much like something a physicist had invented. It didn’t sound like biochemistry to them.… So there was a certain resistance to simple ideas like three nucleotides’ coding an amino acid; people thought it was rather like cheating.
Gamow, at the other extreme, was bypassing the biochemical details to put forward an idea of shocking simplicity: that any living organism is determined by “a long number written in a four-digital system.”♦ He called this “the number of the beast” (from Revelation). If two beasts have the same number, they are identical twins.
By now the word code was so deeply embedded in the conversation that people seldom paused to notice how extraordinary it was to find such a thing—abstract symbols representing arbitrarily different abstract symbols—at work in chemistry, at the level of molecules. The genetic code performed a function with uncanny similarities to the metamathematical code invented by Gödel for his philosophical purposes. Gödel’s code substitutes plain numbers for mathematical expressions and operations; the genetic code uses triplets of nucleotides to represent amino acids. Douglas Hofstadter was the first to make this connection explicitly, in the 1980s: “between the complex machinery in a living cell that enables a DNA molecule to replicate itself and the clever machinery in a mathematical system that enables a formula to say things about itself.”♦ In both cases he saw a twisty feedback loop. “Nobody had ever in the least suspected that one set of chemicals could code for another set,” Hofstadter wrote.
Indeed, the very idea is somewhat baffling: If there is a code, then who invented it? What kinds of messages are written in it? Who writes them? Who reads them?
The Tie Club recognized that the problem was not just information storage but information transfer. DNA serves two different functions. First, it preserves information. It does this by copying itself, from generation to generation, spanning eons—a Library of Alexandria that keeps its data safe by copying itself billions of times. Notwithstanding the beautiful double helix, this information store is essentially one-dimensional: a string of elements arrayed in a line. In human DNA, the nucleotide units number more than a billion, and this detailed gigabit message must be conserved perfectly, or almost perfectly. Second, however, DNA also sends that information outward for use in the making of the organism. The data stored in a one-dimensional strand has to flower forth in three dimensions. This information transfer occurs via messages passing from the nucleic acids to proteins. So DNA not only replicates itself; separately, it dictates the manufacture of something entirely different. These proteins, with their own enormous complexity, serve as the material of a body, the mortar and bricks, and also as the control system, the plumbing and wiring and the chemical signals that control growth.
The replication of DNA is a copying of information. The manufacture of proteins is a
transfer of information: the sending of a message. Biologists could see this clearly now, because the message was now well defined and abstracted from any particular substrate. If messages could be borne upon sound waves or electrical pulses, why not by chemical processes?
Gamow framed the issue simply: “The nucleus of a living cell is a storehouse of information.”♦ Furthermore, he said, it is a transmitter of information. The continuity of all life stems from this “information system”; the proper study of genetics is “the language of the cells.”
When Gamow’s diamond code proved wrong, he tried a “triangle code,” and more variations followed—also wrong. Triplet codons remained central, and a solution seemed tantalizingly close but out of reach. A problem was how nature punctuated the seemingly unbroken DNA and RNA strands. No one could see a biological equivalent for the pauses that separate letters in Morse code, or the spaces that separate words. Perhaps every fourth base was a comma. Or maybe (Crick suggested) commas would be unnecessary if some triplets made “sense” and others made “nonsense.”♦ Then again, maybe a sort of tape reader just needed to start at a certain point and count off the nucleotides three by three. Among the mathematicians drawn to this problem were a group at the new Jet Propulsion Laboratory in Pasadena, California, meant to be working on aerospace research. To them it looked like a classic problem in Shannon coding theory: “the sequence of nucleotides as an infinite message, written without punctuation, from which any finite portion must be decodable into a sequence of amino acids by suitable insertion of commas.”♦ They constructed a dictionary of codes. They considered the problem of misprints.
Biochemistry did matter. All the world’s cryptanalysts, lacking petri dishes and laboratory kitchens, would not have been able to guess from among the universe of possible answers. When the genetic code was solved, in the early 1960s, it turned out to be full of redundancy. Much of the mapping from nucleotides to amino acids seemed arbitrary—not as neatly patterned as any of Gamow’s proposals. Some amino acids correspond to just one codon, others to two, four, or six. Particles called ribosomes ratchet along the RNA strand and translate it, three bases at a time. Some codons are redundant; some actually serve as start signals and stop signals. The redundancy serves exactly the purpose that an information theorist would expect. It provides tolerance for errors. Noise affects biological messages like any other. Errors in DNA—misprints—are mutations.