Life's Greatest Secret

Home > Other > Life's Greatest Secret > Page 33
Life's Greatest Secret Page 33

by Matthew Cobb


  Just as we do not know when the RNA world appeared, so we also do not know when it finally disappeared. All we can do is trace the ancestry of modern, DNA-based organisms back to the Last Universal Common Ancestor (LUCA), a population of single-celled DNA organisms that lived perhaps 3.8 billion years ago. LUCA evolved out of the RNA world, eventually – perhaps rapidly – out-competing and replacing it.

  The replacement of RNA as the repository of genetic information by its more stable cousin, DNA, provided a more reliable way of transmitting information down the generations. This explains why DNA uses thymine (T) as one of its four informational bases, whereas RNA uses uracil (U) in its place. The problem is that cytosine (C), one of the two other bases, can easily turn into U, through a simple reaction called deamination. This takes place spontaneously dozens of times a day in each of your cells but is easily corrected by cellular machinery because, in DNA, U is meaningless. However, in RNA such a change would be significant – the cell would not be able to tell the difference between a U that was supposed to be there and needed to be acted upon, and a U that was a spontaneous mutation from C and needed to be corrected. This does not cause your cells any difficulty, because most RNA is so transient that it does not have time to mutate – in the case of messenger RNA it is copied from DNA immediately before being used. Thymine is much more stable and does not spontaneously change so easily. The adoption of DNA as the genetic material, with its built-in error-correction mechanism in the shape of the two complementary strands in the double helix, and the use of thymine in the sequence, provided a more reliable information store and slowed the rate of potentially damaging mutations.

  These kind of mechanisms also provide one answer to Schrödinger’s concern about how gene-molecules are able to remain apparently constant down the generations, despite the existence of quantum effects that should alter their structure. Life is even stranger than Schrödinger imagined: it has evolved ways of checking the stability of genetic information and of reducing errors.

  The new DNA life-forms would have had a substantial advantage because they involved proteins in all their cellular activities. Although we do not know when or why protein synthesis developed, it seems unlikely that it occurred instantaneously – there was probably no protein revolution.16 Instead, amino acids present in the primitive cell would have interacted spontaneously with pieces of RNA that were acting as enzymes, with small nucleotide sequences binding with a few amino acids – this was the first glimmer of the genetic code. Initially the interaction of RNA and amino acids would have enabled RNA life-forms to gain some additional metabolic property, before eventually the appearance of strings of amino acids – proteins – created the world of protein-based life. At some point DNA supplanted RNA as the informational molecule, keeping the genetic sequence safe, using RNA to produce rapid translations of that sequence into the patterned production of proteins, as the RNA enzymes were co-opted and turned into bits of cellular machinery such as transfer RNAs and ribosomes.

  Proteins can carry out an almost infinite range of biological functions, both as structural components and as enzymes. In both respects, they far surpass RNA. The appearance of proteins therefore opened new niches to life, spreading DNA and protein across the planet, creating and continually altering the biosphere. These new DNA-based life-forms would have out-competed the RNA world organisms in terms of their flexibility and the range of niches they could occupy. They would also have been able to grow much more quickly: a modern DNA-based cell can replicate itself in about 20 minutes. Experiments on RNA enzymes involved in replication suggest that it would have taken days for an RNA-based life-form to reproduce.17 The RNA world was slow, limited and probably confined to the ocean depths.

  The evolutionary and ecological advantages gained through the use of proteins by DNA-based life show that the appearance of translation from a sequence of RNA bases into a sequence of amino acids was a decisive evolutionary step. The evolution of the genetic code was therefore essential for life as we know it. It truly is life’s greatest secret. This raises the obvious questions of how the code evolved and why it is the way it is. There is a simple but frustrating answer to both these questions: we do not know.

  *

  In December 1966, shortly before the final word in the genetic code was read, Francis Crick explored the origin of the code. Speaking at a meeting of the British Biophysical Society in London, he developed ideas that still dominate scientists’ thinking about this difficult question.18 Crick’s first suggestion, which was also being explored by Carl Woese and Leslie Orgel, was that there is a physical link between each codon and the amino acid it codes for, and that the code is therefore in some way inevitable.19 This assumption lay behind many of the theoretical attempts to break the code, and could trace its intellectual origins right back to Gamow’s letter to Watson and Crick in 1953. However, Crick was unable to fully explain the code in these terms, and there is still no physical explanation of the relation between all RNA codons and their amino acids.

  Part of the problem is that there is obviously something going on – the distribution of codons and amino acids is clearly not random. As Crick pointed out, in many cases the final base in an RNA codon is irrelevant if the first two bases are alike: XYU and XYC always code for the same amino acid, and XYA and XYG often do so. In half of the cases it does not matter what base follows XY – all those combinations code for the same amino acid. There seems to be some link with the physical nature of the amino acid: for example, if the second base in the codon is a U, then the amino acid has a particular chemical characteristic called hydrophobicity, and the more acidic and the more alkaline groups of amino acids each have similar codons.

  These tantalising patterns have led several scientists, including Crick, to suggest that at first the code would have enabled the protocell to process a relatively small number of amino acids, on the basis of the physicochemical interactions between RNA molecules and amino acids, and indeed there is some evidence for this.20 These would initially have been coded by one or two bases alone, but very quickly the triplet code was established, with the number of amino acids subsequently being expanded to the current twenty. Unable to see any coherence in the code, Crick described its modern structure as a frozen accident.21 He argued that the Last Universal Common Ancestor of all existing life just happened to use the current system of translation from DNA to protein, and that it has stuck because any deviation from the universal code would be disadvantageous. Despite the fact that we know that minor variations of the code are possible, this intellectually unsatisfactory explanation remains as a rather limp conclusion to most explanations of the origin of the modern genetic code.

  Attempts to explain why the code is the way it is are generally divided into three types: Crick’s physicochemical hypothesis, which seeks to explain the code in terms of the links between codons and amino acids; the co-evolution hypothesis, which suggests that the pattern of codon assignment reflects the evolution and expansion of the code in terms of function; and the adaptive hypothesis, which sees the code as a reflection of processes that reduce the number of errors. This final suggestion was first set out by Stephen Freeland and Laurence Hurst in 1998. They compared the actual genetic code with all the possible alternatives and found that in terms of the errors that could occur in the chemical binding between a codon and an amino acid, only one alternative in a million outperformed the genetic code we currently use.22

  In the 1950s, theoreticians assumed that there must be some physicochemical explanation of the link between a DNA codon and its amino acid. Once it was realised that proteins were assembled with the help of RNA, it was the RNA codon that became the focus. But this is not correct, either: in fact, the amino acid does not actually attach to a codon at all. It fits onto the open end of the tRNA molecule, which is a long loop in the shape of a clover leaf. On the opposite side of the tRNA molecule sits the anticodon, which is what the mRNA molecule recognises because it is composed of three complementar
y bases. There is no known link between the amino acid attachment site on the tRNA and the anticodon, which are on separate sides of the tRNA molecule. All those theoretical explanations were barking up the wrong tree.

  In 2005, RNA biochemist Michael Yarus and his colleagues reviewed various explanations for the origin of the code, together with the evidence for them, and attempted to integrate all three types of hypothesis.23 They suggested that the initial allocation of codons was based on physicochemical relations between a few RNA enzymes (either tRNA molecules or their predecessors) and the small number of amino acids they processed. These primitive molecules were then subjected to natural selection to optimise their ability to recognise different amino acids, and these were then selected to minimise error in their recognition by mRNA, leading to the current code. Although this is an attractive compromise, it has by no means settled the argument. Despite fifty years of research on the origin and evolution of the genetic code, there is still no consensus as to which of these hypotheses, or which combination of them, is correct. A recent review of the topic gloomily predicted that the answer – if there is one – might elude us for another half-century.24

  *

  Francis Crick’s starting point in his 1957 lecture on protein synthesis, which contained his description of the central dogma, was what he called the sequence hypothesis: the sequence of nucleotides on the DNA or RNA molecules enables the cell to produce a corresponding sequence of amino acids in a protein, or nucleotides in another molecule of nucleic acid. That is all the genetic code is. Crick could see no need for any other explanation of the way in which proteins fold, and, with the exception of the protective role of chaperones, this seems to be the case. Nonetheless, there have been repeated suggestions that the genetic code might contain more than sequence information – there may be a code within the code. This in turn may shed light on the origin of the particular version of the code that all life now uses.

  The simplest example of this kind of hidden genetic information is to be found in the pattern of codon usage. For those amino acids that are coded by more than one codon, the frequency with which the alternative codons are used is not equal. For example, leucine can be encoded by six DNA codons – TTA, TTG, CTT, CTC, CTA and CTG. In human genes, these six codons, each of which do the same thing, are found at varying frequencies: CTA makes up 0.7 per cent of your DNA, whereas CTG is 4.1 per cent of your DNA. Broadly similar results are found for the mouse and for Drosophila, however in yeast CTA and CTG make up around 1.3 and 1.0 per cent of the DNA, respectively, whereas TTG is the most frequently found leucine codon, at 2.7 per cent.

  For the moment there is no agreement about why this effect – called codon bias – exists, nor what it tells us about evolution. It seems to be related to selection, in that codon bias is more easily detected in genes that are highly expressed. Among the factors that may produce codon bias are the possibility of mutation leading to a switch between the various redundant codons, the number of genes coding for the various tRNAs in a given species, and selection pressure to use one form of codon rather than another to avoid potential errors. Genome-wide analyses of codon bias in twelve Drosophila species have shown that codon bias can even extend across codons: the most frequent pairs of codons in these flies are XXG-CXX (so any codon ending in G, followed by any codon starting in C), whereas the least frequent are XXT-TXX.25 This is telling us something, although it is not clear what. A similar effect has been observed in yeast, where there is a tendency for the codons using the same tRNA to follow each other in the sequence, perhaps because tRNA molecules that have released their amino acid diffuse away from the ribosome at a slower rate than that with which other copies of these tRNAs are recruited by the translation process. To avoid a molecular traffic-jam, natural selection may have tended to favour the sequential involvement of the same tRNAs, leaving a signature in the genome.26

  There is even information in the frequency that the four bases occur in our genomes. In humans, the GC pair of bases (guanine on one DNA strand, cytosine on the complementary DNA strand) is not found at the same frequency as the other pair of bases (AT – adenine and thymine). Again, the reasons for these effects are not known: longer genes tend to have a higher proportion of GC than AT pairs, and there are substantial differences between species – for example, in large stretches of mammalian genomes, the proportion of GC pairs varies between 35 per cent and 55 per cent (under random variation you would expect both types of pair to be at 50 per cent).27 Furthermore, GC pairs tend to be more frequent in some parts of chromosomes than others – these GC-rich zones, called isochores, have been known about for more than forty years, but there is still no agreement about their origins or significance.28 Although in mammals there are links between GC content and both body mass and genome size, many researchers argue that the effect is not due to selection and is instead produced by neutral changes that flow from gene duplication and mutations in non-selected parts of the genome.29 This apparent code within the code may contain nothing except noise.

  The boldest suggestion that there is more than a sequence code in our DNA emerged at the end of 2013 from the University of Washington, after the publication of a paper in Science.30 The researchers claimed that they had identified a second layer of information in the human genetic code, overlying the sixty-four triplet codons. Around 14 per cent of the normal codons both specified amino acids and allowed transcription factors that control gene regulation to bind to the sequence.31 Despite the high-profile publication and the claims of the university’s communications agency, this effect had been described a few years earlier in a wide range of organisms, including mammals.32 The researchers suggested that mutations in the codons that also act as binding sites for transcription factors could lead to genetic diseases. However, they did not show any consequences (good or bad) of the existence of these binding sites, nor did they show that these sites actually affected the regulation of even one gene.

  The paper provoked irritation on social media, mainly because of the overblown claims about a phenomenon that had already been described.33 The research was part of the ENCODE project, which in 2012 claimed that most of our genomes are functional because detectable levels of biochemical activity are associated with virtually all of our DNA, even though the biological significance of that activity was not clear. These dramatic claims of a second genetic code may be a consequence of the criteria initially used by the ENCODE consortium to interpret its data. The findings may turn out to be correct, but it will require experimentation to prove that such a high proportion of our genome is the focus of gene regulation, and a great deal of work will be needed to convince the scientific community that this is indeed an example of a code within the code.

  We now know that our genomes contain information that enables the cell to process the DNA sequence in various ways, most of which are related to gene regulation.34 The simplest form of this additional information can be found in the untranslated regions upstream and downstream of the gene, which appear in mature mRNA and help direct gene expression. The region at the end of the gene consists of a long sequence of adenine bases called a poly(A) tail, which can be up to 200 bases long and is involved in the stability of the mRNA molecule. The beginning of the mRNA molecule has a chemical ‘cap’ and a series of bases that control how the mRNA is processed by the cell.35 None of these forms of information are a systematic code like the genetic code. Called auxiliary or complementary genetic information by some researchers, this form of information, which is dispersed throughout our genomes, is more like a set of additional, particular and precise instructions, which has still to reveal all its secrets.36 Rather than an alternative code of life, it is instead a sign of our deep evolutionary history, revealing ways in which our far-distant ancestors discovered new ways of manipulating genes and their outputs.37 It helps shape our DNA into something more than simply a set of codons that produce sequences of proteins or nucleic acids: it reveals our genome as a palimpsest, overlaid with other forms of
information that do not obscure or invalidate the original sequence-encoding signal but enrich our understanding of our present and of our past.

  *

  Ever since 1953, when Watson and Crick wrote those apparently simple words ‘the precise sequence of the bases is the code which carries the genetical information’, biologists have considered that the idea that genes contain information is intuitively obvious. Philosophers have not been so easily convinced, and over the past two decades a debate about genetic information has taken place, away from the gaze of biologists. The main issue that has preoccupied the philosophers is the exact nature of the kind of information that is in genes, and, indeed, whether there is something there that can strictly be called information. The fact that most scientists are unaware of these arguments is due partly to the divisions between academic disciplines and partly, I suspect, to the fact that many of my colleagues take a dim view of philosophy. This is unfortunate, because one of the jobs of philosophers is to explore the complexity that lurks in apparently straightforward concepts such as information. Indeed, it is possible that had philosophers paid more attention to the issue in the 1950s, they might have been able to persuade the theoreticians not to view the code literally as a code or as a language, and less time might have been wasted on fruitless speculation.

  Many scientists would probably agree with Michael Apter and Lewis Wolpert, who argued in 1965 that genetic information is simply a metaphor or an analogy, a way of describing what genes contain and how they exert their effects.38 Apter and Wolpert claimed that the most thorough definition of information, as described in Shannon’s communication theory, does not apply to genetic information because the whole point of genetic information is that it does something, it has a function, a meaning, whereas Shannon’s view of information has no place for meaning. The difficulty involved in expressing the content of DNA in Shannon’s terms can be shown by trying to calculate the information content of a genome. The problems begin at the beginning: it is unclear whether the fundamental unit should be a single base, with four alternative states (and therefore two bits of information), or a codon – three bases, with sixty-four alternative states (and therefore eight bits of information) – or the output of the system, with twenty-one alternative states (twenty amino acids and ‘stop’), and therefore five bits of information. Calculations based on each of these approaches would produce different answers, and in all cases it is not clear what the outcome would mean.

 

‹ Prev