Book Read Free

Life's Greatest Secret

Page 25

by Matthew Cobb


  * For example, an ACU codon in mRNA would bind with a UGA anticodon on a tRNA molecule.

  UPDATE

  Nearly half a century has passed since the final word of the genetic code was read. In the intervening years, science has made substantial advances, some of which seem to challenge the fundamental discoveries that were made in the heady years of 1944–67. The closing chapters bring the history of the genetic code up to date, showing what happened in the intervening decades.

  –TWELVE–

  SURPRISES AND SEQUENCES

  All of the researchers involved in cracking the genetic code agreed on two basic principles. First, they assumed that there was a one-to-one correspondence between the DNA sequence of a gene and the corresponding amino acid sequence – what Crick called colinearity. Second, they considered that the genetic code and the way in which genes functioned were universal, so ‘anything found to be true of E. coli must also be true of elephants’, as Jacques Monod put it at Cold Spring Harbor in 1961.1 Neither of these principles was required for genetics to work or for the genetic code to be cracked, but they made sense and they gave a universal significance to the models and interpretations that were being developed. They also ensured that the new science of molecular genetics fitted into the Darwinian framework according to which all life had a single origin and could therefore be assumed to share fundamental processes. Crick later said that those who were studying the genetic code had ‘a boundless optimism that the basic concepts involved were rather simple and probably much the same in all living things.’2

  Within ten years of the final word in the genetic code being read, it became obvious that such boundless optimism was unfounded, as the assumptions of colinearity and the universality of the genetic code were proved to be wrong.

  *

  In autumn 1977, several linked papers appeared in Proceedings of the National Academy of Sciences and in the new journal Cell, which had been set up three years earlier with the ambitious aim of being ‘a journal of exciting biology’.3 For once, the reality lived up to the hype, as the articles announced that, in viruses, genes were not necessarily continuous stretches of DNA but instead could be spread out along a sequence, split into several pieces.4

  What Watson called the bombshell discovery of split genes had first been announced at the Cold Spring Harbor meeting in the summer of 1977, and the scientific community was abuzz with the implications. It was soon found that mammalian genes shared this property, which contrasted sharply with the strictly continuous organisation of genes in bacteria. The surprise and excitement felt by researchers is shown by the unprecedented language used in the title of the first paper in that issue of Cell, by Louise Chow, Richard Roberts and their colleagues, which described ‘An amazing sequence arrangement’ of viral nucleic acids. Scientists rarely use words like ‘amazing’ in their professional publications.

  The reason for the excitement was simple: nearly twenty-five years of assumptions about gene structure had been overthrown by a completely unexpected discovery. Within a few months scientists were revelling in what was widely described as a revolution (Crick called it a mini-revolution).5* In eukaryotic cells (that is, in cells with a nucleus, so in all multicellular organisms and in single-celled organisms such as yeast) it turns out that genes often contain many bases that are not used to make a protein. As a result there is often no colinearity between the DNA sequence and the amino acid sequence of the protein. Between the start and the stop codons of a gene, there may be huge chunks of non-coding DNA that have no relation to the final protein. In an article in Nature, Wally Gilbert named these apparently irrelevant non-coding sequences introns (from ‘intragenic regions’); the DNA sequences that are expressed in protein were called exons.6 Most eukaryotic DNA is a patchwork of exons and introns. Introns are generally around forty bases in length, but they can be very large – for example, one of the introns in the human dystrophin gene is more than 300,000 bases long.7 In some rare cases, the intron of one gene can even contain a completely separate, protein-encoding gene.8

  The existence of introns means that the cell has to process the genetic message before it can be turned into protein. The first transcription of DNA into RNA was named pre-mRNA – it initially contains all the irrelevant introns, but these are immediately snipped away and the two new ends of the mRNA molecule joined together (‘spliced’) to form a messenger RNA sequence that corresponds to the final amino acid product of the gene, along with untranslated regions at the beginning and end of the mRNA sequence, which tell the cell how the gene is to be expressed and processed. This splicing is done by tiny cellular structures made of RNA and protein, known clumsily as spliceosomes (some RNA molecules can splice themselves, without the aid of a spliceosome). The beginning and end of an intron are marked by specific sequences that are recognised by the spliceosome and which indicate which bits of the pre-mRNA molecule need to be snipped out.9 It is this spliced version of mRNA, called mature mRNA, that contains an RNA sequence that is colinear with the amino acid sequence of the protein and is used by the cell in protein synthesis.

  Despite the initial amazement of the scientific community, the existence of all this non-coding DNA in eukaryotic organisms was soon welcomed by researchers, as it seemed to provide an answer to the nagging suspicion, first voiced by Burnet in 1956, that not all of the DNA in a genome actually contributes to producing proteins.10 If important chunks of the genome were composed of what Wally Gilbert called ‘a matrix of silent DNA’, this would explain the situation, even if in 1959 Crick had described this possibility as unattractive.11 Since Gilbert’s first description, there has been a long-running debate about where introns come from – some scientists have argued that even the earliest genomes had introns, but most now think that introns appeared with the evolution of the eukaryotes, because there is no evidence that any prokaryotic organism – single-celled organisms with no nucleus – ever had introns, or possessed the complex cellular machinery required for splicing them out.12 Why introns evolved is still unclear.

  Splicing is not just a matter of snipping out a few irrelevant bases. It allows the production of different proteins from a single gene, because under different conditions different exons can be spliced together – this is called alternative splicing. A single DNA sequence can give rise to several mRNA sequences, depending on a variety of external factors, including the type of cell that the gene is expressed in. Currently, the largest known number of mRNAs that can be produced by a single gene is 38,016. These mRNAs are encoded by the Drosophila gene Dscam, which has four clusters of exons, each of which has twelve, forty-eight, thirty-three or two alternative splices.13 Many of the 38,016 potential Dscam proteins differ only slightly, but this variability is of major functional significance because they mean that the fly’s neurons differ. The consequence is that Dscam proteins help determine the intricate way that those neurons interconnect, shaping the brain.14 The DNA sequence can contain an astonishing degree of complexity.

  Until the discovery of introns, it had been assumed that gene mutation primarily involved point mutations – changes in a single base that would either lead to a different amino acid being inserted into the protein, or, if the base were deleted, would produce a frame-shift mutation in which the remaining bases of the genetic sequence would be read in a novel series of triplet codons, which would often be nonsense, as Crick had suggested in 1961. With the discovery of introns, it was realised that a mutation at the beginning or end of an intron could radically alter the structure of the translated protein by allowing new DNA sequences from the intron to be included in the coding region of the gene, thereby providing an additional source of genetic novelty. The two principal researchers involved in the discovery of what were initially called ‘split genes’ or ‘genes in pieces’ were Richard Roberts of Cold Spring Harbor and Phillip Sharp of MIT, and in 1993 they won the Nobel Prize in Physiology or Medicine for their work.

  *

  Two years after the discovery of introns,
the scientific world was shaken yet again. The last word of the genetic code to be deciphered was the stop codon UGA (nicknamed opal), in 1967. In November 1979, a group at Cambridge discovered that in human mitochondria – small energy-producing structures found in all eukaryotic cells, which contain their own DNA and ribosomes – UGA does not encode stop but instead produces an amino acid, tryptophan.15 The genetic code is not strictly universal; even more surprisingly, the same organism – you – contains two different genetic codes, one in your genomic DNA, the other in your mitochondrial DNA.

  This fact tells us something fundamental about the history of life on our planet. In 1967, the US biologist Lynn Margulis began arguing that mitochondria were not merely micro-structures within eukaryotic cells but were remnants of a single-celled organism that had fused with the ancestor of all eukaryotic organisms, billions of years ago, probably as part of a symbiotic relationship. She was not the first to come up with this idea – in the early years of the twentieth century, both Paul Portier and Ivan Wallin suggested that mitochondria might be symbionts.16 Margulis argued that these symbiotic bacteria subsequently found themselves trapped in every one of our cells and lost all their independence, but not their own, separate genome – a tiny ring of DNA about 16,500 base pairs long (in comparison, the human nuclear genome contains about 3 billion base pairs). (Genes and genomes are measured in ‘base pairs’ because of the two strands of the DNA double helix – for each base there is a complementary base on the other strand, forming a base pair.)

  It appears that all mitochondria, in all the eukaryotes on the planet, have a common ancestor that was alive more than 1.5 billion years ago. The ancestors of plants subsequently incorporated another microbe in the same way, thus gaining their power-generating chloroplast organelles and the ability to gain energy from sunlight.17 In the cases of both mitochondria and chloroplasts, there are arguments over exactly what kind of microbe fused with what, and above all the speed with which the fusion took place, but most scientists now think that in each case there was a single event that enabled what was effectively a hybrid organism to grow larger and to acquire the energy required by more complex organisms.18 The extremely small nature of the mitochondrial genome, and its peculiar use of codons, can be explained in terms of the history of this symbiotic relationship. The mitochondrial genome codes for very few proteins – most of the other genes were lost before or shortly after fusion with our ancestors or were incorporated into the genomic DNA of the host – so the appearance of a new function for a codon in mitochondrial DNA through mutation would not have had an important effect on the symbiont, most of whose needs were provided by the host cell.

  Mitochondria are not alone in having a non-standard genetic code. In 1985, it was discovered that single-cell ciliates – tiny organisms such as Paramecium – show variants of the nuclear genetic code that have appeared several times during evolution. In some species of ciliate, UAA and UAG code for glutamate rather than stop, with only UGA encoding stop; in others, UGA codes for tryptophan.19 Sometimes UGA and UAG have been recoded by natural selection to code for extra amino acids, not generally found in life – selenocysteine and pyrrolysine, respectively.20 This can occur by altering the genetic code only in particular genes. For example, the human genome contains a handful of genes in which UGA has been recoded to encode selenocysteine.21 In these cases part of the mRNA for these genes instructs the cell to insert selenocysteine when it reads UGA; in all our other genes, UGA retains its normal stop function.22

  A recent study of 5.6 trillion base pairs of DNA from more than 1,700 samples of bacteria and bacteriophages isolated from natural environments, including on the human body, revealed that in an important proportion of the sequences, stop codons had been reassigned to code for amino acids, and an investigation of hitherto unstudied microbes revealed that in one group UAG had been reassigned from stop to code for glycine.23 There are even cases of novel codons being used to start translation, for example, in 2012, it was discovered that in some unusual circumstances in the mammalian immune system, the genetic message does not begin with the normal AUG codon but can be initiated from a CUG codon, which normally codes for leucine.24

  More than fifteen alternative or non-canonical genetic codes are known to exist, and it can be assumed that more remain to be discovered.25 The non-canonical codes generally involve the reassignment of stop codons; this may indicate that there is something about the machinery involved in stop codons that makes them particularly susceptible to change, or it may simply be that as long as the organism can still code stop using another codon, reassigning one stop codon to an amino acid does not cause those organisms any major physiological or evolutionary difficulties.26

  The exact process by which codon change takes place has been the focus of a great deal of theoretical and experimental research, and several hypotheses have been put forward to explain how variant codes might arise. The current front-runner is called the codon capture model, and was first put forward in 1987 by Jukes and Osawa. According to this model, random effects such as genetic drift can lead to the disappearance of a particular codon in a given genome; similar effects then lead to that codon being ‘captured’ by a tRNA that codes for another amino acid.27 A recent experimental study of genetically engineered bacteria in which some codons had been artificially replaced supported this model, and even suggested that reassignment of codons could be advantageous in some circumstances, providing the organism with expanded functions.28

  The non-universality of the genetic code and the existence of introns were both completely unexpected, and went against all the assumptions of all the researchers who had been studying the genetic code. These discoveries showed that, strictly speaking, Monod was wrong – what is true for Escherichia coli is not necessarily true for an elephant in all respects. Nevertheless, the basic positions established during the cracking of the genetic code remain intact. The strict universality of the code and the linear organisation of genes were not laws, or even requirements. The only requirement is that any divergence from these assumptions can be explained within the framework of evolution, and through testable hypotheses about the history of organisms. This has been amply met for both the non-universality of the code and the existence of introns.

  Although the genetic code is not strictly universal, this has not altered our view of the fundamental processes of evolution at all. There is no dispute that life as we know it evolved only once, and that we all descend from a population of cells that lived more than 3.5 billion years ago, known as the Last Universal Common Ancestor, or LUCA.29 Because all organisms use amino acids with a left-handed orientation and RNA is universally used as a way of stringing amino acids together to make a protein, scientists are convinced that this hypothesis is true. In 2010, Douglas Theobald calculated that the hypothesis that all life is related ‘is 102,860 times more probable than the closest competing hypothesis.’30

  The variations in the code that have been discovered are in fact quite minor and can be explained either in terms of the deep evolutionary history of eukaryotes – thereby revealing the thrilling fact that our evolution has hinged on the chance fusion of two cells to create the eukaryotes – or in something recent and local in the life-history of a particular group of organisms such as the ciliates. Similarly, although eukaryotic genes are profoundly different from those of prokaryotes, because they are ‘split’, they still work according to the same principles. All that has happened is that the cellular machinery for taking the information in genomic DNA and turning it into protein has been revealed to be very complicated in a group of organisms that we are particularly interested in, because it includes ourselves. Our basic understanding of how the information in a DNA sequence becomes an amino acid sequence has not been altered; although things are far more complex than the code pioneers could have imagined, the basic framework they developed still stands. The simple models developed in the 1950s and 1960s were not universally correct, but they were a necessary step for the develo
pment of our current understanding. And they remain true for the oldest and most numerous organisms on our planet, the prokaryotes.

  This final point highlights the power of the reductionist approach adopted by Crick, Delbrück, Monod and the others. They chose to use the simplest possible systems – bacteria and viruses – to understand fundamental processes. In so doing, they gambled that their findings would be applicable to all life. The models that they came up with were simple, elegant and susceptible to experimental testing. Had they been studying mammals and the tangled web of molecules and processes that lead from DNA to protein in these species, it is unlikely that much progress would have been made.

  *

  Over recent decades, the study of the genetic code has been transformed by one of the most significant technological changes that have taken place in biology – our ability to determine the sequence of DNA and RNA molecules. The breakthrough came with the work of Fred Sanger, who won the Nobel Prize in Chemistry twice, first in 1958 for determining the structure of insulin and other proteins, then in 1980 for sequencing nucleic acids (he shared the second prize with Wally Gilbert, who came up with a less widely used technique for sequencing DNA).

  Sanger was not the first to sequence a nucleic acid – a small transfer RNA was sequenced in 1965, using techniques similar to those that had previously been used to sequence proteins.31 But Sanger’s method made it possible to sequence up to 300 bases of a piece of DNA (in reality 200 bases was more often the limit), marking DNA chains of varying lengths with radioactive phosphorus-containing bases (A, C, G or T), and then visualising these fragments on an electrophoresis gel. Sanger obtained these DNA chains by carrying out four separate reactions to copy a DNA molecule. Each test-tube included four normal nucleotide bases (A, C, G and T), enzymes used to copy the DNA molecule, together with a radioactively labelled variant of one of the bases (hence the need for four reactions). As well as being radioactive, these special bases had been chemically modified so as to stop the chemical reaction when they were incorporated randomly into a new DNA chain. Because a typical extract contains so many identical copies of the DNA molecule and the radioactive base was incorporated at a random point in each new chain, the result was a large number of DNA molecules that were of different lengths and which were radioactive, and could therefore be detected on the gel. Each reaction (A, C, G and T) was then loaded side by side onto a gel and the electric current was turned on. Different lengths of DNA migrated at different speeds and so ended up at distinct points on the gel, enabling the sequence to be read by eye.

 

‹ Prev