Book Read Free

Life's Greatest Secret

Page 26

by Matthew Cobb


  Sanger later described this technique, known variously as the chain termination method, dideoxy sequencing or, more simply, Sanger sequencing, as ‘the best idea I have ever had’.32 The rest of the scientific community seems to agree – his 1977 paper describing the method has been cited more than 65,000 times, a staggering number that makes it the fourth most cited article in the history of science.33

  Using this technique, in 1978 Sanger and his colleagues sequenced the first complete genome, that of a bacteriophage. It was 5,386 base pairs long and represented months and months of work.34 The technique soon became well established even though it was tedious and repetitive. It was also dangerous: as well as the omnipresence of radioactivity, the electrophoresis gel was made of toxic material and various steps in the procedure involved nasty chemicals that unravelled the DNA in the sample and, potentially, in the experimenter’s body. Despite these hazards, by 1984 researchers had sequenced the full genome of three viruses – two bacteriophages, and the Epstein–Barr virus, which causes glandular fever in humans. The Epstein–Barr virus sequence, which was described in Cambridge, was 172,282 bases long or thirty-two times the length of the first genomic sequence. This was a major feat, representing years of work, and involving what was then a large team of twelve researchers.

  Sanger’s method became widely used in the late 1980s with the development of the polymerase chain reaction (PCR), which allows tiny samples of DNA to be amplified in a test tube. This method was invented by Kary Mullis, who was working at the biotech company Cetus Corporation in California, in a flash of insight during a night-time drive with his girlfriend.35* PCR involves heating a sample to very high temperatures (up to 95° C); this separates the complementary DNA strands. The sample is then cooled slightly, DNA polymerase enzymes begin to copy the DNA molecules and the complementary strands then pair up. A single cycle doubles the amount of DNA in the sample. By repeating this cycle of heating and cooling dozens of times, even minute amounts of DNA can be amplified millions of times over in a couple of hours.

  Mullis had a problem though – he needed a polymerase enzyme that could resist the relatively high temperatures his experiment required. As luck would have it, such an enzyme had recently been described in Thermus aquaticus (generally known as Taq), a bacterium that lives in ocean thermal vents.36 The final addition to this procedure is that by adding to the test tube short pieces of DNA, fifteen to twenty bases long, which mark the beginning and end of a DNA sequence of interest, it is possible to target the PCR and thereby amplify only the section of DNA that you are interested in.

  PCR rapidly overtook the previous technique of inserting a DNA fragment into a phage genome, then infecting bacteria and allowing the bacteria to reproduce, thereby amplifying the DNA. PCR is much simpler, and even a complete novice can soon amplify minute quantities of DNA. In 1993, less than a decade after his invention, Mullis was awarded the Nobel Prize in Chemistry. The initial application of the technique was diagnosis, and it is now routinely used in medicine as a tool for identifying diseases, both infectious and genetic. Coupled with sequencing, PCR has transformed the way in which biology and medicine work.

  The practical application of DNA technology really took off in 1984, when Alec Jeffreys of the University of Leicester discovered the existence of small stretches of DNA that can be easily identified and which represent a unique genetic ‘fingerprint’ of each individual. The significance of these bits of DNA, known as minisatellites, was instantly obvious to Jeffreys, and he immediately wrote down a series of potential applications that included forensics, conservation biology and paternity testing. In less than a year, the technique was used to determine the outcome of an immigration case by showing that a young Ghanaian boy was indeed the son of the woman who claimed to be his mother; as a result the child was allowed back into the UK.37

  Jeffrey’s technique soon proved more flexible and simple than the previous method for identifying genetic variants, which involved snipping bits of DNA at defined locations, using special proteins called restriction enzymes. If the population being studied contained variability for the length of DNA between the two sites where the restriction enzymes acted, then those variants could be detected on an electrophoresis gel. This was first demonstrated in 1980 by David Botstein and his colleagues, who were working on the human genetic disorder Huntington’s disease.38 Outside of medicine, the use of restriction enzymes proved invaluable for mapping genes and for the development of recombinant DNA biotechnology.

  All around the world, DNA fingerprinting is now routinely employed by the judicial system to convict criminals and to prove the innocence of the wrongly accused. The routine collection of DNA samples by the police, and the existence of databases permitting the identification of individuals, has led to a continuing ethical debate of the conflict between liberty and justice, with state forces arguing that only the guilty have something to hide, whereas more libertarian arguments underline the potential dangers.

  By the late 1980s, machines were able to read DNA sequences, using a system based on fluorescence rather than radioactivity, but still using Sanger’s sequencing method.39 Sequences were now detected in tiny capillary tubes rather than on huge heavy gels, opening the possibility of simultaneously carrying out many parallel reads, and the sequence could be read in real time, as the reaction took place, rather than waiting for the gel to run and then detecting the radioactive products using a photographic plate. At the beginning of the 1990s, these technical developments led to the creation of a series of projects for sequencing the genomes of multicellular organisms, with the ultimate objective being the sequencing of the human genome. The first animal genome to be completed, in 1998, was that of the nematode worm, Caenorhabditis elegans, closely followed by that of the tiny vinegar fly, Drosophila melanogaster, in 2000. These projects provided vital information about two widely used laboratory organisms and were testing-grounds for different technical and commercial approaches to genome sequencing. The C. elegans genome project, led by John Sulston, was entirely funded by public money, whereas the Drosophila genome was a joint effort between publicly funded researchers and a company called Celera Genomics, led by Craig Venter, a molecular biologist turned entrepreneur.

  Despite the very different motivations of the public and private researchers, the Drosophila genome project was a success. In contrast, the Human Genome Project, which took place in parallel, was the focus of clashes of scientific and commercial outlook as well as of personality.40 The human genome contains around 3 billion base pairs, far more than that of C. elegans (100 million base pairs) or Drosophila (140 million base pairs). The size of the human genome and the large stretches of repetitive sequences it contains posed new difficulties that were exacerbated by the very different approaches taken by the public and private researchers.

  The publicly funded International Human Genome Sequencing Consortium, led first by Jim Watson and then by Francis Collins, had been working since 1990 to produce a full sequence of every base in the genome, and its members were resolutely hostile to the idea of patenting genes. In contrast, Craig Venter and Celera initially focused on sequencing only genes that were known to be expressed in certain tissues or under certain conditions, with the hope of finding patentable products. They did this by collecting mature mRNA that was present in the cell or tissue of interest, transcribing that back into what is known as complementary DNA (cDNA) and then sequencing this cDNA molecule.

  This approach had the great advantage of focusing on genes that were apparently important in a given tissue and meant that researchers did not waste time sequencing the millions of bases in the huge non-coding regions that can be found between genes, or even sequencing the introns of the gene of interest, which had been stripped out by the cellular machinery during the synthesis of RNA from the genomic DNA. Using this method, Venter showed that it was possible to identify genes involved in vital processes with the tantalising possibility of gaining insight into novel medical treatments. While thi
s was extremely productive and held the promise of financial gain, it was at odds with the aim of the publicly funded project, which was to sequence every base in the human genome.

  Venter’s group then used a different approach, which was initially opposed by the publicly funded researchers but ended up dominating the field and has since been used in sequencing the genomes of many organisms. Known as shotgun sequencing, the technique involves identifying the bases on many short pieces of DNA and then assembling them into huge long sequences. Sequencing short stretches of DNA is easier, but it leads to a substantial difficulty: which of the resultant hundreds of thousands of short sequences follows on from which – how to reassemble the puzzle? This was particularly problematic when it came to dealing with the immense stretches of apparently functionless DNA to be found between genes, which could consist of featureless repetitions of two bases, such as ACACACAC….

  To resolve this problem, Venter and his Celera colleagues enlisted computer scientists to develop algorithms for assembling the sequence, and they were able to prove the validity of their approach with the Drosophila genome. Despite hostility from many scientists around the world, Venter was probably right to argue that this method would make it possible to complete the project. Nevertheless, problems remained – even with the cleverest algorithms in the world, it is not possible to join up all of the bits of sequences. To get over this problem, recalcitrant parts of the genome were amplified in bacteria to try and bridge the gap. This does not always work – some sections of the human and the Drosophila genomes have still not been joined up, fifteen years after the sequences were published.

  Despite the continuing clashes, the completion of the draft human genome was announced by President Bill Clinton in 2000, even though it was in fact nowhere near finished. Collins and Venter stood on either side of Clinton in the White House, while the ambassadors of the UK, Japan, Germany and France were in the audience. Meanwhile Tony Blair, along with Fred Sanger, Max Perutz and other British scientists, appeared at the end of a video feed from Downing Street. In a jarring counterpoint to the celebration of human ingenuity and the power of evolution that was on display, Clinton claimed that ‘Today we are learning the language in which God created life’.41 Collins and Blair, both devout Christians, presumably concurred.

  The draft sequence was published in 2001 in two versions: the Celera genome appeared in Science, and the publicly funded version was published in Nature.42 The International Human Genome Sequencing Consortium sequence is now taken as the definitive version, and was initially a mosaic of information from more than 100 individuals who contributed DNA to the publicly funded project (one of whom was Jim Watson) and the five individuals used in the Celera effort (one of whom was Craig Venter). It continues to be updated as genes are more effectively annotated, and functions or similarities can be more reliably located to particular stretches of DNA.

  However, there is no such thing as ‘the’ human genome. On average, each of our genomes differs by about one base pair in a thousand, so by about 3 million base pairs in total. Most of those differences are not in coding DNA, and those differences that are in coding sequences are generally silent – they do not alter our amino acid sequences. Nevertheless, the overall structure of the human genome, its mixture of coding and non-coding sequences, and the way in which the coding sequences are expressed in time and space, form part of what it is to be human. And as the publicly funded researchers intended, the human genome is a public good, open to all and freely accessible on the Internet, out of reach of the patent lawyers – in 2013, after years of argument, the US Supreme Court finally ruled that no human genes could be patented, striking down patents that had been awarded to Myriad Genetics for use in diagnostic testing for the BRCA1 gene (mutations in this gene can increase the risk of breast cancer).43 That situation may change: in 2014, the Australian courts supported Myriad Genetics’s claim that human gene sequences could be patented.44

  *

  Since the beginning of the twenty-first century and the triumph of the Human Genome Project, genome sequencing has been transformed from a highly complex international affair, immensely costly in terms of people and money, into something that can be undertaken by relatively small groups of researchers, interested in the most obscure organisms. Behind this change has been the appearance of what are called next-generation sequencing techniques based on robotics and powerful computers that were developed after the human genome sequence was completed.45

  The best sequencers available at the turn of the century used Sanger sequencing to simultaneously sequence about 100 stretches of DNA, each stretch producing reads that were up to 800 bases long. Next-generation sequencing is very different; it uses a variety of techniques to enable the machine to detect each base as it is incorporated into a new chain of DNA during DNA replication. The technology is continually being upgraded; as of 2014, hundreds of thousands of short strings of DNA – each 75–125 bases long – can be simultaneously sequenced, meaning that millions of bases can be detected in a second (when I was hand-sequencing in the 1990s, I was happy if I did 400 bases in a day). These fragments are randomly selected from the genome, and by carrying out this process millions of times, the entire genome can be covered. Computer algorithms are then used to assemble the sequence, meaning that next-generation sequencing is as much about mathematics as it is about molecular biology.

  As the price of sequencing machines and computers has dropped, so too has the price of genomes. The human genome cost the public purse around $3bn – more or less a dollar a base pair – and used more than 1,000 sequencing machines. In 2010, the Chinese employed next-generation sequencing to analyse the 2.3 billion base pairs in the genome of the giant panda for a mere $900,000 – less than 0.04 cents per base, or 1/2,500 of the cost for the human genome. The whole project took less than a year, and used the equivalent of just thirty machines.46

  Most of the genomes from multicellular organisms thus far completed have been published in one of the leading scientific journals. That will inevitably change. According to the Genomes Online Database, at the end of 2014 there were more than 700 projects to sequence non-human vertebrate genomes alone. The genomes of the rattlesnake, the turkey vulture and Nancy Ma’s night monkey, along with hundreds of others, are all no doubt fascinating and will provide insight into evolution and medicine, but they – along with the hundreds of arthropods and the thousands of fungi that are being sequenced – are unlikely to get the same kind of attention as the platypus and the panda. The leading journals will focus on genomes with a high commercial or scientific impact, and which therefore promise a high rate of citation in the future. For the remainder of the natural world there will be electronic-only genome journals – already, most bacteria that are sequenced receive a brief one-page announcement with a link to the online database where the information is stored.47

  More sequencing developments are just around the corner: in 2014, Oxford Nanopore Technologies delivered early models of its nanopore sequencer to researchers around the world for beta testing. The device is the size of a mobile phone and plugs into a computer via the USB port. Unlike next-generation sequencing, which relies on computing power and parallel processing, this technology is claimed to create continuous DNA sequences of up to 10,000 base pairs on your desktop. If it lives up to the hype, DNA sequencing will become commonplace, and could even be done in the field on wild-caught samples, to identify particular genetic variants. Already, next-generation sequencing is being used on oceanic research expeditions.48

  Meanwhile, the market leader in next-generation sequencing machines, Illumina, has announced that its latest device will be able to sequence the equivalent of sixteen human genomes in three days, bringing the price for sequencing a whole human genome down to less than $1,000. There is a catch: the company insists that to get access to their technology, the user will have to buy ten machines at a total cost of at least $10m.49 Whatever the coming years hold, the price of sequence data will continue t
o fall, and the number of sequences will continue to grow.

  Eventually personalised medicine based on our individual genetic make-up will finally become widely available. The president of Illumina, Francis de Souza, has predicted that in 2015, an astonishing 228,000 human genomes will be sequenced in the name of medicine – the British government is currently supporting a project to sequence 100,000 genomes with the aim of improving the diagnosis, prevention and treatment of disease.50 De Souza’s ambition is to move Illumina technology into the hospitals and to carve out a chunk of a diagnostic market that he estimates at $20bn. Whatever the hype associated with such claims, our understanding of the significance of small genetic differences between individuals – what is known as intraspecific variation – is growing as governments and research agencies around the world realise that there will be health benefits, as well as insight into the history and demography of human populations.51 Already, the analysis of the genomic variations found in particular cancers has opened the road to new, more precise treatments. For example, the breast cancer drug Herceptin is targeted solely at women with cancers that have a genetic profile called HER2-positive, while patients with lung cancer whose tumours show mutations in the EGFR gene can be treated with drugs called Iressa and Tarceva.52

 

‹ Prev