Neanderthal Man
Page 19
During the spring of 2007, the Friday meetings continued to show our cohesive group from its best side. People threw out one crazy idea after another for increasing the proportion of Neanderthal DNA or finding microscopic pockets in the bones where preservation might be better. It was almost impossible to say who came up with which idea, because the ideas were generated in real-time, during continuous discussions to which everyone contributed. We started talking about ways to separate the bacterial DNA in our extracts from the endogenous Neanderthal DNA: maybe the bacterial DNA differed from the Neanderthal DNA in some feature that we could exploit for this purpose, perhaps a difference in the size of the bacterial and the Neanderthal DNA fragments? Alas, no! The size of bacterial DNA fragments in the bones was largely indistinguishable from that of the Neanderthal DNA.
Figure 13.1.The Neanderthal genome group in Leipzig 2010. From the left: Adrian Briggs, Hernan Burbano, Matthias Meyer, Anja Heinze, Jesse Dabney, Kay Prüfer, me, a reconstructed Neanderthal skeleton, Janet Kelso, Tomi Maricic, Qiaomei Fu, Udo Stenzel, Johannes Krause, Martin Kircher. Photo: MPI-EVA.
Again and again we asked what differences there might be between bacterial and mammalian DNA. And then it struck me: methylation! Methyl groups are little chemical modifications that are common in bacterial DNA, particularly on A nucleotides. In the DNA of mammals, however, C nucleotides are methylated. Perhaps we could use antibodies to methylated A’s to bind and remove bacterial DNA from the extracts. Antibodies are proteins that are produced by immune cells when they detect substances foreign to the body—for example, DNA from bacteria or viruses. The antibodies then circulate in the blood, bind with great strength to the foreign substances wherever they encounter them, and help eliminate them. Because of their ability to specifically bind to substances to which immune cells have been exposed, antibodies can be used as powerful tools in the laboratory. For example, if DNA containing methylated A nucleotides is injected into mice, their immune cells will recognize that the methylated A’s are foreign and make antibodies to them. These antibodies can then be purified from the blood of the mice and used in the laboratory, and I thought we should make such antibodies and then try to use them to bind and eliminate bacterial DNA in our DNA extracts.
A quick literature search revealed that researchers at a company, New England Biolabs outside Boston, had already produced antibodies to methylated A’s. I wrote to Tom Evans, an excellent scientist interested in DNA repair who I knew there, and he graciously sent us a supply. Now I wanted someone in the group to use them to bind to the bacterial DNA and remove it from the extracts. I thought that doing so would leave us with extracts in which the percentage of Neanderthal DNA was much higher. I considered this an ingenious plan. But when I presented it in our weekly meeting, people seemed skeptical—again, it seemed to me, because of their unfamiliarity with the technique. This time, bolstered by the fact that I had been right about the radioactivity, I more or less insisted. Adrian Briggs took it on. He spent months trying to get the antibodies to bind to the bacterial DNA and separate it from nonbacterial DNA. He tried all kinds of modifications of the technique. It never worked and we still don’t know why. For quite some time, I got to hear facetious comments about my wonderful antibody idea.
What else could we try in order to eliminate the bacterial DNA? One idea was to identify sequence motifs found frequently among our bacterial sequences. Perhaps we could then use synthetic DNA strands to specifically bind and remove the bacterial DNA in a way similar to what I had imagined for the antibodies. Kay Pruefer, a soft-spoken computer science student who, after coming to our lab, had taught himself more genome biology than most biology students know, looked for potentially useful sequence motifs. He found that some combinations of just two to six nucleotides—such as CGCG, CCGG, CCCGGG, and so on—were present much more often in the microbial DNA than in the Neanderthal DNA. When he presented this observation in a meeting, it was immediately clear to me what was going on. In fact, I should have thought of this earlier! Every molecular biology textbook will tell you that the nucleotide combination of C followed by G is relatively infrequent in the genomes of mammals. The reason is that methylation in mammals occurs to C nucleotides only when they are followed by G nucleotides. Such methylated C’s may be chemically modified and misread by DNA polymerases so that they mutate to T’s. As a result, over millions of years, mammalian genomes have slowly but steadily been depleted of CG motifs. In bacteria, this methylation of C’s does not occur, or is rare, so CG motifs are more common.
How could we use this information? The answer to that question, too, was immediately obvious to us. Bacteria make enzymes, so-called restriction enzymes, that cut within or nearby specific DNA sequence motifs (such as CGCG or CCCGGG). If we incubated the Neanderthal libraries with a collection of such enzymes, they would chop up many of the bacterial sequences so that they could not be sequenced but leave most of the Neanderthal sequences intact. We would thus tip the ratio of Neanderthal to bacterial DNA in our favor. Based on his analyses of the sequences, Kay suggested cocktails of up to eight restriction enzymes that would be particularly effective. We immediately treated one of our libraries with this mix of enzymes and sequenced it. Out of our sequencing machine came about 20 percent Neanderthal DNA instead of 4 percent! This meant that we needed only about seven hundred runs on the machines in Branford to reach our goal—a number within the realm of possibility. This small trick was what made the impossible possible. The only drawback was that the enzyme treatment would cause us to lose some Neanderthal sequences—the ones that carried particular runs of C’s and G’s—but we could pick up those sequences by using different mixtures of enzymes in different runs and by doing some runs without any enzymes. When we presented our restriction enzyme trick to Michael Egholm at 454, he called it brilliant. For the first time, we knew that we could in principle reach our goal!
While all this was going on, a paper appeared by Jeffrey Wall, a young and talented population geneticist in San Francisco whom I had met on several occasions. It compared the 750,000 nucleotides that our group had determined by 454 sequencing from the Vi-33.16 bones and published in Nature with the 36,000 nucleotides that Eddy Rubin had determined by bacterial cloning from our extracts of the same bone and published in Science. Wall and his co-author, Sung Kim, pointed out several differences in the data sets, many of which we had already seen and discussed extensively when the two papers were in review. They suggested that there could be several possible problems with the 454 data set but favored the interpretation that there were huge amounts of present-day human contamination in our library. In particular, they suggested that between 70 and 80 percent of what we had thought was Neanderthal DNA was instead modern human DNA.{53}
This was troubling. We were aware that we might have some contamination in both the Nature and Science data sets, as we had sent the extracts to laboratories that did not work under clean-room conditions. We were also aware that if there was a difference in levels of contamination, they would be higher in the Nature data set produced at 454. We were sure, however, that any contamination levels could not be 70 to 80 percent, because Wall’s analysis relied on assumptions, such as similar GC content in short and long fragments, that we knew were not true.
In an attempt to clarify these issues, we immediately asked Nature to publish a short note, in which we pointed out that several features differed between the sequences determined by the 454 technique and by bacterial cloning, and that some of the features were likely to affect the analysis. We also wanted to mention that our additional sequencing of the library had indicated very little contamination based on mtDNA. But we further realized that some level of contamination had probably been introduced into the library at 454, perhaps from a library of Jim Watson’s DNA that it turned out 454 had sequenced at the same time as our Neanderthal library. So in the note we conceded that “contamination levels above that estimated by the mtDNA assay may be present.” But by how much was impossible to tell. We poin
ted the readers both to Wall’s paper and to the paper in which we described the use of tags in the library production that now made any contamination outside the clean room impossible.{54} We also posted a note in the publicly available DNA sequence database, so that any potential users would know of the concerns with these data. But, to my annoyance, after sending our note to reviewers, Nature decided not to publish it.
We discussed whether we had been too hasty in publishing the proof-of-principle data in Nature. Had we been driven to go ahead by the competition with Eddy? Should we have waited? Some in the group thought so, and others did not. Even in retrospect I felt that the only direct evidence for contamination we possessed, the analysis of the mtDNA, had shown that contamination was low. And that was still the case. The mtDNA analysis had its limitations, but in my opinion, direct evidence should always have precedence over indirect inferences. In the note that Nature never published, we therefore said that “no tests for contamination based on nuclear sequences are known, but in order to have reliable nuclear sequences from ancient DNA, these will have to be developed.” This remained an ongoing theme in our Friday meetings for the next several months.
Chapter 14
Mapping the Genome
______________________________
Once we knew that we could make the DNA libraries we needed, and with the hope that 454 would soon have fast enough machines to sequence them all, we started turning our attention to the next challenge: mapping. This was the process of matching the short Neanderthal DNA fragments to the human genome reference sequence. This process might sound easy, but in fact it would prove a monumental task, much like doing a giant jigsaw puzzle with many missing pieces, many damaged pieces, and lots and lots of extra pieces that would fit nowhere in the puzzle.
At heart, mapping required that we balance our responses to two different problems. On the one hand, if we required near-exact matches between the Neanderthal DNA fragments and the human genome, we might miss fragments that carried more than one or two real differences (or errors). This would make the Neanderthal genome look more similar to present-day humans than it really was. But if our match criteria were too permissive, we might end up allowing bacterial DNA fragments that carried spurious similarity to some parts of the human genome to be misattributed as Neanderthal DNA. This would make the Neanderthal genome look more different from present-day humans than in reality. Getting this balance right was the single most crucial step in the analysis because it would influence all later work that revolved around scoring differences from present-day genomes.
We also had to keep practical considerations in mind. The computer algorithms used for mapping could not take too many parameters into account, as it then became impossible to efficiently compare the more than 1 billion DNA fragments each composed of 30 to 70 nucleotides that we planned to sequence from the Neanderthal bones to the 3 billion nucleotides in the human genome.
The people who took on the monumental task of designing an algorithm to map the DNA fragments were Ed Green, Janet Kelso, and Udo Stenzel. Janet had joined us to head a bioinformatics group in 2004 from University of the Western Cape, in her native South Africa. An unassuming but effective leader, she was able to form a cohesive team out of the quirky personalities that made up the bioinformatics group. One of these personalities was Udo, who had a misanthropic streak; convinced that most people, especially those higher up in academic hierarchies, were pompous fools, he had dropped out of university before finishing his degree in informatics. Nevertheless, he was probably more capable as a programmer and logical thinker than most of his teachers. I was happy that he found the Neanderthal project worthy of his attention even if his conviction that he always knew everything best could drive me mad at times. In fact, Udo would probably not have gotten along with me at all if it were not for Janet’s mediating influence.
Ed, with his original project on RNA splicing having died a quiet, unmourned death, had become the de facto coordinator of the efforts to map the Neanderthal DNA fragments. He and Udo developed a mapping algorithm that took the patterns of errors in the Neanderthal DNA sequences into account. These patterns had in the meantime been worked out by Adrian together with Philip Johnson, a brilliant student in Monty Slatkin’s group at Berkeley. They had found that errors were located primarily toward the ends of the DNA strands. This was because when a DNA molecule is broken, the two strands are often different lengths, leaving the longer strand dangling loose and vulnerable to chemical attack. Adrian’s detailed analysis had also revealed that, contrary to our conclusions just a year earlier, the errors were all due to deamination of cytosine residues, not adenine residues. In fact, when a C occurred at the very end of a DNA strand, it had a 20 to 30 percent risk of appearing as a T in our sequences.
Ed’s mapping algorithm cleverly implemented Adrian’s and Philip’s model of how errors occurred as position-dependent error probabilities. For example, if a Neanderthal molecule had a T at the end position and the human genome a C, this was counted as almost a perfect match, as deamination-induced C-to-T errors at the end positions of Neanderthal fragments were so common. In contrast, a C in the Neanderthal molecule and a T in the human genome at the end position was counted as a full mismatch. We were confident that Ed’s algorithm would be a great advance in reducing false mapping of fragments and increasing correct ones.
Another problem was choosing a comparison genome to use for mapping the Neanderthal fragments. One of the goals of our research was to examine whether the Neanderthal genome sequence revealed a closer relationship with humans in Europe than in other parts of the world. For example, if we mapped the Neanderthal DNA fragments to a European genome (about half of the standard reference genome was from a person of European descent), then fragments that matched the European genome might be retained more often than fragments that were more like African genomes. This would make the Neanderthal incorrectly look more similar to Europeans than to Africans. We needed a neutral comparison, and we found one in the chimpanzee genome. The common ancestor that Neanderthals and modern humans shared with chimpanzees existed perhaps 4 million to 7 million years ago, meaning that the chimpanzee genome should be equally unlike both the Neanderthal and the modern human genome. We also mapped the Neanderthal DNA fragments to an imaginary genome that others had constructed by estimating what the genome of the common ancestor of humans and chimpanzees would have looked like. After being mapped to these more distant genomes, the Neanderthal fragments could then be compared to the corresponding DNA sequences in present-day genomes from different parts of the world and differences could be scored in a way that did not bias the results from the outset.
All of this required considerable computational power, and we were fortunate to have the unwavering support of the Max Planck Society as we attempted it. The society dedicated a cluster of 256 powerful computers at its computer facility in southern Germany exclusively to our project. Even with these computers at our disposal, mapping a single run from the sequencing machines took days. To map all our data would take months. The crucial task that Udo took on was how to more efficiently distribute the work to these computers. Since Udo was deeply convinced that no one could do it as well as he could, he wanted to do all of the work himself. I had to cultivate patience while awaiting his progress.
When Ed looked at the mapping of the first batches of new DNA sequences that came back to us from Branford, he discovered a worrying pattern that set off alarm bells in the group and made my heart sink: the shorter fragments showed more differences from the human genome than the longer fragments! It was reminiscent of one of the patterns that Graham Coop, Eddy Rubin, and Jeff Wall had seen in our Nature data. They had interpreted that pattern as contamination, assuming that the longer fragments showed fewer differences from present-day humans because many of them were in fact recent human DNA contaminating our libraries. We had hoped that preparing the libraries in our clean room and using our special TGAC tags would spare us the plague of contamination. E
d began frantic work to see if we after all had modern human contamination in our sequencing libraries.
Happily he found that we didn’t. Ed quickly saw that if he made the cut-off criteria for a match more stringent, then the short and long fragments became just as different from the reference genome. Ed could show that whenever we (and Wall and the others) had used the cut-off values routinely used by genome scientists, short bacterial DNA fragments were mistakenly matched to the human reference genome. This made the short fragments look more different than long fragments from the reference genome; when he increased the quality cut-off, the problem went away, and I felt secretly justified in my distrust of contamination estimates based on comparisons between short and long fragments.
But soon after this, our alarm bells went off again. This time the issue was even more convoluted and took me quite a while to understand—so please bear with me. One consequence of normal human genetic variation is that a comparison of any two versions of the same human chromosome reveals roughly one sequence difference in every thousand nucleotides, those differences being the result of mutations in previous generations. So whenever two different nucleotides (or alleles as geneticists will say) occur at a certain position in a comparison of two chromosomes, we can ask which of the two is the older one (or the “ancestral allele”) and which is the more recent one (or the “derived allele”). Fortunately, it is possible to figure this out easily by checking which nucleotide appears in the genomes of chimpanzees and other apes. That allele is the one that is likely to have been present in the common ancestor we shared with the apes, and it is therefore the ancestral one.