Human Diversity
Page 21
A more sophisticated way of identifying candidate genes, called linkage analysis, enabled progress during the 1980s in understanding genetically based diseases such as Huntington’s disease, Alzheimer’s disease, and some forms of cancer.22 But by the mid-1990s, researchers working on diseases such as schizophrenia, bipolar disorder, and diabetes were stymied. Linkage analysis was useful for diseases that were caused by alleles with large effects, but it produced inconclusive and often contradictory results for complex diseases that were caused by many alleles with small effects.
Evolutionary geneticists also used the candidate gene approach for exploring the known adaptations that had occurred after the dispersal from Africa, such as lightened skin. But such obvious examples of recent evolution were rare. As geneticist Joshua Akey explained it, candidate gene studies have two major limitations.23 One is that they require a priori hypotheses about which genes have been subject to selection, but few traits lend themselves to a priori hypotheses. This is true of many physiological traits, but especially of cognitive ones. The relevant alleles surely involve brain function and might have something to do with hormones, but that’s not nearly specific enough to identify a candidate gene. Ingenious attempts to do so failed. Perhaps the most frustrating example was the search for the genetic basis of cognitive ability using the candidate gene approach. Years of intensive effort in the late 1990s and early 2000s failed to identify even a single replicable genetic locus affecting it.24
The second limitation of candidate gene studies was how to determine whether natural selection was really involved even if a candidate gene seemed to be panning out. Perhaps the alleles were under selection pressure. But perhaps genetic drift was at work, or some other mechanism that had nothing to do with natural selection. The available theoretical models weren’t good enough to yield precise predictions, and the statistical tests were often inadequate for robust conclusions.
Genome-wide association study (GWAS), pronounced “g-wasp” without the “p.” GWAS is an acronym now as ubiquitous in the technical literature as SNP. The technique itself is usually abbreviated as GWA. The idea behind it was first expressed in 1996 in an article in Science by Stanford geneticist Neil Risch and Yale epidemiologist Kathleen Merikangas. “Has the genetic study of complex disorders reached its limits?” they asked. Their answer put geneticists on course to a new and productive technique:
We argue below that the method that has been used successfully (linkage analysis) to find major genes has limited power to detect genes of modest effect, but that a different approach (association studies) that utilizes candidate genes has far greater power, even if one needs to test every gene in the genome. Thus, the future of the genetics of complex diseases is likely to require large-scale testing by association analysis.25
Implementing “association analysis” is complicated, but the idea is simple, involving concepts you learn in the first month of an introductory statistics course or can learn from Appendix 1, but applied on a gigantic scale.26
Maybe a simplified illustration will give you a sense of how the process works. I’ll use my running example of height. Suppose you are given a SNP that has alleles A and T. You also have a large sample of people who have been genotyped and their height measured in inches. You find that men with TT are on average 70.0 inches tall, men with TA are on average 70.1 inches tall, and men with AA are 70.2 inches. You apply the appropriate statistical test and determine that the relationship is statistically highly significant. It looks as if allele A is associated with greater height.
That’s essentially what a GWA does for several million SNPs. There are complications, of course. If you’re running a million tests for statistical significance, you would get about 50,000 SNPs that show up as having a “statistically significant” association with schizophrenia at the familiar p < .05 level even if none of them had any substantive association whatsoever. Therefore you must require a far stiffer criterion for statistical significance. Risch and Merikangas anticipated that problem, calculating that to achieve a significance level that gives a probability greater than .95 of no false positives among the nominated SNPs from a million SNPs, the level of statistical significance required for any one SNP should be <5 × 10–8, or less than .00000005.27 This has become the most commonly used standard, though more sophisticated rationales for it have subsequently been developed.28
SNPs that are candidates for causality. If you’re interested in establishing the causality of specific SNPs with a specific trait, you have the raw material for beginning the process. But even if your SNPs all meet the standard criterion for statistical significance in a GWA (p < 5.0 × 10–8), what you have at this point is no more than the raw material. You need to go through a number of steps to prune and otherwise clean up—“curate”—your set of SNPs. I will discuss those processes subsequently.
What might you eventually be able to do with this information? It depends on the trait. If you are a medical researcher studying the SNPs associated with a specific disease, your set of SNPs will eventually be part of a mosaic that helps understand the genetic origins of that disease and thereby perhaps enables new curative and prophylactic strategies.
If you’re a population geneticist, you can use your set of SNPs to ask questions about whether natural selection for the trait in question has been occurring, and if so, in what populations and when. To do that, you will need a second database that contains samples from the ancestral populations you want to study.
A host of procedural and statistical issues attend the production of a valid GWA, but the main data requirement is a large sample. The statistical methods used in the social sciences are routinely applied to samples in the hundreds, and a sample of 10,000 is considered large. In GWA, a sample of 100,000 is no more than okay.
A Shift in Focus from Mutation and Hard Sweeps to Standing Variation and Soft Sweeps
GWA provided a way to get around one of Akey’s limitations on the candidate gene approach, but it did not help with the other: how to tell whether changes over time were the product of natural selection. Progress on that front required another change in focus.
Until the genome was sequenced, most of the work of molecular biologists focused on evolution through mutation—it was the “ruling paradigm,” in the words of evolutionary biologists Joachim Hermisson and Pleuni Pennings.29 The ruling theoretical model was “the neutral theory of molecular evolution,” introduced in the late 1960s and given full expression by population geneticist Motoo Kimura in 1983.30 The neutral theory acknowledges that phenotypic evolution is driven by Darwinian natural selection. But the theory posits that the vast majority of differences at the molecular level are neutral, meaning that they do not influence the fitness of the organism. Insofar as selection does occur, it is “purifying” selection, which acts to eliminate harmful mutations. The genetic variation at the molecular level that we observe within and between species is explained, with rare exceptions, by genetic drift.
Given a focus on evolution through mutations of large effect and a theoretical explanation that assigns almost all molecular variation to genetic drift, scientists may have found it natural to believe that humans hadn’t had enough time to evolve much since the dispersal from Africa. The number of generations since humans left Africa is probably around 2,000 and almost certainly no more than 5,000. A favorable mutation with an unusually large s value can go to fixation in a few hundred generations, but under commonly observed values of s, a mutation is likely to take thousands of generations to reach fixation.31 From this perspective, the time since humans left Africa has indeed been “the blink of an eye” in evolutionary terms, just as Gould proclaimed.
This conclusion has been indirectly reinforced by analyses indicating that, as one such study put it, “strong, sustained selection that drives alleles from low frequency to near fixation has been relatively rare during the past ∼70 KY [thousands of years] of human evolution.”32
Long before the genome was sequenced, however, going
all the way back to Fisher’s early work, quantitative geneticists were aware that mutation wasn’t the only way in which evolution worked. Completely new variants weren’t needed, just changes in the variation that already existed—“standing variation.”
Over the millions of years that led to anatomically modern humans, a great deal of genetic variation has arisen that confers no particular advantage or disadvantage. Perhaps SNPs have phenotypic effects, but these effects are too small to have an appreciable impact on reproductive fitness. Perhaps a mutation spread to some percentage of the population because it was once advantageous but then lost its advantage as the organism adapted to the environment in other ways. Perhaps it had been simply a matter of genetic drift.
Think of standing variation as kindling. For a long time, it has no effect on anything. The allele frequencies drift aimlessly from generation to generation. Then something changes in the environment—the equivalent of a match. Depending on what the change is, an allele that had been more or less neutral can become advantageous and start to spread. To use a completely made-up example, let’s say that the SNPs that produce variation in the trait called “thriftiness” existed as standing variation in hunter-gatherer populations. There was no reason for purifying selection to eliminate the variation completely, but neither was there any evolutionary reason for the thriftiness alleles to increase. In an environment where possessions are rudimentary and foodstuffs rot within a few days, thriftiness has no appreciable effect on fitness. When a hunter-gatherer group switched to agriculture, the situation changed radically. Those who were thrifty had many advantages in accumulating surpluses for surviving hard times and for bartering in good times. Aside from direct fitness advantages, the thrifty man or woman who became prosperous obtained advantages in sexual selection. Under these conditions the frequencies of the thrift-promoting alleles would start to exhibit a tendency to increase rather than to fluctuate haphazardly.
The example generalizes to many kinds of standing variation. The frequencies of the newly favored allele may not go to fixation after an environmental change—the spread of any single favorable allele for a polygenic trait slows as the organism’s phenotype becomes satisfactorily adapted to the changed environment—but its frequency within the population increases. These alterations in standing variation are known as soft sweeps, in contrast to a hard sweep, in which an adaptive mutation spreads and eventually goes to fixation.
The role of standing variation in evolution depends in part on the genetic complexity of the trait. It is least applicable to a trait such as resistance to a deadly disease caused by a specific pathogen. The simpler the genetics of the trait, the more likely that a single mutation can have a major effect, with such a large value of s that the mutation has a good chance of going to fixation. In contrast, adaptation through standing variation is most applicable to traits that are affected by hundreds or thousands of alleles contributing tiny effect sizes. A change in the environment may have only modest effects on the allele frequency at any one locus, but it has those modest effects on hundreds of the relevant sites and thereby produces a cumulatively large effect.
Such changes in standing variation can reliably produce dramatic effects in the phenotype through breeding. Humans have known this for millennia, even though they didn’t know anything about alleles. Darwin begins On the Origin of Species with a chapter titled “Variation Under Domestication,” knowing that his readers among England’s rural gentry who were proud of their livestock would understand what he’s talking about (P. G. Wodehouse fans will think of Lord Emsworth).
It’s not just the physiology of animals that can be changed rapidly through breeding. So can fundamental personality traits. A modern experimental example is the Siberian silver fox. In 1959, Soviet biologist Dmitry Belyaev decided to reproduce the evolution of wolves into domesticated dogs.33 Instead of using actual wolves, he obtained Siberian silver foxes from Soviet fur farms and began to breed them for tameness. The foxes were not trained in any way, nor were they selected for anything except specific indicators of tameness as puppies. In the fourth generation, Belyaev produced the first fox puppies that would wag their tails when a human approached. In the sixth generation, he had puppies who were eager to establish human contact, whimpering to attract attention, licking their handlers—in short, acting like dogs. By the tenth generation, 18 percent of puppies exhibited these characteristics from birth. By the twentieth generation, that proportion had grown to 35 percent.
Even though the rapid effects of breeding were well known, it had generally been assumed until the 1950s that natural selection in the wild must move more slowly. Then British geneticist Bernard Kettlewell realized that within his own lifetime the wings of many types of moths had changed from light to dark in industrial areas of England. He began experiments in which he released light-and dark-winged peppered moths in unpolluted and polluted forests (the bark on trees in polluted forests having been darkened by industrial smoke and soot). He found that the daily mortality rate of the light-winged moths was twice that of the dark-winged variety in the polluted forests and subsequently elaborated on that finding to prove that natural selection was the cause.34 (Let us pause for a moment: Try to imagine the patience and doggedness it takes to determine daily mortality rates of moths over several acres of land.) Since Kettlewell’s work, rapid response to environmental change has been demonstrated in many species—for example, Italian wall lizards, cane toads, house sparrows, and, most famously, in the beaks of finches living on the Galápagos Islands.35
All this had been known before the genome was sequenced. But that knowledge didn’t have many practical research implications because the technology for analyzing selection pressure on standing variation wasn’t up to the job. Using candidate genes was the only game in town.
Population genomics. The sequencing of the genome opened a possibility that had been closed until then: determining what regions of the genome are under selection (i.e., responding to evolutionary pressures) and how long that selection has been going on.36 The field is still maturing, with refinements on existing methods and new techniques being published every few months, but it has already made dramatic progress in determining the age of evolutionary adaptations and identifying which regions of the genome are currently under selection.
How is it possible to know that a part of the genome has been under selection when you are working with samples of genomes drawn from people who are alive now? Even more perplexing, how could it be possible to know how long the region has been under selection?
I could answer by telling you that such selective sweeps create a valley of genetic diversity around the site under selection, that they leave a deficit of extreme allele frequencies (low or high) at linked sites and an increase in linkage disequilibrium in flanking regions—but that doesn’t tell you much unless you’re a geneticist.37 Here is a highly simplified way of thinking about one of the major sources of information about the location of regions under selection:
Keep three things in mind: (1) Base pairs at nearby SNPs in the genome tend to be correlated. (2) When nature recombines the parents’ genes, choosing some from the mother and some from the father, it does not conduct its coin flips site by site; instead, the coin flips shift one parent’s genes to the offspring in blocks. (3) The placing of the cuts defining the blocks varies from generation to generation, although there are “hot spots” where cuts are more likely to happen than elsewhere. As time goes on, the size of an original block is whittled down.
The record of evolution left by this process has been likened to palimpsests, the parchment medieval manuscripts that were reused but left traces of the older writing. In genetics, the parchment is the chromosome and the DNA sequence the text.38 A less elegant way to think of it is to imagine that the block of SNPs is a playing card—the nine of clubs, let’s say. Every generation, a thirty-second of an inch is sliced off—sometimes from the top, sometimes from the bottom, sometimes from a side.39 You will still be able
to tell it is the nine of clubs through many, many slices. Eventually, you won’t. Geneticists are in the position of observing the card already diminished but being able to estimate how many slices were required to reach its whittled-down configuration. If the frequency of the mutant allele at the center of this block is higher than expected from the block’s age as inferred from its whittled-down width, we have evidence that natural selection is responsible for the elevated frequency. The process cannot detect the origin of adaptations older than about 30,000 years—which by definition means that it identifies adaptations that occurred long after the dispersal from Africa.
I have given you a colloquial description of only one of the many methods that are used to detect selection pressure. As early as 2006, a review of progress described five “signatures” in the DNA sequence that indicate selection.40 Those methods have subsequently been refined and augmented.41 They are now reaching the point where they can identify individual SNPs under selection as well as regions under selection.42
Paleogenomics. This progress in the analysis of contemporary genomes has been augmented by progress in the study of the DNA of archaic humans. Unlikely as it seems, DNA can survive in the bones of hominins who died tens of thousands of years ago. Recovering it is a daunting task. Scientists must piece together partial genomes and infer missing sections. Ancient genetic material has to be discriminated from modern contamination. But the development of next-generation sequencing technologies enabled solutions to those problems.43 In 2009, a team headed by Svante Pääbo at the Department of Genetics at the Max Planck Institute for Evolutionary Anthropology in Leipzig succeeded in completing the first draft of an entire Neanderthal genome.44 Since then, the genomes of many other Neanderthals and of early anatomically modern humans have been completed, and progress has been made in reconstructing the genomes of other hominins.45 The availability of archaic genomes lets geneticists do more than infer whether alleles in contemporary genomes have been under recent selection. They can directly compare their inferences with the evidence on those same alleles from DNA tens of thousands of years old.