Human Diversity
Page 19
The cluster analysis used by Cavalli-Sforza and Bowcock confirmed (as have all subsequent analyses) that the great bulk of variation in humans is within populations, not between them, just as Lewontin said. But even with the limited measures of genetic variation available in the early 1990s, their clusters corresponded to the geographic origin of the subjects at the continental level.39 In 1998, a similar analysis conducted by geneticists at Yale produced similar results.40
The methods used for this pioneering work were primitive by later standards. In 2000, Stanford geneticists led by Jonathan Pritchard developed a more sophisticated method that they implemented through a software program called Structure.41 Among other things, the new method allowed for mixed ancestry—individuals could be assigned to more than one cluster, with the percentages in the various clusters summing to 100.
In 2002, a team of scholars with the Human Genome Diversity Project (first author was Noah Rosenberg) applied the Structure software to a sample of 1,056 individuals from 52 populations, using 377 autosomal microsatellites.42 The individuals were deliberately chosen from so many populations to ensure that the software’s clustering algorithms would not be constrained by artificially narrow groups (e.g., a European sample drawn exclusively from Dutch individuals). Their results replicated and expanded upon the 1994 and 1998 findings. The cleanest set of clusters was produced when K was set to 5. The five clusters once again clustered according to continents: Africa, Europe, East Asia, the Americas, and Oceania.
The First Cluster Analysis Using Hundreds of Thousands of SNPs
The sequencing of the genome changed everything. In 2008, a team of eleven scholars affiliated with the Human Genome Diversity Project (first author was Jun Z. Li) used the same sample as the Rosenberg study, reduced to 51 populations and 938 persons for technical reasons. The big difference was that the Li study was analyzing 642,690 variants instead of 377.43 Here’s what they found as K went from two to seven:
At K = 2, two sets of the 51 populations in the Li study’s database had virtually no overlap: populations in sub-Saharan Africa versus populations in East Asia plus a few in the Americas. All the other populations were mixtures of the two clusters.
At K = 3, the people who showed virtually no admixture across clusters consisted of individuals from sub-Saharan Africa, today’s Europe and Mideast, and the East Asian–Americas group. Those from Central and South Asia had varying mixtures of the European/Mideast and East Asian/Amerind clusters.
At K = 4, the Amerindians split off to form a separate cluster.
At K = 5, the Oceania populations split off.
At K = 6, the Central and South Asians split off.
At K = 7, the configuration that the authors assessed as the most informative, those in the Mideast split off from the Europeans.
The figure below graphically shows the results for the analysis at K = 7.
Source: Li, Absher, Tang et al. (2008). Adapted from Fig. 1.
It is a fascinating graphic. It combines 938 vertical lines, one for each person in the sample. Each line is partitioned into segments with lengths that correspond to that person’s “ancestry coefficients.” When only one pattern is represented, the individual belongs entirely to that cluster—as in large portions of the lines for Africa, Europe, East Asia, and the Americas. In contrast, look at the Mideast segment of the figure. Interpreted colloquially (remember that the labels for the clusters were added only after each individual’s ancestry had been estimated), most of the lines have a mix of Mideastern, European, and Central or South Asian ancestry.44
What struck me most about the Li analysis is what happened when the number of clusters went from five to seven. As in the 2002 study, the first five clusters corresponded to the five continental ancestral populations. But the two new subsidiary clusters that emerged when K = 7 corresponded to commonsense observation. When K = 5, the Li study produced a cluster that corresponded to the classic definition of Caucasian—an odd agglomeration of peoples from Europe, North Africa, the Mideast, South Asia, and parts of Central Asia. There had been a reason why physical anthropologists had once combined these disparate populations—all of them have morphological features in common—but it had never made sense to people who weren’t physical anthropologists.45 With K = 7, one of the new clusters split off the peoples of the Mideast and North Africa and the other split off the peoples of Central and South Asia—precisely the groups that had always been visibly distinctive from Europeans and from each other in the Caucasian agglomeration.
The figure below shows another perspective on the separation of members of the 51 subpopulations. The note gives details of the analysis.[46] Highly simplified, you’re looking at the genetic distance of each of the 938 people in the study from the other 937.
Source: Li, Absher, Tang et al. (2008): Fig. S3B. The authors computed a 938-by-938 Identity-by-State (IBS) matrix using the software package PLINK, which was then factor analyzed for all samples and for seven regions separately.
The graph is necessarily a two-dimensional representation of a multidimensional dataset, but it does provide a useful sense of the varying genetic distances separating the seven clusters. Africa is distinct from all the rest. The traditional Caucasian clusters are grouped but nonetheless distinct. A mixture of Central/South Asian and East Asian populations are between the Caucasian clusters and the two clusters of East Asians and Amerinds. The Oceania subjects are separate, two small blobs representing the Melanesians and Papuans. The signature feature of the graph is not the overlap, which is confined to just two of the seven groups, but the clarity of most of the separations among the rest.
Since the Li Study
In 2010, a consortium of evolutionary geneticists extended the Li study, assembling data from 296 individuals in 13 populations that had not been covered by previous studies. The first author was Jinchuan Xing. The methods differed from those used in the Li study, and the addition of 13 new populations enabled a new level of detail. For example, adding the new populations led to an estimated global differentiation of 11.3 percent, compared to 15.9 percent with the HapMap Project’s more restricted samples (and the 15 percent cited by Lewontin).47 The Xing study also demonstrated, consistent with population genetics theory, that the more granular the analysis, the less discontinuity is seen between adjacent populations.
In both the Li and Xing studies, the first component in the principal component analysis differentiated Africans from all other populations, and the second component differentiated Eurasian populations. In terms of the genetic distances among regions and subpopulations, the Xing study amounts to a confirmation of the Li study. In a replication of the principal components plot shown above for the Li study, Africa was most distant from all the others; Europe, West Asia, and Central Asia were adjacent but distinct; and the Americas, East Asia, and Polynesia were adjacent but distinct. To indicate how little the main conclusions differed, I have included the “Conclusion” section of the Xing study in its entirety in the note.[48]
Since 2010, studies of genetic differentiation among populations have focused on fine structure. The use of large samples of SNPs enables investigators to more or less replicate not just the major populations traditionally defined as races, but subpopulations within the major populations. In their review of the state of the art as of 2016, John Novembre and Benjamin Peter showed what happens when several European subpopulations are plotted with different numbers of sites.49 When only 100 or even 1,000 sites are used, the subpopulations are indistinguishable. At 10,000 sites, some separation is visible. At 100,000 sites, Italians, Spanish, Germans, and Romanians are all reasonably distinct, with the British, Dutch, Swedish, and Irish fuzzily separated.
The methods for analyzing population structure have multiplied and become more sophisticated. A review published in 2015 listed 25 different software packages for analyzing population structure and demographic history.50 The methods for ensuring that the genetic markers are drawn from noncoding SNPs have improved.51 The
technical literature has grown accordingly, analyzing population structures at exceedingly fine levels.52 Geneticist Razib Khan pulled together a sampling of the most important varieties of population structure as of 2016, with illustrative graphics for each.53
By now, the ability to classify people not just according to continental ancestral population but also to specific subpopulations has become so routine that you may have already availed yourselves of it in the form of a commercial product that gave you an analysis of your ancestral heritage in return for a cheek swab and a modest payment. Originally, these companies used clusters of genetic markers known as “Ancestry Informative Markers,” or AIMs. With the technology now available, most of them no longer bother with AIMs, instead just using hundreds of thousands of markers.
The results can be extremely precise. One of the earliest uses of AIMs, applied to a sample of 3,636 people who self-identified as white, black, East Asian, or Latino, classified 99.7 percent of them—all but five—into the same population as the subjects identified themselves.54 But such profiles can also be misleading. If you are from Pakistan, for example, and your profile indicates that you are 4 percent Melanesian, don’t expect to find anything if you ransack your family tree for a great-great-grandfather from Samoa. The explanation is probably that the reference populations didn’t include enough South Asian variation. The algorithm looked for the nearest population, which was Melanesian.
How Have Advocates of “Race Is a Social Construct” Responded to the Cluster Analyses?
Advocates of “race is a social construct” have raised a host of methodological and philosophical issues with the cluster analyses. None of the critical articles has published a cluster analysis that does not show the kind of results I’ve shown.[55]
Many of the critical responses emphasize that genetic differentiation across populations is small compared to the variation within populations, that admixtures do exist in some populations, and that the finer the level of population structure, the smaller the distance between adjacent populations becomes.[56] All of these points are true, but no one conducting the cluster analyses has ever disputed them.
A more direct conflict involves the exploratory nature of cluster analyses, especially the different results that are produced by different values of K and by different numbers of iterations used to produce the results.57 But none of the critiques I have seen deal with an observation first made in the Rosenberg study: “Each increase in K split one of the clusters obtained with the previous value.”58 That is, different values of K do not produce a radically different pattern of results. Instead, they augment the results, giving a greater degree of definition to a previously identified pattern.
Much of the rest of the criticism of the cluster analyses comes down to semantics. The Rosenberg study in 2002 prompted David Serre and Svante Pääbo of the Max Planck Institute for Evolutionary Anthropology to explore the possibility that the appearance of clusters was illusory, an artifact of the sampling procedure. If a larger number of markers and populations had been used, they argued, it would be seen that variation in human populations occurred gradually—in clines, to use the technical term. They presented evidence that simple geographic distance better explained genetic distance than discrete geographic regions.59 The authors of the Rosenberg study responded by reanalyzing their clusters after raising the number of genetic markers from 377 to 993. They concluded that “examination of the relationship between genetic and geographic distance supports a view in which the clusters arise not as an artifact of the sampling scheme, but from small discontinuous jumps in genetic distance for most population pairs on opposite sides of geographic barriers, in comparison with genetic distance for pairs on the same side.”60 That conclusion was reinforced by the subsequent Li and Xing studies, with their larger numbers of genomes and hundreds of thousands of SNPs available for the analysis. The principal component analyses from those studies give visual evidence for the discontinuous genetic distances separating continent-wide populations. But the discontinuous jumps were most evident when significant geographic barriers separated populations. The smaller the geographic scale, the more often variation in allele frequencies occurred in clines.61 Subsequently, the defenders of race as a social construct have expended considerable effort pushing back against any existence of genuine, as opposed to statistical, clusters of populations.62
Seen in ideological terms, I can understand why the orthodoxy wants genetic differences to be clinal. Geographic discontinuities in genetic variation look a lot more like races, classically construed, than clinal variation does. But substantively, what difference does it make?
The genetic distance between Europeans and East Asians shown in the principal components plots looks “big.” Now suppose that we zero in on the genetic profiles of Bretons living in the far northwest of France and Hakka Chinese living in the far southeast of China. Suppose that the genetic differentiation between those two populations occurs entirely in clines—that if you sampled each and every population on the route between northwest France and southeast China, the gradations would perfectly correlate with geographic distance and there would be no discontinuities associated with the steppes of Russia or Central Asian deserts. Even in that case, the magnitude of the genetic distinctiveness of the French and Chinese would be unaffected. If the difference is great, it would still be just as great, even though the pairwise differences among the dozens of populations in between were quite small.
More broadly, my view of the orthodox reaction to the cluster analyses is that it constitutes a complicated set of “Yes, buts…” A core truth uncongenial to the orthodoxy goes untouched. A geneticist can say to the orthodox, “Give me a large random sample of SNPs in the human genome, and I will use a computer algorithm, blind to any other information about the subjects, that matches those subjects closely not just to their continental ancestral populations, but, if the random sample is large enough, to subpopulations within continents that correspond to ethnicities.” If race and ethnicity were nothing but social constructs, that would be impossible. It’s actually a sure bet.
Recapitulation
As I close this discussion, it is time for another reminder that the genetic distinctiveness of populations is minor compared to their commonalities and that all of the clusters and genetic distances are based on SNPs that for the most part are not known to affect any phenotype. The material here does not support the existence of the classically defined races, nor does it deny the many ways in which race is a social construct. Rather, it communicates a truth that geneticists expected theoretically more than half a century ago and that has been confirmed by repeated empirical tests: Genetic differentiation among populations is an inherent part of the process of peopling the Earth. It is what happens when populations successively split off from parent populations and are subsequently (mostly) separated geographically.
The inescapable next question is whether we’re looking at a phenomenon that has been confined to SNPs that have no effect on the phenotype, or whether the same thing has been happening to SNPs that do have such an effect. We will explore that topic by looking at what has been learned about evolution since humans left Africa.
8
Evolution Since Humans Left Africa
Proposition #6: Evolutionary selection pressure since humans left Africa has been extensive and mostly local.
A pillar of the orthodox position is that humans left Africa so recently that they haven’t had time to differentiate themselves genetically in ways that would affect cognitive repertoires. That position is now known to be wrong, as indicated by Proposition #6. Presenting the evidence for Proposition #6 involves a number of technical issues regarding evolution, and so it’s time for another interlude.
Fourth Interlude: Evolutionary Terms You Must Know to Read the Rest of the Book
Evolution refers to the process whereby the first primitive forms of life became the biosphere we know today, a process independently understood in its modern form by Ch
arles Darwin and Alfred Russel Wallace in the 1830s and 1840s and famously described in 1859 by Darwin in On the Origin of Species.
Mutation. The evolution of completely new traits—hearing or eyesight, for example—requires mutations. Mutations have several causes. The chemicals that make up the base pairs can decay or be damaged. The process for correcting those errors (a capability for DNA repair is built into every cell) is pretty good, but it sometimes makes mistakes. Similarly, errors can occur during DNA replication. Mutations can be caused by different forms of radiation or exposure to certain chemicals. These and other causes can affect the chemical in one letter of a base pair or the number of repetitions in microsatellites. They can result in small insertions or deletions of bases (indels), and structural variations in regions of DNA.1
An extensive literature debates the incidence of mutations and whether the rate has increased or decreased over time. The rate is usually expressed in incomprehensibly small numbers (e.g., 1.29 × 10–8 per position per generation). But the same study that produced that estimate thankfully gives an intuitively useful way of thinking about it: It implies that a newborn with 30-year-old parents will carry 75 new SNV mutations and 6 new short indel mutations.2
The process of mutation is a matter of chance. The great majority of mutations have no effect or a negative effect—random changes are unlikely to add something positive to highly evolved traits.3 “Negative effect” in evolutionary terms refers to a reduction in reproductive fitness (often just fitness), the technical term for describing reproductive success, measured in the simplest case by the number of offspring one produces.4 Even when a mutation has a positive effect on fitness, it initially happens to a single individual. For that mutation to spread from one individual throughout the population requires a great deal of luck. Exactly how much luck was calculated almost a century ago by British geneticist J. B. S. Haldane: If a given allele has a selective advantage of s, the chance that it will sweep through a large population and become fixed is 2s.5 For example, if the selective advantage conferred by an allele is 5 percent (a large advantage for a single allele), it has only a 10 percent chance of eventually becoming fixed. Even a highly favorable mutation has a precarious place in the genome until it has successfully propagated to many people.