Overlooked in the blaze of White House publicity was the fact that the object of celebration was but a rough draft of the human genome. Much work remained to be done. In fact, the sequences of only two of the smallest chromosomes, 21 and 22, were reasonably complete and had been published. And even these could not boast unbroken chromosome-tip-to-chromosome-tip sequences. As to the other chromosomes, some of their sequences were riddled with gaps. Since that big announcement, sights were set on a new deadline of April 2003 for filling in the gaps and securing a full, accurate sequence. Some small regions, however, have proved literally unsequenceable, and in practice the goal has become to obtain an "essentially complete" sequence: at least 95 percent of the sequence finished with an error rate of less than 1 in 10,000 bases.
One of those responsible for coaxing the international herd of sequencing centers over the final hurdles was Rick Wilson, the bluff midwesterner who succeeded Bob Waterston as head of the Washington University center. Quality control is the name of the game, so each chromosome has been assigned a coordinator to oversee progress and ensure that his or her charge meets the project's overall specifications. Occasional glitches occur – for example, an errant piece of rice sequence crept mysteriously into one submission to the database – but screening procedures have proved effective in removing such contaminants. When I wrote this, the Human Genome Project was well on course to being "essentially complete" by the April 2003 deadline, which is also the fiftieth anniversary of the publication of the double helix.
The Human Genome Project is an extraordinary technological achievement. Had anyone suggested in 1953 that the entire human genome would be sequenced within fifty years, Crick and I would have laughed and bought them another drink. And such skepticism would have still seemed valid more than twenty years later when the first methods for sequencing DNA were finally devised. Those methods were, to be sure, a technical breakthrough but sequencing was a painfully slow business all the same – in those days it was a major undertaking to generate the sequence of even one small gene a few hundred base pairs in length. And now there we were, just another twenty-five years further on, celebrating the completion of some 3.1 billion base pairs of sequence. But we must also bear in mind that the genome is much more than a monument to our technological wizardry, astonishing though that may be: whatever its immediate political motivation, that White House celebration was perfectly justified in hailing the possibilities of a marvelous new weapon in our fight against disease and, even more, a whole new era in our understanding of how organisms are put together and how they operate, and of what it is that sets us apart biologically from other species – what, in other words, makes us human.
CHAPTER EIGHT
READING GENOMES:
EVOLUTION IN ACTION
I used to wish that when the human genome was finally completely sequenced it would turn out to contain 72,415 genes. My enthusiasm for this obscure number stemmed from the Human Genome Project's first big surprise. In December 1999, sandwiched between two major sequencing landmarks – billions one and two – came the first completed chromosome, number 22. Although a small one, constituting only 1.1 percent of the total genome, chromosome 22 was still 33.4 million base pairs long. This was our first glimpse of what the genome as a whole might look like; as one commentator for Nature wrote, it was like "seeing the surface or the landscape of a new planet for the first time." Most interesting was the density of genes along the chromosome. We had no reason to believe that chromosome 22 would not be representative of the entire genome, so we expected to find about 1.1 percent of all human genes in its sequence. That is to say, given the standard textbook estimate of about 100,000 human genes in total, we should have expected to see about 1,100 of them on chromosome 22. Almost exactly half that number were found: 545. Here was the first big hint that the human genome was not as gene-rich as we had supposed.
Suddenly the human gene count was a hot topic. At the Cold Spring Harbor Laboratory conference on the genome in May 2000, Ewan Birney, who was spearheading the Sanger Centre's computer analysis of the sequence, organized a contest he called Genesweep. It was a lottery based on estimating the correct gene count, which would finally be determined with the completion of the sequence in 2003; the winner would be the one who had come closest to the right answer. (That Birney should have become the HGP's unofficial bookie wasn't entirely surprising: numbers are his thing. After Eton, he took a year to tackle quantitative problems in biology while living in my house on Long Island – a far cry from trekking in the Himalayas or tending bar in Rio, just two of the more likely ways a young Briton might spend the "gap year" before university. Birney's CSHL work yielded two important research papers before he even set foot in Oxford.)
Originally Birney charged $1 per entry, but the price of admission to the pool increased with every published estimate that brought us closer to a final count. I was able to get in on the ground floor, putting $1 down on 72,415. My bet was a calculated attempt to reconcile the textbook figure, 100,000, and the new best guess of around 50,000, based on the chromosome 22 result. Birney announced the result in May 2003: 21,000 genes – many fewer than anyone had guessed. Lee Rowen, a Seattle-based expert in biological computing, won Genesweep with her bet of 25,947, still some 4000 off. Of course I was wildly over the mark so I'm now a dollar down on the genome.
Perhaps the only question to generate as much idle speculation as the gene count was that of whose genes we were sequencing. The information was in principle confidential, so money was not going to change hands on this one, but many wondered all the same. In the case of the public project, the DNA sample we sequenced had come from a number of randomly selected individuals from around Buffalo, New York, the same area where the processing work – isolating the DNA and inserting it into bacterial artificial chromosomes for mapping and sequencing – was taking place. Initially Celera claimed that its material too had been derived from six anonymous donors, a multicultural group, but in 2002 Craig Venter could not resist letting the world know that the main genome sequenced was actually his own. Today that sequence is Venter's last remaining connection to the company. Concerned that sequencing genomes, though glamorous and newsworthy, was not proving viable from a business perspective, Celera reinvented itself as a drug company and bid farewell to its founder in 2002. As for Venter, he has established two new institutes, one to study the ethical issues raised by modern genetics, and the other to use the genomes of bacteria to find fresh sources of renewable energy.
With the whole rough draft in hand, it is now confirmed that there is nothing atypical about the gene density on chromosome 22. If anything, in fact, chromosome 22 with its 545 genes was for its size gene-rich rather than gene-poor. Only 236 genes have been definitively located on chromosome 21, which is about the same size. As we've seen, there are only an estimated 21,000 genes in total from the entire complement of 24 human chromosomes. And while we should stress that the final number can only rise as we make more discoveries, we are all but certain to end well below the 30,000 mark, never mind the 100,000 one.
As to how far, only time will tell. Finding genes is not actually such a straightforward task: protein-coding regions are but strings of As, Ts, Gs, and Cs embedded among all the other As, Ts, Gs, and Cs of the genome – they do not stand out in any obvious way. And remember, only about 2 percent of the human genome actually codes for proteins; the rest, unflatteringly referred to as "junk," is made up of apparently functionless stretches of varying length, many of which occur repeatedly. And junk can even be found strewn within the genes themselves; studded with noncoding segments (introns), genes can sometimes straddle enormous expanses of DNA, the coding parts like so many towns isolated between barren stretches of molecular highway. The longest human gene found so far, dystrophin (in which mutations cause muscular dystrophy), sprawls over some 2.4 million base pairs. Of these, a mere 11,055 (0.5 percent of the gene) encode the actual protein; the rest consists of the gene's seventy-nine introns (a typical hu
man gene has eight). It is this awkward architecture of the genome that makes gene identification so difficult.
But human gene spotting has become less tricky now that the genome of the mouse is better known. The credit goes to evolution: in their functional parts the human and mouse genomes, like the genomes of all mammals, are remarkably similar, having diverged relatively little over the eons intervening since the two species' common ancestor. The junk DNA regions, by contrast, have been evolution's wild frontier; without natural selection to keep mutation in check, as it does in coding segments, mutations aplenty have accumulated so that there is substantial genetic divergence between the two species in these regions. Looking for similarity in sequence between the human and mouse data is therefore an effective way of identifying functional areas, like genes.
Identifying human genes has also been facilitated by the completion of a rough draft of the puffer fish genome. Fugu, as it is better known to aficionados of Japanese cuisine, contains a potent neurotoxin; a competent chef removes the poison-containing organs, so dinner should produce only a little numbness in the mouth. But some eighty people die each year from poorly prepared fugu, and the Japanese imperial family is forbidden by law from enjoying this delicacy. More than a decade ago, Sydney Brenner developed a taste for the puffer, at least as an object of genetic inquiry. Its genome, just one-ninth the size of the human one, contains much less junk than ours: approximately one-third of it encodes proteins. Under Brenner's leadership, the fugu genome rough draft was completed for some $12 million, a genuine bargain by genome-sequencing standards. The gene count at present seems to fall between 32,000 and 40,000, in the same ballpark as humans. Interestingly, though, while fugu genes have roughly the same number of introns as human and mouse genes, the fugu introns are typically much shorter.
Even if we assume that plenty of genes remain to be discovered and generously increase our estimate of the human gene count from 21,000 to 25,000, we still get a somewhat exaggerated impression of our essential genetic complexity. Over evolution, certain genes have spun off sets of related ones, resulting in groups of similar genes of like, but subtly different, function. These so-called gene families originate by accident when, in the course of producing egg or sperm cells, a chunk within a chromosome is inadvertently duplicated, so that there are now two copies of a particular gene on that chromosome. As long as one copy continues to function, the other is unchecked by natural selection, free to diverge in whatever direction evolution may choose as mutations accumulate. Occasionally the mutations will result in the gene acquiring a new function, usually one closely related to that of the original gene. In fact, many of our human genes consist of slight variations upon a relatively few genetic themes. Consider, for example, that 575 of our genes (nearly 2 percent of our total complement) are responsible for encoding different forms of protein kinase enzymes, chemical messengers that pass signals around the cell. Then, there are the 900 human genes underlying your nose's capacity to smell: the proteins encoded are odor receptors, each one recognizing a different smell molecule or class of molecules. Roughly these same 900 genes are present in the mouse as well. But here is the difference: the mouse, having adapted to a mainly nocturnal existence, has greater need of its sense of smell – natural selection has favored the keener sniffer and kept most of the 900 odor-detecting genes in service. In the human case, however, some 60 percent of these genes have been allowed to deteriorate over evolution. Presumably, as we became more dependent on sight, we have needed fewer smell receptors, so natural selection did not intervene when mutations caused many of our smell genes to be incapable of producing functioning proteins, making us relatively inept smellers compared with other warm-blooded creatures.
How does our gene number compare with that of other organisms?
COMMON NAME SPECIES NAME NUMBER OF GENES
Human Homo sapiens 25,000
Mustard plant Arabidopsis thaliana 27,000
Nematode worm Caenorhabditis elegans 20,000
Fruit fly Drosophila melanogaster 14,000
Baker's yeast Saccharomyces cerevisiae 6,000
Gut bacterium Escherichia coli 4,000
In terms of gene complement, then, we are only fractionally more complex than a weedy little plant. Even more sobering is the comparison with the nematode, a creature composed of only 959 cells (against our own estimated 100 trillion), of which some 302 are nerve cells that form the worm's decidedly simple brain (ours consists of 100 billion nerve cells) – orders of magnitude difference in structural complexity and yet we have not even double the worm's gene complement. How can we account for this embarrassing discrepancy? It's no cause for embarrassment at all: humans, it would appear, are simply able to do more with their genetic hardware.
In fact, I would propose there is a correlation between intelligence and low gene count. My guess is that being smart – having a decent nerve center like ours or even the fruit fly's – permits complex functioning with relatively few genes (if indeed "few" has any real meaning in relation to the number 25,000). Our brain gives us sensory and neuromotor capabilities far beyond those of the eyeless, inching nematode, and thus a greater range of behavioral response options. And the plant, being rooted, has fewer options still: it requires a full onboard set of genetic resources for dealing with every environmental contingency. A brainy species by contrast can respond to, say, a cold snap by using its nerve cells to seek out more favorable conditions (a warm cave will do).
Vertebrate complexity may also be enhanced by sophisticated genetic switches that are typically located near genes. With the genome sequencing accomplished, we can now analyze in detail these regions flanking genes. It is here that regulation occurs, with regulatory proteins binding to the DNA to turn the adjacent gene on or off. Vertebrate genes seem to be governed by a much more elaborate set of switching mechanisms than those of simpler organisms. It is this nimble and complicated coordination of genes that permits the complexities of vertebrate life. Moreover, a given gene may in addition yield many different proteins, either because different exons are coupled together to create slightly different proteins (a process known as "alternative splicing") or because biochemical changes are made to the proteins after they have been produced.
The unexpectedly low human gene count provoked several op-ed page ruminations on its significance. These tended toward a common theme. Stephen Jay Gould (whose recent premature death tragically silenced an impassioned voice), writing in the New York Times, hailed the low count as the death knell of reductionism, the reigning doctrine of virtually all biological inquiry. This doctrine holds that complex systems are built from the bottom up. Put another way: To understand events at complex levels of organization, we must first understand them at simpler levels and piece together these simpler dynamics. And so it follows that by understanding the workings of the genome, we will ultimately understand how organisms are assembled. Gould and others took the surprisingly small human gene count as evidence that such a bottom-up approach is not only unworkable but also invalid. In light of its unexpected genetic simplicity, the human organism, argued the antireductionists, was living proof that we cannot begin to understand ourselves in relation to a sum of smaller processes. To them, our low gene number implied that nurture, not nature, must be the primary determinant of who each one of us is. It was, in short, a declaration of independence from the tyranny supposedly exercised by our genes.
Like Gould, I well appreciate that nurture plays an important part in shaping each of us. His evaluation of nature's role, however, is utterly wrong: our low gene count by no means invalidates a reductionist approach to biological systems; nor does it justify any logical inference that we are not determined by our genes. A fertilized egg containing a chimp genome still inevitably produces a chimp, while a fertilized egg containing a human genome produces a human. No amount of exposure to classical music or violence on TV could make it otherwise. Yes, we have a long way to go in developing our understanding of just how the information i
n those two remarkably similar genomes is applied to the task of producing two apparently very different organisms, but the fact remains that the greatest part of what each individual organism will be is programmed ineluctably into its every cell, in the genome. In fact, I see our discovery of a low human gene count as good news for standard reductionist approaches to biology: it's much easier to sort through the effects of 25,000 genes than 100,000.
While humans may not have an enormous number of genes, we do have, as the sprawling dystrophin gene illustrates, a large, messy genome. Returning again to the worm comparison: while we have not even twice as many genes, our genome is thirty-three times larger. Why the discrepancy? Gene mappers describe the human genome as a desert spotted with occasional genetic oases – genes. Fifty percent of the genome is constituted of repetitive junklike sequences of no apparent function; a full 10 percent of our DNA consists of a million scattered copies of a single sequence, called Alu:
Dna: The Secret of Life Page 22