The humble function of hemoglobin, binding and releasing oxygen as it shuttles between the lung and the body’s tissues, belies its near-universal importance. Hemoglobin is a member of a large family of oxygen-binding proteins, the globins, which are vital not only to us but also to many other mammals, birds, reptiles, and fish.36 Countless generations have elapsed, parents followed by their children, grandchildren, and innumerable generations of great-grandchildren, since all these organisms shared a common ancestor. During this succession of generations, the DNA that encodes hemoglobin and all other proteins has replicated countless times. The copying errors it suffers each generation are rare—about one for every forty million copied DNA letters in our cells37—but given enough time, all genes in a genome will suffer errors that alter the proteins they encode.
An altered amino acid text that prevents a globin from folding also prevents oxygen from traveling where it is needed. In other words, it spells death. But an altered protein does not always lose its function and meaning. Some alterations impair neither folding nor function, and get passed on to the next generation.38 Over thousands and millions of generations, copy error after tolerable copy error can thus accumulate and slowly change a protein’s amino acid sequence.
FIGURE 11. Proteins change in time
Figure 11 shows a snippet of ten amino acids from human hemoglobin and from three of our animal relatives.39 Each letter in the figure is taken from the twenty-letter alphabet that scientists use to abbreviate amino acids, V for valine, A for alanine, and so on. We and our closest relatives, the chimpanzees, shared a common ancestor some five million years, or roughly 200,000 human generations, ago—a huge amount of time compared to a human life span, but precious little in evolution.40 And because little time means few errors, the globin text of chimpanzees has not changed much since then. In the figure’s snippet, the only difference is that chimp globins harbor the amino acid glutamate (letter E, in black) in place of the human alanine (A).
The human lineage split from the mouse lineage some eighty million years ago. Mouse globins thus had more time to accumulate changes than chimps’, which shows in the two amino acid differences of figure 11 between mice and humans. The chicken lineage separated from us even longer ago—almost three hundred million years—and accrued six altered amino acids.41
Millions of other organisms harbor globins, not only warm-blooded vertebrates but reptiles, frogs, fish, sea stars, mollusks, flies, worms, and even plants. Some of these organisms grow on the same twig of life’s gigantic tree and have a recent common ancestor. Their globin texts shared most of their journey through time, split only recently, and are still similar. Others lie on different branches, share more distant ancestors, and harbor globins with different texts.42 But however different these texts are, each of them works just fine, since otherwise it would not have survived. Each surviving text encodes a different solution to the problem of binding oxygen.43 And every millennium that life continues, it travels further and further through the library of proteins, discovering ever-new globin texts in its blindly groping evolutionary journey.44
FIGURE 12. Two hemoglobins with similar folds
To see how far this journey has already led the globins, consider some of our most distant relatives: plants, some of which indeed have globins, even though they have no blood.45
Legumes like soybeans, peas, and alfalfa can extract vital nitrogen from its nearly unlimited supply in the air. (Most other plants need to extract nitrogen from the soil, where it is often scarce, unless a farmer has applied fertilizer.) For this purpose, legumes employ bacteria that live in clumps of tissues around their roots, and that harbor a special enzyme that converts airborne nitrogen gas into ammonium, the same ammonium that nitrogen fertilizers contain. This ingenious symbiosis has only one problem: Atmospheric oxygen destroys the enzyme. To protect those enzymes, plants manufacture globins, which keep oxygen safely away from the bacteria.
Plants and animals dwell on different major branches of life’s tree, because their common ancestor lived more than a billion years ago. Their globins are staggeringly different, which reflects their long and separate evolutionary journey. For instance, the globins from lupins and insects differ in almost 90 percent of their amino acids. Yet these globins not only bind oxygen, they also still fold into the very similar shapes of figure 12. The fold on the left is from a legume, the one on the right from a midge, a tiny two-winged fly. Both proteins have several spiral-staircase-like helices that are arranged similarly, such as the two helices that run in parallel from the upper left to the lower right. The image does not do full justice to just how similar these globins are. If you rotated these molecules to place one exactly above the other, their atoms would occupy almost exactly the same places. Despite more than a billion years of separation, these globins still fold the same way.
The amino acid differences between these globins are extreme, but not unusual. Even globins from some animals, for example those of clams and whales, can differ in more than 80 percent of their amino acids.46 Despite these differences, though, these and thousands more globins from other organisms are connected by a network of unbroken paths through the protein library, paths that began at their common ancestor, took one amino-acid-changing step at a time, but left the text’s meaning unchanged. You will recognize a theme that we already encountered in the metabolic library, where evolution could travel far and wide without losing the meaning of a metabolic phenotype. The steps evolution takes through the protein library are different—single amino acid changes instead of horizontal gene transfer—but the principle is the same. A genotype network connects globins and extends its tendrils far through the protein library. Evolution can explore the library along this network without falling into the deadly quicksand of molecular nonsense.
When it comes to forming vast and far-reaching genotype networks, globins are not an exception but the rule. Enzymes with the same fold, catalyzing the same reaction, and sharing the same ancestor typically share less than 20 percent of their amino acids. We know this because scientists have mapped the location of texts encoding thousands of known enzymes in the library. By cataloging these texts, we can map the paths of genotype networks in the library, which reveals that some networks can reach even further through the library than globins, and none more so than that of TIM barrel proteins. Their name is an acronym for triose phosphate isomerase, an enzyme that helps extract energy from glucose, and their fold is called a barrel because its sheets and helices are arranged like the staves of a barrel. The stunning fact is that some enzymes with this fold do not have a single amino acid in common. They occupy opposite corners of the protein library—texts that do not share a single letter—yet carry the same chemical message.47 Proteins like these are a bit like innumerable versions of Hamlet, all of them equally stageable, while sharing only a few hundred—or even none—of the play’s four thousand lines.
Thousands of proteins from nature’s laboratory tell a similar same story: When a problem can be solved with new proteins, be they enzymes, regulators, or transporters like hemoglobin, the number of solutions is too large to count. And all of those proteins are connected by a vast network of amino acid texts that spread throughout the protein library. We know thousands of proteins from some of these networks, but they are grains of sand on a vast beach of the unknown—most of the many trillions of proteins that share the same phenotype. Some of these unknown proteins belong to long-extinct organisms. But most have not even formed yet. The four billion years of life were not nearly enough time—they would suffice to create only some 1050 proteins, a vanishingly small fraction of all texts in the protein library.48 Life’s enormous tree and all its proteins, however vast and beautiful, is but a smeared reflection in a filthy mirror, a faint shadow of the vast Platonic realm that genotype networks inhabit.
In chapter 3, we saw that genotype networks help evolution’s billions of readers explore different and far-flung neighborhoods of the metabolic library. Through these
networks, some of the library’s explorers find innovative texts with new phenotypes, even though others get knocked off a network and die. Genotype networks might do the same for proteins, but only if the neighborhoods of the protein library are diverse.49 Otherwise, an evolving population of proteins might as well stay wherever it is. No need to explore the library if its different stacks host the same books.
Do the shelves near each protein in the library contain texts with similar meaning, a bit like modern suburbs with their identical cookie-cutter homes? Or is each neighborhood more like a medieval village, with unique buildings and their individual charm, containing proteins with unique new functions? Until recently, we had no idea, even though decades of protein research allow us to answer this question with computers that can mine mountains of protein data.
Answering this question needs more than just computers. It also needs a librarian’s love of texts. A young Chilean researcher named Evandro Ferrada brought just this love to Zürich when he joined our group of researchers to get his Ph.D. He had already studied proteins and become skillful at mining huge protein databases for information about proteins, from their folds to their smallest atomic details. I had seen Evandro’s quiet, pensive personality before, in people whose minds constantly grapple with the deep mysteries of life. Perhaps this is why he agreed to work on this problem, because the structure of protein space is just such a mystery: one that not only is challenging and profound but also can be unraveled. What’s more, it also holds the secret to protein innovability.
Evandro focused on enzymes because they are an extremely diverse group of proteins—no surprise, since they catalyze more than five thousand different chemical reactions. They are also especially well studied: Thousands of them scattered throughout the library have been mapped. Their locations are precisely known, and we can use computers to analyze them. Evandro asked his computer to choose a pair of proteins with the same fold, but in different places on the same genotype network.50 He then explored a small neighborhood around the first protein, and listed all known proteins in it, together with their function. After that, he explored the neighborhood of the second protein, and listed all known proteins and their functions in its neighborhood. Finally, he compared these lists, asking simply whether they were different, whether proteins in the two neighborhoods had different functions. He then chose another protein pair, yet another pair, and so on, asking the same question for them, until he had explored hundreds of pairs and their neighborhoods.
The final answer was simple. The neighborhoods of two proteins contain mostly different functions, even if the two proteins are close together in the library. For instance, even proteins that differ in fewer than 20 percent of their amino acids have neighborhoods whose proteins differ in most of their functions. The protein library has neighborhoods that are highly diverse, just like the metabolic library. And just as with metabolism, this diversity makes vast genotype networks ideal for exploring the library, helping populations to discover texts with new meaning while preserving old and useful meaning.
Both metabolic and protein libraries are full of genotype networks composed of synonymous texts that reach far through a vast multidimensional hypercube, and both harbor unimaginably many diverse neighborhoods. They have much in common with each other, but little with human libraries. And that’s not surprising: They were here long before us.
At least three billion years before us. That’s when proteins took over most of life’s jobs from RNA. They did so for a good reason. Because they have many more building blocks—twenty different amino acids compared to the four nucleotides of RNA—nature could write more texts with proteins. In an alphabet of four letters, you can write about one million different ten-letter strings, whereas an alphabet of twenty letters allows more than ten trillion such strings—ten million times more. This vastly larger number of protein texts increases further with longer texts. More texts mean more shapes, more chemical reactions you can catalyze, more tasks you can perform.51
But RNA did come before proteins, and for this reason alone it deserves an honorable place in the pantheon of biological innovation. Without innovations made by the first replicators, we would not be here. And our job would be incomplete without understanding their innovability.
Fortunately, there are many parallels between RNA and proteins that can help us understand RNA innovability. We can organize RNA texts into a hypercubic library—not quite as large as that of proteins, but still formidable—where similar texts are near each other and dissimilar texts are far apart. This library also exists in many dimensions, meaning that its neighborhoods are much larger than in three-dimensional space—near any one text are many others. The meaning of many RNA texts is also expressed in a language of shapes, because RNA chains are highly flexible, like proteins. They bend and twist in space, organizing themselves into elaborate folds, like proteins.
Unfortunately, the parallels end with the recalcitrance of RNA molecules to reveal their shape. Experiments have traced this shape only for a few hundred RNAs, a paltry number compared to the many thousands of proteins whose form and function we know. Therefore, what we can do for proteins—compare many naturally occurring molecules to map the library—is not yet possible for RNA.52
Thanks to the Austrian scientist Peter Schuster and his associates, though, the RNA library is not a lost cause. One of the grandfathers of computational biology in Europe, Schuster is now a retired professor at the University of Vienna, where he taught since the 1970s. A first encounter with Schuster seems to confirm the stereotypes that many Europeans have of Austrians. A jovial man with a generous girth and a wry sense of humor, Schuster would not have been out of place in the traditional Viennese cafés of the last days of the Austro-Hungarian Empire, where formidably well-educated polymaths held forth on everything from psychoanalysis to quantum theory. He is a scientist in that tradition, a purebred intellectual conversant with a broad variety of subjects. Not taking himself too seriously, Schuster opines with a tongue-in-cheek attitude, peppering the gravest discourse with humorous asides. He epitomizes an oft-repeated saying about how Austrians view life and its many challenges: “The situation may be hopeless, but it is never serious.”
There’s a broad mind and an incisive intellect, however, beneath the surface of Schuster’s jovial demeanor. He was among the first to propose how an RNA world might have originated.53 And his research group developed computer programs that predict an important aspect of an RNA text’s molecular meaning, its secondary structure phenotype.54
RNA secondary structure is what emerges first when an RNA string folds. As the string twists and bends and curls, some of its nucleotides pair with one another and create short stretches of double helices in the molecule, much like DNA’s famous spiral staircase. The secondary structure is a pattern of multiple such helices connected by stretches of intervening single-stranded text, all formed by a single molecule. Like the sheets and helices of proteins, these helices are the flowers that self-organize into the final bouquet of a three-dimensional fold.55
Not only was Schuster able to compute RNA’s secondary structures from their nucleotide sequences, but his group’s computer programs were also blazingly fast. They could predict hundreds of these molecular shapes within seconds. (To this day, we cannot do this for the more complex three-dimensional RNA fold.) With programs as fast as this, one can begin to map the RNA library. And even though we are still miles from understanding RNA’s complete fold and function, the secondary structure is very important on its own: If a mutation in the letter sequence of an RNA molecule disrupts its secondary structure, the molecule can no longer fold properly in three dimensions. Secondary structure is essential for the molecular meaning of RNA molecules, just as there can be no bouquet without flowers. And that’s a very good reason to study it.
Schuster’s researchers found a bewildering number of potential molecular meanings in the RNA library, all of them expressed as shapes. For example, RNA strings that are merely o
ne hundred letters long can already form 1023 different shapes. Many natural RNA molecules are much longer, and such longer texts can form many more shapes.56 What is more, texts with the same shapes are organized much like in the protein library. They form connected networks that reach far through the library, allowing you to revise any one text in little steps, radically, while leaving its molecular meaning unchanged.57 And just as in the protein library, different neighborhoods are more like medieval villages than cookie-cutter suburbs. Each neighborhood contains many different shapes, and any two neighborhoods do not share many of them.58 All this hints that innovability in RNA follows the same rules as in proteins. And recent experiments show that this is indeed the case.
In an ingenious experiment performed in the year 2000, Erik Schultes and David Bartel from the Massachusetts Institute of Technology blazed a trail through the RNA library.59 The experiment started from two short RNA texts with fewer than a hundred letters each. The texts are far apart in the library and differ in many letters, but they are not just any two strings. Both molecules are enzymes—ribozymes, because they are composed of RNA rather than protein. Each of them wiggles into a different three-dimensional shape and catalyzes a different reaction. The first molecule can cleave an RNA string into two pieces, while the second does the exact opposite, joining two RNA strings by fusing their ends with atomic bonds. Let’s call these enzymes the “splitter” and the “fuser.”
If you already had a splitter, and you needed to find a fuser somewhere in the library, would that be easy or hard? And what about the opposite, creating a splitter from a fuser? In other words, can you create a specific molecular innovation from either one of these molecules by exploring the library as evolution would? If you were ignorant about genotype networks, you would think that should be impossible, because the two molecules are far apart. And even if were possible, it might be exceedingly difficult, since a single misstep that creates a defective molecule spells death in evolution.
Arrival of the Fittest: Solving Evolution's Greatest Puzzle Page 13