The Future of Everything: The Science of Prediction

Page 19

by David Orrell

DATA OR MODEL

As mentioned above, scientific forecasts can be based on either statistical patterns or detailed mathematical models. An inherent drawback of the data-driven approach is that it relies on the Pythagorean idea that the future will resemble the past. Galton showed that tall parents correlate with tall children, but his results applied to a population, rather than to individuals, and were not proof of a causal relationship. Change the population, and an “inherited trait” may weaken or disappear. The same applies to genetic risk factors.

The fact that a gene is statistically associated with a trait in the general population does not mean it will cause that trait in a particular individual. Data also often includes a large amount of random variation or noise, so it is often impossible to tell whether a correlation exists at all. It is therefore not very meaningful to say that a trait is 30 percent inherited, or that a certain gene gives someone a 20 percent chance of developing a disease. To get around this problem, genetic studies often focus on subpopulations with a high degree of genetic homogeneity, such as Icelanders or Ashkenazi Jews, but the results are of most use for those groups.51

Statistically based claims that particular genes cause complex diseases usually prove to be oversimplistic. The biologist David S. Moore wrote that “a perfect and complete map of the human genome will not allow us to make accurate predictions about the traits—or diseases—that a given human being will develop.”52 The genes for Parkinson’s disease, suicide, and homosexuality have all been discovered and then promptly undiscovered, after further experiments showed that the story was more complicated.53 Inheritance is obviously important—animal breeders would go out of business otherwise—but individual genes, or limited sets of genes, do not fix our future in stone.54 Our fascination with finding a molecular basis for complex conditions like mental illness often appears to reflect a cultural desire to explain the world in terms of simple physical causes, which can be harmful if it distracts us from addressing more relevant issues.55

The ultimate aim of the Human Genome Project and predictive medicine is more general, and more ambitious, than this. It’s to translate the language of DNA and use it to determine the connection between the genotype—the sum of genetic information in the DNA—and the phenotype—the appearance and properties of the final organism. Many top molecular biologists believe that it should be possible to compute the phenotype from the genotype using a mathematical model. Efforts to build such models are under way at a number of institutions. In principle, they would account for all the processes and interactions that make statistical prediction so difficult. They would give an almost unlimited ability to predict future traits based on the laws of cause and effect. They would be the GCMs of the body. One Canadian scientist compared the process with that of weather forecasting: “What they’re looking at is a whole system— looking at clouds coming from Alberta and looking at where they will be in a few days. That’s basically what we are going to do with the cell . . . . [We’ll] be able to predict what’s going to happen.”56

Such models are reminiscent of Laplace, who said that if we knew the initial conditions of all particles and the laws of motion acting on them, we could predict the future state of the system. So is it possible to predict health using DNA? Can we build a model that will forecast traits based on genetic information? Is life computable, or will a baby’s future turn out to be more elusive than the clouds? To answer these questions, we must investigate the processes by which information in DNA is expressed in the organism.

THE GENERAL CELL MODEL

To build a predictive, mathematical model of a living system, we must take into account the genotype, the phenotype, and the external surroundings. For DNA to code for proteins, it needs to be part of a body, which in turn is nourished by external resources.57 A molecule of DNA, left to its own devices, will not spontaneously organize into a life form. In fact, DNA is a highly inert substance, which is why it has taken over from Galton’s fingerprints as the police ID method of choice. It can be recovered from tissue, bone, or bodily fluids at crime scenes, and can be used to put people in prison, or get them out, even years after a crime was committed.

To get a sense of the biochemistry involved, we’ll follow the lead of the genome project and begin with the simple and unicellular E. coli. This tiny bacterium, barely visible with an optical microscope, resides happily in the human gut, among other places, but will also grow in petri dishes supplied with the correct nutrients. If the instructions on how to make the cell are contained in the DNA, then we should in principle be able to use the DNA sequence, the cell contents, and the environment as the initial condition and the basic physical laws as the model. Running the model forward in time, we should see how the cell evolves—just as meteorologists do with the weather.

The cell, the basic unit of life, is a little world in itself. It is surrounded by a layer of lipid molecules. This wall separates the cell from the outside world, but it also contains pores and specialized proteins that can absorb resources, expel waste, and read chemical signals. In E. coli, about 15 percent of the cell’s mass is protein, 6 percent is RNA, and 1 percent is DNA. As in most cells, water takes up about 70 percent. The cellular material is localized in organized structures. It is no more meaningful to talk about the “concentration” of a protein in water, as is typically done in chemistry, than it is to talk about the concentration of Hawaiians in the Pacific.

One problem with modelling a cell is that it is too large: an E. coli cell contains about 2 to 3 million protein molecules. It is also too small. You can go out and measure the pressure of the atmosphere or watch the formation of clouds, but it is extremely hard to observe individual cells without disturbing them (though new nanotechnologies hold the promise of reducing the measuring equipment to a scale suitable for a cell).58

To capture the process by which the bacterium’s DNA instructions are read, the model must simulate the transcription of DNA to RNA, the subsequent translation of RNA to protein, and the folding of the protein into its final shape. While figure 5.3 (see page 187) looks extremely simple, the reality is more like a complicated industrial process, with many different inputs. Each step involves molecular machines that take up a great deal of the cell’s space. The DNA molecule is much longer than the cell, so it’s tightly folded and held in place by proteins. For a particular gene to be transcribed, the relevant portion of DNA must be made accessible to specialized polymerase molecules. The transcription process is also controlled by proteins that dock onto binding sites near the gene’s DNA. These proteins may either repress or promote transcription, and often act in concert, so that a gene is not transcribed unless several different proteins are correctly bound.

The RNA molecules produced by transcription are comparable in length to the cell and typically number in the thousands. In bacteria like E. coli, an RNA molecule begins to be transcribed by ribosomes even before it detaches from the DNA. The ribosomes assemble amino acids into proteins according to the sequence specified in the RNA. Once a protein has been produced, it must be folded into the correct shape to function properly. Since a protein containing a string of perhaps thousands of amino acids can fold into a number of different shapes, specialized chaperone molecules help it form—rather like clowns who twist long rubber balloons into amusing shapes.

The protein then heads off to interact with other proteins, form complexes, and take part in some other aspect of the cell’s metabolism (from maintaining the cell’s structure or transporting molecules to transmitting information or handling waste). This is when things get complicated, as shown in figure 5.4. Metabolic processes include positive feedback loops that amplify signals, negative feedback loops that control them, cascades of reactions that transmit signals, and so on.

While a bacterial cell is certainly a dynamical system—the word “metabolism” is derived from the Greek metabole, for change—its dynamics do not resemble those of the planets, where the normal rules of physics can be applied. T
hey are more like those of a city, with a myriad of players engaged in every kind of activity—industry, energy, consumption, garbage disposal, management, maintenance, communication, security. The system becomes even more complicated in eukaryotes such as yeast. In these the cell is divided into a number of specialized compartments known as organelles. DNA is constrained inside a nucleus, where transcription takes place, and proteins are transported from compartment to compartment inside coated vesicles. They ride the bus to work instead of walking. Multicellular organisms like mammals have completely separate layers of complexity.

FIGURE 5.4. Interactions between different proteins in a yeast cell. The nodes represent different types of protein, and the lines between two nodes mean those proteins interact (by joining to form a complex, for example). Systems biologists try to make sense out of this.59

Clearly, modelling even a unicellular bacterium such as E. coli is an enormous challenge. The cellular machinery lacks what the Pythagoreans referred to as “the universal harmony and consonance of the spheres.” It is hard to imagine determining even the correct initial condition for the cell, since this would have to include its entire structure, the configuration of proteins, and so on. The only option is to pass over all these details, make a number of gross simplifications, and hope that the resulting low-resolution model will capture the essence of the underlying dynamics. The model could then be used, for example, to determine how changes to particular genes affect the phenotype—information that might prove useful in the study of human disease.

COMPLICATIONS

The simplest way to build such a model is to treat the cell like a test tube full of chemicals interacting according to the rules that govern large-scale chemical reactions. Because it is not possible to track every single reaction, the model must focus on an isolated subset of the complete system. The result is a large set of differential equations that can be solved on the computer. Such an approach, while straightforward, has a number of drawbacks. The use of differential equations assumes that the chemical concentrations vary in a continuous way. However, RNA molecules are produced from DNA and later degrade in a discrete fashion, so their numbers will vary in a somewhat random manner. Complex organisms use multiple control systems to reduce the effects of this stochastic noise, as it is known (the word “stochastic” comes from the Greek stokhos, which was a pointed stake used by archers as a target). Stochastic noise can be captured to an extent by modelling the reactions as discrete, random events between individual molecules, which is more realistic but computationally expensive.60

Another modelling difficulty is that measurements of reaction rates and chemical properties in a laboratory setting may be highly misleading, because cellular context matters. If solutions containing two chemicals are mixed together in a test tube, the reaction rate depends on the mixing process and will obey simple statistical laws. In the cell, however, protein traffic is carefully guided and controlled. Therefore, techniques that can be used to model chemical reactions for millions of molecules in a solution are simply not appropriate inside a small cell, where interactions are controlled by local effects. Finally, a model that focuses on only a subset of reactions will have to omit a large part of the system, and perhaps miss important interactions.

Modelling of biological systems is an area of intense research activity, and new algorithms and techniques are constantly being developed. However, as modellers add more detail—say, by simulating the fine structure of the cell—they run into problems trying to verify it properly against experimental data. In biological models, there always seem to be far more parameters than there are quantities that can be measured, and as more detail is added, the number of parameters explodes. It is therefore impossible to deduce their correct values from experiments. The models suffer from the same problem as the Greek Circle Model did: they are too flexible.61 As Will Keepin put it, modellers can pull the levers and make the model do whatever they want. This is common to all the systems discussed in this book. It is the signature of uncomputability.

LOCAL OR GLOBAL

The more closely we study cellular dynamics, the clearer it becomes that they depend on a multitude of local effects, acting on individual molecules, that cannot be neatly captured by equations. The initial condition is given by the DNA and the starting distribution of

proteins and other molecules, and the rules are the laws of chemistry and physics. Yet even perfect knowledge of both these things would not allow us to predict the system’s state after a certain length of time. For that, we’d need to run the system itself. The cell’s characteristics, or traits, are analogous to emergent properties of complex systems, which elude computation (see figure 5.5 on page 207). It is easy to devise models that appear to fit past data, but this does not mean the models can predict the future.

This problem is compounded in multicellular organisms, like humans. We contain about 100 trillion cells, in a hierarchy where each level depends on the level below and above it. The body is made up of organs, which in turn are made up of separate components, which are made up of cells, which are made up of smaller organelles, and so on, down to the level of molecules. Since there is a constant flow of information between these levels, it is not possible to draw an arbitrary line at some level of complexity. Like clouds or turbulent flow, living beings exhibit structure over a wide range of scales.

It may seem surprising that DNA cannot be read like a book: after all, it is just a string of information, a long text. If we can arrange the cloning of a sheep from a string of DNA, surely we can build a computer model that will predict a phenotype from a DNA sequence. Even the miraculous development of a human being—through the fixed stages of conception, embryo, fetus, and birth—seems to follow a regular, repeatable path, as if carrying out a set of detailed instructions. But the development of multicellular organisms owes less to dictates from a “master molecule” than it does to many small, local decisions. Specific proteins, known as adhesion molecules, cause cells to cluster together or to slide along one another, resulting in the surfaces and folds that define tissues and organs. As the biologist Richard Lewontin put it, “At every

stage it is the local interactions of cells and tissues that determine the further movement, division, and differentiation of cells in the locality, which lead to yet further local interactions, and so on to adulthood.”62 The fact that this process appears to be machine-like and, to a degree, reproducible does not mean that it is predictable. If the Game of Life is started twice from the same initial conditions, it will evolve in exactly the same way, but that is of no help at all in guessing its future.

FIGURE 5.5. You can—t get there from here. Water molecules and DNA may obey simple, locally applied physical laws, but that does not mean we can compute the properties of clouds or organisms.

NEGATIVE AND POSITIVE

The lack of a direct, computable link between genotype and phenotype means we cannot translate DNA from the bottom up. To find out what a baby will look like, or how healthy it will be, we will still need to wait around and see—it won’t suffice to plug the baby’s DNA sequence into a computer. It is still possible to build mathematical models using a top-down engineering approach, and these models are useful in understanding the processes behind human health. However, their accuracy is limited by another factor, which has to do with the nature of living systems. Again, the example I give is unicellular, but the principles apply to more complex organisms— indeed, to all the systems studied in this book.

One genetic network that has come under much scrutiny from biologists in recent decades is the metabolism in baker’s yeast.

When a yeast cell’s food supply switches from its preferred carbon source of glucose to a slightly different sugar, known as galactose, the yeast turns on a particular set of genes, which then produce proteins that digest the galactose. As an experimental organism, yeast has a number of favourable qualities. This microscopic fungus was probably the first organism to be domesticated by man. Witho
ut it, we would have neither bread nor beer. It is easy to grow, reproduces quickly, and shares surprisingly many genes with humans. Indeed, about a quarter of the 5,000 or so genetic diseases that afflict humans have some kind of analog in yeast—including galac-tosemia. A child with this condition cannot digest the galactose in milk, so must avoid it or risk death from the buildup of toxic metabolic by-products. The condition can be treated by a change of diet.

A common feature of genetic networks is auto-regulation, in which proteins regulate their own transcription. There are two types, positive and negative, which correspond to positive and negative feedback. In electronic circuits, feedback loops are used to amplify or damp electrical signals, and they play a similar role in biological systems. Positive feedback, for example, is used to provide a rapid response to a signal, but it has to be carefully controlled. If you cut your finger, a cascade of reactions incorporates positive feedback to create a clot and stop the bleeding. If this system goes wrong, it can cause thrombosis.

The galactose network incorporates both types of feedback loops, but in opposition. When galactose is detected, the yeast cell uses positive feedback to accelerate the production of the proteins required to metabolize galactose. But at the same time, a protein responsible for repressing transcription of the same proteins is also upregulated. This acts as a partial brake on the system—negative feedback. Both the brake and the accelerator are applied at the same time, as if the organism is ramping up its internal tension. This type of antagonistic action is ubiquitous in biological systems; the level of glucose in blood, for example, is controlled by the hormones insulin and glucagon, which pull in opposing directions, like the two hands of an archer.63 If the hormones get out of balance, it can lead to diabetes. Our appetite is suppressed by one hormone, obes-tatin, and boosted by a second, grelin, which are both produced from the same gene (the RNA is edited differently).64

‹ Prev Next ›