by Kirk, Edwin;
James Watson’s genome was sequenced before many people had had exome sequencing done, and well before large databases of exome and genome data became available. Watson’s ten recessives had all been found in people with genetic conditions, and reported in papers published over the preceding decade or so. For various reasons, every single one of those reports was, not to put too fine a point on it, wrong. For example, Watson was found to have a change in a gene linked to severe eye disease, called RPGRIP1. He had one copy of the gene with the usual amino acid found at position 547, alanine, and one copy with a different amino acid, serine.
In 2003, researchers from Pakistan reported a family in which eight members of a large family with a degenerative eye disease all had two copies of the same change found in Watson; it also cropped up in two smaller families. At the time, the standard way to check if something you’ve found is just a normal variation was to check 100 people from the same population (‘population controls’) to see if they also had the change. Sequence 100 people and you get 200 copies of the gene, so you’d think that if something is common and harmless in that group of people, you’d have a good chance of picking it up. To save money, the researchers used a cheap screening test rather than reading the sequence of the gene — and in hindsight, that test must have failed, because they didn’t find the variant in any of the controls.
Apart from getting the screening test wrong, you can’t blame the Pakistani group for thinking they had found the answer in their patients. Alanine and serine have some different chemical properties, although they are far from being the most dissimilar pair of amino acids. Finding the same genetic change in 12 different people (across the three families) with the same condition would usually be very strong evidence of a link to that condition.
By 2005, a Dutch group had already reported that the variant was way too common to be a cause of a rare eye condition, but this information must have been missed by the team who sequenced Watson’s genome. Thanks to Daniel MacArthur and his team, we now know that the variant is common in much of the world; nearly half of the people from a European background in the gnomAD database have one or two copies of the variant, and there are nearly 7,000 people in the database (out of 140,000 total from all backgrounds) who have two copies of the variant — i.e. both of their copies of the gene are this version of it. It’s simply impossible for this variant to be the cause of a rare condition; and it’s not even slightly surprising that you might find it if you sequence the genome of someone from a European background, like Watson.
Over the past decade, it has become uncomfortably obvious that it is very easy to get it wrong in genetics. It’s tempting to criticise the Pakistani group whose screening test went wrong (the variant is nearly as common in South Asians as it is in Europeans, so, if their test had worked, they would surely have found it in some of their 100 controls), but the problem has turned out to be widespread across the genetics literature. Population data alone aren’t the whole solution, unfortunately, because there is variation that is harmless but rare, as well as variation that is harmless and common.
The field of cardiac genetics has been particularly hit by the issue of genetic variants that are wrongly classified as disease-causing. In 2012 and 2013, a Danish group, led by Morten Olesen, trawled through the medical literature about inherited diseases of heart muscle and of heart rhythm, and compared variants that had been published as disease-causing against the very first public exome database, the Exome Variant Server. This had information from just 6,500 exomes but was a treasure trove of information when it was first released. Olesen’s team found that the cardiac genetics literature was littered with mistakes, with many of the ‘disease-causing’ changes being far too common in the population. They calculated that if all of these were really disease-causing, one in four people would have a heart-muscle condition, hypertrophic cardiomyopathy; one in six would have another, dilated cardiomyopathy; and one in 30 would have long QT syndrome, a problem with the heart’s electrical rhythm. The implication was clear — a lot of the variants that had been described as harmful were, in reality, harmless.
Perhaps even worse, it became apparent that not only are some variants wrongly reported as disease-causing, there are plenty of genes that have been wrongly associated with conditions. Sometimes, this comes in the form of a single report that is never replicated. This is (mostly) fairly harmless. Sometimes, however, genes find their way onto lists and into tests for conditions despite very limited evidence. Again, cardiac genetics is a problem area. For example, the genes CACNB2 and KCNQ1 are both commonly included in panels of genes for testing people with hypertrophic cardiomyopathy, despite there being only a tenuous link between these genes and this condition. This runs the considerable risk that people who are tested will be told that they have a variant in one of these genes, and that it is the cause of their heart condition. All sorts of bad information can flow from this — particularly, family members who do not have heart problems being tested and wrongly reassured, or wrongly told they are at risk.
The discovery that we had, as a field, been getting it wrong so often led to a worldwide swing towards caution in interpreting genetic data. While this is mostly appropriate, it carries its own risks. There are all sorts of potential consequences when we make mistakes, and it cuts both ways: there are harms from wrongly reporting that a variant is harmful, and harms from wrongly reporting that a variant is harmless. If we tell someone their child has a condition that they actually don’t have, it might lead to mistakes in treatment. It might mean we give them wrong information about the chance another child might be affected; they could have a prenatal diagnostic test that leads to termination of a normal pregnancy, or a failure to identify an affected fetus. If we wrongly fail to report a variant that is in fact disease-causing, on the other hand, that might also mean someone misses out on treatment that might help them. It might mean that someone who could have been reassured that they had a very low chance of having another child with a severe condition misses out on that reassurance. If you’re worried about this happening and it affects your thinking about having more children, we describe you as having ‘lost reproductive confidence’ — and it might mean you also lose the chance to have another, healthy child.
So we have to be Goldilocks geneticists: we can’t be too hot when we make a call on whether a variant is disease-associated or not, and we can’t be too cold. The genetic porridge must be just right.
Getting it right can sometimes be really hard. It’s easy if we have population evidence that says, ‘This thing is just too common to be harmful.’ It’s easy if we see something that has been seen dozens of times before in affected individuals and never in the general population. It’s most of the in-between that presents a challenge.
You might think that computers are the answer. If so, you wouldn’t be the first. There is a long tradition of someone looking at the problem and thinking, ‘I know! I’ll write a computer program that can tell nice from nasty.’ There have been a few different approaches to this, mainly directed at the scenario in which one amino acid is switched for another (generally, if the change is to ‘stop’, it’s not as hard to work out what it means). Some of the programs look at the chemical changes. Some look at conservation through the course of evolution. By now, there are a LOT of organisms that have had their genomes sequenced. If you’re interested in a change from — say — alanine to serine, you could get your shiny new program to have a look at that position in the equivalent protein in creatures that are less and less similar to humans, or at the same position in similar proteins (in humans and other animals).
If you’d done that for Watson’s RPGRIP1 variant, the news would have been equivocal: apes and monkeys all have an alanine at that spot, as do most rodents, and camels, cows, and killer whales; elephants and bats have an alanine, and so do aardvarks and armadillos. But squirrels and Cape golden moles have something different in that location; so do budgies and ducks.
The star-nosed mole even has a serine there! It’s one of the things it has in common with James Watson, along with being a warm-blooded, hairy, four-limbed animal. Not many moles have Nobel prizes, though. Overall, the evolutionary evidence is not especially strong evidence for this particular change being damaging to the protein (even if we didn’t have the population information).
Sometimes, we see amazing conservation. For example, in a child with severe epilepsy, we found a change in a protein in which the particular amino acid that was altered (proline) was the same in every species that had been sequenced, back to oysters and amoebas. If nature thinks it needs a proline right here, and can’t abide substitution in all the hundreds of millions of years since our paths diverged from amoebas and oysters … there’s a good chance that amino acid really, really needs to be a proline.
Anyway, back to designing your computer program. You don’t have to restrict yourself to chemistry or conservation; you could combine both. Or you could gather up a bunch of scores from other peoples’ programs and combine them to help make your new score.6 Now, calibrate your shiny new program on a big set of variants that you already know are harmful or harmless, check it against another set, give it a clever name, and publish a paper, explaining —
[6 Or — why not? — you could compare the amino acid change with the hypothetical common ancestor between humans and chimps. I’m not making this up; this is part of the basis of one of the most successful programs, CADD.]
Explaining that, after all this effort, you’ve made something just a little better than what was there before. Of course, you won’t frame it that way, but that’s the best you can hope for, it seems. And just a little better really amounts to not much good at all. Pick any of the more than 20 such programs that are already out there, and you’ll find that everybody has hundreds, if not thousands, of variants that the program thinks are likely to cause problems. They aren’t too bad at picking the harmless variants, but considering that the starting point for any one change you find in a person’s genome is that it’s likely to be innocuous, that’s not a great achievement.
The reason why all these efforts to write a program that can do the job have failed comes down to the nature of the task. It looks like what you need to do is to put the things you’re classifying into one of two bins: an enormous industrial container full of benign and only mildly harmful variation, and a golden eggcup that contains the one or two variants that were the reason we did the test in the first place. You have 39,999 harmless apples and one orange. The problem is that your starting point in this apples-from-oranges task is not really a pile of 40,000 pieces of fruit. There are all sorts of different ways that changing an amino acid can cause problems.
Perhaps the chemical properties of the two different amino acids are so different that it causes the protein not to fold properly. Or perhaps a protein forms but is unstable and doesn’t last long enough to be useful. Or the protein may form, but extra modifications that are needed — like tacking on sugars — can’t happen. Or there is a perfect protein, but it can’t get to the place inside the cell where it is needed. Or the amino acid change isn’t the problem at all: the DNA change messes up the splicing process, causing a completely different type of issue. There are other possibilities, but you get the point. The computer programs are valiantly trying to sort fruit, but they are being presented with a mixture of fruit, roundish rocks, tennis balls, and sea urchins. It’s no wonder they don’t do a great job.
So, we have population data — powerful, but limited.7 We have prediction software — slightly better than nothing. We have information that’s published in the medical literature — that we know is riddled with errors. Not looking too promising so far, is it?
[7 One of the limitations is that many populations aren’t represented in the databases, so we don’t know much about their normal variation. Arabs, for instance, and Pacific Islanders, and Indigenous Australians.]
Fortunately, there are a number of other pieces of information that are useful some of the time. One of the most powerful of these is the information we get from the doctor who saw the patient. At the most basic level, if you are doing a test to find out why someone has severe epilepsy, and you find a variant in a gene that has only ever been linked to a skin condition, it’s not very likely that you have your answer. Other information includes things such as how a variant tracks in a family, whether it affects a part of the protein known to be critically important, and (sometimes) tests that can directly measure the function of the changed version of the protein.
Put all of the available information together, and you may well be able to come up with a reasonable answer.8 Classifying genetic variants is one of the most challenging and interesting parts of my job, although the pressure to stay in the Goldilocks zone can be stressful when it’s a close call. Like most labs around the world, we put the variants we assess into one of five categories. Benign (class 1) is a variant that we are very confident is harmless, often because it’s too common to be anything else (like Watson’s RPGRIP1 variant). Likely Benign (class 2) is a variant for which there’s good evidence it’s harmless, but not quite enough to put it in the Benign group. Pathogenic (class 5) means we’re close to certain that the variant can cause problems. Likely Pathogenic (class 4) means that there’s strong enough evidence that a doctor can use the information for making medical decisions, but there’s not quite enough evidence to make it class 5. For both Likely Benign and Likely Pathogenic variants, there’s an appreciable chance (which notionally can be as high as 10 per cent) that the truth lies in the other direction.
[8 There are various systems out there to help make these assessments. The most popular and widely used guidelines were published in 2015 by the American College of Medical Genetics and Genomics; they aren’t perfect, but they are pretty good, and they have the advantage that everybody in the field knows about them, even those who don’t personally use them.]
In the middle are the Variants of Uncertain Significance (class 3). This means what it says: we’re not certain if this is a problem or not. This classification has been described as genetic limbo, in the sense of the place you get stuck in rather than the bar you try to wiggle under. If there’s not quite enough evidence to call a variant Likely Pathogenic or Likely Benign, or if there are pieces of evidence that point in opposite directions, then what you have is a VUS. The most important — and often difficult — decisions we make relate to variants that skate on the border between VUS and Likely Pathogenic. Make the wrong call, in either direction, and people may suffer because of it. These are the unknowns that keep me up at night. Have I wrongly called a variant Likely Pathogenic, leading a patient and her doctor down the wrong path? Have I wrongly classified one as a Variant of Uncertain Significance, denying a patient and her doctor options that might have helped?
And if that’s not difficult enough, consider that everyone has not one but two different genomes.
6
Power!
The ships hung in the sky in much the same way that bricks don’t.
DOUGLAS ADAMS
If, by chance, you are one of those who live in fear of an alien invasion, you are way, way behind the times. The invasion happened a very long time ago, and the aliens are already here. They are living not just among us, but inside us.
The story of life on Earth — which is your story, and mine — is very, very old. We don’t know exactly when life began, but one estimate is 3.8 billion years ago. Imagine that the ultimate genealogist has volunteered to trace your family tree. Perhaps you already know a bit about your family: you know your parents, and their parents. You have some information about your grandparents and maybe your great-grandparents, an old photograph of stiffly-dressed, unsmiling people from a hundred years ago. If you’re a keen genealogist, you may have a family tree reaching back past that — for a few hundred years, or even further. But at some point, you’re going to hit a dead end. The ear
liest person whose name we know lived in Mesopotamia a little over 5,000 years ago (his name was Kushim). Even if you assume a generation every 20 years, that’s only 250 generations of ancestors … and you can’t ever trace your line back even as far as Kushim, because we don’t know anything about his family.
Our ultimate genealogist is going to take you back a lot further. Five thousand years of written history spans only the last 2 per cent or so of the time that modern humans have been around. Before them, there were protohumans, who in turn traced their origins through the primate line, and back to early mammals. Imagine if you could line up a series of photographs of your ancestors in one continuous line — all the women, perhaps — along a long, long wall. As you walked along that wall, starting with your mother and her mother, and hers, you’d see nothing but humans, or creatures that looked much like humans, for a very long way. If there were three pictures every metre, you’d see nothing but human beings for two kilometres. Somewhere around there, a gradual change might happen. The people — and they would still be people, just not quite Homo sapiens — would become shorter, and furrier; eventually, many more kilometres down the corridor, you’d reach the first of your ancestors to walk on two legs. Back, through millions of years; the earliest mammals emerged about 200 million years ago. Keep going, for thousands of kilometres, and the family photos change again, to creatures that crawled on the banks of some warm sea, 400 million years ago — closer to the fishes from which they descended. But you can still trace the line, from mother to mother to mother. The journey back from the first fish to leave the sea to the first fish worthy of the name is just another hundred million years or so. Back and back; 900 million years into the past, and we’re down to the first, primitive multi-celled creatures.
The alien invasion happened a billion years before that.
For quite a long time now, your walk along the wall of family portraits has been something of a dull affair. Single-celled organism after single-celled organism, all looking much the same, with only the slightest changes over periods of millions of years. At last, though, you start to notice something strange. The single-celled organisms aren’t really single cells. Inside the tiny creature, there are others. In fact, those others were always there but had seemed just another part of the whole, unremarkable little blobs within each cell. Now it’s clear that these objects are different from that single-celled creature’s other components. If it were a video portrait rather than a still picture, you’d see that they move around freely within the cell; it’s clear that, although comfortably at home, they are leading a life of their own. Perhaps they even look to you like parasites rather than part of the whole. They are very far from that, however.