Super Crunchers

Page 22

by Ian Ayres

I did a very crude calculation, which I’m sure was wrong and certainly was unsubtle, twenty different ways. I looked…at the evidence on the sex ratios in the top 5 percent of twelfth graders [in science and math]. If you look at those—they’re all over the map [but] one woman for every two men would be a high-end estimate [for the relative prevalence of women]. From that, you can back out a difference in the implied standard deviations that works out to be about 20 percent.

Summers doesn’t say it, but his calculation assumes what researchers have in fact found: there is no pronounced difference in the average math or science scores for male and female twelfth graders. But in a variety of different studies, researchers have found a difference in the tails of the distribution. In particular, Summers focused in on the tendency for there to be two men for every one woman when you looked at the top 5 percent of math and science achievement among twelfth graders. Summers worked backwards to figure out what kind of a difference in standard deviations would give rise to this sex difference in the tails. His core claim, indeed his only claim, of innate difference was that the standard deviation of men’s intelligence might be 20 percent greater than that of women.

Summers in the speech was careful to point out that his calculation was “crude” and “unsubtle.” But Summers is no dummy. He is the youngest person ever to be voted tenure at Harvard. He won the prestigious John Bates Clark award for the best U.S. economist under forty. Two of the three greatest American economists of the twentieth century, Kenneth Arrow and Paul Samuelson, are his uncles, and like them Summers in his early forties was headed straight for a Nobel Prize. He definitely understands standard deviations. But after almost dying from Hodgkin’s disease, Summers chose a different path. Like Paul Gertler of Progresa fame, he became chief economist for the World Bank and eventually went on to be secretary of the treasury at the end of the Clinton administration. He is almost always the smartest person in the room (and his critics say he knows it).

Being smart, however, does not mean that everything you say is right. Summers’s back-of-the-envelope empiricism doesn’t definitely resolve the question of whether women have less variable intelligence. For example, lots of other factors could have influenced the math and science scores of twelfth graders besides innate ability. Yet there have been subsequent studies suggesting that the IQ scores of women are in fact less variable than those of men.

Summers, in suggesting a gendered difference in standard deviations, is suggesting that men are more likely to be really smart, but he’s also implying that men are innately more likely to be really dumb. It’s a tricky question to know whether it is desirable to be associated with a group that has a larger IQ standard deviation. Imagine that you are expecting your first child. You are told that you can choose the range of possible IQs that your child will have but this range must be centered on an IQ of 100. Any IQ within the range that you choose is equally likely to occur. What range would you choose—95 to 105, or would you roll the dice on a wider range of, say, 60 to 140? When I asked this question to a group of fourth and sixth graders, they invariably chose ranges that were incredibly small (nothing wider than 95–105). None of them wanted to roll the dice on the chance that their kid would be a genius if it meant that their kid might alternatively end up as developmentally disabled. So from the kids’ perspective, Summers was suggesting that men have a less desirable IQ distribution.

What really got Summers in trouble was taking his estimated 20 percent difference and using it to figure out other probabilities. Instead of looking at the ratio of males to females in the top 5 percent of the most intelligent people, he wanted to speculate about the ratio of men to women in the top 0.01 of 1 percent of the most scientifically intelligent people. Summers claimed that research scientists at top universities come from this more rarefied strata:

If…one is talking about physicists at a top twenty-five research university, one is not talking about people who are two standard deviations above the mean. And perhaps it’s not even talking about somebody who is three standard deviations above the mean. But it’s talking about people who are three and a half, four standard deviations above the mean in the 1 in 5,000, 1 in 10,000 class.

To infer what he called the “available pool” of women and men this far out in the distribution, Summers took his estimates of the implicit standard deviations and worked forward:

Even small differences in the standard deviation will translate into very large differences in the available pool substantially out [in the tail of the distribution]…. [Y]ou can work out the difference out several standard deviations. If you do that calculation—and I have no reason to think that it couldn’t be refined in a hundred ways—you get five to one, at the high end.

Summers was claiming that women may be underrepresented in science because for the kinds of smarts you need at a top research department, there might be five men for every one woman. Now you can start to understand why he got in so much trouble. I’ve recalculated Summers’s estimates, using his same methodology, and his bottom-line characterization of the results, if anything, was understated. At three point five or four standard deviations above the mean, a 20 percent difference in standard deviations can easily translate into there being ten or twenty times as many men as women. However, these results are far from definitive. I agree with him that his method might be flawed in “twenty different ways.”

Still, I come to praise his process (if not the general conclusion he drew from it). Summers worked both backwards and forwards to derive a probability of interest. He started with an observed proportion, and with it he backed out implicit standard deviations for male and female scientific intelligence. He then worked forwards to estimate the relative number of women and men at another point in the distribution. This exercise didn’t work out so well for Summers. Nonetheless, like Summers, the intuitivist of the future will keep an eye out for proportions and observed probabilities and use them to derive standard deviations (and vice versa).

The news media almost completely ignored the point that Summers was just talking about a difference in variability. It’s not nearly as sexy as reporting “Harvard President Says Women Are Innately Deficient in Mathematics.” (They might as easily have reported that Summers was claiming that women are innately superior in mathematics, since they are less likely to be really bad in math.) Many reporters simply didn’t understand the point or couldn’t figure out a way to communicate it to a general audience. It’s a hard idea to get across to the uninitiated. At least in small part, Summers may have lost his job because people don’t understand standard deviations.

This inability to speak to one another about dispersion hinders our ability to make decisions. If we can’t communicate the probability of worst-case scenarios, it becomes a lot harder to take the right precautions. Our inability to communicate even impacts something as basic and important as how we plan for pregnancy.

Polak’s Pregnancy Problems

Everybody knows that a baby is due roughly nine months after conception. However, few people know that the standard deviation is fifteen days. If you’re pregnant and are planning to take off time from work or want to schedule a relative’s visit, you might want to know something about the variability of when you’ll actually give birth. Knowing the standard deviation is the best place to start. (The distribution is also skewed left—so that there are more pregnancies that are three weeks early than three weeks late.)

Most doctors don’t even give the most accurate prediction of the due date. They still often calculate the due date based on the quasi-mystical formula of Franz Naegele, who believed in 1812 that “pregnancy lasted ten lunar months from the last menstrual period.” It wasn’t until the 1980s that Robert Mittendorf and his coauthors crunched numbers on thousands of births to let the numbers produce a formula for the twentieth century. Turns out that pregnancy for the average woman is eight days longer than the Naegele rule, but it’s possible to make even more refined predictions. First-time mot
hers deliver about five days later than mothers who have already given birth. Whites tend to deliver later than nonwhites. The age of the mother, her weight, and her nutrition all help predict her due date.

Physicians using the crude Naegele rule cruelly set up first-time mothers for disappointment. These mothers are told that their due date is more than a week before their baby is really expected. And there is almost never any discussion about the probable variability of the actual delivery.

So right off the bat, physicians are failing to communicate. As Ben Polak quickly found out, it gets worse. Ben, in his day job, is a theoretical economist. In the typically rumpled raiment of academics, he prowls his seminar rooms with piercing eyes. Polak was born in London and has a proper British accent (when I was reading Jane Eyre, he politely enlightened me as to the correct pronunciation of “St. John”).

On the side, Ben crunches numbers in the fine tradition of Bill James. Indeed, Ben and his coauthor Brian Lonergan have done James one better. Instead of predicting the contribution of baseball players to scoring runs, Polak made a splash when he estimated each player’s contributions to his team’s wins—the true competitive bottom line. There is a satisfying simplicity to Polak’s estimates because the players’ contributions to a team sum up to the number of games the team actually wins.

Ben wasn’t nearly as satisfied about the statistics he was given by health care professionals when his wife, Stefanie, was pregnant with their first child, Nelly. And the problem was a lot more important than predicting their due date. Ben and Stefanie wanted to know the chance that their child would have Down syndrome.

I remember when my partner, Jennifer, and I were expecting for the first time—back in 1994. Back then, women were first told the probability of Down syndrome based on their age. After sixteen weeks, the mother could have a blood test measuring her alphafetoprotein (AFP) level, and then they’d give you another probability. I remember asking the doctor if they had a way of combining the different probabilities. He told me flat out, “That’s impossible. You just can’t combine probabilities like that.”

I bit my tongue, but I knew he was dead wrong. It is possible to combine different pieces of evidence, and has been since 1763 when a short essay by the Reverend Thomas Bayes was posthumously published. Bayes’ theorem is really a single equation that tells us how to update an initial probability given a new piece of evidence.

Here’s how it works. A woman who is thirty-seven has about a 1 in 250 (.4 percent) chance of giving birth to a child with Down syndrome. The Bayes formula tells you how you can update this probability to take into account a woman’s AFP level. To update, all you need to do is multiply the initial probability by a single number, a “likelihood ratio” which might either inflate or deflate the original probability estimate.*6

Knowing the best single estimate from these early tests is important because some parents will want to go further and have amniocentesis if the probability of Down syndrome is too high. “Amniocentesis is almost 100 percent accurate,” Polak said. “But it carries a risk. The risk is that the amniocentesis might cause a miscarriage of the child.” Once in about every 250 procedures, a pregnant woman will miscarry after amnio.

There is some good news to report here. In the last decade, doctors have uncovered multiple predictors for Down syndrome. Instead of the single AFP test, there is now the triple screen, which uses Bayesian updating to predict the probability of Down syndrome based on three different assays from a single blood test. Doctors have also noticed that fetuses with Down syndrome are more likely to display on sonograms a thick patch of skin at the base of the neck—the so-called nuchal fold.

Nonetheless, Ben (like Bono) still hasn’t found what he’s looking for. He told me that the health care professionals would tend to downplay the importance of numbers. “I had a comical interaction with a very nice physician,” he told me, “where she said, ‘One of these probabilities is 1 in 1,000, and one is 1 in 10,000, so what’s the difference?’ and to me there’s actually quite a big difference. There’s a ten times difference.” I had a similar interaction with a genetic counselor in California after Jennifer’s blood showed a risky level of AFP. When I asked for a probability of Down syndrome, the counselor unhelpfully offered: “That would only be a number. In reality, your child will either have it or not.”

In Ben’s case, the health care providers suggested that Ben and Stefanie follow a rule of thumb that to him was “completely arbitrary.” He said:

The rule that they guide you towards is to take amniocentesis when the probability of miscarriage is less than the probability of Down syndrome. That assumes that the parents put equal weight on the two bad outcomes. The two bad outcomes are miscarriage and Down syndrome. And there may be parents for whom those aren’t at all equal. In fact, it could go either way.

Some parents might find the loss from a miscarriage to be much greater (especially, for example, if the couple is unlikely to be able to get pregnant again). However, as Ben points out, “[There are a] lot of parents for whom having a severely handicapped child would be devastating. And they would choose to put a big weight on that as a bad outcome.”

One of the great things about Super Crunching is that it tends to tie decision rules to the actual consequences of the decision. More enlightened genetic counselors would do well to ask patients which adverse outcome (Down syndrome or miscarriage) they would experience as a greater loss—and how much greater a loss it would be. If a pregnant woman says that she would weigh Down syndrome three times worse than a miscarriage, then the woman should probably have amniocentesis if the probability of Down syndrome is more than a third greater than the probability of miscarriage.

Ben was also frustrated by what doctors told him about the triple screen results. “When you take the actual test, all you really want to know is a simple thing: what is the probability that my child has Down syndrome, given the outcome of the test. You don’t want to know a bunch of extraneous information about what the probability of a false positive is here.” The real power of the Bayes equation is that it gives you the bottom line probability of Down syndrome while taking into account the possibility that a test will have false positives.

Overall, there are now five valid predictors of Down syndrome—the mother’s age, three blood tests, and the nuchal fold—that could help mothers and fathers decide whether to have amniocentesis. Yet even today, many doctors don’t provide a bottom-line prediction that is based on all five factors. Ben said, “I had trouble getting them to combine the data from these blood and nuchal fold tests. There must be vast amounts of data to do this. It cannot be a hard exercise to do. But they just didn’t have this analysis available. When I asked, the physician said something technical about one of the distributions being non-Gaussian. But that was utterly irrelevant to the issue at hand.”

The Bayesian cup is more than half full. The quad screen now combines the information from four different predictors into a single, bottom-line probability of Down syndrome. Medical professionals, however, are still making “you can’t get there from here” claims that full updating on the nuchal fold information just isn’t possible.

The Bayes equation is the science of learning. It is the second great tool of this chapter. If the Super Cruncher of the future is really going to dialectically toggle back and forth between her intuitions and her statistical predictions, she’s going to have to know how to update her predictions and intuitions over time as she gets new information. Bayes’ equation is crucial to this updating process.

Still, it’s not surprising that many health care professionals aren’t comfortable with updating. We learned before that physicians have been good at biochemistry, but they often are still out to lunch when it comes to basic statistics. For example, several studies have asked physicians over the years the following type of question:

One percent of women at age forty who participate in routine screening have breast cancer. Eighty percent of women with breast cancer
will get positive mammographies. Ten percent of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

By the way, this is a question you can answer, too. What do you think is the probability that a woman who has a positive mammography has breast cancer? Think about it for a minute.

In study after study, most physicians tend to estimate that the probability of cancer is about 75 percent. Actually, this answer is about ten times too high. Most physicians don’t know how to apply Bayes’ equation.

We can actually work out the probability (and learn Bayes to boot) if we translate the probabilities into frequencies. First, imagine a sample of 1,000 women who are screened for breast cancer. From the 1 percent (prior) probability, we know that 10 out of every 1,000 women who get screened will actually have breast cancer. Of these 10 women with breast cancer, 8 will have a positive mammogram. We also know that of the 990 women without breast cancer who take the test, 99 will have a false positive result. Can you figure out the probability that a woman with a positive test will have breast cancer now?

It’s a pretty straightforward calculation. Eight out of the 107 positive tests (8 true positives plus 99 false positives) will actually have cancer. So what statisticians call the posterior or updated probability of cancer conditioned upon a positive mammogram becomes 7.5 percent (8 divided by 107). Bayes’ theorem tells us that the prior 1 percent probability of cancer doesn’t jump to 70 or 75 percent—it increases to 7.5 percent.

People who don’t understand Bayes tend to put too much emphasis on the 80 percent chance that a woman with cancer will test positive. Most physicians studied seem to think that if 80 percent of women with breast cancer have positive mammographies, then the probability of a woman with a positive mammography having breast cancer must be around 80 percent. But Bayes’ equation tells us why this intuition is wrong. We have to put a lot more weight on the original, unconditional fraction of women with breast cancer (the prior probability), as well as the possibility that women without breast cancer will receive false positives.

‹ Prev Next ›