Book Read Free

Rationality- From AI to Zombies

Page 72

by Eliezer Yudkowsky


  To put it another way, before the mammography screening, the 10,000 women can be divided into two groups:

  Group 1: 100 women with breast cancer.

  Group 2: 9,900 women without breast cancer.

  Summing these two groups gives a total of 10,000 patients, confirming that none have been lost in the math. After the mammography, the women can be divided into four groups:

  Group A: 80 women with breast cancer and a positive mammography.

  Group B: 20 women with breast cancer and a negative mammography.

  Group C: 950 women without breast cancer and a positive mammography.

  Group D: 8,950 women without breast cancer and a negative mammography.

  The sum of groups A and B, the groups with breast cancer, corresponds to group 1; and the sum of groups C and D, the groups without breast cancer, corresponds to group 2. If you administer a mammography to 10,000 patients, then out of the 1,030 with positive mammographies, eighty of those positive-mammography patients will have cancer. This is the correct answer, the answer a doctor should give a positive-mammography patient if she asks about the chance she has breast cancer; if thirteen patients ask this question, roughly one out of those thirteen will have cancer.

  * * *

  The most common mistake is to ignore the original fraction of women with breast cancer, and the fraction of women without breast cancer who receive false positives, and focus only on the fraction of women with breast cancer who get positive results. For example, the vast majority of doctors in these studies seem to have thought that if around 80% of women with breast cancer have positive mammographies, then the probability of a women with a positive mammography having breast cancer must be around 80%.

  Figuring out the final answer always requires all three pieces of information—the percentage of women with breast cancer, the percentage of women without breast cancer who receive false positives, and the percentage of women with breast cancer who receive (correct) positives.

  The original proportion of patients with breast cancer is known as the prior probability. The chance that a patient with breast cancer gets a positive mammography, and the chance that a patient without breast cancer gets a positive mammography, are known as the two conditional probabilities. Collectively, this initial information is known as the priors. The final answer—the estimated probability that a patient has breast cancer, given that we know she has a positive result on her mammography—is known as the revised probability or the posterior probability. What we’ve just seen is that the posterior probability depends in part on the prior probability.

  To see that the final answer always depends on the original fraction of women with breast cancer, consider an alternate universe in which only one woman out of a million has breast cancer. Even if mammography in this world detects breast cancer in 8 out of 10 cases, while returning a false positive on a woman without breast cancer in only 1 out of 10 cases, there will still be a hundred thousand false positives for every real case of cancer detected. The original probability that a woman has cancer is so extremely low that, although a positive result on the mammography does increase the estimated probability, the probability isn’t increased to certainty or even “a noticeable chance”; the probability goes from 1:1,000,000 to 1:100,000.

  What this demonstrates is that the mammography result doesn’t replace your old information about the patient’s chance of having cancer; the mammography slides the estimated probability in the direction of the result. A positive result slides the original probability upward; a negative result slides the probability downward. For example, in the original problem where 1% of the women have cancer, 80% of women with cancer get positive mammographies, and 9.6% of women without cancer get positive mammographies, a positive result on the mammography slides the 1% chance upward to 7.8%.

  Most people encountering problems of this type for the first time carry out the mental operation of replacing the original 1% probability with the 80% probability that a woman with cancer gets a positive mammography. It may seem like a good idea, but it just doesn’t work. “The probability that a woman with a positive mammography has breast cancer” is not at all the same thing as “the probability that a woman with breast cancer has a positive mammography”; they are as unlike as apples and cheese.

  * * *

  Q. Why did the Bayesian reasoner cross the road?

  A. You need more information to answer this question.

  * * *

  Suppose that a barrel contains many small plastic eggs. Some eggs are painted red and some are painted blue. 40% of the eggs in the bin contain pearls, and 60% contain nothing. 30% of eggs containing pearls are painted blue, and 10% of eggs containing nothing are painted blue. What is the probability that a blue egg contains a pearl? For this example the arithmetic is simple enough that you may be able to do it in your head, and I would suggest trying to do so.

  A more compact way of specifying the problem:

  P(pearl) = 40%

  P(blue|pearl) = 30%

  P(blue|¬pearl) = 10%

  P(pearl|blue) = ?

  The symbol “¬” is shorthand for “not,” so ¬pearl reads “not pearl.”

  The notation P(blue|pearl) is shorthand for “the probability of blue given pearl” or “the probability that an egg is painted blue, given that the egg contains a pearl.” The item on the right side is what you already know or the premise, and the item on the left side is the implication or conclusion. If we have P(blue|pearl) = 30%, and we already know that some egg contains a pearl, then we can conclude there is a 30% chance that the egg is painted blue. Thus, the final fact we’re looking for—“the chance that a blue egg contains a pearl” or “the probability that an egg contains a pearl, if we know the egg is painted blue”—reads P(pearl|blue).

  40% of the eggs contain pearls, and 60% of the eggs contain nothing. 30% of the eggs containing pearls are painted blue, so 12% of the eggs altogether contain pearls and are painted blue. 10% of the eggs containing nothing are painted blue, so altogether 6% of the eggs contain nothing and are painted blue. A total of 18% of the eggs are painted blue, and a total of 12% of the eggs are painted blue and contain pearls, so the chance a blue egg contains a pearl is 12/18 or 2/3 or around 67%.

  As before, we can see the necessity of all three pieces of information by considering extreme cases. In a (large) barrel in which only one egg out of a thousand contains a pearl, knowing that an egg is painted blue slides the probability from 0.1% to 0.3% (instead of sliding the probability from 40% to 67%). Similarly, if 999 out of 1,000 eggs contain pearls, knowing that an egg is blue slides the probability from 99.9% to 99.966%; the probability that the egg does not contain a pearl goes from 1/1,000 to around 1/3,000.

  On the pearl-egg problem, most respondents unfamiliar with Bayesian reasoning would probably respond that the probability a blue egg contains a pearl is 30%, or perhaps 20% (the 30% chance of a true positive minus the 10% chance of a false positive). Even if this mental operation seems like a good idea at the time, it makes no sense in terms of the question asked. It’s like the experiment in which you ask a second-grader: “If eighteen people get on a bus, and then seven more people get on the bus, how old is the bus driver?” Many second-graders will respond: “Twenty-five.” They understand when they’re being prompted to carry out a particular mental procedure, but they haven’t quite connected the procedure to reality. Similarly, to find the probability that a woman with a positive mammography has breast cancer, it makes no sense whatsoever to replace the original probability that the woman has cancer with the probability that a woman with breast cancer gets a positive mammography. Neither can you subtract the probability of a false positive from the probability of the true positive. These operations are as wildly irrelevant as adding the number of people on the bus to find the age of the bus driver.

  * * *

  A study by Gigerenzer and Hoffrage in 1995 showed that some ways of phrasing story problems are much more evocative of correct Bayesian reaso
ning.4 The least evocative phrasing used probabilities. A slightly more evocative phrasing used frequencies instead of probabilities; the problem remained the same, but instead of saying that 1% of women had breast cancer, one would say that 1 out of 100 women had breast cancer, that 80 out of 100 women with breast cancer would get a positive mammography, and so on. Why did a higher proportion of subjects display Bayesian reasoning on this problem? Probably because saying “1 out of 100 women” encourages you to concretely visualize X women with cancer, leading you to visualize X women with cancer and a positive mammography, etc.

  The most effective presentation found so far is what’s known as natural frequencies—saying that 40 out of 100 eggs contain pearls, 12 out of 40 eggs containing pearls are painted blue, and 6 out of 60 eggs containing nothing are painted blue. A natural frequencies presentation is one in which the information about the prior probability is included in presenting the conditional probabilities. If you were just learning about the eggs’ conditional probabilities through natural experimentation, you would—in the course of cracking open a hundred eggs—crack open around 40 eggs containing pearls, of which 12 eggs would be painted blue, while cracking open 60 eggs containing nothing, of which about 6 would be painted blue. In the course of learning the conditional probabilities, you’d see examples of blue eggs containing pearls about twice as often as you saw examples of blue eggs containing nothing.

  Unfortunately, while natural frequencies are a step in the right direction, it probably won’t be enough. When problems are presented in natural frequencies, the proportion of people using Bayesian reasoning rises to around half. A big improvement, but not big enough when you’re talking about real doctors and real patients.

  * * *

  Q. How can I find the priors for a problem?

  A. Many commonly used priors are listed in the Handbook of Chemistry and Physics.

  Q. Where do priors originally come from?

  A. Never ask that question.

  Q. Uh huh. Then where do scientists get their priors?

  A. Priors for scientific problems are established by annual vote of the AAAS. In recent years the vote has become fractious and controversial, with widespread acrimony, factional polarization, and several outright assassinations. This may be a front for infighting within the Bayes Council, or it may be that the disputants have too much spare time. No one is really sure.

  Q. I see. And where does everyone else get their priors?

  A. They download their priors from Kazaa.

  Q. What if the priors I want aren’t available on Kazaa?

  A. There’s a small, cluttered antique shop in a back alley of San Francisco’s Chinatown. Don’t ask about the bronze rat.

  Actually, priors are true or false just like the final answer—they reflect reality and can be judged by comparing them against reality. For example, if you think that 920 out of 10,000 women in a sample have breast cancer, and the actual number is 100 out of 10,000, then your priors are wrong. For our particular problem, the priors might have been established by three studies—a study on the case histories of women with breast cancer to see how many of them tested positive on a mammography, a study on women without breast cancer to see how many of them test positive on a mammography, and an epidemiological study on the prevalence of breast cancer in some specific demographic.

  * * *

  The probability P(A,B) is the same as P(B,A), but P(A|B) is not the same thing as P(B|A), and P(A,B) is completely different from P(A|B). It’s a common confusion to mix up some or all of these quantities.

  To get acquainted with all the relationships between them, we’ll play “follow the degrees of freedom.” For example, the two quantities P(cancer) and P(¬cancer) have one degree of freedom between them, because of the general law P(A) + P(¬A) = 1. If you know that P(¬cancer) = 0.99, you can obtain P(cancer) = 1 - P(¬cancer) = 0.01.

  The quantities P(positive|cancer) and P(¬positive|cancer) also have only one degree of freedom between them; either a woman with breast cancer gets a positive mammography or she doesn’t. On the other hand, P(positive|cancer) and P(positive|¬cancer) have two degrees of freedom. You can have a mammography test that returns positive for 80% of cancer patients and 9.6% of healthy patients, or that returns positive for 70% of cancer patients and 2% of healthy patients, or even a health test that returns “positive” for 30% of cancer patients and 92% of healthy patients. The two quantities, the output of the mammography test for cancer patients and the output of the mammography test for healthy patients, are in mathematical terms independent; one cannot be obtained from the other in any way, and so they have two degrees of freedom between them.

  What about P(positive, cancer), P(positive|cancer), and P(cancer)? Here we have three quantities; how many degrees of freedom are there? In this case the equation that must hold is

  P(positive, cancer) = P(positive|cancer) × P(cancer).

  This equality reduces the degrees of freedom by one. If we know the fraction of patients with cancer, and the chance that a cancer patient has a positive mammography, we can deduce the fraction of patients who have breast cancer and a positive mammography by multiplying.

  Similarly, if we know the number of patients with breast cancer and positive mammographies, and also the number of patients with breast cancer, we can estimate the chance that a woman with breast cancer gets a positive mammography by dividing: P(positive|cancer) = P(positive, cancer)∕P(cancer). In fact, this is exactly how such medical diagnostic tests are calibrated; you do a study on 8,520 women with breast cancer and see that there are 6,816 (or thereabouts) women with breast cancer and positive mammographies, then divide 6,816 by 8,520 to find that 80% of women with breast cancer had positive mammographies. (Incidentally, if you accidentally divide 8,520 by 6,816 instead of the other way around, your calculations will start doing strange things, such as insisting that 125% of women with breast cancer and positive mammographies have breast cancer. This is a common mistake in carrying out Bayesian arithmetic, in my experience.) And finally, if you know P(positive, cancer) and P(positive|cancer), you can deduce how many cancer patients there must have been originally. There are two degrees of freedom shared out among the three quantities; if we know any two, we can deduce the third.

  How about P(positive), P(positive, cancer), and P(positive,¬cancer)? Again there are only two degrees of freedom among these three variables. The equation occupying the extra degree of freedom is

  P(positive) = P(positive, cancer) + P(positive,¬cancer).

  This is how P(positive) is computed to begin with; we figure out the number of women with breast cancer who have positive mammographies, and the number of women without breast cancer who have positive mammographies, then add them together to get the total number of women with positive mammographies. It would be very strange to go out and conduct a study to determine the number of women with positive mammographies—just that one number and nothing else—but in theory you could do so. And if you then conducted another study and found the number of those women who had positive mammographies and breast cancer, you would also know the number of women with positive mammographies and no breast cancer—either a woman with a positive mammography has breast cancer or she doesn’t. In general, P(A,B) + P(A,¬B) = P(A). Symmetrically, P(A,B) + P(¬A,B) = P(B).

  What about P(positive, cancer), P(positive,¬cancer), P(¬positive, cancer), and P(¬positive,¬cancer)? You might at first be tempted to think that there are only two degrees of freedom for these four quantities—that you can, for example, get P(positive,¬cancer) by multiplying P(positive) × P(¬cancer), and thus that all four quantities can be found given only the two quantities P(positive) and P(cancer). This is not the case! P(positive,¬cancer) = P(positive) × P(¬cancer) only if the two probabilities are statistically independent—if the chance that a woman has breast cancer has no bearing on whether she has a positive mammography. This amounts to requiring that the two conditional probabilities be equal to each other—a requirement which would elimin
ate one degree of freedom. If you remember that these four quantities are the groups A, B, C, and D, you can look over those four groups and realize that, in theory, you can put any number of people into the four groups. If you start with a group of 80 women with breast cancer and positive mammographies, there’s no reason why you can’t add another group of 500 women with breast cancer and negative mammographies, followed by a group of 3 women without breast cancer and negative mammographies, and so on. So now it seems like the four quantities have four degrees of freedom. And they would, except that in expressing them as probabilities, we need to normalize them to fractions of the complete group, which adds the constraint that P(positive, cancer)+P(positive,¬cancer)+P(¬positive, cancer)+P(¬positive,¬cancer) = 1. This equation takes up one degree of freedom, leaving three degrees of freedom among the four quantities. If you specify the fractions of women in groups A, B, and D, you can deduce the fraction of women in group C.

  Given the four groups A, B, C, and D, it is very straightforward to compute everything else:

  P(cancer) = (A + B) / (A + B + C + D)

  P(¬positive|cancer) = B / (A + B),

  and so on. Since {A,B,C,D} contains three degrees of freedom, it follows that the entire set of probabilities relating cancer rates to test results contains only three degrees of freedom. Remember that in our problems we always needed three pieces of information—the prior probability and the two conditional probabilities—which, indeed, have three degrees of freedom among them. Actually, for Bayesian problems, any three quantities with three degrees of freedom between them should logically specify the entire problem.

 

‹ Prev