Book Read Free

Rationality- From AI to Zombies

Page 73

by Eliezer Yudkowsky


  * * *

  The probability that a test gives a true positive divided by the probability that a test gives a false positive is known as the likelihood ratio of that test. The likelihood ratio for a positive result summarizes how much a positive result will slide the prior probability. Does the likelihood ratio of a medical test then sum up everything there is to know about the usefulness of the test?

  No, it does not! The likelihood ratio sums up everything there is to know about the meaning of a positive result on the medical test, but the meaning of a negative result on the test is not specified, nor is the frequency with which the test is useful. For example, a mammography with a hit rate of 80% for patients with breast cancer and a false positive rate of 9.6% for healthy patients has the same likelihood ratio as a test with an 8% hit rate and a false positive rate of 0.96%. Although these two tests have the same likelihood ratio, the first test is more useful in every way—it detects disease more often, and a negative result is stronger evidence of health.

  * * *

  Suppose that you apply two tests for breast cancer in succession—say, a standard mammography and also some other test which is independent of mammography. Since I don’t know of any such test that is independent of mammography, I’ll invent one for the purpose of this problem, and call it the Tams-Braylor Division Test, which checks to see if any cells are dividing more rapidly than other cells. We’ll suppose that the Tams-Braylor gives a true positive for 90% of patients with breast cancer, and gives a false positive for 5% of patients without cancer. Let’s say the prior prevalence of breast cancer is 1%. If a patient gets a positive result on her mammography and her Tams-Braylor, what is the revised probability she has breast cancer?

  One way to solve this problem would be to take the revised probability for a positive mammography, which we already calculated as 7.8%, and plug that into the Tams-Braylor test as the new prior probability. If we do this, we find that the result comes out to 60%.

  Suppose that the prior prevalence of breast cancer in a demographic is 1%. Suppose that we, as doctors, have a repertoire of three independent tests for breast cancer. Our first test, test A, a mammography, has a likelihood ratio of 80%/9.6% = 8.33. The second test, test B, has a likelihood ratio of 18.0 (for example, from 90% versus 5%); and the third test, test C, has a likelihood ratio of 3.5 (which could be from 70% versus 20%, or from 35% versus 10%; it makes no difference). Suppose a patient gets a positive result on all three tests. What is the probability the patient has breast cancer?

  Here’s a fun trick for simplifying the bookkeeping. If the prior prevalence of breast cancer in a demographic is 1%, then 1 out of 100 women have breast cancer, and 99 out of 100 women do not have breast cancer. So if we rewrite the probability of 1% as an odds ratio, the odds are 1:99.

  And the likelihood ratios of the three tests A, B, and C are:

  8.33:1 = 25:3

  18.0:1 = 18:1

  3.5:1 = 7.5:2.

  The odds for women with breast cancer who score positive on all three tests, versus women without breast cancer who score positive on all three tests, will equal:

  1 × 25 × 18 × 7 : 99 × 3 × 1 × 2 = 3150 : 594.

  To recover the probability from the odds, we just write:

  3150 / (3150 + 594) = 84%.

  This always works regardless of how the odds ratios are written; i.e., 8.33:1 is just the same as 25:3 or 75:9. It doesn’t matter in what order the tests are administered, or in what order the results are computed. The proof is left as an exercise for the reader.

  * * *

  E. T. Jaynes, in Probability Theory With Applications in Science and Engineering, suggests that credibility and evidence should be measured in decibels.5

  Decibels?

  Decibels are used for measuring exponential differences of intensity. For example, if the sound from an automobile horn carries 10,000 times as much energy (per square meter per second) as the sound from an alarm clock, the automobile horn would be 40 decibels louder. The sound of a bird singing might carry 1,000 times less energy than an alarm clock, and hence would be 30 decibels softer. To get the number of decibels, you take the logarithm base 10 and multiply by 10:

  decibels = 10log10(intensity)

  or

  intensity = 10decibels/10.

  Suppose we start with a prior probability of 1% that a woman has breast cancer, corresponding to an odds ratio of 1:99. And then we administer three tests of likelihood ratios 25:3, 18:1, and 7:2. You could multiply those numbers . . . or you could just add their logarithms:

  10log10(1/99) ≈ -20

  10log10(25/3) ≈ 9

  10log10(18/1) ≈ 13

  10log10(7/2) ≈ 5.

  It starts out as fairly unlikely that a woman has breast cancer—our credibility level is at -20 decibels. Then three test results come in, corresponding to 9, 13, and 5 decibels of evidence. This raises the credibility level by a total of 27 decibels, meaning that the prior credibility of -20 decibels goes to a posterior credibility of 7 decibels. So the odds go from 1:99 to 5:1, and the probability goes from 1% to around 83%.

  * * *

  You are a mechanic for gizmos. When a gizmo stops working, it is due to a blocked hose 30% of the time. If a gizmo’s hose is blocked, there is a 45% probability that prodding the gizmo will produce sparks. If a gizmo’s hose is unblocked, there is only a 5% chance that prodding the gizmo will produce sparks. A customer brings you a malfunctioning gizmo. You prod the gizmo and find that it produces sparks. What is the probability that a spark-producing gizmo has a blocked hose?

  What is the sequence of arithmetical operations that you performed to solve this problem?

  (45% × 30%) / (45% × 30% + 5% × 70%)

  Similarly, to find the chance that a woman with positive mammography has breast cancer, we computed:

  which is

  which is

  P(positive,cancer) / P(positive)

  which is

  P(cancer|positive).

  The fully general form of this calculation is known as Bayes’s Theorem or Bayes’s Rule.

  When there is some phenomenon A that we want to investigate, and an observation X that is evidence about A—for example, in the previous example, A is breast cancer and X is a positive mammography—Bayes’s Theorem tells us how we should update our probability of A, given the new evidence X.

  By this point, Bayes’s Theorem may seem blatantly obvious or even tautological, rather than exciting and new. If so, this introduction has entirely succeeded in its purpose.

  * * *

  Bayes’s Theorem describes what makes something “evidence” and how much evidence it is. Statistical models are judged by comparison to the Bayesian method because, in statistics, the Bayesian method is as good as it gets—the Bayesian method defines the maximum amount of mileage you can get out of a given piece of evidence, in the same way that thermodynamics defines the maximum amount of work you can get out of a temperature differential. This is why you hear cognitive scientists talking about Bayesian reasoners. In cognitive science, Bayesian reasoner is the technically precise code word that we use to mean rational mind.

  There are also a number of general heuristics about human reasoning that you can learn from looking at Bayes’s Theorem.

  For example, in many discussions of Bayes’s Theorem, you may hear cognitive psychologists saying that people do not take prior frequencies sufficiently into account, meaning that when people approach a problem where there’s some evidence X indicating that condition A might hold true, they tend to judge A’s likelihood solely by how well the evidence X seems to match A, without taking into account the prior frequency of A. If you think, for example, that under the mammography example, the woman’s chance of having breast cancer is in the range of 70%–80%, then this kind of reasoning is insensitive to the prior frequency given in the problem; it doesn’t notice whether 1% of women or 10% of women start out having breast cancer. “Pay more attention to the prior frequency!” is one of the many things
that humans need to bear in mind to partially compensate for our built-in inadequacies.

  A related error is to pay too much attention to P(X|A) and not enough to P(X|¬A) when determining how much evidence X is for A. The degree to which a result X is evidence for A depends not only on the strength of the statement we’d expect to see result X if A were true, but also on the strength of the statement we wouldn’t expect to see result X if A weren’t true. For example, if it is raining, this very strongly implies the grass is wet—P(wetgrass|rain) ≈ 1—but seeing that the grass is wet doesn’t necessarily mean that it has just rained; perhaps the sprinkler was turned on, or you’re looking at the early morning dew. Since P(wetgrass|¬rain) is substantially greater than zero, P(rain|wetgrass) is substantially less than one. On the other hand, if the grass was never wet when it wasn’t raining, then knowing that the grass was wet would always show that it was raining, P(rain|wetgrass) ≈ 1, even if P(wetgrass|rain) = 50%; that is, even if the grass only got wet 50% of the times it rained. Evidence is always the result of the differential between the two conditional probabilities. Strong evidence is not the product of a very high probability that A leads to X, but the product of a very low probability that not-A could have led to X.

  The Bayesian revolution in the sciences is fueled, not only by more and more cognitive scientists suddenly noticing that mental phenomena have Bayesian structure in them; not only by scientists in every field learning to judge their statistical methods by comparison with the Bayesian method; but also by the idea that science itself is a special case of Bayes’s Theorem; experimental evidence is Bayesian evidence. The Bayesian revolutionaries hold that when you perform an experiment and get evidence that “confirms” or “disconfirms” your theory, this confirmation and disconfirmation is governed by the Bayesian rules. For example, you have to take into account not only whether your theory predicts the phenomenon, but whether other possible explanations also predict the phenomenon.

  Previously, the most popular philosophy of science was probably Karl Popper’s falsificationism—this is the old philosophy that the Bayesian revolution is currently dethroning. Karl Popper’s idea that theories can be definitely falsified, but never definitely confirmed, is yet another special case of the Bayesian rules; if P(X|A) ≈ 1—if the theory makes a definite prediction—then observing ¬X very strongly falsifies A. On the other hand, if P(X|A) ≈ 1, and we observe X, this doesn’t definitely confirm the theory; there might be some other condition B such that P(X|B) ≈ 1, in which case observing X doesn’t favor A over B. For observing X to definitely confirm A, we would have to know, not that P(X|A) ≈ 1, but that P(X|¬A) ≈ 0, which is something that we can’t know because we can’t range over all possible alternative explanations. For example, when Einstein’s theory of General Relativity toppled Newton’s incredibly well-confirmed theory of gravity, it turned out that all of Newton’s predictions were just a special case of Einstein’s predictions.

  You can even formalize Popper’s philosophy mathematically. The likelihood ratio for X, the quantity P(X|A)/P(X|¬A), determines how much observing X slides the probability for A; the likelihood ratio is what says how strong X is as evidence. Well, in your theory A, you can predict X with probability 1, if you like; but you can’t control the denominator of the likelihood ratio, P(X|¬A)—there will always be some alternative theories that also predict X, and while we go with the simplest theory that fits the current evidence, you may someday encounter some evidence that an alternative theory predicts but your theory does not. That’s the hidden gotcha that toppled Newton’s theory of gravity. So there’s a limit on how much mileage you can get from successful predictions; there’s a limit on how high the likelihood ratio goes for confirmatory evidence.

  On the other hand, if you encounter some piece of evidence Y that is definitely not predicted by your theory, this is enormously strong evidence against your theory. If P(Y|A) is infinitesimal, then the likelihood ratio will also be infinitesimal. For example, if P(Y|A) is 0.0001%, and P(Y|¬A) is 1%, then the likelihood ratio P(Y|A)/P(Y|¬A) will be 1:10,000. That’s -40 decibels of evidence! Or, flipping the likelihood ratio, if P(Y|A) is very small, then P(Y|¬A)/P(Y|A) will be very large, meaning that observing Y greatly favors ¬A over A. Falsification is much stronger than confirmation. This is a consequence of the earlier point that very strong evidence is not the product of a very high probability that A leads to X, but the product of a very low probability that not-A could have led to X. This is the precise Bayesian rule that underlies the heuristic value of Popper’s falsificationism.

  Similarly, Popper’s dictum that an idea must be falsifiable can be interpreted as a manifestation of the Bayesian conservation-of-probability rule; if a result X is positive evidence for the theory, then the result ¬X would have disconfirmed the theory to some extent. If you try to interpret both X and ¬X as “confirming” the theory, the Bayesian rules say this is impossible! To increase the probability of a theory you must expose it to tests that can potentially decrease its probability; this is not just a rule for detecting would-be cheaters in the social process of science, but a consequence of Bayesian probability theory. On the other hand, Popper’s idea that there is only falsification and no such thing as confirmation turns out to be incorrect. Bayes’s Theorem shows that falsification is very strong evidence compared to confirmation, but falsification is still probabilistic in nature; it is not governed by fundamentally different rules from confirmation, as Popper argued.

  So we find that many phenomena in the cognitive sciences, plus the statistical methods used by scientists, plus the scientific method itself, are all turning out to be special cases of Bayes’s Theorem. Hence the Bayesian revolution.

  * * *

  Having introduced Bayes’s Theorem explicitly, we can explicitly discuss its components.

  We’ll start with P(A|X). If you ever find yourself getting confused about what’s A and what’s X in Bayes’s Theorem, start with P(A|X) on the left side of the equation; that’s the simplest part to interpret. In P(A|X), A is the thing we want to know about. X is how we’re observing it; X is the evidence we’re using to make inferences about A. Remember that for every expression P(Q|P), we want to know about the probability for Q given P, the degree to which P implies Q—a more sensible notation, which it is now too late to adopt, would be P(Q ← P).

  P(Q|P) is closely related to P(Q,P), but they are not identical. Expressed as a probability or a fraction, P(Q,P) is the proportion of things that have property Q and property P among all things; e.g., the proportion of “women with breast cancer and a positive mammography” within the group of all women. If the total number of women is 10,000, and 80 women have breast cancer and a positive mammography, then P(Q,P) is 80/10,000 = 0.8%. You might say that the absolute quantity, 80, is being normalized to a probability relative to the group of all women. Or to make it clearer, suppose that there’s a group of 641 women with breast cancer and a positive mammography within a total sample group of 89,031 women. Six hundred and forty-one is the absolute quantity. If you pick out a random woman from the entire sample, then the probability you’ll pick a woman with breast cancer and a positive mammography is P(Q,P), or 0.72% (in this example).

  On the other hand, P(Q|P) is the proportion of things that have property Q and property P among all things that have P; e.g., the proportion of women with breast cancer and a positive mammography within the group of all women with positive mammographies. If there are 641 women with breast cancer and positive mammographies, 7,915 women with positive mammographies, and 89,031 women, then P(Q,P) is the probability of getting one of those 641 women if you’re picking at random from the entire group of 89,031, while P(Q|P) is the probability of getting one of those 641 women if you’re picking at random from the smaller group of 7,915.

  In a sense, P(Q|P) really means P(Q,P|P), but specifying the extra P all the time would be redundant. You already know it has property P, so the property you’re investigating is Q—even though you’re looking at
the size of group (Q,P) within group P, not the size of group Q within group P (which would be nonsense). This is what it means to take the property on the right-hand side as given; it means you know you’re working only within the group of things that have property P. When you constrict your focus of attention to see only this smaller group, many other probabilities change. If you’re taking P as given, then P(Q,P) equals just P(Q)—at least, relative to the group P. The old P(Q), the frequency of “things that have property Q within the entire sample,” is revised to the new frequency of “things that have property Q within the subsample of things that have property P.” If P is given, if P is our entire world, then looking for (Q,P) is the same as looking for just Q.

  If you constrict your focus of attention to only the population of eggs that are painted blue, then suddenly “the probability that an egg contains a pearl” becomes a different number; this proportion is different for the population of blue eggs than the population of all eggs. The given, the property that constricts our focus of attention, is always on the right side of P(Q|P); the P becomes our world, the entire thing we see, and on the other side of the “given” P always has probability 1—that is what it means to take P as given. So P(Q|P) means “If P has probability 1, what is the probability of Q?” or “If we constrict our attention to only things or events where P is true, what is the probability of Q?” The statement Q, on the other side of the given, is not certain—its probability may be 10% or 90% or any other number. So when you use Bayes’s Theorem, and you write the part on the left side as P(A|X)—how to update the probability of A after seeing X, the new probability of A given that we know X, the degree to which X implies A—you can tell that X is always the observation or the evidence, and A is the property being investigated, the thing you want to know about.

 

‹ Prev