The division between left and right is the one Facebook makes; on the left-hand side are the hundred thousand people Facebook reckons as having an elevated chance of terrorist involvement. We’ll take Facebook at their word, that their algorithm is so good that the people who bear its mark are fully twice as likely as the average user to be terrorists. So among this group, one in ten thousand, or ten people, will turn out to be terrorists, while 99,990 will not.
If ten out of the 10,000 future terrorists are in the upper left, that leaves 9,990 for the upper right. By the same reasoning: there are 199,990,000 nonoffenders in Facebook’s user base, 99,990 of whom were flagged by the algorithm and sit in the lower left box; that leaves 199,890,010 people in the lower right. If you add up all four quadrants, you get 200,000,000—that is, everybody.
Somewhere in the four-part box is your neighbor down the block.
But where? What you know is that he’s in the left half of the box, because Facebook has identified him as a person of interest.
And the thing to notice is that almost nobody in the left half of the box is a terrorist. In fact, there’s a 99.99% chance that your neighbor is innocent.
In a way, this is the birth control scare revisited. Being on the Facebook list doubles a person’s chance of being a terrorist, which sounds terrible. But that chance starts out very small, so when you double it, it’s still small.
But there’s another way to look at it, which highlights even more clearly just how confusing and treacherous reasoning about uncertainty can be. Ask yourself this—if a person is in fact not a future terrorist, what’s the chance that they’ll show up, unjustly, on Facebook’s list?
In the box, that means: if you’re in the bottom row, what’s the chance that you’re on the left-hand side?
That’s easy enough to compute; there are 199,990,000 people in the bottom half of the box, and of those, a mere 99,990 are on the left-hand side. So the chance that an innocent person will be marked as a potential terrorist by Facebook’s algorithm is
99,990/199,990,000
or about 0.05%.
That’s right—an innocent person has only a 1 in 2,000 chance of being wrongly identified as a terrorist by Facebook!
Now how do you feel about your neighbor?
The reasoning that governs p-values gives us clear guidance. The null hypothesis is that your neighbor is not a terrorist. Under that hypothesis—that is, presuming his innocence—the chance of him showing up on the Facebook red list is a mere 0.05%, well below the 1-in-20 threshold of statistical significance. In other words, under the rules that govern the majority of contemporary science, you’d be justified in rejecting the null hypothesis and declaring your neighbor a terrorist.
Except there’s a 99.99% chance he’s not a terrorist.
On the one hand, there’s hardly any chance that an innocent person will be flagged by the algorithm. At the same time, the people the algorithm points to are almost all innocent. It seems like a paradox, but it’s not. It’s just how things are. And if you take a deep breath and keep your eye on the box, you can’t go wrong.
Here’s the crux. There are really two questions you can ask. They sound kind of the same, but they’re not.
Question 1: What’s the chance that a person gets put on Facebook’s list, given that they’re not a terrorist?
Question 2: What’s the chance that a person’s not a terrorist, given that they’re on Facebook’s list?
One way you can tell these two questions are different is that they have different answers. Really different answers. We’ve already seen that the answer to the first question is about 1 in 2,000, while the answer to the second is 99.99%. And it’s the answer to the second question that you really want.
The quantities these questions contemplate are called conditional probabilities; “the probability that X is the case, given that Y is.” And what we’re wrestling with here is that the probability of X, given Y, is not the same as the probability of Y, given X.
If that sounds familiar, it should; it’s exactly the problem we faced with the reductio ad unlikely. The p-value is the answer to the question
“The chance that the observed experimental result would occur, given that the null hypothesis is correct.”
But what we want to know is the other conditional probability:
“The chance that the null hypothesis is correct, given that we observed a certain experimental result.”
The danger arises precisely when we confuse the second quantity for the first. And this confusion is everywhere, not just in scientific studies. When the district attorney leans into the jury box and announces, “There is only a one in five million, I repeat, a ONE IN FIVE MILLLLLLLION CHANCE that an INNOCENT MAN would match the DNA sample found at the scene,” he is answering question 1: How likely would an innocent person be to look guilty? But the jury’s job is to answer question 2: How likely is this guilty-looking defendant to be innocent? That’s a question the DA can’t help them with.*
—
The example of Facebook and the terrorists makes it clear why you should worry about bad algorithms as much as good ones. Maybe more. It’s creepy and bad when you’re pregnant and Target knows you’re pregnant. But it’s even creepier and worse if you’re not a terrorist and Facebook thinks you are.
You might well think that Facebook would never cook up a list of potential terrorists (or tax cheats, or pedophiles) or make the list public if they did. Why would they? Where’s the money in it? Maybe that’s right. But the NSA collects data on people in America, too, whether they’re on Facebook or not. Unless you think they’re recording the metadata of all our phone calls just so they can give cell phone companies good advice about where to build more signal towers, there’s something like the red list going on. Big Data isn’t magic, and it doesn’t tell the feds who’s a terrorist and who’s not. But it doesn’t have to be magic to generate long lists of people who are in some ways red-flagged, elevated-risk, “people of interest.” Most of the people on those lists will have nothing to do with terrorism. How confident are you that you’re not one of them?
RADIO PSYCHICS AND THE RULE OF BAYES
Where does the apparent paradox of the terrorist red list come from? Why does the mechanism of the p-value, which seems so reasonable, work so very badly in this setting? Here’s the key. The p-value takes into account what proportion of people Facebook flags (about 1 in 2000) but it totally ignores the proportion of people who are terrorists. When you’re trying to decide whether your neighbor is a secret terrorist, you have critical prior information, which is that most people aren’t terrorists! You ignore that fact at your peril. Just as R.A. Fisher said, you have to evaluate each hypothesis in the “light of the evidence” of what you already know about it.
But how do you do that?
This brings us to the story of the radio psychics.
In 1937, telepathy was the rage. Psychologist J. B. Rhine’s book New Frontiers of the Mind, which presented extraordinary claims about Rhine’s ESP experiments at Duke in a soothingly sober and quantitative tone, was a best seller and a Book-of-the-Month Club selection, and psychic powers were a hot topic of cocktail conversation across the country. Upton Sinclair, the best-selling author of The Jungle, released in 1930 a whole book, Mental Radio, about his experiments in psychic communication with his wife, Mary; the subject was mainstream enough that Albert Einstein contributed a preface to the German edition, stopping short of endorsing telepathy, but writing that Sinclair’s book “deserves the most earnest consideration” from psychologists.
Naturally, the mass media wanted in on the craze. On September 5, 1937, the Zenith Radio Corporation, in collaboration with Rhine, launched an ambitious experiment of the kind only the new communication technology they commanded made possible. Five times, the host spun a roulette wheel, with a panel of self-styled telepaths looking on. With each spin, the ba
ll landed either in the black or in the red, and the psychics concentrated with all their might on the appropriate color, transmitting that signal across the country over their own broadcast channel. The station’s listeners were implored to use their own psychic powers to pick up the mental transmission and to mail the radio station the sequence of five colors they’d received. More than forty thousand listeners responded to the first request, and even for later programs, after the novelty was gone, Zenith was getting thousands of responses a week. It was a test of psychic powers on a scale Rhine could never have carried out subject by subject in his office at Duke, a kind of proto−Big Data event.
The results of the experiment were not, in the end, favorable to telepathy. But the accumulated data of the responses turned out to be useful for psychologists in a totally different way. The listeners were trying to reproduce sequences of blacks and reds (hereafter Bs and Rs) generated by five spins of the roulette wheel. There are 32 possible sequences:
BBBBB
BBRBB
BRBBB
BRRBB
BBBBR
BBRBR
BRBBR
BRRBR
BBBRB
BBRRB
BRBRB
BRRRB
BBBRR
BBRRR
BRBRR
BRRRR
RBBBB
RBRBB
RRBBB
RRRBB
RBBBR
RBRBR
RRBBR
RRRBR
RBBRB
RBRRB
RRBRB
RRRRB
RBBRR
RBRRR
RRBRR
RRRRR
all of which are equally likely to come up, since each spin is equally likely to land red or black. And since the listeners weren’t actually receiving any psychic emanations, you might expect that their responses, too, would be drawn equally from the thirty-two choices.
But no. In fact, the cards the listeners mailed in were highly nonuniform. Sequences like BBRBR and BRRBR were offered much more frequently than chance would predict, while sequences like RBRBR are less frequent than they ought to be, and RRRRR almost never showed up.
This probably doesn’t surprise you. RRRRR somehow doesn’t feel like a random sequence the way BBRBR does, even though the two are equally likely to occur when we spin the wheel. What’s going on? What do we really mean when we say that one sequence of letters is “less random” than another?
Here’s another example. Quick, think of a number from 1 to 20.
Did you pick 17?
Okay, that trick doesn’t always work—but if you ask people to pick a number between 1 and 20, 17 is the most common choice. And if you ask people for a number between 0 and 9, they most frequently pick 7. Numbers ending in 0 and 5, by contrast, are chosen much more rarely than chance would lead you to expect—they just seem less random to people. This leads to an irony. Just as the radio psychic contestants tried to match random sequences of Rs and Bs and produced notably nonrandom results, so people who choose random numbers tend to make choices that visibly deviate from randomness.
In 2009, Iran held a presidential election, which incumbent Mahmoud Ahmadinejad won by a large margin. There were widespread accusations that the vote had been fixed. But how could you hope to test the legitimacy of the vote count in a country whose government allowed for almost no independent oversight?
Two graduate students at Columbia, Bernd Beber and Alexandra Scacco, had the clever idea to use the numbers themselves as evidence of fraud, effectively compelling the official vote count to testify against itself. They looked at the official total amassed by the four main candidates in each of Iran’s twenty-nine provinces, a total of 116 numbers. If these were true vote counts, there should be no reason for the last digits of those numbers to be anything but random. They should be distributed just about evenly among the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9, each one appearing 10% of the time.
That’s not how the Iranian vote counts looked. There were too many 7s, almost twice as many as their fair share; not like digits derived from a random process, but very much like digits written down by humans trying to make them look random. This, by itself, isn’t proof that the election was fixed, but it’s evidence in that direction.*
Human beings are always inferring, always using observations to refine our judgments about the various competing theories that jostle around inside our mental representation of the world. We are very confident, almost unshakably confident, about some of our theories (“The sun will rise tomorrow,” “When you drop things, they fall”) and less sure about others (“If I exercise today, I’ll sleep well tonight,” “There’s no such thing as telepathy”). We have theories about big things and little things, things we encounter every day and things we’ve run into only once. As we encounter evidence for and against those theories, our confidence in them bobs up and down.
Our standard theory about roulette wheels is that they’re fairly balanced, and that the ball is equally likely to land on red or black. But there are competing theories—say, that the wheel is biased in favor of one color or the other.* Let’s simplify matters and suppose there are just three theories available to you:
RED: The wheel is biased to make the ball land on red 60% of the time.
FAIR: The wheel is fair, so the ball lands on red half the time and on black half the time.
BLACK: The wheel is biased to make the ball land on black 60% of the time.
How much credence do you assign to these three theories? You probably tend to think roulette wheels are fair, unless you have reason to believe otherwise. Maybe you think there’s a 90% chance that FAIR is the right theory, and only a 5% chance for each of BLACK and RED. We can draw a box for this, just like we did for the Facebook list:
The box records what we call in probability lingo the a priori probabilities that the different theories are correct; the prior, for short. Different people might have different priors; a hardcore cynic might assign a 1/3 probability to each theory, while someone with a really firm preexisting belief in the rectitude of roulette-wheel makers might assign only a 1% probability to each of RED and BLACK.
But those priors aren’t fixed in place. If we’re presented with evidence favoring one theory over another—say, the ball landing red five times in a row—our levels of belief in the different theories can change. How might that work in this case? The best way to figure it out is to compute more conditional probabilities and draw a bigger box.
How likely is it that we’ll spin the wheel five times and get RRRRR? The answer depends on which theory is true. Under the FAIR theory, each spin has a 1/2 chance of landing on the red, so the probability of seeing RRRRR is
(1/2) × (1/2) × (1/2) × (1/2) × (1/2) = 1/32 = 3.125%
In other words, RRRRR is exactly as likely as any of the other 31 possibilities.
But if BLACK is true, there’s only an 40%, or 0.4 chance of getting red on each spin, so the chance of RRRRR is
(0.4) × (0.4) × (0.4) × (0.4) × (0.4) = 1.024%
And if RED is true, so that each spin has a 60% chance of landing red, the chance of RRRRR is
(0.6) × (0.6) × (0.6) × (0.6) × (0.6) = 7.76%.
Now we’re going to expand the box from three parts to six.
The columns still correspond to the three theories, BLACK, FAIR, and RED. But now we’ve split each column into two boxes, one corresponding to the outcome of getting RRRRR and the other to the outcome of not getting RRRRR. We’ve already done all the math we need to figure out what numbers go in the boxes. For instance, the a priori probability that FAIR is the correct theory is 0.9. And 3.125% of this probability, 0.9 × 0.03125 or about 0.0281, goes in the box where FAIR is correct and the balls fall RRRRR. The other 0.8719 goes in the “FAIR correct, not RRRRR” box, so that the FAIR column still adds up to 0.9 in all.
The a priori probability of being in the RED column is 0.05. So the chance that RED is true and the balls fall RRRRR is 7.76% of 5%, or 0.0039. That leaves 0.0461 to sit in the “RED true, not RRRRR” box.
The BLACK theory also has an a priori probability of 0.05. But that theory doesn’t jibe nearly as well with seeing RRRRR. The chance that BLACK is true and the balls fall RRRRR is just 1.024% of 5%, or .0005.
Here’s the box, filled in:
(Notice that the numbers in all six boxes together add up to 1; that’s as it must be, because the six boxes represent all possible situations.)
What happens to our theories if we spin the wheel and we do get RRRRR? That ought to be good news for RED and bad news for BLACK. And that’s just what we see. Getting five reds in a row means we’re in the bottom row of the six-part box, where there’s 0.0005 attached to BLACK, 0.028 attached to FAIR, and 0.0039 attached to RED. In other words, given that we saw RRRRR, our new judgment is that FAIR is about seven times as likely as RED, and RED is about eight times as likely as BLACK.
If you want to translate those proportions into probabilities, you just need to remember that the total probability of all the possibilities has to be 1. The sum of the numbers in the bottom row is about 0.0325, so to make those numbers sum to one without changing their proportions to one another, we can just divide each number by 0.0325. This leaves you with
How Not to Be Wrong : The Power of Mathematical Thinking (9780698163843) Page 17