“If the present results are surprising,” GVT write, “it is because of the robustness with which the erroneous belief in the ‘hot hand’ is held by experienced and knowledgeable observers.” And indeed, while their result was quickly taken up as conventional wisdom by psychologists and economists, it has been slow to gain traction in the basketball world. This didn’t faze Tversky, who relished a good fight, whatever the outcome. “I’ve been in a thousand arguments over this topic,” he said. “I’ve won them all, and I’ve convinced no one.”
But GVT, like Skinner before them, have answered only half the question: namely, what if the null hypothesis is true, and there is no hot hand? Then, as they demonstrate, the results would look very much like the ones observed in the real data.
But what if the null hypothesis is wrong? The hot hand, if it exists, is brief, and the effect, in strictly numerical terms, is small. The worst shooter in the league hits 40% of his shots and the best hits 60%; that’s a big difference in basketball terms, but not so big statistically. What would the shot sequences look like if the hot hand were real?
Computer scientists Kevin Korb and Michael Stillwell worked out exactly that in a 2003 paper. They generated simulations with a hot hand built in: the simulated player’s shooting percentage leaped up all the way to 90% for two ten-shot “hot” intervals over the course of the trial. In more than three-quarters of those simulations, the significance test used by GVT reported that there was no reason to reject the null hypothesis—even though the null hypothesis was completely false. The GVT design was underpowered, destined to report the nonexistence of the hot hand even if the hot hand was real.
If you don’t like simulations, consider reality. Not all teams are equal when it comes to preventing shots; in the 2012−13 season, the stingy Indiana Pacers allowed opponents to make only 42% of their shots, while 47.6% of shots fell in against the Cleveland Cavaliers. So players really do have “hot spells” of a rather predictable kind: namely, they’re more likely to hit a shot when they’re playing the Cavs. But this mild heat—maybe we should call it “the warm hand”—is something the tests used by Gilovich, Vallone, and Tversky aren’t sensitive enough to feel.
—
The right question isn’t “Do basketball players sometimes temporarily get better or worse at making shots?”—the kind of yes/no question a significance test addresses. The right question is “How much does their ability vary with time, and to what extent can observers detect in real time whether a player is hot?” Here, the answer is surely “not as much as people think, and hardly at all.” A recent study found that players who make the first of two free throws become slightly more likely to make the next one, but there’s no convincing evidence supporting the hot hand in real-time game play, unless you count the subjective impressions of the players and coaches. The short life of the hot hand, which makes it so hard to disprove, makes it just as hard to reliably detect. Gilovich, Vallone, and Tversky are absolutely correct in their central contention that human beings are quick to perceive patterns where they don’t exist and to overestimate their strength where they do. Any regular hoops watcher will routinely see one player or another sink five shots in a row. Most of the time, surely, this is due to some combination of indifferent defense, wise shot selection, or, most likely of all, plain good luck, not a sudden burst of basketball transcendence. Which means there’s no reason to expect a guy who’s just hit five in a row to be particularly likely to make the next one. Analyzing the performance of investment advisors presents the same problem. Whether there is such a thing as skill in investing or whether differences in performance between different funds are wholly due to luck has been a vexed, murky, unsettled question for years. But if there are investors with a temporary or permanent hot hand, they’re rare, so rare that they make little to no dent in the kind of statistics contemplated by GVT. A fund that’s beaten the market five years running is vastly more likely to have been lucky than good. Past performance is no guarantee of future returns. If Michigan fans were counting on Spike Albrecht to carry the team all the way to a championship, they were badly disappointed; Albrecht missed every shot he took in the second half, and the Wolverines ended up losing by 6.
A 2009 study by John Huizinga and Sandy Weil suggests that it might be a good idea for players to disbelieve in the hot hand, even if it really exists! In a much larger data set than GVT’s, they found a similar effect; after making a basket, players were less likely to succeed on their next shot. But Huizinga and Weil had records of not only shot success but shot location. And that data showed a striking potential explanation; players who had just made a shot were more likely to take a more difficult shot on their next attempt. Yigal Attali, in 2013, found even more intriguing results along these lines. A player who made a layup was no more likely to shoot from distance than a player who just missed a layup. Layups are easy and shouldn’t give the player a strong sense of being hot. But a player is much more likely to try a long shot after a three-point basket than after a three-point miss. In other words, the hot hand might “cancel itself out”—players, believing themselves to be hot, get overconfident and take shots they shouldn’t.
The nature of the analogous phenomenon in stock investment is left as an exercise for the reader.
EIGHT
REDUCTIO AD UNLIKELY
The stickiest philosophical point in a significance test comes right at the beginning, before we run any of the sophisticated algorithms developed by Fisher and honed by his successors. It’s right there at the beginning of step 2:
“Suppose the null hypothesis is true.”
But what we’re trying to prove, in most cases, is that the null hypothesis isn’t true. The drug works, Shakespeare alliterates, the Torah knows the future. It seems very logically fishy to assume exactly what we’re aiming to disprove, as if we’re in danger of making a circular argument.
On this point, you can rest easy. Assuming the truth of something we quietly believe to be false is a time-honored method of argument that goes all the way back to Aristotle; it is the proof by contradiction, or reductio ad absurdum. The reductio is a kind of mathematical judo, in which we first affirm what we wish eventually to deny, with the plan of throwing it over our shoulder and defeating it by means of its own force. If a hypothesis implies a falsehood,* then the hypothesis itself must be false. So the plan goes like this:
Suppose the hypothesis H is true.
It follows from H that a certain fact F cannot be the case.
But F is the case.
Therefore, H is false.
Say someone exclaims to you that two hundred children were killed by gunfire in the District of Columbia in 2012. That’s a hypothesis. But it might be somewhat hard to check (by which I mean that I typed “number of children killed by guns in DC in 2012” into the Google search bar and did not immediately learn the answer). On the other hand, if we assume the hypothesis is correct, then there cannot have been any fewer than two hundred homicides in total in DC in 2012. But there were fewer; in fact, there were only eighty-eight. So the exclaimer’s hypothesis must have been wrong. There’s no circularity here; we’ve “assumed” the false hypothesis in a kind of tentative, exploratory way, setting up the counterfactual mental world in which H is so and then watching it collapse under pressure from reality.
Put this way, the reductio sounds almost trivial, and in a sense, it is; but maybe it’s more accurate to say it’s a mental tool we’ve grown so used to handling that we forget how powerful it is. In fact, it’s a simple reductio that drives the Pythagoreans’ proof of the irrationality of the square root of 2; the one so awesomely paradigm-busting they had to kill its author; a proof so simple, refined, and compact that I can write it out whole in a page.
Suppose
H: the square root of 2 is a rational number
that is, √2 is a fraction m/n where m and n are whole numbers. We might as well write this fracti
on in lowest terms, which means that if there is a common factor between the numerator and denominator, we divide it out of both, leaving the fraction unchanged: no reason to write 10/14 instead of the simpler 5/7. So let’s rephrase our hypothesis:
H: the square root of 2 is equal to m/n, where m and n are whole numbers with no factor in common.
In fact, this means we can be sure it’s not the case that m and n are both even; for to say both numbers are even is exactly to say both have 2 as a factor. In that case, as in the case of 10/14, we could divide both numerator and denominator by 2 without changing the fraction, which is to say it was not in lowest terms after all. So
F: both m and n are even
is false.
Now since √2 = m/n, then by squaring both sides we see that 2 = m2 / n2 or, equivalently, that 2n2 = m2. So m2 is an even number, which means that m itself is even. A number is even just when it can be written as twice another whole number; so we can, and do, write m as 2k for some whole number k. Which means that 2n2 = (2k)2 = 4k2. Dividing both sides by 2, we find that n2 = 2k2.
What’s the point of all this algebra? Simply to show that n2 is twice k2, and therefore an even number. But if n2 is even, so must n be, just like m is. But that means that F is true! By assuming H we have arrived at a falsehood, even an absurdity; that F is false and true at once. So H must have been wrong. The square root of 2 is not a rational number. By assuming it was, we proved that it wasn’t. It’s a weird trick indeed, but it works.
You can think of the null hypothesis significance test as a sort of fuzzy version of the reductio:
Suppose the null hypothesis H is true.
It follows from H that a certain outcome O is very improbable (say, less than Fisher’s 0.05 threshold).
But O was actually observed.
Therefore, H is very improbable.
Not a reductio ad absurdum, in other words, but a reductio ad unlikely.
A classical example comes from the eighteenth-century astronomer and clergyman John Michell, among the first to take a statistical approach to the study of the heavenly bodies. The cluster of dim stars in one corner of the constellation Taurus has been observed by just about every civilization. The Navajo call them Dilyehe, “the sparkling figure”; the Maori call them Matariki, “the eyes of god.” To the ancient Romans they were a bunch of grapes and in Japanese they’re Subaru (in case you ever wondered where the car company’s six-star logo came from). We call them the Pleiades.
All these centuries of observation and mythmaking couldn’t answer the fundamental scientific question about the Pleiades: is the cluster actually a cluster? Or are the six stars separated by unfathomable distances, but arrayed by chance in almost the exact same direction from Earth? Points of light, placed at random in our frame of vision, look something like this:
You see some clumps, right? That’s to be expected: there will inevitably be some groups of stars that wind up almost on top of one another, simply by happenstance. How can we be sure that’s not what’s going on with the Pleiades? It’s the same phenomenon Gilovich, Vallone, and Tversky pointed out: a perfectly consistent point guard, who enjoys no hot streaks and suffers no slumps, will nonetheless sometimes nail five shots in a row.
In fact, if there were no big visible clusters of stars, as in this picture:
that itself would be evidence that some nonrandom process was at work. The second picture might look “more random” to the naked eye, but it is not; it testifies that the points have a built-in disinclination to crowd.
So the mere appearance of an apparent cluster shouldn’t convince us that the stars in question are actually clumped together in space. On the other hand, a group of stars in the sky might be so tightly packed as to demand that one doubt it could have happened by chance. Michell showed that, were visible stars randomly strewn around in space, the chance that six would array themselves so neatly as to present a Pleiades-like cluster to our eyes was small indeed; about 1 in 500,000, by his computation. But there they are above us, the tightly packed bunch of grapes. Only a fool, Michell concluded, could believe it had happened by chance.
Fisher wrote approvingly of Michell’s work, making explicit the analogy he saw there between Michell’s argument and the classical reductio:
“The force with which such a conclusion is supported is logically that of a simple disjunction: Either an exceptionally rare chance has occurred, or the theory of random distribution is not true.”
The argument is compelling, and its conclusion correct; the Pleiades are indeed no optical coincidence, but a real cluster—of several hundred adolescent stars, not just the six visible to the eye. The fact that we see many very tight clusters of stars like the Pleiades, much tighter than would be likely to exist by chance, is good evidence that the stars are not placed randomly, but rather are clumped by some real physical phenomenon out there in the void.
But here’s the bad news: the reductio ad unlikely, unlike its Aristotelian ancestor, is not logically sound in general. It leads us into its own absurdities. Joseph Berkson, the longtime head of the medical statistics division at the Mayo Clinic, who cultivated (and loudly broadcast) a vigorous skepticism about methodology he thought shaky, offered a famous example demonstrating the pitfalls of the method. Suppose you have a group of fifty experimental subjects, who you hypothesize (H) are human beings. You observe (O) that one of them is an albino. Now, albinism is extremely rare, affecting no more than one in twenty thousand people. So given that H is correct, the chance you’d find an albino among your fifty subjects is quite small, less than 1 in 400,* or 0.0025. So the p-value, the probability of observing O given H, is much lower than .05.
We are inexorably led to conclude, with a high degree of statistical confidence, that H is incorrect: the subjects in the sample are not human beings.
It’s tempting to think of “very improbable” as meaning “essentially impossible,” and, from there, to utter the word “essentially” more and more quietly in our mind’s voice until we stop paying attention to it.* But impossible and improbable are not the same—not even close. Impossible things never happen. But improbable things happen a lot. That means we’re on quivery logical footing when we try to make inferences from an improbable observation, as reductio ad unlikely asks us to. That time in North Carolina when the lottery combo 4, 21, 23, 34, 39 came up twice in a week raised a lot of questions; was something wrong with the game? But each combination of numbers is exactly as likely to come up as any other. For the numbers to show 4, 21, 23, 34, 39 on Tuesday and 16, 17, 18, 22, 39 on Thursday is precisely as improbable as what actually took place—there’s just one chance in 300 billion or so of getting those two draws on those two days. In fact, any particular outcome of the Tuesday and Thursday lottery draws is a one in 300 billion shot. If you’re committed to the view that a highly improbable outcome should lead you to question the fairness of the game, you’re going to be the person shooting off an angry e-mail to the lottery commissioner every Thursday of your life, no matter which numbered balls drop out of the cage.
Don’t be that person.
PRIME CLUSTERS AND THE STRUCTURE OF STRUCTURELESSNESS
Michell’s critical insight, that clusters of stars might appear to our eye even if stars were randomly distributed around our field of vision, doesn’t apply only to the celestial sphere. This phenomenon was the hinge for the pilot episode of the math/cop drama Numb3rs.* A series of grisly attacks, marked by pins on the wall map at HQ, showed no clusters; ergo, a single cunning serial killer intentionally leaving space between victims, not an unconnected burst of psychos, was at work. It was somewhat contrived as a police story, but mathematically it was perfectly correct.
The appearance of clusters in random data offers insight even in situations where there is no real randomness at all, like the behavior of prime numbers. In 2013, Yitang “Tom” Zhang, a popular math lecturer at the University of New Hampshi
re, stunned the world of pure mathematics when he announced that he had proven the “bounded gaps” conjecture about the distribution of primes. Zhang had been a star student at Beijing University, but had never thrived after moving to the United States for his PhD in the 1980s. He hadn’t published a paper since 2001. At one point, he left academic math entirely to sell sandwiches at Subway, until a fellow former student from Beijing tracked him down and helped him get an untenured lectureship at UNH. To all outward appearances, he was washed up. So it came as a great surprise when he released a paper proving a theorem some of the biggest names in number theory had tried, and failed, to conquer.
But the fact that the conjecture is true came as no surprise at all. Mathematicians have a reputation of being no-B.S. hard cases who don’t believe a thing until it’s locked down and proved. That’s not quite true. All of us believed the bounded gaps conjecture before Zhang’s big reveal, and we all believe the closely related twin primes conjecture, even though it remains unproven. Why?
Let’s start with what the two conjectures say. The prime numbers are those numbers greater than 1 that aren’t multiples of any number smaller than themselves and greater than 1; so 7 is a prime, but 9 is not, because it’s divisible by 3. The first few primes are 2, 3, 5, 7, 11, and 13.
Every positive number can be expressed in just one way as a product of prime numbers. For instance, 60 is made up of two 2s, one 3, and one 5, because 60 = 2 × 2 × 3 × 5. (This is why we don’t take 1 to be a prime, though some mathematicians have done so in the past; it breaks the uniqueness, because if 1 counts as prime, 60 could be written as 2 × 2 × 3 × 5 and 1 × 2 × 2 × 3 × 5 and 1 × 1 × 2 × 2 × 3 × 5 . . .) What about prime numbers themselves? They’re fine; a prime number, like 13, is the product of a single prime, 13 itself. And what about 1? We’ve excluded it from our list of primes, so how can it be a product of primes, each one of which is larger than 1? Simple: 1 is the product of no primes.
How Not to Be Wrong : The Power of Mathematical Thinking (9780698163843) Page 13