1 See Larry Laudan’ s (1996 ) ingenious analysis of how much the positivists and post-positivists share in “ The Sins of the Fathers.”
2 I will not distinguish personalists and subjectivists, even though I realize there is a history of distinct terms.
3 Aside: should an author stop to explain every joke, as some reviewers seem to think? I don’ t think so, but you can look up “ omelet Savage 1961 .”
4 Silver recognizes that groupthink created an echo chamber during the 2016 election in the USA.
Tour II
Rejection Fallacies: Who’ s Exaggerating What?
Comedian Jackie Mason will be doing his shtick this evening in the ship’ s theater: a one-man show consisting of a repertoire of his “ Greatest Hits” without a new or updated joke in the mix. A sample:
If you want to eat nothing, eat nouvelle cuisine. Do you know what it means? No food. The smaller the portion the more impressed people are, so long as the food’ s got a fancy French name, haute cuisine. An empty plate with sauce!
You’ ll get the humor only once you see and hear him (Mayo 2012b). As one critic (Logan 2012 ) wrote, Mason’ s jokes “ offer a window to a different era,” one whose caricatures and biases one can only hope we’ ve moved beyond. It’ s one thing for Jackie Mason to reprise his greatest hits, another to reprise statistical foibles and howlers which could leave us with radical changes to science. Among the tribes we’ ll be engaging: Large n , Jeffreys– Lindley, and Spike and Smear.
How Could a Group of Psychologists Be so Wrong?
I’ ll carry a single tome in our tour: Morrison and Henkel’ s 1970 classic, The Significance Test Controversy. Some abuses of the proper interpretation of significance tests were deemed so surprising even back then that researchers in psychology conducted studies to try to understand how this could be. Notably, Rosenthal and Gaito (1963 ) discovered that statistical significance at a given level was often fallaciously taken as evidence of a greater discrepancy from the null hypothesis the larger the sample size n . In fact, it is indicative of less of a discrepancy from the null than if it resulted from a smaller sample size.
What is shocking is that these psychologists indicated substantially greater confidence or belief in results associated with the larger sample size for the same p values. According to the theory, especially as this has been amplified by Neyman and Pearson (1933 ), the probability of rejecting the null hypothesis for any given deviation from null and p values increases as a function of the number of observations. The rejection of the null hypothesis when the number of cases is small speaks for a more dramatic effect in the population … The question is, how could a group of psychologists be so wrong?
(Bakan 1970 , p. 241)
(Our convention is for “ discrepancy” to refer to the parametric, not the observed, difference. Their use of “ deviation” from the null alludes to our “ discrepancy.” )
As statistician John Pratt notes, “ the more powerful the test, the more a just significant result favors the null hypothesis” (1961 , p. 166). Yet we still often hear: “ The thesis implicit in the [N-P] approach, [is] that a hypothesis may be rejected with increasing confidence or reasonableness as the power of the test increases” (Howson and Urbach 1993 , p. 209). In fact, the thesis implicit in the N-P approach, as Bakan remarks, is the opposite! The fallacy is akin to making mountains out of molehills according to severity (Section 3.2 ):
Mountains out of Molehills (MM) Fallacy (large n problem): The fallacy of taking a rejection of H 0 , just at level P , with larger sample size (higher power ) as indicative of a greater discrepancy from H 0 than with a smaller sample size.
Consider an analogy with two fire alarms: The first goes off with a sensor liable to pick up on burnt toast; the second is so insensitive it doesn’ t kick in until your house is fully ablaze. You’ re in another state, but you get a signal when the alarm goes off. Which fire alarm indicates the greater extent of fire? Answer: the second, less sensitive one. When the sample size increases it alters what counts as a single sample . It is like increasing the sensitivity of your fire alarm. It is true that a large enough sample size triggers the alarm with an observed mean that is quite “ close” to the null hypothesis. But, if the test rings the alarm (i.e., rejects H 0 ) even for tiny discrepancies from the null value, then the alarm is poor grounds for inferring larger discrepancies. Now this is an analogy, you may poke holes in it. For instance, a test must have a large enough sample to satisfy model assumptions. True, but our interpretive question can’ t even get started without taking the P -values as legitimate and not spurious.
4.3 Significant Results with Overly Sensitive Tests: Large n Problem
“ [W]ith a large sample size virtually every null hypothesis is rejected, while with a small sample size, virtually no null hypothesis is rejected. And we generally have very accurate estimates of the sample size available without having to use significance testing at all!” .
(Kadane 2011 , p. 438)
P -values are sensitive to sample size, but to see this as a problem is to forget what significance tests are for. We want consistent tests, so that as n increases the probability of discerning any discrepancy from the null (i.e., the power) increases. The fact that the test would eventually uncover any discrepancy there may be, regardless of how small, doesn’ t mean there always is such a discrepancy, by the way. (Another little confusion repeated in the form of “ all null hypotheses are false.” ) Let’ s focus on the example of Normal testing, T+ with H 0 : μ ≤ 0 vs. H 1 : µ > 0 letting σ = 1. It’ s precisely to bring out the effect of sample size that many prefer to write the statistic as
rather than
where abbreviates ( σ /√ n ).
T+ rejects H 0 (at the 0.025 level) iff the sample mean . As n increases, a single ( σ /√ n ) unit decreases. Thus the value of required to reach significance decreases as n increases.
The test’ s goal is to distinguish observed effects due to ordinary expected variability under H 0 with those that cannot be readily explained by mere noise. If the inter-ocular test will do, you don’ t need statistics. As the sample size increases, the ordinary expected variability decreases. The severe tester takes account of the sample size in interpreting the discrepancy indicated. The test is like a thermostat, a fire alarm, or the mesh size in a fishing net. You choose the sensitivity, and it does what you told it to do.
Keep in mind that the hypotheses entertained are not point values, but discrepancies. Informally, for a severe tester, each corresponds to an assertion of form: there’ s evidence of a discrepancy at least this large, but there’ s poor evidence it’ s as large as thus and so. Let’ s compare statistically significant results at the same level but with different sample sizes.
Consider the 2-standard deviation cut-off for n = 25, 100, 400 in test T+, σ = 1 (Figure 4.1 ).
Figure 4.1 X̅ ~ N ( μ , σ 2 /n ) for n = 25, 100, 400.
Let abbreviate the sample mean that is just statistically significant at the 0.025 level in each test. With ; with ; with . So the cut-offs for rejection are 0.4, 0.2, and 0.1, respectively.
Again, alterations of the sample size change what counts as one unit. If you treat identical values of the same, ignoring √ n , you will misinterpret your results. With large enough n , the cut-off for rejection can be so close to the null value as to lead some accounts to regard it as evidence for the null. This is the Jeffreys– Lindley paradox that we’ ll be visiting this afternoon (Section 4.4 ).
Exhibit (v): Responding to a Familiar Chestnut
Did you hear the one about the significance tester who rejected H 0 in favor of H 1 even though the result makes H 0 more likely than H 1 ?
I follow the treatment in Elliott Sober (2008 , p. 56), who is echoing Howson and Urbach (1993 , pp. 208– 9), who are echoing Lindley (1957 ). The only difference is that I will allude to a variant on test T+: H 0 : μ = 0 vs. H 1 : μ > 0 with σ = 1 . “ Another odd property of significance tests,” says Sober, “ concerns
the way in which they are sensitive to sample size.” Suppose you are applying test T+ with null H 0 : μ = 0 . If your sample size is n = 25 , and you choose α = 0.025 , you will reject H 0 whenever . If you examine n = 100 , and choose the same value for α , you will reject H 0 whenever . And if you examine n = 400 , again with α = 0.025 , you will reject H 0 whenever . “ As sample size increases” the sample mean must be closer and closer to 0 for you not to reject H 0 . “ This may not seem strange until you add the following detail. Suppose the alternative to H 0 is the hypothesis” H 1 : μ = 0.24 . “ The Law of Likelihood now entails that observing” favors H 0 over H 1 , so in particular favors H 0 over H 1 (Section 1.4 ).
Your reply: Hold it right at “ add the following detail.” You’ re observing that the significance test disagrees with a Law of Likelihood appraisal to a point vs. point test: H 0 : µ = 0 vs. H 1 : µ = 0.24 . We require the null and alternative hypotheses to exhaust the space of parameters, and these don’ t. Nor are our inferences to points, but rather to inequalities about discrepancies. That said, we’ re prepared to consider your example, and make short work of it. We’ re testing H 0 : μ ≤ 0 vs. H 1 : μ > 0 (one could equally view it as testing H 0 : µ = 0 vs. H 1 : µ > 0 ). The outcome , while indicating some positive discrepancy from 0, offers bad evidence and an insevere test for inferring μ as great as 0.24. Since rejects H 0 , we say the result accords with H 1 . The severity associated with inference µ ≥ 0.24 asks: what’ s the probability of observing – i.e., a result more discordant with H 1 assuming µ = 0.24.
SEV(µ ≥ 0.24) with and n = 400 is computed as < 0.1; µ = 0.24). Standardizing yields Z = √ 400 (0.1 − 0.24)/1 = 20(−0.14) = −2.8 . So SEV(µ ≥ 0.24) = 0.003! Were µ as large as 0.24, we’ d have observed a larger observed mean than we did with 0.997 probability! It’ s terrible evidence for H 1 : µ = 0.24 .
This is redolent of the Binomial example in discussing Royall (Section 1.4 ). To underscore the difference between the Likelihoodist’ s comparative appraisal and the significance tester, you might go further. Consider an alternative that the Likelihoodist takes as favored over H 0 : µ = 0 with , namely, the maximum likely alternative H 1 : µ = 0.1 . This is one of our key benchmarks for a discrepancy that’ s poorly indicated. To the Likelihoodist, inferring that H 1 : µ = 0.1 “ is favored” over H 0 : µ = 0 makes sense, whereas to infer a discrepancy of 0.1 from H 0 is highly un warranted for a significance tester. 1 Our aims are very different.
We can grant this: starting with any value for , however close to 0, there’ s an n such that is statistically significantly greater than 0 at a chosen level. If one understands the test’ s intended task, this is precisely what is wanted. How large would n need to be so that 0.02 is statistically significant at the 0.025 level (still retaining σ = l)?
Answer: Setting 0.02 = 2(1/√ n ) and solving for n yields n = 10,000. 2
Statistics won’ t tell you what magnitudes are of relevance to you. No matter, we can critique results and purported inferences.
Exhibit (vi): Reforming the Reformers on Confidence Intervals.
You will be right to wonder why some of the same tribes who raise a ruckus over P -values – to the extent, in some cases, of calling for a “ test ban” – are cheerleading for confidence intervals (CIs), given there is a clear duality between the two (Section 3.7 ). What they’ re really objecting to is a dichotomous use of significance tests where the report is “ significant” or not at a predesignated significance level. I completely agree with this objection, and reject the dichotomous use of tests (which isn’ t to say there are no contexts where an “ up/down” indication is apt). We should reject Unitarianism where a single method with a single interpretation must be chosen. Ironically, some of the most outspoken CI leaders use them in the dichotomous fashion (rightly) deplored when it comes to testing.
Geoffrey Cumming, an acknowledged tribal leader on CIs, tells us that “ One-sided CIs are analogous to one-tailed tests but, as usual, the estimation approach is better” (2012 , p. 109). Well, it might be better, but like hypothesis testing, it calls for supplements and reinterpretations as begun in Section 3.7 .
Our one-sided test T+ (H 0 : µ ≤ 0 vs. H 1 : µ > 0, and σ = 1) at α = 0.025 has as its dual the one-sided (lower) 97.5% general confidence interval: – rounding to 2 from 1.96. So you won’ t have to flip back pages, here’ s a quick review of the notation we developed to avoid the common slipperiness with confidence intervals. We abbreviate the generic lower limit of a (1 − α ) confidence interval as and the particular limit as . The general estimating procedure is: Infer . The particular estimate is . Letting α = 0.025 we have: . With α = 0.05 , we have .
Cumming’ s interpretation of CIs and confidence levels points to their performance-oriented construal: “ In the long run 95% of one-sided CIs will include the population mean … We can say we’ re 95% confident our one-sided interval includes the true value … meaning that for 5% of replications the [lower limit] will exceed the true value” (Cumming 2012 , p. 112). What does it mean to be 95% confident in the particular interval estimate for Cumming? “ It means that the values in the interval are plausible as true values for μ , and that values outside the interval are relatively implausible – though not impossible” (ibid., p. 79). The performance properties of the method rub off in a plausibility assessment of some sort.
The test that’ s dual to the CI would “ accept” those parameter values within the corresponding interval, and reject those outside, all at a single predesignated confidence level 1 − α . Our main objection to this is it gives the misleading idea that there’ s evidence for each value in the interval, whereas, in fact, the interval simply consists of values that aren’ t rejectable, were one testing at the α level. Not being a rejectable value isn’ t the same as having evidence for that value. Some values are close to being rejectable, and we should convey this. Standard CIs do not.
To focus on how CIs deal with distinguishing sample sizes, consider again the three instances of test T+ with (i) n = 25 , (ii) n = 100 , and (iii) n = 400 . Imagine the observed mean from each test just hits the significance level 0.025. That is, (i) , (ii) , and (iii) . Form 0.975 confidence interval estimates for each:
(i) for n = 25, the inferred estimate is , that is, ;
(ii) for n = 100, the inferred estimate is , that is, ;
(iii) for n = 400, the inferred estimate is , that is, .
Substituting in all cases, we get the same one-sided confidence interval:
µ > 0.
Cumming writes them as [0, infinity). How are the CIs distinguishing them?
They are not. The construal is dichotomous: in or out, plausible or not. Would we really want to say “ the values in the interval are plausible as true values for µ ” ? Clearly not, since that includes values to infinity. I don’ t want to step too hard on the CI champion’ s toes, since CIs are in the frequentist, error statistical tribe. Yet, to avoid fallacies, this standard use of CIs won’ t suffice. Severity directs you to avoid taking your result as indicating a discrepancy beyond what’ s warranted. For an example, we can show the same inference is poorly indicated with n = 400, while fairly well indicated when n = 100. For a poorly indicated claim, take our benchmark for severity of 0.5; for fairly well, 0.84:
The reasoning based on severity is counterfactual: were µ less than or equal to 0.1, it is fairly probable, 0.84, that a smaller would have occurred. This is not part of the standard CI account, but enables the distinction we want. Another move would be for a CI advocate to require we always compute a two-sided interval. The upper 0.975 bound would reflect the greater sensitivity with increasing sample sizes:
(i) n = 25: (0, 0.8], (ii) n = 100: (0, 0.4], (iii) n = 400: (0, 0.2].
But we cannot just deny one-sided tests, nor does Cumming. In fact, he encourages their use: “ it’ s unfortunate they are usually ignored” (2012 , p. 113). (He also says he is happy for people to decide afterwards whether to report it as a one- or two-sided interval (ibi
d., p. 112), only doubling α , which I do not mind.) Still needed is a justification for bringing in the upper limit when applying a one-sided estimator, and severity supplies it. You should always be interested in at least two benchmarks : discrepancies well warranted and those terribly warranted. In test T+, our handy benchmark for the terrible is to set the lower limit to . The severity for . Two side notes:
First I grant it would be wrong to charge Cumming with treating all parameter values within the confidence interval on par , because he does suggest distinguishing them by their likelihoods (by how probable each renders the outcome). Take just the single 0.975 lower CI bound with n = 100 and . A µ value closer to the observed 0.2 has higher likelihood (in the technical sense) than ones close to the 0.975 lower limit 0. For example, µ = 0.15 is more likely than µ = 0.05. However, this moves away from CI reasoning (toward likelihood comparisons). The claim µ >
Statistical Inference as Severe Testing Page 34