Statistical Inference as Severe Testing
Page 20
Let’ s look at this. The numerator is the value of θ that makes the data x most probable over the entire parameter space. It is the maximum likelihood estimator for θ . Write it as . The denominator is the value of θ that maximizes the probability of x restricted just to the members of the null Θ 0 . It may be called the restricted likelihood. Write it as :
Suppose that looking through the entire parameter space Θ we cannot find a θ value that makes the data more probable than if we restrict ourselves to the parameter values in Θ 0 . Then the restricted likelihood in the denominator is large, making the ratio Λ ( X ) small. Thus, a small Λ ( X ) corresponds to H 0 being in accordance with the data (Wilks 1962 , p. 404). It’ s a matter of convenience which way one writes the ratio. In the one we’ ve chosen, following Aris Spanos (1986 , 1999 ), the larger the Λ ( X ), the more discordant the data are from H 0 . This suggests the null would be rejected whenever
Λ ( X ) ≥ k α
for some value of k α .
So far all of this was to form the distance measure Λ ( X ). It’ s looking somewhat the same as the Likelihoodist account. Yet we know that the additional step 3 that error statistics demands is to compute the probability of Λ ( X ) under different hypotheses. Merely reporting likelihood ratios does not produce meaningful control of errors; nor do likelihood ratios mean the same thing in different contexts. So N-P consider the probability distribution of Λ ( X ), and they want to ensure the probability of the event {Λ ( X ) ≥ k α } is sufficiently small under H 0 . They set k α so that
Pr(Λ ( X ) ≥ k α ; H 0 ) = α
for small α . Equivalently, they want to ensure high probability of accordance with H 0 just when it adequately describes the data generation process. Note the complement:
Pr(Λ ( X ) < k α ; H 0 ) = (1 − α ).
The event statement to the left of “ ;” does not reverse positions with H 0 when you form the complement, H 0 stays where it is.
The set of data points leading to (Λ ( X ) ≥ k α ) is what N-P call the critical region or rejection region of the test { x : Λ ( X ) ≥ k α } – the set of outcomes that will be taken to reject H 0 or, in our terms, to infer a discrepancy from H 0 in the direction of H 1 . Specifying the test procedure, in other words, boils down to specifying the rejection (of H 0 ) region.
Monotonicity.
Following Fisher’ s goal of maximizing sensitivity, N-P seek to maximize the capability of detecting discrepancies from H 0 when they exist. We need the sampling distribution of Λ ( X ), but in practice, Λ ( X ) is rarely in a form that one could easily derive this. Λ ( X ) has to be transformed in clever ways to yield a test statistic d( X ), a function of the sample that has a known distribution under H 0 . A general trick to finding a suitable test statistic d( X ) is to find a function h(.) of Λ ( X ) that is monotonic with respect to a statistic d( X ). The greater d( X ) is, the greater the likelihood ratio; the smaller d( X ) is, the smaller the likelihood ratio. Having transformed Λ ( X ) into the test statistic d( X ), the rejection region becomes
Rejection Region, RR ≔ { x : d( x ) ≥ c α },
the set of data points where d( x ) ≥ c α . All other data points belong to the “ non-rejection” or “ acceptance” region, NR. At first Neyman and Pearson introduced an “ undecided” region, but tests are most commonly given such that the RR and NR regions exhaust the entire sample space S. The term “ acceptance,” Neyman tells us, was merely shorthand: “ The phrase ‘ do not reject H ’ is longish and cumbersome … My own preferred substitute for ‘ do not reject H ’ is ‘ no evidence against H is found’” (Neyman 1976 , p. 749). That is the interpretation that should be used.
The use of the Λ (.) criterion began as E. Pearson’ s intuition. Neyman was initially skeptical. Only later did he show it could be the basis for good and even optimal tests.
Having established the usefulness of the Λ -criterion, we realized that it was essential to explore more fully the sense in which it led to tests which were likely to be effective in detecting departures from the null hypothesis. So far we could only say that it seemed to appeal to intuitive requirements for a good test.
(E. Pearson 1970 p. 470, I replace λ with Λ )
Many other desiderata for good tests present themselves.
We want a higher and higher value for Pr(d( X ) ≥ c α ; θ 1 ) as the discrepancy ( θ 1 − θ 0 ) increases. That is, the larger the discrepancy, the easier (more probable) it should be to detect it. This came to be known as the power function . Likewise, the power should increase as the sample size increases, and as the variability decreases. The point is that Neyman and Pearson did not start out with a conception of optimality. They groped for criteria that intuitively made sense and that reflected Fisher’ s tests and theory of estimation. There are some early papers in 1928, but the N-P classic result isn’ t until the paper in 1933.
Powerful Tests.
Pearson describes the days when he and Neyman are struggling to compare various different test statistics – Neyman is in Poland, he is in England. Pearson found himself simulating power for different test statistics and tabling the results. He calls them “ empirical power functions.” Equivalently, he made tables of the complement to the empirical power function: “ what was tabled was the percentage of samples for which a test at 5 percent level failed to establish significance, as the true mean shifted from μ 0 by steps of σ /√ n (ibid. p. 471). He’ s construing the test’ s capabilities in terms of percentage of samples. The formal probability distributions serve as shortcuts to cranking out the percentages. “ While the results were crude, they show that our thoughts were turning towards the justification of tests in terms of power” (ibid.).
While Pearson is busy experimenting with simulated power functions, Neyman writes to him in 1931 of difficulties he is having in more complicated cases, saying: I found a test in which, paradoxically, “ the true hypothesis will be rejected more often than some of the false ones . I told Lola [his wife] that we had invented such a test. She said: ‘ good boys!’” (ibid. p. 472). A test should have a higher probability of leading to a rejection of H 0 when H 1 : θ ∈ Θ 1 than when H 0 : θ ∈ Θ 0 . After Lola’ s crack, pretty clearly, they would insist on unbiased tests : the probability of rejecting H 0 when it’ s true or adequate is always less than that of rejecting it when it’ s false or inadequate. There are direct parallels with properties of good estimators of θ (although we won’ t have time to venture into that).
Tests that violate unbiasedness are sometimes called “ worse than useless” (Hacking 1965 , p. 99), but when you read for example in Gigerenzer and Marewski (2015 ) that N-P found Fisherian tests “ worse than useless” (p. 427), there is a danger of misinterpretation. N-P aren’ t bad-mouthing Fisher. They know he wouldn’ t condone this, but want to show that without making restrictions explicit, it’ s possible to end up with such unpalatable tests. In the case of two-sided tests, the additional criterion of unbiasedness led to uniformly most powerful (UMP) unbiased tests.
Consistent Tests.
Unbiasedness by itself isn’ t a sufficient property for a good test; it needs to be supplemented with the property of consistency . This requires that, as the sample size n increases without limit, the probability of detecting any discrepancy from the null hypothesis (the power) should approach 1. Let’ s consider a test statistic that is unbiased yet inconsistent. Suppose we are testing the mean of a Normal distribution with σ known. The test statistic to which the Λ gives rise is
Say that, rather than using the sample mean , we use the average of the first and last values. This is to estimate the mean θ as . The test statistic is then . This is an unbiased estimator of θ . The distribution of is N( θ , σ 2 /2). Even though this is unbiased and enables control of the Type I error, it is inconsistent. The result of looking only at two outcomes is that the power does not increase as n increases. The power of this test is much lower than a test using the sample mean for any n > 2. If you come across a criticism of te
sts, make sure consistency is not being violated.
Historical Sidelight.
Except for short visits and holidays, their work proceeded by mail. When Pearson visited Neyman in 1929, he was shocked at the conditions in which Neyman and other academics lived and worked in Poland. Numerous letters from Neyman describe the precarious position in his statistics lab: “ You may have heard that we have in Poland a terrific crisis in everything” [1931] (C. Reid 1998 , p. 99). In 1932, “ I simply cannot work; the crisis and the struggle for existence takes all my time and energy” (Lehmann 2011 , p. 40). Yet he managed to produce quite a lot. While at the start, the initiative for the joint work was from Pearson, it soon turned in the other direction with Neyman leading the way.
By comparison, Egon Pearson’ s greatest troubles at the time were personal: He had fallen in love “ at first sight” with a woman engaged to his cousin George Sharpe, and she with him. She returned the ring the very next day, but Egon still gave his cousin two years to win her back (C. Reid 1998 , p. 86). In 1929, buoyed by his work with Neyman, Egon finally declares his love and they are set to be married, but he let himself be intimidated by his father, Karl, deciding “ that I could not go against my family’ s opinion that I had stolen my cousin’ s fiancée … at any rate my courage failed” (ibid., p. 94). Whenever Pearson says he was “ suddenly smitten” with doubts about the justification of tests while gazing on the fruit station that his cousin directed, I can’ t help thinking he’ s also referring to this woman (ibid., p. 60). He was lovelorn for years, but refused to tell Neyman what was bothering him.
N-P Tests in Their Usual Formulation: Type I and II Error Probabilities and Power
Whether we accept or reject or remain in doubt, say N-P (1933 , p. 146), it must be recognized that we can be wrong. By choosing a distance measure d( X ) wherein the probability of different distances may be computed, if the source of the data is H 0, we can determine the probability of an erroneous rejection of H 0 – a Type I error.
The test specification that dovetailed with the Fisherian tests in use began by ensuring the probability of a Type I error – an erroneous rejection of the null – is fixed at some small number, α , the significance level of the test:
Type I error probability = Pr(d( X ) ≥ c α ; H 0 ) ≤ α .
Compare the Type I error probability and the P -value:
P -value : Pr(d( X ) ≥ d( x 0 ); H 0 ) = p( x 0 ).
So the N-P test could easily be given in terms of the P-value:
Reject H 0 iff p( x 0 ) ≤ α .
Equivalently, the rejection (of H 0 ) region consists of those outcomes whose P -value is less than or equal to α . Reflecting the tests commonly used, N-P suggest the Type I error be viewed as the “ more important” of the two. Let the relevant hypotheses be H 0 : θ = θ 0 vs. H 1 : θ > θ 0 .
The Type II error is failing to reject the null when it is false to some degree. The test leads you to declare “ no evidence of discrepancy from H 0 ” when H 0 is false, and a discrepancy exists. The alternative hypothesis H 1 contains more than a single value of the parameter, it is composite . So, abbreviate by ß( θ 1 ): the Type II error probability assuming θ = θ 1 , for θ 1 values in the alternative region H 1 :
Type II error probability (at θ 1 ) = Pr(d( X ) < c α ; θ 1 ) = ß( θ 1 ), for θ 1 ∈ Θ 1 .
In Figure 3.2 , this is the area to the left of c α , the vertical dotted line, under the H 1 curve. The shaded area, the complement of the Type II error probability (at θ 1 ), is the power of the test (at θ 1 ):
Power of the test (POW) (at θ 1 ) = Pr(d( X ) ≥ c α ; θ 1 ).
This is the area to the right of the vertical dotted line, under the H 1 curve, in Figure 3.2 . Note d( x 0 ) and c α are always approximations expressed as decimals. For continuous cases, Pr is the probability density.
Figure 3.2 Type II error and power.
A uniformly most powerful (UMP) N-P test of a hypothesis at level α is one that minimizes ß( θ 1 ), or, equivalently, maximizes the power for all θ > θ 0 . One reason alternatives are often not made explicit is the property of being a best test for any alternative. We’ ll explore power, an often-misunderstood creature, further in Excursion 5 .
Although the manipulations needed to derive a test statistic using a monotonic mapping of the likelihood ratio can be messy, it’ s exhilarating to deduce them. Wilks (1938 ) derived a general asymptotic result, which does not require such manipulations. He showed that, under certain regularity conditions, as n goes to infinity one can define the asymptotic test, where “ ~” denotes “ is distributed as” .
2lnΛ (X ) ~ χ 2 (r ), under H 0 , with rejection region RR ≔ { x : 2lnΛ ( x ) ≥ c α },
where χ 2 (r ) denotes the chi-square distribution with r degrees of freedom determined by the restrictions imposed by H 0 . 4 The monotonicity of the likelihood ratio condition holds for familiar models including one-parameter variants of the Normal, Gamma, Beta, Binomial, Negative Binomial, Poisson (the Exponential family), the Uniform, Logistic, and others (Lehmann 1986 ). In a wide variety of tests, the Λ principle gives tests with all of the intuitively desirable test properties (see Spanos 1999 and 2019, chapter 13).
Performance versus Severity Construals of Tests
“ The work [of N-P] quite literally transformed mathematical statistics” (C. Reid 1998 , p. 104). The idea that appraising statistical methods revolves around optimality (of some sort) goes viral. Some compared it “ to the effect of the theory of relativity upon physics” (ibid.). Even when the optimal tests were absent, the optimal properties served as benchmarks against which the performance of methods could be gauged. They had established a new pattern for appraising methods, paving the way for Abraham Wald’ s decision theory, and the seminal texts by Lehmann and others. The rigorous program overshadowed the more informal Fisherian tests. This came to irk Fisher. Famous feuds between Fisher and Neyman erupted as to whose paradigm would reign supreme. Those who sided with Fisher erected examples to show that tests could satisfy predesignated criteria and long-run error control while leading to counterintuitive tests in specific cases. That was Barnard’ s point on the eclipse experiments (Section 3.1 ): no one would consider the class of repetitions as referring to the hoped-for 12 photos, when in fact only some smaller number were usable. We’ ll meet up with other classic chestnuts as we proceed.
N-P tests began to be couched as formal mapping rules taking data into “ reject H 0 ” or “ do not reject H 0 ” so as to ensure the probabilities of erroneous rejection and erroneous acceptance are controlled at small values, independent of the true hypothesis and regardless of prior probabilities of parameters. Lost in this behavioristic formulation was how the test criteria naturally grew out of the requirements of probative tests, rather than good long-run performance. Pearson underscores this in his paper (1947 ) in the epigraph of Section 3.2 : Step 2 comes before Step 3. You must first have a sensible distance measure. Since tests that pass muster on performance grounds can simultaneously serve as probative tests, the severe tester breaks out of the behavioristic prison. Neither Neyman nor Pearson, in their applied work, was wedded to it. Where performance and probativeness conflict, probativeness takes precedent. Two decades after Fisher allegedly threw Neyman’ s wood models to the floor (Section 5.8 ), Pearson (1955 ) tells Fisher: “ From the start we shared Professor Fisher’ s view that in scientific enquiry, a statistical test is ‘ a means of learning’” (p. 206):
… it was not till after the main lines of this theory had taken shape with its necessary formalization in terms of critical regions, the class of admissible hypotheses, the two sources of error, the power function, etc., that the fact that there was a remarkable parallelism of ideas in the field of acceptance sampling became apparent. Abraham Wald’ s contributions to decision theory of ten to fifteen years later were perhaps strongly influenced by acceptance sampling problems, but that is another story.
(ibid., pp. 204– 5)
In fact, the tests as dev
eloped by Neyman– Pearson began as an attempt to obtain tests that Fisher deemed intuitively plausible, and this goal is easily interpreted as that of computing and controlling the severity with which claims are inferred.
Not only did Fisher reply encouragingly to Neyman’ s letters during the development of their results, it was Fisher who first informed Neyman of the split of K. Pearson’ s duties between himself and Egon, opening up the possibility of Neyman’ s leaving his difficult life in Poland and gaining a position at University College in London. Guess what else? Fisher was a referee for the all-important N– P 1933 paper, and approved of it.
To Neyman it has always been a source of satisfaction and amusement that his and Egon’ s fundamental paper was presented to the Royal Society by Karl Pearson, who was hostile and skeptical of its contents, and favorably refereed by the formidable Fisher, who was later to be highly critical of much of the Neyman– Pearson theory.
(C. Reid 1998 , p. 103)
Souvenir J: UMP Tests
Here are some familiar Uniformly Most Powerful (UMP) unbiased tests that fall out of the Λ criterion (letting µ be the mean):
(1) One-sided Normal test. Each Xi is NIID, N(µ , σ 2 ), with σ known: H 0 : µ < µ 0 against H 1 : µ > µ 0 .
Evaluating the Type I error probability requires the distribution of d( X ) under H 0 : d( X ) ~ N(0,1).
Evaluating the Type II error probability (and power) requires the distribution of d( X ) under H 1 [µ = µ 1 ] :