Book Read Free

Statistical Inference as Severe Testing

Page 43

by Deborah G Mayo


  4.9 For Model-Checking, They Come Back to Significance Tests

  Why can’ t all criticism be done using Bayes posterior analysis … ? The difficulty with this approach is that by supposing all possible sets of assumptions known a priori , it discredits the possibility of new discovery. But new discovery is, after all, the most important object of the scientific process.

  (George Box 1983 , p. 73)

  Why the apology for ecumenism? Unlike most Bayesians, Box does not view induction as probabilism in the form of probabilistic updating (posterior probabilism), or any form of probabilism. Rather, it requires critically testing whether a model Mi is “ consonant” with data, and this, he argues, demands frequentist significance testing. Our ability “ to find patterns in discrepancies Mi − y d between the data and what might be expected if some tentative model were true is of great importance in the search for explanations of data and of discrepant events” (Box 1983 , p. 57). But the dangers of apophenia raise their head.

  However, some check is needed on [the brain’ s] pattern seeking ability, for common experience shows that some pattern or other can be seen in almost any set of data or facts. This is the object of diagnostic checks and tests of fit which, I will argue, require frequentist theory significance tests for their formal justification.

  (ibid.)

  Once you have inductively arrived at an appropriate model, the move, on his view, “ is entirely deductive and will be called estimation ” (ibid., p. 56). The deductive portion, he thinks, can be Bayesian, but the inductive portion requires frequentist significance tests, and statistical inference depends on an iteration between the two. Alluding to Box, Peter Huber avers: “ Within orthodox Bayesian statistics, we cannot even address the question whether a model Mi , under consideration at stage i of the investigation, is consonant with the data y ” (Huber 2011 , p. 92). Box adds a non-Bayesian activity to his account.

  A note on Box’ s slightly idiosyncratic use of deduction/induction: Frequentist significance testing is often called deductive, but for Box it’ s the inductive component. There’ s no confusion if we remember that Box is emphasizing that frequentist testing is the source of new ideas, it is the inductive achievement. It’ s in sync with our own view that inductive inference to claim C consists of trying and failing to falsify C with a stringent test: C should be well corroborated. In fact, the approach to misspecification (M-S) testing that melds seamlessly with the error statistical account has its roots in the diagnostic checking of Box and Jenkins (1976 ).

  All You Need Is Bayes. Not

  Box and Jenkins highlight the link between ‘ prove’ and ‘ test’ : “ A model is only capable of being ‘ proved’ in the biblical sense of being put to the test” (ibid., p. 286). Box considers the possibility that model checking occurs as follows: One might imagine A 1 , A 2 , … , Ak being alternative assumptions and then computing Pr(Ai |y ). Box denies this is plausible. To assume we start out with all models precludes the “ something else we haven’ t thought of” so vital to science. Typically, Bayesians try to deal with this by computing a Bayesian catchall “ everything else.” Savage recommends reserving a low prior for the catchall (1962a), but Box worries that this may allow you to assign model Mi a high posterior probability relative to the other models considered. “ In practice this would seem of little comfort” (Box 1983 , pp. 73– 4). For suppose of the three models under consideration the posteriors are 0.001, 0.001, 0.998, but unknown to the investigator a fourth model is a thousand times more probable than even the most probable one considered so far.

  So he turns to frequentist tests for model checking. Is there any harm in snatching some cookies from the frequentist cookie jar? Not really. Does it violate the Likelihood Principle (LP)? Let’ s listen to Box:

  The likelihood principle holds, of course, for the estimation aspect of inference in which the model is temporarily assumed true. However it is inapplicable to the criticism process in which the model is regarded as in doubt … In the criticism phase we are considering whether, given A, the sample y d is likely to have occurred at all. To do this we must consider it in relation to the other samples that could have occurred but did not.

  (Box 1983 , pp. 74– 5)

  Suppose you’ re about to use a statistical model, say n IID Normal trials for a primary inference about mean μ . Checking independence (I), identical distributed (ID), or the Normality assumption (N) are secondary inferences in relation to the primary one. In conducting secondary inferences, Box is saying, the LP must be violated, or simply doesn’ t apply. You can run a simple Fisherian significance test – the null asserting the model assumption A holds – and reject it if the observed result is statistically significantly far from what A predicts. A P -value (or its informal equivalent) is computed – a tail area – which requires considering outcomes other than the one observed.

  Box gives the example of stopping rules. Stopping rules don’ t alter the posterior distribution, as we learned from the extreme example in Excursion 1 (Section 1.5 ). For a simple example, he considers four Bernoulli trials: ⟨ S, S, F, S⟩ . The same string could have come about if n = 4 was fixed in advance, or if the plan was to sample until the third success is observed. The latter are called negative Binomial trials, the former Binomial. The string enters the likelihood ratio the same way, and respectively: the only difference is the coefficients, which cancel. But the significance tester distinguishes them, because the sample space, and corresponding error probabilities, differ. 1 When it comes to model testing, Box contends, this LP violation is altogether reasonable, since “ we are considering whether, given A, the sample is likely to have occurred at all” (ibid., p. 75).

  This is interesting. Isn’ t it also our question at his estimation stage where the LP denies stopping rules matter? We don’ t know there’ s any genuine effect, or if a null is true. If we ignore the stopping rules, we may make it too easy to find one, even if it’ s absent. In the example of Section 1.5 , we ensure erroneous rejection, violating “ weak repeated sampling.” A Boxian Bayesian, who retains the LP for primary statistical inference, still seems to owe us an explanation why we shouldn’ t echo Armitage (1962 , p. 72) that “ Thou shalt be misled” if your method hides optional stopping at the primary (Box’ s estimation) stage.

  Another little puzzle arises in telling what’ s true about the LP: Is the LP violated or simply inapplicable in secondary testing of model assumptions. Consider Casella and R. Berger’ s text.

  Most data analysts perform some sort of ‘ model checking’ when analyzing a set of data … For example, it is common practice to examine residuals from a model, statistics that measure variation in the data not accounted for by the model … (Of course such a practice directly violates the Likelihood Principle also.) Thus, before considering [the Likelihood Principle], we must be comfortable with the model.

  (Casella and R. Berger 2002 , pp. 295– 6)

  For them, it appears, the LP is full out violated in model checking. I’ m not sure how much turns on whether the LP is regarded as violated or merely inapplicable in testing assumptions; a question arises in either case. Say you have carried out Box’ s iterative moves between criticism and estimation, arrived at a model deemed adequate, and infer H : model Mi is adequate for modeling data x 0 . My question is: How is this secondary inference qualified? Probabilists are supposed to qualify uncertain claims with probability (e.g., with posterior probabilities or comparisons of posteriors). What about this secondary inference to the adequacy/inadequacy of the model? For Boxians, it’ s admitted to be a non-Bayesian frequentist animal. Still a long-run performance justification wouldn’ t seem plausible. If you’ re going to accept the model as sufficiently adequate to build the primary inference, you’ d want to say it had passed a severe test: that if it wasn’ t adequate for the primary inference, then you probably would have discovered this through the secondary model checking. However, if secondary inference is also a statistical inference, it looks like Casella and R. Berg
er, and Box, are right to consider the LP violated – as regards that inference . There’ s an appeal to outcomes other than the one observed.

  Andrew Gelman’ s Bayesian approach can be considered an offshoot of Box’ s, but, unlike Box, he will avoid assigning a posterior probability to the primary inference. Indeed, he calls himself a falsificationist Bayesian, and is disgruntled that Bayesians don’ t test their models.

  I vividly remember going from poster to poster at the 1991 Valencia meeting on Bayesian statistics … not only were they not interested in checking the fit of the models, they considered such checks to be illegitimate. To them, any Bayesian model necessarily represented a subjective prior distribution and as such could never be tested. The idea of testing and p-values were held to be counter to the Bayesian philosophy.

  (Gelman 2011 , pp. 68– 9)

  What he’ s describing is in sync with the classical subjective Bayesian: If “ the Bayesian theory is about coherence, not about right or wrong” (Lindley 1976 , p. 359), then what’ s to test? Lindley does distinguish a pre-data model choice:

  Notice that the likelihood principle only applies to inference, i.e. to calculations once the data have been observed. Before then, e.g. in some aspects of model choice, in the design of experiments … , a consideration of several possible data values is essential.

  (Lindley 2000 , p. 310)

  This he views as a decision based on maximizing an agent’ s expected utility. But wouldn’ t a correct assessment of utility depend on information on model adequacy?

  Interestingly, there are a number of Bayesians who entertain the idea of a Bayesian P -value to check accordance of a model when there’ s no alternative in sight. 2 They accept the idea that significance tests and P -values are a good way, if not the only way, to assess the consonance between data and model. Yet perhaps they are only grinning and bearing it. As soon as alternative models are available, most would sooner engage in a Bayesian analysis, e.g., Bayes factors (Bayarri and Berger 2004 ).

  But Gelman is a denizen of a tribe of Bayesians that rejects these traditional forms. “ To me, Bayes factors correspond to a discrete view of the world, in which we must choose between models A, B, or C” (Gelman 2011 , p. 74) or a weighted average of them as in Madigan and Raftery (1994 ). Nor will it be a posterior. “ I do not trust Bayesian induction over the space of models because the posterior probability of a continuous-parameter model depends crucially on untestable aspects of its prior distribution” (Gelman 2011 , p. 70). Instead, for Gelman, the priors/posteriors arise as an interim predictive device to draw out and test implications of a model. What is the status of the inference to the adequacy of the model? If neither probabilified nor Bayes ratioed, it can at least be well or poorly tested. In fact, he says, “ This view corresponds closely to the error-statistics idea of Mayo (1996 )” (ibid., p. 70). We’ ll try to extricate his approach in Excursion 6 .

  4.10 Bootstrap Resampling: My Sample Is a Mirror of the Universe

  “ My difficulty” with the Likelihood Principle (LP), declares Brad Efron (in a comment on Lindley), is that it “ rules out many of our most useful data analytic tools without providing workable substitutes” (2000 , p. 330) – notably, the method for which he is well known: bootstrap resampling (Efron 1979 ). Let’ s take a little detour to have a look around this hot topic. (I follow D. Freedman (2009), and A. Spanos (2019)).

  We have a single IID sample of size 100 of the water temperatures soon after the accident x 0 = ⟨ x 1 , x 2 , … , x 100 ⟩ . Can we say anything about its accuracy even if we couldn’ t take any more? Yes. We can lift ourselves up by the bootstraps with this single x 0 by treating it as its own population. Get the computer to take a large number, say 10,000, independent samples from x 0 (with replacement), giving 10,000 resamples . Then reestimate the mean for each, giving 10,000 bootstrap means . The frequency with which the bootstrapped means take different values approximates the sampling distribution of . It can be extended to medians, standard deviations, etc. “ This is exactly the kind of calculation that is ruled out by the likelihood principle; it relies on hypothetical data sets different from the data that are actually observed and does so in a particularly flagrant way” (Efron 2000 , p. 331). At its very core is the question: what would mean temperatures be like were we to have repeated the process many times? This lets us learn: How capable of producing our observed sample is a universe with mean temperature no higher than the temperature thought to endanger the ecosystem?

  Averaging the 10,000 bootstrap means, we get the overall bootstrap sample mean, . If n is sufficiently large, the resampling distribution of mirrors the sampling distribution of where μ is the mean of the population. We can use the sample deviation of to approximate the standard error of . 3

  To illustrate with a tiny example, imagine that instead of 100 temperature measurements there are only 10: x 0 : 150, 165, 151, 138, 148, 167, 164, 160, 136, 173, with sample mean , and instead of 10,000 resamples, only 5. Since it’ s with replacement there can be duplicates.

  x 0 : 150, 165, 151, 138, 148, 167, 164, 160, 136, 173

  Bootstrap resamples Bootstrap means

  x b1 : 160, 136, 138, 165, 173, 165, 167, 148, 151, 167 157

  x b2 : 164, 136, 165, 167, 148, 138, 151, 160, 150, 151 153

  x b3 : 173, 138, 173, 160, 167, 167, 148, 138, 148, 165 157.7

  x b4 : 148, 138, 164, 167, 160, 150, 164, 167, 148, 173 157.9

  x b5 : 173, 136, 167, 138, 150, 160, 148, 164, 164, 148 154.8

  Here are the rest of the bootstrap statistics:

  Bootstrap overall mean: ;

  Bootstrap variance: [(157 − 156.08)2 + (153 − 156.08)2 + (157.7 − 156.08)2 + (157.9 − 156.08)2 + (154.8 − 156.08)2 ]/4 = 4.477;

  Bootstrap SE: √ 4.477 = 2.116.

  Note the difference between the mean of our observed sample and that of the overall bootstrap mean (the bias) is small: .

  From our toy example, we could form the bootstrap 0.95 confidence interval: 156.08 ± 1.96 (2.116) approximately [152,158]. You must now imagine we arrived at the interval via 10,000 samples, not 5. The observed mean just after the accident (155.2) exceeds 150 by around 2.5 SE, indicating our sample came from a population with θ > 150. In fact, were θ ≤ 152, such large results would occur infrequently.

  The non-parametric bootstrap works without relying on a theoretical probability distribution, at least when the sample is random, large enough, and has sufficiently many bootstraps. Statistical inference by non-parametrics still has assumptions, such as IID (although there are other variants). Many propose we do all statistics non-parametrically, and some newer texts advocate this. I’ m all for it, because the underlying error statistical reasoning becomes especially clear. I concur, too, with the philosopher Jan Sprenger that philosophy of statistics should integrate “ resampling methods into a unified scheme of data analysis and inductive inference” (Sprenger 2011 , p. 74). That unified scheme is error statistical. (I’ m not sure how it’ s in sync with his subjective Bayesianism.)

  The philosophical significance of bootstrap resampling is twofold. (1) The relative frequency of different values of sustains our error statistical argument: the probability model is a good way to approximate the empirical distribution analytically. Through a hypothetical – ‘ what it would be like’ were this process repeated many times – we understand what produced the single observed sample. (2) By identifying exemplary cases where we manage to take approximately random samples, we can achieve inductive lift-off. It’ s through our deliberate data generation efforts, in other words, that we solve induction. I don’ t know if taking water samples is one such exemplar, but I’ m using it just as an illustration. We may imagine ample checks of water sampling on bodies with known temperature show we’ re pretty good at taking random samples of water temperature. Thus we reason, it works when the mean temperature is unknown. Can supernal powers read my mind and interfere just in the cases of an unknown mean?

  Nor is it necessary to deny altogeth
er the existence of mysterious influences adverse to the validity of the inductive … processes. So long as their influence were not too overwhelming, the wonderful self-correcting nature of the ampliative inference would enable us … to detect and make allowance for them.

  (Peirce 2.749)

  4.11 Misspecification (M-S) Testing in the Error Statistical Account

  Induction – understood as severe testing – “ not only corrects its conclusions, it even corrects its premises” (Peirce 3.575). In the land of statistical inference it does so by checking and correcting the assumptions underlying the inference. It’ s common to distinguish “ model-based” and “ design-based” statistical inference, but both involve assumptions. So let’ s speak of the adequacy of the model in both cases. It’ s to this auditing task that I now turn. Let’ s call violated statistical assumptions statistical model misspecifications. The term “ misspecification” is often used to refer to a problem with a primary model, whereas for us it will always refer to the secondary problem of checking assumptions for probing a primary question (following A. Spanos). “ Primary” is relative to the main inferential task: Once an adequate statistical model is at hand, an inquiry can grow to include many layers of primary questions.

  Splitting things off piecemeal has payoffs. The key is for the relevant error probabilities to be sufficiently close to those calculated in probing the primary claim. Even if you have a theory, turning it into something statistically testable isn’ t straightforward. You can’ t simply add an error term at the end, such as y = theory + error, particularly in the social sciences – although people often do. The trouble is that you can tinker with the error term to “ fix” anomalies without the theory having been tested in the least. Aris Spanos, an econometrician, roundly criticizes this tendency to “ the preeminence of theory in econometric modeling” (2010c , p. 202). This would be okay if you were only estimating quantities in a theory known to be true or adequate, but in fact, Spanos says, “ mainstream economic theories have been invariably unreliable predictors of economic phenomena” (ibid., p. 203).

 

‹ Prev