Statistical Inference as Severe Testing

Page 9

by Deborah G Mayo

In order to fix a limit between ‘ small’ and ‘ large’ values of [the likelihood ratio] we must know how often such values appear when we deal with a true hypothesis.

(Pearson and Neyman 1930 , p. 106)

That’ s true, but putting it in terms of the desire “ to control the error involved in rejecting a true hypothesis” it is easy to dismiss it as an affliction of a frequentist concerned only with long-run performance. Bayesians and Likelihoodists are free of this affliction. Pearson and Neyman should have said: ignoring the information as to how readily true hypotheses are rejected, we cannot determine if there really is evidence of inconsistency with them.

Our minimal requirement for evidence insists that data only provide genuine or reliable evidence for H if H survives a severe test – a test H would probably have failed if false. Here the hypothesis H of interest is the non-null of Armitage’ s example: the existence of a genuine effect. A warranted inference to H depends on the test’ s ability to find H false when it is, i.e., when the null hypothesis is true. The severity conception of tests provides the link between a test’ s error probabilities and what’ s required for a warranted inference.

The error probability computations in significance levels, confidence levels, power, all depend on violating the LP! Aside from a concern with “ intentions,” you will find two other terms used in describing the use of error probabilities: a concern with (i) outcomes other than the one observed, or (ii) the sample space. Recall Souvenir B, where Royall, who obeys the LP, speaks of “ the irrelevance of the sample space” once the data are in hand. It’ s not so obvious what’ s meant. To explain, consider Jay Kadane: “ Significance testing violates the Likelihood Principle, which states that, having observed the data, inference must rely only on what happened, and not on what might have happened but did not” (Kadane 2011 , p. 439). According to Kadane, the probability statement: Pr(|d( X )|>1.96)=0.05 “ is a statement about d( X ) before it is observed. After it is observed, the event {d( X )>1.96} either happened or did not happen and hence has probability either one or zero” (ibid.).

Knowing d( x ) = 1.96, Kadane is saying there’ s no more uncertainty about it. But would he really give it probability 1? That’ s generally thought to invite the problem of “ known (or old) evidence” made famous by Clark Glymour (1980 ). If the probability of the data x is 1, Glymour argues, then Pr( x |H ) also is 1, but then Pr(H | x ) = Pr(H )Pr( x |H )/Pr( x ) = Pr(H ), so there is no boost in probability given x . So does that mean known data don’ t supply evidence? Surely not. Subjective Bayesians try different solutions: either they abstract to a context prior to knowing x , or view the known data as an instance of a general type, in relation to a sample space of outcomes. Put this to one side for now in order to continue the discussion. 5

Kadane is emphasizing that Bayesian inference is conditional on the particular outcome. So once x is known and fixed, other possible outcomes that could have occurred but didn’ t are irrelevant. Recall finding that Pickrite’ s procedure was to build k different portfolios and report just the one that did best. It’ s as if Kadane is asking: “ Why are you considering other portfolios that you might have been sent but were not, to reason from the one that you got?” Your answer is: “ Because that’ s how I figure out whether your boast about Pickrite is warranted.” With the “ search through k portfolios” procedure, the possible outcomes are the success rates of the k different attempted portfolios, each with its own null hypothesis. The actual or “ audited” P -value is rather high, so the severity for H : Pickrite has a reliable strategy, is low (1 − p ). For the holder of the LP to say that, once x is known, we’ re not allowed to consider the other chances they gave themselves to find an impressive portfolio, is to put the kibosh on a crucial way to scrutinize the testing process.

Interestingly, nowadays, non-subjective or default Bayesians concede they “ have to live with some violations of the likelihood and stopping rule principles” (Ghosh, Delampady, and Samanta 2010 , p. 148) since their prior probability distributions are influenced by the sampling distribution. Is it because ignoring stopping rules can wreak havoc with the well-testedness of inferences? If that is their aim, too, then that is very welcome. Stay tuned.

Souvenir C: A Severe Tester’ s Translation Guide

Just as in ordinary museum shops, our souvenir literature often probes treasures that you didn’ t get to visit at all. Here’ s an example of that, and you’ ll need it going forward. There’ s a confusion about what’ s being done when the significance tester considers the set of all of the outcomes leading to a d( x ) greater than or equal to 1.96, i.e., { x : d( x ) ≥ 1.96} , or just d( x ) ≥ 1.96 . This is generally viewed as throwing away the particular x , and lumping all these outcomes together. What’ s really happening, according to the severe tester, is quite different. What’ s actually being signified is that we are interested in the method, not just the particular outcome. Those who embrace the LP make it very plain that data-dependent selections and stopping rules drop out. To get them to drop in, we signal an interest in what the test procedure would have yielded. This is a counterfactual and is altogether essential in expressing the properties of the method, in particular, the probability it would have yielded some nominally significant outcome or other .

When you see Pr(d( X )≥ d( x 0 ) ; H 0 ), or Pr(d( X )≥ d( x 0 ) ; H 1 ), for any particular alternative of interest, insert:

“ the test procedure would have yielded”

just before the d( X ). In other words, this expression, with its inequality, is a signal of interest in, and an abbreviation for, the error probabilities associated with a test.

Applying the Severity Translation.

In Exhibit (i) , Royall described a significance test with a Bernoulli( θ ) model, testing H 0 : θ ≤ 0.2 vs. H 1 : θ > 0.2 . We blocked an inference from observed difference d( x ) = 3.3 to θ = 0.8 as follows. (Recall that and d( x 0 ) ≃ 3.3 .)

We computed Pr(d( X ) > 3.3; θ = 0.8) ≃ 1.

We translate it as Pr(The test would yield d( X ) > 3.3; θ = 0.8) ≃ 1.

We then reason as follows:

Statistical inference : If θ = 0.8, then the method would virtually always give a difference larger than what we observed. Therefore, the data indicate θ < 0.8.

(This follows for rejecting H 0 in general.) When we ask: “ How often would your test have found such a significant effect even if H 0 is approximately true?” we are asking about the properties of the experiment that did happen. The counterfactual “ would have” refers to how the procedure would behave in general, not just with these data, but with other possible data sets in the sample space.

Exhibit (iii).

Analogous situations to the optional stopping example occur even without optional stopping, as with selecting a data-dependent, maximally likely, alternative. Here’ s an example from Cox and Hinkley (1974 , 2.4.1, pp. 51– 2), attributed to Allan Birnbaum (1969 ).

A single observation is made on X , which can take values 1,2, … ,100. “ There are 101 possible distributions conveniently indexed by a parameter θ taking values 0, 1,..., 100” (ibid.). We are not told what θ is, but there are 101 possible point hypotheses about the value of θ : from 0 to 100. If X is observed to be r , written X = r ( r ≠ 0), then the most likely hypothesis is θ = r : in fact, Pr( X = r ; θ = r ) = 1 . By contrast, Pr( X = r ; θ = 0) = 0.01 . Whatever value r that is observed, hypothesis θ = r is 100 times as likely as is θ = 0. Say you observe X = 50 , then H : θ = 50 is 100 times as likely as is θ = 0. So “ even if in fact θ = 0, we are certain to find evidence apparently pointing strongly against θ = 0, if we allow comparisons of likelihoods chosen in the light of the data” (Cox and Hinkley 1974 , p. 52). This does not happen if the test is restricted to two preselected values. In fact, if θ = 0 the probability of a ratio of 100 in favor of the false hypothesis is 0.01. 6

Allan Birnbaum gets the prize for inventing chestnuts that deeply challenge both those who do, and those who do not, hold
the Likelihood Principle!

Souvenir D: Why We Are So New

What’ s Old? You will hear critics say that the reason to overturn frequentist, sampling theory methods – all of which fall under our error statistical umbrella – is that, well, they’ ve been around a long, long time. First, they are scarcely stuck in a time warp. They have developed with, and have often been the source of, the latest in modeling, resampling, simulation, Big Data, and machine learning techniques. Second, all the methods have roots in long-ago ideas. Do you know what is really up-to-the-minute in this time of massive, computer algorithmic methods and “ trust me” science? A new vigilance about retaining hard-won error control techniques. Some thought that, with enough data, experimental design could be ignored, so we have a decade of wasted microarray experiments. To view outcomes other than what you observed as irrelevant to what x 0 says is also at odds with cures for irreproducible results. When it comes to cutting-edge fraud-busting, the ancient techniques (e.g., of Fisher) are called in, refurbished with simulation.

What’ s really old and past its prime is the idea of a logic of inductive inference. Yet core discussions of statistical foundations today revolve around a small cluster of (very old) arguments based on that vision. Tour II took us to the crux of those arguments. Logics of induction focus on the relationships between given data and hypotheses – so outcomes other than the one observed drop out. This is captured in the Likelihood Principle (LP). According to the LP, trying and trying again makes no difference to the probabilist: it is what someone intended to do, locked up in their heads.

It is interesting that frequentist analyses often need to be adjusted to account for these ‘ looks at the data,’… That Bayesian analysis claims no need to adjust for this ‘ look elsewhere’ effect – called the stopping rule principle – has long been a controversial and difficult issue…

(J. Berger 2008 , p. 15)

The irrelevance of optional stopping is an asset for holders of the LP. For the task of criticizing and debunking, this puts us in a straightjacket. The warring sides talk past each other. We need a new perspective on the role of probability in statistical inference that will illuminate, and let us get beyond, this battle.

New Role of Probability for Assessing What’ s Learned.

A passage to locate our approach within current thinking is from Reid and Cox (2015 ):

Statistical theory continues to focus on the interplay between the roles of probability as representing physical haphazard variability … and as encapsulating in some way, directly or indirectly, aspects of the uncertainty of knowledge, often referred to as epistemic.

(p. 294)

We may avoid the need for a different version of probability by appeal to a notion of calibration, as measured by the behavior of a procedure under hypothetical repetition. That is, we study assessing uncertainty, as with other measuring devices, by assessing the performance of proposed methods under hypothetical repetition. Within this scheme of repetition, probability is defined as a hypothetical frequency.

(p. 295)

This is an ingenious idea. Our meta-level appraisal of methods proceeds this way too, but with one important difference. A key question for us is the proper epistemic role for probability. It is standardly taken as providing a probabilism, as an assignment of degree of actual or rational belief in a claim, absolute or comparative. We reject this. We proffer an alternative theory: a severity assessment. An account of what is warranted and unwarranted to infer – a normative epistemology – is not a matter of using probability to assign rational beliefs, but to control and assess how well probed claims are.

If we keep the presumption that the epistemic role of probability is a degree of belief of some sort, then we can “ avoid the need for a different version of probability” by supposing that good/poor performance of a method warrants high/low belief in the method’ s output. Clearly, poor performance is a problem, but I say a more nuanced construal is called for. The idea that partial or imperfect knowledge is all about degrees of belief is handed down by philosophers. Let’ s be philosophical enough to challenge it.

New Name?

An error statistician assesses inference by means of the error probabilities of the method by which the inference is reached. As these stem from the sampling distribution, the conglomeration of such methods is often called “ sampling theory.” However, sampling theory, like classical statistics, Fisherian, Neyman– Pearsonian, or frequentism are too much associated with hardline or mish-mashed views. Our job is to clarify them, but in a new way. Where it’ s apt for taking up discussions, we’ ll use “ frequentist” interchangeably with “ error statistician.” However, frequentist error statisticians tend to embrace the long-run performance role of probability that I find too restrictive for science. In an attempt to remedy this, Birnbaum put forward the “ confidence concept” (Conf), which he called the “ one rock in a shifting scene” in statistical thinking and practice. This “ one rock,” he says, takes from the Neyman– Pearson (N-P) approach “ techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data” (Birnbaum 1970 , p.1033). Extending his notion to a composite alternative:

Conf: An adequate concept of statistical evidence should find strong evidence against H 0 (for ~H 0 ) with small probability α when H 0 is true, and with much larger probability (1− β ) when H 0 is false, increasing as discrepancies from H 0 increase.

This is an entirely right-headed pre-data performance requirement, but I agree with Birnbaum that it requires a reinterpretation for evidence post-data (Birnbaum 1977). Despite hints and examples, no such evidential interpretation has been given. The switch that I’ m hinting at as to what’ s required for an evidential or epistemological assessment is key. Whether one uses a frequentist or a propensity interpretation of error probabilities (as Birnbaum did) is not essential. What we want is an error statistical approach that controls and assesses a test’ s stringency or severity . That’ s not much of a label. For short, we call someone who embraces such an approach a severe tester. For now I will just venture that a severity scrutiny illuminates all statistical approaches currently on offer.

1 Divide the numerator and the denominator by Pr( x |H 0 )Pr(H 0 ) . Then

2 He notes that the comparative evidence for a trick versus a normal deck is not evidence against a normal deck alone (pp. 14– 15).

3 Pr( x ) = Pr( x & H 0 ) + Pr( x & H 1 ), where H 0 and H 1 are exhaustive.

4 A general result, stated in Kerridge (1963 , p. 1109), is that with k simple hypotheses, where H 0 is true and H 1 , … , Hk −1 are false, and equal priors, “ the frequency with which, at the termination of sampling the posterior probability of the true hypothesis is p or less cannot exceed (k − 1)p /(1 − p ).” Such bounds depend on having countably additive probability, while the uniform prior in Armitage’ s example imposes finite additivity.

5 Colin Howson, a long-time subjective Bayesian, has recently switched to being a non-subjective Bayesian at least in part because of the known evidence problem (Howson 2017 , p. 670).

6 From Cox and Hinkley 1974 , p. 51. The likelihood function corresponds to the normal distribution of around μ with SE σ /√ n . The likelihood at μ = 0 is exp(−0.5k 2 ) times that at . One can choose k to make the ratio small. “ That is, even if in fact μ = 0 , there always appears to be strong evidence against μ = 0, at least if we allow comparison of the likelihood at μ = 0 against any value of μ and hence in particular against the value of μ giving maximum likelihood” . However, if we confine ourselves to comparing the likelihood at μ = 0 with that at some fixed μ = μ ′ , this difficulty does not arise.

Excursion 2

Taboos of Induction and Falsification

Itinerary

Tour I Induction and Confirmation 2.1 The Traditional Problem of Induction

2.2 Is Probability a Good Measure of Confirmation?

Tour II Falsification, Ps
eudoscience, Induction 2.3 Popper, Severity, and Methodological Probability

2.4 Novelty and Severity

2.5 Fallacies of Rejection and an Animal Called NHST

2.6 The Reproducibility Revolution (Crisis) in Psychology

2.7 How to Solve the Problem of Induction Now

Tour I

Induction and Confirmation

Cox: [I]n some fields foundations do not seem very important, but we both think that foundations of statistical inference are important; why do you think that is?

Mayo: I think because they ask about fundamental questions of evidence, inference, and probability … we invariably cross into philosophical questions about empirical knowledge and inductive inference.

(Cox and Mayo 2011 , p. 103)

Contemporary philosophy of science presents us with some taboos: Thou shalt not try to find solutions to problems of induction, falsification, and demarcating science from pseudoscience. It’ s impossible to understand rival statistical accounts, let alone get beyond the statistics wars, without first exploring how these came to be “ lost causes.” I am not talking of ancient history here: these problems were alive and well when I set out to do philosophy in the 1980s. I think we gave up on them too easily, and by the end of Excursion 2 you’ ll see why. Excursion 2 takes us into the land of “ Statistical Science and Philosophy of Science” (StatSci/PhilSci). Our Museum Guide gives a terse thumbnail sketch of Tour I. Here’ s a useful excerpt:

‹ Prev Next ›