Statistical Inference as Severe Testing

Page 14

by Deborah G Mayo

Souvenir F: Getting Free of Popperian Constraints on Language

Popper allows that anyone who wants to define induction as the procedure of corroborating by severe testing is free to do so; and I do. Free of the bogeyman that induction must take the form of a probabilism, let’ s get rid of some linguistic peculiarities inherited by current-day Popperians (critical rationalists). They say things such as: it is warranted to infer (prefer or believe) H (because H has passed a severe test), but there is no justification for H (because “ justifying” H would mean H was true or highly probable). In our language, if H passes a severe test, you can say it is warranted, corroborated, justified – along with whatever qualification is appropriate. I tend to use “ warranted.” The Popperian “ hypothesis H is corroborated by data x ” is such a tidy abbreviation of “ H has passed a severe test with x ” that we may use the two interchangeably. I’ ve already co-opted Popper’ s description of science as problem solving . A hypothesis can be seen as a potential solution to a problem (Laudan 1978 ). For example, the theory of protein folding purports to solve the problem of how pathological prions are transmitted. The problem might be to explain, to predict, to unify, to suggest new problems, etc. When we severely probe, it’ s not for falsity per se, but to investigate if a problem has been adequately solved by a model, method, or theory.

In rejecting probabilism, there is nothing to stop us from speaking of believing in H . It’ s not the direct output of a statistical inference. A post-statistical inference might be to believe a severely tested claim; disbelieve a falsified one. There are many different grounds for believing something. We may be tenacious in our beliefs in the face of given evidence; they may have other grounds, or be prudential. By the same token, talk of deciding to conclude, infer, prefer, or act can be fully epistemic in the sense of assessing evidence, warrant, and well-testedness. Popper, like Neyman and Pearson, employs such language because it allows talking about inference distinct from assigning probabilities to hypotheses. Failing to recognize this has created unnecessary combat.

Live Exhibit (vi): Revisiting Popper’ s Demarcation of Science.

Here’ s an experiment: try shifting what Popper says about theories to a related claim about inquiries to find something out. To see what I have in mind, let’ s listen to an exchange between two fellow travelers over coffee at Statbucks.

Traveler 1 : If mere logical falsifiability suffices for a theory to be scientific, then, we can’ t properly oust astrology from the scientific pantheon. Plenty of nutty theories have been falsified, so by definition they’ re scientific. Moreover, scientists aren’ t always looking to subject well-corroborated theories to “ grave risk” of falsification.

Traveler 2 : I’ ve been thinking about this. On your first point, Popper confuses things by making it sound as if he’ s asking: When is a theory unscientific? What he is actually asking or should be asking is: When is an inquiry into a theory, or an appraisal of claim H, unscientific ? We want to distinguish meritorious modes of inquiry from those that are BENT. If the test methods enable ad hoc maneuvering, sneaky face-saving devices, then the inquiry – the handling and use of data – is unscientific. Despite being logically falsifiable, theories can be rendered immune from falsification by means of cavalier methods for their testing. Adhering to a falsified theory no matter what is poor science. Some areas have so much noise and/or flexibility that they can’ t or won’ t distinguish warranted from unwarranted explanations of failed predictions. Rivals may find flaws in one another’ s inquiry or model, but the criticism is not constrained by what’ s actually responsible. This is another way inquiries can become unscientific. 1

She continues:

On your second point, it’ s true that Popper talked of wanting to subject theories to grave risk of falsification. I suggest that it’ s really our inquiries into, or tests of, the theories that we want to subject to grave risk. The onus is on interpreters of data to show how they are countering the charge of a poorly run test. I admit this is a modification of Popper. One could reframe the entire demarcation problem as one of the characters of an inquiry or test.

She makes a good point. In addition to blocking inferences that fail the minimal requirement for severity:

A scientific inquiry or test: must be able to embark on a reliable probe to pinpoint blame for anomalies (and use the results to replace falsified claims and build a repertoire of errors).

The parenthetical remark isn’ t absolutely required, but is a feature that greatly strengthens scientific credentials. Without solving, not merely embarking on, some Duhemian problems there are no interesting falsifications. The ability or inability to pin down the source of failed replications – a familiar occupation these days – speaks to the scientific credentials of an inquiry. At any given time, even in good sciences there are anomalies whose sources haven’ t been traced – unsolved Duhemian problems – generally at “ higher” levels of the theory-data array. Embarking on solving these is the impetus for new conjectures. Checking test assumptions is part of working through the Duhemian maze. The reliability requirement is: infer claims just to the extent that they pass severe tests. There’ s no sharp line for demarcation, but when these requirements are absent, an inquiry veers into the realm of questionable science or pseudoscience. Some physicists worry that highly theoretical realms can’ t be expected to be constrained by empirical data. Theoretical constraints are also important. We’ ll flesh out these ideas in future tours.

2.4 Novelty and Severity

When you have put a lot of ideas together to make an elaborate theory, you want to make sure, when explaining what it fits, that those things it fits are not just the things that gave you the idea for the theory; but that the finished theory makes something else come out right, in addition.

(Feynman 1974 , p. 385)

This “ something else that must come out right” is often called a “ novel” predictive success. Whether or not novel predictive success is required is a very old battle that parallels debates between frequentists and inductive logicians, in both statistics and philosophy of science, for example, between Mill and Peirce and Popper and Keynes. Walking up the ramp from the ground floor to the gallery of Statistics, Science, and Pseudoscience, the novelty debate is used to intermix Popper and statistical testing.

When Popper denied we can capture severity formally, he was reflecting an astute insight: there is a tension between the drive for a logic of confirmation and our strictures against practices that lead to poor tests and ad hoc hypotheses. Adhering to the former downplays or blocks the ability to capture the latter, which demands we go beyond the data and hypotheses. Imre Lakatos would say we need to know something about the history of the hypothesis: how was it developed? Was it the result of deliberate and ad hoc attempts to spare one’ s theory from refutation? Did the researcher continue to adjust her theory in the face of an anomaly or apparent discorroborating result? (He called these “ exception incorporations” .) By contrast, the confirmation theorist asks: why should it matter how the hypothesis inferred was arrived at, or whether data-dependent selection effects were operative? When holders of the Likelihood Principle (LP) wonder why data can’ t speak for themselves, they’ re echoing the logical empiricist (Section 1.4 ). Here’ s Popperian philosopher Alan Musgrave:

According to modern logical empiricist orthodoxy, in deciding whether hypothesis h is confirmed by evidence e , … we must consider only the statements h and e , and the logical relations between them. It is quite irrelevant whether e was known first and h proposed to explain it, or whether e resulted from testing predictions drawn from h .

(Musgrave 1974 , p. 2)

John Maynard Keynes likewise held that the “… question as to whether a particular hypothesis happens to be propounded before or after examination of [its instances] is quite irrelevant (Keynes 1921 /1952, p. 305). Logics of confirmation ran into problems because they insisted on purely formal or syntactical criteria of confirmation that, like
deductive logic, “ should contain no reference to the specific subject-matter” (Hempel 1945 , p. 9) in question. The Popper– Lakatos school attempts to avoid these shortcomings by means of novelty requirements:

Novelty Requirement : For data to warrant a hypothesis H requires not just that (i) H agree with the data, but also (ii) the data should be novel or surprising or the like.

For decades Popperians squabbled over how to define novel predictive success. There’ s (1) temporal novelty – the data were not already available before the hypothesis was erected (Popper, early); (2) theoretical novelty – the data were not already predicted by an existing hypothesis (Popper, Lakatos); and (3) use- novelty – the data were not used to construct or select the hypothesis. Defining novel success is intimately linked to defining Popperian severity.

Temporal novelty is untenable: known data (e.g., the perihelion of Mercury, anomalous for Newton) are often strong evidence for theories (e.g., GTR). Popper ultimately favored theoretical novelty: H passes a severe test with x , when H entails x , and x is theoretically novel – according to a letter he sent me. That, of course, freed me to consider my own notion as distinct. (We replace “ entails” with something like “ accords with.” ) However, as philosopher John Worrall (1978 , pp. 330– 1) shows, to require theoretical novelty prevents passing H with severity, so long as there’ s already a hypothesis that predicts the data or phenomenon x (it’ s not clear which). Why should the first hypothesis that explains x be better tested?

I take the most promising notion of novelty to be a version of use-novelty: H passes a test with data x severely, so long as x was not used to construct H (Worrall 1989 ). Data can be known, so long as they weren’ t used in building H , presumably to ensure H accords with x . While the idea is in sync with the error statistical admonishment against “ peeking at the data” and finding your hypothesis in the data – it’ s far too vague as it stands. Watching this debate unfold in philosophy, I realized none of the notions of novelty were either sufficient or necessary for a good test (Mayo 1991 ).

You will notice that statistical researchers go out of their way to state a prediction at the start of a paper, presenting it as temporally novel, and if H is temporally novel, it also satisfies use-novelty. If H came first, the data could not have been used to arrive at H . This stricture is desirable, but to suppose it suffices for a good test grows out of a discredited empiricist account where the data are given rather than the product of much massaging and interpretation. There is as much opportunity for bias to arise in interpreting or selectively reporting results, with a known hypothesis, as there is in starting with data and artfully creating a hypothesis. Nor is violating use-novelty a matter of the implausibility of H . On the contrary, popular psychology thrives by seeking to explain results by means of hypotheses expected to meet with approval, at least in a given political tribe. Preregistration of the detailed protocol is supposed to cure this. We come back to this.

Should use-novelty be necessary for a good test? Is it ever okay to use data to arrive at a hypothesis H as well as to test H – even if the data use ensures agreement or disagreement with H ? The answers, I say, are no and yes, respectively. Violations of use-novelty need not be pejorative. A trivial example: count all the people in the room and use it to fix the parameter of the number in the room. Or, less trivially, think of confidence intervals: we use the data to form the interval estimate. The estimate is really a hypothesis about the value of the parameter. The same data warrant the hypothesis constructed! Likewise, using the same data to arrive at and test assumptions of statistical models can be entirely reliable. What matters is not novelty, in any of the senses, but severity in the error statistical sense. Even where our intuition is to prohibit use-novelty violations, the requirement is murky. We should instead consider specific ways that severity can be violated. Let’ s define:

Biasing Selection Effects : when data or hypotheses are selected or generated (or a test criterion is specified), in such a way that the minimal severity requirement is violated, seriously altered, or incapable of being assessed. 2

Despite using this subdued label, it’ s too irresistible to banish entirely a cluster of colorful terms for related gambits – double-counting, cherry picking, fishing, hunting, significance seeking, searching for the pony, trying and trying again, data dredging, monster barring, look elsewhere effect, and many others besides – unless we’ re rushing. New terms such as P -hacking are popular, but don’ t forget that these crooked practices are very old. 3

To follow the Popper– Lakatos school (although entailment is too strong):

Severity Requirement: for data to warrant a hypothesis H requires not just that

(S-1) H agrees with the data (H passes the test), but also

(S-2) with high probability, H would not have passed the test so well, were H false.

This describes corroborating a claim, it is “ strong” severity. Weak severity denies H is warranted if the test method would probably have passed H even if false. While severity got its start in this Popperian context, in future excursions, we will need more specifics to describe both clauses (S-1) and (S-2).

2.5 Fallacies of Rejection and an Animal Called NHST

One of Popper’ s prime examples of non-falsifiable sciences was Freudian and Adlerian psychology, which gave psychologist Paul Meehl conniptions because he was a Freudian as well as a Popperian. Meehl castigates Fisherian significance tests for providing a sciency aura to experimental psychology, when they seem to violate Popperian strictures: “ [T]he almost universal reliance on merely refuting the null hypothesis as the standard method for corroborating substantive theories in the soft areas … is basically unsound, poor scientific strategy …” (Meehl 1978 , p. 817). Reading Meehl, Lakatos wrote, “ one wonders whether the function of statistical techniques in the social sciences is not primarily to provide a machinery for producing phoney corroborations and … an increase in pseudo-intellectual garbage” (Lakatos 1978 , pp. 88– 9, note 4).

Now Meehl is a giant when it comes to criticizing statistical practice in psychology, and a good deal of what contemporary critics are on about was said long ago by him. He’ s wrong, though, to pin the blame on “ Sir Ronald” (Fisher). Corroborating substantive theories merely by means of refuting the null? Meehl may be describing what is taught and permitted in the “ soft sciences,” but the practice of moving from statistical to substantive theory violates the testing methodologies of both Fisher and Neyman– Pearson. I am glad to see Gerd Gigerenzer setting the record straight on this point, given how hard he, too, often is on Fisher:

It should be recognized that, according to Fisher, rejecting the null hypothesis is not equivalent to accepting the efficacy of the cause in question. The latter cannot be established on the basis of one single experiment, but requires obtaining more significant results when the experiment, or an improvement of it, is repeated at other laboratories or under other conditions.

(Gigerenzer et al. 1989 , pp. 95– 6)

According to Gigerenzer et al., “ careless writing on Fisher’ s part, combined with selective reading of his early writings has led to the identification of the two, and has encouraged the practice of demonstrating a phenomenon on the basis of a single statistically significant result” (ibid., p. 96). I don’ t think Fisher can be accused of carelessness here; he made two crucial clarifications, and the museum display case bears me out. The first is that “ [W]e need, not an isolated record, but a reliable method of procedure” (Fisher 1935a , p. 14), from Excursion 1 . The second is Fisher’ s requirement that even a genuine statistical effect H fails to warrant a substantive research hypothesis H *. Using “≠ >” to abbreviate “ does not imply” : H ≠ > H *.

Here’ s David Cox defining significance tests over 40 years ago:

… we mean by a significance test a procedure for measuring the consistency of data with a null hypothesis … there is a function d = d( y ) of the observations, called a test statistic, and such that
the larger is d( y ) the stronger is the inconsistency of y with H 0 , in the respect under study … we need to be able to compute, at least approximately,

p obs = Pr(d ≥ d( y obs ); H 0 )

called the observed level of significance [or P -value].

Application of the significance test consists of computing approximately the value of p obs and using it as a summary measure of the degree of consistency with H 0 , in the respect under study.

(Cox 1977 , p. 50; replacing t with d)

Statistical test requirements follow non-statistical tests, Cox emphasizes, though at most H 0 entails some results with high probability. Say 99% of the time the test would yield {d < d 0 }, if H 0 adequately describes the data-generating mechanism where d 0 abbreviates d( x 0 ). Observing {d ≥ d 0 } indicates inconsistency with H 0 in the respect tested. (Implicit alternatives, Cox says, “ lurk in the undergrowth,” given by the test statistic.) So significance tests reflect statistical modus tollens , and its reasoning follows severe testing – BUT, an isolated low P -value won’ t suffice to infer a genuine effect, let alone a research claim. Here’ s a list of fallacies of rejection .

‹ Prev Next ›