Statistical Inference as Severe Testing

Page 62

by Deborah G Mayo

Error statistical methods, deemed indirect for probabilism, are direct for severe probing and falsification. Severe testers do not view scientists as seeking highly probable hypotheses, but learning which interpretations of data are well and poorly tested. Of course we want well-warranted claims, but arriving at them does not presuppose a single probability pie with its requirements of exhaustiveness: science must be open ended. We want methods that efficiently find falsity, not ones that are based on updating values for parameters in an existing model. We want to infer local variants of theories piecemeal, falsify others, and be free to launch a probe of any hypothesis which we can subject to severe testing. If other ways to falsify satisfy error statistical requirements, then they are happily in sync with us.

9. Severe Testing Is Not All You Do in Inquiry. Agreed. I have used a neutral word “ warranted” to mean justified, adding “ with severity” when appropriate. There’ s a distinctive twist that goes with severely warranting claims – some prefer to say ” beliefs,” and you could substitute throughout if you wish. It is this twist that makes it possible to have your probabilist cake, and probativism too – each for distinct contexts. The severe testing assessment is not measuring how strong your belief in H is but how well you can show why H ought to be believed. It is relevant when the aim is to know why claims pass (or fail) the tests they do. View the error statistical notions as a picturesque representation of the real life, flesh and blood, capability or incapability to put to rest reasonable skeptical challenges. It’ s in the spirit of Fisher’ s requiring you know how to bring about results that would rarely fail to corroborate H . It’ s not merely knowing, but showing you’ re prepared, or would be, to tackle skeptical challenges. I’ m using “ you” but it could be a group or a machine. It’ s just not all that you do in inquiry. I admitted at the outset that we do not always want to find things out. If your goal is belief probabilism, or you’ re in a context where the aim is to assign direct probabilities to events (a deductive task), then you are better off recognizing the differences than trying to unify or reconcile. Let me be clear, severe testing isn’ t reserved for cases of strong evidence; it is operative at every stage of inquiry, but even more so in early stages – where skepticism is greatest. The severity demand is what we naturally want as consumers of statistics, namely, grounds that reports would very probably have revealed flaws of relevance when they’ re present. To pass tests with severity gives strong evidence, yes, but most of the time it’ s to learn that much less than was thought or hoped has passed. Showing (with severity!) that a study was poorly run is important in its own right, even if done semi-formally. Better still is to pinpoint a flaw that’ s been overlooked.

Our journey has taken you far beyond the hackneyed statistical battles that make up much of today’ s statistics wars. I’ ve chosen to focus on some of them in your final “ keepsake” because, if you have to refight them, you can begin from the places we’ ve reached. These criticisms can no longer be blithely put forward as having weight without wrestling with the underlying presuppositions and challenges about evidence and inference. You might say that the criticisms have force against garden-variety treatments of error statistical methods, that I’ ve changed things by adding an explicit severe testing philosophy. I’ ll happily concede this, but that is the whole reason for taking this journey. You needn’ t accept this statistical philosophy to use it to peel back the layers of the statistics wars; you will then be beyond them. It’ s time.

Live (Final) Exhibit. What Does the Severe Tester Say About Positions 1– 9? What do you say?

1 We know performance is necessary but not sufficient for severity, nor for confidence distributions or fiducial inference, but here we imagine we have got the relevant error probability.

2 There’ s no need for the philosopher’ s appeal to things like closest possible worlds to use counter factuals either.

3 They allow the possibility that the knowledge that optional stopping will be used alters their prior for 0. I take it they recognize this is at odds with the presumption that “ optional stopping is no sin,” and they don’ t press it. See Section 1.5 where we first took this up.

4 In observing that “ informative stopping rules occur only rarely in practice” (p. 90), Berger and Wolpert make the insightful point that disagreement on this is “ due to the misconception that an informative stopping rule is one for which N carries information about θ .”

5 One explanation is in Bernardo’ s appeal to a decision theory that considers the sampling distribution in computing utilities.

6 The error statistical account would suggest first checking the likelihood portion of the model, after which they could turn to the prior.

7 Note, for example, that for a given parameter θ , one has presumably only selected a single θ , not the n samples of our usual M-S test. It’ s not clear why we should expect it to produce typical outcomes. I owe this point to Christian Hennig.

8 A co-developer of posterior predictive checks, Xiao-Li Meng, is a leader of the “ Bayes– Fiducial– Frequentist” movement.

9 We would need predesignation of hypotheses (and/or other restrictions) if there is to be error control.

10 I allude to a pin and tumbler lock.

11 Some will use a P -value as a degree of inconsistency with a null hypothesis.

Souvenirs

Souvenir A: Postcard to Send

Souvenir B: Likelihood versus Error Statistical

Souvenir C: A Severe Tester’ s Translation Guide

Souvenir D: Why We Are So New

Souvenir E: An Array of Questions, Problems, Models

Souvenir F: Getting Free of Popperian Constraints on Language

Souvenir G: The Current State of Play in Psychology

Souvenir H: Solving Induction Is Showing Methods with Error Control

Souvenir I: So What Is a Statistical Test, Really?

Souvenir J: UMP Tests

Souvenir K: Probativism

Souvenir L: Beyond Incompatibilist Tunnels

Souvenir M: Quicksand Takeaway

Souvenir N: Rule of Thumb for SEV

Souvenir O: Interpreting Probable Flukes

Souvenir P: Transparency and Informativeness

Souvenir Q: Have We Drifted From Testing Country? (Notes From an Intermission)

Souvenir R: The Severity Interpretation of Rejection (SIR)

Souvenir S: Preregistration and Error Probabilities

Souvenir T: Even Big Data Calls for Theory and Falsification

Souvenir U: Severity in Terms of Problem-Solving

Souvenir V: Two More Points on M-S Tests and an Overview of Excursion 4

Souvenir W: The Severity Interpretation of Negative Results (SIN) for Test T+

Souvenir X: Power and Severity Analysis

Souvenir Y: Axioms Are To Be Tested by You (Not Vice Versa)

Souvenir Z: Understanding Tribal Warfare

References

Achinstein , P. (2000 ). ‘ Why Philosophical Theories of Evidence Are (And Ought to Be) Ignored by Scientists ’ , in Howard , D. (ed.), Proceedings of the 1998 Biennial Meetings of the Philosophy of Science Association, Philosophy of Science 67 , S180 – 92 .

Achinstein , P. (2001 ). The Book of Evidence . Oxford : Oxford University Press .

Achinstein , P. (2010 ). ‘ Mill’ s Sins or Mayo’ s Errors?’ In Mayo , D. and Spanos , A. (eds.), pp. 170– 88 .

Akaike , H. (1973 ). ‘ Information Theory and an Extension of the Maximum Likelihood Principle ’ , in Petrov B. and Csaki , F. (eds.), 2nd International Symposium on Information Theory . Akademia Kiado , Budapest , pp. 267– 81 .

American Statistical Association (2017 ). Recommendations to Funding Agencies for Supporting Reproducible Research . amstat.org/asa/News/ASA-Develops-Reproducible-Research-Recommendations.aspx .

Armitage , P. (1961 ). ‘ Contribution to the Discussion in Smith, C.A.B., “ Consistency in Statistical Inference and Decision” ’ , Journal of the Roy
al Statistical Society: Series B (Methodological) 23 , 1 – 37 .

Armitage , P. (1962 ). ‘ Contribution to Discussion’ , in Savage , L. J. (ed.), pp. 62 – 103 .

Armitage , P. (1975 ). Sequential Medical Trials , 2nd edn. New York : Wiley .

Armitage , P. (2000 ). ‘ Comments on the Paper by Lindley ’ , Journal of the Royal Statistical Society: Series D 49 (3 ), 319– 20 .

ATLAS Collaboration (2012 a). ‘ Latest Results from ATLAS Higgs Search’ , Press statement, ATLAS Updates , July 4, 2012 . http://atlas.cern/updates/press-statement/latest-results-atlas-higgs-search .

ATLAS Collaboration (2012 b). ‘ Observations of a New Particle in the Search for the Standard Model Higgs Boson with the Atlas Detector at the LHC ’ , Physics Letters B 716 (2012 ), 1 – 29 .

ATLAS Collaboration (2012 c). ‘ Updated ATLAS Results on the Signal Strength of the Higgs-like Boson for Decays into WW and Heavy Fermion Final States’ , ATLAS-CONF-2012-162. ATLAS Note , November 14, 2012 . http://cds.cern.ch/record/1494183/files/ATLAS-CONF-2012-162.pdf .

Bacchus , F. , Kyburg H. , and Thalos , M. (1990 ). ‘ Against Conditionalization ’ , Synthese 85 (3 ), 475 – 506 .

Baggerly , K. and Coombes , K. (2009 ). ‘ Deriving Chemosensitivity from Cell Lines: Forensic Bioinformatics and Reproducible Research in High-throughput Biology ’ , Annals of Applied Statistics 3 (4 ), 1309– 34 .

Bailar , J. (1991 ). ‘ Scientific Inferences and Environmental Health Problems ’ , Chance 4 (2 ), 27 – 38 .

Bakan , D. (1970 ). ‘ The Test of Significance in Psychological Research’ , in Morrison , D. and Henkel , R. (eds.), pp. 231– 51 .

Baker , M. (2016 ). ‘ 1,500 Scientists Lift the Lid on Reproducibility ’ , Nature 533 , 452– 4 .

Banerjee , A. and Duflo , E. (2011 ). Poor Economics: A Radical Rethinking of the Way to Fight Global Poverty , 1st edn. New York: PublicAffairs .

Barnard , G. A. (1950 ). ‘ On the Fisher-Behrens Test ’ , Biometrika 37 (3/4 ), 203– 7 .

Barnard , G. (1962 ). ‘ Contribution to Discussion’ , in Savage , L. J. (ed.), pp. 62 – 103 .

Barnard , G. (1971 ). ‘ Scientific Inferences and Day to Day Decisions’ , in Godambe , V. and Sprott , D. (eds.), pp. 289 – 300 .

Barnard , G. (1972 ). ‘ The Logic of Statistical Inference (Review of “ The Logic of Statistical Inference” by Ian Hacking) ’ , British Journal for the Philosophy of Science 23 (2 ), 123– 32 .

Barnard , G. (1985 ). A Coherent View of Statistical Inference , Statistics Technical Report Series. Department of Statistics & Actuarial Science, University of Waterloo, Canada.

Bartlett , M. (1936 ). ‘ The Information Available in Small Samples ’ , Proceedings of the Cambridge Philosophical Society 32 , 560– 6 .

Bartlett , M. (1939 ). ‘ Complete Simultaneous Fiducial Distributions ’ , Annals of Mathematical Statistics , 10 , 129– 38 .

Bartlett , T. (2012 a). ‘ Daniel Kahneman Sees “ Train-Wreck Looming” for Social Psychology’ , Chronicle of Higher Education , online (10/4/2012).

Bartlett , T. (2012 b). ‘ The Researcher Behind the Ovulation Voting Study Responds’ , Chronicle of Higher Education online (10/28/2012).

Bayarri , M. and Berger , J. (2004 ). ‘ The Interplay between Bayesian and Frequentist Analysis ’ , Statistical Science 19 , 58 – 80 .

Bayarri , M. , Benjamin , D. , Berger , J. , and Sellke , T. (2016 ). ‘ Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses ’ , Journal of Mathematical Psychology , 72 , 90 – 103 .

Bem , D. (2011 ). ‘ Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect ’ , Journal of Personality and Social Psychology 100 (3 ), 407 – 425 .

Bem , D. , Utts , J. , and Johnson , W. (2011 ). ‘ Must Psychologists Change the Way They Analyze Their Data? ’ Journal of Personality and Social Psychology , 101 (4 ), 716 – 719 .

Benjamin , D. and Berger , J. (2016 ). ‘ Comment: A Simple Alternative to P-values on Wasserstein, R. and Lazar, N. 2016, “ The ASA’ s Statement on p-Values: Context, Process, and Purpose ”’ , The American Statistician 70 (2 ) (supplemental materials).

Benjamin , D. , Berger , J. , Johannesson , M. , et al. (2017 ). ‘ Redefine Statistical Significance ’ , Nature Human Behaviour 2 , 6 – 10 .

Benjamini , Y. (2008 ). ‘ Comment: Microarrays, Empirical Bayes and the Two-Groups Model ’ , Statistical Science 23 (1 ), 23– 8 .

Benjamini , Y. and Hochberg , Y. (1995 ). ‘ Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing ’ , Journal of the Royal Statistical Society, B 57 , 289 – 300 .

Berger , J. (2003 ). ‘ Could Fisher, Jeffreys and Neyman Have Agreed on Testing?’ and ‘ Rejoinder’ , Statistical Science 18 (1 ), 1 – 12 ; 28 – 32 .

Berger , J. (2006 ). ‘ The Case for Objective Bayesian Analysis’ and ‘ Rejoinder’ , Bayesian Analysis 1 (3 ), 385 – 402 ; 457– 64 .

Berger , J. (2008 ). ‘ A Comparison of Testing Methodologies’ , in Proceedings of the PHYSTAT-LHC Workshop on Statistical Issues for LHC Physics , June 2008, CERN 2008-001, pp. 8 – 19 .

Berger , J. and Bernardo , J. (1992 ). ‘ On the Development of Reference Priors ’ , in Bernardo , J. , Berger , J. , Dawid , A. and Smith A. (eds.), Bayesian Statistics Volume 4 , Oxford : Oxford University Press , pp. 35 – 60 .

Berger , J. and Sellke , T. (1987 ). ‘ Testing a Point Null Hypothesis: The Irreconcilability of P Values and Evidence (with Discussion and Rejoinder) ’ , Journal of the American Statistical Association 82 (397 ), 112– 22 ; 135– 9 .

Berger , J. and Wolpert , R. (1988 ). The Likelihood Principle , 2nd edn., Vol. 6 . Lecture Notes-Monograph Series. Hayward, CA : Institute of Mathematical Statistics .

Berger , R. (2014 ). ‘ Comment on S. Senn’ s post: “ Blood Simple?” The complicated and controversial world of bioequivalence’ , Guest Blogpost on Errorstatistics.com (7/31/2014).

Berger , R. and Hsu , J. (1996 ). ‘ Bioequivalence Trials, Intersection-union Tests and Equivalence Confidence Sets ’ , Statistical Science 11 (4 ), 283 – 302 .

Bernardo , J. (1997 ). ‘ Non-informative Priors Do Not Exist: A Discussion ’ , Journal of Statistical Planning and Inference 65 , 159– 89 .

Bernardo , J. (2008 ). ‘ Comment on Article by Gelman ’ , Bayesian Analysis 3 (3 ), 451– 4 .

Bernardo , J. (2010 ). ‘ Integrated Objective Bayesian Estimation and Hypothesis Testing ’ (with discussion), Bayesian Statistics 9 , 1 – 68 .

Berry , S. and Kadane , J. (1997 ). ‘ Optimal Bayesian Randomization ’ , Journal of the Royal Statistical Society: Series B 59 (4 ), 813– 19 .

Bertrand , J. ([1889 ]/ 1907 ). Calcul des Probabilités . Paris : Gauthier-Villars .

Birnbaum , A. (1962 ). ‘ On the Foundations of Statistical Inference ’ , in Kotz , S. and Johnson , N. (eds.), Breakthroughs in Statistics , 1 , Springer Series in Statistics, New York : Springer-Verlag , pp. 478 – 581 . (First published with discussion in Journal of the American Statistical Association 57(298), 269 – 326 .)

Birnbaum , A. (1969 ). ‘ Concepts of Statistical Evidence ’ , in Morgenbesser , S. , Suppes , P. , and White , M. (eds.), Philosophy, Science, and Method: Essays in Honor of Ernest Nagel , New York : St. Martin’ s Press , pp. 112– 43 .

Birnbaum , A. (1970 ). ‘ Statistical Methods in Scientific Inference’ (letter to the Editor) , Nature 225 (5237 ), 1033 .

Birnbaum , A. (1977 ). ‘ The Neyman-Pearson Theory as Decision Theory, and as Inference Theory; with a Criticism of the Lindley-Savage Argument for Bayesian Theory ’ , Synthese 36 (1 ), 19 – 49 .

Bogen , J. and Woodward , J. (1988 ). ‘ Saving the Phenomena ’ , Philosophical Review 97 (3 ), 303– 52 .

Borel , E. ([1914 ]/ 1948 ). Le Hasard . Paris : Alcan .

Bowley , A. (1934 ). ‘ Discussion and Commentary’ pp. 131– 3 in Neyman 1934.

Box , G. (1979 ). ‘ Robustness in the Strategy of Scientific M
odel Building ’ , in Launer , R. and Wilkinson , G. (eds.), Robustness in Statistics , New York : Academic Press , 201– 36 .

Box , G. (1983 ). ‘ An Apology for Ecumenism in Statistics ’ , in Box , G. , Leonard , T. , and Wu , D. (eds.), Scientific Inference, Data Analysis, and Robustness , New York : Academic Press , 51 – 84 .

Box , G. and Jenkins , G. (1976 ). Time Series Analysis: Forecasting and Control . San Francisco : Holden-Day .

Box , J. (1978 ). R. A. Fisher: The Life of a Scientist . New York : John Wiley .

Breiman , L. (1997 ). ‘ No Bayesians in Foxholes ’ , part of ‘ Banter on Bayes: Debating the Usefulness of Bayesian Approaches to Solving Practical Problems’ , hosted by Hearst , M. , IEEE Expert: Intelligent Systems and Their Applications 12 (6 ), 21– 4 .

Brown , E. N. and Kass , R. E. (2009 ). ‘ What is Statistics? ’ (with discussion), The American Statistician 63 , 105– 23 .

Buchen , L. (2009 ). ‘ May 29, 1919: A Major Eclipse, Relatively Speaking’ , Wired online (5/29/2009).

Buehler , R. J. and Feddersen , A. P. (1963 ). ‘ Note on a Conditional Property of Student’ s t ’ , The Annals of Mathematical Statistics 34 (3 ), 1098– 100 .

Burgman , M. (2005 ). Risk and Decision for Conservation and Environmental Management . Cambridge : Cambridge University Press .

Burnham , K. and Anderson , D. (2002 ). Model Selection and Multimodal Inference: A Practical Information-Theoretic Approach . New York : Springer-Verlag .

‹ Prev Next ›