Statistical Inference as Severe Testing

Home > Other > Statistical Inference as Severe Testing > Page 32
Statistical Inference as Severe Testing Page 32

by Deborah G Mayo


  4. Therefore, statistical methods and inferences are invariably subjective, if only in part.

  The mistake is to suppose we are incapable of critically scrutinizing how discretionary choices influence conclusions. It is true, for example, that choosing a very insensitive test for detecting a risk δ ′ will give the test low probability of detecting such discrepancies even if they exist. Yet I’ m not precluded from objectively determining this. Setting up a test with low power against δ ′ might be a product of your desire not to find an effect for economic reasons, of insufficient funds to collect a larger sample, or of the inadvertent choice of a bureaucrat. Or ethical concerns may have entered. But our critical evaluation of what the resulting data do and do not indicate need not itself be a matter of economics, ethics, or what have you.

  Idols of Objectivity.

  I sympathize with disgruntlement about phony objectivity and false trappings of objectivity. They grow out of one or another philosophical conception about what objectivity requires – even though you will almost surely not see them described that way. It’ s the curse of logical positivism, but also its offshoots in post-positivisms. If it’ s thought objectivity is limited to direct observations (whatever they are) plus mathematics and logic, as the typical positivist, then it’ s no surprise to wind up worshiping what Gigerenzer and Marewski (2015 ) call “ the idol of a universal method.” Such a method is to supply a formal, ideally mechanical, rule to process statements of observations and hypotheses – translated into a neutral observation language. Can we translate Newtonian forces and Einsteinian curved spacetime into a shared observation language? The post-positivists, rightly, said no. Yet giving up on logics of induction and theory-neutral languages, they did not deny these were demanded by objectivity. They only decided that they were unobtainable. Genuine objectivity goes by the board, replaced by various stripes of relativism and constructivism, as well as more extreme forms of anarchism and postmodernism. 1

  From the perspective of one who has bought the view that objectivity is limited to math, logic, and fairly direct observations (the dial now points to 7), methods that go beyond these appear “ just as” subjective as another. They may augment their rather thin gruel with an objectivity arising from social or political negotiation, or a type of consensus, but that’ s to give away the goat far too soon. The result is to relax the core stipulations of scientific objectivity. To be clear: There are authentic problems that threaten objectivity. Let’ s not allow outdated philosophical accounts to induce us into giving it up.

  What about the fact that different methods yield different inferences, for example that Richard Royall won’ t infer the composite µ > 0.2 while N-P testers will? I have no trouble understanding why, if you define inference as comparative likelihoods, the results disagree with error statistical tests. Running different analyses on the same data can be the best way to unearth flaws. However, objectivity is an important criterion in appraising such rival statistical accounts.

  Objectivity and Observation.

  In facing objectivity skeptics, you might remind them of parallels between learning from statistical experiments and learning from observations in general. The problem in objectively interpreting observations is that observations are always relative to the particular instrument or observation scheme employed. But we are often aware not only of the fact that observation schemes influence what we observe but also of how: How much noise are they likely to introduce? How might we subtract it out?

  The result of a statistical method need only (and should only) be partly determined by the specifications of a given method (e.g., the cut-off for statistical significance); it is also determined by the underlying scientific phenomenon, as modeled. What enables objective learning is the possibility of taking into account how test specifications color results as we intervene in phenomena of interest. Don’ t buy the supposition that the word “ arbitrary” always belongs in front of “ convention.” That my weight shows up as k pounds is a convention in the USA. Still, given the convention , the readout of k pounds is a matter of how much I weigh. I cannot simply ignore the additional weight as due to an arbitrary convention, even if I wanted to.

  How Well Have You Probed H versus How Strongly Do (or Should) You Believe It?

  When Albert Einstein was asked “ What if Eddington had not found evidence of the deflection effect?” , Einstein famously replied, “ Then I would feel sorry for the dear Lord; the theory is correct.” Some might represent this using subjective Bayesian resources. Einstein had a strong prior conviction in GTR – a negative result might have moved his belief down a bit but it would still be plenty high. Such a reconstruction may be found useful. If we try to cash it out as a formal probability, it isn’ t so easy. Did he assign high prior to the deflection effect being 1.75″ , or also to the underlying theoretical picture of curved spacetime (which is really the basis of his belief)? A formal probability assignment works better for individual events than for assessing full-blown theories, but let us assume that it could be done. What matters is that Einstein would also have known the deflection hypothesis had not been well probed, that is, it had not yet passed a severe test in 1919. An objective account of statistics needs to distinguish how probable (believable, plausible, supported) a claim is from how well it has been probed. This remains true whether the focus is on a given set of data, several sets, or, given everything I know, what I called “ big picture” inference.

  Having distinguished our aim – appraising how stringently and responsibly probed a claim H is by the results of a given inquiry – from that of determining H ’ s plausibility or belief-worthiness, it’ s easy to allow that different methodologies and criteria are called for in pursuing these two goals. Recall the root of probare is to demonstrate or show.

  Some argue that “ discretionary choices” in tests, which Neyman himself tended to call “ subjective,” lead us to subjective probabilities. A weak version goes: since you can’ t avoid discretionary choices in getting the data and model, how can you complain about subjective degrees of belief in the resulting inference? This is weaker than arguing you must use subjective probabilities; it argues merely that doing so is no worse than discretion. It still misses the point.

  First, as we saw in exposing the “ dirty hands” argument, even if discretionary judgments can introduce subjectivity, they need not. Second, not all discretionary judgments are in the same boat when it comes to being open to severe testing of their own. E. Pearson imagines he

  might quote at intervals widely different Bayesian probabilities for the same set of states, simply because I should be attempting what would be for me impossible and resorting to guesswork. It is difficult to see how the matter could be put to experimental test.

  (Pearson 1962 , pp. 278– 9)

  A stronger version of the argument goes on a slippery slope from the premise of discretion in data generation and modeling to the conclusion: statistical inference is a matter of subjective belief. How does that work? One variant involves a subtle slide from “ our models are merely objects of belief,” to “ statistical inference is a matter of degrees of belief.” From there it’ s a short step to “ statistical inference is a matter of subjective probability.” It is one thing to allow talk of our models as objects of belief and quite another to maintain that our task is to model beliefs.

  This is one of those philosophical puzzles of language that might set some people’ s eyes rolling. If I believe in the deflection effect then that effect is the object of my belief, but only in the sense that my belief is about said effect. Yet if I’ m inquiring into the deflection effect, I’ m not inquiring into beliefs about the effect. The philosopher of science Clark Glymour (2010 , p. 335) calls this a shift from phenomena (content) to epiphenomena (degrees of belief). Popper argues that the key confusion all along was sliding from the degree of the rationality (or warrantedness) of a belief, to the degree of rational belief (1959 , p. 407).

  Or take subjectivist Frank Lad. T
o him,

  … so-called ‘ statistical models’ are not real entities that merit being estimated. To the extent that models mean anything, they are models of someone’ s (some group’ s) considered uncertain opinion about observable quantities.

  (Lad 2006 , p. 443)

  Notice the slide from uncertainty or partial knowledge of quantities in models, to models being models of opinions . I’ m not saying Lad is making a linguistic error. He appears instead to embrace a positivist philosophy of someone like Bruno de Finetti. De Finetti denies we can put probabilities on general claims because we couldn’ t settle bets on them. If it’ s also maintained that scientific inference takes the form of a subjective degree of belief, then we cannot infer general hypotheses – such as statistical hypotheses. Are we to exclude them from science as so much meaningless metaphysics?

  When current-day probabilists echo such stances, it’ s a good bet they would react with horror at the underlying logical positivist philosophy. So how do you cut to the chase without sinking into a philosophical swamp? You might ask: Are you saying statistical models are just models of beliefs and opinions? They are bound to say no. So press on and ask: Are you saying they are mere approximations, and we hold fallible beliefs and opinions about them? They’ re likely to agree. But the error statistician holds this as well!

  What’ s Being Measured versus My Ability to Test It.

  You will sometimes hear it claimed that anyone who says their probability assignments to hypotheses are subjective must also call the use of any model subjective because it too is based on my choice of specifications. It’ s important not to confuse two notions of subjective. The first concerns what’ s being measured, and for the Bayesian, at least the subjective Bayesian, probability represents a subject’ s strength of belief. The second sense of subjective concerns whether the measurement is checkable or testable. Nor does latitude for disagreement entail untestability. An intriguing analysis of objectivity and subjectivity in statistics is Gelman and Hennig (2017 ).

  4.2 Embrace Your Subjectivity

  The classical position of the subjective Bayesian aims at inner coherence or consistency rather than truth or correctness. Take Dennis Lindley:

  I am often asked if the method gives the right answer: or, more particularly, how do you know if you have got the right prior. My reply is that I don’ t know what is meant by ‘ right’ in this context. The Bayesian theory is about coherence , not about right or wrong.

  (Lindley 1976 , p. 359)

  There’ s no reason to suppose there is a correct degree of belief to hold. For Lindley, Pr(H | x ) “ is your belief about [H ] when you know [ x ]” (Lindley 2000 , p. 302, substituting Pr for P; H for A and x for B ). My opinions are my opinions and your opinions are yours. How do I criticize your prior degrees of belief? As Savage said, “ [T]he Bayesian outlook reinstates opinion in statistics – in the guise of the personal probabilities of events …” (Savage 1961 , p. 577). Or again, “ The concept of personal probability … seems to those of us who have worked with it an excellent model for the concept of opinion” (ibid., pp. 581– 2). 2 That might be so, but what if we are not trying to model opinions, but instead insist on meeting requirements for objective scrutiny? For these goals, inner coherence or consistency among your beliefs is not enough. One can be consistently wrong, as everyone knows (or should know).

  If you’ re facing a radical skeptic of all knowledge, a radical relativist, post-modernist, social-constructivist, or anarchist, there may be limited room to maneuver. The position may be the result of a desire to shock or be camp (as Feyerabend or Foucault) or give voice to political interests. The position may be mixed with, or at least dressed in the clothes of, philosophy: We are locked in a world of appearances seeing mere shadows of an “ observer independent reality.” Our bold activist learner, who imaginatively creates abstract models that give him pushback, terrifies them. Calling it unholy metaphysics may actually reflect their inability to do the math.

  Progress of Philosophy

  To the error statistician, radical skepticism is a distraction from the pragmatic goal of understanding how we do manage to learn, and finding out how we can do it better. Philosophy does make progress. Logical positivism was developed and embraced when Einstein’ s theory rocked the Newtonian worldview. Down with metaphysics! All must be verifiable by observation. But there are no pure observations, no theory-neutral observational languages, no purely formal rules of confirmation holding between any statements of observation and hypotheses. Popper sees probabilism as a holdover from a watered down verificationism, “… under the influence of the mistaken view that science, unable to attain certainty, must aim at a kind of ‘ Ersatz ’ – at the highest attainable probability” (Popper 1959 , p. 398 (Appendix IX)). Even in the face of the “ statistical crisis in science,” by and large, scientists aren’ t running around terrified that our cherished theories of physics will prove wrong: they expect even the best ones are incomplete, and several rival metaphysics thrive simultaneously. In genetics, we have learned to cut, delete, and replace genes in human cells with the new CRISPR technique discovered by Jennifer Doudna and Emmanuelle Charpentier (2014 ). The picture of the knower limited by naked observations no longer has any purchase, if it ever did.

  Some view the Big Data revolution, with its focus on correlations rather than causes, as a kind of return to theory-free neopositivism. Theory freedom and black-box modeling might work well for predicting what color website button is most likely to get me to click. AI has had great successes. We’ ve also been learning how theory-free prediction techniques come up short when it comes to scientific understanding.

  Loss and Cost Functions

  The fact that we have interests, and that costs and values may color our interpretation of data, does not mean they should be part of the scientific interpretation of data. Frequent critics of statistical significance tests, economists Stephen Ziliak and Deirdre McCloskey, declare, in relation to me, that “ a notion of a severe test without a notion of a loss function is a diversion from the main job of science” (2008a , p. 147). It’ s unclear if this is meant as a vote for a N-P type of balancing of error probabilities, or for a full-blown decision theoretic account. If it is the latter, with subjective prior probabilities in hypotheses, we should ask: Whose losses? Whose priors? The drug company? The patient? They may lie hidden in impressive Big Data algorithms as Cathy O’ Neil (2016) argues.

  Remember that a severity appraisal is always in relation to a question or problem, and that problem could be a decision, within a scientific inquiry or wholly personal. In the land of science, we’ d worry that to incorporate into an inference on genomic signatures, say, your expected windfall from patenting it would let it pass without a severe probe. So if that is what they mean, I disagree, and so should anyone interested in blocking flagrant biases. Science is already politicized enough. Besides, in order for my assessment of costs to be adequate, I’ ve got to get the science approximately correct first – wishing and hoping don’ t suffice (as Potti and Nevins discovered in Excursion 1 ).

  We can agree with Ziliak and McCloskey if all they mean is that in deciding if a treatment, say hormone replacement therapy (HRT), is right for me, then a report on how it improves skin elasticity ignoring, say, the increase in cardiac risk, is likely irrelevant for my decision.

  Some might eschew all this as naïve: scientists cannot help but hold on to beliefs based on private commitments, costs, and other motivations. We may readily agree. Oliver Lodge, our clairvoyant, had a keen interest in retaining a Newtonian ether to justify conversations with his son, Raymond, on “ the other side” (Section 3.1 ). Doubtless Lodge, while accepting the interpretation of the deflection experiments, could never really bring himself to disbelieve in the ether. It might not even have been healthy, psychologically, for him to renounce his belief. Yet, the critical assessment of each of his purported ether explanations had nothing to do with this. Perhaps one could represent his personal assessmen
t using a high prior in the ether, or a high cost to relinquishing belief in it. Yet everyone understands the difference between adjudicating disagreements on evidential grounds and producing a psychological conversion, or making it worthwhile financially, as when a politician’ s position “ evolves” if the constituency demands it.

  “ Objective” (Default, Non-subjective) Bayesians

  The desire for a Bayesian omelet while only breaking “ objective” eggs gives rise to default Bayesianism or, if that sounds too stilted, default/non-subjective. 3 Jim Berger is one of the leaders:

  I feel that there are a host of practical and sociological reasons to use the label ‘ objective’ for priors of model parameters that appropriately reflect a lack of subjective information … [None of the other names] carries the simplicity or will carry the same weight outside of statistics as ‘ objective.’… we should start systematically accepting the ‘ objective Bayes’ name before it is co-opted by others.

  (Berger 2006 , p. 387)

  The holy grail of truly “ uninformative” priors has been abandoned – what is uninformative under one parameterization can be informative for another. (For example, “ if θ is uniform e θ has an improper exponential distribution, which is far from flat” ; Cox 2006a , p. 73.) Moreover, there are competing systems for ensuring the data are weighed more heavily than the priors. As we will see, so-called “ equipoise” assignments may be highly biased. For the error statistician, as long as an account is restricted to priors and likelihoods, it still leaves out the essential ingredient for objectivity: the sampling distribution, the basis for error probabilities and severity assessments. Classical Bayesians, both subjective and default, reject this appeal to “ frequentist objectivity” as solely rooted in claims about long-run performance. Failure to craft a justification in terms of probativeness means that there’ s uncharted territory, waiting to be developed. Fortunately I happen to have my own maps, rudimentary perhaps, but enough for our excavations.

 

‹ Prev