Statistical Inference as Severe Testing

Page 29

by Deborah G Mayo

You see the pattern. The example is designed so that some outcomes yield much more information than others. As with Cox’ s “ two measuring instruments,” the data have two parts: First, an indication of whether the two outcomes are the same or different; second, the observed result. Let A be an indicator of the first part: A = 1 if both are the same (unlucky); A = 2 if the sample values differ by 2 (lucky!). The full data may be represented as (A , x ). The distribution of A is fixed independently of the parameter of interest: Pr(A = 1) = Pr(A = 2) = 0.5 . It is an example of an ancillary statistic. However, learning whether A = 1 or A = 2 is very informative as to the precision achieved by the inference. Thus the relevant properties associated with the particular inference would be conditional on the value of A .

The tip-off that we’ re dealing with a problem case is this: The sufficient statistic S has two parts (A , X ), that is it has dimension 2. But there’ s only one parameter ψ . Without getting into the underlying theory, this alone indicates that S has a property known as being incomplete , opening the door to different P -values or confidence levels when calculated conditionally on the value of A . In particular, the marginal distribution of a P -value averaged over the two possibilities (0.5(0) + 0.5(0.5) = 0.25) would be misleading for any particular set of data. Instead we condition on the value of A obtained. David Cox calls this process “ technical conditioning to induce relevance of the frequentist probability to the inference at hand” (Cox and Mayo 2010 , pp. 297– 8).

Such examples have other noteworthy features: the ancillary part A gives a sneaky way of assigning a probability to “ being correct” in the subset of cases given by the value of A . It’ s an example of what Fisher called “ recognizable subsets.” By careful artifice, the event that “ a random variable A takes a given value a ” is equivalent to “ the data were generated by a hypothesized parameter value.” So the probability of A = a gives the probability a hypothesis is true. Aris Spanos considers these examples “ rigged” for this reason, and he discusses these and several other famous pathological examples (Spanos 2012).

Even putting pathologies aside, is there any reason the frequentist wouldn’ t do the sensible thing and report on how well probed the inference is once A is known? No. Certainly a severe testing theorist would.

Live Exhibit (ix). What Should We Say When Severity Is Not Calculable?

In developing a system like severity, at times a conventional decision must be made. However, the reader can choose a different path and still work within this system.

What if the test or interval estimation procedure does not pass the audit? Consider for the moment that there has been optional stopping, or cherry picking, or multiple testing. Where these selection effects are well understood, we may adjust the error probabilities so that they do pass the audit. But what if the moves are so tortuous that we can’ t reliably make the adjustment? Or perhaps we don’ t feel secure enough in the assumptions? Should the severity for μ > μ 0 be low or undefined?

You are free to choose either. The severe tester says SEV( μ > μ 0 ) is low. As she sees it, having evidence requires a minimum threshold for severity, even without setting a precise number. If it’ s close to 0.5, it’ s quite awful. But if it cannot be computed, it’ s also awful, since the onus on the researcher is to satisfy the minimal requirement for evidence. I’ ll follow her: If we cannot compute the severity even approximately (which is all we care about), I’ ll say it’ s low, along with an explanation as to why: It’ s low because we don’ t have a clue how to compute it!

A probabilist, working with a single “ probability pie” as it were, would take a low probability for H as giving a high probability to ~H . By contrast we wish to clearly distinguish between having poor evidence for H and having good evidence for ~H . Our way of dealing with bad evidence, no test (BENT) allows us to do that. Both SEV(H ) and SEV(~H ) can be low enough to be considered lousy, even when both are computable.

Souvenir N: Rule of Thumb for SEV

Can we assume that if SEV( μ > μ 0 ) is a high value, 1 − α , then SEV( μ ≤ μ 0 ) is α ?

Because the claims μ > μ 0 and μ ≤ μ 0 form a partition of the parameter space, and because we are assuming our test has passed (or would pass) an audit, else these computations go out the window, the answer is yes.

If SEV( μ > μ 0 ) is high, then SEV( μ ≤ μ 0 ) is low.

The converse need not hold – given the convention we just saw in Exhibit (ix) . At the very least, “ low” would not exceed 0.5.

A rule of thumb (for test T+ or its dual CI):

If we are pondering a claim that an observed difference from the null seems large enough to indicate μ > μ ′ , we want to be sure the test was highly capable of producing less impressive results, were μ = μ ′ .

If, by contrast, the test was highly capable of producing more impressive results than we observed, even in a world where μ = μ ′ , then we block an inference to μ > μ ′ (following weak severity).

This rule will be at odds with some common interpretations of tests. Bear with me. I maintain those interpretations are viewing tests through “ probabilist-colored” glasses, while the correct error-statistical view is this one.

3.8 The Probability Our Results Are Statistical Fluctuations: Higgs’ Discovery

One of the biggest science events of 2012– 13 was the announcement on July 4, 2012 of evidence for the discovery of a Higgs-like particle based on a “ 5-sigma observed effect.” With the March 2013 data analysis, the 5-sigma difference grew to 7 sigmas, and some of the apparent anomalies evaporated. In October 2013, the Nobel Prize in Physics was awarded jointly to François Englert and Peter W. Higgs for the “ theoretical discovery of a mechanism” behind the particle experimentally discovered by the collaboration of thousands of scientists (on the ATLAS and CMS teams) at CERN’ s Large Hadron Collider in Switzerland. Yet before the dust had settled, the very nature and rationale of the 5-sigma discovery criterion began to be challenged among scientists and in the popular press. Because the 5-sigma standard refers to a benchmark from frequentist significance testing, the discovery was immediately imbued with controversies that, at bottom, concern statistical philosophy.

Why a 5-sigma standard? Do significance tests in high-energy particle (HEP) physics escape the misuses of P -values found in social and other sciences? Of course the main concern wasn’ t all about philosophy: they were concerned that their people were being left out of an exciting, lucrative, many-years project. But unpacking these issues is philosophical, and that is the purpose of this last stop of Excursion 3. I’ m an outsider to HEP physics, but that, apart from being fascinated by it, is precisely why I have chosen to discuss it. Anyone who has come on our journey should be able to decipher the more public controversies about using P -values.

I’ m also an outsider to the International Society of Bayesian Analysis (ISBA), but a letter was leaked to me a few days after the July 4, 2012 announcement, prompted by some grumblings raised by a leading subjective Bayesian, Dennis Lindley. The letter itself was sent around to the ISBA list by statistician Tony O’ Hagan. “ Dear Bayesians,” the letter began. “ We’ ve heard a lot about the Higgs boson.”

Why such an extreme evidence requirement? We know from a Bayesian perspective that this only makes sense if (a) the existence of the Higgs boson … has extremely small prior probability and/or (b) the consequences of erroneously announcing its discovery are dire in the extreme.

(O’ Hagan 2012 )

Neither of these seemed to be the case in his opinion: “ [Is] the particle physics community completely wedded to frequentist analysis? If so, has anyone tried to explain what bad science that is?” (ibid.).

Bad science? Isn’ t that a little hasty? HEP physicists are sophisticated with their statistical methodology: they’ d seen too many bumps disappear. They want to ensure that before announcing a new particle has been discovered that, at the very least, the results being spurious is given a run for its money. Significan
ce tests, followed by confidence intervals, are methods of choice here for good reason. You already know that I favor moving away from traditional interpretations of statistical tests and confidence limits. But some of the criticisms, and the corresponding “ reforms,” reflect misunderstandings, and the knottiest of them all concerns the very meaning of the phrase (in the title of Section 3.8 ): “ the probability our results are merely statistical fluctuations.” Failing to clarify it may well impinge on the nature of future big science inquiry based on statistical models. The problem is a bit delicate, and my solution is likely to be provocative. You may reject my construal, but you’ ll see what it’ s like to switch from wearing probabilist, to severe testing, glasses.

The Higgs Results

Here’ s a quick sketch of the Higgs statistics. (I follow the exposition by physicist Robert Cousins (2017 ). See also Staley (2017). There is a general model of the detector within which researchers define a “ global signal strength” parameter μ “ such that H 0 : μ = 0 corresponds to the background-only hypothesis and μ = 1 corresponds to the [Standard Model] SM Higgs boson signal in addition to the background” (ATLAS collaboration 2012c ). The statistical test may be framed as a one-sided test:

H 0 : μ = 0 vs. H 1 : μ > 0.

The test statistic d( X ) data records how many excess events of a given type are “ observed” (from trillions of collisions) in comparison to what would be expected from background alone, given in standard deviation or sigma units. Such excess events give a “ signal-like” result in the form of bumps off a smooth curve representing the “ background” alone.

The improbability of the different d( X ) values, its sampling distribution, is based on simulating what it would be like under H 0 fortified with much cross-checking of results. These are converted to corresponding probabilities under a standard Normal distribution. The probability of observing results as extreme as or more extreme than 5 sigmas, under H 0 , is approximately 1 in 3,500,000! Alternatively, it is said that the probability that the results were just a statistical fluke (or fluctuation) is 1 in 3,500,000.

Why such an extreme evidence requirement, Lindley asked. Given how often bumps disappear, the rule for interpretation, which physicists never intended to be rigid, is something like: if d( X ) ≥ 5 sigma, infer discovery, if d( X ) ≥ 2 sigma, get more data.

Now “ deciding to announce” the results to the world, or “ get more data” are actions all right, but each corresponds to an evidential standpoint or inference: infer there’ s evidence of a genuine particle, and infer that spurious bumps had not been ruled out with high severity, respectively.

What “ the Results” Really Are

You know from the Translation Guide (Souvenir C) that Pr(d( X )≥ 5; H 0 ) is to be read Pr (the test procedure would yield d( X ) ≥ 5; H 0 ) . Where do we record Fisher’ s warning that we can only use P -values to legitimately indicate a genuine effect by demonstrating an experimental phenomenon . In good sciences and strong uses of statistics, “ the results” may include demonstrating the “ know-how” to generate results that rarely fail to be significant. Also important is showing the test passes an audit (it isn’ t guilty of selection biases, or violations of statistical model assumptions). “ The results of test T” incorporates the entire display of know-how and soundness. That’ s what the severe tester means by Pr(test T would produce d( X ) ≥ d( x 0 ); H 0 ). So we get:

Fisher’ s Testing Principle : To the extent that you know how to bring about results that rarely fail to be statistically significant, there’ s evidence of a genuine experimental effect.

There are essentially two stages of analysis. The first stage is to test for a genuine Higgs-like particle, the second, to determine its properties (production mechanism, decay mechanisms, angular distributions, etc.). Even though the SM Higgs sets the signal parameter to 1, the test is going to be used to learn about the value of any discrepancy from 0. Once the null is rejected at the first stage, the second stage essentially shifts to learning the particle’ s properties, and using them to seek discrepancies from a new null hypothesis: the SM Higgs.

The P -Value Police

The July 2012 announcement gave rise to a flood of buoyant, if simplified, reports heralding the good news. This gave ample grist for the mills of P -value critics. Statistician Larry Wasserman playfully calls them the “ P-Value Police” (2012a ) such as Sir David Spiegelhalter (2012), a professor of the Public’ s Understanding of Risk at the University of Cambridge. Their job was to examine if reports by journalists and scientists could be seen to be misinterpreting the sigma levels as posterior probability assignments to the various models and claims. Thumbs up or thumbs down! Thumbs up went to the ATLAS group report:

A statistical combination of these channels and others puts the significance of the signal at 5 sigma, meaning that only one experiment in 3 million would see an apparent signal this strong in a universe without a Higgs.

(2012a, emphasis added)

Now HEP physicists have a term for an apparent signal that is actually produced due to chance variability alone: a statistical fluctuation or fluke . Only one experiment in 3 million would produce so strong a background fluctuation. ATLAS (2012b ) calls it the “ background fluctuation probability.” By contrast, Spiegethalter gave a thumbs down to:

There is less than a one in 3 million chance that their results are a statistical fluctuation.

If they had written “ would be” instead of “ is” it would get thumbs up. Spiegelhalter’ s ratings are generally echoed by other Bayesian statisticians. According to them, the thumbs down reports are guilty of misinterpreting the P -value as a posterior probability on H 0 .

A careful look shows this is not so. H 0 does not say the observed results are due to background alone; H 0 does not say the result is a fluke. It is just H 0 : μ = 0 . Although if H 0 were true it follows that various results would occur with specified probabilities. In particular, it entails (along with the rest of the background) that large bumps are improbable.

It may in fact be seen as an ordinary error probability:

(1) Pr(test T would produce d( X ) ≥ 5; H 0 ) ≤ 0.0000003.

The portion within the parentheses is how HEP physicists understand “ a 5-sigma fluctuation.” Note (1) is not a conditional probability, which involves a prior probability assignment to the null. It is not

Pr(test T would produce d( X ) ≥ 5 and H 0 )/Pr(H 0 ).

Only random variables or their values are conditioned upon. This may seem to be nit-picking, and one needn’ t take a hard line on the use of “ conditional.” I mention it because it may explain part of the confusion here. The relationship between the null hypothesis and the test results is intimate: the assignment of probabilities to test outcomes or values of d( X ) “ under the null” may be seen as a tautologous statement.

Since it’ s not just a single result, but also a dynamic test display, we might even want to emphasize a fortified version:

(1)* Pr(test T would display d( X ) ≥ 5; H 0 ) ≤ 0.0000003.

Critics may still object that (1), even fortified as (1)*, only entitles saying:

There is less than a one in 3 million chance of a fluctuation (at least as strong as in their results).

It does not entitle one to say:

There is less than a one in 3 million chance that their results are a statistical fluctuation.

Let’ s compare three “ ups” and three “ downs” to get a sense of the distinction that leads to the brouhaha:

Ups

U-1. The probability of the background alone fluctuating up by this amount or more is about one in 3 million. (CMS 2012)

U-2. Only one experiment in 3 million would see an apparent signal this strong in a universe described in H 0 .

U-3. The probability that their signal would result by a chance fluctuation was less than one chance in 3 million.

Downs

D-1. The probability their results were due to the background fluctuating up by this
amount or more is about one in 3 million.

D-2. One in 3 million is the probability the signal is a false positive – a fluke produced by random statistical fluctuation.

D-3. The probability that their signal was the result of a statistical fluctuation was less than one chance in 3 million.

The difference is that the thumbs down allude to “ this” signal or “ these” data are due to chance or is a fluctuation. Critics might say the objection to “ this” is that the P -value refers to a difference as great or greater – a tail area. But if the probability of {d( X ) ≥ d( x )} is low under H 0 , then Pr (d( X ) = d( x ) ; H 0 ) is even lower. We’ ve dealt with this back with Jeffreys’ quip (Section 3.4). No statistical account recommends going from improbability of a point result on a continuum under H to rejecting H . The Bayesian looks to the prior probability in H and its alternatives. The error statistician looks to the general procedure. The notation {d( X ) ≥ d( x )} is used to signal the latter.

But if we’ re talking about the procedure, the critic rightly points out, we are not assigning probability to these particular data or signal. True, but that’ s the way frequentists always give probabilities to general events, whether they have occurred, or we are contemplating a hypothetical excess of 5 sigma that might occur. It’ s always treated as a generic type of event. We are never considering the probability “ the background fluctuates up this much on Wednesday July 4, 2012,” except as that is construed as a type of collision result at a type of detector, and so on. It’ s illuminating to note, at this point:

‹ Prev Next ›