Statistical alchemy and the “test for excess significance”

[This post is based largely on my 2013 article for Journal of Mathematical Psychology; see the other articles in that special issue as well for more critiques.]

When I tell people that my primary area of research is statistical methods, one of the reactions I often encounter from people untrained in statistics is that “you can prove anything with statistics.” Of course, this rankles, first because it isn’t true (unless you use a very strange definition of prove) and second because I’ve spent years learning the limitations of statistics, and there are many limitations. These limitations exist, however, in the context of enormous successes. In the sciences, the field of statistics rightly has a place of honor.

This success is evidenced by the great number of scientific arguments that are supported by statistical methods. Not all statistical arguments are created equal, of course. But the respect with which statistics is viewed has the unfortunate downside that a statistical argument can apparently turn a leaden hunch into a golden “truth”. This post is about such statistical alchemy.

The gold: Justified substantive claims

One of the goals we all have as scientists is to make claims backed by solid evidence. This is harder than it seems. Ideally we would prefer that evidence be ironclad and assumptions unnecessary. In real-life cases, however, the strength of evidence does not provide certainty, and assumptions are needed. The key to good argument, then, is that all assumptions are made explicit, the chain of reasoning is clear and logical, and the resulting evidence is strong enough to garner agreement.

Such cases we might call the “gold standard” for scientific arguments. We expect this sort of argument when someone makes a strong claim. This is the stuff that the scientific literature should be made of, for the most part. Among other things, the gold standard requires careful experimental design and execution, deliberate statistical analysis and avoidance of post hoc reasoning, and a willingness to explore the effects of unstated assumptions in one’s reasoning.

Hunches are a necessary part of science. Science is driven by a creative force that cannot (at this point) be quantified, and a sneaking suspicion that something is true is often the grounds on which we design experiments. Hunches are some of the most useful things in science, just as lead is an exceptionally useful metal. Like lead, hunches are terribly common. We all have many hunches, and often we don’t know where they come from.

What makes a hunch is that it doesn’t have solid grounds to back it up. Hunches often turn to dust upon closer examination: they may contradict other knowledge, they may be based on untenable assumptions, or the evidence for them may turn out to be much weak when we examine it. If a hunch survives a solid test, it is no longer a hunch; but so long we do not test them — or cannot test them — they remain hunches.

The alchemy of statistics

One of the most dangerous, but unfortunately common, ways in which statistics is used is to magically turn hunches into “truth”. The mother of all statistical alchemy is the Fisherian (p) value, by which hunches based on “low” (p) values are turned into statements about the implausibility of the null hypothesis. Although it seems reasonable, when the hunch on which (p) values rest is examined by either frequentists or Bayesians, it is found wanting.

However, my main focus here is not specifically (p) values. I’d like to focus on one particularly recent special case of statistical alchemy among methodologists called the “test for excess significance”. Here’s the hunch: in any series of typically-powered experiments, we expect some to fail to be non-significant due to sampling error, even if a true effect exists. If we see a series of five experiments, and they are all significant, one thinks that either they are either very high powered, the authors got lucky, or there are some nonsignificant studies missing. For many sets of studies, the first seems implausible because the effect sizes are small; the last is important, because if it is true then the picture we get of the results is misleading.

Just to be clear, this hunch makes sense to me, and I think to most people. However, without a formal argument it remains a hunch. Ioannidis and Trikalinos (2007) suggested formalising it:

We test in a body of $n$ published studies whether the observed number of studies $O$ with ‘positive’ results at a specified $alpha$ level on a specific research question is different from the expected number of studies with ‘positive’ results $E$ in the absence of any bias. (Ioannidis and Trikalinos, 2007, p246)

Of “biases”, Ioannidis and and Tikalinos say that “biases…result in a relative excess of published statistically significant results as compared with what their true proportion should be in a body of evidence.” If there are too many significant studies, there must be too few nonsignificant ones, hence the idea of “relative” excess.

Suppose there is a true effect size that is being pursued by study $i (i = 1,ldots,n)$ and its size is $theta_i$…[T]he expected probability that a specific single study $i$ will find a ‘positive’ result equals $1 – beta_i$, its power at the specified $alpha$ level. (Ioannidis and Trikalinos, 2007, p246)

So far so good; this is all true. Ioannidis and Trikalinos continue:

Assuming no bias, $E$ equals the sum of the expected probabilities across all studies on the same question: [ E = sum_{i=1}^n (1 – beta_i). ] (Ioannidis and Trikalinos, 2007, p246)

Here we have bit of a mystery. That (E) equals the sum of the expected probabilities is merely asserted. There is no explanation of what assumptions were necessary to derive that fact. Moreover, it is demonstrably false. Suppose I run experiments until I obtain (k) nonsignificant studies ((k>0)). The expected number of significant studies in a set of (n) is exactly (n-k). Depending on the stopping rule for the studies, which is unknown (and unknowable or even meaningless, in most cases), (E) can be chosen to be 0 (stop after (n) nonsignificant studies), (n) (stop after (n) significant studies), or any number in between!

Ioannidis and Trikalinos go on to say that “[t]he expected number (E) is compared against the observed number (O) of ‘positive’ studies” and if there are an “excess” then bias is claimed, by standard significance test logic. Here, things go off the rails again. First, as we have seen, (E) could be anything. Second, a significance test is performed by computing the probability of observing an outcome as extreme or more extreme than the one observed, given no “bias”. What is more extreme? Suppose we observe 4 significant results in 5 studies. It seems clear that 5/5 is more extreme. Is 6/6 possible? No mention is made of the assumed sampling process, so how are we to know what the more extreme samples would be? And if a sampling assumption were made explicit, how could we know whether that was a reasonable assumption for the studies at hand? The (p) value is simply incalculable from the information available.

Suppose I find a “significant” result; what do I infer? Ioannidis and Trikalinos claim that they “have introduced an exploratory test for examining whether there is an excess of significant findings in a body of evidence” (p 251). This is a very strange assertion. When we do a statistical test, we are not asking a question about the data itself; rather, we are inferring something about a population. The “body of evidence” is the sample; we infer from the sample to the population. But what is the population? Or, put in frequentist terms, what is the sampling process from which the studies in question arise? Given that this question is central to the statistical inference, one would think it would be addressed, but it is not. Dealing with this question would require a clear definition of a “set” of studies, and how this set is sampled.

Are these studies one sample of hypothetical sets of studies from all scientific fields? Or perhaps they are a sample of studies within a specific field; say, psychology? Or from a subfield, like social psychology? Or maybe from a specific lab? There’s no way to uniquely answer this question, and so it isn’t clear what can be inferred. Am I inferring bias in all of science, in the field, the subfield, or the lab? And if any of these are true, why do they discuss bias in the sample instead? They have confused the properties of the population and sample in a basic way.

But even though these critical details are missing — details that are necessary to the argument — the authors go on to apply this to several meta-analyses, inferring bias in several. Other authors have applied the method to claim “evidence” of bias in other sets of studies.

…and the alchemy is complete

We see that Ioannidis and Trikalinos have unstated assumptions of enormous import, they have failed to clearly define any sort of sampling model, and they have not made clear the link between the act of inference (“we found a ‘significant’ result”) and what is to be inferred (“Evidence for bias exists in these studies.”). And this is all before even addressing the problematic nature of (p) values themselves, which cannot be used as a measure of evidence. The test for “excess significance” is neither a valid frequentist procedure (due to the lack of a clearly defined sampling process) nor a valid Bayesian procedure.

But through the alchemy of statistics, the Ioannidis and Trikalinos’ test for “excess significance” has given us the appearance of a justified conclusion. Bodies of studies are called into doubt, and the users of the approach continue to get papers published using the approach despite its utter lack of justification. We would not accept such shoddy modeling and reasoning for studying other aspects of human behavior. As Val Johnson put it in his comment on the procedure, “[We] simply cannot quite determine the level of absurdity that [we are] expected to ignore.” Why is this acceptable for deploying against groups of studies in the scientific literature?

The reason is simple: we all have the hunch. It seems right. Ioannidis and Trikalinos have given us a way to transmute our hunch that something is amiss into the gold of a publishable, evidence-backed conclusion. But it is an illusion; the argument simply falls apart under scrutiny.

This is bad science, and it should not be tolerated. Methodologists have the same responsibility as everyone else to justify their conclusions. The peer review system has failed to prevent the leaden hunch passing for gold, which is acutely ironic given how methodologists use the test to accuse others of bad science.

The frequentist case against the significance test, part 2

The significance test is perhaps the most used statistical procedure in the world, though has never been without its detractors. This is the second of two posts exploring Neyman’s frequentist arguments against the significance test; if you have not read Part 1, you should do so before continuing (“The frequentist case against the significance test, part 1”).

Neyman offered two major arguments against the significance test:

1. The significance test fails as an epistemic procedure. There is no relationship between the (p) value and rational belief. More broadly, the goal of statistical inference is tests with good error properties, not beliefs.
2. The significance test fails as a test. The lack of an alternative means that a significance test can yield arbitrary results.

The first, philosophical, argument I outlined in Part 1. Part 1 was based largely on Neyman’s 1957 paper “’Inductive Behavior’ as a Basic Concept of Philosophy of Science”. Part 2 will be based on Chapter 1, part 3 of Neyman’s 1952 book, “Lectures and conferences on mathematical statistics and probability”.

First, it must be said that Neyman did not think that significance tests were useless or misleading, all the time. He said “The [significance test procedure] has been applied since the invention of the first systematically applied test, the Pearson chi-square of 1900, and has worked, on the whole, satisfactorily. However, now that we have become sophisticated we desire to have a theory of tests.” Obviously, he is not making a blanket statement that significance tests are, generally, good science; he was making an empirical statement about the applications of significance tests in the first half of the twentieth. It is debatable whether he would say the same about the significance test since then.

Of course, we should not evaluate a procedure by its purported results; we can be misled by results, and even worse, this involves an inherent circularity (how do we determine whether the procedure actually performed satisfactorily? Another test?). However, this was merely an informal judgment of Neyman’s; we should not over-interpret it either way. After all: he will show that the foundation of the significance test is flawed, and he clearly thought this was important.

An example: Cushney and Peebles’ soporific drugs

Suppose that we are interested in the effect of sleep-inducing drugs. Cushney and Peebles (1905) reported the effects of two sleep-inducing drugs on 10 patients in a paired design. Conveniently, R has the data for these 10 patients built-in, as the sleep data set; the data comprise 10 participants’ improvements over baseline hours of sleep, for each drug. If we wished to compare the two drugs, we might compute a difference score for each participant and subject these difference scores to a one-sample (t) test.

The null hypothesis, in this case, is that the population mean of the difference scores, (mu=0). Making the typical assumptions of normality and independence, we know that, under the null hypothesis,
[ t = frac{bar{x}}{sqrt{s^2/N}} sim t_{N-1} ] where (bar{x}) and (s^2) are the difference-score sample mean and variance.

The figure below shows the distribution of the (t) statistic assuming that the null hypothesis is true, with the corresponding (p) values on the top axis. Increasingly red areas show increasing evidence against the null hypothesis, according to the Fisherian view.

If we decided to use the (t) statistic to make a decision in a significance test, we would decide on a criterion: say, (|t|>2.26), which would lead to (alpha=0.05). Repeating the logic of the significance test, as Neyman put it:

When an “improbable sample” was obtained, the usual way of reasoning was this: “Were the hypothesis (H) true, then the probability of getting a value of [test statistic] (T) as or more improbable than that actually observed would be (e.g.) (p = 0.00001). It follows that if the hypothesis (H) be true, what we actually observed would be a miracle. We don’t believe in miracles nowadays and therefore we do not believe in (H) being true.” (Neyman ,1952)

In the case of our sample, we can perform the (t) test in R:

## ##  Paired t-test## ## data:  sleep$extra[1:10] and sleep$extra[11:20]## t = -4.0621, df = 9, p-value = 0.002833## alternative hypothesis: true difference in means is not equal to 0## 95 percent confidence interval:##  -2.4598858 -0.7001142## sample estimates:## mean of the differences ##                   -1.58

In a typical significance test scenario, this would lead to a rejection of the null hypothesis, because (|t|>2.26).

Neyman’s second argument: Significance testing can be arbitrary

Remember that at this point, we have not considered anything about what we would expect if the null hypothesis were not true. In fact, Fisherian significance testing does not need to consider any alternatives to the null. The pseudo-falsificationist logic of the significance test means that we only need consider the implications for the data under the null hypothesis.

Neyman asks: why use the (t) statistic for a significance test? Why use the typical (bar{x}) and (s^2)? Neyman then does something very clever: he defines two new statistics, (bar{x}_1) and (s^2_1), that have precisely the same distribution as (bar{x}) and (s^2) when the null hypothesis is true, and shows that using these two statistics leads to a different test, and different results:
[ begin{eqnarray*} bar{x}_1 &=& frac{x_1 – x_2}{sqrt{2N}}, s^2_1 &=& frac{sum_{i=1}^N x_i^2 – Nbar{x}^2_1}{N-1}, end{eqnarray*} ]
where (x_i) is the difference score of the $i$th participant (assuming the samples are in arbitrary order).

Neyman proves that these statistics have the same joint distribution as (bar{x}) and (s^2), but we can verify Neyman’s proof using R (code available here). The top row of the plot below shows the histogram of 100,000 samples of (bar{x}), (s^2), and the (t) statistic for (N=10) and (sigma^2=1), assuming the null hypothesis is true; the bottom row shows the same 100,000, but computing (bar{x}_1), (s^2_1), and (t_1), the (t) statistic computed from (bar{x}_1) and (s^2_1). The red line shows the theoretical distributions. The distributions match precisely.

We now have two sets of statistics that have the same distributions, and will thus produce a significance test with precisely the same properties when the null hypothesis is true. Which should we choose? Fisher might object that (bar{x}_1) and (s^2_1) are not sufficient, but this only pushes the problem onto sufficiency: why sufficiency?

The figure below shows that this matters for the example at hand. The figure shows 100,000 simulations of (t) and (t_1) plotted against one another; when (t) is large, (t_1) tends to be small; when (t) is small, (t_1) tends to be large. The red dashed lines show the (alpha=0.05) critical values for each test, and the blue curves show the limits of bounds within which ((t,t_1)) has to be contained.

The red point shows (t) and (t_1) for the Cushny and Peebles’ data set; (t) would lead to a rejection of the null, while (t_1) would not.

Examining the definitions of (bar{x}_1) and (s^2_1), it isn’t difficult to see what is happening; when the null is true, these statistics will have identical distributions to (bar{x}) and (s^2). However, when the null is false, they will not. The distribution of (bar{x}_1) will continue to have a mean of 0 (instead of (mu)), while the distribution of (s^2_1) will become more spread than (s^2). The effect of this is that the power of the test based on (t_1) will decrease as the true effect size increases!

A consideration of both Type I and Type II errors makes it obvious which test to choose; we should choose the test that yields the higher power (this is, incidentally, closely related to the Bayesian solution to the problem through the Neyman-Pearson lemma). The use of (t_1) would lead to a bad test, when both Type I error rates and Type II error rates are taken into account. A significance test, which does not consider Type II error rates, has no account of why (t) is better than (t_1).

More problems

The previous development is bad for a significance test; it shows that there can be two statistics that lead to different answers, yet have the same properties from the perspective of significance testing. Following this, Neyman proves something even better: we can always find a statistic that will have the same long-run distribution under the null as (t), yet will yield an arbitrarily high test statistic for our sample. This means that we cannot simply base our choice of test statistic on what would yield a more or less conservative test statistic for our sample.

Neyman defines some constants (alpha_i) using the obtained samples (x_i): [ alpha_i = frac{x_i}{sqrt{sum_{i = 1}^N x_i^2}} ] then for future samples (y_i), (i=1,ldots,N) defines [ begin{eqnarray*} bar{y}_2 &=& frac{sum_{i=1}^N alpha_iy_i}{sqrt{N}}, s^2_2 &=& frac{sum_{i=1}^N y_i^2 – Nbar{y}_2^2}{N-1}, end{eqnarray*} ] and of course we can compute a (t) statistic (t_2) based on these values. If we use our observed (x_i) values for (y_i), this will yield a (t_2=infty), because (s^2_2 = 0), exactly! However, if we check the long-run distribution of these statistics under the null hypothesis, we again find that they are exactly the same as (bar{x}), (s^2), and (t):

If we considered the power of the test based on (t_2), we would find that it is worse than the power based on (t). The significance test offers no reason why (t) is better than (t_2), but a consideration of the frequentist properties of the test does. Neyman has thus shown that we must consider an alternative hypothesis in choosing a test statistic, otherwise we can select a test statistic to give us any result we like.

Conclusion: The importance of power

At the risk of belaboring a point that has been made over and over, power is not a mere theoretical concern for a frequentist. Neyman and Pearson offer an account of why some tests are better than others, and also, in some cases, an account of which test is the optimal; however, just because a test is optimal, does not mean it is good.

We might always manage to avoid Type I errors at the same rate (assuming the null hypothesis is true), but as Neyman points out, this is not enough; one needs to consider power, and how one wants to treat both Type I error and power. A good frequentist test may balance Type I and Type II error rates; a good frequentist test may control the Type I error rate while having a power that is above a certain probability. From a frequentist perspective these are decisions that must be made prior to an experiment; none of them can be addressed within the significance testing framework.

To recap both posts, Neyman makes clear why significance testing, as commonly deployed in the scientific literature, does not offer a good theory of inference: it is fails epistemically by allowing arbitrary “rational” beliefs, and it fails on statistical grounds by allowing arbitrary results.

From a frequentist perspective, what might a significance test be useful for? Neyman allows that before a critical set of experiments is performed, exploratory research must be undertaken. Generating a test or a confidence procedure requires some assumptions. Neyman does not offer an account of the process of choosing these assumptions, and seems content to leave this up to substantive researchers. Once a formal inference is needed, however, it is clear that from a frequentist perspective the significance test is inadequate.

[the source to this blog post, including R code to reproduce the figures, is available here: https://gist.github.com/richarddmorey/5806aad7191377dcbf4f]

The frequentist case against the significance test, part 1

It is unfortunate that today, we tend to think about statistical theory in terms of Bayesianism vs frequentism. Modern practice is a blend of Fisher’s and Neyman’s ideas, with the characteristics of the blend driven by convenience rather than principle. Significance tests are lumped in as a “frequentist” technique by Bayesians in an unfortunate rhetorical shorthand.

In recent years, the significance test has been critiqued on several grounds, but often these critiques are offered from Bayesian or pragmatic grounds. In a two-part post, I will outline the frequentist case developed by Jerzy Neyman against the null hypothesis significance test.

I will outline two frequentist arguments Neyman deployed against significance tests: the first is philosophical, and the second is statistical:

1. The significance test fails as an epistemic procedure. There is no relationship between the (p) value and rational belief. More broadly, the goal of statistical inference is tests with good error properties, not beliefs.
2. The significance test fails as a test. The lack of an alternative means that a significance test can yield arbitrary results.

In this post I will describe the significance test and outline Neyman’s first, philosophical objection to the signifance test. In part 2, I will develop Neyman’s statistical objection.

Significance testing

Suppose that a company is developing a drug for depression (for simplicity, we will consider two-sided tests, but the points will generalize to one-sided tests as well). We randomly assign participants to a placebo control and experimental group, and then measure the change across time via a depression inventory.

If the drug had no effect at all, then clearly we would expect the difference between the two groups to be 0. However, we expect variability in depression scores due to factors other than the drug, so we can’t simply take an observed difference between the two conditions as evidence that the drug has an effect. We always expect some difference.

We therefore need to somehow take into account the variability we expect, even if there is no effect, in assessing the effect of the drug. The most common way of doing this is with a significance test; in our example, typically a t test would be used. The logic goes like this:

1. Develop a null hypothesis that is to be (possibly) rejected (e.g., the drug has an effect)
2. Collect data and compute a test statistic (T) with a known probability distribution, assuming that the null hypothesis is true (e.g., a (t) statistic).
3. Compute the probability (p) of obtaining a more extreme test statistic, assuming the null hypothesis is true.
4. Interpret (p) as a measure of evidence against the null hypothesis; the lower (p) is, the less we should believe the null hypothesis.

Or, as Neyman (1952) described the logic:

When an “improbable sample” was obtained, the usual way of reasoning was this: “Were the hypothesis (H) true, then the probability of getting a value of [test statistic] (T) as or more improbable than that actually observed would be (e.g.) (p = 0.00001). It follows that if the hypothesis (H) be true, what we actually observed would be a miracle. We don’t believe in miracles nowadays and therefore we do not believe in (H) being true.”

Sometimes Step 4 will be accompanied by a decision to reject the null hypothesis, but what is important to us now is that the (p) value supposedly gives us reason to disbelieve the null hypothesis.

Preliminaries: deductive validity

One of the ways that a significance test is often taught is that the significance test is a probabilistic version of the following argument:

• If the theory is true, (P) will not occur.
• (P) did occur.
• Therefore, the theory is false.

This a deductively valid argument; that is, if the premises are true, then the conclusion necessarily follows. The use of such a deductive argument forms the basis of the falsificationist model of the scientific process. Given the intuitive nature of falsificationist logic, it is not surprising that a similar logic is often used to describe the significance test:

• If the null hypothesis were true, we would probably not observe a “small” (p) value.
• We observed a “small” (p) value.
• Therefore, the null hypothesis is probably not true.

Although the above argument seems parallel, it is does not share the deductive validity of the non-probabilistic version. We can easily see its deductive invalidity by adding a new premise that does not contradict the other premises, yet contradicts the conclusion:

• The null hypothesis is certainly true.

Obviously, this does not contradict the premise about what we would expect, if the null hypothesis were true; neither does it contradict the premise stating our observation. It does, however, contradict our conclusion that the null hypothesis is improbable. With a deductively valid argument, any time the premises are true the conclusion must also be true. The familiar significance testing argument cannot be deductively valid.

That the logic of the significance test is not deductively valid is not news; certainly if the logic were deductively valid, Fisher and Neyman would have made use of that fact, since deductive logic plays a major role in both Fisher and Neyman’s theories. As it turns out, it isn’t valid; but this isn’t necessarily a problem if another justification for the logic can be found. In fact, Fisher justified the significance test as an example inductive logic. This is the logic to which Neyman would object.

Neyman’s first objection: Epistemology

Very important to Fisher’s view of the significance test is the idea that the (p) value is a measure of inductive evidence against the null hypothesis:

“Convenient as it is to note that a hypothesis is contradicted at some familiar level of significance such as 5% or 2% or 1% we do not, in Inductive Inference, ever need to lose sight of the exact strength that the evidence has reached, or to ignore the fact that with further trial it might come to be stronger, or weaker.” (Fisher, 1971, The Design of Experiments)

The evidence from the (p) value was supposed by Fisher to be a rational measure of the disbelief someone should have regarding the null hypothesis:

“…the feeling induced by a test of significance has an objective basis in that the probability statement on which it is based is a fact communicable to and verifiable by other rational minds. The level of significance in such cases fulfills the conditions of a measure of the rational grounds for the disbelief [in the null hypothesis] it engenders.” (Fisher, 1959; Statistical Methods and Scientific Inference)

This will resonate with almost everyone who has taken an introductory statistics course, and certainly with scientists who use (p) values. Scientists use (p) values to call into doubt a particular hypothesis. The concept of statistics as an epistemic endevour — that is, that statistics is, in some sense, about what it is reasonable to believe — was central to Fisher and the Bayesians. Neyman was philosophically aligned against this idea.

In a fantastic 1957 paper entitled “’Inductive Behavior’ as a Basic Concept of Philosophy of Science”, Neyman outlined the philosophy underlying his frequentist theory of inference. I believe this paper should be required reading for all users of statistics; it is clear, non-technical, and raises important points that all users of statistics should think about. Historically, it is a very important paper because we read Neyman’s reaction to three important viewpoints developed in the previous decades: Fisher’s, Jeffreys’ objective Bayesianism, and de Finetti’s subjective Bayesianism.

Neyman, with characteristic clarity, identified the major problem with both Fisher’s (and, incidentally, Jeffreys’ Bayesian) viewpoint: How much doubt should a particular (p) value yield? This is not a question that Fisher ever answered, nor could answer. As Neyman says,

“[I]f a scientist inquires why should he reject or accept hypotheses in accordance with the calculated values of (p) the unequivocal answer is: because these values of (p) are the ultimate measures of beliefs especially designed for the scientist to adjust his attitudes to. If one inquires why should one use the normative formulae of one school rather than those of some other, one becomes involved in a fruitless argument.

Fisher never formalized the connection between the (p) value and rational “disbelief.” Every scientist is assumed to have a “feeling” of disbelief from the (p) value. Whose feeling is the rational one? And why use (p), and not, say, (sqrt{1-p^4})? Fisher’s entire argument is built on intuition. Neyman continues:

It must be obvious that [inductive inference’s] use as a basic principle underlying research is unsatisfactory. The beliefs of particular scientists are a very personal matter and it is useless to attempt to norm them by any dogmatic formula.

That is, rational belief is not the target of statistical analysis. Neyman took this idea to an extreme. He went so far as to deny that interpreting the results of a statistical test involves beliefs, reasoning, knowledge, or even any conclusions at all (see, for instance, Neyman, 1941)! What mattered to Neyman was setting up tests that had good long-run error properties, and acting according to a plan derived on the basis of these tests. “Rational belief” is not the target of a statistical procedure.

Fisher found Neyman’s viewpoint completely alien, and I suspect most scientists today would as well. I find that scientists agree more with Fisher, when he responded to Neyman’s philosophy:

In fact, scientific research…is an attempt to improve public knowledge undertaken as an act of faith to the effect that, as more becomes known, or surely known, the intelligent pursuit of a great variety of aims, by a great variety of men, or groups of men, will be facilitated. (Fisher, 1956)

Although one might agree or disagree with Neyman’s take on the philosophy of science, he was correct in his critique of Fisher: Fisher failed to provide any link between (p) values and rational (dis)belief, and no such link exists. The epistemic use of the significance test, championed by Fisher and deployed by scientists all over the world for decades, has no foundation.

Conclusion, and on to part 2

The significance test is not deductively valid, in spite of its being sometimes taught as having a falsificationist foundation. Fisher justified the use of the (p) value as an epistemic statistic and as an example of inductive inference. Neyman points out that there is simply no foundation for the “rational” feelings Fisher associates with the (p) value, and emphasizes the frequentist view that beliefs aren’t the target of inference; rather, tests with good long-run error properties are.

In part 2, we will explore Neyman’s second argument against the logic of the significance test: it fails to consider what makes a “good” frequentist test, and actually can lead to tests that produce arbitrary results.