# The frequentist case against the significance test, part 2

The significance test is perhaps the most used statistical procedure in the world, though has never been without its detractors. This is the second of two posts exploring Neyman’s frequentist arguments against the significance test; if you have not read Part 1, you should do so before continuing (“The frequentist case against the significance test, part 1”).

Neyman offered two major arguments against the significance test:

1. The significance test fails as an epistemic procedure. There is no relationship between the (p) value and rational belief. More broadly, the goal of statistical inference is tests with good error properties, not beliefs.
2. The significance test fails as a test. The lack of an alternative means that a significance test can yield arbitrary results.

The first, philosophical, argument I outlined in Part 1. Part 1 was based largely on Neyman’s 1957 paper “’Inductive Behavior’ as a Basic Concept of Philosophy of Science”. Part 2 will be based on Chapter 1, part 3 of Neyman’s 1952 book, “Lectures and conferences on mathematical statistics and probability”.

First, it must be said that Neyman did not think that significance tests were useless or misleading, all the time. He said “The [significance test procedure] has been applied since the invention of the first systematically applied test, the Pearson chi-square of 1900, and has worked, on the whole, satisfactorily. However, now that we have become sophisticated we desire to have a theory of tests.” Obviously, he is not making a blanket statement that significance tests are, generally, good science; he was making an empirical statement about the applications of significance tests in the first half of the twentieth. It is debatable whether he would say the same about the significance test since then.

Of course, we should not evaluate a procedure by its purported results; we can be misled by results, and even worse, this involves an inherent circularity (how do we determine whether the procedure actually performed satisfactorily? Another test?). However, this was merely an informal judgment of Neyman’s; we should not over-interpret it either way. After all: he will show that the foundation of the significance test is flawed, and he clearly thought this was important.

### An example: Cushney and Peebles’ soporific drugs

Suppose that we are interested in the effect of sleep-inducing drugs. Cushney and Peebles (1905) reported the effects of two sleep-inducing drugs on 10 patients in a paired design. Conveniently, R has the data for these 10 patients built-in, as the sleep data set; the data comprise 10 participants’ improvements over baseline hours of sleep, for each drug. If we wished to compare the two drugs, we might compute a difference score for each participant and subject these difference scores to a one-sample (t) test.

The null hypothesis, in this case, is that the population mean of the difference scores, (mu=0). Making the typical assumptions of normality and independence, we know that, under the null hypothesis,
[ t = frac{bar{x}}{sqrt{s^2/N}} sim t_{N-1} ] where (bar{x}) and (s^2) are the difference-score sample mean and variance.

The figure below shows the distribution of the (t) statistic assuming that the null hypothesis is true, with the corresponding (p) values on the top axis. Increasingly red areas show increasing evidence against the null hypothesis, according to the Fisherian view.

If we decided to use the (t) statistic to make a decision in a significance test, we would decide on a criterion: say, (|t|>2.26), which would lead to (alpha=0.05). Repeating the logic of the significance test, as Neyman put it:

When an “improbable sample” was obtained, the usual way of reasoning was this: “Were the hypothesis (H) true, then the probability of getting a value of [test statistic] (T) as or more improbable than that actually observed would be (e.g.) (p = 0.00001). It follows that if the hypothesis (H) be true, what we actually observed would be a miracle. We don’t believe in miracles nowadays and therefore we do not believe in (H) being true.” (Neyman ,1952)

In the case of our sample, we can perform the (t) test in R:

## ##  Paired t-test## ## data:  sleep$extra[1:10] and sleep$extra[11:20]## t = -4.0621, df = 9, p-value = 0.002833## alternative hypothesis: true difference in means is not equal to 0## 95 percent confidence interval:##  -2.4598858 -0.7001142## sample estimates:## mean of the differences ##                   -1.58

In a typical significance test scenario, this would lead to a rejection of the null hypothesis, because (|t|>2.26).

### Neyman’s second argument: Significance testing can be arbitrary

Remember that at this point, we have not considered anything about what we would expect if the null hypothesis were not true. In fact, Fisherian significance testing does not need to consider any alternatives to the null. The pseudo-falsificationist logic of the significance test means that we only need consider the implications for the data under the null hypothesis.

Neyman asks: why use the (t) statistic for a significance test? Why use the typical (bar{x}) and (s^2)? Neyman then does something very clever: he defines two new statistics, (bar{x}_1) and (s^2_1), that have precisely the same distribution as (bar{x}) and (s^2) when the null hypothesis is true, and shows that using these two statistics leads to a different test, and different results:
[ begin{eqnarray*} bar{x}_1 &=& frac{x_1 – x_2}{sqrt{2N}}, s^2_1 &=& frac{sum_{i=1}^N x_i^2 – Nbar{x}^2_1}{N-1}, end{eqnarray*} ]
where (x_i) is the difference score of the $i$th participant (assuming the samples are in arbitrary order).

Neyman proves that these statistics have the same joint distribution as (bar{x}) and (s^2), but we can verify Neyman’s proof using R (code available here). The top row of the plot below shows the histogram of 100,000 samples of (bar{x}), (s^2), and the (t) statistic for (N=10) and (sigma^2=1), assuming the null hypothesis is true; the bottom row shows the same 100,000, but computing (bar{x}_1), (s^2_1), and (t_1), the (t) statistic computed from (bar{x}_1) and (s^2_1). The red line shows the theoretical distributions. The distributions match precisely.

We now have two sets of statistics that have the same distributions, and will thus produce a significance test with precisely the same properties when the null hypothesis is true. Which should we choose? Fisher might object that (bar{x}_1) and (s^2_1) are not sufficient, but this only pushes the problem onto sufficiency: why sufficiency?

The figure below shows that this matters for the example at hand. The figure shows 100,000 simulations of (t) and (t_1) plotted against one another; when (t) is large, (t_1) tends to be small; when (t) is small, (t_1) tends to be large. The red dashed lines show the (alpha=0.05) critical values for each test, and the blue curves show the limits of bounds within which ((t,t_1)) has to be contained.

The red point shows (t) and (t_1) for the Cushny and Peebles’ data set; (t) would lead to a rejection of the null, while (t_1) would not.

Examining the definitions of (bar{x}_1) and (s^2_1), it isn’t difficult to see what is happening; when the null is true, these statistics will have identical distributions to (bar{x}) and (s^2). However, when the null is false, they will not. The distribution of (bar{x}_1) will continue to have a mean of 0 (instead of (mu)), while the distribution of (s^2_1) will become more spread than (s^2). The effect of this is that the power of the test based on (t_1) will decrease as the true effect size increases!

A consideration of both Type I and Type II errors makes it obvious which test to choose; we should choose the test that yields the higher power (this is, incidentally, closely related to the Bayesian solution to the problem through the Neyman-Pearson lemma). The use of (t_1) would lead to a bad test, when both Type I error rates and Type II error rates are taken into account. A significance test, which does not consider Type II error rates, has no account of why (t) is better than (t_1).

### More problems

The previous development is bad for a significance test; it shows that there can be two statistics that lead to different answers, yet have the same properties from the perspective of significance testing. Following this, Neyman proves something even better: we can always find a statistic that will have the same long-run distribution under the null as (t), yet will yield an arbitrarily high test statistic for our sample. This means that we cannot simply base our choice of test statistic on what would yield a more or less conservative test statistic for our sample.

Neyman defines some constants (alpha_i) using the obtained samples (x_i): [ alpha_i = frac{x_i}{sqrt{sum_{i = 1}^N x_i^2}} ] then for future samples (y_i), (i=1,ldots,N) defines [ begin{eqnarray*} bar{y}_2 &=& frac{sum_{i=1}^N alpha_iy_i}{sqrt{N}}, s^2_2 &=& frac{sum_{i=1}^N y_i^2 – Nbar{y}_2^2}{N-1}, end{eqnarray*} ] and of course we can compute a (t) statistic (t_2) based on these values. If we use our observed (x_i) values for (y_i), this will yield a (t_2=infty), because (s^2_2 = 0), exactly! However, if we check the long-run distribution of these statistics under the null hypothesis, we again find that they are exactly the same as (bar{x}), (s^2), and (t):

If we considered the power of the test based on (t_2), we would find that it is worse than the power based on (t). The significance test offers no reason why (t) is better than (t_2), but a consideration of the frequentist properties of the test does. Neyman has thus shown that we must consider an alternative hypothesis in choosing a test statistic, otherwise we can select a test statistic to give us any result we like.

### Conclusion: The importance of power

At the risk of belaboring a point that has been made over and over, power is not a mere theoretical concern for a frequentist. Neyman and Pearson offer an account of why some tests are better than others, and also, in some cases, an account of which test is the optimal; however, just because a test is optimal, does not mean it is good.

We might always manage to avoid Type I errors at the same rate (assuming the null hypothesis is true), but as Neyman points out, this is not enough; one needs to consider power, and how one wants to treat both Type I error and power. A good frequentist test may balance Type I and Type II error rates; a good frequentist test may control the Type I error rate while having a power that is above a certain probability. From a frequentist perspective these are decisions that must be made prior to an experiment; none of them can be addressed within the significance testing framework.

To recap both posts, Neyman makes clear why significance testing, as commonly deployed in the scientific literature, does not offer a good theory of inference: it is fails epistemically by allowing arbitrary “rational” beliefs, and it fails on statistical grounds by allowing arbitrary results.

From a frequentist perspective, what might a significance test be useful for? Neyman allows that before a critical set of experiments is performed, exploratory research must be undertaken. Generating a test or a confidence procedure requires some assumptions. Neyman does not offer an account of the process of choosing these assumptions, and seems content to leave this up to substantive researchers. Once a formal inference is needed, however, it is clear that from a frequentist perspective the significance test is inadequate.

[the source to this blog post, including R code to reproduce the figures, is available here: https://gist.github.com/richarddmorey/5806aad7191377dcbf4f]

# Bayes factor t tests, part 1

In my first post, I described the general logic of Bayes factors. I will continue discussing the general logic of Bayes factor, while introducing some of the basic functionality of the BayesFactor package.

### Recap: What is a Bayes factor?

The Bayes factor is the relative evidence provided by the data for comparing two statistical models. It has two equivalent definitions:

1. The Bayes factor is the relative predictive success between two hypotheses: it is the ratio of the probabilities of the observed data under each of the hypotheses. If the probability of the observed data is higher under one hypothesis than another, then that hypothesis is preferred.
2. The Bayes factor is evidence: it is the factor by which the relative beliefs must be multiplied after the data have been observed, providing new (posterior) relative beliefs. The stronger the evidence, the more beliefs must change.

### Student’s sleep data

Student (1908) describes a study by Cushny and Peebles (1905) on the effect of several medications on sleep. One of the drugs they studied was Laevo-hyoscine hydrobromide, which we’ll abbreviate as “LH”. LH was administered to 10 patients on several nights, and the average amount of sleep over the patients’ unmedicated sleep (measured on other nights) was recorded. For the purposes of this post, we will make the typical assumptions about the data that would be made if we were to apply the (t) procedure: specifically, that the observation are independent and normally distributed.
We return to Carole and Paul, who have another disagreement. Carole does not believe that LH increases sleep at all, while Paul claims that it does. This would traditionally be handled by a t test. Under Carole’s hypothesis, the (t) statistic has a Student’s (t) distribution with (N-1) degrees of freedom. That is, [ t = frac{bar{y}}{s}sqrt{N} sim mbox{Student’s }~t_{N-1} ]
Another way of stating her prediction is in terms of the observed effect size, (hat{delta}=bar{y}/s), which is shown in the figure below. It is simply a Student’s (t) distribution, scaled by (1/sqrt{N}).

Paul’s hypothesis — that LH causes an increase in sleep — is problematic. Typically, we require scientific hypotheses to constrain the data in some way. We require them to predict that some outcomes are more plausible than others. Paul’s hypothesis, however, predicts nothing. His hypothesis does not constrain the data.
This highlights a curious fact about traditional significance testing. The only hypothesis we can support using a classical significance test is the one that makes no predictions! If we applied the traditional t test to these data, we could only reject Carole’s hypothesis in favor of Paul’s — but never support Carole’s — even though Paul has advanced a decidedly unfalsifiable hypothesis.
To be testable, Paul’s hypothesis must make predictions, as Carole’s has. As a first step, let’s assume that Paul believes that LH will increase sleep by 1 standardized effect size unit. If (mu) is the mean increase in hours of sleep and (sigma) is the standard deviation of the increases, then the standardized increase is [ delta = frac{mu}{sigma}. ] and Paul believes that (delta=1). Using the noncentral t distribution, we can obtain Paul’s predictions for the observed effect size (hat{delta}), shown in the figure below:

As expected, the predictions are centered around (hat{delta}=1). We now have two hypothesis: Carole believes that (delta=0), and Paul believes that (delta=1). These hypotheses imply two different predictions for the observed effect size, as shown above. We can now compare the predictions with the results presented by Student to see whether Paul or Carole is favored by the evidence.
The sleep data set is built into R; the following R code will extract the data for LH (which is labelled group 2):

improvements = with(sleep, extra[group == 2])improvements
##  [1]  1.9  0.8  1.1  0.1 -0.1  4.4  5.5  1.6  4.6  3.4

The improvements in hours of sleep range from -0.1 hours (the patient slept less when medicated by LH) and an improvement of 5.5 hours. The mean improvement was 2.33 hours. The figure below shows the hours of improvement for all 10 patients.

We begin our analysis by computing (N) and the (t) statistic from the data:

N = length(improvements)t = mean(improvements)/sd(improvements) * sqrt(N)t
## [1] 3.68

Typically, at this point we would look up the (p) value for this (t) statistic; but for a Bayes factor analysis, we use it to compute the relative evidence for the two hypotheses. Of interest is the effect size:

deltaHat = t/sqrt(N)deltaHat
## [1] 1.164

The effect size (hat{delta}=1.1637) seems to favor Paul, since Paul’s hypothesis ((delta=1)) was actually quite close to the observed result . In order to know by how much the result favors Paul, we need to look at the predictions and see where the observation falls.
The figure below shows both Carole’s and Paul’s predictions for the observed effect size as overlapping gray distributions. The observed effect size is denoted by the vertical line segment. In order to compute the Bayes factor, we must compare the relative heights of the two distributions for the observed data. The probability density of the observed data is 67.9451 times higher under Paul’s hypothesis than under Carole’s. The Bayes factor favoring Paul is thus 67.9451.

We can easily compute this using R’s built in functions. Under Carole’s hypothesis, the (t) statistic has a central (t) distribution with 9 degrees of freedom; Under Paul’s hypothesis, the (t) statistic has a noncentral (t) distribution with 9 degrees of freedom, and a noncentrality parameter of (deltasqrt{N}=1timessqrt{10}). The ratio of the two densities, at the observed (t) statistic, yields the Bayes factor.

dt(t, df = 9, ncp = 1 * sqrt(10))/dt(t, df = 9)
## [1] 67.95

### Uncertain hypotheses, and a first look at BayesFactor

In the section above Carole’s and Paul’s original hypotheses, when stated in terms of effect sizes, were [ begin{align*} {cal H}_c : delta &= 0\ {cal H}_p : delta &> 0; end{align*} ] that is, Paul hypothesized that the drug increases sleep on average, and Carole hypothesized that it had no effect. The problem with this was that Paul’s hypothesis was untestable, because it does not lead to any proper constraint on the data.
The solution is for Paul to propose a hypothesis that does constrain data. We chose the hypothesis (delta=1) for Paul, and showed how that can be used to compute a Bayes factor. The resulting Bayes factor, however, does not seem to capture very well the spirit of Paul’s original hypothesis. Paul was not specific about what the exact effect size was. But (delta>0) won’t do either; it is so unspecific as to be completely unfalsifiable.
What we need is a middle ground between the two. In the previous post I showed how a Bayesian analysis can spread uncertainty out over a range of parameter values using a probability distribution.
The figure below shows two hypotheses: Carole’s null hypothesis ((delta=0)) whose predictions I’ve already presented, and a plausible hypothesis for Paul ((delta>0)). Under Paul’s hypothesis, plausible effect sizes are not spread evenly across the positive effect sizes, because that would not lead to any constraint on the data. Instead, small effect sizes are preferred to large ones. It is implausible for the effect size to be too large, so the plausibility of the effects diminish as the effect sizes get larger.

The shape of this prior distribution is the positive half of a Cauchy distribution, or a Student’s (t) distribution with 1 degree of freedom. The Cauchy prior was suggested by Rouder et al. (2009), and is the one implemented in the BayesFactor package.
Of course, this half-Cauchy distribution is not the only way we could implement Paul’s hypothesis. We could choose a different shape of distribution; we could make it wider or narrower around 0; we could even restrict the range further. We must choose some valid probability distribution in order to create a hypothesis that constrains the data, and we should do our best to make the test meaningful by choosing a defensible distribution.
Recall from the previous post that the way we obtain predictions from an uncertain hypothesis like Paul’s, we weight the predictions for each possible (delta) by their plausibility and average them. This is represented by the integral: [ P(hat{delta}mid{cal H}_p) = int P(hat{delta}mid delta)p(delta),ddelta, ] where (P(hat{delta}mid delta)) gives the predictions for a specific (delta), and (p(delta)) is the distribution in the figure above, giving the weightings for all possible values of (delta).
The figure below shows Carole’s predictions (which are the same as before) and Paul’s new predictions for the observed effect size. Notice that Paul’s predictions are now substantially more spread out than before. The uncertainty Paul has about the true value of (delta), added to the uncertainty that is inherent in the sampling given a particular true value, has caused his predictions to be less certain. This spreading out is the penalty for a having a less-specific hypothesis.

Now that we have predictions from both Carole’s and Paul’s hypothesis, we can compute a Bayes factor. The Bayes factor is the ratio of the heights at the observed (hat{delta}) value, shown in the figure below by the vertical line segment. The Bayes factor is 21.3275 in favor of Paul, because the probability density of the observed data is 21.3275 times greater under Paul’s hypothesis than under Carole’s. Note that this is substantially lower than the Bayes factor in favor of Paul’s hypothesis when it was more specific. More flexible hypotheses, like Paul’s, are automatically penalized by the Bayes factor.

As previously mentioned, the distribution used as Paul’s hypothesis happens to be the one implemented in the BayesFactor package. We can thus use the function ttestBF, part of the package’s suite of functions, to compute the Bayes factor for the Student sleep data.
If you have not installed the BayesFactor package yet, install the package first (help here). If you define the improvements vector as above, you can run the following two lines of code to load the package and perform the a Bayes factor (t) test. The nullInterval argument tells ttestBF that you want to consider that range as a hypothesis. Paul’s hypothesis ranged from 0 to (infty), so we specify that range using the nullInterval argument:

library(BayesFactor)ttestBF(improvements, nullInterval = c(0, Inf))
## Bayes factor analysis## --------------## [1] Alt., r=0.707 0<d<Inf    : 21.33  ±0.01%## [2] Alt., r=0.707 !(0<d<Inf) : 0.1036 ±1.4%## ## Against denominator:##   Null, mu = 0 ## ---## Bayes factor type: BFoneSample, JZS

There are several things to note in the output. There are two Bayes factors shown here shown in the lines starting with [1] and [2] The first line is the one we specified, where (0<d<infty). The second Bayes factor is the Bayes factor for the complement of the specified range (the ! indicates negation) — in this case, the negative effect sizes. Each of the two lines contains three pieces of information: the numerator hypothesis for the Bayes factor, the Bayes factor itself, and a measure of the error in the estimate of the Bayes factor. Here, the error is very small.
Below the two Bayes factors, the denominator hypothesis is specified. The denominator is the hypothesis to which all the numerator hypotheses have been compared. The first Bayes factor, then, is the comparison of the positive effect sizes to the null hypothesis; the second Bayes factor is the comparison of the negative effect sizes to the null. The positive hypothesis is preferred to the null hypothesis, and the null hypothesis is preferred to the negative hypothesis.
In order to gain some insight into the Bayes factor, one can compute the Bayes factor for a range of possible data. The figure below shows the Bayes factors for a range of observations. The effect size in the Student sleep data is denoted by the circle. Observing smaller effect sizes or negative effect sizes favors Carole’s hypothesis (the null), while observing larger effect sizes favors Paul’s alternative.

### Conclusions

What can we conclude from our Bayes factor analysis? Paul’s alternative hypothesis was favored to Carole’s null hypothesis by a factor of 21.3275, suggesting a substantial amount of evidence in favor of a positive effect size. This is, of course, conditioned on the “reasonableness” of Paul’s hypothesis. The Bayes factor is not, however, arbitrary: although the Bayes factor would change if we changed the prior, we would have to choose a strange prior to change the substantive conclusion.
In the next post, we will explore the BayesFactor package’s (t) tests in more detail.