Category Archives: R

Screen-2BShot-2B2015-08-10-2Bat-2B20.05.24

On radical manuscript openness

One of my papers that has attracted a lot of attention lately is “The Fallacy of Placing Confidence in Confidence Intervals,” in which we describe some of the fallacies held by the proponents and users of confidence intervals. This paper has been discussed on twitterreddit, on blogs (eg, here and here), and via email with people who found the paper in various places.  A person unknown to me has used the article as the basis for edits to the Wikipedia article on confidence intervals. I have been told that several papers currently under review cite it. Perhaps this is a small sign that traditional publishers should be worried: this paper has not been “officially” published yet.


I am currently wrapping up the final revisions on the paper, which has been accepted pending minor revisions at Psychonomic Bulletin & Review. The paper has benefited from an extremely public revision process. When I had a new major version to submit, I published the text and all code on github, and shared it via social media. Some of resulting discussions have been positive, others negative; some useful and enlightening, others not useful and frustrating. Most scientific publications almost exclusively reflect input from the coauthors and the editors and reviewers. This manuscript, in contrast, has been influenced by scores of people I’ve never met, and I think the paper is better for it.

This is all the result of my exploring ways to make my writing process more open, which led to the idea of releasing successive major versions of the text and R code on github with DOIs. But what about after it is published? How can manuscript openness continue after the magic moment of publication?

One of the downsides of the traditional scientific publishing model is that once the work is put into a “final” state, it becomes static. The PDF file format in which articles find their final form  and in which they are exchanged and read  enforces certain rigidity, a rigor mortis. The document is dead and placed behind glass for the occasional passerby to view. It is of course good to have a citable version of record; we would not, after all, want a document to be a moving target, constantly changing on the whim of the authors. But it seems like we can do better than the current idea of a static, final document, and I’d like to try.

I have created a website for the paper that, on publication, will contain the text of the paper in its entirety, free to read for anyone. It also contains extra material, such as teaching ideas and interactive apps to assist in understanding the material in the paper. The version of the website corresponding to the “published” version of the paper will be versioned on github, along with the paper. But unlike the paper at the journal, a website is flexible, and I intend to take advantage of this in several ways.

First, I have enabled hypothes.is annotation across the entire text. If you open part of the text and look in the upper right hand corner, you will see three icons that can be used to annotate the text:

The hypothes.is annotation tools.

Moreover, highlighting a bit of text will open up further annotation tools:

Highlighting the text brings up more annotation tools.

Anyone can annotate the document, and others can see the annotations you make. Am I worried that on the Internet, some people might not add the highest quality annotations? A bit. But my curiosity to see how this will be used, and the potential benefits, outweighs my trepidation.

Second, I will update the site with new information, resources, and corrections. These changes will be versioned on github, so that anyone can see what the changes were. Due to the fact that the journal will have the version of record, there is no possibility of “hiding” changes to the website. So I get the best of both worlds: the trust that comes with having a clear record of the process, with the ability to change the document as the need arises. And the entire process can be open, through the magic of github.
Third, I have enabled together.js across every page of the manuscript. together.js allows collaboration between people looking at the same website. Unlike hypothes.is, together.js is meant for small groups to privately discuss the content, not for public annotation. This is mostly to explore its possibilities for teaching and discussion, but I also imagine it holds promise for post-publication review and drafting critiques of the manuscript.
The together.js collaboration tools allow making your mouse movements and clicks visible to others, text chat, and voice chat.
Critics could discuss the manuscript using together.js, chatting about the content of the manuscript. The communication in together.js is peer-to-peer, ensuring privacy; nothing is actually being managed by the website itself, except for making the collaboration tools available.

The best part of this is that it requires no action or support from the publisher. This is essentially a sophisticated version of a pre-print, which I would release anyway. We don’t have to wait for the publishers to adopt policies and technologies friendly for post-publication peer review; we can do it ourselves. All of these tools are freely available, and anyone can use them. If you have any more ideas for tools that would be useful for me to add, let me know; the experiment hasn’t even started yet!

Check out “The Fallacy of Placing Confidence in Confidence Intervals,” play around with the tools, and let me know what you think.

BayesFactor updated to version 0.9.11-1

The BayesFactor package has been updated to version 0.9.11-1. The changes are:

  CHANGES IN BayesFactor VERSION 0.9.11-1

CHANGES
  * Fixed memory bug causing importance sampling to fail.

  CHANGES IN BayesFactor VERSION 0.9.11

CHANGES
  * Added support for prior/posterior odds and probabilities. See the new vignette for details.
  * Added approximation for t test in case of large t
  * Made some error messages clearer
  * Use callbacks at least once in all cases
  * Fix bug preventing continuous interactions from showing in regression Gibbs sampler
  * Removed unexported function oneWayAOV.Gibbs(), and related C functions, due to redundancy
  * gMap from model.matrix is now 0-indexed vector (for compatibility with C functions)
  * substantial changes to backend, to Rcpp and RcppEigen for speed
  * removed redundant struc argument from nWayAOV (use gMap instead)

unnamed-chunk-2-1

The frequentist case against the significance test, part 2

The significance test is perhaps the most used statistical procedure in the world, though has never been without its detractors. This is the second of two posts exploring Neyman’s frequentist arguments against the significance test; if you have not read Part 1, you should do so before continuing (“The frequentist case against the significance test, part 1”).

Neyman offered two major arguments against the significance test:

  1. The significance test fails as an epistemic procedure. There is no relationship between the (p) value and rational belief. More broadly, the goal of statistical inference is tests with good error properties, not beliefs.
  2. The significance test fails as a test. The lack of an alternative means that a significance test can yield arbitrary results.

The first, philosophical, argument I outlined in Part 1. Part 1 was based largely on Neyman’s 1957 paper “’Inductive Behavior’ as a Basic Concept of Philosophy of Science”. Part 2 will be based on Chapter 1, part 3 of Neyman’s 1952 book, “Lectures and conferences on mathematical statistics and probability”.

First, it must be said that Neyman did not think that significance tests were useless or misleading, all the time. He said “The [significance test procedure] has been applied since the invention of the first systematically applied test, the Pearson chi-square of 1900, and has worked, on the whole, satisfactorily. However, now that we have become sophisticated we desire to have a theory of tests.” Obviously, he is not making a blanket statement that significance tests are, generally, good science; he was making an empirical statement about the applications of significance tests in the first half of the twentieth. It is debatable whether he would say the same about the significance test since then.

Of course, we should not evaluate a procedure by its purported results; we can be misled by results, and even worse, this involves an inherent circularity (how do we determine whether the procedure actually performed satisfactorily? Another test?). However, this was merely an informal judgment of Neyman’s; we should not over-interpret it either way. After all: he will show that the foundation of the significance test is flawed, and he clearly thought this was important.

An example: Cushney and Peebles’ soporific drugs

Suppose that we are interested in the effect of sleep-inducing drugs. Cushney and Peebles (1905) reported the effects of two sleep-inducing drugs on 10 patients in a paired design. Conveniently, R has the data for these 10 patients built-in, as the sleep data set; the data comprise 10 participants’ improvements over baseline hours of sleep, for each drug. If we wished to compare the two drugs, we might compute a difference score for each participant and subject these difference scores to a one-sample (t) test.

The null hypothesis, in this case, is that the population mean of the difference scores, (mu=0). Making the typical assumptions of normality and independence, we know that, under the null hypothesis,
[ t = frac{bar{x}}{sqrt{s^2/N}} sim t_{N-1} ] where (bar{x}) and (s^2) are the difference-score sample mean and variance.

The figure below shows the distribution of the (t) statistic assuming that the null hypothesis is true, with the corresponding (p) values on the top axis. Increasingly red areas show increasing evidence against the null hypothesis, according to the Fisherian view.

If we decided to use the (t) statistic to make a decision in a significance test, we would decide on a criterion: say, (|t|>2.26), which would lead to (alpha=0.05). Repeating the logic of the significance test, as Neyman put it:

When an “improbable sample” was obtained, the usual way of reasoning was this: “Were the hypothesis (H) true, then the probability of getting a value of [test statistic] (T) as or more improbable than that actually observed would be (e.g.) (p = 0.00001). It follows that if the hypothesis (H) be true, what we actually observed would be a miracle. We don’t believe in miracles nowadays and therefore we do not believe in (H) being true.” (Neyman ,1952)

In the case of our sample, we can perform the (t) test in R:

## 
## Paired t-test
##
## data: sleep$extra[1:10] and sleep$extra[11:20]
## t = -4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.4598858 -0.7001142
## sample estimates:
## mean of the differences
## -1.58

In a typical significance test scenario, this would lead to a rejection of the null hypothesis, because (|t|>2.26).

Neyman’s second argument: Significance testing can be arbitrary

Remember that at this point, we have not considered anything about what we would expect if the null hypothesis were not true. In fact, Fisherian significance testing does not need to consider any alternatives to the null. The pseudo-falsificationist logic of the significance test means that we only need consider the implications for the data under the null hypothesis.

Neyman asks: why use the (t) statistic for a significance test? Why use the typical (bar{x}) and (s^2)? Neyman then does something very clever: he defines two new statistics, (bar{x}_1) and (s^2_1), that have precisely the same distribution as (bar{x}) and (s^2) when the null hypothesis is true, and shows that using these two statistics leads to a different test, and different results:
[ begin{eqnarray*} bar{x}_1 &=& frac{x_1 – x_2}{sqrt{2N}}, s^2_1 &=& frac{sum_{i=1}^N x_i^2 – Nbar{x}^2_1}{N-1}, end{eqnarray*} ]
where (x_i) is the difference score of the $i$th participant (assuming the samples are in arbitrary order).

Neyman proves that these statistics have the same joint distribution as (bar{x}) and (s^2), but we can verify Neyman’s proof using R (code available here). The top row of the plot below shows the histogram of 100,000 samples of (bar{x}), (s^2), and the (t) statistic for (N=10) and (sigma^2=1), assuming the null hypothesis is true; the bottom row shows the same 100,000, but computing (bar{x}_1), (s^2_1), and (t_1), the (t) statistic computed from (bar{x}_1) and (s^2_1). The red line shows the theoretical distributions. The distributions match precisely.

We now have two sets of statistics that have the same distributions, and will thus produce a significance test with precisely the same properties when the null hypothesis is true. Which should we choose? Fisher might object that (bar{x}_1) and (s^2_1) are not sufficient, but this only pushes the problem onto sufficiency: why sufficiency?

The figure below shows that this matters for the example at hand. The figure shows 100,000 simulations of (t) and (t_1) plotted against one another; when (t) is large, (t_1) tends to be small; when (t) is small, (t_1) tends to be large. The red dashed lines show the (alpha=0.05) critical values for each test, and the blue curves show the limits of bounds within which ((t,t_1)) has to be contained.

The red point shows (t) and (t_1) for the Cushny and Peebles’ data set; (t) would lead to a rejection of the null, while (t_1) would not.

Examining the definitions of (bar{x}_1) and (s^2_1), it isn’t difficult to see what is happening; when the null is true, these statistics will have identical distributions to (bar{x}) and (s^2). However, when the null is false, they will not. The distribution of (bar{x}_1) will continue to have a mean of 0 (instead of (mu)), while the distribution of (s^2_1) will become more spread than (s^2). The effect of this is that the power of the test based on (t_1) will decrease as the true effect size increases!

A consideration of both Type I and Type II errors makes it obvious which test to choose; we should choose the test that yields the higher power (this is, incidentally, closely related to the Bayesian solution to the problem through the Neyman-Pearson lemma). The use of (t_1) would lead to a bad test, when both Type I error rates and Type II error rates are taken into account. A significance test, which does not consider Type II error rates, has no account of why (t) is better than (t_1).

More problems

The previous development is bad for a significance test; it shows that there can be two statistics that lead to different answers, yet have the same properties from the perspective of significance testing. Following this, Neyman proves something even better: we can always find a statistic that will have the same long-run distribution under the null as (t), yet will yield an arbitrarily high test statistic for our sample. This means that we cannot simply base our choice of test statistic on what would yield a more or less conservative test statistic for our sample.

Neyman defines some constants (alpha_i) using the obtained samples (x_i): [ alpha_i = frac{x_i}{sqrt{sum_{i = 1}^N x_i^2}} ] then for future samples (y_i), (i=1,ldots,N) defines [ begin{eqnarray*} bar{y}_2 &=& frac{sum_{i=1}^N alpha_iy_i}{sqrt{N}}, s^2_2 &=& frac{sum_{i=1}^N y_i^2 – Nbar{y}_2^2}{N-1}, end{eqnarray*} ] and of course we can compute a (t) statistic (t_2) based on these values. If we use our observed (x_i) values for (y_i), this will yield a (t_2=infty), because (s^2_2 = 0), exactly! However, if we check the long-run distribution of these statistics under the null hypothesis, we again find that they are exactly the same as (bar{x}), (s^2), and (t):

If we considered the power of the test based on (t_2), we would find that it is worse than the power based on (t). The significance test offers no reason why (t) is better than (t_2), but a consideration of the frequentist properties of the test does. Neyman has thus shown that we must consider an alternative hypothesis in choosing a test statistic, otherwise we can select a test statistic to give us any result we like.

Conclusion: The importance of power

At the risk of belaboring a point that has been made over and over, power is not a mere theoretical concern for a frequentist. Neyman and Pearson offer an account of why some tests are better than others, and also, in some cases, an account of which test is the optimal; however, just because a test is optimal, does not mean it is good.

We might always manage to avoid Type I errors at the same rate (assuming the null hypothesis is true), but as Neyman points out, this is not enough; one needs to consider power, and how one wants to treat both Type I error and power. A good frequentist test may balance Type I and Type II error rates; a good frequentist test may control the Type I error rate while having a power that is above a certain probability. From a frequentist perspective these are decisions that must be made prior to an experiment; none of them can be addressed within the significance testing framework.

To recap both posts, Neyman makes clear why significance testing, as commonly deployed in the scientific literature, does not offer a good theory of inference: it is fails epistemically by allowing arbitrary “rational” beliefs, and it fails on statistical grounds by allowing arbitrary results.

From a frequentist perspective, what might a significance test be useful for? Neyman allows that before a critical set of experiments is performed, exploratory research must be undertaken. Generating a test or a confidence procedure requires some assumptions. Neyman does not offer an account of the process of choosing these assumptions, and seems content to leave this up to substantive researchers. Once a formal inference is needed, however, it is clear that from a frequentist perspective the significance test is inadequate.

[the source to this blog post, including R code to reproduce the figures, is available here: https://gist.github.com/richarddmorey/5806aad7191377dcbf4f]