Category Archives: science


On radical manuscript openness

One of my papers that has attracted a lot of attention lately is “The Fallacy of Placing Confidence in Confidence Intervals,” in which we describe some of the fallacies held by the proponents and users of confidence intervals. This paper has been discussed on twitterreddit, on blogs (eg, here and here), and via email with people who found the paper in various places.  A person unknown to me has used the article as the basis for edits to the Wikipedia article on confidence intervals. I have been told that several papers currently under review cite it. Perhaps this is a small sign that traditional publishers should be worried: this paper has not been “officially” published yet.

I am currently wrapping up the final revisions on the paper, which has been accepted pending minor revisions at Psychonomic Bulletin & Review. The paper has benefited from an extremely public revision process. When I had a new major version to submit, I published the text and all code on github, and shared it via social media. Some of resulting discussions have been positive, others negative; some useful and enlightening, others not useful and frustrating. Most scientific publications almost exclusively reflect input from the coauthors and the editors and reviewers. This manuscript, in contrast, has been influenced by scores of people I’ve never met, and I think the paper is better for it.

This is all the result of my exploring ways to make my writing process more open, which led to the idea of releasing successive major versions of the text and R code on github with DOIs. But what about after it is published? How can manuscript openness continue after the magic moment of publication?

One of the downsides of the traditional scientific publishing model is that once the work is put into a “final” state, it becomes static. The PDF file format in which articles find their final form  and in which they are exchanged and read  enforces certain rigidity, a rigor mortis. The document is dead and placed behind glass for the occasional passerby to view. It is of course good to have a citable version of record; we would not, after all, want a document to be a moving target, constantly changing on the whim of the authors. But it seems like we can do better than the current idea of a static, final document, and I’d like to try.

I have created a website for the paper that, on publication, will contain the text of the paper in its entirety, free to read for anyone. It also contains extra material, such as teaching ideas and interactive apps to assist in understanding the material in the paper. The version of the website corresponding to the “published” version of the paper will be versioned on github, along with the paper. But unlike the paper at the journal, a website is flexible, and I intend to take advantage of this in several ways.

First, I have enabled annotation across the entire text. If you open part of the text and look in the upper right hand corner, you will see three icons that can be used to annotate the text:

The annotation tools.

Moreover, highlighting a bit of text will open up further annotation tools:

Highlighting the text brings up more annotation tools.

Anyone can annotate the document, and others can see the annotations you make. Am I worried that on the Internet, some people might not add the highest quality annotations? A bit. But my curiosity to see how this will be used, and the potential benefits, outweighs my trepidation.

Second, I will update the site with new information, resources, and corrections. These changes will be versioned on github, so that anyone can see what the changes were. Due to the fact that the journal will have the version of record, there is no possibility of “hiding” changes to the website. So I get the best of both worlds: the trust that comes with having a clear record of the process, with the ability to change the document as the need arises. And the entire process can be open, through the magic of github.
Third, I have enabled together.js across every page of the manuscript. together.js allows collaboration between people looking at the same website. Unlike, together.js is meant for small groups to privately discuss the content, not for public annotation. This is mostly to explore its possibilities for teaching and discussion, but I also imagine it holds promise for post-publication review and drafting critiques of the manuscript.
The together.js collaboration tools allow making your mouse movements and clicks visible to others, text chat, and voice chat.
Critics could discuss the manuscript using together.js, chatting about the content of the manuscript. The communication in together.js is peer-to-peer, ensuring privacy; nothing is actually being managed by the website itself, except for making the collaboration tools available.

The best part of this is that it requires no action or support from the publisher. This is essentially a sophisticated version of a pre-print, which I would release anyway. We don’t have to wait for the publishers to adopt policies and technologies friendly for post-publication peer review; we can do it ourselves. All of these tools are freely available, and anyone can use them. If you have any more ideas for tools that would be useful for me to add, let me know; the experiment hasn’t even started yet!

Check out “The Fallacy of Placing Confidence in Confidence Intervals,” play around with the tools, and let me know what you think.

Guidelines for reporting confidence intervals

I’m working on a manuscript on confidence intervals, and I thought I’d share a draft section on the reporting of confidence intervals. The paper has several demonstrations of how CIs may, or may not, offer quality inferences, and how they can differ markedly from credible intervals, even ones with so-called “non-informative” priors.

Guidelines for reporting confidence intervals

Report credible intervals instead. We believe any author who chooses to use confidence intervals should ensure that the intervals correspond numerically with credible intervals under some reasonable prior. Many confidence intervals cannot be so interpreted, but if the authors know they can be, they should be called “credible intervals”. This signals to readers that they can interpret the interval as they have been (incorrectly) told they can interpret confidence intervals. Of course, the corresponding prior must also be reported. This is not to say that one can’t also call them confidence intervals if indeed they are; however, readers are likely more interested in the post-data properties of the procedure — not the coverage — if they are interested arriving at substantive conclusions from the interval.

Do not use procedures whose Bayesian properties are not known. As Casella (1992) pointed out, the post-data properties of a procedure are necessary for understanding what can be inferred from an interval. Any procedure whose Bayesian properties have not been explored can have properties that make it unsuitable for post-data inference. Procedures whose properties have not been adequately studied are inappropriate for general use.

Warn readers if the confidence procedure does not correspond to a Bayesian procedure. If it is known that a confidence interval does not correspond to a Bayesian procedure, warn readers that the confidence interval cannot be interpreted as having a X% probability of containing the parameter, that it cannot be interpreted in terms of the precision of measurement, and that cannot be said to contain the values that should be taken seriously: the interval is merely an interval that, prior to sampling, had a X% probability of containing the true value. Authors using confidence intervals have a responsibility to keep their readers from invalid inferences if they choose to use them, and it is almost sure that readers will misinterpret them without a warning (Hoekstra et al, 2014).

Never report a confidence interval without noting the procedure and the corresponding statistics. As we have described, there are many different ways to construct confidence intervals, and they will have different properties. Some will have better frequentist properties than others; some will correspond to credible intervals, and others will not. It is unfortunately common for authors to report confidence intervals without noting how they were constructed. As can be seen from the examples we’ve presented, this is a terrible practice because without knowing which confidence intervals was used, it is unclear what can be inferred. A narrow interval could correspond to very precise information or very imprecise information depending on which procedure was used. Not knowing which procedure was used could lead to very poor inferences. In addition, enough information should be presented so that any reader can compute a different confidence interval or credible interval. In most cases, this is covered by standard reporting practices, but in other cases more information may need to be given.

Consider reporting likelihoods or posteriors instead. An interval provides fairly impoverished information. Just as proponents of confidence intervals argue that CIs provide more information than a significance test (although this is debatable for many CIs), a likelihood or a posterior provides much more information than an interval. Recently, Cumming (2014) [see also here] has proposed so-called “cat’s eye” intervals which are either fiducial distributions or Bayesian posteriors under a “non-informative” prior (the shape is the likelihood, but he interprets the area, so it must be a posterior or a fiducial distribution). With modern scientific graphics so easy to create, along with the fact that likelihoods are often approximately normal, we see no reason why likelihoods and posteriors cannot replace intervals in most circumstances. With a likelihood or a posterior, the arbitrariness of the confidence or credibility coefficient is avoided altogether.


Some thoughts on replication

In a recent blog post, Simine Vazire discusses the problem with the logic of requiring replicators to explain when they reach different conclusions to the original authors. She frames it, correctly, it as asking people to over-interpret random noise. Vazire identifies the issue as a problem with our thinking: that we under-estimate randomness. I’d like to explore other ways in which our biases interferes with clear thinking about replication, and perhaps suggest some ways we can clarify it.

I suggest two ways in which we fool ourselves in thinking about replication: the concept of “replication” is unnecessarily asymmetric and an example of overly-linear thinking, and lack of distinction in practice causing a lack of distinction in theory.

Fooled by language: the asymmetry of “replication”

Imagine that a celebrated scientist, Dr. Smith, dies, and within her notes is discovered a half-written paper. Building on her previous work, this paper clearly lays out an creative experiment to test a theory. To avoid any complications such as post hoc theorising, assume the link between the theory and experiment is clear and follows from her previous work. On the Dr. Smith’s computer, along with the paper, is found a data set. Dr. Smith’s colleagues decide to finish the paper and publish it in her honor.
Given the strange circumstances of this particular paper’s history, another scientist, Dr. Jones, decides to replicate the study. Dr. Jones does his best to match the methods described in the paper, but obtains a different result. Dr. Jones tries to publish, but editors and reviewers demand an explanation: why is the replication different? Dr. Jones’ result is doubted until he can explain the difference.
Now suppose — unbeknownst to everyone — that the first experiment was never done. Dr. Smith simulated the data set as a pedagogical exercise to learn a new analysis technique. She never told anyone because she did not anticipate dying, of course, but everyone assumed the data was real. The second experiment is no replication at all; it is the first experiment done.
Does this change the evidential value of the Dr. Jones’ experiment at all? Of course not. The fact that the Dr. Smith’s experiment was not done is irrelevant to the evidence in Dr. Jones’ experiment. The evidence contained in a first experiment is the same, regardless of whether a second experiment is done (assuming, of course, that the methods are all sound). “Replication” is a useless label.
Calling the Dr. Jones’ experiment a “replication” focuses our attention on wrong relationship. One replicates an actual experiment that was done. However, the evidence that an experiment provides for a theory depends not on the relationship between the experiment’s methods and an experiment that was done in the past. Rather, the evidence depends on the relationship between the experiment’s methods and a hypothetical experiment that is designed to test the theory. One cannot replicate a hypothetical experiment, of course, because hypothetical experiments cannot be performed. Instead, one realizes a hypothetical experiment, and there may be several realizations of the same hypothetical experiment.
Thinking in this manner eliminates the asymmetric relationship between the two experiments. If both experiments can be realizations of the same hypothetical experiment designed to test a theory, which one came first is immaterial.* The burden is no longer on the second experimenter to explain why the results are different; the burden is on the advocates of the theory to explain the extant data, which now includes two differing results. (Vazire’s caution about random noise still applies here, as we still don’t want to over-explain differences; it is assumed that any post hoc explanation will be tested.)
Three hypothetical experiments that are tests of the same theory, along with five actually-run experiments. Hypothetical experiments A and B may be so-called “conceptual replications” of A, or tests of other aspects of the theory.
The conceptual distinction between a hypothetical experiment — that is, the experiment that is planned — and the actual experiment is critical. That hypothetical experiment can be realized in many ways: different times, different labs, different participants, even different stimuli, if these are randomly generated or are selected from a large collection of interchangeable stimuli. Importantly, when the first realization of the hypothetical experiment is done, it does not get methodological priority. It is temporally first, but is simply one way in which the experiment could have been realized. 
Conceptualizing the scientific process in this way prevents researchers who did an experiment first from claiming that their experiment takes priority. If you are “replicating” their actual experiment, then it makes sense that your results will get compared to theirs, in the same way a “copy” might be compared to the “original”. But conceptually, the two are siblings, not parent and child.

Lack of distinction in practice vs. theory

The critical distinctions above is the distinction between a hypothetical experiment and an actual one. I think this is an instance where modern scientific practice causes problems. Although the idea of a hypothetical experiment arises in any experimental planning process, consider the typical scientific paper, which has an introduction, then a brief (maybe even just a few sentences!) segue describing the logic of the experiment, into the methods of an actually-performed experiment. 
This structure means that the hypothetical experiment and the actual experiment are impossible to disentangle. This is one of the reasons, I think, why we talk about “replication” so much, rather than performing another realization of the hypothetical experiment. We have no hypothetical experiment to work from, because it is almost completely conflated with the actual experiment.
One initiative that will help with this problem is public pre-registration. A hypothetical experiment is laid out in an pre-registration document. Note that from a pre-registration document, the structure in the figure becomes clear. If someone posts a public pre-registration document, why does it matter who does the experiment first (aside from the ethical issue of “scooping”, etc)? No one is “replicating” anyone else; they are each separately realizing the hypothetical experiment that was planned.
But in current practice, which does not typically distinguish a hypothetical experiment and an actual one, the only way to add to the scientific literature about hypothetical experiment A is to try to “redo” one of its realizations. Any subsequent experiment is then logically dependent on the first actually performed experiment, and the unhelpful asymmetry crops up again.
I think it would be useful to have a different word than “replication”, because the connotation of the word “replication”, as a fascimile or a copy of something already existing, focuses our attention in unhelpful ways.
* Although logically which came first is immaterial, there may be statistical considerations to keep in mind, like the “statistical significance filter” that is more likely to affect a first study than a second. Also, as Vazire points out in the comments, the second study has fewer researcher degrees of freedom.