Category Archives: reproducible analysis

Screen-2BShot-2B2015-08-10-2Bat-2B20.05.24

On radical manuscript openness

One of my papers that has attracted a lot of attention lately is “The Fallacy of Placing Confidence in Confidence Intervals,” in which we describe some of the fallacies held by the proponents and users of confidence intervals. This paper has been discussed on twitterreddit, on blogs (eg, here and here), and via email with people who found the paper in various places.  A person unknown to me has used the article as the basis for edits to the Wikipedia article on confidence intervals. I have been told that several papers currently under review cite it. Perhaps this is a small sign that traditional publishers should be worried: this paper has not been “officially” published yet.


I am currently wrapping up the final revisions on the paper, which has been accepted pending minor revisions at Psychonomic Bulletin & Review. The paper has benefited from an extremely public revision process. When I had a new major version to submit, I published the text and all code on github, and shared it via social media. Some of resulting discussions have been positive, others negative; some useful and enlightening, others not useful and frustrating. Most scientific publications almost exclusively reflect input from the coauthors and the editors and reviewers. This manuscript, in contrast, has been influenced by scores of people I’ve never met, and I think the paper is better for it.

This is all the result of my exploring ways to make my writing process more open, which led to the idea of releasing successive major versions of the text and R code on github with DOIs. But what about after it is published? How can manuscript openness continue after the magic moment of publication?

One of the downsides of the traditional scientific publishing model is that once the work is put into a “final” state, it becomes static. The PDF file format in which articles find their final form  and in which they are exchanged and read  enforces certain rigidity, a rigor mortis. The document is dead and placed behind glass for the occasional passerby to view. It is of course good to have a citable version of record; we would not, after all, want a document to be a moving target, constantly changing on the whim of the authors. But it seems like we can do better than the current idea of a static, final document, and I’d like to try.

I have created a website for the paper that, on publication, will contain the text of the paper in its entirety, free to read for anyone. It also contains extra material, such as teaching ideas and interactive apps to assist in understanding the material in the paper. The version of the website corresponding to the “published” version of the paper will be versioned on github, along with the paper. But unlike the paper at the journal, a website is flexible, and I intend to take advantage of this in several ways.

First, I have enabled hypothes.is annotation across the entire text. If you open part of the text and look in the upper right hand corner, you will see three icons that can be used to annotate the text:

The hypothes.is annotation tools.

Moreover, highlighting a bit of text will open up further annotation tools:

Highlighting the text brings up more annotation tools.

Anyone can annotate the document, and others can see the annotations you make. Am I worried that on the Internet, some people might not add the highest quality annotations? A bit. But my curiosity to see how this will be used, and the potential benefits, outweighs my trepidation.

Second, I will update the site with new information, resources, and corrections. These changes will be versioned on github, so that anyone can see what the changes were. Due to the fact that the journal will have the version of record, there is no possibility of “hiding” changes to the website. So I get the best of both worlds: the trust that comes with having a clear record of the process, with the ability to change the document as the need arises. And the entire process can be open, through the magic of github.
Third, I have enabled together.js across every page of the manuscript. together.js allows collaboration between people looking at the same website. Unlike hypothes.is, together.js is meant for small groups to privately discuss the content, not for public annotation. This is mostly to explore its possibilities for teaching and discussion, but I also imagine it holds promise for post-publication review and drafting critiques of the manuscript.
The together.js collaboration tools allow making your mouse movements and clicks visible to others, text chat, and voice chat.
Critics could discuss the manuscript using together.js, chatting about the content of the manuscript. The communication in together.js is peer-to-peer, ensuring privacy; nothing is actually being managed by the website itself, except for making the collaboration tools available.

The best part of this is that it requires no action or support from the publisher. This is essentially a sophisticated version of a pre-print, which I would release anyway. We don’t have to wait for the publishers to adopt policies and technologies friendly for post-publication peer review; we can do it ourselves. All of these tools are freely available, and anyone can use them. If you have any more ideas for tools that would be useful for me to add, let me know; the experiment hasn’t even started yet!

Check out “The Fallacy of Placing Confidence in Confidence Intervals,” play around with the tools, and let me know what you think.

On making a Bayesian omelet

My colleagues Eric-Jan Wagenmakers and Jeff Rouder and I have a new manuscript in which we respond to Hoijtink, van Kooten, and Hulsker’s in press manuscript Why Bayesian Psychologists Should Change the Way they Use the Bayes Factor. They suggest a method for “calibrating” Bayes factor using error rates. We show that this method is fatally flawed, but also along the way we describe how we think about the subjective properties of the priors we use in our Bayes factors:


“…a particular researcher’s subjective prior is of limited use in the context of a public scientific discussion. Statistical analysis is often used as part of an argument. Wielding a fully personal, subjective prior and concluding ‘If you were me, you would believe this’ might be useful in some contexts, but in others it is less useful. In the context of a scientific argument, it is much more useful to have priors that approximate what a reasonable, but somewhat-removed researcher would have in the situation. One could call this a ‘consensus prior’ approach. The need for broadly applicable arguments is not a unique property of statistics; it applies to all scientific arguments. We do not argue to convince ourselves; we should therefore make use of statistical arguments that are not pegged to our own beliefs…

It should now be obvious how we make our ‘Bayesian omelet’; we break the eggs and cook the omelet for others in the hopes that it is something like what they would choose for themselves. With the right choice of ingredients, we think our Bayesian omelet can satisfy most people; others are free to make their own, and we would be happy to help them if we can. “


Our completely open, reproducible manuscript — “Calibrated” Bayes factors should not be used: a reply to Hoijtink, van Kooten, and Hulsker — along with a supplement and R code, is available on github (with DOI!).

Embedding RData files in Rmarkdown files for more reproducible analyses

For those of us interested in reproducible analysis, Rmarkdown is a great way of communicating our code to other researchers. Rstudio, in particular, makes it very easy to create attractive HTML document containing text, code, and figures, which can then be sent to colleagues or put on the internet for anyone to see. If you aren’t using Rmarkdown for your statistical analyses, I recommend you start; you’ll never go back to simple script files again (and your colleagues won’t want you to).

In this post, I describe how to improve your Rmarkdown by embedding data that can be downloaded by anyone viewing the document in a modern browser with javascript enabled. For a quick look, see the example Rmd file and resulting HTML file.

One of the drawbacks of Rmakdown, from a reproducible analysis perspective, is that the data is not a part of the document itself. Typically, an Rmarkdown file will use R code to load a file from your disk, and when you send the resulting HTML file to a colleague, or put it on the internet, that file is separate. It must be sent in an email or placed on a server to be downloaded.

This raises the possibility that the data could get separated from the code, and I think this is a terrible thing for reproducible analysis. In my mind, the data and the document and data should travel together as a single document. What we would like is a method of encoding R data into the HTML file such that any user who has access to the HTML file can download it, without even having access to the internet.

As it turns out, files can be encoded in an HTML document via the URI data scheme. All we need is an R function that encodes the data, and produces a link to enable downloading the data.


setDownloadURI = function(list, filename = stop("'filename' must be specified"), textHTML = "Click here to download the data.", fileext = "RData", envir = parent.frame()){
require(base64enc,quietly = TRUE)
divname = paste(sample(LETTERS),collapse="")
tf = tempfile(pattern=filename, fileext = fileext)
save(list = list, file = tf, envir = envir)
filenameWithExt = paste(filename,fileext,sep=".")

uri = dataURI(file = tf, mime = "application/octet-stream", encoding = "base64")
cat("<a style='text-decoration: none' id='",divname,"'></a>
<script>
var a = document.createElement('a');
var div = document.getElementById('",divname,"');
div.appendChild(a);
a.setAttribute('href', '",uri,"');
a.innerHTML = '",textHTML,"' + ' (",filenameWithExt,")';
if (typeof a.download != 'undefined') {
a.setAttribute('download', '",filenameWithExt,"');
}else{
a.setAttribute('onclick', 'confirm("Your browser does not support the download HTML5 attribute. You must rename the file to ",filenameWithExt," after downloading it (or use Chrome/Firefox/Opera). ")');
}
</script>",
sep="")
}

The first argument of the function, list, is a character vector containing names of variables to save in the RData file.

Once this function is declared, all we need to do is call it in our Rmd file. If we use the argument results = 'asis' in our R code block, it will inject the appropriate HTML code into our compiled HTML document to allow a download of the embedded data as an RData file, and anyone with the HTML file can download it.

Unfortunately, blogger will not allow me to embed the data into a post; therefore, a complete, self-contained example Rmd file can be found here, and the resulting HTML file can be found here.

Keep in mind, however, that the data file is actually embedded in the HTML file. This means that the resulting HTML file can be very large, if your data file is large. Also consider that data are encoded in base64, which increases the size of the file by about a third over the equivalent RData binary file. For very large data sets, one might consider hosting them outside of the HTML file; but for many purposes, the technique I describe will improve the ease with which you can share reproducible analyses.