# Statistical alchemy and the “test for excess significance”

[This post is based largely on my 2013 article for Journal of Mathematical Psychology; see the other articles in that special issue as well for more critiques.]

When I tell people that my primary area of research is statistical methods, one of the reactions I often encounter from people untrained in statistics is that “you can prove anything with statistics.” Of course, this rankles, first because it isn’t true (unless you use a very strange definition of prove) and second because I’ve spent years learning the limitations of statistics, and there are many limitations. These limitations exist, however, in the context of enormous successes. In the sciences, the field of statistics rightly has a place of honor.

This success is evidenced by the great number of scientific arguments that are supported by statistical methods. Not all statistical arguments are created equal, of course. But the respect with which statistics is viewed has the unfortunate downside that a statistical argument can apparently turn a leaden hunch into a golden “truth”. This post is about such statistical alchemy.

## The gold: Justified substantive claims

One of the goals we all have as scientists is to make claims backed by solid evidence. This is harder than it seems. Ideally we would prefer that evidence be ironclad and assumptions unnecessary. In real-life cases, however, the strength of evidence does not provide certainty, and assumptions are needed. The key to good argument, then, is that all assumptions are made explicit, the chain of reasoning is clear and logical, and the resulting evidence is strong enough to garner agreement.

Such cases we might call the “gold standard” for scientific arguments. We expect this sort of argument when someone makes a strong claim. This is the stuff that the scientific literature should be made of, for the most part. Among other things, the gold standard requires careful experimental design and execution, deliberate statistical analysis and avoidance of post hoc reasoning, and a willingness to explore the effects of unstated assumptions in one’s reasoning.

Hunches are a necessary part of science. Science is driven by a creative force that cannot (at this point) be quantified, and a sneaking suspicion that something is true is often the grounds on which we design experiments. Hunches are some of the most useful things in science, just as lead is an exceptionally useful metal. Like lead, hunches are terribly common. We all have many hunches, and often we don’t know where they come from.

What makes a hunch is that it doesn’t have solid grounds to back it up. Hunches often turn to dust upon closer examination: they may contradict other knowledge, they may be based on untenable assumptions, or the evidence for them may turn out to be much weak when we examine it. If a hunch survives a solid test, it is no longer a hunch; but so long we do not test them — or cannot test them — they remain hunches.

## The alchemy of statistics

One of the most dangerous, but unfortunately common, ways in which statistics is used is to magically turn hunches into “truth”. The mother of all statistical alchemy is the Fisherian (p) value, by which hunches based on “low” (p) values are turned into statements about the implausibility of the null hypothesis. Although it seems reasonable, when the hunch on which (p) values rest is examined by either frequentists or Bayesians, it is found wanting.

However, my main focus here is not specifically (p) values. I’d like to focus on one particularly recent special case of statistical alchemy among methodologists called the “test for excess significance”. Here’s the hunch: in any series of typically-powered experiments, we expect some to fail to be non-significant due to sampling error, even if a true effect exists. If we see a series of five experiments, and they are all significant, one thinks that either they are either very high powered, the authors got lucky, or there are some nonsignificant studies missing. For many sets of studies, the first seems implausible because the effect sizes are small; the last is important, because if it is true then the picture we get of the results is misleading.

Just to be clear, this hunch makes sense to me, and I think to most people. However, without a formal argument it remains a hunch. Ioannidis and Trikalinos (2007) suggested formalising it:

We test in a body of $n$ published studies whether the observed number of studies $O$ with ‘positive’ results at a specified $alpha$ level on a specific research question is different from the expected number of studies with ‘positive’ results $E$ in the absence of any bias. (Ioannidis and Trikalinos, 2007, p246)

Of “biases”, Ioannidis and and Tikalinos say that “biases…result in a relative excess of published statistically significant results as compared with what their true proportion should be in a body of evidence.” If there are too many significant studies, there must be too few nonsignificant ones, hence the idea of “relative” excess.

Suppose there is a true effect size that is being pursued by study $i (i = 1,ldots,n)$ and its size is $theta_i$…[T]he expected probability that a specific single study $i$ will find a ‘positive’ result equals $1 – beta_i$, its power at the specified $alpha$ level. (Ioannidis and Trikalinos, 2007, p246)

So far so good; this is all true. Ioannidis and Trikalinos continue:

Assuming no bias, $E$ equals the sum of the expected probabilities across all studies on the same question: [ E = sum_{i=1}^n (1 – beta_i). ] (Ioannidis and Trikalinos, 2007, p246)

Here we have bit of a mystery. That (E) equals the sum of the expected probabilities is merely asserted. There is no explanation of what assumptions were necessary to derive that fact. Moreover, it is demonstrably false. Suppose I run experiments until I obtain (k) nonsignificant studies ((k>0)). The expected number of significant studies in a set of (n) is exactly (n-k). Depending on the stopping rule for the studies, which is unknown (and unknowable or even meaningless, in most cases), (E) can be chosen to be 0 (stop after (n) nonsignificant studies), (n) (stop after (n) significant studies), or any number in between!

Ioannidis and Trikalinos go on to say that “[t]he expected number (E) is compared against the observed number (O) of ‘positive’ studies” and if there are an “excess” then bias is claimed, by standard significance test logic. Here, things go off the rails again. First, as we have seen, (E) could be anything. Second, a significance test is performed by computing the probability of observing an outcome as extreme or more extreme than the one observed, given no “bias”. What is more extreme? Suppose we observe 4 significant results in 5 studies. It seems clear that 5/5 is more extreme. Is 6/6 possible? No mention is made of the assumed sampling process, so how are we to know what the more extreme samples would be? And if a sampling assumption were made explicit, how could we know whether that was a reasonable assumption for the studies at hand? The (p) value is simply incalculable from the information available.

Suppose I find a “significant” result; what do I infer? Ioannidis and Trikalinos claim that they “have introduced an exploratory test for examining whether there is an excess of significant findings in a body of evidence” (p 251). This is a very strange assertion. When we do a statistical test, we are not asking a question about the data itself; rather, we are inferring something about a population. The “body of evidence” is the sample; we infer from the sample to the population. But what is the population? Or, put in frequentist terms, what is the sampling process from which the studies in question arise? Given that this question is central to the statistical inference, one would think it would be addressed, but it is not. Dealing with this question would require a clear definition of a “set” of studies, and how this set is sampled.

Are these studies one sample of hypothetical sets of studies from all scientific fields? Or perhaps they are a sample of studies within a specific field; say, psychology? Or from a subfield, like social psychology? Or maybe from a specific lab? There’s no way to uniquely answer this question, and so it isn’t clear what can be inferred. Am I inferring bias in all of science, in the field, the subfield, or the lab? And if any of these are true, why do they discuss bias in the sample instead? They have confused the properties of the population and sample in a basic way.

But even though these critical details are missing — details that are necessary to the argument — the authors go on to apply this to several meta-analyses, inferring bias in several. Other authors have applied the method to claim “evidence” of bias in other sets of studies.

## …and the alchemy is complete

We see that Ioannidis and Trikalinos have unstated assumptions of enormous import, they have failed to clearly define any sort of sampling model, and they have not made clear the link between the act of inference (“we found a ‘significant’ result”) and what is to be inferred (“Evidence for bias exists in these studies.”). And this is all before even addressing the problematic nature of (p) values themselves, which cannot be used as a measure of evidence. The test for “excess significance” is neither a valid frequentist procedure (due to the lack of a clearly defined sampling process) nor a valid Bayesian procedure.

But through the alchemy of statistics, the Ioannidis and Trikalinos’ test for “excess significance” has given us the appearance of a justified conclusion. Bodies of studies are called into doubt, and the users of the approach continue to get papers published using the approach despite its utter lack of justification. We would not accept such shoddy modeling and reasoning for studying other aspects of human behavior. As Val Johnson put it in his comment on the procedure, “[We] simply cannot quite determine the level of absurdity that [we are] expected to ignore.” Why is this acceptable for deploying against groups of studies in the scientific literature?

The reason is simple: we all have the hunch. It seems right. Ioannidis and Trikalinos have given us a way to transmute our hunch that something is amiss into the gold of a publishable, evidence-backed conclusion. But it is an illusion; the argument simply falls apart under scrutiny.

This is bad science, and it should not be tolerated. Methodologists have the same responsibility as everyone else to justify their conclusions. The peer review system has failed to prevent the leaden hunch passing for gold, which is acutely ironic given how methodologists use the test to accuse others of bad science.

# To Beware or To Embrace The Prior

In this guest post, Jeff Rouder reacts to two recent comments skeptical of Bayesian statistics, and describes the importance of the prior in Bayesian statistics. In short: the prior gives a Bayesian model the power to predict data, and prediction is what allows the evaluation of evidence. Far from being a liability, Bayesian priors are what make Bayesian statistics useful to science.

Bayes’ Theorem is about 250 years old. For just about as long, there has been this one never-ending criticism—beware the prior. That is: priors are too subjective or arbitrary. In the last week I have read two separate examples of this critique in the psychological literature. The first comes from Savalei and Dunn (2015) who write,

…using Bayes factors further increases ‘researcher degrees of freedom,’ creating another potential QRP, because researchers must select a prior–—a subjective expectation about the most likely size of the effect for their analyses. (Savalei and Dunn, 2015)

The second example is from Trafimow and Marks (2015) who write,

The usual problem with Bayesian procedures is that they depend on some sort of Laplacian assumption to generate numbers where none exist. (Trafimow and Marks, 2015)

The focus should be on the last part—generating numbers where none exist—which I interpret as questioning the appropriateness of priors. Though the critiques are subtly different, they both question the wisdom of Bayesian analysis for its dependence on a prior. Because of this dependence, researchers holding different priors may reach different conclusions from the same data. The implication is that ideally analyses should be more objective than the subjectivity necessitated by Bayes.

The critique is dead wrong. The prior is the strength rather than the weakness of the Bayesian method. It gives it all of its power to predict data, to embed theoretically meaningful constraint, and to adjudicate evidence among competing theoretical positions. The message here is to embrace the prior. My colleague Chris Donkin has used the Kubrick’s subtitle to say it best: “How I learned to stop worrying about and love the prior.” Here goes:

### Classical and Bayesian Models

Let’s specify a simple model both in classical and Bayesian form. Consider for example where data, denoted (Y_1,ldots,Y_N), are distributed as normals with a known variance of 1. The conventional frequentist model is [ Y_i(mu) sim mbox{Normal} (mu,1). ] There is a single parameter, (mu), which is the center of the distribution. Parameter (mu) is a single fixed value which, unfortunately, is not known to us. I have made (Y_i) a function of (mu) to make the relationship explicit. Clearly, the distribution of each (Y_i) depends on (mu), so this notation is reasonable.

The Bayesian model consists of two statements. The first one is the data model: [ Y_i | mu sim mbox{Normal} (mu,1), ] which is very similar to the conventional model above. The difference is that (mu) is no longer a constant but a random variable. Therefore, we write the data model as a conditional statement—conditional on some value of (mu), the observations follow a normal at that mean. The data model, though conceptually similar to the frequentist model, is not enough for a Bayesian. It is incomplete because it is specified as a conditional statement. Bayesians need a second statement, a model on the parameter (mu). A common specification is [ mu sim mbox{Normal}(a,b),] where (a) and (b), the mean and variance, are set by the analyst before observing the data.

From a classical perspective, Bayesians make an extra model specification, the prior on parameters, that is unnecessary and unwarranted. Two researchers can have the same model, that the data are normal, but have very different priors if they choose very different values of (a) and/or (b). With these different choices, they may draw different conclusions. From a Bayesian perspective, the classical perspective is incomplete because it only models the phenomena up to function of unknown parameters. Classical models are good models if you know (mu) — say, as God does — but not so good if you don’t; and, of course, mortals don’t. This disagreement, whether Bayesian models have an unnecessary and unwise specification or whether classical models are incomplete, is critical to understand why the priors-are-too-subjective critique is off target.

### Bayesian Models Make Predictions, Classical Models Don’t

One criteria that I adopt, and that I hope you do to, is that models should make predictions about data. Prediction is at the heart of deductive science. Theories make predictions, and then we check if the data has indeed conformed to these predictions. This view is not too alien, in fact, it is the stuff of grade-school science. Prediction to me means the ability to make probability statements about where data will lie before the data are collected. For example, if we agree that (mu=0) in the above model, we now can make such statement, say that the probability that (Y_1) is between -1 and 1 is about 68%.

This definition of prediction, while common sense, is quite disruptive. Do classically-specified models predict data? I admit a snarky thrill in posing this question to my colleagues who advocate classical methods. Sometimes they say “yes,” and then I remind them that the parameters remain unknown except in the large-sample limit. Since we don’t have an infinite amount of data, we don’t know the parameters. Sometimes they say they can make predictions with the best estimate of (mu), and I remind them that they need to see the data first to estimate (mu), and as such, it is not a prediction (not to mention the unaccounted sample noise in the best estimate). It always ends a bit uneasy with awkward smiles, and with the unavoidable conclusion that classical models do not predict data, at least not in the usual definition of “predict.”

The reason classical models don’t predict data is that they are incomplete. They are missing the prior—a specification of how the parameters vary. With this specification, the predictions are straightforward application of the Law of Total Probability (http://en.wikipedia.org/wiki/Law_of_total_probability):[Pr(Y_i) = int Pr(Y_i|mu) Pr(mu) dmu. ] The respective probabilities (densities) (Pr(Y_i|mu)) and (Pr(mu)) are derived from the model specifications. Hence, the (Pr(Y_i)) is computable. We can state the probability that an observation lies in any interval before we see the data. Bayesian specifications predict data; classical specifications don’t.

### Priors Instantiate Meaningful Constraint

The prior is not some nuisance that one must begrudgingly specify. Instead, it is a tool for instantiating theoretically meaningful constraint. Let’s take a problem near and dear to my children—whether the candy Smarties makes children smarter. For if so, my kids have a very convincing claim why I should buy them Smarties. I have three children, and these three don’t agree on much. So let’s assume the eldest thinks Smarties makes you smarter, the middle thinks Smarties makes you dumber if only to spite his older brother, and the youngest thinks it’s wisest to steer a course between her brothers. She thinks Smarties have no effect at all. They decide to run an experiment on 40 schoolmates where each schoolmate first takes an IQ test, then eats a Smartie, and then take the IQ test again. The critical measure is the change in IQ, and for the sake of this simple demonstration, we discount any learning or fatigue confounds.

All three kids decided to instantiate their position within a Bayesian model. All three start with the same data model: [ Y_i | mu sim mbox{Normal}(mu,sigma^2)] where (Y_i) is the difference score for the $i$th kid, (mu) is the true effect of Smarties, and (sigma^2) is the variance of this difference across kids. For simplicity, let’s treat (sigma=5) as known, say as the known standard deviation of test-retest IQ score differences. Now each of my children needs a model on (mu), the prior, to instantiate their position. The youngest had it easiest. With no effect, her model on (mu) is [ M_0: mu=0. ] Next, consider the model of the oldest. He believe there is a positive effect, and knowing what he does about Smarties and IQ scores, he decides to place equal probability of (mu) between 0-point and a 5-point IQ effect, i.e., [M_1: mu sim mbox{Uniform}(0,5).] The middle one, being his brother’s perfect contrarian, comes up with the mirror-symmetric model: [M_2: mu sim mbox{Uniform}(-5,0).]

### Predictions Are The Key To Evidence

Now a full-throated disagreement among my children will inevitably result in one of them yelling, “I’m right; you’re wrong.” This proclamation will be followed by, “You’re so stupid.” The whole thing will go on for a while with hurled insults and hurt feelings. And if you think this juvenile behavior is limited to my children or children in general, then you may not know many psychological scientists. What my kids need is a way of using data to inform theoretically-motivated positions.
In a previous post, Richard Morey demonstrated — in the context of Bayesian t tests — how predictions may be used state evidence). I state the point here for the problem my children face. Because my children are Bayesian, they may compute their predictions about the sample mean of the difference scores. Here they are for a sample mean across 40 kids:

My daughter with Model (M_0) most boldly predicts that the sample mean will be small in magnitude, and her predictive density is higher than that of her brothers for (-1.15<bar{Y}<1.15). If the sample mean is in this range, she is more right than they are. Likewise if the sample mean is above 1.15, the oldest child is more right (Model (M_1)), and if the sample mean is below -1.15, the middle child is more right (Model (M_2)).

With this Bayesian setup, we as scientist can hopefully rise above the temptation to think in terms of right and wrong. Instead, we can state fine-grained evidence as ratios. For example, suppose we observe a mean of -1.4, which is indicated with the vertical dashed line. The most probable prediction comes from Model (M_2), and it is almost twice as probable as Model (M_0). This 2-to-1 ratio serves as evidence for a negative effect of Smarties relative to a null effect. The prediction for the negative model is 25 times as probable as that for the positive model, and thus the evidence for a negative-effects model is 25-to-1 compared to the positive-effects model. These ratios of marginal predictions are Bayes factors, which are intuitive measure of evidence. Naturally, the meaning of the Bayes factor is bound to the model specifications.

### Take Home

The prior is not some fudge factor. Different theoretically motivated constraints on data may be specified gracefully through the prior. With this specification, not only do competing models predict data, but stating evidence for positions is as conceptually simple as comparing how well each model predicts the observed data. Embrace the prior.

# BayesFactorExtras: a sneak preview

Felix Schönbrodt and I have been working on an R package called BayesFactorExtras. This package is designed to work with the BayesFactor package, providing features beyond the core BayesFactor functionality. Currently in the package are:

1. Sequential Bayes factor plots for visualization of how the Bayes factor changes as data come in: seqBFplot()
2. Ability to embed R objects directly into HTML reports for reproducible, sharable science:  createDownloadURI()
3. Interactive BayesFactor objects in HTML reports;  just print the object in a knitr document.
4. Interactive MCMC objects in HTML reports; just print the object in a knitr document.
All of these are pretty neat, but I thought I’d give a sneak preview of #4. To see how it works, click here to play with the document on Rpubs!

I anticipate releasing this to CRAN soon.