[This post is based largely on my 2013 article for Journal of Mathematical Psychology; see the other articles in that special issue as well for more critiques.]

When I tell people that my primary area of research is statistical methods, one of the reactions I often encounter from people untrained in statistics is that “you can prove anything with statistics.” Of course, this rankles, first because it isn’t true (unless you use a very strange definition of prove) and second because I’ve spent years learning the limitations of statistics, and there are many limitations. These limitations exist, however, in the context of enormous successes. In the sciences, the field of statistics rightly has a place of honor.

This success is evidenced by the great number of scientific arguments that are supported by statistical methods. Not all statistical arguments are created equal, of course. But the respect with which statistics is viewed has the unfortunate downside that a statistical argument can apparently turn a leaden hunch into a golden “truth”. This post is about such statistical alchemy.

## The gold: Justified substantive claims

One of the goals we all have as scientists is to make claims backed by solid evidence. This is harder than it seems. Ideally we would prefer that evidence be ironclad and assumptions unnecessary. In real-life cases, however, the strength of evidence does *not* provide certainty, and assumptions are needed. The key to good argument, then, is that all assumptions are made explicit, the chain of reasoning is clear and logical, and the resulting evidence is strong enough to garner agreement.

Such cases we might call the “gold standard” for scientific arguments. We expect this sort of argument when someone makes a strong claim. This is the stuff that the scientific literature *should* be made of, for the most part. Among other things, the gold standard requires careful experimental design and execution, deliberate statistical analysis and avoidance of post hoc reasoning, and a willingness to explore the effects of unstated assumptions in one’s reasoning.

## The lead: Hunches

Hunches are a necessary part of science. Science is driven by a creative force that cannot (at this point) be quantified, and a sneaking suspicion that something is true is often the grounds on which we design experiments. Hunches are some of the most useful things in science, just as lead is an exceptionally useful metal. Like lead, hunches are terribly common. We all have many hunches, and often we don’t know where they come from.

What makes a hunch is that it doesn’t have solid grounds to back it up. Hunches often turn to dust upon closer examination: they may contradict other knowledge, they may be based on untenable assumptions, or the evidence for them may turn out to be much weak when we examine it. If a hunch survives a solid test, it is no longer a hunch; but so long we do not test them — or cannot test them — they remain hunches.

## The alchemy of statistics

One of the most dangerous, but unfortunately common, ways in which statistics is used is to magically turn hunches into “truth”. The mother of all statistical alchemy is the Fisherian (p) value, by which hunches based on “low” (p) values are turned into statements about the implausibility of the null hypothesis. Although it seems reasonable, when the hunch on which (p) values rest is examined by either frequentists or Bayesians, it is found wanting.

However, my main focus here is not specifically (p) values. I’d like to focus on one particularly recent special case of statistical alchemy among methodologists called the “test for excess significance”. Here’s the hunch: in any series of typically-powered experiments, we expect some to fail to be non-significant due to sampling error, even if a true effect exists. If we see a series of five experiments, and they are all significant, one thinks that either they are either very high powered, the authors got lucky, or there are some nonsignificant studies missing. For many sets of studies, the first seems implausible because the effect sizes are small; the last is important, because if it is true then the picture we get of the results is misleading.

Just to be clear, this hunch makes sense to me, and I think to most people. However, without a formal argument it remains a hunch. Ioannidis and Trikalinos (2007) suggested formalising it:

Of “biases”, Ioannidis and and Tikalinos say that “biases…result in a relative excess of published statistically significant results as compared with what their true proportion should be in a body of evidence.” If there are *too many* significant studies, there must be *too few* nonsignificant ones, hence the idea of “relative” excess.

So far so good; this is all true. Ioannidis and Trikalinos continue:

Here we have bit of a mystery. That (E) equals the sum of the expected probabilities is merely asserted. There is no explanation of what assumptions were necessary to derive that fact. Moreover, it is demonstrably false. Suppose I run experiments until I obtain (k) nonsignificant studies ((k>0)). The expected number of significant studies in a set of (n) is exactly (n-k). Depending on the stopping rule for the studies, which is unknown (and unknowable or even meaningless, in most cases), (E) can be chosen to be 0 (stop after (n) nonsignificant studies), (n) (stop after (n) significant studies), or any number in between!

Ioannidis and Trikalinos go on to say that “[t]he expected number (E) is compared against the observed number (O) of ‘positive’ studies” and if there are an “excess” then bias is claimed, by standard significance test logic. Here, things go off the rails again. First, as we have seen, (E) could be anything. Second, a significance test is performed by computing the probability of observing an outcome as extreme or more extreme than the one observed, given no “bias”. What is more extreme? Suppose we observe 4 significant results in 5 studies. It seems clear that 5/5 is more extreme. Is 6/6 possible? No mention is made of the assumed sampling process, so how are we to know what the more extreme samples would be? And if a sampling assumption *were* made explicit, how could we know whether that was a reasonable assumption for the studies at hand? The (p) value is simply incalculable from the information available.

Suppose I find a “significant” result; what do I infer? Ioannidis and Trikalinos claim that they “have introduced an exploratory test for examining whether there is an excess of significant findings in a body of evidence” (p 251). This is a very strange assertion. When we do a statistical test, we are not asking a question about the data itself; rather, we are inferring something about a *population*. The “body of evidence” is the sample; we infer from the sample to the population. But what is the population? Or, put in frequentist terms, what is the sampling process from which the studies in question arise? Given that this question is central to the statistical inference, one would think it would be addressed, but it is not. Dealing with this question would require a clear definition of a “set” of studies, and how this set is sampled.

Are these studies one sample of hypothetical sets of studies from all scientific fields? Or perhaps they are a sample of studies within a specific field; say, psychology? Or from a subfield, like social psychology? Or maybe from a specific lab? There’s no way to uniquely answer this question, and so it isn’t clear *what* can be inferred. Am I inferring bias in all of science, in the field, the subfield, or the lab? And if any of these are true, why do they discuss bias in the *sample* instead? They have confused the properties of the population and sample in a basic way.

But even though these critical details are missing — details that are necessary to the argument — the authors go on to apply this to several meta-analyses, inferring bias in several. Other authors have applied the method to claim “evidence” of bias in other sets of studies.

## …and the alchemy is complete

We see that Ioannidis and Trikalinos have unstated assumptions of enormous import, they have failed to clearly define any sort of sampling model, and they have not made clear the link between the *act* of inference (“we found a ‘significant’ result”) and what is to be inferred (“Evidence for bias exists in *these* studies.”). And this is all before even addressing the problematic nature of (p) values themselves, which cannot be used as a measure of evidence. The test for “excess significance” is neither a valid frequentist procedure (due to the lack of a clearly defined sampling process) nor a valid Bayesian procedure.

But through the alchemy of statistics, the Ioannidis and Trikalinos’ test for “excess significance” has given us the appearance of a justified conclusion. Bodies of studies are called into doubt, and the users of the approach continue to get papers published using the approach despite its utter lack of justification. We would not accept such shoddy modeling and reasoning for studying other aspects of human behavior. As Val Johnson put it in his comment on the procedure, “[We] simply cannot quite determine the level of absurdity that [we are] expected to ignore.” Why is this acceptable for deploying against groups of studies in the scientific literature?

The reason is simple: we all have the hunch. It *seems* right. Ioannidis and Trikalinos have given us a way to transmute our hunch that something is amiss into the gold of a publishable, evidence-backed conclusion. But it is an illusion; the argument simply falls apart under scrutiny.

This is bad science, and it should not be tolerated. Methodologists have the same responsibility as everyone else to justify their conclusions. The peer review system has failed to prevent the leaden hunch passing for gold, which is acutely ironic given how methodologists use the test to accuse others of bad science.

Further reading:

Hi Richard,

I find it frustrating that you repeat your arguments against the test for excess significance (TES) rather than respond to the counterarguments I have provided. I do not think such repetition is productive, but I will repeat my counterarguments and hope that you will push the debate forward. Like you, I will try to present my counterarguments in a slightly different way in the hope of gaining better understanding, but the counterarguments are essentially the same as what I previously presented.

1) Although the logic is similar, the TES is not a standard hypothesis test, so your concerns about populations and inferences are misplaced. The p-value that is calculated in a TES (Ptes) is an estimated probability that an exact replication of the reported experiments would produce significant outcomes at a rate at least as high as what was reported. In as much as scientists currently care about replication success (as measured by statistical significance), this seems like a probability value that matters to people. (Just to be clear, I think we agree that scientists should not actually care so much about statistical significance, but that is a different issue.)

2) You are concerned that the formula Ioannidis & Trikalinos provided for E (the expected number of significant outcomes from a set of n studies) was "demonstrably false". However, none of your examples of the falseness of the equation are valid because you fix the number of studies to be n, which is inconsistent with your proposed study generation process. Your study generation process works if you let n vary, but then the Ioannidis & Trikalinos formula is shown to be correct:

i) You say, "Suppose I run experiments until I obtain k nonsignificant studies (k>0). The expected number of significant studies in a set of n is exactly n−k." This is like saying that your process for generating coin flips is to flip n=10 fair coins to produce exactly k=3 tails (a stand-in for nonsignificant studies). That outcome might happen by chance (prob=0.1172), but you cannot define it as your coin-flipping process. Most of the time when you flip your 10 coins you will get something different than k=3 tails. Likewise, most of the time when you run 10 experiments you will get something other than 3 nonsignificant studies (how often depends on the power).

You could relax the requirement that n be a fixed number and instead just keep running experiments until you get k nonsignificant studies. But then you do not know the number of significant studies that will be produced until you actually gather the experimental results. On average, you will need n= k/b, where b is the Type II error rate (beta in the formula) (I assume all studies have the same Type II error rate to make the calculations easy). So, the expected number of studies to reject the null is

n – k = n – bn = n(1-b) = E

which agrees with the Ioannidis & Trikalinos formula.

ii) You say, "E can be chosen to be 0 (stop after n nonsignificant studies)". If n is fixed, how is a scientist supposed to follow this procedure? What happens if they get a significant study? They cannot make E=0 without hiding that significant study.

If you relax the fixed n requirement, then you can keep running experiments until you get k>0 nonsignificant studies, but then we are back to case i), and you still have no way to set E=0.

iii) You say, "E can be chosen to be…n (stop after n significant studies), or any number in between!" For the same reasons as above, this statement is nonsense. If you can advise scientists on a valid procedure that guarantees producing n significant studies (whether n is fixed or variable), you should let them know. I am sure everyone would love to use it.

In short, you present impossible sampling procedures and then complain that the formula proposed by Ioannidis & Trikalinos does not handle your impossible situations.

(Continued)

(Part two)

3) You worry about the definition of "more extreme" used to calculate Ptes. In every case I have seen, Ptes is calculated relative to a hypothetical direct replication of the reported experiments. Thus, if there is a set of n experiments with k non-significant outcomes, then Ptes is the estimated probability of a replication of n experiments (with the same sample sizes and so forth) producing k or fewer non-significant outcomes. This approach, I think, reflects the common view of what experimental replication means, and it does not depend on the process by which the original experiments were generated; it depends on the process by which the experimental claims will be evaluated by replication studies. In this case, the definition of more extreme is self-evident and easily handled.

Maybe there are scientists who judge replication in a different way (such as to run repeated experiments until you get k non-significant outcomes). This strikes me as a very strange attitude, and I have not fully thought through whether the concept of replication even makes sense with such a replication procedure. You may be correct that the TES does not apply to such a situation, but I think you (or someone) has to first show that such an approach makes sense for a replication study at all.

4) You claim that it is unclear whether a finding of "excess significance" applies to a field, subfield, or lab. Certainly there is a possibility of misapplication here (e.g., suggesting that excess significance in Bem's precognition studies indicates a problem with all of social psychology), but I have not yet seen it happen in practice. Excess significance is always relative to the theoretical claims being made. For the ways Ioannidis has used it, the theoretical claim is about the value of the meta-analyzed effect size. Some of my studies have followed that approach but other TES investigations have explored excess significance (or more generally "excess success") for a more complex set of theoretical claims (e.g., there is a significant interaction with a significant difference between A1 and A2 but not between B1 and B2). You can see the details in the papers. Importantly, these theoretical claims are proposed by the original authors, not by the TES investigator. The TES estimates the probability of producing experimental outcomes as good (or better) as that used by the original authors to support their theoretical claims.

I welcome criticism of the TES as the method is important and it deserves careful scrutiny. However, claims about the TES being "alchemy" and a discussion of impossible study generation processes a not helpful, and they possibly encourage scientists to ignore problems with current scientific practice.

-Greg Francis

Hi Greg,

In response to (2), "However, none of your examples of the falseness of the equation are valid because you fix the number of studies to be n, which is inconsistent with your proposed study generation process." I'm not fixing N, I'm conditioning on it. As I&T said, "We test in a body of n published studies…" That's the problem. The rest of your post misses the point. You confuse the study *generation* process with the study *selection* process. Studies are generated by some process that you don't know. You then ask "What's the probability of in a selected set of n studies of seeing k nonsignificant studies?" That depends on the process that *generated* them. We're not *fixing* the number n, we're *conditioning* on it.

4. "Certainly there is a possibility of misapplication here (e.g., suggesting that excess significance in Bem's precognition studies indicates a problem with all of social psychology), but I have not yet seen it happen in practice." This is not a practical point, it is a theoretical point that has to do with the idea of a reference class (http://en.wikipedia.org/wiki/Reference_class_problem).

"I welcome criticism of the TES as the method is important and it deserves careful scrutiny. However, claims about the TES being "alchemy" and a discussion of impossible study generation processes a not helpful, and they possibly encourage scientists to ignore problems with current scientific practice."

Claims based on bad methods are not helpful, and possibly encourage scientists to ignore methodologists altogether. There's nothing "impossible" about the study generation process I describe; they're just stopping rules based on significance, and the size of the set is then conditioned on by the analyst wielding the TES.

If you don't believe me, here's a challenge: you pick a power and a random seed. I will simulate a very large "literature" according to the "experimenter behaviour" of my choice, importantly with no publication bias or other selection of studies. I will guarantee that I will use a behaviour that will generate experiment set sizes of 5. I will save the code and the "literature" coded in terms of "sets" of studies and how many significant and nonsignificant studies there are. You get to guess what the average number of significant studies are in sets of 5 via I&T's model, along with a 95% CI (I'll tell you the total number of such studies). That is, we're just using Monte Carlo to estimate the expected number of significant studies in sets of experiments n=5; that is, precisely what I&T use as the basis of their model (for the special case of n=5).

This will answer the question of "what is the expected number of nonsignificant studies in a set of n?"

"I find it frustrating that you repeat your arguments against the test for excess significance (TES) rather than respond to the counterarguments I have provided. I do not think such repetition is productive, but I will repeat my counterarguments and hope that you will push the debate forward. Like you, I will try to present my counterarguments in a slightly different way in the hope of gaining better understanding, but the counterarguments are essentially the same as what I previously presented."

This may surprise you, but the first draft of this blog post opened with an almost identical paragraph, from my own perspective. I deleted it.

Clearly at least one of us is confused. Maybe we can sort it out by trying your challenge. Power=0.5, random seed= 19374013

I am not sure your challenge gets to the main point of replicability though. A scientist wanting to replicate a set of findings is going to run the reported n studies, regardless of how they were generated.

I am traveling, so I probably will take a careful look at what you generate until next week.

I think we are both operating in good faith. We just have quite different views.

Before I do this, though, I want to make sure that we agree on what this will show. I want to show that the expected number of nonsignificant studies in a set of n (=5) studies is not what I&T say it is, and hence, the reasoning behind the test is flawed (because "excess significance" is defined as deviation from this expected number). I also want to be clear what the prediction is here: Since the power of the test is .5, according to I&T, the expected number of nonsignificant studies in a set of 5 is 2.5. Agreed?

As long as your procedure for producing studies reports all the studies that are relevant for a theoretical claim and do not use some kind of questionable research practice, then I think we are in agreement.

Please see my new blog post here for the results and code.