# The TES Challenge to Greg Francis

This post is a follow-up to my previous post, “Statistical alchemy and the ‘test for excess significance’”. In the comments on that post, Greg Francis objected to my points about the Test for Excess Significance. I laid out a challenge in which I would use simulation to demonstrate these points. Greg Francis agreed to the details; this post is about the results of the simulations (with links to the code, etc.)

## A challenge

In my previous post, I said this:

Morey: “…we have bit of a mystery. That $E$ [the expected number of non-significant studies in a set of $n$ studies] equals the sum of the expected [Type II error] probabilities is merely asserted [by Ioannidis and Trikalinos]. There is no explanation of what assumptions were necessary to derive that fact. Moreover, it is demonstrably false.”

Greg Francis replied:

Francis:“…none of your examples of the falseness of the equation are valid because you fix the number of studies to be n, which is inconsistent with your proposed study generation process. Your study generation process works if you let n vary, but then the Ioannidis & Trikalinos formula is shown to be correct…[i]n short, you present impossible sampling procedures and then complain that the formula proposed by Ioannidis & Trikalinos does not handle your impossible situations.”

To which I replied,

Morey:“If you don’t believe me, here’s a challenge: you pick a power and a random seed. I will simulate a very large ‘literature’ according to the ‘experimenter behaviour’ of my choice, importantly with no publication bias or other selection of studies. I will guarantee that I will use a behaviour that will generate experiment set sizes of 5. I will save the code and the ‘literature’ coded in terms of ‘sets’ of studies and how many significant and nonsignificant studies there are. You get to guess what the average number of significant studies are in sets of 5 via I&T’s model, along with a 95% CI (I’ll tell you the total number of such studies). That is, we’re just using Monte Carlo to estimate the expected number of significant studies in sets of experiments n=5; that is, precisely what I&T use as the basis of their model (for the special case of n=5).” “This will answer the question of ‘what is the expected number of nonsignificant studies in a set of n?’”

This challenge will very clearly show that my situations are not “impossible”. I can sample them in a very simple simulation. Greg Francis agreed to the simulation:

Francis: “Clearly at least one of us is confused. Maybe we can sort it out by trying your challenge. Power=0.5, random seed= 19374013”

I further clarified:

Morey: “Before I do this, though, I want to make sure that we agree on what this will show. I want to show that the expected number of nonsignificant studies in a set of n (=5) studies is not what I&T say it is, and hence, the reasoning behind the test is flawed (because ‘excess significance’ is defined as deviation from this expected number). I also want to be clear what the prediction is here: Since the power of the test is .5, according to I&T, the expected number of nonsignificant studies in a set of 5 is 2.5. Agreed?”

…to which Greg Francis agreed.

I have performed this simulation. Before reading on, you should read the web page containing the results:

The table below shows the results of the simulation of 1000000 “sets” of studies. All simulated “studies” are published in this simulation, no questionable research practices are involved. The first column shows (n), and the second column shows the average number of non-significant studies for sets of (n), which is a Monte Carlo estimate of I&T’s (E). As you can see, it is not 2.5.

Total studies (n)  Mean nonsig. studies  Expected by TES (E)  SD nonsig. studies  Count
1 1 0.5 0 499917
2 1 1.0 0 249690
3 1 1.5 0 125269
4 1 2.0 0 62570
5 1 2.5 0 31309
6 1 3.0 0 15640
7 1 3.5 0 7718
8 1 4.0 0 3958
9 1 4.5 0 1986
10 1 5.0 0 975

(I have truncated the table at (n=10); see the HTML file for the full table.)

I also showed that you can change the experimenter’s behaviour and make it 2.5. This indicates that the assumptions one makes about experimenter behavior matter to the expected number of non-significant studies in a particular set. Across all sets of studies, the expected proportion of significant studies is expected to be equal to the power. However, how this is distributed across studies of different lengths is a function of the decision rule.

The expression for the expected number of non-significant studies in a set of (n) is not correct (without further very strong, unwarranted assumptions).

# Two things to stop saying about null hypotheses

There is a currently fashionable way of describing Bayes factors that resonates with experimental psychologists. I hear it often, particularly as a way to describe a particular use of Bayes factors. For example, one might say, “I needed to prove the null, so I used a Bayes factor,” or “Bayes factors are great because with them, you can prove the null.” I understand the motivation behind this sort of language but please: stop saying one can “prove the null” with Bayes factors.

I also often hear other people say “but the null is never true.” I’d like to explain why we should avoid saying both of these things.

 Null hypotheses are tired of your jibber jabber

### Why you shouldn’t say “prove the null”

Statistics is complicated. People often come up with colloquial ways of describing what a particular method is doing: for instance, one might say a significance tests give us “evidence against the null”; one might say that a “confidence interval tells us the 95% most plausible values”; or one might say that a Bayes factor helps us “prove the null.” Bayesians often are quick to correct misconceptions that people use to justify their use of classical or frequentist methods. It is just as important to correct misconceptions about Bayesian methods.

In order to understand why we shouldn’t say “prove the null”, consider the following situation: You have a friend who claims that they can affect the moon with their mind. You, of course, think this is preposterous. Your friend looks up at the moon and says “See, I’m using my abilities right now!” You check the time.

You then decide to head to the local lunar seismologist, who has good records of subtle moon tremors. You ask her whether about what happened at the time your friend was looking at the moon, and she reports back to you that lunar activity at that time was stronger than it typically is 95% of the time (thus passes the bar for “statistical significance”).

Does this mean that there is evidence for your friend’s assertion? The answer is “no.” Your friend made no statement about what one would expect from the seismic data. In fact, your friend’s statement is completely unfalsifiable (as is the case with the typical “alternative” in a significance test, (muneq0)).

But consider the following alternative statements your friend could have made: “I will destroy the moon with my mind”; “I will make very large tremors (with magnitude (Y))”; “I will make small tremors (with magnitude (X)).” How do we now regard your friend’s claims in light of the what happened?

• “I will destroy the moon with my mind” is clearly inconsistent with the data. You (the null) are supported by an infinite amount, because you have completely falsified his statement that he would destroy the moon (the alternative).
• “I will make very large tremors (with magnitude (Y))” is also inconsistent with the data, but if we allow a range of uncertainty around his claim, may not be completely falsified. Thus you (the null) are supported, but not by as much in the first situation.
• “I will make small tremors (with magnitude (X))” may support you (the null) or your friend (the alternative), depending on how the magnitude predicted and observed.

Here we can see that the support for the null depends on the alternative at hand. This is, of course, as it must be. Scientific evidence is relative. We can never “prove the null”: we can only “find evidence for a specified null hypothesis against a reasonable, well-specified alternative”. That’s quite a mouthful, it’s true, but “prove the null” creates misunderstandings about Bayesian statistics, and makes it appear that it is doing something it cannot do.

In a Bayesian setup, the null and alternative are both models and the relative evidence between them will change based on how we specify them. If we specify them in a reasonable manner, such that the null and alternative correspond to relevant theoretical viewpoints or encode information about the question at hand, the relative statistical evidence will be informative for our research ends. If we don’t specify reasonable models, then the relative evidence between the models may be correct, but useless.

We never “prove the null” or “compute the probability of the null hypothesis”. We can only compare a null model to an alternative model, and determine the relative evidence.

### Why you shouldn’t say “the null is never true”

A common retort to tests including a point null (often called a ‘null’ hypothesis) is that “the null is never true.” This backed up by four sorts of “evidence”:

• A quote from an authority: “Tukey or Cohen said so!” (Tukey was smart, but this is not an argument.)
• Common knowledge / “experience”: “We all know the null is impossible.” (This was Tukey’s “argument”)
• Circular: “The area under a point in a density curve is 0.” (Of course if your model doesn’t have a point null, the point null will be impossible.)
• All models are “false” (even if this were true — I think it is actually a category error — it would equally apply to all alternatives as well)

The most attractive seems to be the second, but it should be noted that people almost never use techniques that allow finding evidence for null hypotheses. Under these conditions, how is one determining that the null is never true? If a null were ever true, we would not be able to accumulate evidence for it, so the second argument definitely has a hint of circularity as well.

When someone says “The null hypothesis is impossible/implausible/irrelevant”, what they are saying in reality is “I don’t believe the null hypothesis can possibly be true.” This is a totally fine statement, as long as we recognize it for what it is: an a priori commitment. We should not pretend that it is anything else; I cannot see any way that one can find universal evidence for the statement “the null is impossible”.

If you find the null hypothesis implausible, that’s OK. Others might not find it implausible. It is ultimately up to substantive experts to decide what hypotheses they want to consider in their data analysis, and not up to methodologists or statisticians to decide to tell experts what to think.

Any automatic behavior — either automatically rejecting all null hypothesis, or automatically testing null hypotheses — is bad. Hypothesis testing and estimation should be considered and deliberate. Luckily, Bayesian statistics allows both to be done in a principled, coherent manner, so informed choices can be made by the analyst and not by the restrictions of the method.

# Statistical alchemy and the “test for excess significance”

[This post is based largely on my 2013 article for Journal of Mathematical Psychology; see the other articles in that special issue as well for more critiques.]

When I tell people that my primary area of research is statistical methods, one of the reactions I often encounter from people untrained in statistics is that “you can prove anything with statistics.” Of course, this rankles, first because it isn’t true (unless you use a very strange definition of prove) and second because I’ve spent years learning the limitations of statistics, and there are many limitations. These limitations exist, however, in the context of enormous successes. In the sciences, the field of statistics rightly has a place of honor.

This success is evidenced by the great number of scientific arguments that are supported by statistical methods. Not all statistical arguments are created equal, of course. But the respect with which statistics is viewed has the unfortunate downside that a statistical argument can apparently turn a leaden hunch into a golden “truth”. This post is about such statistical alchemy.

## The gold: Justified substantive claims

One of the goals we all have as scientists is to make claims backed by solid evidence. This is harder than it seems. Ideally we would prefer that evidence be ironclad and assumptions unnecessary. In real-life cases, however, the strength of evidence does not provide certainty, and assumptions are needed. The key to good argument, then, is that all assumptions are made explicit, the chain of reasoning is clear and logical, and the resulting evidence is strong enough to garner agreement.

Such cases we might call the “gold standard” for scientific arguments. We expect this sort of argument when someone makes a strong claim. This is the stuff that the scientific literature should be made of, for the most part. Among other things, the gold standard requires careful experimental design and execution, deliberate statistical analysis and avoidance of post hoc reasoning, and a willingness to explore the effects of unstated assumptions in one’s reasoning.

Hunches are a necessary part of science. Science is driven by a creative force that cannot (at this point) be quantified, and a sneaking suspicion that something is true is often the grounds on which we design experiments. Hunches are some of the most useful things in science, just as lead is an exceptionally useful metal. Like lead, hunches are terribly common. We all have many hunches, and often we don’t know where they come from.

What makes a hunch is that it doesn’t have solid grounds to back it up. Hunches often turn to dust upon closer examination: they may contradict other knowledge, they may be based on untenable assumptions, or the evidence for them may turn out to be much weak when we examine it. If a hunch survives a solid test, it is no longer a hunch; but so long we do not test them — or cannot test them — they remain hunches.

## The alchemy of statistics

One of the most dangerous, but unfortunately common, ways in which statistics is used is to magically turn hunches into “truth”. The mother of all statistical alchemy is the Fisherian (p) value, by which hunches based on “low” (p) values are turned into statements about the implausibility of the null hypothesis. Although it seems reasonable, when the hunch on which (p) values rest is examined by either frequentists or Bayesians, it is found wanting.

However, my main focus here is not specifically (p) values. I’d like to focus on one particularly recent special case of statistical alchemy among methodologists called the “test for excess significance”. Here’s the hunch: in any series of typically-powered experiments, we expect some to fail to be non-significant due to sampling error, even if a true effect exists. If we see a series of five experiments, and they are all significant, one thinks that either they are either very high powered, the authors got lucky, or there are some nonsignificant studies missing. For many sets of studies, the first seems implausible because the effect sizes are small; the last is important, because if it is true then the picture we get of the results is misleading.

Just to be clear, this hunch makes sense to me, and I think to most people. However, without a formal argument it remains a hunch. Ioannidis and Trikalinos (2007) suggested formalising it:

We test in a body of $n$ published studies whether the observed number of studies $O$ with ‘positive’ results at a specified $alpha$ level on a specific research question is different from the expected number of studies with ‘positive’ results $E$ in the absence of any bias. (Ioannidis and Trikalinos, 2007, p246)

Of “biases”, Ioannidis and and Tikalinos say that “biases…result in a relative excess of published statistically significant results as compared with what their true proportion should be in a body of evidence.” If there are too many significant studies, there must be too few nonsignificant ones, hence the idea of “relative” excess.

Suppose there is a true effect size that is being pursued by study $i (i = 1,ldots,n)$ and its size is $theta_i$…[T]he expected probability that a specific single study $i$ will find a ‘positive’ result equals $1 – beta_i$, its power at the specified $alpha$ level. (Ioannidis and Trikalinos, 2007, p246)

So far so good; this is all true. Ioannidis and Trikalinos continue:

Assuming no bias, $E$ equals the sum of the expected probabilities across all studies on the same question: [ E = sum_{i=1}^n (1 – beta_i). ] (Ioannidis and Trikalinos, 2007, p246)

Here we have bit of a mystery. That (E) equals the sum of the expected probabilities is merely asserted. There is no explanation of what assumptions were necessary to derive that fact. Moreover, it is demonstrably false. Suppose I run experiments until I obtain (k) nonsignificant studies ((k>0)). The expected number of significant studies in a set of (n) is exactly (n-k). Depending on the stopping rule for the studies, which is unknown (and unknowable or even meaningless, in most cases), (E) can be chosen to be 0 (stop after (n) nonsignificant studies), (n) (stop after (n) significant studies), or any number in between!

Ioannidis and Trikalinos go on to say that “[t]he expected number (E) is compared against the observed number (O) of ‘positive’ studies” and if there are an “excess” then bias is claimed, by standard significance test logic. Here, things go off the rails again. First, as we have seen, (E) could be anything. Second, a significance test is performed by computing the probability of observing an outcome as extreme or more extreme than the one observed, given no “bias”. What is more extreme? Suppose we observe 4 significant results in 5 studies. It seems clear that 5/5 is more extreme. Is 6/6 possible? No mention is made of the assumed sampling process, so how are we to know what the more extreme samples would be? And if a sampling assumption were made explicit, how could we know whether that was a reasonable assumption for the studies at hand? The (p) value is simply incalculable from the information available.

Suppose I find a “significant” result; what do I infer? Ioannidis and Trikalinos claim that they “have introduced an exploratory test for examining whether there is an excess of significant findings in a body of evidence” (p 251). This is a very strange assertion. When we do a statistical test, we are not asking a question about the data itself; rather, we are inferring something about a population. The “body of evidence” is the sample; we infer from the sample to the population. But what is the population? Or, put in frequentist terms, what is the sampling process from which the studies in question arise? Given that this question is central to the statistical inference, one would think it would be addressed, but it is not. Dealing with this question would require a clear definition of a “set” of studies, and how this set is sampled.

Are these studies one sample of hypothetical sets of studies from all scientific fields? Or perhaps they are a sample of studies within a specific field; say, psychology? Or from a subfield, like social psychology? Or maybe from a specific lab? There’s no way to uniquely answer this question, and so it isn’t clear what can be inferred. Am I inferring bias in all of science, in the field, the subfield, or the lab? And if any of these are true, why do they discuss bias in the sample instead? They have confused the properties of the population and sample in a basic way.

But even though these critical details are missing — details that are necessary to the argument — the authors go on to apply this to several meta-analyses, inferring bias in several. Other authors have applied the method to claim “evidence” of bias in other sets of studies.

## …and the alchemy is complete

We see that Ioannidis and Trikalinos have unstated assumptions of enormous import, they have failed to clearly define any sort of sampling model, and they have not made clear the link between the act of inference (“we found a ‘significant’ result”) and what is to be inferred (“Evidence for bias exists in these studies.”). And this is all before even addressing the problematic nature of (p) values themselves, which cannot be used as a measure of evidence. The test for “excess significance” is neither a valid frequentist procedure (due to the lack of a clearly defined sampling process) nor a valid Bayesian procedure.

But through the alchemy of statistics, the Ioannidis and Trikalinos’ test for “excess significance” has given us the appearance of a justified conclusion. Bodies of studies are called into doubt, and the users of the approach continue to get papers published using the approach despite its utter lack of justification. We would not accept such shoddy modeling and reasoning for studying other aspects of human behavior. As Val Johnson put it in his comment on the procedure, “[We] simply cannot quite determine the level of absurdity that [we are] expected to ignore.” Why is this acceptable for deploying against groups of studies in the scientific literature?

The reason is simple: we all have the hunch. It seems right. Ioannidis and Trikalinos have given us a way to transmute our hunch that something is amiss into the gold of a publishable, evidence-backed conclusion. But it is an illusion; the argument simply falls apart under scrutiny.

This is bad science, and it should not be tolerated. Methodologists have the same responsibility as everyone else to justify their conclusions. The peer review system has failed to prevent the leaden hunch passing for gold, which is acutely ironic given how methodologists use the test to accuse others of bad science.