This post is a follow-up to my previous post, “Statistical alchemy and the ‘test for excess significance’”. In the comments on that post, Greg Francis objected to my points about the Test for Excess Significance. I laid out a challenge in which I would use simulation to demonstrate these points. Greg Francis agreed to the details; this post is about the results of the simulations (with links to the code, etc.)

## A challenge

In my previous post, I said this:

**: “…we have bit of a mystery. That $E$ [the expected number of non-significant studies in a set of $n$ studies] equals the sum of the expected [Type II error] probabilities is merely asserted [by Ioannidis and Trikalinos]. There is no explanation of what assumptions were necessary to derive that fact. Moreover, it is demonstrably false.”**

*Morey*Greg Francis replied:

**:“…none of your examples of the falseness of the equation are valid because you fix the number of studies to be n, which is inconsistent with your proposed study generation process. Your study generation process works if you let n vary, but then the Ioannidis & Trikalinos formula is shown to be correct…[i]n short, you present impossible sampling procedures and then complain that the formula proposed by Ioannidis & Trikalinos does not handle your impossible situations.”**

*Francis*To which I replied,

**:“If you don’t believe me, here’s a challenge: you pick a power and a random seed. I will simulate a very large ‘literature’ according to the ‘experimenter behaviour’ of my choice, importantly with no publication bias or other selection of studies. I will guarantee that I will use a behaviour that will generate experiment set sizes of 5. I will save the code and the ‘literature’ coded in terms of ‘sets’ of studies and how many significant and nonsignificant studies there are. You get to guess what the average number of significant studies are in sets of 5 via I&T’s model, along with a 95% CI (I’ll tell you the total number of such studies). That is, we’re just using Monte Carlo to estimate the expected number of significant studies in sets of experiments n=5; that is, precisely what I&T use as the basis of their model (for the special case of n=5).” “**

*Morey***This will answer the question of ‘what is the expected number of nonsignificant studies in a set of n?**’”

This challenge will very clearly show that my situations are not “impossible”. I can sample them in a very simple simulation. Greg Francis agreed to the simulation:

**: “Clearly at least one of us is confused. Maybe we can sort it out by trying your challenge. Power=0.5, random seed= 19374013”**

*Francis*I further clarified:

**: “Before I do this, though, I want to make sure that we agree on what this will show. I want to show that the expected number of nonsignificant studies in a set of n (=5) studies is not what I&T say it is, and hence, the reasoning behind the test is flawed (because ‘excess significance’ is defined as deviation from this expected number). I also want to be clear what the prediction is here: Since the power of the test is .5, according to I&T, the expected number of nonsignificant studies in a set of 5 is 2.5. Agreed?”**

*Morey*…to which Greg Francis agreed.

I have performed this simulation. Before reading on, you should read the web page containing the results:

- Web page (with code) outlining the results: http://learnbayes.org/talks/TES/TESsimulation.html
- Source
`.Rmd`

file: http://learnbayes.org/talks/TES/TESsimulation.Rmd

The table below shows the results of the simulation of 1000000 “sets” of studies. All simulated “studies” are published in this simulation, no questionable research practices are involved. The first column shows (n), and the second column shows the average number of non-significant studies for sets of (n), which is a Monte Carlo estimate of I&T’s (E). As you can see, it is not 2.5.

Total studies (n) | Mean nonsig. studies | Expected by TES (E) | SD nonsig. studies | Count |
---|---|---|---|---|

1 | 1 | 0.5 | 0 | 499917 |

2 | 1 | 1.0 | 0 | 249690 |

3 | 1 | 1.5 | 0 | 125269 |

4 | 1 | 2.0 | 0 | 62570 |

5 | 1 | 2.5 | 0 | 31309 |

6 | 1 | 3.0 | 0 | 15640 |

7 | 1 | 3.5 | 0 | 7718 |

8 | 1 | 4.0 | 0 | 3958 |

9 | 1 | 4.5 | 0 | 1986 |

10 | 1 | 5.0 | 0 | 975 |

(I have truncated the table at (n=10); see the HTML file for the full table.)

I also showed that you can change the experimenter’s behaviour and make it 2.5. This indicates that the assumptions one makes about experimenter behavior *matter* to the expected number of non-significant studies in a particular set. Across *all* sets of studies, the expected proportion of significant studies is expected to be equal to the power. However, how this is distributed across studies of different lengths is a function of the decision rule.

The expression for the expected number of non-significant studies in a set of (n) is not correct (without further very strong, unwarranted assumptions).