At some point, everyone who fiddles around with Bayes factors with point nulls notices something that, at first blush, seems strange: small effect sizes seem “biased” toward the null hypothesis. In null hypothesis significance testing, power simply increases when you change the true effect size. With Bayes factors, there is a non-monotonicity where increasing the sample size will slightly increase the degree to which a small effect size favors the null, then the small effect size becomes evidence for the alternative. I recall puzzling with this with Jeff Rouder years ago when drafting our 2009 paper on Bayesian t tests.
Uri Simonsohn has a blog post critiquing default Bayes factors for their supposed “bias” toward the null hypothesis for small sample sizes. I have several brief responses:
- We do not suggest a “default” prior; we suggest a family of default priors, which an adjustable prior parameter (see also this paper describing our view, which is conditionally accepted at Multivariate Behavioral Research). If you’re looking for a small effect, adjust the prior.
- The whole point of a Bayes factor analysis is that you do not know what the true effect size is (see Jeff Rouder and Joe Hilgard’s response here). Noting that the Bayes factor will mislead when you know there’s a small effect, but you use a prior that says the effect size is probably moderate to large is not useful. Bayes factors just do what you ask them to do!
- More broadly, though, I think it is helpful to think about this supposed “bias”. Is it what we would expect for a reasonable method? Sometimes our intuitions fail us, and we end up thinking something undesirable, when actually we should be worried if that thing didn’t happen.
The third point is what this blog post is about. Here, I show that the “bias” toward the null for small effect sizes is exactly what must happen for any reasonable method that meets four, simple desiderata.
We start with the idea of a measure of evidence comparing some composite alternative hypothesis to the null hypothesis. For our purposes here, it could be any measure of evidence; it does not have to be a Bayes factor. What we will do is set a number of reasonable desiderata on the properties of this evidence measure, and show that the so-called “bias” in favor of the null for small effect sizes must occur.
We assume that our data can be summarized in terms of an effective sample size and an (observed) effect size measure. This effect size should have a “nullest” member (for instance, d=0, or R2=0). For any given sample size, the evidence against the null will be an increasing function of this observed effect size. We also need the concept of “no”, or equivocal, evidence; that is, that the data do not favor either hypothesis. This defines a 0 point on the evidence scale, whatever it is.
The important concept for our demonstration is the idea of a bivariate space of sample size vs evidence. Sample size begins at 0 and increases along the x axis, and “no” evidence is marked on the y axis. We can think of sample size abstractly as indexing the amoung of information in the data. We are going to imagine fixing an observed effect size and varying the sample size, which will trace a curve through this bivariate space:
 |
A bivariate sample size / evidence space. |
We can now give four desired properties that any evidence measure will have.
Desiderata 1: The evidence with no data is “equivocal”.
If we observe no data, the strength of the evidence does not favor either hypothesis. Whatever the “0 evidence” point in the evidence space, having no data must put you there.
[For a Bayes factor, this means that prior odds and the posterior odds are the same — with no data, they don’t change — and the log Bayes factor is 0.]
Desiderata 2: The evidence for a “null” observed effect size is an increasing function of sample size, in favor of the null. However much evidence a “null” observed effect provides, no other observed effect size can exceed it.
For instance, if we observe d=0 with N=1000 participants, this is more convincing evidence in favor of the null than of we had observed d=0 with N=10. Obviously, this null observed effect should offer the most evidence possible, for a given sample size.
Desiderata 3: A fixed non-null observed effect size must yield arbitrarily large amounts of evidence as sample size increases.
If we observe d=.3, with 10 participants, this isn’t terribly convincing; but if we observed d=.3 with more and more participants, we are increasingly sure that the null hypothesis is false. In the bivariate space, this means that all non-null effect size curves eventually must end up either at -∞ or at an asymptote at some large value in favor of the alternative.
Desiderata 4: The closer an observed effect size is to the null effect size, the more it’s curve “looks like” the null
This is just a smoothness assumption. The conclusions we obtain from observing d=0 should be very close to the ones we obtain from d=.001 and even closer to those we obtain from d=.0000001. Of course, this smoothness should also hold for all other observed effect sizes, not just the null, but for our purposes here the observed null is what is important.
For small sample sizes, this means that the curves for small effect sizes must be near the null effect size lines in the bivariate space. As we increase the sample size, of course, those lines must diverge downward.
The effect of these four desiderata is to ensure that small effect sizes “look” null. This is not a consequence of the Bayes factor, or the prior, but rather of very reasonable conditions that any evidence measure would fulfil. For a Bayes factor, of course, how these lines move through the bivariate space — and how small an effect size will need to be in order to “look” null — will be sensitive to the prior on the alternative, as it must be. But behaviour described by Simonsohn is natural consequence of very reasonable assumptions.
Although it is counter intuitive, we would be worried if it didn’t happen for some measure of evidence.