# All about that “bias, bias, bias” (it’s no trouble)

At some point, everyone who fiddles around with Bayes factors with point nulls notices something that, at first blush, seems strange: small effect sizes seem “biased” toward the null hypothesis. In null hypothesis significance testing, power simply increases when you change the true effect size. With Bayes factors, there is a non-monotonicity where increasing the sample size will slightly increase the degree to which a small effect size favors the null, then the small effect size becomes evidence for the alternative. I recall puzzling with this with Jeff Rouder years ago when drafting our 2009 paper on Bayesian t tests.

Uri Simonsohn has a blog post critiquing default Bayes factors for their supposed “bias” toward the null hypothesis for small sample sizes. I have several brief responses:

• We do not suggest a “default” prior; we suggest a family of default priors, which an adjustable prior parameter (see also this paper describing our view, which is conditionally accepted at Multivariate Behavioral Research). If you’re looking for a small effect, adjust the prior.
• The whole point of a Bayes factor analysis is that you do not know what the true effect size is (see Jeff Rouder and Joe Hilgard’s response here). Noting that the Bayes factor will mislead when you know there’s a small effect, but you use a prior that says the effect size is probably moderate to large is not useful. Bayes factors just do what you ask them to do!
• More broadly, though, I think it is helpful to think about this supposed “bias”. Is it what we would expect for a reasonable method? Sometimes our intuitions fail us, and we end up thinking something undesirable, when actually we should be worried if that thing didn’t happen.

The third point is what this blog post is about. Here, I show that the “bias” toward the null for small effect sizes is exactly what must happen for any reasonable method that meets four, simple desiderata.

We start with the idea of a measure of evidence comparing some composite alternative hypothesis to the null hypothesis. For our purposes here, it could be any measure of evidence; it does not have to be a Bayes factor. What we will do is set a number of reasonable desiderata on the properties of this evidence measure, and show that the so-called “bias” in favor of the null for small effect sizes must occur.

We assume that our data can be summarized in terms of an effective sample size and an (observed) effect size measure. This effect size should have a “nullest” member (for instance, d=0, or R2=0). For any given sample size, the evidence against the null will be an increasing function of this observed effect size. We also need the concept of “no”, or equivocal, evidence; that is, that the data do not favor either hypothesis. This defines a 0 point on the evidence scale, whatever it is.

The important concept for our demonstration is the idea of a bivariate space of sample size vs evidence. Sample size begins at 0 and increases along the x axis, and “no” evidence is marked on the y axis. We can think of sample size abstractly as indexing the amoung of information in the data. We are going to imagine fixing an observed effect size and varying the sample size, which will trace a curve through this bivariate space:

 A bivariate sample size / evidence space.

We can now give four desired properties that any evidence measure will have.

### Desiderata 1: The evidence with no data is “equivocal”.

If we observe no data, the strength of the evidence does not favor either hypothesis. Whatever the “0 evidence” point in the evidence space, having no data must put you there.

[For a Bayes factor, this means that prior odds and the posterior odds are the same — with no data, they don’t change — and the log Bayes factor is 0.]

### Desiderata 2: The evidence for a “null” observed effect size is an increasing function of sample size, in favor of the null. However much evidence a “null” observed effect provides, no other observed effect size can exceed it.

For instance, if we observe d=0 with N=1000 participants, this is more convincing evidence in favor of the null than of we had observed d=0 with N=10. Obviously, this null observed effect should offer the most evidence possible, for a given sample size.

### Desiderata 3: A fixed non-null observed effect size must yield arbitrarily large amounts of evidence as sample size increases.

If we observe d=.3, with 10 participants, this isn’t terribly convincing; but if we observed d=.3 with more and more participants, we are increasingly sure that the null hypothesis is false. In the bivariate space, this means that all non-null effect size curves eventually must end up either at -∞ or at an asymptote at some large value in favor of the alternative.

### Desiderata 4: The closer an observed effect size is to the null effect size, the more it’s curve “looks like” the null

This is just a smoothness assumption. The conclusions we obtain from observing d=0 should be very close to the ones we obtain from d=.001 and even closer to those we obtain from d=.0000001. Of course, this smoothness should also hold for all other observed effect sizes, not just the null, but for our purposes here the observed null is what is important.

For small sample sizes, this means that the curves for small effect sizes must be near the null effect size lines in the bivariate space. As we increase the sample size, of course, those lines must diverge downward.

The effect of these four desiderata is to ensure that small effect sizes “look” null. This is not a consequence of the Bayes factor, or the prior, but rather of very reasonable conditions that any evidence measure would fulfil. For a Bayes factor, of course, how these lines move through the bivariate space — and how small an effect size will need to be in order to “look” null — will be sensitive to the prior on the alternative, as it must be. But behaviour described by Simonsohn is natural consequence of very reasonable assumptions.

Although it is counter intuitive, we would be worried if it didn’t happen for some measure of evidence.

# Two things to stop saying about null hypotheses

There is a currently fashionable way of describing Bayes factors that resonates with experimental psychologists. I hear it often, particularly as a way to describe a particular use of Bayes factors. For example, one might say, “I needed to prove the null, so I used a Bayes factor,” or “Bayes factors are great because with them, you can prove the null.” I understand the motivation behind this sort of language but please: stop saying one can “prove the null” with Bayes factors.

I also often hear other people say “but the null is never true.” I’d like to explain why we should avoid saying both of these things.

 Null hypotheses are tired of your jibber jabber

### Why you shouldn’t say “prove the null”

Statistics is complicated. People often come up with colloquial ways of describing what a particular method is doing: for instance, one might say a significance tests give us “evidence against the null”; one might say that a “confidence interval tells us the 95% most plausible values”; or one might say that a Bayes factor helps us “prove the null.” Bayesians often are quick to correct misconceptions that people use to justify their use of classical or frequentist methods. It is just as important to correct misconceptions about Bayesian methods.

In order to understand why we shouldn’t say “prove the null”, consider the following situation: You have a friend who claims that they can affect the moon with their mind. You, of course, think this is preposterous. Your friend looks up at the moon and says “See, I’m using my abilities right now!” You check the time.

You then decide to head to the local lunar seismologist, who has good records of subtle moon tremors. You ask her whether about what happened at the time your friend was looking at the moon, and she reports back to you that lunar activity at that time was stronger than it typically is 95% of the time (thus passes the bar for “statistical significance”).

Does this mean that there is evidence for your friend’s assertion? The answer is “no.” Your friend made no statement about what one would expect from the seismic data. In fact, your friend’s statement is completely unfalsifiable (as is the case with the typical “alternative” in a significance test, (muneq0)).

But consider the following alternative statements your friend could have made: “I will destroy the moon with my mind”; “I will make very large tremors (with magnitude (Y))”; “I will make small tremors (with magnitude (X)).” How do we now regard your friend’s claims in light of the what happened?

• “I will destroy the moon with my mind” is clearly inconsistent with the data. You (the null) are supported by an infinite amount, because you have completely falsified his statement that he would destroy the moon (the alternative).
• “I will make very large tremors (with magnitude (Y))” is also inconsistent with the data, but if we allow a range of uncertainty around his claim, may not be completely falsified. Thus you (the null) are supported, but not by as much in the first situation.
• “I will make small tremors (with magnitude (X))” may support you (the null) or your friend (the alternative), depending on how the magnitude predicted and observed.

Here we can see that the support for the null depends on the alternative at hand. This is, of course, as it must be. Scientific evidence is relative. We can never “prove the null”: we can only “find evidence for a specified null hypothesis against a reasonable, well-specified alternative”. That’s quite a mouthful, it’s true, but “prove the null” creates misunderstandings about Bayesian statistics, and makes it appear that it is doing something it cannot do.

In a Bayesian setup, the null and alternative are both models and the relative evidence between them will change based on how we specify them. If we specify them in a reasonable manner, such that the null and alternative correspond to relevant theoretical viewpoints or encode information about the question at hand, the relative statistical evidence will be informative for our research ends. If we don’t specify reasonable models, then the relative evidence between the models may be correct, but useless.

We never “prove the null” or “compute the probability of the null hypothesis”. We can only compare a null model to an alternative model, and determine the relative evidence.

### Why you shouldn’t say “the null is never true”

A common retort to tests including a point null (often called a ‘null’ hypothesis) is that “the null is never true.” This backed up by four sorts of “evidence”:

• A quote from an authority: “Tukey or Cohen said so!” (Tukey was smart, but this is not an argument.)
• Common knowledge / “experience”: “We all know the null is impossible.” (This was Tukey’s “argument”)
• Circular: “The area under a point in a density curve is 0.” (Of course if your model doesn’t have a point null, the point null will be impossible.)
• All models are “false” (even if this were true — I think it is actually a category error — it would equally apply to all alternatives as well)

The most attractive seems to be the second, but it should be noted that people almost never use techniques that allow finding evidence for null hypotheses. Under these conditions, how is one determining that the null is never true? If a null were ever true, we would not be able to accumulate evidence for it, so the second argument definitely has a hint of circularity as well.

When someone says “The null hypothesis is impossible/implausible/irrelevant”, what they are saying in reality is “I don’t believe the null hypothesis can possibly be true.” This is a totally fine statement, as long as we recognize it for what it is: an a priori commitment. We should not pretend that it is anything else; I cannot see any way that one can find universal evidence for the statement “the null is impossible”.

If you find the null hypothesis implausible, that’s OK. Others might not find it implausible. It is ultimately up to substantive experts to decide what hypotheses they want to consider in their data analysis, and not up to methodologists or statisticians to decide to tell experts what to think.

Any automatic behavior — either automatically rejecting all null hypothesis, or automatically testing null hypotheses — is bad. Hypothesis testing and estimation should be considered and deliberate. Luckily, Bayesian statistics allows both to be done in a principled, coherent manner, so informed choices can be made by the analyst and not by the restrictions of the method.

# BayesFactor updated to version 0.9.11-1

The BayesFactor package has been updated to version 0.9.11-1. The changes are:

CHANGES IN BayesFactor VERSION 0.9.11-1

CHANGES
* Fixed memory bug causing importance sampling to fail.

CHANGES IN BayesFactor VERSION 0.9.11

CHANGES
* Added support for prior/posterior odds and probabilities. See the new vignette for details.
* Added approximation for t test in case of large t
* Made some error messages clearer
* Use callbacks at least once in all cases
* Fix bug preventing continuous interactions from showing in regression Gibbs sampler
* Removed unexported function oneWayAOV.Gibbs(), and related C functions, due to redundancy
* gMap from model.matrix is now 0-indexed vector (for compatibility with C functions)
* substantial changes to backend, to Rcpp and RcppEigen for speed
* removed redundant struc argument from nWayAOV (use gMap instead)