# All about that “bias, bias, bias” (it’s no trouble)

At some point, everyone who fiddles around with Bayes factors with point nulls notices something that, at first blush, seems strange: small effect sizes seem “biased” toward the null hypothesis. In null hypothesis significance testing, power simply increases when you change the true effect size. With Bayes factors, there is a non-monotonicity where increasing the sample size will slightly increase the degree to which a small effect size favors the null, then the small effect size becomes evidence for the alternative. I recall puzzling with this with Jeff Rouder years ago when drafting our 2009 paper on Bayesian t tests.

Uri Simonsohn has a blog post critiquing default Bayes factors for their supposed “bias” toward the null hypothesis for small sample sizes. I have several brief responses:

• We do not suggest a “default” prior; we suggest a family of default priors, which an adjustable prior parameter (see also this paper describing our view, which is conditionally accepted at Multivariate Behavioral Research). If you’re looking for a small effect, adjust the prior.
• The whole point of a Bayes factor analysis is that you do not know what the true effect size is (see Jeff Rouder and Joe Hilgard’s response here). Noting that the Bayes factor will mislead when you know there’s a small effect, but you use a prior that says the effect size is probably moderate to large is not useful. Bayes factors just do what you ask them to do!
• More broadly, though, I think it is helpful to think about this supposed “bias”. Is it what we would expect for a reasonable method? Sometimes our intuitions fail us, and we end up thinking something undesirable, when actually we should be worried if that thing didn’t happen.

The third point is what this blog post is about. Here, I show that the “bias” toward the null for small effect sizes is exactly what must happen for any reasonable method that meets four, simple desiderata.

We start with the idea of a measure of evidence comparing some composite alternative hypothesis to the null hypothesis. For our purposes here, it could be any measure of evidence; it does not have to be a Bayes factor. What we will do is set a number of reasonable desiderata on the properties of this evidence measure, and show that the so-called “bias” in favor of the null for small effect sizes must occur.

We assume that our data can be summarized in terms of an effective sample size and an (observed) effect size measure. This effect size should have a “nullest” member (for instance, d=0, or R2=0). For any given sample size, the evidence against the null will be an increasing function of this observed effect size. We also need the concept of “no”, or equivocal, evidence; that is, that the data do not favor either hypothesis. This defines a 0 point on the evidence scale, whatever it is.

The important concept for our demonstration is the idea of a bivariate space of sample size vs evidence. Sample size begins at 0 and increases along the x axis, and “no” evidence is marked on the y axis. We can think of sample size abstractly as indexing the amoung of information in the data. We are going to imagine fixing an observed effect size and varying the sample size, which will trace a curve through this bivariate space:

 A bivariate sample size / evidence space.

We can now give four desired properties that any evidence measure will have.

### Desiderata 1: The evidence with no data is “equivocal”.

If we observe no data, the strength of the evidence does not favor either hypothesis. Whatever the “0 evidence” point in the evidence space, having no data must put you there.

[For a Bayes factor, this means that prior odds and the posterior odds are the same — with no data, they don’t change — and the log Bayes factor is 0.]

### Desiderata 2: The evidence for a “null” observed effect size is an increasing function of sample size, in favor of the null. However much evidence a “null” observed effect provides, no other observed effect size can exceed it.

For instance, if we observe d=0 with N=1000 participants, this is more convincing evidence in favor of the null than of we had observed d=0 with N=10. Obviously, this null observed effect should offer the most evidence possible, for a given sample size.

### Desiderata 3: A fixed non-null observed effect size must yield arbitrarily large amounts of evidence as sample size increases.

If we observe d=.3, with 10 participants, this isn’t terribly convincing; but if we observed d=.3 with more and more participants, we are increasingly sure that the null hypothesis is false. In the bivariate space, this means that all non-null effect size curves eventually must end up either at -∞ or at an asymptote at some large value in favor of the alternative.

### Desiderata 4: The closer an observed effect size is to the null effect size, the more it’s curve “looks like” the null

This is just a smoothness assumption. The conclusions we obtain from observing d=0 should be very close to the ones we obtain from d=.001 and even closer to those we obtain from d=.0000001. Of course, this smoothness should also hold for all other observed effect sizes, not just the null, but for our purposes here the observed null is what is important.

For small sample sizes, this means that the curves for small effect sizes must be near the null effect size lines in the bivariate space. As we increase the sample size, of course, those lines must diverge downward.

The effect of these four desiderata is to ensure that small effect sizes “look” null. This is not a consequence of the Bayes factor, or the prior, but rather of very reasonable conditions that any evidence measure would fulfil. For a Bayes factor, of course, how these lines move through the bivariate space — and how small an effect size will need to be in order to “look” null — will be sensitive to the prior on the alternative, as it must be. But behaviour described by Simonsohn is natural consequence of very reasonable assumptions.

Although it is counter intuitive, we would be worried if it didn’t happen for some measure of evidence.

# Some thoughts on replication

In a recent blog post, Simine Vazire discusses the problem with the logic of requiring replicators to explain when they reach different conclusions to the original authors. She frames it, correctly, it as asking people to over-interpret random noise. Vazire identifies the issue as a problem with our thinking: that we under-estimate randomness. I’d like to explore other ways in which our biases interferes with clear thinking about replication, and perhaps suggest some ways we can clarify it.

I suggest two ways in which we fool ourselves in thinking about replication: the concept of “replication” is unnecessarily asymmetric and an example of overly-linear thinking, and lack of distinction in practice causing a lack of distinction in theory.

### Fooled by language: the asymmetry of “replication”

Imagine that a celebrated scientist, Dr. Smith, dies, and within her notes is discovered a half-written paper. Building on her previous work, this paper clearly lays out an creative experiment to test a theory. To avoid any complications such as post hoc theorising, assume the link between the theory and experiment is clear and follows from her previous work. On the Dr. Smith’s computer, along with the paper, is found a data set. Dr. Smith’s colleagues decide to finish the paper and publish it in her honor.
Given the strange circumstances of this particular paper’s history, another scientist, Dr. Jones, decides to replicate the study. Dr. Jones does his best to match the methods described in the paper, but obtains a different result. Dr. Jones tries to publish, but editors and reviewers demand an explanation: why is the replication different? Dr. Jones’ result is doubted until he can explain the difference.
Now suppose — unbeknownst to everyone — that the first experiment was never done. Dr. Smith simulated the data set as a pedagogical exercise to learn a new analysis technique. She never told anyone because she did not anticipate dying, of course, but everyone assumed the data was real. The second experiment is no replication at all; it is the first experiment done.
Does this change the evidential value of the Dr. Jones’ experiment at all? Of course not. The fact that the Dr. Smith’s experiment was not done is irrelevant to the evidence in Dr. Jones’ experiment. The evidence contained in a first experiment is the same, regardless of whether a second experiment is done (assuming, of course, that the methods are all sound). “Replication” is a useless label.
Calling the Dr. Jones’ experiment a “replication” focuses our attention on wrong relationship. One replicates an actual experiment that was done. However, the evidence that an experiment provides for a theory depends not on the relationship between the experiment’s methods and an experiment that was done in the past. Rather, the evidence depends on the relationship between the experiment’s methods and a hypothetical experiment that is designed to test the theory. One cannot replicate a hypothetical experiment, of course, because hypothetical experiments cannot be performed. Instead, one realizes a hypothetical experiment, and there may be several realizations of the same hypothetical experiment.
Thinking in this manner eliminates the asymmetric relationship between the two experiments. If both experiments can be realizations of the same hypothetical experiment designed to test a theory, which one came first is immaterial.* The burden is no longer on the second experimenter to explain why the results are different; the burden is on the advocates of the theory to explain the extant data, which now includes two differing results. (Vazire’s caution about random noise still applies here, as we still don’t want to over-explain differences; it is assumed that any post hoc explanation will be tested.)
 Three hypothetical experiments that are tests of the same theory, along with five actually-run experiments. Hypothetical experiments A and B may be so-called “conceptual replications” of A, or tests of other aspects of the theory.
The conceptual distinction between a hypothetical experiment — that is, the experiment that is planned — and the actual experiment is critical. That hypothetical experiment can be realized in many ways: different times, different labs, different participants, even different stimuli, if these are randomly generated or are selected from a large collection of interchangeable stimuli. Importantly, when the first realization of the hypothetical experiment is done, it does not get methodological priority. It is temporally first, but is simply one way in which the experiment could have been realized.
Conceptualizing the scientific process in this way prevents researchers who did an experiment first from claiming that their experiment takes priority. If you are “replicating” their actual experiment, then it makes sense that your results will get compared to theirs, in the same way a “copy” might be compared to the “original”. But conceptually, the two are siblings, not parent and child.

### Lack of distinction in practice vs. theory

The critical distinctions above is the distinction between a hypothetical experiment and an actual one. I think this is an instance where modern scientific practice causes problems. Although the idea of a hypothetical experiment arises in any experimental planning process, consider the typical scientific paper, which has an introduction, then a brief (maybe even just a few sentences!) segue describing the logic of the experiment, into the methods of an actually-performed experiment.
This structure means that the hypothetical experiment and the actual experiment are impossible to disentangle. This is one of the reasons, I think, why we talk about “replication” so much, rather than performing another realization of the hypothetical experiment. We have no hypothetical experiment to work from, because it is almost completely conflated with the actual experiment.
One initiative that will help with this problem is public pre-registration. A hypothetical experiment is laid out in an pre-registration document. Note that from a pre-registration document, the structure in the figure becomes clear. If someone posts a public pre-registration document, why does it matter who does the experiment first (aside from the ethical issue of “scooping”, etc)? No one is “replicating” anyone else; they are each separately realizing the hypothetical experiment that was planned.
But in current practice, which does not typically distinguish a hypothetical experiment and an actual one, the only way to add to the scientific literature about hypothetical experiment A is to try to “redo” one of its realizations. Any subsequent experiment is then logically dependent on the first actually performed experiment, and the unhelpful asymmetry crops up again.
I think it would be useful to have a different word than “replication”, because the connotation of the word “replication”, as a fascimile or a copy of something already existing, focuses our attention in unhelpful ways.
* Although logically which came first is immaterial, there may be statistical considerations to keep in mind, like the “statistical significance filter” that is more likely to affect a first study than a second. Also, as Vazire points out in the comments, the second study has fewer researcher degrees of freedom.

# My favorite Neyman passage: on confidence intervals

I’ve been doing a lot of reading on confidence interval theory. Some of the reading is more interesting than others. There is one passage from Neyman’s (1952) book “Lectures and Conferences on Mathematical Statistics and Probability” (available here) that stands above the rest in terms of clarity, style, and humor. I had not read this before the last draft of our confidence interval paper, but for those of you who have read it, you’ll recognize that this is the style I was going for. Maybe you have to be Jerzy Neyman to get away with it.

Neyman gets bonus points for the footnote suggesting the “eminent”, “elderly” boss is so obtuse (a reference to Fisher?) and that the young frequentists should be “remind[ed] of the glory” of being burned at the stake. This is just absolutely fantastic writing. I hope you enjoy it as much as I did.

[begin excerpt, p. 211-215]

[Neyman is discussing using “sampling experiments” (Monte Carlo experiments with tables of random numbers) in order to gain insight into confidence intervals. $$\theta$$ is a true parameter of a probability distribution to be estimated.]

The sampling experiments are more easily performed than described in
detail. Therefore, let us make a start with $$\theta_1 = 1$$, $$\theta_2 = 2$$, $$\theta_3 = 3$$ and $$\theta_4 = 4$$. We imagine that, perhaps within a week, a practical statistician is faced four times with the problem of estimating $$\theta$$, each time from twelve observations, and that the true values of $$\theta$$ are as above [ie, $$\theta_1,\ldots,\theta_4$$] although the statistician does not know this. We imagine further that the statistician is an elderly gentleman, greatly attached to the arithmetic mean and that he wishes to use formulae (22). However, the statistician has a young assistant who may have read (and understood) modern literature and prefers formulae (21). Thus, for each of the four instances, we shall give two confidence intervals for $$\theta$$, one computed by the elderly Boss, the other by his young Assistant.

[Formula 21 and 22 are simply different 95% confidence procedures. Formula 21 is has better frequentist properties; Formula 22 is inferior, but the Boss likes it because it is intuitive to him.]

Using the first column on the first page of Tippett’s tables of random
numbers and performing the indicated multiplications, we obtain the following
four sets of figures.

The last two lines give the assertions regarding the true value of $$\theta$$ made by the Boss and by the Assistant, respectively. The purpose of the sampling experiment is to verify the theoretical result that the long run relative frequency of cases in which these assertions will be correct is, approximately, equal to $$\alpha = .95$$.

You will notice that in three out of the four cases considered, both assertions (the Boss’ and the Assistant’s) regarding the true value of $$\theta$$ are correct and that in the last case both assertions are wrong. In fact, in this last case the true $$\theta$$ is 4 while the Boss asserts that it is between 2.026 and 3.993 and the Assistant asserts that it is between 2.996 and 3.846. Although the probability of success in estimating $$\theta$$ has been fixed at $$\alpha = .95$$, the failure on the fourth trial need not discourage us. In reality, a set of four trials is plainly too short to serve for an estimate of a long run relative frequency. Furthermore, a simple calculation shows that the probability of at least one failure in the course of four independent trials is equal to .1855. Therefore, a group of four consecutive samples like the above, with at least one wrong estimate of $$\theta$$, may be expected one time in six or even somewhat oftener. The situation is, more or less, similar to betting on a particular side of a die and seeing it win. However, if you continue the sampling  experiment and count the cases in which the assertion regarding the true value of $$\theta$$, made by either method, is correct, you will find that the relative frequency of such cases converges gradually to its theoretical value, $$\alpha= .95$$.

Let us put this into more precise terms. Suppose you decide on a number $$N$$ of samples which you will take and use for estimating the true value of $$\theta$$. The true values of the parameter $$\theta$$ may be the same in all $$N$$ cases or they may vary from one case to another. This is absolutely immaterial as far as the relative frequency of successes in estimation is concerned. In each case the probability that your assertion will be correct is exactly equal to $$\alpha = .95$$. Since the samples are taken in a manner insuring independence (this, of course, depends on the goodness of the table of random numbers used), the total number $$Z(N)$$ of successes in estimating $$\theta$$ is the familiar binomial variable with expectation equal to $$N\alpha$$ and with variance equal to $$N\alpha(1 – \alpha)$$. Thus, if $$N = 100$$, $$\alpha = .95$$, it is rather improbable that the relative frequency $$Z(N)/N$$ of successes in estimating $$\alpha$$ will differ from $$\alpha$$ by more than

$$2\sqrt{\frac{\alpha(1-\alpha)}{N}} = .042$$

This is the exact meaning of the colloquial description that the long run relative frequency of successes in estimating $$\theta$$ is equal to the preassigned $$\alpha$$. Your knowledge of the theory of confidence intervals will not be influenced by the sampling experiment described, nor will the experiment prove anything. However, if you perform it, you will get an intuitive feeling of the machinery behind the method which is an excellent complement to the understanding of the theory. This is like learning to drive an automobile: gaining experience by actually driving a car compared with learning the theory by reading a book about driving.

Among other things, the sampling experiment will attract attention to
the frequent difference in the precision of estimating $$\theta$$ by means of the two alternative confidence intervals (21) and (22). You will notice, in fact, that the confidence intervals based on $$X$$, the greatest observation in the sample, are frequently shorter than those based on the arithmetic mean $$\bar{X}$$. If we continue to discuss the sampling experiment in terms of cooperation between the eminent elderly statistician and his young assistant, we shall have occasion to visualize quite amusing scenes of indignation on the one hand and of despair before the impenetrable wall of stiffness of mind and routine of thought on the other.[See footnote] For example, one can imagine the conversation between the two men in connection with the first and third samples reproduced above. You will notice that in both cases the confidence interval of the Assistant is not only shorter than that of the Boss but is completely included in it. Thus, as a result of observing the first sample, the Assistant asserts that

$$.956 \leq \theta \leq 1.227.$$

On the other hand, the assertion of the Boss is far more conservative and admits the possibility that $$\theta$$ may be as small as .688 and as large as 1.355. And both assertions correspond to the same confidence coefficient, $$\alpha = .95$$! I can just see the face of my eminent colleague redden with indignation and hear the following colloquy.

Boss: “Now, how can this be true? I am to assert that $$\theta$$ is between .688 and 1.355 and you tell me that the probability of my being correct is .95. At the same time, you assert that $$\theta$$ is between .956 and 1.227 and claim the same probability of success in estimation. We both admit the possibility that $$\theta$$ may be some number between .688 and .956 or between 1.227 and 1.355. Thus, the probability of $$\theta$$ falling within these intervals is certainly greater than zero. In these circumstances, you have to be a nit-wit to believe that
$$\begin{eqnarray*} P\{.688 \leq \theta \leq 1.355\} &=& P\{.688 \leq \theta < .956\} + P\{.956 \leq \theta \leq 1.227\}\\ && + P\{1.227 \leq \theta \leq 1.355\}\\ &=& P\{.956 \leq \theta \leq 1.227\}.\mbox{”} \end{eqnarray*}$$

Assistant: “But, Sir, the theory of confidence intervals does not assert anything about the probability that the unknown parameter $$\theta$$ will fall within any specified limits. What it does assert is that the probability of success in estimation using either of the two formulae (21) or (22) is equal to $$\alpha$$.”

Boss: “Stuff and nonsense! I use one of the blessed pair of formulae and come up with the assertion that $$.688 \leq \theta \leq 1.355$$. This assertion is a success only if $$\theta$$ falls within the limits indicated. Hence, the probability of success is equal to the probability of $$\theta$$ falling within these limits —.”

Assistant: “No, Sir, it is not. The probability you describe is the a posteriori probability regarding $$\theta$$, while we are concerned with something else. Suppose that we continue with the sampling experiment until we have, say, $$N = 100$$ samples. You will see, Sir, that the relative frequency of successful estimations using formulae (21) will be about the same as that using formulae (22) and that both will be approximately equal to .95.”

I do hope that the Assistant will not get fired. However, if he does, I would remind him of the glory of Giordano Bruno who was burned at the stake by the Holy Inquisition for believing in the Copernican theory of the solar system. Furthermore, I would advise him to have a talk with a physicist or a biologist or, maybe, with an engineer. They might fail to understand the theory but, if he performs for them the sampling experiment described above, they are likely to be convinced and give him a new job. In due course, the eminent statistical Boss will die or retire and then —.

[footnote] Sad as it is, your mind does become less flexible and less receptive to novel ideas as the years go by. The more mature members of the audience should not take offense. I, myself, am not young and have young assistants. Besides, unreasonable and stubborn individuals are found not only among the elderly but also frequently among young people.

[end excerpt]