Monthly Archives: April 2015

The fallacy of placing confidence in confidence intervals (version 2)

I, with my coathors, have submitted a new draft of our paper “The fallacy of placing confidence in confidence intervals”. This paper is substantially modified from its previous incarnation. Here is the main argument:

“[C]onfidence intervals may not be used as suggested by modern proponents because this usage is not justified by confidence interval theory. If used in the way CI proponents suggest, some CIs will provide severely misleading inferences for the given data; other CIs will not. Because such considerations are outside of CI theory, developers of CIs do not test them, and it is therefore often not known whether a given CI yields a reasonable inference or not. For this reason, we believe that appeal to CI theory is redundant in the best cases, when inferences can be justified outside CI theory, and unwise in the worst cases, when they cannot.”

The document, source code, and all supplementary material is available here on github.

Guidelines for reporting confidence intervals

I’m working on a manuscript on confidence intervals, and I thought I’d share a draft section on the reporting of confidence intervals. The paper has several demonstrations of how CIs may, or may not, offer quality inferences, and how they can differ markedly from credible intervals, even ones with so-called “non-informative” priors.

Guidelines for reporting confidence intervals

Report credible intervals instead. We believe any author who chooses to use confidence intervals should ensure that the intervals correspond numerically with credible intervals under some reasonable prior. Many confidence intervals cannot be so interpreted, but if the authors know they can be, they should be called “credible intervals”. This signals to readers that they can interpret the interval as they have been (incorrectly) told they can interpret confidence intervals. Of course, the corresponding prior must also be reported. This is not to say that one can’t also call them confidence intervals if indeed they are; however, readers are likely more interested in the post-data properties of the procedure — not the coverage — if they are interested arriving at substantive conclusions from the interval.

Do not use procedures whose Bayesian properties are not known. As Casella (1992) pointed out, the post-data properties of a procedure are necessary for understanding what can be inferred from an interval. Any procedure whose Bayesian properties have not been explored can have properties that make it unsuitable for post-data inference. Procedures whose properties have not been adequately studied are inappropriate for general use.

Warn readers if the confidence procedure does not correspond to a Bayesian procedure. If it is known that a confidence interval does not correspond to a Bayesian procedure, warn readers that the confidence interval cannot be interpreted as having a X% probability of containing the parameter, that it cannot be interpreted in terms of the precision of measurement, and that cannot be said to contain the values that should be taken seriously: the interval is merely an interval that, prior to sampling, had a X% probability of containing the true value. Authors using confidence intervals have a responsibility to keep their readers from invalid inferences if they choose to use them, and it is almost sure that readers will misinterpret them without a warning (Hoekstra et al, 2014).

Never report a confidence interval without noting the procedure and the corresponding statistics. As we have described, there are many different ways to construct confidence intervals, and they will have different properties. Some will have better frequentist properties than others; some will correspond to credible intervals, and others will not. It is unfortunately common for authors to report confidence intervals without noting how they were constructed. As can be seen from the examples we’ve presented, this is a terrible practice because without knowing which confidence intervals was used, it is unclear what can be inferred. A narrow interval could correspond to very precise information or very imprecise information depending on which procedure was used. Not knowing which procedure was used could lead to very poor inferences. In addition, enough information should be presented so that any reader can compute a different confidence interval or credible interval. In most cases, this is covered by standard reporting practices, but in other cases more information may need to be given.

Consider reporting likelihoods or posteriors instead. An interval provides fairly impoverished information. Just as proponents of confidence intervals argue that CIs provide more information than a significance test (although this is debatable for many CIs), a likelihood or a posterior provides much more information than an interval. Recently, Cumming (2014) [see also here] has proposed so-called “cat’s eye” intervals which are either fiducial distributions or Bayesian posteriors under a “non-informative” prior (the shape is the likelihood, but he interprets the area, so it must be a posterior or a fiducial distribution). With modern scientific graphics so easy to create, along with the fact that likelihoods are often approximately normal, we see no reason why likelihoods and posteriors cannot replace intervals in most circumstances. With a likelihood or a posterior, the arbitrariness of the confidence or credibility coefficient is avoided altogether.


All about that “bias, bias, bias” (it’s no trouble)

At some point, everyone who fiddles around with Bayes factors with point nulls notices something that, at first blush, seems strange: small effect sizes seem “biased” toward the null hypothesis. In null hypothesis significance testing, power simply increases when you change the true effect size. With Bayes factors, there is a non-monotonicity where increasing the sample size will slightly increase the degree to which a small effect size favors the null, then the small effect size becomes evidence for the alternative. I recall puzzling with this with Jeff Rouder years ago when drafting our 2009 paper on Bayesian t tests.

Uri Simonsohn has a blog post critiquing default Bayes factors for their supposed “bias” toward the null hypothesis for small sample sizes. I have several brief responses:

  • We do not suggest a “default” prior; we suggest a family of default priors, which an adjustable prior parameter (see also this paper describing our view, which is conditionally accepted at Multivariate Behavioral Research). If you’re looking for a small effect, adjust the prior.
  • The whole point of a Bayes factor analysis is that you do not know what the true effect size is (see Jeff Rouder and Joe Hilgard’s response here). Noting that the Bayes factor will mislead when you know there’s a small effect, but you use a prior that says the effect size is probably moderate to large is not useful. Bayes factors just do what you ask them to do!
  • More broadly, though, I think it is helpful to think about this supposed “bias”. Is it what we would expect for a reasonable method? Sometimes our intuitions fail us, and we end up thinking something undesirable, when actually we should be worried if that thing didn’t happen.

The third point is what this blog post is about. Here, I show that the “bias” toward the null for small effect sizes is exactly what must happen for any reasonable method that meets four, simple desiderata.

We start with the idea of a measure of evidence comparing some composite alternative hypothesis to the null hypothesis. For our purposes here, it could be any measure of evidence; it does not have to be a Bayes factor. What we will do is set a number of reasonable desiderata on the properties of this evidence measure, and show that the so-called “bias” in favor of the null for small effect sizes must occur.

We assume that our data can be summarized in terms of an effective sample size and an (observed) effect size measure. This effect size should have a “nullest” member (for instance, d=0, or R2=0). For any given sample size, the evidence against the null will be an increasing function of this observed effect size. We also need the concept of “no”, or equivocal, evidence; that is, that the data do not favor either hypothesis. This defines a 0 point on the evidence scale, whatever it is.

The important concept for our demonstration is the idea of a bivariate space of sample size vs evidence. Sample size begins at 0 and increases along the x axis, and “no” evidence is marked on the y axis. We can think of sample size abstractly as indexing the amoung of information in the data. We are going to imagine fixing an observed effect size and varying the sample size, which will trace a curve through this bivariate space:

A bivariate sample size / evidence space.

We can now give four desired properties that any evidence measure will have.

Desiderata 1: The evidence with no data is “equivocal”.

If we observe no data, the strength of the evidence does not favor either hypothesis. Whatever the “0 evidence” point in the evidence space, having no data must put you there.

[For a Bayes factor, this means that prior odds and the posterior odds are the same — with no data, they don’t change — and the log Bayes factor is 0.]

Desiderata 2: The evidence for a “null” observed effect size is an increasing function of sample size, in favor of the null. However much evidence a “null” observed effect provides, no other observed effect size can exceed it.

For instance, if we observe d=0 with N=1000 participants, this is more convincing evidence in favor of the null than of we had observed d=0 with N=10. Obviously, this null observed effect should offer the most evidence possible, for a given sample size.

Desiderata 3: A fixed non-null observed effect size must yield arbitrarily large amounts of evidence as sample size increases.

If we observe d=.3, with 10 participants, this isn’t terribly convincing; but if we observed d=.3 with more and more participants, we are increasingly sure that the null hypothesis is false. In the bivariate space, this means that all non-null effect size curves eventually must end up either at -∞ or at an asymptote at some large value in favor of the alternative.

Desiderata 4: The closer an observed effect size is to the null effect size, the more it’s curve “looks like” the null

This is just a smoothness assumption. The conclusions we obtain from observing d=0 should be very close to the ones we obtain from d=.001 and even closer to those we obtain from d=.0000001. Of course, this smoothness should also hold for all other observed effect sizes, not just the null, but for our purposes here the observed null is what is important.

For small sample sizes, this means that the curves for small effect sizes must be near the null effect size lines in the bivariate space. As we increase the sample size, of course, those lines must diverge downward.

The effect of these four desiderata is to ensure that small effect sizes “look” null. This is not a consequence of the Bayes factor, or the prior, but rather of very reasonable conditions that any evidence measure would fulfil. For a Bayes factor, of course, how these lines move through the bivariate space — and how small an effect size will need to be in order to “look” null — will be sensitive to the prior on the alternative, as it must be. But behaviour described by Simonsohn is natural consequence of very reasonable assumptions.

Although it is counter intuitive, we would be worried if it didn’t happen for some measure of evidence.