# The fallacy of placing confidence in confidence intervals (version 2)

I, with my coathors, have submitted a new draft of our paper “The fallacy of placing confidence in confidence intervals”. This paper is substantially modified from its previous incarnation. Here is the main argument:

“[C]onfidence intervals may not be used as suggested by modern proponents because this usage is not justified by confidence interval theory. If used in the way CI proponents suggest, some CIs will provide severely misleading inferences for the given data; other CIs will not. Because such considerations are outside of CI theory, developers of CIs do not test them, and it is therefore often not known whether a given CI yields a reasonable inference or not. For this reason, we believe that appeal to CI theory is redundant in the best cases, when inferences can be justified outside CI theory, and unwise in the worst cases, when they cannot.”

The document, source code, and all supplementary material is available here on github.

# Guidelines for reporting confidence intervals

I’m working on a manuscript on confidence intervals, and I thought I’d share a draft section on the reporting of confidence intervals. The paper has several demonstrations of how CIs may, or may not, offer quality inferences, and how they can differ markedly from credible intervals, even ones with so-called “non-informative” priors.

### Guidelines for reporting confidence intervals

Report credible intervals instead. We believe any author who chooses to use confidence intervals should ensure that the intervals correspond numerically with credible intervals under some reasonable prior. Many confidence intervals cannot be so interpreted, but if the authors know they can be, they should be called “credible intervals”. This signals to readers that they can interpret the interval as they have been (incorrectly) told they can interpret confidence intervals. Of course, the corresponding prior must also be reported. This is not to say that one can’t also call them confidence intervals if indeed they are; however, readers are likely more interested in the post-data properties of the procedure — not the coverage — if they are interested arriving at substantive conclusions from the interval.

Do not use procedures whose Bayesian properties are not known. As Casella (1992) pointed out, the post-data properties of a procedure are necessary for understanding what can be inferred from an interval. Any procedure whose Bayesian properties have not been explored can have properties that make it unsuitable for post-data inference. Procedures whose properties have not been adequately studied are inappropriate for general use.

Warn readers if the confidence procedure does not correspond to a Bayesian procedure. If it is known that a confidence interval does not correspond to a Bayesian procedure, warn readers that the confidence interval cannot be interpreted as having a X% probability of containing the parameter, that it cannot be interpreted in terms of the precision of measurement, and that cannot be said to contain the values that should be taken seriously: the interval is merely an interval that, prior to sampling, had a X% probability of containing the true value. Authors using confidence intervals have a responsibility to keep their readers from invalid inferences if they choose to use them, and it is almost sure that readers will misinterpret them without a warning (Hoekstra et al, 2014).

Never report a confidence interval without noting the procedure and the corresponding statistics. As we have described, there are many different ways to construct confidence intervals, and they will have different properties. Some will have better frequentist properties than others; some will correspond to credible intervals, and others will not. It is unfortunately common for authors to report confidence intervals without noting how they were constructed. As can be seen from the examples we’ve presented, this is a terrible practice because without knowing which confidence intervals was used, it is unclear what can be inferred. A narrow interval could correspond to very precise information or very imprecise information depending on which procedure was used. Not knowing which procedure was used could lead to very poor inferences. In addition, enough information should be presented so that any reader can compute a different confidence interval or credible interval. In most cases, this is covered by standard reporting practices, but in other cases more information may need to be given.

Consider reporting likelihoods or posteriors instead. An interval provides fairly impoverished information. Just as proponents of confidence intervals argue that CIs provide more information than a significance test (although this is debatable for many CIs), a likelihood or a posterior provides much more information than an interval. Recently, Cumming (2014) [see also here] has proposed so-called “cat’s eye” intervals which are either fiducial distributions or Bayesian posteriors under a “non-informative” prior (the shape is the likelihood, but he interprets the area, so it must be a posterior or a fiducial distribution). With modern scientific graphics so easy to create, along with the fact that likelihoods are often approximately normal, we see no reason why likelihoods and posteriors cannot replace intervals in most circumstances. With a likelihood or a posterior, the arbitrariness of the confidence or credibility coefficient is avoided altogether.

# My favorite Neyman passage: on confidence intervals

I’ve been doing a lot of reading on confidence interval theory. Some of the reading is more interesting than others. There is one passage from Neyman’s (1952) book “Lectures and Conferences on Mathematical Statistics and Probability” (available here) that stands above the rest in terms of clarity, style, and humor. I had not read this before the last draft of our confidence interval paper, but for those of you who have read it, you’ll recognize that this is the style I was going for. Maybe you have to be Jerzy Neyman to get away with it.

Neyman gets bonus points for the footnote suggesting the “eminent”, “elderly” boss is so obtuse (a reference to Fisher?) and that the young frequentists should be “remind[ed] of the glory” of being burned at the stake. This is just absolutely fantastic writing. I hope you enjoy it as much as I did.

[begin excerpt, p. 211-215]

[Neyman is discussing using “sampling experiments” (Monte Carlo experiments with tables of random numbers) in order to gain insight into confidence intervals. $$\theta$$ is a true parameter of a probability distribution to be estimated.]

The sampling experiments are more easily performed than described in
detail. Therefore, let us make a start with $$\theta_1 = 1$$, $$\theta_2 = 2$$, $$\theta_3 = 3$$ and $$\theta_4 = 4$$. We imagine that, perhaps within a week, a practical statistician is faced four times with the problem of estimating $$\theta$$, each time from twelve observations, and that the true values of $$\theta$$ are as above [ie, $$\theta_1,\ldots,\theta_4$$] although the statistician does not know this. We imagine further that the statistician is an elderly gentleman, greatly attached to the arithmetic mean and that he wishes to use formulae (22). However, the statistician has a young assistant who may have read (and understood) modern literature and prefers formulae (21). Thus, for each of the four instances, we shall give two confidence intervals for $$\theta$$, one computed by the elderly Boss, the other by his young Assistant.

[Formula 21 and 22 are simply different 95% confidence procedures. Formula 21 is has better frequentist properties; Formula 22 is inferior, but the Boss likes it because it is intuitive to him.]

Using the first column on the first page of Tippett’s tables of random
numbers and performing the indicated multiplications, we obtain the following
four sets of figures.

The last two lines give the assertions regarding the true value of $$\theta$$ made by the Boss and by the Assistant, respectively. The purpose of the sampling experiment is to verify the theoretical result that the long run relative frequency of cases in which these assertions will be correct is, approximately, equal to $$\alpha = .95$$.

You will notice that in three out of the four cases considered, both assertions (the Boss’ and the Assistant’s) regarding the true value of $$\theta$$ are correct and that in the last case both assertions are wrong. In fact, in this last case the true $$\theta$$ is 4 while the Boss asserts that it is between 2.026 and 3.993 and the Assistant asserts that it is between 2.996 and 3.846. Although the probability of success in estimating $$\theta$$ has been fixed at $$\alpha = .95$$, the failure on the fourth trial need not discourage us. In reality, a set of four trials is plainly too short to serve for an estimate of a long run relative frequency. Furthermore, a simple calculation shows that the probability of at least one failure in the course of four independent trials is equal to .1855. Therefore, a group of four consecutive samples like the above, with at least one wrong estimate of $$\theta$$, may be expected one time in six or even somewhat oftener. The situation is, more or less, similar to betting on a particular side of a die and seeing it win. However, if you continue the sampling  experiment and count the cases in which the assertion regarding the true value of $$\theta$$, made by either method, is correct, you will find that the relative frequency of such cases converges gradually to its theoretical value, $$\alpha= .95$$.

Let us put this into more precise terms. Suppose you decide on a number $$N$$ of samples which you will take and use for estimating the true value of $$\theta$$. The true values of the parameter $$\theta$$ may be the same in all $$N$$ cases or they may vary from one case to another. This is absolutely immaterial as far as the relative frequency of successes in estimation is concerned. In each case the probability that your assertion will be correct is exactly equal to $$\alpha = .95$$. Since the samples are taken in a manner insuring independence (this, of course, depends on the goodness of the table of random numbers used), the total number $$Z(N)$$ of successes in estimating $$\theta$$ is the familiar binomial variable with expectation equal to $$N\alpha$$ and with variance equal to $$N\alpha(1 – \alpha)$$. Thus, if $$N = 100$$, $$\alpha = .95$$, it is rather improbable that the relative frequency $$Z(N)/N$$ of successes in estimating $$\alpha$$ will differ from $$\alpha$$ by more than

$$2\sqrt{\frac{\alpha(1-\alpha)}{N}} = .042$$

This is the exact meaning of the colloquial description that the long run relative frequency of successes in estimating $$\theta$$ is equal to the preassigned $$\alpha$$. Your knowledge of the theory of confidence intervals will not be influenced by the sampling experiment described, nor will the experiment prove anything. However, if you perform it, you will get an intuitive feeling of the machinery behind the method which is an excellent complement to the understanding of the theory. This is like learning to drive an automobile: gaining experience by actually driving a car compared with learning the theory by reading a book about driving.

Among other things, the sampling experiment will attract attention to
the frequent difference in the precision of estimating $$\theta$$ by means of the two alternative confidence intervals (21) and (22). You will notice, in fact, that the confidence intervals based on $$X$$, the greatest observation in the sample, are frequently shorter than those based on the arithmetic mean $$\bar{X}$$. If we continue to discuss the sampling experiment in terms of cooperation between the eminent elderly statistician and his young assistant, we shall have occasion to visualize quite amusing scenes of indignation on the one hand and of despair before the impenetrable wall of stiffness of mind and routine of thought on the other.[See footnote] For example, one can imagine the conversation between the two men in connection with the first and third samples reproduced above. You will notice that in both cases the confidence interval of the Assistant is not only shorter than that of the Boss but is completely included in it. Thus, as a result of observing the first sample, the Assistant asserts that

$$.956 \leq \theta \leq 1.227.$$

On the other hand, the assertion of the Boss is far more conservative and admits the possibility that $$\theta$$ may be as small as .688 and as large as 1.355. And both assertions correspond to the same confidence coefficient, $$\alpha = .95$$! I can just see the face of my eminent colleague redden with indignation and hear the following colloquy.

Boss: “Now, how can this be true? I am to assert that $$\theta$$ is between .688 and 1.355 and you tell me that the probability of my being correct is .95. At the same time, you assert that $$\theta$$ is between .956 and 1.227 and claim the same probability of success in estimation. We both admit the possibility that $$\theta$$ may be some number between .688 and .956 or between 1.227 and 1.355. Thus, the probability of $$\theta$$ falling within these intervals is certainly greater than zero. In these circumstances, you have to be a nit-wit to believe that
$$\begin{eqnarray*} P\{.688 \leq \theta \leq 1.355\} &=& P\{.688 \leq \theta < .956\} + P\{.956 \leq \theta \leq 1.227\}\\ && + P\{1.227 \leq \theta \leq 1.355\}\\ &=& P\{.956 \leq \theta \leq 1.227\}.\mbox{”} \end{eqnarray*}$$

Assistant: “But, Sir, the theory of confidence intervals does not assert anything about the probability that the unknown parameter $$\theta$$ will fall within any specified limits. What it does assert is that the probability of success in estimation using either of the two formulae (21) or (22) is equal to $$\alpha$$.”

Boss: “Stuff and nonsense! I use one of the blessed pair of formulae and come up with the assertion that $$.688 \leq \theta \leq 1.355$$. This assertion is a success only if $$\theta$$ falls within the limits indicated. Hence, the probability of success is equal to the probability of $$\theta$$ falling within these limits —.”

Assistant: “No, Sir, it is not. The probability you describe is the a posteriori probability regarding $$\theta$$, while we are concerned with something else. Suppose that we continue with the sampling experiment until we have, say, $$N = 100$$ samples. You will see, Sir, that the relative frequency of successful estimations using formulae (21) will be about the same as that using formulae (22) and that both will be approximately equal to .95.”

I do hope that the Assistant will not get fired. However, if he does, I would remind him of the glory of Giordano Bruno who was burned at the stake by the Holy Inquisition for believing in the Copernican theory of the solar system. Furthermore, I would advise him to have a talk with a physicist or a biologist or, maybe, with an engineer. They might fail to understand the theory but, if he performs for them the sampling experiment described above, they are likely to be convinced and give him a new job. In due course, the eminent statistical Boss will die or retire and then —.

[footnote] Sad as it is, your mind does become less flexible and less receptive to novel ideas as the years go by. The more mature members of the audience should not take offense. I, myself, am not young and have young assistants. Besides, unreasonable and stubborn individuals are found not only among the elderly but also frequently among young people.

[end excerpt]