# My favorite Neyman passage: on confidence intervals

I’ve been doing a lot of reading on confidence interval theory. Some of the reading is more interesting than others. There is one passage from Neyman’s (1952) book “Lectures and Conferences on Mathematical Statistics and Probability” (available here) that stands above the rest in terms of clarity, style, and humor. I had not read this before the last draft of our confidence interval paper, but for those of you who have read it, you’ll recognize that this is the style I was going for. Maybe you have to be Jerzy Neyman to get away with it.

Neyman gets bonus points for the footnote suggesting the “eminent”, “elderly” boss is so obtuse (a reference to Fisher?) and that the young frequentists should be “remind[ed] of the glory” of being burned at the stake. This is just absolutely fantastic writing. I hope you enjoy it as much as I did.

[begin excerpt, p. 211-215]

[Neyman is discussing using “sampling experiments” (Monte Carlo experiments with tables of random numbers) in order to gain insight into confidence intervals. $$\theta$$ is a true parameter of a probability distribution to be estimated.]

The sampling experiments are more easily performed than described in
detail. Therefore, let us make a start with $$\theta_1 = 1$$, $$\theta_2 = 2$$, $$\theta_3 = 3$$ and $$\theta_4 = 4$$. We imagine that, perhaps within a week, a practical statistician is faced four times with the problem of estimating $$\theta$$, each time from twelve observations, and that the true values of $$\theta$$ are as above [ie, $$\theta_1,\ldots,\theta_4$$] although the statistician does not know this. We imagine further that the statistician is an elderly gentleman, greatly attached to the arithmetic mean and that he wishes to use formulae (22). However, the statistician has a young assistant who may have read (and understood) modern literature and prefers formulae (21). Thus, for each of the four instances, we shall give two confidence intervals for $$\theta$$, one computed by the elderly Boss, the other by his young Assistant.

[Formula 21 and 22 are simply different 95% confidence procedures. Formula 21 is has better frequentist properties; Formula 22 is inferior, but the Boss likes it because it is intuitive to him.]

Using the first column on the first page of Tippett’s tables of random
numbers and performing the indicated multiplications, we obtain the following
four sets of figures.

The last two lines give the assertions regarding the true value of $$\theta$$ made by the Boss and by the Assistant, respectively. The purpose of the sampling experiment is to verify the theoretical result that the long run relative frequency of cases in which these assertions will be correct is, approximately, equal to $$\alpha = .95$$.

You will notice that in three out of the four cases considered, both assertions (the Boss’ and the Assistant’s) regarding the true value of $$\theta$$ are correct and that in the last case both assertions are wrong. In fact, in this last case the true $$\theta$$ is 4 while the Boss asserts that it is between 2.026 and 3.993 and the Assistant asserts that it is between 2.996 and 3.846. Although the probability of success in estimating $$\theta$$ has been fixed at $$\alpha = .95$$, the failure on the fourth trial need not discourage us. In reality, a set of four trials is plainly too short to serve for an estimate of a long run relative frequency. Furthermore, a simple calculation shows that the probability of at least one failure in the course of four independent trials is equal to .1855. Therefore, a group of four consecutive samples like the above, with at least one wrong estimate of $$\theta$$, may be expected one time in six or even somewhat oftener. The situation is, more or less, similar to betting on a particular side of a die and seeing it win. However, if you continue the sampling  experiment and count the cases in which the assertion regarding the true value of $$\theta$$, made by either method, is correct, you will find that the relative frequency of such cases converges gradually to its theoretical value, $$\alpha= .95$$.

Let us put this into more precise terms. Suppose you decide on a number $$N$$ of samples which you will take and use for estimating the true value of $$\theta$$. The true values of the parameter $$\theta$$ may be the same in all $$N$$ cases or they may vary from one case to another. This is absolutely immaterial as far as the relative frequency of successes in estimation is concerned. In each case the probability that your assertion will be correct is exactly equal to $$\alpha = .95$$. Since the samples are taken in a manner insuring independence (this, of course, depends on the goodness of the table of random numbers used), the total number $$Z(N)$$ of successes in estimating $$\theta$$ is the familiar binomial variable with expectation equal to $$N\alpha$$ and with variance equal to $$N\alpha(1 – \alpha)$$. Thus, if $$N = 100$$, $$\alpha = .95$$, it is rather improbable that the relative frequency $$Z(N)/N$$ of successes in estimating $$\alpha$$ will differ from $$\alpha$$ by more than

$$2\sqrt{\frac{\alpha(1-\alpha)}{N}} = .042$$

This is the exact meaning of the colloquial description that the long run relative frequency of successes in estimating $$\theta$$ is equal to the preassigned $$\alpha$$. Your knowledge of the theory of confidence intervals will not be influenced by the sampling experiment described, nor will the experiment prove anything. However, if you perform it, you will get an intuitive feeling of the machinery behind the method which is an excellent complement to the understanding of the theory. This is like learning to drive an automobile: gaining experience by actually driving a car compared with learning the theory by reading a book about driving.

Among other things, the sampling experiment will attract attention to
the frequent difference in the precision of estimating $$\theta$$ by means of the two alternative confidence intervals (21) and (22). You will notice, in fact, that the confidence intervals based on $$X$$, the greatest observation in the sample, are frequently shorter than those based on the arithmetic mean $$\bar{X}$$. If we continue to discuss the sampling experiment in terms of cooperation between the eminent elderly statistician and his young assistant, we shall have occasion to visualize quite amusing scenes of indignation on the one hand and of despair before the impenetrable wall of stiffness of mind and routine of thought on the other.[See footnote] For example, one can imagine the conversation between the two men in connection with the first and third samples reproduced above. You will notice that in both cases the confidence interval of the Assistant is not only shorter than that of the Boss but is completely included in it. Thus, as a result of observing the first sample, the Assistant asserts that

$$.956 \leq \theta \leq 1.227.$$

On the other hand, the assertion of the Boss is far more conservative and admits the possibility that $$\theta$$ may be as small as .688 and as large as 1.355. And both assertions correspond to the same confidence coefficient, $$\alpha = .95$$! I can just see the face of my eminent colleague redden with indignation and hear the following colloquy.

Boss: “Now, how can this be true? I am to assert that $$\theta$$ is between .688 and 1.355 and you tell me that the probability of my being correct is .95. At the same time, you assert that $$\theta$$ is between .956 and 1.227 and claim the same probability of success in estimation. We both admit the possibility that $$\theta$$ may be some number between .688 and .956 or between 1.227 and 1.355. Thus, the probability of $$\theta$$ falling within these intervals is certainly greater than zero. In these circumstances, you have to be a nit-wit to believe that
$$\begin{eqnarray*} P\{.688 \leq \theta \leq 1.355\} &=& P\{.688 \leq \theta < .956\} + P\{.956 \leq \theta \leq 1.227\}\\ && + P\{1.227 \leq \theta \leq 1.355\}\\ &=& P\{.956 \leq \theta \leq 1.227\}.\mbox{”} \end{eqnarray*}$$

Assistant: “But, Sir, the theory of confidence intervals does not assert anything about the probability that the unknown parameter $$\theta$$ will fall within any specified limits. What it does assert is that the probability of success in estimation using either of the two formulae (21) or (22) is equal to $$\alpha$$.”

Boss: “Stuff and nonsense! I use one of the blessed pair of formulae and come up with the assertion that $$.688 \leq \theta \leq 1.355$$. This assertion is a success only if $$\theta$$ falls within the limits indicated. Hence, the probability of success is equal to the probability of $$\theta$$ falling within these limits —.”

Assistant: “No, Sir, it is not. The probability you describe is the a posteriori probability regarding $$\theta$$, while we are concerned with something else. Suppose that we continue with the sampling experiment until we have, say, $$N = 100$$ samples. You will see, Sir, that the relative frequency of successful estimations using formulae (21) will be about the same as that using formulae (22) and that both will be approximately equal to .95.”

I do hope that the Assistant will not get fired. However, if he does, I would remind him of the glory of Giordano Bruno who was burned at the stake by the Holy Inquisition for believing in the Copernican theory of the solar system. Furthermore, I would advise him to have a talk with a physicist or a biologist or, maybe, with an engineer. They might fail to understand the theory but, if he performs for them the sampling experiment described above, they are likely to be convinced and give him a new job. In due course, the eminent statistical Boss will die or retire and then —.

[footnote] Sad as it is, your mind does become less flexible and less receptive to novel ideas as the years go by. The more mature members of the audience should not take offense. I, myself, am not young and have young assistants. Besides, unreasonable and stubborn individuals are found not only among the elderly but also frequently among young people.

[end excerpt]

# A parable on confidence intervals: why “confidence” is misleading

Null hypothesis significance testing (NHST) is increasingly falling out of style with methodologically-minded behavioral and social scientists. Many diverse critiques have been leveled against significance testing; the debate is increasingly what should replace it. Building on work with my colleagues (see here and here), I discuss and critique one replacement option that has been persistently suggested over the years: confidence procedures. We begin with a parable.

### A parable

Susan and Mark were talking over lunch one day. Susan was telling Mark about her recent move into a smaller apartment. “My new place is much smaller than my old one. I have extra boxes sitting in my rental truck, and I need a place to store them.”

Mark thought of ways he could help Susan with her storage problem. “I have a vehicle.”

Susan wasn’t sure what Mark meant. “A vehicle?”

“Yes, a vehicle.” Mark took Susan’s question as evidence she disbelieved him. “Here’s my insurance card. And a gas receipt…oh, and from this invoice here can see I had the engine serviced. It is definitely a vehicle.”

“But…can I store my boxes in it?”

Mark didn’t understand why Susan was still confused. “Vehicle have wheels and can move from place to place. Put more simply, vehicles are places you can store your boxes.”

Nonplussed, Susan decided not to press the issue and asked for the bill.

This scenario is obviously absurd. Mark seems to think that there is an obvious link between having a vehicle and having a place to store boxes. Indeed, some vehicles can store boxes, but not most. Mark’s proof that he has a vehicle missed Susan’s need for storage, and as such, was unhelpful.

### Confidence Procedures

Absurd as the scenario is, though, Mark’s reaction is analogous to the advocacy of confidence procedures. To see why, we need to understand what a confidence procedure is. The definition is straightforward:

A X% confidence procedure (CP) is any procedure that generates intervals that contain the true value in X% of repeated samples. A confidence interval is a particular interval generated from a data using a confidence procedure.

One way to think about a confidence procedure is like a net that catches a fish on a certain percentage of casts. The CP is a net that “catches” the true value of a parameter – a population mean, variance, rate, for example – X% of the time. Unlike with a net however, we cannot simply retrieve the net and look for the “fish”. We must use statistical theory to infer something about the true value of the parameter.

Typically what is of interest after collecting data is to somehow quantify what is known, what should be believed, or what the evidence is for different possible values of a parameter. Suppose we use a confidence procedure to compute a 95% confidence interval for a population mean, and we obtain the interval ((5, 10)). We know the property of the procedure: it will include the mean 95% of the time. But what can we say about the parameter from the confidence interval ((5, 10)), on the basis that it is sample from a confidence procedure? It isn’t clear, but various proponents of CPs have offered ideas:

“[t]he interpretation of the confidence interval constructed around that specific mean would be that there is a 95% probability that the interval is one of the 95% of all possible confidence intervals that includes the population mean. Put more simply, in the absence of any other information, there is a 95% probability that the obtained confidence interval includes the population mean.” (Masson and Loftus, 2003)

“[w]e can be 95% confident that our interval includes [the population mean] and can think of the lower and upper limits as likely lower and upper bounds for [the population mean].” (Cumming, 2014)

Neither of these are correct. In fact, in his classic article outlining the theory underlying confidence procedures, Neyman definitively says that these interpretations are wrong:

“Consider now the case when a sample … is already drawn and the [confidence interval] given…Can we say that in this particular case the probability of the true value of [the parameter] falling between [the CI bounds] is equal to [X%]? The answer is obviously in the negative.” (Neyman, 1937)

Neyman developed the theory of CP within the frequentist theoretical framework, in which probability is not a quantification of uncertain knowledge, but rather a long-run average rate of occurrence of events. A procedure thus has a probability associated with it, but the CI, being a single realization from the procedure, does not. But what if we loosen the requirements of Neyman’s frequentist theory? Is there a sense in which the confidence coefficient from a confidence procedure quantifies our uncertainty about a parameter given a confidence interval?

The answer here is again, no. To see why, we need to show a confidence procedure that is disconnected from knowledge about the parameter. Such examples have long been known in the theoretical statistical literature, and are not hard to find. My colleagues and I explore one such example in our paper, “The fallacy of placing confidence in confidence intervals” and its supplement. I will outline part of the example here.

### The missing submarine

Suppose we are looking for a 10 meter long submarine in a vast ocean. We know that it has rested on the bottom of the ocean somewhere, and we also know that when it hits the ocean bottom, it releases two bubbles. These bubbles occur independently and with uniform probability anywhere along the length of the submarine. Our goal is to find the submarine’s hatch, which is halfway along the submarine’s length.

We represent the hatch location as (mu) because it is the submarine’s mean (and median) location. We would like to use the bubbles to compute a confidence interval that is supposed to quantify our knowledge about (mu). One possible confidence interval is the interval between the bubbles. This is easily seen to be a 50% confidence interval, because the hatch is the median; given that the first bubble is on one side of the hatch, there is a 50% probability that the second bubble ends up on the other side of the hatch. The two bubbles are on opposite sides of the hatch 50% of the time.

This confidence procedure is generally applicable as a nonparametric confidence procedure for the median, given two observations. Another well-known confidence procedure that is uses the interval between two observations is the Student’s (t_1) interval. We outline the properties of this procedure in the submarine case in our supplement; for now, all that is important is the knowledge that the interval between the bubbles forms a 50% CI. The question is now, does the confidence coefficient of 50% quantify our uncertainty about whether the hatch is in the interval?

The answer is definitely “no.” To see why, we need only look at how the bubbles might occur.

In panel A, the bubbles are almost 10 meters apart. In this case, the 50% confidence interval is very wide. But our knowledge of the parameter is very precise, because if the submarine is 10 meters wide, then bubbles that are 10 meters apart must come from opposite ends of the craft. The mean (mu) must be very close to the middle of the bubbles (the “likelihood” row shows the possible locations of the hatch, given the bubbles). Further, because the bubbles are more than 5 meters apart, the 50% confidence interval must contain the hatch, with 100% certainty. The 50% confidence coefficient is unrelated to our actual certainty that the parameter is in the interval.

In panel B, the bubbles are very close together. The two bubbles are so close that they do not provide much more information than a single bubble would. The possible locations of the hatch, indicated by the “likelihood” row, much wider than before. In spite of the narrow CI, knowledge about the location of the hatch is as diffuse as it can be. Only 5% of confidence intervals as narrow as this one contain the true value. Far from being 50% certain that the true value is in the interval, we should be fairly certain it is not.

An alternative approach to the submarine problem is to take the middle 50% of the possible values. This forms an objective Bayesian 50% credible interval, and also, incidentally, a 50% confidence interval. Unlike the first confidence procedure, the Bayesian procedure allows for reasonable inferences because it is derived from the likelihood. See the paper and the supplement for more details 1.

### “I have a vehicle”

We can now we see why a focus on confidence procedures is misguided. The fact that a procedure is a confidence procedure is unrelated to its usefulness in making good inferences about a parameter of interest. Papers that present novel confidence procedures – there are plenty of them – focus on the wrong thing. A proof that a procedure has a long-run probability of including the mean of X% is like Mark proving that he had a vehicle. Vehicle-ness was irrelevant to Susan. What Susan wanted was a place to store her boxes. What researchers want are intervals that allow good inferences.

But in order to prove that a procedure makes useful inferences, you need a different theoretical framework than confidence procedures. A good start would be the likelihood or Bayesian frameworks. Just as some vehicles may be useful to store boxes, some confidence procedures are also Bayesian credible procedures. Proponents of particular confidence procedures have a responsibility to researchers to prove that their procedures also allow reasonable inferences, which will require further work than simply proving that a procedure is a confidence procedure.

Confidence interval proponents believe, incorrectly, that the confidence framework can be relied upon, by itself, to generate useful inferences. Reasonable inferences were never the purpose of confidence procedures, so it is not surprising that they fail dramatically in many simple cases.

1 It is important to note that the Bayesian-derived confidence procedure, in spite of its obvious reasonableness and its lack of the inferential pathologies of other confidence procedures, is not the preferred frequentist interval. There are a number of confidence procedures that are preferred under frequentist theory in the submarine example because on average, they exclude incorrect values at a higher average rate; see the supplement and Welch, 1939.