# My favorite Neyman passage: on confidence intervals

I’ve been doing a lot of reading on confidence interval theory. Some of the reading is more interesting than others. There is one passage from Neyman’s (1952) book “Lectures and Conferences on Mathematical Statistics and Probability” (available here) that stands above the rest in terms of clarity, style, and humor. I had not read this before the last draft of our confidence interval paper, but for those of you who have read it, you’ll recognize that this is the style I was going for. Maybe you have to be Jerzy Neyman to get away with it.

Neyman gets bonus points for the footnote suggesting the “eminent”, “elderly” boss is so obtuse (a reference to Fisher?) and that the young frequentists should be “remind[ed] of the glory” of being burned at the stake. This is just absolutely fantastic writing. I hope you enjoy it as much as I did.

[begin excerpt, p. 211-215]

[Neyman is discussing using “sampling experiments” (Monte Carlo experiments with tables of random numbers) in order to gain insight into confidence intervals. $$\theta$$ is a true parameter of a probability distribution to be estimated.]

The sampling experiments are more easily performed than described in
detail. Therefore, let us make a start with $$\theta_1 = 1$$, $$\theta_2 = 2$$, $$\theta_3 = 3$$ and $$\theta_4 = 4$$. We imagine that, perhaps within a week, a practical statistician is faced four times with the problem of estimating $$\theta$$, each time from twelve observations, and that the true values of $$\theta$$ are as above [ie, $$\theta_1,\ldots,\theta_4$$] although the statistician does not know this. We imagine further that the statistician is an elderly gentleman, greatly attached to the arithmetic mean and that he wishes to use formulae (22). However, the statistician has a young assistant who may have read (and understood) modern literature and prefers formulae (21). Thus, for each of the four instances, we shall give two confidence intervals for $$\theta$$, one computed by the elderly Boss, the other by his young Assistant.

[Formula 21 and 22 are simply different 95% confidence procedures. Formula 21 is has better frequentist properties; Formula 22 is inferior, but the Boss likes it because it is intuitive to him.]

Using the first column on the first page of Tippett’s tables of random
numbers and performing the indicated multiplications, we obtain the following
four sets of figures.

The last two lines give the assertions regarding the true value of $$\theta$$ made by the Boss and by the Assistant, respectively. The purpose of the sampling experiment is to verify the theoretical result that the long run relative frequency of cases in which these assertions will be correct is, approximately, equal to $$\alpha = .95$$.

You will notice that in three out of the four cases considered, both assertions (the Boss’ and the Assistant’s) regarding the true value of $$\theta$$ are correct and that in the last case both assertions are wrong. In fact, in this last case the true $$\theta$$ is 4 while the Boss asserts that it is between 2.026 and 3.993 and the Assistant asserts that it is between 2.996 and 3.846. Although the probability of success in estimating $$\theta$$ has been fixed at $$\alpha = .95$$, the failure on the fourth trial need not discourage us. In reality, a set of four trials is plainly too short to serve for an estimate of a long run relative frequency. Furthermore, a simple calculation shows that the probability of at least one failure in the course of four independent trials is equal to .1855. Therefore, a group of four consecutive samples like the above, with at least one wrong estimate of $$\theta$$, may be expected one time in six or even somewhat oftener. The situation is, more or less, similar to betting on a particular side of a die and seeing it win. However, if you continue the sampling  experiment and count the cases in which the assertion regarding the true value of $$\theta$$, made by either method, is correct, you will find that the relative frequency of such cases converges gradually to its theoretical value, $$\alpha= .95$$.

Let us put this into more precise terms. Suppose you decide on a number $$N$$ of samples which you will take and use for estimating the true value of $$\theta$$. The true values of the parameter $$\theta$$ may be the same in all $$N$$ cases or they may vary from one case to another. This is absolutely immaterial as far as the relative frequency of successes in estimation is concerned. In each case the probability that your assertion will be correct is exactly equal to $$\alpha = .95$$. Since the samples are taken in a manner insuring independence (this, of course, depends on the goodness of the table of random numbers used), the total number $$Z(N)$$ of successes in estimating $$\theta$$ is the familiar binomial variable with expectation equal to $$N\alpha$$ and with variance equal to $$N\alpha(1 – \alpha)$$. Thus, if $$N = 100$$, $$\alpha = .95$$, it is rather improbable that the relative frequency $$Z(N)/N$$ of successes in estimating $$\alpha$$ will differ from $$\alpha$$ by more than

$$2\sqrt{\frac{\alpha(1-\alpha)}{N}} = .042$$

This is the exact meaning of the colloquial description that the long run relative frequency of successes in estimating $$\theta$$ is equal to the preassigned $$\alpha$$. Your knowledge of the theory of confidence intervals will not be influenced by the sampling experiment described, nor will the experiment prove anything. However, if you perform it, you will get an intuitive feeling of the machinery behind the method which is an excellent complement to the understanding of the theory. This is like learning to drive an automobile: gaining experience by actually driving a car compared with learning the theory by reading a book about driving.

Among other things, the sampling experiment will attract attention to
the frequent difference in the precision of estimating $$\theta$$ by means of the two alternative confidence intervals (21) and (22). You will notice, in fact, that the confidence intervals based on $$X$$, the greatest observation in the sample, are frequently shorter than those based on the arithmetic mean $$\bar{X}$$. If we continue to discuss the sampling experiment in terms of cooperation between the eminent elderly statistician and his young assistant, we shall have occasion to visualize quite amusing scenes of indignation on the one hand and of despair before the impenetrable wall of stiffness of mind and routine of thought on the other.[See footnote] For example, one can imagine the conversation between the two men in connection with the first and third samples reproduced above. You will notice that in both cases the confidence interval of the Assistant is not only shorter than that of the Boss but is completely included in it. Thus, as a result of observing the first sample, the Assistant asserts that

$$.956 \leq \theta \leq 1.227.$$

On the other hand, the assertion of the Boss is far more conservative and admits the possibility that $$\theta$$ may be as small as .688 and as large as 1.355. And both assertions correspond to the same confidence coefficient, $$\alpha = .95$$! I can just see the face of my eminent colleague redden with indignation and hear the following colloquy.

Boss: “Now, how can this be true? I am to assert that $$\theta$$ is between .688 and 1.355 and you tell me that the probability of my being correct is .95. At the same time, you assert that $$\theta$$ is between .956 and 1.227 and claim the same probability of success in estimation. We both admit the possibility that $$\theta$$ may be some number between .688 and .956 or between 1.227 and 1.355. Thus, the probability of $$\theta$$ falling within these intervals is certainly greater than zero. In these circumstances, you have to be a nit-wit to believe that
$$\begin{eqnarray*} P\{.688 \leq \theta \leq 1.355\} &=& P\{.688 \leq \theta < .956\} + P\{.956 \leq \theta \leq 1.227\}\\ && + P\{1.227 \leq \theta \leq 1.355\}\\ &=& P\{.956 \leq \theta \leq 1.227\}.\mbox{”} \end{eqnarray*}$$

Assistant: “But, Sir, the theory of confidence intervals does not assert anything about the probability that the unknown parameter $$\theta$$ will fall within any specified limits. What it does assert is that the probability of success in estimation using either of the two formulae (21) or (22) is equal to $$\alpha$$.”

Boss: “Stuff and nonsense! I use one of the blessed pair of formulae and come up with the assertion that $$.688 \leq \theta \leq 1.355$$. This assertion is a success only if $$\theta$$ falls within the limits indicated. Hence, the probability of success is equal to the probability of $$\theta$$ falling within these limits —.”

Assistant: “No, Sir, it is not. The probability you describe is the a posteriori probability regarding $$\theta$$, while we are concerned with something else. Suppose that we continue with the sampling experiment until we have, say, $$N = 100$$ samples. You will see, Sir, that the relative frequency of successful estimations using formulae (21) will be about the same as that using formulae (22) and that both will be approximately equal to .95.”

I do hope that the Assistant will not get fired. However, if he does, I would remind him of the glory of Giordano Bruno who was burned at the stake by the Holy Inquisition for believing in the Copernican theory of the solar system. Furthermore, I would advise him to have a talk with a physicist or a biologist or, maybe, with an engineer. They might fail to understand the theory but, if he performs for them the sampling experiment described above, they are likely to be convinced and give him a new job. In due course, the eminent statistical Boss will die or retire and then —.

[footnote] Sad as it is, your mind does become less flexible and less receptive to novel ideas as the years go by. The more mature members of the audience should not take offense. I, myself, am not young and have young assistants. Besides, unreasonable and stubborn individuals are found not only among the elderly but also frequently among young people.

[end excerpt]

# The TES Challenge to Greg Francis

This post is a follow-up to my previous post, “Statistical alchemy and the ‘test for excess significance’”. In the comments on that post, Greg Francis objected to my points about the Test for Excess Significance. I laid out a challenge in which I would use simulation to demonstrate these points. Greg Francis agreed to the details; this post is about the results of the simulations (with links to the code, etc.)

## A challenge

In my previous post, I said this:

Morey: “…we have bit of a mystery. That $E$ [the expected number of non-significant studies in a set of $n$ studies] equals the sum of the expected [Type II error] probabilities is merely asserted [by Ioannidis and Trikalinos]. There is no explanation of what assumptions were necessary to derive that fact. Moreover, it is demonstrably false.”

Greg Francis replied:

Francis:“…none of your examples of the falseness of the equation are valid because you fix the number of studies to be n, which is inconsistent with your proposed study generation process. Your study generation process works if you let n vary, but then the Ioannidis & Trikalinos formula is shown to be correct…[i]n short, you present impossible sampling procedures and then complain that the formula proposed by Ioannidis & Trikalinos does not handle your impossible situations.”

To which I replied,

Morey:“If you don’t believe me, here’s a challenge: you pick a power and a random seed. I will simulate a very large ‘literature’ according to the ‘experimenter behaviour’ of my choice, importantly with no publication bias or other selection of studies. I will guarantee that I will use a behaviour that will generate experiment set sizes of 5. I will save the code and the ‘literature’ coded in terms of ‘sets’ of studies and how many significant and nonsignificant studies there are. You get to guess what the average number of significant studies are in sets of 5 via I&T’s model, along with a 95% CI (I’ll tell you the total number of such studies). That is, we’re just using Monte Carlo to estimate the expected number of significant studies in sets of experiments n=5; that is, precisely what I&T use as the basis of their model (for the special case of n=5).” “This will answer the question of ‘what is the expected number of nonsignificant studies in a set of n?’”

This challenge will very clearly show that my situations are not “impossible”. I can sample them in a very simple simulation. Greg Francis agreed to the simulation:

Francis: “Clearly at least one of us is confused. Maybe we can sort it out by trying your challenge. Power=0.5, random seed= 19374013”

I further clarified:

Morey: “Before I do this, though, I want to make sure that we agree on what this will show. I want to show that the expected number of nonsignificant studies in a set of n (=5) studies is not what I&T say it is, and hence, the reasoning behind the test is flawed (because ‘excess significance’ is defined as deviation from this expected number). I also want to be clear what the prediction is here: Since the power of the test is .5, according to I&T, the expected number of nonsignificant studies in a set of 5 is 2.5. Agreed?”

…to which Greg Francis agreed.

I have performed this simulation. Before reading on, you should read the web page containing the results:

The table below shows the results of the simulation of 1000000 “sets” of studies. All simulated “studies” are published in this simulation, no questionable research practices are involved. The first column shows (n), and the second column shows the average number of non-significant studies for sets of (n), which is a Monte Carlo estimate of I&T’s (E). As you can see, it is not 2.5.

Total studies (n)  Mean nonsig. studies  Expected by TES (E)  SD nonsig. studies  Count
1 1 0.5 0 499917
2 1 1.0 0 249690
3 1 1.5 0 125269
4 1 2.0 0 62570
5 1 2.5 0 31309
6 1 3.0 0 15640
7 1 3.5 0 7718
8 1 4.0 0 3958
9 1 4.5 0 1986
10 1 5.0 0 975

(I have truncated the table at (n=10); see the HTML file for the full table.)

I also showed that you can change the experimenter’s behaviour and make it 2.5. This indicates that the assumptions one makes about experimenter behavior matter to the expected number of non-significant studies in a particular set. Across all sets of studies, the expected proportion of significant studies is expected to be equal to the power. However, how this is distributed across studies of different lengths is a function of the decision rule.

The expression for the expected number of non-significant studies in a set of (n) is not correct (without further very strong, unwarranted assumptions).

# Two things to stop saying about null hypotheses

There is a currently fashionable way of describing Bayes factors that resonates with experimental psychologists. I hear it often, particularly as a way to describe a particular use of Bayes factors. For example, one might say, “I needed to prove the null, so I used a Bayes factor,” or “Bayes factors are great because with them, you can prove the null.” I understand the motivation behind this sort of language but please: stop saying one can “prove the null” with Bayes factors.

I also often hear other people say “but the null is never true.” I’d like to explain why we should avoid saying both of these things.

 Null hypotheses are tired of your jibber jabber

### Why you shouldn’t say “prove the null”

Statistics is complicated. People often come up with colloquial ways of describing what a particular method is doing: for instance, one might say a significance tests give us “evidence against the null”; one might say that a “confidence interval tells us the 95% most plausible values”; or one might say that a Bayes factor helps us “prove the null.” Bayesians often are quick to correct misconceptions that people use to justify their use of classical or frequentist methods. It is just as important to correct misconceptions about Bayesian methods.

In order to understand why we shouldn’t say “prove the null”, consider the following situation: You have a friend who claims that they can affect the moon with their mind. You, of course, think this is preposterous. Your friend looks up at the moon and says “See, I’m using my abilities right now!” You check the time.

You then decide to head to the local lunar seismologist, who has good records of subtle moon tremors. You ask her whether about what happened at the time your friend was looking at the moon, and she reports back to you that lunar activity at that time was stronger than it typically is 95% of the time (thus passes the bar for “statistical significance”).

Does this mean that there is evidence for your friend’s assertion? The answer is “no.” Your friend made no statement about what one would expect from the seismic data. In fact, your friend’s statement is completely unfalsifiable (as is the case with the typical “alternative” in a significance test, (muneq0)).

But consider the following alternative statements your friend could have made: “I will destroy the moon with my mind”; “I will make very large tremors (with magnitude (Y))”; “I will make small tremors (with magnitude (X)).” How do we now regard your friend’s claims in light of the what happened?

• “I will destroy the moon with my mind” is clearly inconsistent with the data. You (the null) are supported by an infinite amount, because you have completely falsified his statement that he would destroy the moon (the alternative).
• “I will make very large tremors (with magnitude (Y))” is also inconsistent with the data, but if we allow a range of uncertainty around his claim, may not be completely falsified. Thus you (the null) are supported, but not by as much in the first situation.
• “I will make small tremors (with magnitude (X))” may support you (the null) or your friend (the alternative), depending on how the magnitude predicted and observed.

Here we can see that the support for the null depends on the alternative at hand. This is, of course, as it must be. Scientific evidence is relative. We can never “prove the null”: we can only “find evidence for a specified null hypothesis against a reasonable, well-specified alternative”. That’s quite a mouthful, it’s true, but “prove the null” creates misunderstandings about Bayesian statistics, and makes it appear that it is doing something it cannot do.

In a Bayesian setup, the null and alternative are both models and the relative evidence between them will change based on how we specify them. If we specify them in a reasonable manner, such that the null and alternative correspond to relevant theoretical viewpoints or encode information about the question at hand, the relative statistical evidence will be informative for our research ends. If we don’t specify reasonable models, then the relative evidence between the models may be correct, but useless.

We never “prove the null” or “compute the probability of the null hypothesis”. We can only compare a null model to an alternative model, and determine the relative evidence.

### Why you shouldn’t say “the null is never true”

A common retort to tests including a point null (often called a ‘null’ hypothesis) is that “the null is never true.” This backed up by four sorts of “evidence”:

• A quote from an authority: “Tukey or Cohen said so!” (Tukey was smart, but this is not an argument.)
• Common knowledge / “experience”: “We all know the null is impossible.” (This was Tukey’s “argument”)
• Circular: “The area under a point in a density curve is 0.” (Of course if your model doesn’t have a point null, the point null will be impossible.)
• All models are “false” (even if this were true — I think it is actually a category error — it would equally apply to all alternatives as well)

The most attractive seems to be the second, but it should be noted that people almost never use techniques that allow finding evidence for null hypotheses. Under these conditions, how is one determining that the null is never true? If a null were ever true, we would not be able to accumulate evidence for it, so the second argument definitely has a hint of circularity as well.

When someone says “The null hypothesis is impossible/implausible/irrelevant”, what they are saying in reality is “I don’t believe the null hypothesis can possibly be true.” This is a totally fine statement, as long as we recognize it for what it is: an a priori commitment. We should not pretend that it is anything else; I cannot see any way that one can find universal evidence for the statement “the null is impossible”.

If you find the null hypothesis implausible, that’s OK. Others might not find it implausible. It is ultimately up to substantive experts to decide what hypotheses they want to consider in their data analysis, and not up to methodologists or statisticians to decide to tell experts what to think.

Any automatic behavior — either automatically rejecting all null hypothesis, or automatically testing null hypotheses — is bad. Hypothesis testing and estimation should be considered and deliberate. Luckily, Bayesian statistics allows both to be done in a principled, coherent manner, so informed choices can be made by the analyst and not by the restrictions of the method.