Category Archives: anova

The fallacy of placing confidence in confidence intervals (version 2)

I, with my coathors, have submitted a new draft of our paper “The fallacy of placing confidence in confidence intervals”. This paper is substantially modified from its previous incarnation. Here is the main argument:

“[C]onfidence intervals may not be used as suggested by modern proponents because this usage is not justified by confidence interval theory. If used in the way CI proponents suggest, some CIs will provide severely misleading inferences for the given data; other CIs will not. Because such considerations are outside of CI theory, developers of CIs do not test them, and it is therefore often not known whether a given CI yields a reasonable inference or not. For this reason, we believe that appeal to CI theory is redundant in the best cases, when inferences can be justified outside CI theory, and unwise in the worst cases, when they cannot.”

The document, source code, and all supplementary material is available here on github.

Multiple Comparisons with BayesFactor, Part 2 – order restrictions

In my previous post, I described how to do multiple comparisons using the BayesFactor package. Part 1 concentrated on testing equality constraints among effects: for instance, that the the effects of two factor levels are equal, while leaving the third free to be different. In this second part, I will describe how to test order restrictions on factor level effects. This post will be a little more involved than the previous one, because BayesFactor does not currently do order restrictions automatically.

Again, I will note that these methods are only meant to be used for pre-planned comparisons. They should not be used for post hoc comparisons.

Our Example

I will use the same example and data as I did in the previous post; if you have not read that post, I suggest you go back and read it before delving further here. As a reminder, our data consists of (hypothetical) “moral harshness” ratings of undocumented migrants from 150 participants in three conditions:

  • No odor during questionnaire
  • Pleasant, clean odor (lemon) during questionnaire
  • Disgusting odor (sulfur) during questionnaire

Under the idea that moral disgust and physical disgust are related physiologically (the so-called “embodied” viewpoint; Schnall et al, 2008; but see also Johnson et al., 2014 and Landy & Goodwin, in press) the prediction is that the odor will have an effect on the harshness ratings, as feelings of physical disgust are “transferred” to objects of moral judgment.

In the previous post, I showed the classical ANOVA results, which just failed to reach significance. I also showed how to do a basic test of the null hypothesis against the hypothesis that all three means are unequal using anovaBF. The Bayes factor was about 1/0.774 = 1.3, meaning that neither the null hypothesis nor the “full” model (that all three means are unequal) was favored:

library(BayesFactor)
bf1 = anovaBF(score ~ condition, data = disgust_data)
1 / bf1
## Bayes factor analysis
## --------------
## [1] Intercept only : 1.292 ±0.01%
##
## Against denominator:
## score ~ condition
## ---
## Bayes factor type: BFlinearModel, JZS

More data is needed, to test these hypotheses against one another; but as we’ll see, data that are uninformative for one comparison may be more informative for another.

Testing the “right” hypothesis

At this point it is important to note that neither the classical test (which supposedly tests the fitness of the null hypothesis) nor the Bayes factor test of the “full” model against the null hypothesis are the “right” tests for the hypothesis at hand. The null hypothesis may be false in ways that are not consistent with the research predictions. The right test in this case is to test the hypothesis that lemon < control < sulfur. The means fall in the predicted direction:

Out of necessity, however, a classical analysis normally ends with the rejection of the null hypothesis that all means are equal. There is no way in classical statistics to rigorously test order-restrictions; one can only point to the ordering of the means and note that they are in the predicted order. This, however, ignores the uncertainty inherent in the estimation of the means.

Occasionally, researchers perform post hoc tests on the individual means to “ensure” that they are really different, given that they are in the correct order, but this has the disadvantage of extremely low power, which means that this method is only deployed opportunistically. Failure of these post hoc tests might not be reported at all, and would certainly never be reported as a failure to achieve sufficient evidence in favor of the hypothesis (the excuse will always be low power) but rejection will always be trumpeted.

What is needed is a principled way of testing order restrictions. Luckily, this is possible — even straightforward — with Bayes factors.

A refresher on Bayes factor logic

One of the neat features of Bayes factors is their transitivity. If I know that Model A outperforms Model B by 3, and I know that Model B outperforms Model C by 4, then I know that Model A outperforms Model C by (3 times 4 = 12). The reason for this is that Bayes factors are simply ratios (see this previous post). The Bayes factor is the ratio of the likelihood of the data under two hypotheses. Since
[ frac{p(y mid {cal M}_A)}{p(y mid {cal M}_B)} times frac{p(y mid {cal M}_B)}{p(y mid {cal M}_C)} = frac{p(y mid {cal M}_A)}{p(y mid {cal M}_C)} ]
…we can use two Bayes factors to get a third. This suggests how we can compute a Bayes factor for an order restriction:

  1. Compare the unrestricted “full” model to the null (already done, with anovaBF)
  2. Compare the unrestricted “full” model to an order restriction
  3. Use the resulting two Bayes factors to compare the null to the order restriction.

This all hinges on Step 2, which we perform in the next section.

Order restrictions versus full models

For this section, we need to remember that the Bayes factor is the degree to which posterior odds change from prior odds. So, if we can compute the prior odds of a restriction against the full model, and compute the posterior odds of a restriction against the full model, then we can obtain the Bayes factor. For the models in the Bayes factor package, the prior odds are easy. All order restrictions have the same probability, so the odds of any single order restriction is against the full model are just
[ frac{1}{mbox{Number of possible orderings}} ]
With three factor levels, there are 6 orderings, so the prior odds are 1/6.

To compute the posterior odds, we need an additional trick: we sample from the posterior, and work out the posterior probability that our prediction holds in the factor level effects. In this way, we account for the estimation uncertainty in these effects.

We use the posterior() function to sample from the posterior distribution, and for demonstration I’ve shown the first samples from the posterior:

## Sample from the posterior distribution of the full model
## (that is, the numerator of bf1)
samples = posterior(bf1, iterations = 10000)
head(samples)
## Markov Chain Monte Carlo (MCMC) output:
## Start = 1
## End = 7
## Thinning interval = 1
## mu condition-control condition-lemon condition-sulfur sig2
## [1,] 30.24 -0.13343 -1.551 1.6841 42.06
## [2,] 30.49 -0.43687 -1.125 1.5617 42.70
## [3,] 29.35 0.04984 -1.514 1.4644 42.59
## [4,] 29.55 0.38246 -1.233 0.8507 46.79
## [5,] 29.38 -0.21434 -1.071 1.2850 46.31
## [6,] 29.85 -0.85003 -1.043 1.8927 53.68
## [7,] 29.40 -0.79867 -1.003 1.8020 54.84
## g_condition
## [1,] 0.10766
## [2,] 0.18695
## [3,] 0.03047
## [4,] 0.03034
## [5,] 0.59331
## [6,] 0.10248
## [7,] 0.04045

Notice that columns 1 through 4 contain the estimates of the effects of our factor levels. We need to estimate the probability that these order in the specified way. A simple estimate can be had by working out the proportion of samples in which the order constraint holds:

## Check order constraint
consistent = (samples[, "condition-control"] > samples[, "condition-lemon"]) &
(samples[, "condition-sulfur"] > samples[, "condition-control"])
N_consistent = sum(consistent)

For each posterior sample, the variable consistent codes whether the sample was consistent with the order restriction; the variable N_consistent contains how many of these samples were consistent, which in this case is 7245. Our estimate of the posterior probability of the restriction is thus N_consistent/10000, because we drew 10000 samples from the posterior distribution. The posterior probability is about 0.7245.

As it turns out, the posterior odds of the restriction to the full model is just 0.7245/1, because every sample is consistent with the full model. We can now compute the Bayes factor of the restriction to the full model by just dividing the posterior odds by the prior odds:

bf_restriction_against_full = (N_consistent / 10000) / (1 / 6)
bf_restriction_against_full
## [1] 4.347

The Bayes factor is 4.347, which shows that the data change our opinion in favor of the restriction by a factor of about 4, against the full model.

Another way to think about the above calculation is that the prior odds index the “riskiness” of the order-restriction prediction (the lower the odds, the more risky the prediction is), and the posterior odds represent the probability that it worked out, given the data. Under this view, the Bayes factor of the restriction versus the full model is the “boost” we give to the evidence due to these two factors:
[ mbox{Evidential boost to order prediction} = mbox{Probability prediction is true} times mbox{Riskiness of prediction} ]
To get a big boost, we thus need a risky prediction and a prediction that works out. Simply noting that the prediction worked out in the data is not enough.

Putting it all together

We can now compute the Bayes factor of our restriction against the null hypothesis through simple multiplication:

## Convert bf1 to a number so that we can multiply it
bf_full_against_null = as.vector(bf1)
## Use transitivity to compute desired Bayes factor
bf_restriction_against_null = bf_restriction_against_full * bf_full_against_null
bf_restriction_against_null
## condition 
## 3.364

This Bayes factor is still moderate, but substantially more respectable than the previous Bayes factor of 0.774.

What have we gained?

Here we see several interesting features of Bayes factors on display.

  1. If a hypothesis makes a prediction, we can test it. Classical testing has limits which arbitrarily limit our ability to test certain hypotheses (eg, order restrictions). Bayes factors are not so limited.
  2. Bayes factors are transitive. If we can test two models against the same third model, we can compare those two models against one another. Although this seems like a reasonable property, (p) values have no such property because they are not comparative.
  3. Making a specific prediction pays off. The “full” model was the wrong model to test, because it did not make properly constrained predictions. When the correct order restriction was tested, the evidence increased because the data were consistent with the restriction.
  4. The limit to increase in evidence that a specific prediction gives you is the “riskyness” of the prediction. Note in the above calculation that the limit to the boost that a Bayes factor can get from the order restriction is equal to the odds of that restriction. If we make a prediction that has low a priori odds, then when it works out in the data, the Bayes factor will reward it by that amount, weighted by the posterior probability that the restriction is true.

For more about testing order restrictions with Bayes factors, see Morey and Wagenmakers (2013), “Simple relation between Bayesian order-restricted and point-null hypothesis tests.”

Multiple Comparisons with BayesFactor, Part 1

One of the most frequently-asked questions about the BayesFactor package is how to do multiple comparisons; that is, given that some effect exists across factor levels or means, how can we test whether two specific effects are unequal. In the next two posts, I’ll explain how this can be done in two cases: in Part 1, I’ll cover tests for equality, and in Part 2 I’ll cover tests for specific order-restrictions.

Before we start, I will note that these methods are only meant to be used for pre-planned comparisons. They should not be used for post hoc comparisons.

An Example

Suppose we are interested in the basis for feelings of moral disgust. One prominent theory, from the embodied cognition point of view, holds that feelings of moral disgust are extensions of more basic feelings of disgust: disgust for physical things, such as rotting meat, excrement, etc (Schnall et al, 2008; but see also Johnson et al., 2014 and Landy & Goodwin, in press). Under this theory, moral disgust is not only metaphorically related to physical disgust, but may share physiological responses with physical disgust.

Suppose we wish to experimentally test this theory, which predicts that feelings of physical disgust can be “transferred” to possible objects of moral disgust. We ask 150 participants to fill out a questionnaire that measures the harshness of their judgments of undocumented migrants. Participants are randomly assigned to one of three conditions, differing by the odor present in the room: a pleasant scent associated with cleanliness (lemon), a disgusting scent (sulfur), and a control condition in which no unusual odor is present. The dependent variable is the score on the questionnaire, which ranges from 0 to 50 with higher scores representing harsher moral judgment.

Hypothetical data, simulated for the sake of example, can be read into R using the url() function:

# Read in the data from the learnbayes.org
disgust_data = read.table(url('http://www.learnbayes.org/disgust_example.txt'),header=TRUE)

A boxplot and means/standard errors reveal that the effects appear to be in the predicted direction:

(note that the axes are different in the two plots, so that the standard errors can be seen)

And we can perform a classical ANOVA on these data:

# ANOVA
summary(aov(score ~ condition, data = disgust_data))
##              Df Sum Sq Mean Sq F value Pr(>F)  
## condition 2 263 131.4 2.91 0.058 .
## Residuals 147 6635 45.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The classical test of the null hypothesis that all means are equal just fails to reach significance at (alpha=0.05).

A Bayes factor analysis

We can easily perform a Bayes factor test of the null hypothesis using the BayesFactor package. This assumes that the prior settings are acceptable; because this post is about multiple comparisons, we will not explore prior settings here. See ?anovaBF for more information.

The anovaBF is a convenience function to perform Bayes factor ANOVA-like analyses. The code for the Bayes factor analysis is almost identical to the code for the classical test:

library(BayesFactor)
bf1 = anovaBF(score ~ condition, data = disgust_data)
bf1
## Bayes factor analysis
## --------------
## [1] condition : 0.7738 ±0.01%
##
## Against denominator:
## Intercept only
## ---
## Bayes factor type: BFlinearModel, JZS

The Bayes factor in favor of a condition effect is about 0.774, or 1/0.774 = 1.3 in favor of the null (the “Intercept only” model). This is not strong evidence for either the null or the alternative, which given the moderate p value is perhaps not surprising. It should be noted here that even if the p value had just crept in under 0.05, the Bayes factor would not be appreciably different, which shows the inherent arbitrariness of significance testing.

Many possible hypotheses?

This analysis is not the end of the story, however. The hypothesis tested above — that all means are different, but with no further specificity — was not really the hypothesis of interest. The hypothesis of interest was more specific. We might consider an entire spectrum of hypotheses, listed in increasing order of constraint:

  • (most constrained) The null hypothesis (control = lemon = sulfur)
  • (somewhat constrained) Unexpected scents cause the same effect, regardless of type (lemon = sulfur ≠ control; this might occur, for instance, both “clean” and “disgusting” scents prime the same underlying concepts)
  • (somewhat constrained) Only disgusting scents have an effect (control = lemon ≠ sulfur)
  • (somewhat constrained) Only pleasant scents have an effect (control = sulfur ≠ lemon)
  • (unconstrained) All scents have unique effects (control ≠ sulfur ≠ lemon)

The above are all equality constraints. We can also specify order constraints, such as lemon < control < sulfur. The unconstrained model tested above (control ≠ sulfur ≠ lemon) does not give full credit to this ordering prediction. In the next section, I will show how to test equality constraints. In Part 2 of this post, I will show how to test order constraints.

Testing equality constraints

To test equality constraints, we must first consider what an equality constraint means. Claiming that an equality constraint holds is the same as saying that your predictions for data would not change if the two conditions are supposed to be the same had exactly the same label. If want to to impose the constraint that lemon = sulfur ≠ control, we merely have to give lemon and sulfur the same label.

In practice, this means making a new column in the data frame with the required change:

# Copy the condition column that we will change
# We use 'as.character' to avoid using the same factor levels
disgust_data$lemon.eq.sulfur = as.character(disgust_data$condition)
# Change all 'lemon' to 'lemon/sulfur'
disgust_data$lemon.eq.sulfur[ disgust_data$condition == "lemon" ] = 'lemon/sulfur'
# Change all 'sulfur' to 'lemon/sulfur'
disgust_data$lemon.eq.sulfur[ disgust_data$condition == "sulfur" ] = 'lemon/sulfur'
# finally, make the column a factor
disgust_data$lemon.eq.sulfur = factor(disgust_data$lemon.eq.sulfur)

We now have a data column, called lemon.eq.sulfur, that labels the data so that lemon and sulfur have the same labels. We can use this in Bayes factor test:

bf2 = anovaBF(score ~ lemon.eq.sulfur, data = disgust_data)
bf2
## Bayes factor analysis
## --------------
## [1] lemon.eq.sulfur : 0.1921 ±0%
##
## Against denominator:
## Intercept only
## ---
## Bayes factor type: BFlinearModel, JZS

The null hypothesis is now preferred by a factor of 1/0.192 = 5.2, which is expected given that lemon and sulfur were the least similar pair of three means. The null hypothesis accounts for the data better than this constraint.

One of the conveniences of using Bayes factors is if we have two hypotheses that are both tested against the same third hypothesis, we can test the two hypotheses against one another. The BayesFactor package makes this easy; any two BayesFactor objects compared against the same denominator — in this case, the intercept-only null hypothesis — can be combined together:

bf_both_tests = c(bf1, bf2)
bf_both_tests
## Bayes factor analysis
## --------------
## [1] condition : 0.7738 ±0.01%
## [2] lemon.eq.sulfur : 0.1921 ±0%
##
## Against denominator:
## Intercept only
## ---
## Bayes factor type: BFlinearModel, JZS

We could, for instance, put all equality-constraint tests into the same object, and then compare them like so:

bf_both_tests[1] / bf_both_tests[2]
## Bayes factor analysis
## --------------
## [1] condition : 4.029 ±0.01%
##
## Against denominator:
## score ~ lemon.eq.sulfur
## ---
## Bayes factor type: BFlinearModel, JZS

The fully unconstrained hypothesis, represented by condition, is preferred to the lemon = sulfur ≠ control hypothesis by a factor of about 4.

In the next post, we will use the posterior() function to draw from the posterior of the unconstrained model, which will allow us to test ordering constraints.