Monthly Archives: January 2015

On verbal categories for the interpretation of Bayes factors

As Bayesian analysis is becoming more popular, adopters of Bayesian statistics have had to consider new issues that they did not before. What is makes “good” prior? How do I interpret a posterior? What Bayes factor is “big enough”? Although the theoretical arguments for the use of Bayesian statistics are very strong, new and unfamiliar ideas can cause uncertainty in new adopters. Compared to the cozy certainty of (p<.05), Bayesian statistics requires more care and attention. In theory, this is no problem at all. But as Yogi Berra said, “In theory there is no difference between theory and practice. In practice there is.”

In this post, I discuss the the use of verbal labels for magnitudes of Bayes factors. In short, I don’t like them, and think they are unnecessary.

Bayes factors have many good characteristics, and have been advocated by many to replace (p) values from null hypothesis significance tests. Both (p) values and Bayes factors are continuous statistics, and it seems reasonable to ask how one should interpret the magnitude of the number. I will first address the issue of how the magnitudes of (p) values are interpreted, then move on to Bayes factors for a comparison.

Classical and Frequentist statistics

With (p) values this matter is either very difficult or very easy, depending on whether you’re more Fisherian or more Neyman-Personian. Under the Fisherian view, interpretation of the number is difficult. Fisher said, for instance, that:

“Though recognizable as a psychological condition of reluctance, or resistance to the acceptance of a proposition, the feeling induced by a test of significance has an objective basis in that the probability statement on which it is based is a fact communicable to, and verifiable by, other rational minds. The level of significance in such cases fulfills the conditions of a measure of the rational grounds for the disbelief it engenders…”

“In general tests of significance are based on hypothetical probabilities calculated from their null hypotheses. They do not generally lead to any probability statements about the real world, but to a rational and well-defined measure of reluctance to the acceptance of the hypotheses they test.” (Fisher, in ‘Statistical Methods and Scientific Inference’)

According to Fisher, the worth of a (p) value is that it is an objective statement about a probability under the null hypothesis. The strength of the evidence against the null hypothesis, however, is not the (p) value itself; it somehow translated from the “reluctance” engendered by a particular (p) value. The definition of a (p) value itself, as Fisher points out, does not naturally lead to statements about the world. The problem is immediately obvious. How much reluctance should a rational person feel, based on a (p) value? Who decides what is reasonable and what is not? To be clear, these questions are not meant as critiques of Fisher’s viewpoint, with which I sympathize; I only wish to highlight the burden that Fisher’s view of (p) values places on the researcher.

From the Neyman-Person (and the hybrid NHST) perspective, this particular problem goes away completely. As a side benefit of Neyman’s rejection of epistemology in favor of an action/decision-based view, statistics do not need to have meaning at all. In the Neyman’s view, statistical tests are methods of deciding between behaviors, with defined (or in some sense optimal) error rates. A (p) value of less than (0.05) might, for instance, lead to an acceptance of a particular proposition, automatically. As Neyman says, rejecting both Fisher’s account of scientific inductive reasoning and Jeffreys’ Bayesian account:

[T]he examination of experimental or observational data is invariably followed by a set of mental processes that involve [] three categories…: (i) scanning of memory and a review of the various sets of relevant hypotheses, (ii) deductions of consequences of these hypotheses and the comparison of these consequences with empirical data, (iii) an act of will, a decision to take a particular action.

It must be obvious that…[the] use [of inductive reasoning] as a basic principle underlying research is unsatisfactory. The beliefs of particular scientists are a very personal matter and it is useless to attempt to norm them by any dogmatic formula. (Neyman, 1957)

Neyman is not suggesting that statistics is completely automatic – after all, one needs to choose one’s decision rules, according to what suits one’s goals – but the interpretation of the magnitude of (p) values in relation to rational belief is irrelevant to Neyman. The worth of (p) value (or other statistic) is in the decision it determines. The meaning of the number itself is not important, or even nonexistent.

Today (p) values are used opportunistically in both ways. Most people do not know what a (p) value is, but they can tell you two things:

  1. “When (p) is less than 0.05, I reject the null hypothesis.” (Neyman-Pearson)
  2. “When (p) it is very small, it provides a lot of evidence against the null hypothesis.” (Fisher)

These days, it is well known that (p) values do not serve Fisher’s goals particularly well. Even Fisher did not provide a formal link between (p) values and any sort of rational belief, and as it happens no such link exists. So if one is to use (p) values, one is left with only Neyman’s decision-based account. The (p) value is uninterpretable (except trivially, via its definition), but as a decision criterion this isn’t as much of a worry.

The use of such criteria is comforting. It provides one fewer thing to argue over; if (p<0.05), then one can no longer argue that the effect isn’t there. If research is a game, these sorts of rules provide people with a sense of fairness. Whatever else happens, we all collectively agree that we will not doubt one another’s research on the grounds that there is not enough evidence for the effect. The way (p) values are used, (p<0.05) means there is “enough” evidence, by definition.

Bayesian statistics

Researchers who have adopted Bayesian statistics encounter practical hurdles that they did not have previously. Priors require some care to develop and use, but there is no clear analog in classical statistics, except for perhaps the determination of an alternative for power calculation in the Neyman-Pearson paradigm. Likewise, Bayes factors, as changes in relative model odds, have no clear analog.
The closest thing to a Bayes factor in classical statistics is a (p) value, but in truth the only similarity is that they are both interpreted in terms of evidential strength. As I outlined in a previous post, a Bayes factor is two things:

  1. A Bayes factor is the probability (or density) of the observed data in one model compared to another.
  2. A Bayes factor is the relative evidence for one model compared to another.

Part of the elegance of Bayes factors is that these two things are the same; a model is preferred in direct proportion to the degree to which it predicted the observed data.

When they encounter Bayes factors, researchers familiar with (p) values, often ask “How big is big enough?” or “What is a ‘big’ Bayes factor?” Various proponents of Bayes factors have recommended scales that give interpretations of various sizes of Bayes factors. For instance, a Bayes factor of 4 is interpreted as “substantial” evidence by Jeffreys.

Although it is common practice, in my view there are numerous problems with the practice of assigning verbal labels to sizes of Bayes factors. They are not needed, and they actually distort the meaning of Bayes factors.

As noted previously, (p) values do not have a ready interpretation without a label like “statistically significant,” which means no more and no less than “I choose to reject the null hypothesis.” Bayes factors, on the other hand, do not require any such categorization. They are readily interpretable as either ratios of probabilities, or changes in model odds. They yield the amount of evidence contributed by the data, on top of what was already known (the priors). As Kass and Raftery state:

Probability itself provides a meaningful scale defined by betting, and so these categories are not a calibration of the Bayes factor but rather a rough descriptive statement about standards of evidence in scientific investigation. (Kass and Raftery, 1995, emphasis mine)

Although Kass and Raftery are often cited as recommending guidelines for interpreting the Bayes factor, they did not. They offered a description of scientific behavior, and how they thought this mapped onto Bayes factors. They were neither interpreting the Bayes factor nor were they offering normative guidelines for action on the basis of evidential strength. If Kass and Raftery were wrong about scientific behavior (after all, they did not offer any evidence for their description), if scientific behavior were to change, or if one were to consider another area besides scientific investigation, these numbers would not serve.

But even if Bayes factors do not need to be interpreted, perhaps it might be good to have the verbal categories anyway. I do not think so, for several reasons.

My first objection is that words mean different things to different people, and meanings change over time. Take, for instance, Jeffreys’ category “substantial” for Bayes factors of between 3 and 10. This is less evidence than Jeffreys’ category of “strong”, which runs from 10 to 30. This seems strange, because the definition of “substantial” in modern use is “of considerable value.” How are “substantial” and “strong” different? Couldn’t we reverse these labels and have just as good a scale?

I believe the answer to this puzzle is that a less common use of “substantial” is “has substance.” For instance, I may say that I thought an argument is “substantial”, but this does not necessarily mean that I think the argument is strong, but simply means it is not trivially wrong. Put another way, it means that I did not think the argument was insubstantial. This is, I believe, what Jeffreys meant. But why should my evaluation of the strength of evidence depend on my knowledge of uncommon uses of common words? Would someone who did not know the less common use of “substantial” take a different view of the evidence, simply because we read different books or used different dictionaries?

Consider also Wetzels and Wagenmakers’ (2012) replacement of Jeffreys’ “not worth more than a bare mention” with “anecdotal”. Anecdotal evidence has a specific meaning; it is not simply weak evidence. I could have a well-designed, perfectly controlled experiment that nonetheless produces weak evidence for differentiating between two hypotheses of interest. This does not mean that my evidence is anecdotal. Anecdotal evidence is second-hand evidence that has not been substantiated and does not derive from well-controlled experiments.

Here we see the major problem: the use of these verbal labels smuggles arbitrary meaning into the judgement where none is needed. These meanings differ across people, times, and fields. Using such labels adds unnecessary – and perhaps incorrect – baggage to the interpretation of the results of an experiment.

The second objection is that the evaluation of what is “strong” evidence depends on what is being studied, and how. Is 10 kilometers a long way? It is if you’re walking, it isn’t if you’ve just started a flight to Australia. In a sense, Bayes factors are the same way; if we’re claiming something mundane and plausible, a Bayes factor of 10 may be more than enough. If we’re claiming something novel and implausible, a Bayes factor of 10 may not even be a start. Extraordinary claims, as the saying goes, demand extraordinary evidence; we would not regard the same level of evidence as “strong” for all claims.

The third objection is related to the second, and that is that providing evidential categories allows the researcher to shirk their responsibility for interpreting the strength of the evidence. We do not allow this in other settings. When reviewing a paper that contains evidence for some claim, for instance, it is our duty to evaluate the strength of the evidence in context. We do not, and cannot, demand from editors “standard” papers that we all agree are “strong” evidence; such standard papers do not exist. Providing normative guidelines such as “A Bayes factor of 15 is strong evidence,” though comforting, asks researchers to stop thinking in ways that we would not allow in other contexts. They impose an arbitrary, unjustified homogeneity on judgments of evidential strength.

Finally, a fourth objection is that verbal categories provide the illusion of understanding. Being able to say “A Bayes factor of 3 means that there is anecdotal evidence,” may give a researcher comfort, but does not ultimately show any understanding at all. This provides a dangerous fluency effect, because fluency is has been consistently shown to cause people to misjudge their knowledge. Because categories are not actually necessary for the interpretation of Bayes factors, giving them illusory fluency using the labels is likely to hinder, not help, their understanding.

Do people “need” categories?

All of the previous arguments may be admitted, and yet one might argue that they are substantially weakened by a single fact: that Bayes factors cannot be understood by researchers without them. People cannot think without categories, and so if we do not provide them, people will not be able to interpret Bayes factors.

I think this is self-evidently wrong. It at least requires some sort of evidence to back it up. The use of Bayes factors by researchers is in its early years, and we do not yet know how well people can interpret them in practice.

As evidence that the claim that people need categories for Bayes factors is wrong, one may point to other related quantities for which we do not provide verbal categories. The most obvious is probability. When we teach probability in class, we do not give guidelines about what is a “small” or “large” probability. If a student were to ask, we would naturally say “it depends.” A probability of 1% is small if it represents the probability that we will win a game of chess against an opponent; it is large if it is the probability that we will be killed in an accident tomorrow.

For other similar quantities, too, we do not offer verbal categories. Odds ratios and relative risk are closely related to the Bayes factor, and yet they are used by researchers all the time without the need for contextless categories.

It is often the case that students (or researchers) are unsure about probability. Although verbal categories are never (that I know of) advocated for alleviating misunderstandings or lack of certainty about probability – and rightly so – there are other ways of helping students understand probability. Gerd Gigerenzer’s work, in particular, has shown that certain visualizations have been shown to help students understand, and make use of, probabilities. A similar evidence-based tack can, and should, be taken with Bayes factors. We know a lot about how to teach people about probability, so we should apply that knowledge.

As argued previously, it is possible that through the illusion of fluency, categories may actually harm peoples understanding. It would be better to address the root of the problem rather than providing quick fixes for people’s uncomfortableness with new methods. The quick fixes may actually backfire.

Bayes factors as decision statistics

It has been suggested that cut-offs on the Bayes factors are sometimes useful; in particular, when used to stop collecting data. This is a completely different issue from the one addressed above. A rule for behavior does not need an interpretation, and furthermore, the interpretation of a Bayes factor does not depend on the stopping rule. Such a rule is merely a practicality, and there is nothing wrong with using such rules if they are needed.

As an example, I may have a rule for stopping eating, but this a completely separate question from whether I would judge how much I ate to be “a lot”. I do not need the rule to say I ate a lot, and following such a rule does not make what I ate any more or any less. I might choose such a rule based on what I thought “a lot” was, but the concept of “a lot” is prior to the rule.

In the case of Bayes factor, such decision criterion is actually only useful in light of prior odds. We should choose such a criterion such that a Bayes factor that exceeds a particular threshold is likely to convince most people; that is, that it is large enough to overcome most peoples’ biases. Bayes factors in research are used in arguments made for other researchers’ benefits; if we end sampling before we have achieved a level of evidence that would overcome others’ prior odds, then we have not done enough sampling. Convincing ourselves is not the goal of research, after all. This should make it obvious why even a rule for stopping depends on context, because the context helps us know what a useful amount of evidence is.

It should also make clear that the Bayes factor is not really the useful decision statistic; rather, the posterior odds are. If an experiment is expensive but would not achieve the levels of evidence necessary to change peoples’ minds, achieving a “strong” Bayes factor is irrelevant.


This turned out to be quite a lengthier post than I anticipated it to be, but summarizing it is easy: although (p) values need categories or criteria to be interpreted, Bayes factors do not. They have a natural interpretation that directly connects evidence with changes in odds. Furthermore, the use of verbal category labels for Bayes factors is misleading and potentially harmful to learners of Bayesian methods. Teachers of Bayesian statistics should focus on ways of visualizing Bayes factors to help people understand, rather than using the “short-cut” of verbal categories.

Multiple Comparisons with BayesFactor, Part 2 – order restrictions

In my previous post, I described how to do multiple comparisons using the BayesFactor package. Part 1 concentrated on testing equality constraints among effects: for instance, that the the effects of two factor levels are equal, while leaving the third free to be different. In this second part, I will describe how to test order restrictions on factor level effects. This post will be a little more involved than the previous one, because BayesFactor does not currently do order restrictions automatically.

Again, I will note that these methods are only meant to be used for pre-planned comparisons. They should not be used for post hoc comparisons.

Our Example

I will use the same example and data as I did in the previous post; if you have not read that post, I suggest you go back and read it before delving further here. As a reminder, our data consists of (hypothetical) “moral harshness” ratings of undocumented migrants from 150 participants in three conditions:

  • No odor during questionnaire
  • Pleasant, clean odor (lemon) during questionnaire
  • Disgusting odor (sulfur) during questionnaire

Under the idea that moral disgust and physical disgust are related physiologically (the so-called “embodied” viewpoint; Schnall et al, 2008; but see also Johnson et al., 2014 and Landy & Goodwin, in press) the prediction is that the odor will have an effect on the harshness ratings, as feelings of physical disgust are “transferred” to objects of moral judgment.

In the previous post, I showed the classical ANOVA results, which just failed to reach significance. I also showed how to do a basic test of the null hypothesis against the hypothesis that all three means are unequal using anovaBF. The Bayes factor was about 1/0.774 = 1.3, meaning that neither the null hypothesis nor the “full” model (that all three means are unequal) was favored:

bf1 = anovaBF(score ~ condition, data = disgust_data)
1 / bf1
## Bayes factor analysis
## --------------
## [1] Intercept only : 1.292 ±0.01%
## Against denominator:
## score ~ condition
## ---
## Bayes factor type: BFlinearModel, JZS

More data is needed, to test these hypotheses against one another; but as we’ll see, data that are uninformative for one comparison may be more informative for another.

Testing the “right” hypothesis

At this point it is important to note that neither the classical test (which supposedly tests the fitness of the null hypothesis) nor the Bayes factor test of the “full” model against the null hypothesis are the “right” tests for the hypothesis at hand. The null hypothesis may be false in ways that are not consistent with the research predictions. The right test in this case is to test the hypothesis that lemon < control < sulfur. The means fall in the predicted direction:

Out of necessity, however, a classical analysis normally ends with the rejection of the null hypothesis that all means are equal. There is no way in classical statistics to rigorously test order-restrictions; one can only point to the ordering of the means and note that they are in the predicted order. This, however, ignores the uncertainty inherent in the estimation of the means.

Occasionally, researchers perform post hoc tests on the individual means to “ensure” that they are really different, given that they are in the correct order, but this has the disadvantage of extremely low power, which means that this method is only deployed opportunistically. Failure of these post hoc tests might not be reported at all, and would certainly never be reported as a failure to achieve sufficient evidence in favor of the hypothesis (the excuse will always be low power) but rejection will always be trumpeted.

What is needed is a principled way of testing order restrictions. Luckily, this is possible — even straightforward — with Bayes factors.

A refresher on Bayes factor logic

One of the neat features of Bayes factors is their transitivity. If I know that Model A outperforms Model B by 3, and I know that Model B outperforms Model C by 4, then I know that Model A outperforms Model C by (3 times 4 = 12). The reason for this is that Bayes factors are simply ratios (see this previous post). The Bayes factor is the ratio of the likelihood of the data under two hypotheses. Since
[ frac{p(y mid {cal M}_A)}{p(y mid {cal M}_B)} times frac{p(y mid {cal M}_B)}{p(y mid {cal M}_C)} = frac{p(y mid {cal M}_A)}{p(y mid {cal M}_C)} ]
…we can use two Bayes factors to get a third. This suggests how we can compute a Bayes factor for an order restriction:

  1. Compare the unrestricted “full” model to the null (already done, with anovaBF)
  2. Compare the unrestricted “full” model to an order restriction
  3. Use the resulting two Bayes factors to compare the null to the order restriction.

This all hinges on Step 2, which we perform in the next section.

Order restrictions versus full models

For this section, we need to remember that the Bayes factor is the degree to which posterior odds change from prior odds. So, if we can compute the prior odds of a restriction against the full model, and compute the posterior odds of a restriction against the full model, then we can obtain the Bayes factor. For the models in the Bayes factor package, the prior odds are easy. All order restrictions have the same probability, so the odds of any single order restriction is against the full model are just
[ frac{1}{mbox{Number of possible orderings}} ]
With three factor levels, there are 6 orderings, so the prior odds are 1/6.

To compute the posterior odds, we need an additional trick: we sample from the posterior, and work out the posterior probability that our prediction holds in the factor level effects. In this way, we account for the estimation uncertainty in these effects.

We use the posterior() function to sample from the posterior distribution, and for demonstration I’ve shown the first samples from the posterior:

## Sample from the posterior distribution of the full model
## (that is, the numerator of bf1)
samples = posterior(bf1, iterations = 10000)
## Markov Chain Monte Carlo (MCMC) output:
## Start = 1
## End = 7
## Thinning interval = 1
## mu condition-control condition-lemon condition-sulfur sig2
## [1,] 30.24 -0.13343 -1.551 1.6841 42.06
## [2,] 30.49 -0.43687 -1.125 1.5617 42.70
## [3,] 29.35 0.04984 -1.514 1.4644 42.59
## [4,] 29.55 0.38246 -1.233 0.8507 46.79
## [5,] 29.38 -0.21434 -1.071 1.2850 46.31
## [6,] 29.85 -0.85003 -1.043 1.8927 53.68
## [7,] 29.40 -0.79867 -1.003 1.8020 54.84
## g_condition
## [1,] 0.10766
## [2,] 0.18695
## [3,] 0.03047
## [4,] 0.03034
## [5,] 0.59331
## [6,] 0.10248
## [7,] 0.04045

Notice that columns 1 through 4 contain the estimates of the effects of our factor levels. We need to estimate the probability that these order in the specified way. A simple estimate can be had by working out the proportion of samples in which the order constraint holds:

## Check order constraint
consistent = (samples[, "condition-control"] > samples[, "condition-lemon"]) &
(samples[, "condition-sulfur"] > samples[, "condition-control"])
N_consistent = sum(consistent)

For each posterior sample, the variable consistent codes whether the sample was consistent with the order restriction; the variable N_consistent contains how many of these samples were consistent, which in this case is 7245. Our estimate of the posterior probability of the restriction is thus N_consistent/10000, because we drew 10000 samples from the posterior distribution. The posterior probability is about 0.7245.

As it turns out, the posterior odds of the restriction to the full model is just 0.7245/1, because every sample is consistent with the full model. We can now compute the Bayes factor of the restriction to the full model by just dividing the posterior odds by the prior odds:

bf_restriction_against_full = (N_consistent / 10000) / (1 / 6)
## [1] 4.347

The Bayes factor is 4.347, which shows that the data change our opinion in favor of the restriction by a factor of about 4, against the full model.

Another way to think about the above calculation is that the prior odds index the “riskiness” of the order-restriction prediction (the lower the odds, the more risky the prediction is), and the posterior odds represent the probability that it worked out, given the data. Under this view, the Bayes factor of the restriction versus the full model is the “boost” we give to the evidence due to these two factors:
[ mbox{Evidential boost to order prediction} = mbox{Probability prediction is true} times mbox{Riskiness of prediction} ]
To get a big boost, we thus need a risky prediction and a prediction that works out. Simply noting that the prediction worked out in the data is not enough.

Putting it all together

We can now compute the Bayes factor of our restriction against the null hypothesis through simple multiplication:

## Convert bf1 to a number so that we can multiply it
bf_full_against_null = as.vector(bf1)
## Use transitivity to compute desired Bayes factor
bf_restriction_against_null = bf_restriction_against_full * bf_full_against_null
## condition 
## 3.364

This Bayes factor is still moderate, but substantially more respectable than the previous Bayes factor of 0.774.

What have we gained?

Here we see several interesting features of Bayes factors on display.

  1. If a hypothesis makes a prediction, we can test it. Classical testing has limits which arbitrarily limit our ability to test certain hypotheses (eg, order restrictions). Bayes factors are not so limited.
  2. Bayes factors are transitive. If we can test two models against the same third model, we can compare those two models against one another. Although this seems like a reasonable property, (p) values have no such property because they are not comparative.
  3. Making a specific prediction pays off. The “full” model was the wrong model to test, because it did not make properly constrained predictions. When the correct order restriction was tested, the evidence increased because the data were consistent with the restriction.
  4. The limit to increase in evidence that a specific prediction gives you is the “riskyness” of the prediction. Note in the above calculation that the limit to the boost that a Bayes factor can get from the order restriction is equal to the odds of that restriction. If we make a prediction that has low a priori odds, then when it works out in the data, the Bayes factor will reward it by that amount, weighted by the posterior probability that the restriction is true.

For more about testing order restrictions with Bayes factors, see Morey and Wagenmakers (2013), “Simple relation between Bayesian order-restricted and point-null hypothesis tests.”

Multiple Comparisons with BayesFactor, Part 1

One of the most frequently-asked questions about the BayesFactor package is how to do multiple comparisons; that is, given that some effect exists across factor levels or means, how can we test whether two specific effects are unequal. In the next two posts, I’ll explain how this can be done in two cases: in Part 1, I’ll cover tests for equality, and in Part 2 I’ll cover tests for specific order-restrictions.

Before we start, I will note that these methods are only meant to be used for pre-planned comparisons. They should not be used for post hoc comparisons.

An Example

Suppose we are interested in the basis for feelings of moral disgust. One prominent theory, from the embodied cognition point of view, holds that feelings of moral disgust are extensions of more basic feelings of disgust: disgust for physical things, such as rotting meat, excrement, etc (Schnall et al, 2008; but see also Johnson et al., 2014 and Landy & Goodwin, in press). Under this theory, moral disgust is not only metaphorically related to physical disgust, but may share physiological responses with physical disgust.

Suppose we wish to experimentally test this theory, which predicts that feelings of physical disgust can be “transferred” to possible objects of moral disgust. We ask 150 participants to fill out a questionnaire that measures the harshness of their judgments of undocumented migrants. Participants are randomly assigned to one of three conditions, differing by the odor present in the room: a pleasant scent associated with cleanliness (lemon), a disgusting scent (sulfur), and a control condition in which no unusual odor is present. The dependent variable is the score on the questionnaire, which ranges from 0 to 50 with higher scores representing harsher moral judgment.

Hypothetical data, simulated for the sake of example, can be read into R using the url() function:

# Read in the data from the
disgust_data = read.table(url(''),header=TRUE)

A boxplot and means/standard errors reveal that the effects appear to be in the predicted direction:

(note that the axes are different in the two plots, so that the standard errors can be seen)

And we can perform a classical ANOVA on these data:

summary(aov(score ~ condition, data = disgust_data))
##              Df Sum Sq Mean Sq F value Pr(>F)  
## condition 2 263 131.4 2.91 0.058 .
## Residuals 147 6635 45.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The classical test of the null hypothesis that all means are equal just fails to reach significance at (alpha=0.05).

A Bayes factor analysis

We can easily perform a Bayes factor test of the null hypothesis using the BayesFactor package. This assumes that the prior settings are acceptable; because this post is about multiple comparisons, we will not explore prior settings here. See ?anovaBF for more information.

The anovaBF is a convenience function to perform Bayes factor ANOVA-like analyses. The code for the Bayes factor analysis is almost identical to the code for the classical test:

bf1 = anovaBF(score ~ condition, data = disgust_data)
## Bayes factor analysis
## --------------
## [1] condition : 0.7738 ±0.01%
## Against denominator:
## Intercept only
## ---
## Bayes factor type: BFlinearModel, JZS

The Bayes factor in favor of a condition effect is about 0.774, or 1/0.774 = 1.3 in favor of the null (the “Intercept only” model). This is not strong evidence for either the null or the alternative, which given the moderate p value is perhaps not surprising. It should be noted here that even if the p value had just crept in under 0.05, the Bayes factor would not be appreciably different, which shows the inherent arbitrariness of significance testing.

Many possible hypotheses?

This analysis is not the end of the story, however. The hypothesis tested above — that all means are different, but with no further specificity — was not really the hypothesis of interest. The hypothesis of interest was more specific. We might consider an entire spectrum of hypotheses, listed in increasing order of constraint:

  • (most constrained) The null hypothesis (control = lemon = sulfur)
  • (somewhat constrained) Unexpected scents cause the same effect, regardless of type (lemon = sulfur ≠ control; this might occur, for instance, both “clean” and “disgusting” scents prime the same underlying concepts)
  • (somewhat constrained) Only disgusting scents have an effect (control = lemon ≠ sulfur)
  • (somewhat constrained) Only pleasant scents have an effect (control = sulfur ≠ lemon)
  • (unconstrained) All scents have unique effects (control ≠ sulfur ≠ lemon)

The above are all equality constraints. We can also specify order constraints, such as lemon < control < sulfur. The unconstrained model tested above (control ≠ sulfur ≠ lemon) does not give full credit to this ordering prediction. In the next section, I will show how to test equality constraints. In Part 2 of this post, I will show how to test order constraints.

Testing equality constraints

To test equality constraints, we must first consider what an equality constraint means. Claiming that an equality constraint holds is the same as saying that your predictions for data would not change if the two conditions are supposed to be the same had exactly the same label. If want to to impose the constraint that lemon = sulfur ≠ control, we merely have to give lemon and sulfur the same label.

In practice, this means making a new column in the data frame with the required change:

# Copy the condition column that we will change
# We use 'as.character' to avoid using the same factor levels
disgust_data$lemon.eq.sulfur = as.character(disgust_data$condition)
# Change all 'lemon' to 'lemon/sulfur'
disgust_data$lemon.eq.sulfur[ disgust_data$condition == "lemon" ] = 'lemon/sulfur'
# Change all 'sulfur' to 'lemon/sulfur'
disgust_data$lemon.eq.sulfur[ disgust_data$condition == "sulfur" ] = 'lemon/sulfur'
# finally, make the column a factor
disgust_data$lemon.eq.sulfur = factor(disgust_data$lemon.eq.sulfur)

We now have a data column, called lemon.eq.sulfur, that labels the data so that lemon and sulfur have the same labels. We can use this in Bayes factor test:

bf2 = anovaBF(score ~ lemon.eq.sulfur, data = disgust_data)
## Bayes factor analysis
## --------------
## [1] lemon.eq.sulfur : 0.1921 ±0%
## Against denominator:
## Intercept only
## ---
## Bayes factor type: BFlinearModel, JZS

The null hypothesis is now preferred by a factor of 1/0.192 = 5.2, which is expected given that lemon and sulfur were the least similar pair of three means. The null hypothesis accounts for the data better than this constraint.

One of the conveniences of using Bayes factors is if we have two hypotheses that are both tested against the same third hypothesis, we can test the two hypotheses against one another. The BayesFactor package makes this easy; any two BayesFactor objects compared against the same denominator — in this case, the intercept-only null hypothesis — can be combined together:

bf_both_tests = c(bf1, bf2)
## Bayes factor analysis
## --------------
## [1] condition : 0.7738 ±0.01%
## [2] lemon.eq.sulfur : 0.1921 ±0%
## Against denominator:
## Intercept only
## ---
## Bayes factor type: BFlinearModel, JZS

We could, for instance, put all equality-constraint tests into the same object, and then compare them like so:

bf_both_tests[1] / bf_both_tests[2]
## Bayes factor analysis
## --------------
## [1] condition : 4.029 ±0.01%
## Against denominator:
## score ~ lemon.eq.sulfur
## ---
## Bayes factor type: BFlinearModel, JZS

The fully unconstrained hypothesis, represented by condition, is preferred to the lemon = sulfur ≠ control hypothesis by a factor of about 4.

In the next post, we will use the posterior() function to draw from the posterior of the unconstrained model, which will allow us to test ordering constraints.