Bayes factor t tests, part 2: Two-sample tests

In the previous post, I introduced the logic of Bayes factors for one-sample designs by means of a simple example. In this post, I will give more detail about the models and assumptions used by the BayesFactor package, and also how to do simple analyses of two- sample designs.
See the previous posts for background:

This article will cover two-sample t tests.

Bayesian t tests

Frisby and Clatworthy (1975) [DASL] presented random dot stereograms to 78 participants. Each participant was asked to report how long it took them to fuse the random dot stereogram. Of the 78 participants, 35 were given extra visual information about the target image, in the form of a 3D model. The remaining participants were given no such information.

The question of interest is whether the visual information affected the fusion times for the random dot stereograms. We begin by loading the BayesFactor package and reading the data into R.

# Load the BayesFactor package
library(BayesFactor)

# Read in the data from the Data and Story Library
header = FALSE, skip = 33, nrows = 78)
colnames(randDotStereo) = c("fuseTime", "condition")
randDotStereo$logFuseTime = log(randDotStereo$fuseTime)

Because the response times are right-skewed, we perform a conventional logarithmic transformation. All plots and tests will be on the transformed data. The figures below show summaries of the data, in the form of a box plot and a plot with means and standard errors.

The condition in which participants were given visual information about the target image in the random dot stereogram yields slightly faster fusion times, on average. Before exploring the results of a Bayes factor analysis, we first compute the classical t test for comparison.

classical.test = t.test(logFuseTime ~ condition, data = randDotStereo, var.eq = TRUE)
classical.test
##
## Two Sample t-test
##
## data: logFuseTime by condition
## t = 2.319, df = 76, p-value = 0.02308
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.06077 0.80034
## sample estimates:
## mean in group NV mean in group VV
## 1.820 1.389

The classical t test yields a statistically significant result at (alpha=0.05), with a (p) value of (p=0.0231). The standardized effect size is (d=0.5277) [CI95%: (0.072,0.98]). This would typically be interpreted (incorrectly) as indicating strong enough evidence against the null hypothesis to reject it. To truly evaluate the evidence against the null hypothesis, we need a Bayes factor; to compute a Bayes factor, we need to outline two hypotheses.

Finding two hypotheses to compare

A t test is typically formulated to test the plausibility of the null hypothesis; however, in order to determine the plausibility of the null hypothesis, we must compare it to something. Formulating the null hypothesis is easy; it states simply that there is no effect, implying that (delta=0) (where (delta) is the true standardized effect size).

We will compare the null hypothesis to an alternative, so we what standardized effect sizes would be plausible if the null hypothesis were false. The BayesFactor package has a few built-in “default” named settings. The figure below shows three alternative hypotheses as prior distributions over the true effect size (delta).

These priors all have the same shape; the only differ by their scale, denoted by (r). The three named defaults are given in the following table.

Prior name // (r) scale
*medium (sqrt{2}/2)
wide (1)
ultrawide (sqrt{2})

“Medium”, indicated by a star, is the default. The scale controls how large, on average, the expected true effect sizes are. For a particular scale 50% of the true effect sizes are within the interval ((-r,r)). For the default scale of “medium”, 50% of the prior effect sizes are within the range ((-0.7071,0.7071)). Increasing (r) increases the sizes of expected effects; decreasing (r) decreases the size of the expected effects.

Performing the test

We can now compute a Bayes factor using the BayesFactor package. To compute a two-sample t test, we use the ttestBF function. The formula argument indicates the dependent variable on the left of the ~, and the grouping variable on the right. The data argument is used to pass the data frame containing the dependent and independent variables as columns.

bf = ttestBF(formula = logFuseTime ~ condition, data = randDotStereo)
bf
## Bayes factor analysis
## --------------
##  Alt., r=0.707 : 2.322 ±0.06%
##
## Against denominator:
## Null, mu1-mu2 = 0
## ---
## Bayes factor type: BFindepSample, JZS

The Bayes factor analysis with the default prior settings yields a Bayes factor of 2.3224 in favor of the alternative. What does this mean?

1. The data are 2.3224 times as probable under the alternative as under the null.
2. The relative odds in favor of the alternative, against the null, must change by a factor of 2.3224 in light of the data. For instance, if we held even odds before seeing the data, then we our posterior odds would be 2.3224:1 for the alternative. This is regarded as weak evidence, though it does slightly favor the alternative.

If we regarded the significant (p) value to be an indication that there was “enough” evidence to reject the null hypothesis, we might be surprised by this result. Do the classical result and the Bayesian result contradict one another? Yes, and no. If we interpret the classical (p) value as a measure of the weight of the evidence — a common, though incorrect, interpretation — then the results contradict. However, we must remember that classical statistics has no formal notion of the weight of evidence. The significance test is merely a test with a fixed error rate of (alpha). Interpreted correctly, then, the results do not contradict one another. But correctly interpreted, “significance” is an uninteresting concept; the Bayesian notion of the strength of evidence is more in line with the interests of researchers.

A one-sided test

In the random dot stereogram data, the two-sided hypothesis we chose might be inappropriate. It seems reasonable to ask the question “How strong is the evidence that giving visual information about the random dot stereogram yields faster responses?” if we thought that having visual information of the picture in the stereogram helped participants fuse the image. This question suggests testing the hypothesis that (delta>0) versus (delta=0).

For the one-sided test, we use the positive (or negative) portion of the prior distribution. Indicating that we want to restrict our alternative to a range is done using the nullInterval argument:

bf.signed = ttestBF(formula = logFuseTime ~ condition, data = randDotStereo,
nullInterval = c(0, Inf))
bf.signed
## Bayes factor analysis
## --------------
##  Alt., r=0.707 0<d<Inf : 4.567 ±0.06%
##  Alt., r=0.707 !(0<d<Inf) : 0.07747 ±0.06%
##
## Against denominator:
## Null, mu1-mu2 = 0
## ---
## Bayes factor type: BFindepSample, JZS

(see also the previous blog post.) The BayesFactor package helpfully returns the Bayes factor for the selected interval against the null and the Bayes factor for the complement of the selected interval against the null. The Bayes factor for the one-sided alternative that the effect size is positive, against the null, is 4.5673. Because the effect size in the data is consistent with the true effect size being positive, the weight of evidence is greater relative to the two-sided test.

Comparing other hypotheses

Suppose that we were uninterested in testing the point-null hypothesis that (delta=0). We might, for instance, care more about whether the effect size is positive versus negative. In the case of the random dot stereograms, one hypothesis might state that the visual information actually slows fusion times, and the other that visual information speeds fusion times. We already have enough information to compute such a Bayes factor from the analyses above.
Recall that the Bayes factor is the ratio of the probability of the data under the two hypotheses; a hypothesis is favored to the extent that it specifies that the observed data is more probable. In bf.signed, we have two Bayes factors, both against the same denominator model: the null hypothesis. In order to compute a third Bayes factor comparing the positive effect sizes to the negative ones, all we have to do is divide:
[
begin{eqnarray*}
&&left.frac{mbox{Probability of the data if}~delta>0}{mbox{Probability of the data if}~delta=0}middle/frac{mbox{Probability of the data if}~delta<0}{mbox{Probability of the data if}~delta=0}right.\
&=& frac{mbox{Probability of the data if}~delta>0}{mbox{Probability of the data if}~delta<0}
end{eqnarray*}
]
The identical denominators cancel, leaving our new Bayes factor. This is easy using the BayesFactor package. This is easy using the BayesFactor package. Since bf.signed contains the two Bayes factors of interest, we just divide the first element by the second element:

bf.pos.vs.neg = bf.signed/bf.signed
bf.pos.vs.neg

## Bayes factor analysis
## --------------
##  Alt., r=0.707 0<d<Inf : 58.96 ±0.09%
##
## Against denominator:
## Alternative, r = 0.707106781186548, mu1-mu2 =/= 0 !(0<d<Inf)
## ---
## Bayes factor type: BFindepSample, JZS

Note that the numerator is now restricted to positive effect sizes, and the denominator is restricted to negative effect sizes. The Bayes factor is 58.9575, indicating that positive effect sizes are strongly favored to negative ones.

The same logic can be used to perform other tests, as well; consider testing a positive, but small ((0<deltaleq.2)), effect, against a positive and medium-or-larger sized effect ((delta>.2)):

## Each of these will contain two Bayes factors, the first of which is of
## interest
bf.negl = ttestBF(formula = logFuseTime ~ condition, data = randDotStereo, nullInterval = c(0,
0.2))
bf.nonnegl = ttestBF(formula = logFuseTime ~ condition, data = randDotStereo,
nullInterval = c(0.2, Inf))

bf.nonnegl/bf.negl
## Bayes factor analysis
## --------------
##  Alt., r=0.707 0.2<d<Inf : 1.917 ±0.09%
##
## Against denominator:
## Alternative, r = 0.707106781186548, mu =/= 0 0<d<0.2
## ---
## Bayes factor type: BFindepSample, JZS

The Bayes factor of 1.9168 is extremely weak evidence against the small-effect-size hypothesis.

The effect of the prior

Whenever we do a Bayesian analysis, we must use a prior distribution. In the BayesFactor package, a flexible family of priors is provided for you. An analysis specifies the spread of possible true values of (delta) under the alternative hypothesis. Predictions for data under the alternative are obtained by combining the uncertainty in the true value with the uncertainty from sampling; thus, a reasonable prior distribution is critical because the prior distribution makes the alternative testable through its predictions.

Reasonable predictions for data are necessary for reasonable inference, and a reasonable prior is necessary for reasonable predictions. Given a reasonable prior, evaluations of the evidence must be based on Bayes’ theorem, and hence the Bayes factor (see “What is a Bayes factor?”) . The problem is that while we might have a sense that a particular prior is “reasonable”, others might be reasonable as well. We might, for instance, be uncertain whether the “medium” or “wide” prior scale is more reasonable for the t test outlined above. It is interesting, therefore, to understand how the Bayes factor changes with the prior scale.

The figure below shows the predictions for the observed effect size for four hypotheses: the null hypothesis, and the three named alternative scales used in the BayesFactor package. The vertical line shows the effect size observed in the random dot stereogram experiment.

The alternative hypotheses (“medium”, “wide”, and “ultrawide”) show increasingly spread out predictions for the observed effect sizes, which follow from the fact that the corresponding priors on the true effect size (delta) are increasingly wide. The predictions under the null hypothesis, on the other hand, are more concentrated around the effect size of 0. The The prior scales affect the Bayes factor because a hypothesis that has more spread out predictions will have lower density on any one prediction. This is a natural and desirable consequence of probability theory: a hypothesis that predicts everything predicts nothing.

The figure below shows how the Bayes factor for the random dot stereogram data is related to the prior scale. The scales vary by a factor of about 16, a substantial range. Although the Bayes factor is certainly sensitive to these changes, even changes of this large magnitude do not markedly change the Bayes factor. Any one of these chosen prior scales would lead to the conclusion that there is equivocal evidence for an effect.

There are two main trends to understand: how the Bayes factor changes as the prior scale becomes small, and how it changes as the prior scale becomes very large.

• As the prior scale approaches 0, the alternative looks more and more like the null. The Bayes factor approaches 1.
• As the prior scale approaches (infty), the alternative predicts larger and larger observed values. The Bayes factor eventually favors the null to arbitrary degree.

It is important to emphasize that the sensitivity of the Bayes factor to the prior is not a “bad” or “undesirable” feature of the Bayes factor. The sensitivity to the prior is simply a sensitivity to predictions of what data are plausible under the hypotheses. There is no question that inferences should be sensitive to predictions. The only question is how sensitive they should be, and Bayes theorem gives us a definitive answer: the Bayes factor is exactly as sensitive as Bayes’ theorem requires.

For more on the sensitivity of the BayesFactor t tests to the prior, see this shiny app, which demonstrates the effect of the prior on data predictions and the Bayes factor. See also Felix Schönbrodt’s recent post on the topic, and the associated shiny app.

In the next post — part 3 in the t test series — I will discuss using BayesFactor to sample from posterior distributions.

Bayes factor t tests, part 1

In my first post, I described the general logic of Bayes factors. I will continue discussing the general logic of Bayes factor, while introducing some of the basic functionality of the BayesFactor package.

Recap: What is a Bayes factor?

The Bayes factor is the relative evidence provided by the data for comparing two statistical models. It has two equivalent definitions:

1. The Bayes factor is the relative predictive success between two hypotheses: it is the ratio of the probabilities of the observed data under each of the hypotheses. If the probability of the observed data is higher under one hypothesis than another, then that hypothesis is preferred.
2. The Bayes factor is evidence: it is the factor by which the relative beliefs must be multiplied after the data have been observed, providing new (posterior) relative beliefs. The stronger the evidence, the more beliefs must change.

Student’s sleep data

Student (1908) describes a study by Cushny and Peebles (1905) on the effect of several medications on sleep. One of the drugs they studied was Laevo-hyoscine hydrobromide, which we’ll abbreviate as “LH”. LH was administered to 10 patients on several nights, and the average amount of sleep over the patients’ unmedicated sleep (measured on other nights) was recorded. For the purposes of this post, we will make the typical assumptions about the data that would be made if we were to apply the (t) procedure: specifically, that the observation are independent and normally distributed.
We return to Carole and Paul, who have another disagreement. Carole does not believe that LH increases sleep at all, while Paul claims that it does. This would traditionally be handled by a t test. Under Carole’s hypothesis, the (t) statistic has a Student’s (t) distribution with (N-1) degrees of freedom. That is, [ t = frac{bar{y}}{s}sqrt{N} sim mbox{Student’s }~t_{N-1} ]
Another way of stating her prediction is in terms of the observed effect size, (hat{delta}=bar{y}/s), which is shown in the figure below. It is simply a Student’s (t) distribution, scaled by (1/sqrt{N}).

Paul’s hypothesis — that LH causes an increase in sleep — is problematic. Typically, we require scientific hypotheses to constrain the data in some way. We require them to predict that some outcomes are more plausible than others. Paul’s hypothesis, however, predicts nothing. His hypothesis does not constrain the data.
This highlights a curious fact about traditional significance testing. The only hypothesis we can support using a classical significance test is the one that makes no predictions! If we applied the traditional t test to these data, we could only reject Carole’s hypothesis in favor of Paul’s — but never support Carole’s — even though Paul has advanced a decidedly unfalsifiable hypothesis.
To be testable, Paul’s hypothesis must make predictions, as Carole’s has. As a first step, let’s assume that Paul believes that LH will increase sleep by 1 standardized effect size unit. If (mu) is the mean increase in hours of sleep and (sigma) is the standard deviation of the increases, then the standardized increase is [ delta = frac{mu}{sigma}. ] and Paul believes that (delta=1). Using the noncentral t distribution, we can obtain Paul’s predictions for the observed effect size (hat{delta}), shown in the figure below:

As expected, the predictions are centered around (hat{delta}=1). We now have two hypothesis: Carole believes that (delta=0), and Paul believes that (delta=1). These hypotheses imply two different predictions for the observed effect size, as shown above. We can now compare the predictions with the results presented by Student to see whether Paul or Carole is favored by the evidence.
The sleep data set is built into R; the following R code will extract the data for LH (which is labelled group 2):

improvements = with(sleep, extra[group == 2])
improvements
##    1.9  0.8  1.1  0.1 -0.1  4.4  5.5  1.6  4.6  3.4

The improvements in hours of sleep range from -0.1 hours (the patient slept less when medicated by LH) and an improvement of 5.5 hours. The mean improvement was 2.33 hours. The figure below shows the hours of improvement for all 10 patients.

We begin our analysis by computing (N) and the (t) statistic from the data:

N = length(improvements)
t = mean(improvements)/sd(improvements) * sqrt(N)
t
##  3.68

Typically, at this point we would look up the (p) value for this (t) statistic; but for a Bayes factor analysis, we use it to compute the relative evidence for the two hypotheses. Of interest is the effect size:

deltaHat = t/sqrt(N)
deltaHat
##  1.164

The effect size (hat{delta}=1.1637) seems to favor Paul, since Paul’s hypothesis ((delta=1)) was actually quite close to the observed result . In order to know by how much the result favors Paul, we need to look at the predictions and see where the observation falls.
The figure below shows both Carole’s and Paul’s predictions for the observed effect size as overlapping gray distributions. The observed effect size is denoted by the vertical line segment. In order to compute the Bayes factor, we must compare the relative heights of the two distributions for the observed data. The probability density of the observed data is 67.9451 times higher under Paul’s hypothesis than under Carole’s. The Bayes factor favoring Paul is thus 67.9451.

We can easily compute this using R’s built in functions. Under Carole’s hypothesis, the (t) statistic has a central (t) distribution with 9 degrees of freedom; Under Paul’s hypothesis, the (t) statistic has a noncentral (t) distribution with 9 degrees of freedom, and a noncentrality parameter of (deltasqrt{N}=1timessqrt{10}). The ratio of the two densities, at the observed (t) statistic, yields the Bayes factor.

dt(t, df = 9, ncp = 1 * sqrt(10))/dt(t, df = 9)
##  67.95

Uncertain hypotheses, and a first look at BayesFactor

In the section above Carole’s and Paul’s original hypotheses, when stated in terms of effect sizes, were [ begin{align*} {cal H}_c : delta &= 0\ {cal H}_p : delta &> 0; end{align*} ] that is, Paul hypothesized that the drug increases sleep on average, and Carole hypothesized that it had no effect. The problem with this was that Paul’s hypothesis was untestable, because it does not lead to any proper constraint on the data.
The solution is for Paul to propose a hypothesis that does constrain data. We chose the hypothesis (delta=1) for Paul, and showed how that can be used to compute a Bayes factor. The resulting Bayes factor, however, does not seem to capture very well the spirit of Paul’s original hypothesis. Paul was not specific about what the exact effect size was. But (delta>0) won’t do either; it is so unspecific as to be completely unfalsifiable.
What we need is a middle ground between the two. In the previous post I showed how a Bayesian analysis can spread uncertainty out over a range of parameter values using a probability distribution.
The figure below shows two hypotheses: Carole’s null hypothesis ((delta=0)) whose predictions I’ve already presented, and a plausible hypothesis for Paul ((delta>0)). Under Paul’s hypothesis, plausible effect sizes are not spread evenly across the positive effect sizes, because that would not lead to any constraint on the data. Instead, small effect sizes are preferred to large ones. It is implausible for the effect size to be too large, so the plausibility of the effects diminish as the effect sizes get larger.

The shape of this prior distribution is the positive half of a Cauchy distribution, or a Student’s (t) distribution with 1 degree of freedom. The Cauchy prior was suggested by Rouder et al. (2009), and is the one implemented in the BayesFactor package.
Of course, this half-Cauchy distribution is not the only way we could implement Paul’s hypothesis. We could choose a different shape of distribution; we could make it wider or narrower around 0; we could even restrict the range further. We must choose some valid probability distribution in order to create a hypothesis that constrains the data, and we should do our best to make the test meaningful by choosing a defensible distribution.
Recall from the previous post that the way we obtain predictions from an uncertain hypothesis like Paul’s, we weight the predictions for each possible (delta) by their plausibility and average them. This is represented by the integral: [ P(hat{delta}mid{cal H}_p) = int P(hat{delta}mid delta)p(delta),ddelta, ] where (P(hat{delta}mid delta)) gives the predictions for a specific (delta), and (p(delta)) is the distribution in the figure above, giving the weightings for all possible values of (delta).
The figure below shows Carole’s predictions (which are the same as before) and Paul’s new predictions for the observed effect size. Notice that Paul’s predictions are now substantially more spread out than before. The uncertainty Paul has about the true value of (delta), added to the uncertainty that is inherent in the sampling given a particular true value, has caused his predictions to be less certain. This spreading out is the penalty for a having a less-specific hypothesis.

Now that we have predictions from both Carole’s and Paul’s hypothesis, we can compute a Bayes factor. The Bayes factor is the ratio of the heights at the observed (hat{delta}) value, shown in the figure below by the vertical line segment. The Bayes factor is 21.3275 in favor of Paul, because the probability density of the observed data is 21.3275 times greater under Paul’s hypothesis than under Carole’s. Note that this is substantially lower than the Bayes factor in favor of Paul’s hypothesis when it was more specific. More flexible hypotheses, like Paul’s, are automatically penalized by the Bayes factor.

As previously mentioned, the distribution used as Paul’s hypothesis happens to be the one implemented in the BayesFactor package. We can thus use the function ttestBF, part of the package’s suite of functions, to compute the Bayes factor for the Student sleep data.
If you have not installed the BayesFactor package yet, install the package first (help here). If you define the improvements vector as above, you can run the following two lines of code to load the package and perform the a Bayes factor (t) test. The nullInterval argument tells ttestBF that you want to consider that range as a hypothesis. Paul’s hypothesis ranged from 0 to (infty), so we specify that range using the nullInterval argument:

library(BayesFactor)
ttestBF(improvements, nullInterval = c(0, Inf))
## Bayes factor analysis
## --------------
##  Alt., r=0.707 0<d<Inf : 21.33 ±0.01%
##  Alt., r=0.707 !(0<d<Inf) : 0.1036 ±1.4%
##
## Against denominator:
## Null, mu = 0
## ---
## Bayes factor type: BFoneSample, JZS

There are several things to note in the output. There are two Bayes factors shown here shown in the lines starting with  and  The first line is the one we specified, where (0<d<infty). The second Bayes factor is the Bayes factor for the complement of the specified range (the ! indicates negation) — in this case, the negative effect sizes. Each of the two lines contains three pieces of information: the numerator hypothesis for the Bayes factor, the Bayes factor itself, and a measure of the error in the estimate of the Bayes factor. Here, the error is very small.
Below the two Bayes factors, the denominator hypothesis is specified. The denominator is the hypothesis to which all the numerator hypotheses have been compared. The first Bayes factor, then, is the comparison of the positive effect sizes to the null hypothesis; the second Bayes factor is the comparison of the negative effect sizes to the null. The positive hypothesis is preferred to the null hypothesis, and the null hypothesis is preferred to the negative hypothesis.
In order to gain some insight into the Bayes factor, one can compute the Bayes factor for a range of possible data. The figure below shows the Bayes factors for a range of observations. The effect size in the Student sleep data is denoted by the circle. Observing smaller effect sizes or negative effect sizes favors Carole’s hypothesis (the null), while observing larger effect sizes favors Paul’s alternative.

Conclusions

What can we conclude from our Bayes factor analysis? Paul’s alternative hypothesis was favored to Carole’s null hypothesis by a factor of 21.3275, suggesting a substantial amount of evidence in favor of a positive effect size. This is, of course, conditioned on the “reasonableness” of Paul’s hypothesis. The Bayes factor is not, however, arbitrary: although the Bayes factor would change if we changed the prior, we would have to choose a strange prior to change the substantive conclusion.
In the next post, we will explore the BayesFactor package’s (t) tests in more detail.