Category Archives: Bayes factor


To Beware or To Embrace The Prior

In this guest post, Jeff Rouder reacts to two recent comments skeptical of Bayesian statistics, and describes the importance of the prior in Bayesian statistics. In short: the prior gives a Bayesian model the power to predict data, and prediction is what allows the evaluation of evidence. Far from being a liability, Bayesian priors are what make Bayesian statistics useful to science.

Jeff Rouder writes:

Bayes’ Theorem is about 250 years old. For just about as long, there has been this one never-ending criticism—beware the prior. That is: priors are too subjective or arbitrary. In the last week I have read two separate examples of this critique in the psychological literature. The first comes from Savalei and Dunn (2015) who write,

…using Bayes factors further increases ‘researcher degrees of freedom,’ creating another potential QRP, because researchers must select a prior–—a subjective expectation about the most likely size of the effect for their analyses. (Savalei and Dunn, 2015)

The second example is from Trafimow and Marks (2015) who write,

The usual problem with Bayesian procedures is that they depend on some sort of Laplacian assumption to generate numbers where none exist. (Trafimow and Marks, 2015)

The focus should be on the last part—generating numbers where none exist—which I interpret as questioning the appropriateness of priors. Though the critiques are subtly different, they both question the wisdom of Bayesian analysis for its dependence on a prior. Because of this dependence, researchers holding different priors may reach different conclusions from the same data. The implication is that ideally analyses should be more objective than the subjectivity necessitated by Bayes.

The critique is dead wrong. The prior is the strength rather than the weakness of the Bayesian method. It gives it all of its power to predict data, to embed theoretically meaningful constraint, and to adjudicate evidence among competing theoretical positions. The message here is to embrace the prior. My colleague Chris Donkin has used the Kubrick’s subtitle to say it best: “How I learned to stop worrying about and love the prior.” Here goes:

Classical and Bayesian Models

Let’s specify a simple model both in classical and Bayesian form. Consider for example where data, denoted (Y_1,ldots,Y_N), are distributed as normals with a known variance of 1. The conventional frequentist model is [ Y_i(mu) sim mbox{Normal} (mu,1). ] There is a single parameter, (mu), which is the center of the distribution. Parameter (mu) is a single fixed value which, unfortunately, is not known to us. I have made (Y_i) a function of (mu) to make the relationship explicit. Clearly, the distribution of each (Y_i) depends on (mu), so this notation is reasonable.

The Bayesian model consists of two statements. The first one is the data model: [ Y_i | mu sim mbox{Normal} (mu,1), ] which is very similar to the conventional model above. The difference is that (mu) is no longer a constant but a random variable. Therefore, we write the data model as a conditional statement—conditional on some value of (mu), the observations follow a normal at that mean. The data model, though conceptually similar to the frequentist model, is not enough for a Bayesian. It is incomplete because it is specified as a conditional statement. Bayesians need a second statement, a model on the parameter (mu). A common specification is [ mu sim mbox{Normal}(a,b),] where (a) and (b), the mean and variance, are set by the analyst before observing the data.

From a classical perspective, Bayesians make an extra model specification, the prior on parameters, that is unnecessary and unwarranted. Two researchers can have the same model, that the data are normal, but have very different priors if they choose very different values of (a) and/or (b). With these different choices, they may draw different conclusions. From a Bayesian perspective, the classical perspective is incomplete because it only models the phenomena up to function of unknown parameters. Classical models are good models if you know (mu) — say, as God does — but not so good if you don’t; and, of course, mortals don’t. This disagreement, whether Bayesian models have an unnecessary and unwise specification or whether classical models are incomplete, is critical to understand why the priors-are-too-subjective critique is off target.

Bayesian Models Make Predictions, Classical Models Don’t

One criteria that I adopt, and that I hope you do to, is that models should make predictions about data. Prediction is at the heart of deductive science. Theories make predictions, and then we check if the data has indeed conformed to these predictions. This view is not too alien, in fact, it is the stuff of grade-school science. Prediction to me means the ability to make probability statements about where data will lie before the data are collected. For example, if we agree that (mu=0) in the above model, we now can make such statement, say that the probability that (Y_1) is between -1 and 1 is about 68%.

This definition of prediction, while common sense, is quite disruptive. Do classically-specified models predict data? I admit a snarky thrill in posing this question to my colleagues who advocate classical methods. Sometimes they say “yes,” and then I remind them that the parameters remain unknown except in the large-sample limit. Since we don’t have an infinite amount of data, we don’t know the parameters. Sometimes they say they can make predictions with the best estimate of (mu), and I remind them that they need to see the data first to estimate (mu), and as such, it is not a prediction (not to mention the unaccounted sample noise in the best estimate). It always ends a bit uneasy with awkward smiles, and with the unavoidable conclusion that classical models do not predict data, at least not in the usual definition of “predict.”

The reason classical models don’t predict data is that they are incomplete. They are missing the prior—a specification of how the parameters vary. With this specification, the predictions are straightforward application of the Law of Total Probability ([Pr(Y_i) = int Pr(Y_i|mu) Pr(mu) dmu. ] The respective probabilities (densities) (Pr(Y_i|mu)) and (Pr(mu)) are derived from the model specifications. Hence, the (Pr(Y_i)) is computable. We can state the probability that an observation lies in any interval before we see the data. Bayesian specifications predict data; classical specifications don’t.

Priors Instantiate Meaningful Constraint

The prior is not some nuisance that one must begrudgingly specify. Instead, it is a tool for instantiating theoretically meaningful constraint. Let’s take a problem near and dear to my children—whether the candy Smarties makes children smarter. For if so, my kids have a very convincing claim why I should buy them Smarties. I have three children, and these three don’t agree on much. So let’s assume the eldest thinks Smarties makes you smarter, the middle thinks Smarties makes you dumber if only to spite his older brother, and the youngest thinks it’s wisest to steer a course between her brothers. She thinks Smarties have no effect at all. They decide to run an experiment on 40 schoolmates where each schoolmate first takes an IQ test, then eats a Smartie, and then take the IQ test again. The critical measure is the change in IQ, and for the sake of this simple demonstration, we discount any learning or fatigue confounds.

All three kids decided to instantiate their position within a Bayesian model. All three start with the same data model: [ Y_i | mu sim mbox{Normal}(mu,sigma^2)] where (Y_i) is the difference score for the $i$th kid, (mu) is the true effect of Smarties, and (sigma^2) is the variance of this difference across kids. For simplicity, let’s treat (sigma=5) as known, say as the known standard deviation of test-retest IQ score differences. Now each of my children needs a model on (mu), the prior, to instantiate their position. The youngest had it easiest. With no effect, her model on (mu) is [ M_0: mu=0. ] Next, consider the model of the oldest. He believe there is a positive effect, and knowing what he does about Smarties and IQ scores, he decides to place equal probability of (mu) between 0-point and a 5-point IQ effect, i.e., [M_1: mu sim mbox{Uniform}(0,5).] The middle one, being his brother’s perfect contrarian, comes up with the mirror-symmetric model: [M_2: mu sim mbox{Uniform}(-5,0).]

Predictions Are The Key To Evidence

Now a full-throated disagreement among my children will inevitably result in one of them yelling, “I’m right; you’re wrong.” This proclamation will be followed by, “You’re so stupid.” The whole thing will go on for a while with hurled insults and hurt feelings. And if you think this juvenile behavior is limited to my children or children in general, then you may not know many psychological scientists. What my kids need is a way of using data to inform theoretically-motivated positions.
In a previous post, Richard Morey demonstrated — in the context of Bayesian t tests — how predictions may be used state evidence). I state the point here for the problem my children face. Because my children are Bayesian, they may compute their predictions about the sample mean of the difference scores. Here they are for a sample mean across 40 kids:

My daughter with Model (M_0) most boldly predicts that the sample mean will be small in magnitude, and her predictive density is higher than that of her brothers for (-1.15<bar{Y}<1.15). If the sample mean is in this range, she is more right than they are. Likewise if the sample mean is above 1.15, the oldest child is more right (Model (M_1)), and if the sample mean is below -1.15, the middle child is more right (Model (M_2)).

With this Bayesian setup, we as scientist can hopefully rise above the temptation to think in terms of right and wrong. Instead, we can state fine-grained evidence as ratios. For example, suppose we observe a mean of -1.4, which is indicated with the vertical dashed line. The most probable prediction comes from Model (M_2), and it is almost twice as probable as Model (M_0). This 2-to-1 ratio serves as evidence for a negative effect of Smarties relative to a null effect. The prediction for the negative model is 25 times as probable as that for the positive model, and thus the evidence for a negative-effects model is 25-to-1 compared to the positive-effects model. These ratios of marginal predictions are Bayes factors, which are intuitive measure of evidence. Naturally, the meaning of the Bayes factor is bound to the model specifications.

Take Home

The prior is not some fudge factor. Different theoretically motivated constraints on data may be specified gracefully through the prior. With this specification, not only do competing models predict data, but stating evidence for positions is as conceptually simple as comparing how well each model predicts the observed data. Embrace the prior.

Jeff Rouder

On making a Bayesian omelet

My colleagues Eric-Jan Wagenmakers and Jeff Rouder and I have a new manuscript in which we respond to Hoijtink, van Kooten, and Hulsker’s in press manuscript Why Bayesian Psychologists Should Change the Way they Use the Bayes Factor. They suggest a method for “calibrating” Bayes factor using error rates. We show that this method is fatally flawed, but also along the way we describe how we think about the subjective properties of the priors we use in our Bayes factors:

“…a particular researcher’s subjective prior is of limited use in the context of a public scientific discussion. Statistical analysis is often used as part of an argument. Wielding a fully personal, subjective prior and concluding ‘If you were me, you would believe this’ might be useful in some contexts, but in others it is less useful. In the context of a scientific argument, it is much more useful to have priors that approximate what a reasonable, but somewhat-removed researcher would have in the situation. One could call this a ‘consensus prior’ approach. The need for broadly applicable arguments is not a unique property of statistics; it applies to all scientific arguments. We do not argue to convince ourselves; we should therefore make use of statistical arguments that are not pegged to our own beliefs…

It should now be obvious how we make our ‘Bayesian omelet’; we break the eggs and cook the omelet for others in the hopes that it is something like what they would choose for themselves. With the right choice of ingredients, we think our Bayesian omelet can satisfy most people; others are free to make their own, and we would be happy to help them if we can. “

Our completely open, reproducible manuscript — “Calibrated” Bayes factors should not be used: a reply to Hoijtink, van Kooten, and Hulsker — along with a supplement and R code, is available on github (with DOI!).

On verbal categories for the interpretation of Bayes factors

As Bayesian analysis is becoming more popular, adopters of Bayesian statistics have had to consider new issues that they did not before. What is makes “good” prior? How do I interpret a posterior? What Bayes factor is “big enough”? Although the theoretical arguments for the use of Bayesian statistics are very strong, new and unfamiliar ideas can cause uncertainty in new adopters. Compared to the cozy certainty of (p<.05), Bayesian statistics requires more care and attention. In theory, this is no problem at all. But as Yogi Berra said, “In theory there is no difference between theory and practice. In practice there is.”

In this post, I discuss the the use of verbal labels for magnitudes of Bayes factors. In short, I don’t like them, and think they are unnecessary.

Bayes factors have many good characteristics, and have been advocated by many to replace (p) values from null hypothesis significance tests. Both (p) values and Bayes factors are continuous statistics, and it seems reasonable to ask how one should interpret the magnitude of the number. I will first address the issue of how the magnitudes of (p) values are interpreted, then move on to Bayes factors for a comparison.

Classical and Frequentist statistics

With (p) values this matter is either very difficult or very easy, depending on whether you’re more Fisherian or more Neyman-Personian. Under the Fisherian view, interpretation of the number is difficult. Fisher said, for instance, that:

“Though recognizable as a psychological condition of reluctance, or resistance to the acceptance of a proposition, the feeling induced by a test of significance has an objective basis in that the probability statement on which it is based is a fact communicable to, and verifiable by, other rational minds. The level of significance in such cases fulfills the conditions of a measure of the rational grounds for the disbelief it engenders…”

“In general tests of significance are based on hypothetical probabilities calculated from their null hypotheses. They do not generally lead to any probability statements about the real world, but to a rational and well-defined measure of reluctance to the acceptance of the hypotheses they test.” (Fisher, in ‘Statistical Methods and Scientific Inference’)

According to Fisher, the worth of a (p) value is that it is an objective statement about a probability under the null hypothesis. The strength of the evidence against the null hypothesis, however, is not the (p) value itself; it somehow translated from the “reluctance” engendered by a particular (p) value. The definition of a (p) value itself, as Fisher points out, does not naturally lead to statements about the world. The problem is immediately obvious. How much reluctance should a rational person feel, based on a (p) value? Who decides what is reasonable and what is not? To be clear, these questions are not meant as critiques of Fisher’s viewpoint, with which I sympathize; I only wish to highlight the burden that Fisher’s view of (p) values places on the researcher.

From the Neyman-Person (and the hybrid NHST) perspective, this particular problem goes away completely. As a side benefit of Neyman’s rejection of epistemology in favor of an action/decision-based view, statistics do not need to have meaning at all. In the Neyman’s view, statistical tests are methods of deciding between behaviors, with defined (or in some sense optimal) error rates. A (p) value of less than (0.05) might, for instance, lead to an acceptance of a particular proposition, automatically. As Neyman says, rejecting both Fisher’s account of scientific inductive reasoning and Jeffreys’ Bayesian account:

[T]he examination of experimental or observational data is invariably followed by a set of mental processes that involve [] three categories…: (i) scanning of memory and a review of the various sets of relevant hypotheses, (ii) deductions of consequences of these hypotheses and the comparison of these consequences with empirical data, (iii) an act of will, a decision to take a particular action.

It must be obvious that…[the] use [of inductive reasoning] as a basic principle underlying research is unsatisfactory. The beliefs of particular scientists are a very personal matter and it is useless to attempt to norm them by any dogmatic formula. (Neyman, 1957)

Neyman is not suggesting that statistics is completely automatic – after all, one needs to choose one’s decision rules, according to what suits one’s goals – but the interpretation of the magnitude of (p) values in relation to rational belief is irrelevant to Neyman. The worth of (p) value (or other statistic) is in the decision it determines. The meaning of the number itself is not important, or even nonexistent.

Today (p) values are used opportunistically in both ways. Most people do not know what a (p) value is, but they can tell you two things:

  1. “When (p) is less than 0.05, I reject the null hypothesis.” (Neyman-Pearson)
  2. “When (p) it is very small, it provides a lot of evidence against the null hypothesis.” (Fisher)

These days, it is well known that (p) values do not serve Fisher’s goals particularly well. Even Fisher did not provide a formal link between (p) values and any sort of rational belief, and as it happens no such link exists. So if one is to use (p) values, one is left with only Neyman’s decision-based account. The (p) value is uninterpretable (except trivially, via its definition), but as a decision criterion this isn’t as much of a worry.

The use of such criteria is comforting. It provides one fewer thing to argue over; if (p<0.05), then one can no longer argue that the effect isn’t there. If research is a game, these sorts of rules provide people with a sense of fairness. Whatever else happens, we all collectively agree that we will not doubt one another’s research on the grounds that there is not enough evidence for the effect. The way (p) values are used, (p<0.05) means there is “enough” evidence, by definition.

Bayesian statistics

Researchers who have adopted Bayesian statistics encounter practical hurdles that they did not have previously. Priors require some care to develop and use, but there is no clear analog in classical statistics, except for perhaps the determination of an alternative for power calculation in the Neyman-Pearson paradigm. Likewise, Bayes factors, as changes in relative model odds, have no clear analog.
The closest thing to a Bayes factor in classical statistics is a (p) value, but in truth the only similarity is that they are both interpreted in terms of evidential strength. As I outlined in a previous post, a Bayes factor is two things:

  1. A Bayes factor is the probability (or density) of the observed data in one model compared to another.
  2. A Bayes factor is the relative evidence for one model compared to another.

Part of the elegance of Bayes factors is that these two things are the same; a model is preferred in direct proportion to the degree to which it predicted the observed data.

When they encounter Bayes factors, researchers familiar with (p) values, often ask “How big is big enough?” or “What is a ‘big’ Bayes factor?” Various proponents of Bayes factors have recommended scales that give interpretations of various sizes of Bayes factors. For instance, a Bayes factor of 4 is interpreted as “substantial” evidence by Jeffreys.

Although it is common practice, in my view there are numerous problems with the practice of assigning verbal labels to sizes of Bayes factors. They are not needed, and they actually distort the meaning of Bayes factors.

As noted previously, (p) values do not have a ready interpretation without a label like “statistically significant,” which means no more and no less than “I choose to reject the null hypothesis.” Bayes factors, on the other hand, do not require any such categorization. They are readily interpretable as either ratios of probabilities, or changes in model odds. They yield the amount of evidence contributed by the data, on top of what was already known (the priors). As Kass and Raftery state:

Probability itself provides a meaningful scale defined by betting, and so these categories are not a calibration of the Bayes factor but rather a rough descriptive statement about standards of evidence in scientific investigation. (Kass and Raftery, 1995, emphasis mine)

Although Kass and Raftery are often cited as recommending guidelines for interpreting the Bayes factor, they did not. They offered a description of scientific behavior, and how they thought this mapped onto Bayes factors. They were neither interpreting the Bayes factor nor were they offering normative guidelines for action on the basis of evidential strength. If Kass and Raftery were wrong about scientific behavior (after all, they did not offer any evidence for their description), if scientific behavior were to change, or if one were to consider another area besides scientific investigation, these numbers would not serve.

But even if Bayes factors do not need to be interpreted, perhaps it might be good to have the verbal categories anyway. I do not think so, for several reasons.

My first objection is that words mean different things to different people, and meanings change over time. Take, for instance, Jeffreys’ category “substantial” for Bayes factors of between 3 and 10. This is less evidence than Jeffreys’ category of “strong”, which runs from 10 to 30. This seems strange, because the definition of “substantial” in modern use is “of considerable value.” How are “substantial” and “strong” different? Couldn’t we reverse these labels and have just as good a scale?

I believe the answer to this puzzle is that a less common use of “substantial” is “has substance.” For instance, I may say that I thought an argument is “substantial”, but this does not necessarily mean that I think the argument is strong, but simply means it is not trivially wrong. Put another way, it means that I did not think the argument was insubstantial. This is, I believe, what Jeffreys meant. But why should my evaluation of the strength of evidence depend on my knowledge of uncommon uses of common words? Would someone who did not know the less common use of “substantial” take a different view of the evidence, simply because we read different books or used different dictionaries?

Consider also Wetzels and Wagenmakers’ (2012) replacement of Jeffreys’ “not worth more than a bare mention” with “anecdotal”. Anecdotal evidence has a specific meaning; it is not simply weak evidence. I could have a well-designed, perfectly controlled experiment that nonetheless produces weak evidence for differentiating between two hypotheses of interest. This does not mean that my evidence is anecdotal. Anecdotal evidence is second-hand evidence that has not been substantiated and does not derive from well-controlled experiments.

Here we see the major problem: the use of these verbal labels smuggles arbitrary meaning into the judgement where none is needed. These meanings differ across people, times, and fields. Using such labels adds unnecessary – and perhaps incorrect – baggage to the interpretation of the results of an experiment.

The second objection is that the evaluation of what is “strong” evidence depends on what is being studied, and how. Is 10 kilometers a long way? It is if you’re walking, it isn’t if you’ve just started a flight to Australia. In a sense, Bayes factors are the same way; if we’re claiming something mundane and plausible, a Bayes factor of 10 may be more than enough. If we’re claiming something novel and implausible, a Bayes factor of 10 may not even be a start. Extraordinary claims, as the saying goes, demand extraordinary evidence; we would not regard the same level of evidence as “strong” for all claims.

The third objection is related to the second, and that is that providing evidential categories allows the researcher to shirk their responsibility for interpreting the strength of the evidence. We do not allow this in other settings. When reviewing a paper that contains evidence for some claim, for instance, it is our duty to evaluate the strength of the evidence in context. We do not, and cannot, demand from editors “standard” papers that we all agree are “strong” evidence; such standard papers do not exist. Providing normative guidelines such as “A Bayes factor of 15 is strong evidence,” though comforting, asks researchers to stop thinking in ways that we would not allow in other contexts. They impose an arbitrary, unjustified homogeneity on judgments of evidential strength.

Finally, a fourth objection is that verbal categories provide the illusion of understanding. Being able to say “A Bayes factor of 3 means that there is anecdotal evidence,” may give a researcher comfort, but does not ultimately show any understanding at all. This provides a dangerous fluency effect, because fluency is has been consistently shown to cause people to misjudge their knowledge. Because categories are not actually necessary for the interpretation of Bayes factors, giving them illusory fluency using the labels is likely to hinder, not help, their understanding.

Do people “need” categories?

All of the previous arguments may be admitted, and yet one might argue that they are substantially weakened by a single fact: that Bayes factors cannot be understood by researchers without them. People cannot think without categories, and so if we do not provide them, people will not be able to interpret Bayes factors.

I think this is self-evidently wrong. It at least requires some sort of evidence to back it up. The use of Bayes factors by researchers is in its early years, and we do not yet know how well people can interpret them in practice.

As evidence that the claim that people need categories for Bayes factors is wrong, one may point to other related quantities for which we do not provide verbal categories. The most obvious is probability. When we teach probability in class, we do not give guidelines about what is a “small” or “large” probability. If a student were to ask, we would naturally say “it depends.” A probability of 1% is small if it represents the probability that we will win a game of chess against an opponent; it is large if it is the probability that we will be killed in an accident tomorrow.

For other similar quantities, too, we do not offer verbal categories. Odds ratios and relative risk are closely related to the Bayes factor, and yet they are used by researchers all the time without the need for contextless categories.

It is often the case that students (or researchers) are unsure about probability. Although verbal categories are never (that I know of) advocated for alleviating misunderstandings or lack of certainty about probability – and rightly so – there are other ways of helping students understand probability. Gerd Gigerenzer’s work, in particular, has shown that certain visualizations have been shown to help students understand, and make use of, probabilities. A similar evidence-based tack can, and should, be taken with Bayes factors. We know a lot about how to teach people about probability, so we should apply that knowledge.

As argued previously, it is possible that through the illusion of fluency, categories may actually harm peoples understanding. It would be better to address the root of the problem rather than providing quick fixes for people’s uncomfortableness with new methods. The quick fixes may actually backfire.

Bayes factors as decision statistics

It has been suggested that cut-offs on the Bayes factors are sometimes useful; in particular, when used to stop collecting data. This is a completely different issue from the one addressed above. A rule for behavior does not need an interpretation, and furthermore, the interpretation of a Bayes factor does not depend on the stopping rule. Such a rule is merely a practicality, and there is nothing wrong with using such rules if they are needed.

As an example, I may have a rule for stopping eating, but this a completely separate question from whether I would judge how much I ate to be “a lot”. I do not need the rule to say I ate a lot, and following such a rule does not make what I ate any more or any less. I might choose such a rule based on what I thought “a lot” was, but the concept of “a lot” is prior to the rule.

In the case of Bayes factor, such decision criterion is actually only useful in light of prior odds. We should choose such a criterion such that a Bayes factor that exceeds a particular threshold is likely to convince most people; that is, that it is large enough to overcome most peoples’ biases. Bayes factors in research are used in arguments made for other researchers’ benefits; if we end sampling before we have achieved a level of evidence that would overcome others’ prior odds, then we have not done enough sampling. Convincing ourselves is not the goal of research, after all. This should make it obvious why even a rule for stopping depends on context, because the context helps us know what a useful amount of evidence is.

It should also make clear that the Bayes factor is not really the useful decision statistic; rather, the posterior odds are. If an experiment is expensive but would not achieve the levels of evidence necessary to change peoples’ minds, achieving a “strong” Bayes factor is irrelevant.


This turned out to be quite a lengthier post than I anticipated it to be, but summarizing it is easy: although (p) values need categories or criteria to be interpreted, Bayes factors do not. They have a natural interpretation that directly connects evidence with changes in odds. Furthermore, the use of verbal category labels for Bayes factors is misleading and potentially harmful to learners of Bayesian methods. Teachers of Bayesian statistics should focus on ways of visualizing Bayes factors to help people understand, rather than using the “short-cut” of verbal categories.