Why would you want to use Bayesian statistics? Some basics about Bayesian stats vs traditional stats
rtxi89 left a comment on my last post asking to have a full explanation for Bayesian statistics. It's true, many people have heard about Bayesian statistics, but it's not commonly known about -- and definitely not common in the field of psychology as of present. Now, I am no where near someone who would be an expert in Bayesian stats, but I can definitely give some reasons as to what it is and how you could use it in your analyses.
What are Bayesian statistics?
I will first say that Kevin Boone's Bayesian Statistics for Dummies page is a fantastic primer on how to start understanding the Bayes Theorem and the bare basics of what Bayesian statistics are and the power of them. I highly recommend reading that page!
For those who don't want to read his page, here's a summary adapted from Wagenmakers' 2007 paper.
Bayesian statistics are computations of probabilities and using them to reduce uncertainty. Ideally, the more data or information you have, the less uncertain you will be about your prediction. Seeing something happen once leaves greater uncertainty than seeing the same thing happen 100 times. The more data observed, the more accurate the probability for your alternative and null hypotheses.
This differs from traditional statistics, since you are comparing two (or possibly more) probabilities to form whether or not to reject a null hypothesis, as opposed to trying to reject a hypothesis based off of one probability (e.g. the probability your result is not due to "chance"; the p-value threshold of 0.05). Rather than saying "there is less than 5% likelihood our data are due to the null hypothesis" Bayesian statistics can lead to a statement like "the likelihood of our alternative hypothesis occurring is X%". That's what makes some people find great interest in Bayesian statistics -- a more communicable result that doesn't require a course in college statistics to understand.
Bayesian stats versus traditional stats
Traditional statistical methods, known as null hypothesis significance testing (NHST), are reliant on calculating the probability of a phenomenon occurring under certain conditions -- this probability is called the p-value. The origin of the p-value and null hypothesis testing dates back to the 1920s, where Ronald Fischer famously stated that p<.05 would be a good general cut-off point for testing against a null hypothesis. As it is, this particular cut off point has not been changed in almost 100 years -- just about all research I've ever read abides by the concept that p<.05 indicates statistical significance.
Along the way, people have debated that p<.05 is still not a strong enough indicator for statistical significance. Even Fischer stated that p<.05 may be too lenient for some tests, and a lower cut-off point would prove to be better to test against the null hypothesis. Others have factored in different metrics to determine "how significant" their findings are: confidence intervals, regression/correlation, measures of effect size. These all come to support the gold standard p-value.
NHST has a list of reasons why it might not be reliable. Even in the 1920s during the invention of the p-value, researchers Neyman and Pearson suggested two types of errors that equate to accepting a p-value even though it isn't true (type I error) and denying a p-value even though it is significant (type II error). This can commonly happen if a particular experiment is under-powered (e.g. you see a significant test with a small sample size but when you scale that experiment up to a larger sample size, the significance disappears). The sway of a p-value's attractiveness makes some researchers publish before any actual meaningful evidence is found.
Maybe one of the biggest arguments Bayesian statisticians have against NHST is the fact that you are only trying to test against the null hypothesis, and just plainly assuming the difference is because of the alternative hypothesis. But what if the difference is not due to the null model, nor the alternative model? For example, the alternative hypothesis of if my window is wet, it is due to the weather has a null hypothesis of if my window is wet, the weather was not a factor. So finding statistical significance in favor of the alternative hypothesis says since the null hypothesis is wrong, then my window must be wet due to weather! when the actual statement should be since the null hypothesis is wrong... then the null hypothesis is wrong. My window could be wet due to weather or sprinklers or water balloons or etc.
In reality, we are just testing the null hypothesis and not considering the alternative hypothesis at all. Our goal with a p-value is to essentially deny the antecedent, a logical fallacy. You can't say "my window isn't dry, therefore it must be weather" because it could be anything. Bayesian statistics is a way to compare two hypothesis models to see whether significance is actually due to an alternative hypothesis or maybe it isn't due to either of the models you've tested. It helps prevent denying the antecedent since you are able to compare two models at the same time. "My window isn't dry, and weather today is raining, therefore the likelihood of my window being wet due to rain is 90%." This statement is much more accurate given the evidence, and there are no large jumps from statistical output to communication. 90% chance my window is wet because of rain is fairly straight-forward. There is less than 5% chance my window is wet not because of the weather is less straight-forward and doesn't describe the entire situation.
The thing I've found most helpful with the Bayesian approach is the fact that the resulting statistic, the Bayes factor (BF), is a ratio of how likely one model will occur over the other model. A p-value will tell you how much the null hypothesis may be contributing to your study -- a BF will tell you how much both null and alternative hypotheses may be contributing to your study.
Bayesian stats: a supplement like reporting effect size or its own beast?
Bayesian statistics can either be seen as a replacement for NHST or can be seen as a supplement.
Let's talk supplement first. If you find statistical significance, you can also report your BF to specifically say how likely will the alternative hypothesis occur again? This sort of reporting is similar to what already happens when reporting things like R-squared, or eta-squared, or Cohen's d. There are loosely based definitions for these tests for effect size that change depending on the field you research in (e.g. r = .4 could be considered a moderate effect in some fields, but weak in others). With BF, you can say the probability of this result happening again is X%.
However, what about those NHST statistics that are like type II errors -- that come out not significant but may have been erroneously denied? In NHST, those statistics are usually never reported. Bayesian statistics may be able to breathe new life into "borderline significance" stats. Something that is p = .051 for instance doesn't qualify to be reported. But a Bayesian analysis could show that particular result is actually due to the alternative hypothesis to the tune of 75%. What do you do then?
If you choose to report Bayesian statistics as a supplement, you still play by the p-value rules: report on the significant findings, mention null results in passing (if at all). However, if you choose to replace NHST with only Bayesian statistics, you open your data analysis up to the potential of promoting borderline significant results back into significant results. On the other hand, effects that are significant via NHST may come up as anecdotal in Bayesian analyses.
A review on reporting Bayesian statistics favors the idea of reporting p-values and BF alongside each other, including BF-significant-but-NHST-not-significant data. I've tried to implement this particular reporting in my most recent results section -- my results section inflated from 2 pages to 5 and a few lines. I tried editing it down and including a Bayesian analysis section on its own and it inflates to 3 and some left over.
I have personally started to report NHST results with BF and Bayesian probability. To be specific for my latest manuscript, I'm reporting an F-value, p-value, partial-eta-squared, BF, and BF probability in favor of the alternative hypothesis. This adds a few extra words to every finding ("The Bayesian analysis favors the alternative hypothesis, with the likelihood of this model occurring X%"), but that's not that bad of a price to pay in order to more accurately describe your study results.
The problems with Bayesian stats?
Like NHST, Bayesian statistics are not without fault. Bowers and Davis put out an article critical of Bayesian inference in 2012 -- with two points being Bayesian statistics allows for too much variability and makes too few arguments about data. Because of these faults, Bayesian statistics are no less exploitable than NHST. And this is true when performing a Bayesian repeated measures ANOVA -- depending on how you factor in your number of observations, your BF can change dramatically. And currently, there is no consensus on what n to use (I wrote about that a little here).
However, a counter to many of the arguments against Bayesian statistics involves the keyword of "practicality". All statistical analyses have cons, but the end result is how applicable or practical the results are and not necessarily the style at which we choose to illustrate our results. One thing Bayesian stats does well is that it more easily communicates a percentage of likelihood and confidence of that likelihood than a p-value + effect size would be able to communicate. Again, as Bowers and Davis say multiple times in their review, Bayesian statistics aren't doing anything new, but we shouldn't see it as a p-value killer. At least, not yet.