Statistical Methods

What is Bayesian A/B Testing?

Bayesian A/B testing is an approach to experiment analysis that starts with a prior belief about the treatment effect and updates that belief using observed data, producing a posterior distribution...

Bayesian A/B testing is an approach to experiment analysis that starts with a prior belief about the treatment effect and updates that belief using observed data, producing a posterior distribution that represents the range of plausible effect sizes given the evidence. Instead of a p-value, the output is typically a probability statement: "there's a 94% probability that the treatment is better than control."

Bayesian methods have genuine strengths. They produce intuitive probability statements that match how most stakeholders naturally think about uncertainty. They allow you to incorporate prior knowledge formally. And they handle optional stopping more naturally than fixed-horizon frequentist tests, because the posterior is valid at any point during data collection. That said, the practical differences between Bayesian and frequentist approaches are smaller than the methodological debate suggests, especially for the large-sample product experiments that make up most A/B testing.

How does Bayesian A/B testing work?

The mechanics follow Bayes' theorem. You start with a prior distribution that encodes what you believed about the treatment effect before seeing any data. As experiment data arrives, the prior is updated into a posterior distribution that combines the prior belief with the observed evidence.

For a binary metric like conversion rate, a common setup uses a Beta prior. If the control group converts at 5.0% and the treatment group at 5.3%, the posterior distribution for the treatment effect incorporates both the observed difference and whatever the prior specified. From the posterior, you can compute quantities like "the probability that the treatment effect is positive" or "the expected loss from choosing the wrong variant."

The choice of prior matters in theory. In practice, for experiments with thousands or millions of users, the data overwhelms any reasonable prior. A Beta(1,1) uniform prior and a Beta(100, 1900) informative prior will produce nearly identical posteriors once you have 50,000 users per group.

Why doesn't Confidence offer Bayesian analysis?

Confidence deliberately does not include a Bayesian analysis option. The reasoning is grounded in what the manifesto calls "simplicity at scale."

For the typical product experiment (large sample, weak or uninformative prior, continuous or binary metric), conjugate-prior Bayesian implementations are mathematically equivalent to z-tests. The posterior credible interval and the frequentist confidence interval converge to the same numbers. The "probability of being better" maps directly to the p-value through a monotonic transformation. The two frameworks give the same answer in different notation.

Adding a Bayesian option alongside the frequentist default means every experimenter has to choose which framework to use, understand the difference well enough to interpret results correctly, and defend the choice to stakeholders. For most teams, this additional complexity doesn't improve the quality of evidence produced. It increases the surface area for confusion.

Spotify arrived at this position after fifteen years of running experiments at scale. The experiments where Bayesian and frequentist methods disagree are exactly the experiments where sample sizes are small enough that the prior drives the result. In those cases, the quality of the prior matters enormously, and most product teams don't have genuinely informative priors. They have vague defaults that look like priors but carry no real information.

When do Bayesian methods genuinely add value?

Bayesian methods earn their complexity in specific settings.

Small samples with genuine prior information. If you're running an experiment on a niche population (enterprise customers, a specific market) and you have strong prior data from similar past experiments, an informative prior meaningfully improves your estimates. The prior does real work here because the data alone isn't sufficient.

Decision-theoretic frameworks. When you need to minimize expected loss rather than control error rates, Bayesian decision theory provides a natural framework. "The expected revenue loss from choosing the wrong variant is $200/day" is a more actionable statement than "p = 0.07" for some business contexts.

Multi-armed bandit problems. Thompson sampling and other Bayesian bandit algorithms use posterior distributions to balance exploration and exploitation. This is a genuinely different use case from A/B testing: you're optimizing a single metric in real time rather than making a multi-metric product decision.

The honest assessment: for a team running standard product experiments on tens of thousands of users with no strong prior information, the choice between Bayesian and frequentist analysis is mostly a notational preference. The results will be equivalent. Confidence chooses the frequentist notation because it integrates more naturally with sequential testing, multiple testing corrections, and the rest of the statistical stack.