How to Write an Experimentation Platform RFP for Bayesian Inference

Want to experiment like Spotify? Sign up for a 30 day free trial.

"Bayesian A/B testing" appears on nearly every experimentation platform's feature list. The label carries weight: Bayesian methods promise a natural interpretation of probability, coherent updating of beliefs, and freedom from the peeking problem that plagues naive frequentist testing. Those promises are real, in principle. In practice, most vendor implementations carry the label without delivering any of them.

The problem is not that Bayesian inference is wrong. It is that the label "Bayesian" does not tell you what stopping rule is used, what error rate guarantees you get, whether the prior is specified and calibrated, or whether the implementation connects to sample size planning. A flat prior produces results that are algebraically equal to a frequentist z-test. "P(A beats B) = 0.96" and "p = 0.04" are the same number with different wording. The Bayesian framing has not changed the statistical procedure in terms of shrinkage, false discovery rate control (the expected fraction of declared positive results that are actually null effects), or error rate guarantees. Yet the platform calls it Bayesian, the experimenter believes they are getting something their standard tools do not offer, and the same failure modes of optional stopping apply with the same mechanism.

Should you add Bayesian inference to your experimentation platform requirements?

For most organizations, no. The one scenario where Bayesian inference adds genuine value over frequentist methods is when a calibrated empirical Bayes prior is in play: shrinkage toward realistic effect sizes and false discovery rate calibration that frequentist methods cannot match. Without that prior, the difference between the two frameworks is close to zero. A flat prior produces results algebraically equal to a frequentist z-test. The label changes. The output does not.

Building an empirical Bayes prior requires large historical experiment data sets with high-quality metadata, organized by well-defined programs, and a corpus of at least 200 experiments per program. New programs do not have this corpus. Most organizations are constantly spinning up new ones. The result is a platform that must alternate inference modes as programs mature: new programs run without the prior, established ones run with it, and the guarantees change depending on which mode is active. Experimenting teams cannot build stable intuition about what their results mean when the interpretation depends on how old the program is.

At Spotify, we have among the best experiment data and metadata in the industry. We still do not see enough added value in Bayesian inference without the empirical Bayes part. The priors are too hard to support at scale, they do not work for new programs, and without them the incremental value over frequentist methods is negligible. We chose to invest in making our frequentist offering as connected and consistent as possible rather than adding a Bayesian mode that would deliver less.

If your organization has the historical corpus, the metadata infrastructure, and can find a vendor that helps you estimate and update priors, Bayesian inference is worth evaluating. The questions below will tell you whether a vendor's implementation delivers on the promise. No vendor we reviewed can answer every one.

What your RFP should ask instead of the "yes/no?"

Eight questions decide whether a Bayesian implementation delivers on the promise of the label.

First: is the stopping rule specified, and does it control the false positive rate? The stopping rule is the most important question on the list. A posterior probability threshold ("stop when P(B > A) > 0.95") without a calibrated prior inflates the false positive rate by the same amount as naive peeking at a frequentist p-value. Our simulations show more than sixfold inflation under a flat prior: a false positive rate of 0.303 versus a controlled baseline of 0.047. A matched informative prior reduces this to 0.256, still more than fivefold the baseline and still uncontrolled.

Valid Bayesian stopping rules exist. A Bayes factor threshold, which measures the relative evidence for one hypothesis over another, provides unconditional Type I error control when paired with any proper prior. A mixture sequential probability ratio test (mSPRT) provides similar guarantees. But these require deliberate design and a specified threshold. "You can stop whenever you want because it's Bayesian" is not a stopping rule. Ask what the stopping criterion is, what error rate it controls, and whether the guarantee is documented with a specific bound.

Second: are the credible intervals consistent with the stopping rule? A credible interval is the Bayesian analogue of a confidence interval: a range that, given the data and the prior, has the true effect with a stated probability. A standard credible interval computed at an optional stopping time does not have those properties. If the platform stops when the posterior crosses a threshold and then reports a 95% credible interval, that interval was constructed under a stopping rule that selected for favorable results. Its coverage probability is not 95%. The intervals and the stopping rule must be designed together. Ask what the intervals in the UI mean under the actual stopping rule the platform uses, and whether the coverage guarantee has been validated.

Third: is the prior specified and documented? The prior is not a technical detail. It is part of the statistical model and directly determines the operating characteristics of the test. A flat or uninformative prior produces results algebraically equal to a frequentist z-test. A proper informative prior does real work: it regularizes effect estimates and, when calibrated, enables false discovery rate control. But "informative" is not automatically good. A prior that is wrong for your domain is worse than a flat prior. Ask what distribution the prior uses, what its parameters are, what they imply about expected effect sizes, and whether the prior is a fixed platform default or calibrated from your organization's own data.

Fourth: is empirical Bayes prior estimation supported? Empirical Bayes is what separates a Bayesian implementation that delivers real value from one that relabels frequentist output. An empirical Bayes prior is estimated from a corpus of historical experiments within a well-defined program: a coherent set of structurally similar tests in one product area, not pooled across the company. The threshold is at least 200 experiments per program. A prior built from platform-wide data across different organizations, industries, and effect-size distributions is wrong for every individual program and does not recover false discovery rate control regardless of corpus size. Ask whether the platform supports empirical Bayes prior estimation, what corpus it requires, whether the prior is estimated per program or pooled across programs, and how often it is updated.

Fifth: is sample size planning available for Bayesian experiments? The concepts of power and sample size apply in Bayesian inference just as in frequentist inference. You still need to know, before you start, how many users are required to have a reasonable chance of detecting the effect you care about. A Bayesian experiment without a planning step is an experiment with no principled basis for choosing its duration. You either run too short and get ambiguous posteriors, or run too long and waste traffic. The planning step also needs to account for the prior, the stopping rule, and any variance reduction in use. Ask whether the platform provides sample size or expected sample size calculations for Bayesian experiments, and whether those calculations connect to the prior, stopping rule, and analysis method that will actually be used. For more on how sample size calculators connect to the rest of the platform, see our post on sample size calculation.

Sixth: what error bounding or guarantees can you select from? A coherent Bayesian implementation should be able to state, precisely, what guarantee the combination of prior, stopping rule, and decision threshold provides. That guarantee might be an unconditional bound on the false positive rate (Tier 2 in our framework). It might be false discovery rate control across a program (Tier 3). It might be neither, in which case the platform should say so explicitly. The answer "Bayesian inference does not make frequentist guarantees" is an answer, but it is not a substitute for stating what guarantee it does make. Ask what error rate property the implementation delivers, stated precisely, and what conditions must hold for that guarantee to apply.

Seventh: is multiple testing handled in the Bayesian mode? Ten metrics tested with default or diffuse priors and a ship-if-any rule will produce false positives at roughly the same rate as ten uncorrected frequentist tests: about 40% at the 5% level. The claim that "Bayesian inference handles multiple testing automatically" holds only with well-specified informative priors calibrated from the organization's own experiment corpus. With the flat or default priors most platforms use, the false positive risk is comparable to uncorrected frequentist testing. False discovery rate control across multiple metrics requires Bayes factor stopping in combination with a well-calibrated empirical Bayes prior. Without both, the multiple testing problem is unaddressed. Ask whether the platform applies any correction across metrics in Bayesian mode, and if so, what procedure is used and what it controls. For more on how multiple testing corrections interact with inference frameworks, see our post on multiple testing.

Eighth: does variance reduction work in Bayesian mode? CUPED and related variance reduction techniques reduce metric noise by adjusting for pre-experiment behavior. For mean metrics in frequentist mode, this is standard on most platforms. In Bayesian mode, the question is whether the variance reduction is reflected in the posterior variance, in the prior calibration, and in the sample size calculation. A platform where you enable CUPED for a Bayesian experiment but the posterior is computed on raw (unadjusted) data, or where the sample size calculator does not reflect the variance reduction, gives you a fragmented implementation. Ask whether CUPED applies in Bayesian mode, and whether its effect is carried through to the posterior, the intervals, and the planning step.

What the answers actually look like across vendors

Here is how the major platforms stand as of this writing. Vendor capabilities are based on public documentation and the detailed vendor survey in The Bayesian Checkbox.

Cell values use the following conventions:

Yes -- explicit evidence the platform supports this.
No -- explicit evidence the platform does not support this.
--- -- the platform does not offer the broader feature at all (Confidence does not offer Bayesian inference).
Not documented -- a genuine search found no evidence either way.

Last updated: May 2026

Vendor	Prior type	Stopping rule specified?	False positive rate guarantee stated?	Error guarantee stated?	Sample size planning for Bayesian?	Multiple testing in Bayesian mode?	Variance reduction in Bayesian mode?	Intervals consistent with stopping rule?	Empirical Bayes prior?	Other gaps
Confidence	—	—	—	—	—	—	—	—	—	No Bayesian inference
GrowthBook	Flat (default); optional normal prior	No	No	No	No	No	No	Not documented	No	Optional prior not connected to stopping or planning
Eppo	Proper informative normal prior on lift	No	No	No	No	No	Yes	Not documented	No	Prior, stopping, planning, and correction are disconnected
Statsig	Flat default; informative option	No	No	No	No	Not documented	Not documented	Not documented	Partial (user-supplied priors)	Stopping rule warning in blog only
LaunchDarkly	Platform-wide empirical Bayes shrinkage	No	No	No	No	No	No	Not documented	Partial (platform-wide)	Variance reduction and Bayesian mode cannot be combined
PostHog	Flat (uninformative)	No	No	No	No	No	Not documented	Not documented	No	Acknowledges 40% FPR with 10 uncorrected metrics
Amplitude	Flat (uninformative)	No	No	No	No	No	No	Not documented	No	Explicitly disclaims false positive control
Optimizely	Flat (uninformed)	No	No	No	No	Not documented	Not documented	Not documented	No	—
VWO	Flat	Partial	Partial	Partial	No	Yes (in SmartStats)	Not documented	Not documented	No	Proof of FPR control not public

Five patterns emerge from this comparison.

The first is that most vendors ship a flat prior by default, which means the Bayesian output is mathematically the same as a frequentist test. GrowthBook, PostHog, Amplitude, Optimizely, and VWO all use flat or uninformative priors as their default. Under these priors, "Chance to Win = 95%" is the same statement as "p = 0.05." The experimenter sees a Bayesian interface presenting what is mathematically a frequentist result. Nothing about the analysis has changed except the vocabulary.

The second pattern is the universal absence of a stopping rule with a stated guarantee. No vendor in this comparison documents a Bayesian stopping rule that provides an unconditional bound on the false positive rate. VWO comes closest: its SmartStats engine applies a sequential correction to Bayesian posterior probabilities, and its documentation describes false positive rate control as a configurable parameter. But the guarantee depends on the maximum sample size setting, and the mathematical proof is not publicly available. Every other vendor either provides no stopping rule at all, or explicitly disclaims false positive rate control. Eppo's documentation is the most direct: Bayesian analysis "avoids the issue simply by making no promises about the false positive rate." The disclosure is honest, but it means the experimenter has no stated error rate bound.

The third pattern is the complete absence of Bayesian sample size planning. Not a single vendor reviewed provides sample size calculation that is designed for their Bayesian mode. Every vendor offers Bayesian analysis, meaning teams can configure experiments to use Bayesian inference. But none provides a planning tool that tells you, before the experiment starts, how many users you need given the prior, the stopping rule, and the desired operating characteristics. The result is that every Bayesian experiment on every reviewed platform starts without a principled basis for choosing its duration.

The fourth pattern is the multiple testing blind spot. GrowthBook, Eppo, LaunchDarkly, PostHog, and Amplitude explicitly do not apply multiple testing correction in Bayesian mode. Eppo and PostHog cite arguments that Bayesian inference handles multiple testing inherently, but this property requires calibrated informative priors that none of these vendors offer by default. VWO is the exception, integrating its Bonferroni correction into the SmartStats Bayesian engine. The practical consequence: on most platforms, switching from frequentist to Bayesian mode quietly drops whatever multiple testing safeguard was in place.

Two vendors stand apart from the flat-prior baseline, though neither reaches a complete implementation.

Eppo uses a proper informative prior (normal with mean 0 and standard deviation 0.05 on the lift scale) that does genuine Bayesian work. Their UI output is the most complete of any vendor reviewed. Their documentation, including their post "Beware of the Bayesian Imposter," is the most self-aware about the distinction between genuine Bayesian inference and relabeled frequentist testing. But the prior is a fixed platform default, not calibrated from the organization's own experiment corpus. No stopping rule is documented. No sample size planning is available. No multiple testing correction is applied.

LaunchDarkly uses the most distinctive prior among the vendors reviewed: a platform-wide empirical shrinkage prior that pulls treatment estimates toward the control mean. The LaunchDarkly prior is the closest thing to an empirical Bayes approach found in the vendor market. But the prior is estimated from platform-wide data across different organizations, not from each customer's own experiment corpus. The stopping rule is undocumented. No Bayesian-specific sample size planning exists. Multiple testing is not addressed. Variance reduction is incompatible with the Bayesian mode. And their documentation tells users they "do not need to understand these concepts to use Experimentation," an approach that assumes the platform handles the underlying statistical complexity on the user's behalf.

A fifth pattern cuts across the others: feature combinations that break down in Bayesian mode. Several vendors support variance reduction and multiple testing correction in their frequentist mode, but these features are disabled or unavailable when Bayesian inference is selected. Eppo and Amplitude both disable Bonferroni correction for Bayesian experiments. LaunchDarkly's variance reduction is incompatible with its Bayesian mode entirely. The result is that switching to Bayesian inference does not just change the statistical framework. It removes safeguards that were already in place. A team that relied on multiple testing correction and variance reduction in frequentist mode loses both when switching to Bayesian, often without warning. These combination gaps are easy to miss in an RFP and highly consequential in practice.

The inference label does not change the cost of a false positive

The harm from acting on a false positive is the same whether the result came from a z-test or a posterior. A feature that does not work ships. A metric regression goes undetected. The organization's ability to learn from experiments degrades. Every vendor reviewed publishes educational content explaining why peeking at frequentist results inflates false positive rates and why uncorrected multiple testing produces unreliable conclusions. The same protections are absent from their Bayesian features.

The inconsistency reveals an absence of position. Peeking is only a problem if you accept that inflated false positive rates lead to bad product decisions. That judgment does not change when you switch inference frameworks. A standard that applies to frequentist analysis but not to Bayesian analysis is not a standard about good experimentation. It is a convention about which software library produced the number.

A complete Bayesian implementation means a specified prior with documented parameters, a stopping rule with a stated error rate guarantee, sample size planning connected to the prior and stopping rule, intervals that are valid under the actual stopping behavior, multiple testing correction that applies across all metrics in the shipping decision, and variance reduction that carries through to the posterior and the planning step. A vendor that clears all that has built something genuinely valuable. If any of those connections are missing, you have a label, not a method.

How to Write an Experimentation Platform RFP for Bayesian Inference

Should you add Bayesian inference to your experimentation platform requirements?

What your RFP should ask instead of the "yes/no?"

What the answers actually look like across vendors

The inference label does not change the cost of a false positive

More in this series