How to Write an Experimentation Platform RFP for Multiple Testing Corrections

Want to experiment like Spotify? Sign up for a 30 day free trial.

Last updated: May 2026

Every experimentation platform will tell you it corrects for multiple comparisons. Most do, partially. The correction is applied to variants but not metrics, or to the reject/not-reject decision but not to the reported confidence intervals, or to the frequentist engine but not the Bayesian one. The result is a feature that looks complete on a checklist but leaves real gaps in practice.

The math makes the stakes clear. If you test ten metrics at the 5% significance level with a ship-if-any rule and no correction, the probability of at least one false positive is roughly 40%. This is true regardless of inference framework. Neither a p-value nor a posterior probability is immune to the multiple comparisons problem. Only the correction procedure changes the false positive rate, and only if it is applied consistently across every metric that enters the shipping decision.

An analysis of roughly 1,300 experiments on Confidence showed that the practical ship-rate difference between Bonferroni and more advanced methods that control the familywise error rate (FWER), like Holm and Hommel, is about 4.5 percentage points. Benjamini-Hochberg gains 4.9 percentage points. Dropping correction entirely gains 9.2 percentage points, but at the cost of uncontrolled false positives. The power gap between methods shrinks further when few metrics have real effects, which describes most product experiments. Whether Bonferroni's simplicity is worth the modest power cost depends on what you get in return: simultaneous confidence intervals, a closed-form connection to sample size calculation, and compatibility with sequential testing. These are exactly where most vendor implementations fall short.

Should you add multiple testing correction to your experimentation platform requirements?

Yes, and this is the topic where a checkbox answer is most likely to be actively misleading. Every vendor that applies a correction can truthfully say "yes" to the RFP question. But a correction that adjusts the decision threshold without widening the confidence intervals still produces winner's curse: the effect estimates for results that pass the significance threshold are systematically inflated, because the threshold selects for overestimates. A correction that applies to variants but not metrics leaves the metric-level false positive rate uncontrolled. A correction that appears in the analysis but is absent from the sample size calculator means you planned one experiment and ran another.

The question is not whether the platform corrects for multiple comparisons. The question is whether the correction is carried through to every place it matters: the decision, the intervals, the sample size calculation, the sequential monitoring, and the distinction between metric roles. A correction that covers some of these and ignores the rest leaves you exposed in the places it skips. This is the partial pipeline pattern: the vendor has the feature, but the feature does not connect to the rest of the analysis pipeline. At Spotify, we have learned that a partial correction is worse than no correction at all, because it replaces visible uncertainty with invisible gaps. We have removed incomplete features from Confidence for exactly this reason: the false confidence they create causes more damage than the limitation they were meant to address.

What your RFP should ask instead of the "yes/no?"

Seven questions separate a connected multiple testing implementation from a disconnected one.

First: is the correction applied to both variants and metrics? Most platforms correct for multiple variants (testing treatment A, B, and C against control). Fewer correct for multiple metrics (testing ten metrics simultaneously and shipping if any is significant). The variant correction is straightforward: with three treatment groups, the number of comparisons triples. But the metric correction is where the real false positive risk accumulates. With ten metrics tested independently at 5%, the familywise error rate is approximately 40%. A platform that corrects for variants but not metrics addresses the smaller risk while ignoring the larger one. Ask whether the correction covers both dimensions, and whether the metric correction is applied across all metrics that enter the shipping decision.

Second: are the confidence intervals consistent with the correction method? A correction changes the significance threshold. For that threshold to produce trustworthy effect estimates, the confidence intervals must reflect it. Bonferroni at alpha divided by the number of success metrics produces simultaneous confidence intervals: intervals where the probability that all of them cover their true effects at the same time is at least the stated coverage level. As our research shows, this is one of Bonferroni's key operational advantages: every metric has a valid interval, and all of them hold simultaneously. Some correction methods do not have corresponding confidence intervals at all. Holm and Hommel produce adjusted p-values but not simultaneous intervals. A platform that displays Holm-adjusted decisions alongside unadjusted intervals creates an inconsistency: the decision says "not significant" while the interval suggests a real effect, or the decision says "significant" while the interval is too narrow to be trusted for effect size estimation. The practical consequence is winner's curse. When you select experiments to ship based on a corrected threshold but estimate effects using uncorrected intervals, the shipped effects are systematically larger than reality. Ask whether the intervals shown after correction are simultaneous confidence intervals that match the correction method, or whether they are standard intervals displayed alongside adjusted p-values.

Third: is the correction reflected in the sample size calculation? This is the planning-analysis link that determines whether your experiment is properly powered. Bonferroni is the only common correction method whose power cost plugs directly into a standard sample size formula: replace alpha with alpha divided by the number of success metrics. Two success metrics with Bonferroni means testing each at alpha = 0.025 instead of 0.05, which increases the required sample size by roughly 20%, and more when combined with sequential testing. Holm and Hommel require simulation to compute sample size because their rejection decisions depend on the ordering of test statistics across metrics. Benjamini-Hochberg requires a pre-experiment estimate of how many metrics have no real effect (the null fraction: the share of metrics where the treatment has zero impact). If a vendor offers Holm or Benjamini-Hochberg in the analysis engine but uses Bonferroni (or nothing) in the sample size calculator, there is a planning-analysis disconnect. The calculator will produce a sample size that does not match the power profile of the actual test. Ask which correction method the sample size calculator accounts for, and whether it matches the method the analysis will use. For more on how sample size calculators connect to the rest of the analysis pipeline, see our post on sample size calculation.

Fourth: does the platform distinguish metric roles in the correction? Not all metrics in an experiment serve the same purpose. Success metrics are the ones you ship on, like engagement, conversion, or revenue. Guardrail metrics are the ones that must not degrade, like latency, crash rate, or error rate. The set of metrics that enter the correction is called the correction family, and including guardrail metrics in the same family as success metrics inflates the denominator unnecessarily. If you have two success metrics and four guardrail metrics, Bonferroni across all six tests each metric at alpha divided by six. Restricting the correction to the two success metrics tests each at alpha divided by two. The difference is substantial: at Spotify, the median experiment includes two success metrics and four guardrail metrics, and restricting the correction family to success metrics alone cuts the power cost roughly in half. A platform that lumps all metrics into a single correction family is being conservative in a way that costs power without improving the false positive guarantee for shipping decisions. Ask whether the platform lets you define which metrics enter the correction, and whether guardrail metrics can be excluded from the family applied to success metrics. For more on how metric roles affect decision making, see our upcoming post on multi-metric decision rules.

Fifth: is multiple testing handled for both Bayesian and frequentist modes? The claim that "Bayesian inference handles multiple testing automatically" is common in vendor documentation but misleading. A Bayesian analysis with properly specified informative priors incorporates prior information that can reduce false discovery rates, but this depends entirely on having good priors, which is rarely the case for new experimentation programs. With diffuse or default priors, a Bayesian analysis is just as susceptible to the multiple comparisons problem as a frequentist one. Ten metrics tested with weak priors and a ship-if-any rule will produce false positives at roughly the same rate as ten frequentist tests without correction. If the platform offers both inference frameworks, ask whether the correction applies in both modes, or whether switching to Bayesian quietly drops the multiple testing safeguard.

Sixth: what correction methods are supported, and are all of them carried through to sample size, intervals, and monitoring? A platform might support Bonferroni, Holm, and Benjamini-Hochberg. But supporting a method in the analysis is only the first step. Each method needs to connect to the rest of the analysis pipeline. Does the sample size calculator know which method will be used? Does each method produce intervals that match its rejection rule? Does the correction interact correctly with sequential testing boundaries? Sequential monitoring compounds the multiple testing problem: you are testing multiple metrics at multiple time points. The correction and the stopping rule must be designed together. Bonferroni works well here because each metric can run its own independent sequential test at the adjusted threshold (alpha divided by the number of success metrics). Other corrections, like Holm and Benjamini-Hochberg, require all metrics to accumulate information at similar rates, which is rarely true in practice. The power cost of combining continuous monitoring with these methods can exceed the modest gains they offer over Bonferroni. For more on how sequential testing interacts with other analysis features, see our upcoming post on sequential testing.

Seventh: is the correction applied consistently across all analysis views? Some platforms apply the correction in the main results summary but drop it in dimensional breakdowns, segment explorers, or automated alerting. Slicing results by country, platform, or user segment is itself a multiple comparisons problem. If the correction is present in the topline view but absent when you drill into dimensions, the false positive rate in exploratory analysis is uncontrolled. Ask whether the correction persists across all views where significance is displayed, including dimensional analysis. For more on the statistical treatment of dimensional breakdowns, see our upcoming post on exploratory analysis.

What the answers actually look like across vendors

Here is how the major platforms stand as of this writing. Vendor capabilities are based on public documentation. Confidence capabilities are based on the product itself.

Cell value legend: "Yes" and "No" are based on explicit evidence in the vendor's public documentation. "—" means the platform does not offer the broader feature at all, so the column does not apply. "Not documented" means we found no explicit evidence either way, not that the capability is confirmed absent.

Platform	Correction methods (frequentist)	Applied to variants?	Applied to metrics?	Intervals corrected?	Reflected in sample size calculator?	Metric roles distinguished?	Correction in Bayesian mode?	Applied in dimensional analysis?	Pipeline consistency	Other gaps
Confidence	Bonferroni	Yes	Yes	Yes (simultaneous)	Yes	Yes (guardrails excluded)	Yes	Yes	Full	Only Bonferroni offered
GrowthBook	Holm-Bonferroni, Benjamini-Hochberg	Yes	Yes (goal metrics only)	Ad-hoc	No	Yes (guardrails excluded)	No (frequentist only)	Yes (per-dimension family)	Partial	Warns dimensional results are exploratory
Eppo	Preferential Bonferroni	Yes	Yes	Yes (widened)	Not documented	Partial	No (frequentist only)	Not documented	Partial	—
Statsig	Bonferroni, Benjamini-Hochberg	Yes	Yes	Yes	Partial (per-variant only)	Partial	Not documented	No (topline only)	Partial	Dimensions capped at top 10
Optimizely	Tiered Benjamini-Hochberg	Yes	Yes	Not documented	No	Yes	Not documented	Not documented	Partial	—
Amplitude	Bonferroni	Yes	Yes	Yes	No	Partial	Not documented	Not documented	Partial	Bonferroni enabled by default
LaunchDarkly	Bonferroni, Benjamini-Hochberg	Configurable	Configurable	Yes	No	Not documented	No (frequentist only)	Not documented	Partial	—
VWO	Built-in Bonferroni	Yes	Not documented	Not documented	No	Yes	Yes (in SmartStats)	Not documented	Partial	Guardrail breach auto-pauses
PostHog	None	No	No	—	—	No	No	—	N/A	No correction in either mode

Five patterns emerge from this landscape.

The first is the gap between variant correction and metric correction. Every platform that offers correction applies it to variants. Most also apply it to metrics. But PostHog, which offers both Bayesian and frequentist engines, applies no correction in either mode, recommending intentional metric selection as the mitigation. This leaves the false positive rate entirely in the experimenter's hands.

The second pattern is the disconnect between analysis methods and planning. This is the partial pipeline pattern at its most common: a vendor supports correction in the analysis engine but does not carry it through to the sample size calculator. Seven of the eight external platforms offer some form of correction in their analysis engine, but none reflect the correction in their sample size calculator in a way that matches the analysis method. Statsig reflects Bonferroni per-variant correction in its sample size calculator but does not account for per-metric correction or Benjamini-Hochberg. GrowthBook supports Holm and Benjamini-Hochberg in analysis but does not connect either to the calculator. The result is the same planning-analysis gap described in our sample size post: the calculator produces a number that assumes no metric-level correction, the analysis applies one, and the experiment is either underpowered or overpowered depending on the direction of the mismatch.

The third pattern is the Bayesian blind spot. Most platforms that offer both frequentist and Bayesian modes apply multiple testing correction only in the frequentist mode. GrowthBook, Eppo, LaunchDarkly, and PostHog explicitly do not include correction in their Bayesian engines. GrowthBook, Eppo, and LaunchDarkly offer correction only in the frequentist path, while PostHog applies no correction in either mode. The argument that Bayesian inference inherently avoids the multiple testing problem holds only with well-specified informative priors. With the default or diffuse priors that most platforms use, the false positive risk is comparable to uncorrected frequentist testing. VWO is the notable exception, integrating its correction into the SmartStats Bayesian engine. Confidence applies correction in both modes.

The fourth pattern is the confidence interval question. Bonferroni produces simultaneous confidence intervals by construction. Other methods do not. GrowthBook acknowledges this directly, noting that "adjusted p-values do not have directly analogous confidence intervals" and using an ad-hoc approach to back-calculate intervals from adjusted p-values. Statsig and Eppo widen intervals based on the adjusted alpha. Optimizely's documentation does not address whether intervals are corrected alongside its tiered Benjamini-Hochberg decisions. The practical consequence of uncorrected intervals is winner's curse: even when the decision is controlled, the effect estimates for significant results are systematically inflated.

The fifth pattern is dimensional analysis. Statsig explicitly excludes its Benjamini-Hochberg procedure from dimensional breakdowns, applying correction to topline results only. GrowthBook applies correction separately in dimensional views, treating each dimension as its own family of tests (while warning that dimensional results should be treated as exploratory). Most other platforms do not document whether correction extends to dimensional analysis. Since slicing by dimensions is one of the most common sources of false positives in practice, the absence of documented correction in dimensional views means exploratory analysis may be running at an uncontrolled false positive rate.

The overall picture is one of partial pipelines. Every platform addresses some aspect of the multiple comparisons problem. No external platform carries the correction through consistently from analysis to intervals to sample size planning to sequential monitoring to dimensional analysis. The table above shows where each vendor's correction starts and where it stops. The RFP that asks "do you correct for multiple comparisons?" will get a yes from almost every vendor. The RFP that asks where the correction starts and where it stops will get very different answers.

How to Write an Experimentation Platform RFP for Multiple Testing Corrections

Should you add multiple testing correction to your experimentation platform requirements?

What your RFP should ask instead of the "yes/no?"

What the answers actually look like across vendors

More in this series