Multiple Testing

What is a Bonferroni Correction?

The Bonferroni correction adjusts significance thresholds for multiple testing by dividing the target alpha by the number of tests.

The Bonferroni correction adjusts significance thresholds for multiple testing by dividing the target alpha by the number of tests. If you're evaluating 3 success metrics at alpha = 0.05, each individual test uses a threshold of 0.05/3 = 0.017. This guarantees that the family-wise error rate (FWER), the probability of at least one false positive across the entire set, stays below 5%.

Bonferroni has a reputation for being too conservative. That reputation is mostly wrong, at least for A/B testing. The Confidence blog's analysis of multiple testing corrections found the power gap between Bonferroni and more sophisticated FWER methods (Holm, Hommel) is only 4 to 5 percentage points for typical product experiments with 1 to 5 success metrics. That's a small price for a method that's trivial to implement, easy to explain to stakeholders, and uniquely provides simultaneous confidence intervals for every metric.

Why does Confidence use Bonferroni as the default?

Three properties make Bonferroni well-suited for A/B testing in ways that more complex alternatives can't match.

Simultaneous confidence intervals. Bonferroni is the only standard FWER method that produces valid confidence intervals for every metric simultaneously. That means you can look at the estimated effect and its interval for each success metric, and the coverage guarantee holds for all of them at once. Step-down methods like the Holm correction and Hommel correction control FWER for the rejection decisions, but they don't produce simultaneous intervals with the same guarantee. When stakeholders ask "how big was the effect on metric X?", the Bonferroni-adjusted interval gives a trustworthy answer.

Straightforward sample size calculations. Because Bonferroni simply adjusts the per-test alpha, power analysis is straightforward: calculate the sample size for the adjusted alpha. Step-down methods like Holm have power that depends on the unknown configuration of true and false nulls, making exact sample size calculations impossible without assumptions about which effects are real. In practice, you'd size for Bonferroni anyway and treat any additional power from Holm or Hommel as a bonus.

Transparency. The logic fits in one sentence: divide alpha by the number of tests. Every stakeholder in the room can understand why the threshold changed and what it means. Methods that involve sorting p-values and applying rank-dependent thresholds are harder to explain, which matters when the goal is building trust in the experimentation process.

How conservative is Bonferroni really?

The conservatism concern comes from two sources, one legitimate and one that's usually a denominator mistake.

The legitimate source: Bonferroni assumes nothing about the correlation structure between tests. If your metrics are highly positively correlated (as many product metrics are), the effective number of independent tests is smaller than the literal count, and Bonferroni over-corrects slightly. Methods like the Hommel correction can exploit that structure for extra power.

The denominator mistake: teams sometimes apply Bonferroni across every metric in the experiment, including guardrails, quality indicators, and exploratory measures. Correcting across 20 metrics when only 3 are success metrics is expensive regardless of which FWER method you use. The fix is a smaller correction family. Confidence separates success metrics from other types and corrects only across the success metrics. That keeps the denominator at 1 to 5 for most experiments, where Bonferroni's conservatism is minimal.

A recent paper argues that in well-specified decision frameworks where the correction family is defined correctly, Bonferroni can actually outperform more complex alternatives because its simultaneous confidence intervals enable better post-hoc decision-making.

When should you use something other than Bonferroni?

If the correction family is large (more than 10 tests), the cumulative power loss from Bonferroni becomes meaningful, and step-down methods like Holm or Hommel provide a real advantage. If you're screening hundreds of metrics for hypothesis generation rather than making shipping decisions, the Benjamini-Hochberg correction with its FDR control is more appropriate.

For the typical A/B test with a handful of success metrics, Bonferroni is the right default. The power you give up is small. The properties you get in return (simultaneous intervals, simple sample sizing, full transparency) are hard to replicate with alternatives.

Related terms

Multiple Testing
Multiple Testing Correction

A multiple testing correction is an adjustment to significance thresholds that accounts for evaluating more than one hypothesis in the same experiment.

Multiple Testing
Family-Wise Error Rate

Family-wise error rate (FWER) is the probability of making at least one false positive across a set of hypothesis tests.

Multiple Testing
Correction Family

A correction family is the set of hypothesis tests grouped together for a multiple testing adjustment.

Multiple Testing
Holm Correction

The Holm correction (also called Holm-Bonferroni) is a step-down multiple testing procedure that controls the family-wise error rate (FWER) while being uniformly more powerful than the Bonferroni c...

Multiple Testing
Hommel Correction

The Hommel correction is a multiple testing procedure that controls the family-wise error rate (FWER) while being more powerful than both the Bonferroni correction and the Holm correction.

Multiple Testing
Benjamini-Hochberg Correction

The Benjamini-Hochberg (BH) correction is a multiple testing procedure that controls the false discovery rate (FDR): the expected proportion of false positives among all results declared significant.

Multiple Testing
False Discovery Rate

False discovery rate (FDR) is the expected proportion of false positives among all results declared statistically significant.

Statistical Methods
Statistical Significance

Statistical significance is the determination that an observed difference between experiment groups is unlikely to have occurred by chance alone.