The Bonferroni correction adjusts significance thresholds for multiple testing by dividing the target alpha by the number of tests. If you're evaluating 3 success metrics at alpha = 0.05, each individual test uses a threshold of 0.05/3 = 0.017. This guarantees that the family-wise error rate (FWER), the probability of at least one false positive across the entire set, stays below 5%.
Bonferroni has a reputation for being too conservative. That reputation is mostly wrong, at least for A/B testing. The Confidence blog's analysis of multiple testing corrections found the power gap between Bonferroni and more sophisticated FWER methods (Holm, Hommel) is only 4 to 5 percentage points for typical product experiments with 1 to 5 success metrics. That's a small price for a method that's trivial to implement, easy to explain to stakeholders, and uniquely provides simultaneous confidence intervals for every metric.
Why does Confidence use Bonferroni as the default?
Three properties make Bonferroni well-suited for A/B testing in ways that more complex alternatives can't match.
Simultaneous confidence intervals. Bonferroni is the only standard FWER method that produces valid confidence intervals for every metric simultaneously. That means you can look at the estimated effect and its interval for each success metric, and the coverage guarantee holds for all of them at once. Step-down methods like the Holm correction and Hommel correction control FWER for the rejection decisions, but they don't produce simultaneous intervals with the same guarantee. When stakeholders ask "how big was the effect on metric X?", the Bonferroni-adjusted interval gives a trustworthy answer.
Straightforward sample size calculations. Because Bonferroni simply adjusts the per-test alpha, power analysis is straightforward: calculate the sample size for the adjusted alpha. Step-down methods like Holm have power that depends on the unknown configuration of true and false nulls, making exact sample size calculations impossible without assumptions about which effects are real. In practice, you'd size for Bonferroni anyway and treat any additional power from Holm or Hommel as a bonus.
Transparency. The logic fits in one sentence: divide alpha by the number of tests. Every stakeholder in the room can understand why the threshold changed and what it means. Methods that involve sorting p-values and applying rank-dependent thresholds are harder to explain, which matters when the goal is building trust in the experimentation process.
How conservative is Bonferroni really?
The conservatism concern comes from two sources, one legitimate and one that's usually a denominator mistake.
The legitimate source: Bonferroni assumes nothing about the correlation structure between tests. If your metrics are highly positively correlated (as many product metrics are), the effective number of independent tests is smaller than the literal count, and Bonferroni over-corrects slightly. Methods like the Hommel correction can exploit that structure for extra power.
The denominator mistake: teams sometimes apply Bonferroni across every metric in the experiment, including guardrails, quality indicators, and exploratory measures. Correcting across 20 metrics when only 3 are success metrics is expensive regardless of which FWER method you use. The fix is a smaller correction family. Confidence separates success metrics from other types and corrects only across the success metrics. That keeps the denominator at 1 to 5 for most experiments, where Bonferroni's conservatism is minimal.
A recent paper argues that in well-specified decision frameworks where the correction family is defined correctly, Bonferroni can actually outperform more complex alternatives because its simultaneous confidence intervals enable better post-hoc decision-making.
When should you use something other than Bonferroni?
If the correction family is large (more than 10 tests), the cumulative power loss from Bonferroni becomes meaningful, and step-down methods like Holm or Hommel provide a real advantage. If you're screening hundreds of metrics for hypothesis generation rather than making shipping decisions, the Benjamini-Hochberg correction with its FDR control is more appropriate.
For the typical A/B test with a handful of success metrics, Bonferroni is the right default. The power you give up is small. The properties you get in return (simultaneous intervals, simple sample sizing, full transparency) are hard to replicate with alternatives.