The Benjamini-Hochberg (BH) correction is a multiple testing procedure that controls the false discovery rate (FDR): the expected proportion of false positives among all results declared significant. Unlike family-wise error rate (FWER) methods such as the Bonferroni correction, which guarantee the probability of any false positive stays below alpha, BH accepts that some false positives will occur and instead keeps their fraction controlled. If you reject 10 hypotheses under BH at FDR = 5%, you expect about 0.5 of those to be false.
This makes BH substantially more powerful than FWER methods when the number of tests is large and many have real effects. It's the standard correction for large-scale screening problems, and it's the most cited multiple testing procedure in statistics.
How does the Benjamini-Hochberg procedure work?
The procedure is simple.
- Sort all p-values from smallest to largest. Call them p(1), p(2), ..., p(m), where m is the total number of tests.
- For each rank i, compute the threshold: (i/m) * alpha.
- Find the largest rank k where p(k) is less than or equal to (k/m) * alpha.
- Reject all hypotheses with rank 1 through k.
The thresholds increase linearly with rank. The smallest p-value is compared to alpha/m (same as Bonferroni). The largest is compared to alpha itself. This means BH is always at least as powerful as Bonferroni and often much more powerful, because later-ranked p-values face progressively easier thresholds.
A worked example: 5 tests at FDR = 0.05 with p-values of 0.001, 0.008, 0.039, 0.041, and 0.22.
The thresholds are 0.01, 0.02, 0.03, 0.04, and 0.05. The first p-value (0.001) passes its threshold (0.01). The second (0.008) passes (0.02). The third (0.039) exceeds its threshold (0.03). So k = 2, and the first two hypotheses are rejected.
When is BH appropriate for A/B testing?
For most A/B test shipping decisions, BH is not the right default. The reason comes down to what FDR controls and what it doesn't.
FDR guarantees the false discovery proportion in expectation. If you're screening 100 metrics and FDR control tells you 15 are significant, you can expect about 0.75 of those to be false. That's a useful guarantee for prioritizing follow-up investigation. It's a weaker guarantee for shipping decisions, where each individual false positive has a concrete cost: engineering time spent on a feature that didn't work, or a user experience change that produced no real benefit.
FWER methods (Bonferroni, Holm, Hommel) provide the stronger guarantee: at most a 5% chance that any of the significant results is a false positive. For A/B tests with 1 to 5 success metrics, the power cost of FWER control is modest, and the protection is directly aligned with the decision being made. That's why Confidence uses FWER control (Bonferroni by default) for success metrics.
BH becomes valuable in the experimentation context for exploratory analysis. If a team scans 30 to 50 secondary metrics after an experiment to identify promising directions for future tests, FDR control lets them find signals without the severe power penalty of FWER correction across that many tests. The results inform what to investigate next. They don't determine what to ship now.
How does BH compare to FWER methods on power?
The power advantage of BH over FWER methods depends on two things: the number of tests and the fraction that have real effects (the non-null proportion).
With few tests and few real effects, BH and Bonferroni perform similarly. Both are conservative, and the FDR guarantee doesn't buy much extra power.
With many tests and a substantial non-null fraction, BH pulls ahead significantly. In genomics studies with 20,000 tests and perhaps 500 real effects, BH might reject 400 hypotheses while Bonferroni rejects 50. The FDR guarantee (you expect about 20 of those 400 to be false) is perfectly acceptable for generating a list of candidates for follow-up.
For A/B testing with small correction families (1 to 5 success metrics), the Confidence blog's analysis found the power gap between Bonferroni and the most powerful FWER methods is only 4 to 5 percentage points. BH would add some power beyond that, but the trade-off is a weaker error guarantee for each individual decision.
Does BH require independence between tests?
The original BH result assumes independent test statistics. A later result by Benjamini and Yekutieli (2001) proved BH also controls FDR under positive dependence (technically, PRDS: positive regression dependence on each one from a subset). This covers most practical cases in A/B testing, where success metrics tend to be positively correlated.
Under arbitrary dependence, the standard BH procedure may not control FDR at the nominal level. The Benjamini-Yekutieli adjustment handles arbitrary dependence but at a significant power cost. For A/B testing scenarios with positively correlated metrics, the standard BH procedure is valid without modification.