Confidence
  • Pricing
  • Success stories
  • Contact us
  • Login
Start free trial
All terms
Multiple Testing

What is a Benjamini-Hochberg Correction?

The Benjamini-Hochberg (BH) correction is a multiple testing procedure that controls the false discovery rate (FDR): the expected proportion of false positives among all results declared significant.

The Benjamini-Hochberg (BH) correction is a multiple testing procedure that controls the false discovery rate (FDR): the expected proportion of false positives among all results declared significant. Unlike family-wise error rate (FWER) methods such as the Bonferroni correction, which guarantee the probability of any false positive stays below alpha, BH accepts that some false positives will occur and instead keeps their fraction controlled. If you reject 10 hypotheses under BH at FDR = 5%, you expect about 0.5 of those to be false.

This makes BH substantially more powerful than FWER methods when the number of tests is large and many have real effects. It's the standard correction for large-scale screening problems, and it's the most cited multiple testing procedure in statistics.

How does the Benjamini-Hochberg procedure work?

The procedure is simple.

  1. Sort all p-values from smallest to largest. Call them p(1), p(2), ..., p(m), where m is the total number of tests.
  2. For each rank i, compute the threshold: (i/m) * alpha.
  3. Find the largest rank k where p(k) is less than or equal to (k/m) * alpha.
  4. Reject all hypotheses with rank 1 through k.

The thresholds increase linearly with rank. The smallest p-value is compared to alpha/m (same as Bonferroni). The largest is compared to alpha itself. This means BH is always at least as powerful as Bonferroni and often much more powerful, because later-ranked p-values face progressively easier thresholds.

A worked example: 5 tests at FDR = 0.05 with p-values of 0.001, 0.008, 0.039, 0.041, and 0.22.

The thresholds are 0.01, 0.02, 0.03, 0.04, and 0.05. The first p-value (0.001) passes its threshold (0.01). The second (0.008) passes (0.02). The third (0.039) exceeds its threshold (0.03). So k = 2, and the first two hypotheses are rejected.

When is BH appropriate for A/B testing?

For most A/B test shipping decisions, BH is not the right default. The reason comes down to what FDR controls and what it doesn't.

FDR guarantees the false discovery proportion in expectation. If you're screening 100 metrics and FDR control tells you 15 are significant, you can expect about 0.75 of those to be false. That's a useful guarantee for prioritizing follow-up investigation. It's a weaker guarantee for shipping decisions, where each individual false positive has a concrete cost: engineering time spent on a feature that didn't work, or a user experience change that produced no real benefit.

FWER methods (Bonferroni, Holm, Hommel) provide the stronger guarantee: at most a 5% chance that any of the significant results is a false positive. For A/B tests with 1 to 5 success metrics, the power cost of FWER control is modest, and the protection is directly aligned with the decision being made. That's why Confidence uses FWER control (Bonferroni by default) for success metrics.

BH becomes valuable in the experimentation context for exploratory analysis. If a team scans 30 to 50 secondary metrics after an experiment to identify promising directions for future tests, FDR control lets them find signals without the severe power penalty of FWER correction across that many tests. The results inform what to investigate next. They don't determine what to ship now.

How does BH compare to FWER methods on power?

The power advantage of BH over FWER methods depends on two things: the number of tests and the fraction that have real effects (the non-null proportion).

With few tests and few real effects, BH and Bonferroni perform similarly. Both are conservative, and the FDR guarantee doesn't buy much extra power.

With many tests and a substantial non-null fraction, BH pulls ahead significantly. In genomics studies with 20,000 tests and perhaps 500 real effects, BH might reject 400 hypotheses while Bonferroni rejects 50. The FDR guarantee (you expect about 20 of those 400 to be false) is perfectly acceptable for generating a list of candidates for follow-up.

For A/B testing with small correction families (1 to 5 success metrics), the Confidence blog's analysis found the power gap between Bonferroni and the most powerful FWER methods is only 4 to 5 percentage points. BH would add some power beyond that, but the trade-off is a weaker error guarantee for each individual decision.

Does BH require independence between tests?

The original BH result assumes independent test statistics. A later result by Benjamini and Yekutieli (2001) proved BH also controls FDR under positive dependence (technically, PRDS: positive regression dependence on each one from a subset). This covers most practical cases in A/B testing, where success metrics tend to be positively correlated.

Under arbitrary dependence, the standard BH procedure may not control FDR at the nominal level. The Benjamini-Yekutieli adjustment handles arbitrary dependence but at a significant power cost. For A/B testing scenarios with positively correlated metrics, the standard BH procedure is valid without modification.

Related terms

Multiple Testing
False Discovery Rate

False discovery rate (FDR) is the expected proportion of false positives among all results declared statistically significant.

Multiple Testing
Multiple Testing Correction

A multiple testing correction is an adjustment to significance thresholds that accounts for evaluating more than one hypothesis in the same experiment.

Multiple Testing
Family-Wise Error Rate

Family-wise error rate (FWER) is the probability of making at least one false positive across a set of hypothesis tests.

Multiple Testing
Bonferroni Correction

The Bonferroni correction adjusts significance thresholds for multiple testing by dividing the target alpha by the number of tests.

Multiple Testing
Holm Correction

The Holm correction (also called Holm-Bonferroni) is a step-down multiple testing procedure that controls the family-wise error rate (FWER) while being uniformly more powerful than the Bonferroni c...

Multiple Testing
Hommel Correction

The Hommel correction is a multiple testing procedure that controls the family-wise error rate (FWER) while being more powerful than both the Bonferroni correction and the Holm correction.

Multiple Testing
Correction Family

A correction family is the set of hypothesis tests grouped together for a multiple testing adjustment.

Statistical Methods
Statistical Significance

Statistical significance is the determination that an observed difference between experiment groups is unlikely to have occurred by chance alone.

Spotify

Learn more

  • Read our blog
  • See comparisons
  • Glossary
  • RFP guides
  • Listen to us
  • Read our docs
  • Status page

Need help

  • Contact us

Legal

  • Terms of Service
  • Data Protection Agreement
  • Privacy Policy
  • Cookies

© 2026 Spotify

The Confidence name and logo are registered trademarks of Spotify.