Multiple Testing

What is a Holm Correction?

The Holm correction (also called Holm-Bonferroni) is a step-down multiple testing procedure that controls the family-wise error rate (FWER) while being uniformly more powerful than the Bonferroni c...

The Holm correction (also called Holm-Bonferroni) is a step-down multiple testing procedure that controls the family-wise error rate (FWER) while being uniformly more powerful than the Bonferroni correction. It works by sorting p-values from smallest to largest and comparing each to a progressively less strict threshold. The smallest p-value is tested against alpha divided by the total number of tests (same as Bonferroni). If it's rejected, the next is tested against alpha divided by n minus 1. The process continues until a p-value fails to reject, at which point all remaining hypotheses are retained.

The practical difference between Holm and Bonferroni is modest for small correction families. With 3 success metrics in a typical A/B test, the power gap is around 4 to 5 percentage points, as measured in the Confidence blog's analysis of multiple testing corrections. That means Holm detects a real effect slightly more often, but both methods perform similarly when the family size stays small.

How does the step-down procedure work?

The logic is sequential and intuitive.

Suppose you have 4 success metrics with p-values of 0.003, 0.012, 0.028, and 0.41, and your target alpha is 0.05.

  1. Sort the p-values: 0.003, 0.012, 0.028, 0.41.
  2. Compare the smallest (0.003) to 0.05/4 = 0.0125. It's smaller, so reject it.
  3. Compare the next (0.012) to 0.05/3 = 0.0167. It's smaller, so reject it.
  4. Compare the next (0.028) to 0.05/2 = 0.025. It's larger, so stop. Retain this and all remaining hypotheses.

Result: two metrics are significant, two are not. Bonferroni would have tested every p-value against 0.05/4 = 0.0125 and only rejected the first one. Holm picks up the second because, once the first hypothesis is rejected, the remaining family is effectively smaller.

This is why Holm is called a step-down method: it starts from the most significant result and steps down through the sorted list, relaxing the threshold at each step.

When does Holm outperform Bonferroni?

Holm's advantage grows with two factors: the size of the correction family and the number of genuinely non-null hypotheses.

When most hypotheses are truly null (nothing is different), Holm and Bonferroni behave almost identically. The step-down relaxation only helps when earlier hypotheses are rejected, which rarely happens when there's nothing to find.

When several metrics have real effects, Holm's sequential relaxation kicks in. Each rejected hypothesis loosens the threshold for the remaining ones, giving the procedure more power to detect additional effects. With a correction family of 10 metrics and 5 real effects, the difference can be substantial.

For A/B tests with 1 to 5 success metrics, the typical scenario in Confidence, the gap is small. Most experiments have one or two real effects at most. The step-down doesn't have many opportunities to relax the threshold.

What does Holm give up compared to Bonferroni?

The main trade-off is that Holm doesn't produce simultaneous confidence intervals. It controls FWER for the rejection decisions (which hypotheses are significant), but the adjusted p-values it produces don't correspond to confidence intervals with joint coverage guarantees. If you want to report "the effect on metric X was between A and B, and the effect on metric Y was between C and D, and both intervals hold simultaneously," Bonferroni gives you that. Holm doesn't.

This matters when stakeholders want to interpret effect sizes alongside significance decisions. If the question is purely "which metrics are significant?", Holm is a strict improvement. If the question is "what's the plausible range of the effect on each significant metric?", the Bonferroni-adjusted intervals are more useful.

Holm also makes sample size calculations harder. Bonferroni's per-test alpha is fixed before the experiment, so you can calculate the required sample size directly. Holm's effective threshold for each test depends on how many other tests are rejected, which you don't know in advance. In practice, teams size their experiments for the Bonferroni threshold and treat Holm's extra power as a bonus.

How does Confidence handle the Holm correction?

Confidence uses the Bonferroni correction as the default for FWER control on success metrics. This choice reflects the practical advantages of simultaneous confidence intervals and straightforward power calculations, which matter for how teams interpret and act on results. The 4 to 5 percentage point power gap between Bonferroni and Holm is a known cost, accepted in exchange for those properties.