The Hommel correction is a multiple testing procedure that controls the family-wise error rate (FWER) while being more powerful than both the Bonferroni correction and the Holm correction. It achieves this by exploiting the joint distribution of p-values under the assumption that the test statistics are independent or positively dependent. Where Holm processes p-values sequentially from smallest to largest, Hommel uses a more complex algorithm that considers all p-values simultaneously, finding the largest number of hypotheses it can reject while maintaining FWER control.
In practice, Hommel represents the upper end of what FWER-controlling methods can achieve within the Bonferroni family of procedures. It's the most powerful of the three standard FWER methods (Bonferroni, Holm, Hommel), but the incremental gain over Holm is smaller than Holm's gain over Bonferroni.
How does the Hommel procedure work?
The algorithm is more involved than Holm's step-down approach. Instead of processing p-values one at a time, Hommel searches for the largest integer k such that no p-value in a specific subset exceeds a rank-adjusted threshold.
For each possible value of k from the total number of tests down to 1, the procedure checks whether there exists a configuration where k hypotheses can be rejected while satisfying a set of inequality constraints across all p-values. The largest k that passes this check determines how many hypotheses are rejected, and which ones.
The result: Hommel always rejects at least as many hypotheses as Holm, and sometimes more. The additional rejections come from cases where the pattern of p-values across all tests provides evidence that Holm's sequential approach misses.
The computational cost is higher than Holm (which requires only sorting and a single pass), but for the small correction families typical of A/B testing (1 to 5 success metrics), the difference is negligible.
How much more powerful is Hommel than the alternatives?
The Confidence blog's analysis of multiple testing corrections measured the power gap across the three standard FWER methods. The key finding: for typical A/B test scenarios with 1 to 5 success metrics, the gap between Bonferroni and the most powerful FWER methods (including Hommel) is only 4 to 5 percentage points.
That 4 to 5 point gap is the total distance from the simplest method (Bonferroni) to the most powerful (Hommel). Holm captures most of that gap. Hommel's additional gain over Holm is a fraction of the already-small total.
The power improvement becomes more meaningful as the correction family grows. With 10 or more tests and several real effects, Hommel's ability to consider all p-values simultaneously gives it a real edge. For small families, the differences are often below the level that would change an experimental decision.
What are the trade-offs?
Hommel shares the same limitations as the Holm correction relative to Bonferroni, and adds one more.
No simultaneous confidence intervals. Like Holm, Hommel controls FWER for the rejection decisions but doesn't produce confidence intervals with joint coverage guarantees. When stakeholders need to interpret effect sizes (not just significance), Bonferroni-adjusted intervals are more informative.
Sample size calculations are harder. The effective rejection threshold for any individual test depends on the full pattern of p-values across all tests, which isn't known at the design stage. Teams typically size experiments using the Bonferroni threshold and treat any extra power from Hommel as a bonus.
Assumption of independence or positive dependence. Hommel's validity requires the test statistics to be independent or satisfy a positive dependence condition called PRDS (positive regression dependence on each one from a subset). For most product metrics in A/B tests, this assumption holds because metrics tend to move in the same direction. Negatively correlated test statistics could violate it.
Harder to explain. Bonferroni's logic fits in one sentence. Holm's step-down procedure is intuitive after a worked example. Hommel's algorithm is genuinely complex. For teams that value transparency in their statistical methods (and want stakeholders to trust the results without taking the methodology on faith), this matters.
When should you use Hommel?
Hommel is the right choice when you have a moderately large correction family (5 to 15 tests), you don't need simultaneous confidence intervals, and you want to maximize the number of detections under FWER control. It's the most powerful option in the Bonferroni-Holm-Hommel family under these conditions.
For the typical A/B test with 1 to 3 success metrics, Bonferroni's simplicity and unique properties (simultaneous intervals, straightforward sample sizing) outweigh Hommel's marginal power gain. That's why Confidence uses Bonferroni as its default.
If the correction family grows beyond 15 to 20 tests, consider whether FWER control is still the right goal. At that scale, the Benjamini-Hochberg correction with its false discovery rate control may be more appropriate, depending on whether you're making shipping decisions or screening for hypotheses.