What is a Multiple Testing Correction?

A multiple testing correction is an adjustment to significance thresholds that accounts for evaluating more than one hypothesis in the same experiment. Without it, the probability of at least one false positive grows with every additional test. Ten independent tests at alpha = 0.05 each produce a roughly 40% chance of at least one spurious significant result. Multiple testing corrections bring that probability back under control.

The need for correction shows up in nearly every real A/B test. Teams rarely evaluate a single metric. A typical experiment tracks success metrics (did the change help?), guardrail metrics (did the change cause harm?), and often quality or exploratory metrics. When the shipping decision depends on whether any success metric is significant, the multiple testing problem is present, and ignoring it means accepting a false positive rate much higher than the nominal 5%.

What types of multiple testing corrections exist?

Corrections fall into two families, defined by what they control.

FWER-controlling methods keep the probability of even one false positive below alpha across the entire correction family. The Bonferroni correction is the simplest: divide alpha by the number of tests. The Holm correction is a step-down refinement that's uniformly more powerful. The Hommel correction squeezes out additional power by exploiting the joint distribution of p-values. All three provide strong FWER control.

FDR-controlling methods keep the expected proportion of false positives among rejected hypotheses below alpha. The Benjamini-Hochberg correction is the standard. It's more powerful than FWER methods when many tests have real effects, but the guarantee is weaker: it allows some false positives as long as the fraction stays small.

For A/B testing decisions, FWER control is the standard choice. The correction family is typically small (1 to 5 success metrics), the consequences of a single false positive are concrete (shipping a change that didn't help), and the power cost of FWER control is modest at that scale. FDR control makes sense for large-scale screening, like scanning 50 exploratory metrics to generate hypotheses for future experiments.

How much power do multiple testing corrections cost?

Less than most teams assume, when the correction family is defined correctly.

The Confidence blog's analysis of multiple testing corrections found that Bonferroni's power gap versus more sophisticated FWER methods (Holm, Hommel) is only 4 to 5 percentage points for typical A/B test scenarios. That's with 1 to 5 success metrics and effect sizes common in product experiments. The gap is small enough that Bonferroni's simplicity and unique properties (simultaneous confidence intervals, straightforward sample size calculation) make it a strong default.

The bigger power lever is the correction family, not the correction method. Correcting across 3 success metrics costs far less power than correcting across 20 metrics of all types. Confidence separates success metrics from guardrails and exploratory metrics, keeping the correction denominator small. That family definition typically matters more than whether you use Bonferroni, Holm, or Hommel.

How does Confidence handle multiple testing corrections?

Confidence applies FWER correction to success metrics by default, using the Bonferroni correction. The platform's approach is built on the decision framework from the peer-reviewed paper on risk-aware product decisions, which formalizes how different metric types require different error control.

Success metrics share a correction family because a false positive on any success metric leads to the same bad outcome: shipping based on noise. Guardrail metrics are handled separately because the relevant risk is a false negative (missing a real regression), not a false positive. This separation is a design choice that reflects how experimentation decisions actually work, not a statistical convenience.

The practical benefit is that teams don't need to choose a correction method or define families manually. Confidence encodes the methodology into the default analysis. A team designating their metrics as success, guardrail, or quality gets the right correction automatically.

When can you skip multiple testing correction?

In a strict sense, correction is unnecessary if the experiment has a single pre-registered success metric and the shipping decision is based solely on that metric. One test, one decision, no multiplicity problem.

In practice, few experiments are that clean. Even with one success metric, teams often peek at secondary metrics and let those influence the decision informally. That informal peek reintroduces the multiple testing problem without any formal control.

The safer default: apply correction whenever the shipping decision could be influenced by more than one statistical test. Confidence makes this easy by handling it automatically.

What is a Multiple Testing Correction?

What types of multiple testing corrections exist?

How much power do multiple testing corrections cost?

How does Confidence handle multiple testing corrections?

When can you skip multiple testing correction?

Related terms