Confidence
  • Pricing
  • Success stories
  • Contact us
  • Login
Start free trial
All terms
Multiple Testing

What is a Correction Family?

A correction family is the set of hypothesis tests grouped together for a multiple testing adjustment.

A correction family is the set of hypothesis tests grouped together for a multiple testing adjustment. When you apply a method like the Bonferroni correction or the Holm correction, the correction family determines how many tests the method accounts for. Every test inside the family shares the error budget. Every test outside it is irrelevant to the adjustment.

The correction family is the single most consequential choice in multiple testing correction, more so than the specific method you use. A Bonferroni correction across 3 success metrics costs a small amount of power. A Bonferroni correction across 25 metrics (success, guardrails, quality indicators, and exploratory measures all lumped together) costs a lot. The method is the same. The denominator changed.

What belongs in the correction family?

The answer depends on the decision the tests inform. Tests that jointly drive the same decision belong in the same family. Tests that inform different decisions belong in separate families.

In a typical A/B test, this maps cleanly to metric types:

Success metrics go in the correction family. These are the metrics that determine whether you ship the change. If you have three success metrics and declare victory when any one is significant, that's the multiple testing problem you need to correct for. The family size is 3, not the total number of metrics in the experiment.

Guardrail metrics stay outside the family. Guardrails monitor what you don't want to break. The risk with guardrails is a false negative (failing to detect a real regression), not a false positive. Applying the same FWER correction to guardrails would reduce the sensitivity to regressions, which is the opposite of what you want. The Confidence decision framework, formalized in the peer-reviewed paper on risk-aware product decisions, makes this separation explicit.

Quality and exploratory metrics stay outside the family. These inform future hypotheses or track operational health. They don't drive the current shipping decision.

Why does the correction family size matter so much?

Statistical power drops as the family grows. With Bonferroni correction at alpha = 0.05, each test in a family of 3 uses a threshold of 0.05/3 = 0.017. In a family of 20, each test uses 0.05/20 = 0.0025. That tighter threshold means you need substantially larger sample sizes to detect the same effect.

The Confidence blog's analysis of multiple testing corrections found that the power gap between Bonferroni and more sophisticated family-wise error rate methods like Holm or Hommel is only 4 to 5 percentage points when applied to typical A/B test success metric sets (1 to 5 metrics). The gap between correcting across the right family versus the wrong one is much larger.

At Spotify, experiments often track 20 or more metrics across different types. If all 20 were thrown into one correction family, even the most powerful FWER method would struggle. By separating success metrics (typically 1 to 3) from guardrails and exploratory metrics, Confidence keeps the correction denominator small and the power loss minor.

How does Confidence define correction families?

Confidence uses the metric type classification to form correction families automatically. When you set up an experiment, you designate each metric as success, guardrail, deterioration, or quality. The platform applies FWER correction (Bonferroni by default) only across the success metrics. Guardrail metrics get their own error control, tuned for false negative risk rather than false positive risk.

This design reflects a specific statistical insight: the error that matters depends on the decision the metric informs. For success metrics, a false positive means shipping something that didn't help. For guardrails, a false negative means missing something that caused harm. Grouping them into one family and applying a uniform correction would address the wrong risk for at least one of the two groups.

Can you define custom correction families?

Sometimes the default metric-type grouping doesn't fit. If an experiment has two independent objectives (say, improving search relevance and reducing load time), those two success metrics might drive genuinely separate decisions. In that case, they could be placed in separate correction families of one, with no power penalty at all.

The principle is the same: group tests by the decision they feed. If two metrics answer the same question ("should we ship this?"), they share a family. If they answer different questions, they don't.

Related terms

Multiple Testing
Multiple Testing Correction

A multiple testing correction is an adjustment to significance thresholds that accounts for evaluating more than one hypothesis in the same experiment.

Multiple Testing
Family-Wise Error Rate

Family-wise error rate (FWER) is the probability of making at least one false positive across a set of hypothesis tests.

Multiple Testing
Bonferroni Correction

The Bonferroni correction adjusts significance thresholds for multiple testing by dividing the target alpha by the number of tests.

Multiple Testing
Holm Correction

The Holm correction (also called Holm-Bonferroni) is a step-down multiple testing procedure that controls the family-wise error rate (FWER) while being uniformly more powerful than the Bonferroni c...

Multiple Testing
Hommel Correction

The Hommel correction is a multiple testing procedure that controls the family-wise error rate (FWER) while being more powerful than both the Bonferroni correction and the Holm correction.

Multiple Testing
False Discovery Rate

False discovery rate (FDR) is the expected proportion of false positives among all results declared statistically significant.

Metrics
Guardrail Metric

A guardrail metric is a metric monitored during an experiment to ensure the change doesn't cause unintended harm, even when the success metric improves.

Metrics
Success Metric

A success metric is the primary metric an experiment is designed to move.

Spotify

Learn more

  • Read our blog
  • See comparisons
  • Glossary
  • RFP guides
  • Listen to us
  • Read our docs
  • Status page

Need help

  • Contact us

Legal

  • Terms of Service
  • Data Protection Agreement
  • Privacy Policy
  • Cookies

© 2026 Spotify

The Confidence name and logo are registered trademarks of Spotify.