A correction family is the set of hypothesis tests grouped together for a multiple testing adjustment. When you apply a method like the Bonferroni correction or the Holm correction, the correction family determines how many tests the method accounts for. Every test inside the family shares the error budget. Every test outside it is irrelevant to the adjustment.
The correction family is the single most consequential choice in multiple testing correction, more so than the specific method you use. A Bonferroni correction across 3 success metrics costs a small amount of power. A Bonferroni correction across 25 metrics (success, guardrails, quality indicators, and exploratory measures all lumped together) costs a lot. The method is the same. The denominator changed.
What belongs in the correction family?
The answer depends on the decision the tests inform. Tests that jointly drive the same decision belong in the same family. Tests that inform different decisions belong in separate families.
In a typical A/B test, this maps cleanly to metric types:
Success metrics go in the correction family. These are the metrics that determine whether you ship the change. If you have three success metrics and declare victory when any one is significant, that's the multiple testing problem you need to correct for. The family size is 3, not the total number of metrics in the experiment.
Guardrail metrics stay outside the family. Guardrails monitor what you don't want to break. The risk with guardrails is a false negative (failing to detect a real regression), not a false positive. Applying the same FWER correction to guardrails would reduce the sensitivity to regressions, which is the opposite of what you want. The Confidence decision framework, formalized in the peer-reviewed paper on risk-aware product decisions, makes this separation explicit.
Quality and exploratory metrics stay outside the family. These inform future hypotheses or track operational health. They don't drive the current shipping decision.
Why does the correction family size matter so much?
Statistical power drops as the family grows. With Bonferroni correction at alpha = 0.05, each test in a family of 3 uses a threshold of 0.05/3 = 0.017. In a family of 20, each test uses 0.05/20 = 0.0025. That tighter threshold means you need substantially larger sample sizes to detect the same effect.
The Confidence blog's analysis of multiple testing corrections found that the power gap between Bonferroni and more sophisticated family-wise error rate methods like Holm or Hommel is only 4 to 5 percentage points when applied to typical A/B test success metric sets (1 to 5 metrics). The gap between correcting across the right family versus the wrong one is much larger.
At Spotify, experiments often track 20 or more metrics across different types. If all 20 were thrown into one correction family, even the most powerful FWER method would struggle. By separating success metrics (typically 1 to 3) from guardrails and exploratory metrics, Confidence keeps the correction denominator small and the power loss minor.
How does Confidence define correction families?
Confidence uses the metric type classification to form correction families automatically. When you set up an experiment, you designate each metric as success, guardrail, deterioration, or quality. The platform applies FWER correction (Bonferroni by default) only across the success metrics. Guardrail metrics get their own error control, tuned for false negative risk rather than false positive risk.
This design reflects a specific statistical insight: the error that matters depends on the decision the metric informs. For success metrics, a false positive means shipping something that didn't help. For guardrails, a false negative means missing something that caused harm. Grouping them into one family and applying a uniform correction would address the wrong risk for at least one of the two groups.
Can you define custom correction families?
Sometimes the default metric-type grouping doesn't fit. If an experiment has two independent objectives (say, improving search relevance and reducing load time), those two success metrics might drive genuinely separate decisions. In that case, they could be placed in separate correction families of one, with no power penalty at all.
The principle is the same: group tests by the decision they feed. If two metrics answer the same question ("should we ship this?"), they share a family. If they answer different questions, they don't.