Family-wise error rate (FWER) is the probability of making at least one false positive across a set of hypothesis tests. When you evaluate ten metrics in an A/B test, each at a 5% significance level, the chance that at least one produces a spurious significant result is far higher than 5%. With ten independent tests, the probability climbs to roughly 40%. FWER quantifies that compounding risk.
Controlling FWER matters because experimentation decisions are rarely based on a single metric. Most A/B tests track several success metrics alongside guardrail metrics and quality indicators. If any one of those success metrics shows a false positive, the team may ship a change that produced no real improvement. FWER-controlling methods like the Bonferroni correction, the Holm correction, and the Hommel correction adjust significance thresholds so the probability of even one such mistake stays below the target alpha (typically 5%) across the entire set.
How does FWER differ from false discovery rate?
FWER and false discovery rate (FDR) answer different questions about the same problem.
FWER asks: what's the probability that I make at least one false positive in this group of tests? It's a strict guarantee. If you control FWER at 5%, there's at most a 5% chance of any false rejection, regardless of how many tests you run.
FDR asks: among the tests I've declared significant, what proportion are expected to be false positives? FDR is more permissive. It allows some false positives as long as the fraction stays controlled. The Benjamini-Hochberg correction is the standard FDR-controlling method.
The practical trade-off is power. FWER control is conservative: to keep the probability of any false positive low, it raises the bar for every individual test. FDR control gives each test a lower bar, so it detects more real effects, but it accepts a known rate of false discoveries among the results.
For A/B testing with a small number of success metrics (typically 1 to 5), FWER control is the standard choice. The power cost is manageable, and the guarantee is strong: if you ship based on a significant result, you can trust it wasn't noise. When researchers screen hundreds of metrics to generate hypotheses rather than make shipping decisions, FDR control makes more sense.
When does FWER matter in practice?
The risk is sharpest when teams run experiments with multiple success metrics and ship based on whichever one reaches significance. Without a correction, an experiment with five success metrics tested at alpha = 0.05 has a roughly 23% chance of producing at least one false positive under the null. That's not a theoretical concern. It's the default behavior of any platform that tests each metric independently.
Confidence applies FWER-controlling corrections to success metrics by default. The platform's decision framework, formalized in a peer-reviewed paper on risk-aware product decisions, distinguishes between metric types: success metrics get FWER correction because a false positive on a success metric leads to shipping a change that didn't actually help. Guardrail metrics don't need the same adjustment because the relevant risk there is a false negative (missing a real regression), not a false positive.
What determines the size of the correction?
The strength of the FWER correction depends on how many tests are in the correction family: the set of tests grouped together for adjustment. The larger the family, the more conservative the correction becomes and the more statistical power you lose.
This is why defining the correction family carefully matters more than choosing between FWER methods. If you include every metric in a single family, you're paying a power penalty for metrics that don't influence the same decision. Confidence groups only the success metrics that drive the ship/don't-ship decision into the correction family. Guardrail metrics and quality metrics sit outside it. That separation keeps the correction denominator small and the power loss contained.
At Spotify, where experiments often track 20 or more metrics total, the distinction between "all metrics" and "just the success metrics" is the difference between a severe power loss and a manageable one. Most experiments have 1 to 3 success metrics. Correcting across 3 tests costs far less power than correcting across 20.