Multiple Testing

What is a False Discovery Rate?

False discovery rate (FDR) is the expected proportion of false positives among all results declared statistically significant.

False discovery rate (FDR) is the expected proportion of false positives among all results declared statistically significant. If you reject 20 hypotheses and control FDR at 5%, you expect about 1 of those 20 rejections to be a false positive. FDR doesn't promise you won't make any mistakes. It promises the mistakes will be a small, controlled fraction of your discoveries.

FDR sits on the permissive end of the multiple testing correction spectrum. The stricter alternative, family-wise error rate (FWER), controls the probability of making even one false positive across all tests. FWER is the right default for A/B testing decisions where each false positive could lead to shipping a change that didn't help. FDR is designed for settings where you're screening many hypotheses at once and can tolerate some false positives as long as most discoveries are real.

When is FDR the right choice?

FDR control makes sense when three conditions hold: you're testing a large number of hypotheses, you expect many of them to have real effects, and the consequence of a single false positive is low relative to the cost of missing a real effect.

The canonical example is genomics. A gene expression study might test 20,000 genes simultaneously. FWER control at that scale would require each individual test to pass an absurdly strict threshold (alpha of 0.0000025 with Bonferroni), making it nearly impossible to detect anything. FDR control lets researchers find hundreds of genuinely differentially expressed genes while accepting that, say, 5% of the list will be false leads. Those false leads get filtered in follow-up experiments. The initial screen just needs to be mostly right.

In product experimentation, FDR is less commonly the right choice. A typical A/B test has 1 to 5 success metrics, not thousands. The consequences of a false positive are concrete: you ship a feature that doesn't work, consuming engineering resources and potentially degrading the user experience. For that setting, FWER control with methods like the Bonferroni correction or the Holm correction is the standard approach. Confidence uses FWER control for success metrics by default.

Where FDR can be useful in an experimentation context is exploratory metric screening. If a team wants to scan 50 secondary metrics to generate hypotheses for future experiments, FDR control lets them identify promising signals without the severe power penalty of FWER correction across 50 tests. The key distinction: those FDR-controlled results inform what to test next, not what to ship now.

How does FDR control work?

The most widely used FDR-controlling method is the Benjamini-Hochberg correction. It sorts the p-values from smallest to largest, then finds the largest p-value that falls below a threshold that increases with its rank. Every test with a smaller p-value is rejected.

The procedure is simple to implement and doesn't require the tests to be independent (under positive dependence, which is the typical case for correlated metrics). It's uniformly more powerful than FWER methods when you have many tests with real effects, because it spends the error budget proportionally to how many discoveries you make rather than budgeting for the worst case.

How does FDR relate to FWER?

When all null hypotheses are true (nothing is actually different), FDR and FWER are identical. Both equal the probability of at least one false rejection. The methods diverge when some effects are real.

With real effects present, FDR control is strictly less conservative. FWER still guarantees the probability of any false positive stays below alpha. FDR only guarantees the false positive fraction stays below alpha. If you reject 10 hypotheses under FDR control at 5%, you expect 0.5 false positives. That's a useful guarantee for large-scale screening but a weak one for decisions where each individual result matters.

The practical implication: use FWER for shipping decisions (where each false positive has a direct cost) and FDR for discovery and hypothesis generation (where false positives get filtered downstream).