Confidence
  • Pricing
  • Success stories
  • Contact us
  • Login
Start free trial
All terms
Multiple Testing

What is a False Discovery Rate?

False discovery rate (FDR) is the expected proportion of false positives among all results declared statistically significant.

False discovery rate (FDR) is the expected proportion of false positives among all results declared statistically significant. If you reject 20 hypotheses and control FDR at 5%, you expect about 1 of those 20 rejections to be a false positive. FDR doesn't promise you won't make any mistakes. It promises the mistakes will be a small, controlled fraction of your discoveries.

FDR sits on the permissive end of the multiple testing correction spectrum. The stricter alternative, family-wise error rate (FWER), controls the probability of making even one false positive across all tests. FWER is the right default for A/B testing decisions where each false positive could lead to shipping a change that didn't help. FDR is designed for settings where you're screening many hypotheses at once and can tolerate some false positives as long as most discoveries are real.

When is FDR the right choice?

FDR control makes sense when three conditions hold: you're testing a large number of hypotheses, you expect many of them to have real effects, and the consequence of a single false positive is low relative to the cost of missing a real effect.

The canonical example is genomics. A gene expression study might test 20,000 genes simultaneously. FWER control at that scale would require each individual test to pass an absurdly strict threshold (alpha of 0.0000025 with Bonferroni), making it nearly impossible to detect anything. FDR control lets researchers find hundreds of genuinely differentially expressed genes while accepting that, say, 5% of the list will be false leads. Those false leads get filtered in follow-up experiments. The initial screen just needs to be mostly right.

In product experimentation, FDR is less commonly the right choice. A typical A/B test has 1 to 5 success metrics, not thousands. The consequences of a false positive are concrete: you ship a feature that doesn't work, consuming engineering resources and potentially degrading the user experience. For that setting, FWER control with methods like the Bonferroni correction or the Holm correction is the standard approach. Confidence uses FWER control for success metrics by default.

Where FDR can be useful in an experimentation context is exploratory metric screening. If a team wants to scan 50 secondary metrics to generate hypotheses for future experiments, FDR control lets them identify promising signals without the severe power penalty of FWER correction across 50 tests. The key distinction: those FDR-controlled results inform what to test next, not what to ship now.

How does FDR control work?

The most widely used FDR-controlling method is the Benjamini-Hochberg correction. It sorts the p-values from smallest to largest, then finds the largest p-value that falls below a threshold that increases with its rank. Every test with a smaller p-value is rejected.

The procedure is simple to implement and doesn't require the tests to be independent (under positive dependence, which is the typical case for correlated metrics). It's uniformly more powerful than FWER methods when you have many tests with real effects, because it spends the error budget proportionally to how many discoveries you make rather than budgeting for the worst case.

How does FDR relate to FWER?

When all null hypotheses are true (nothing is actually different), FDR and FWER are identical. Both equal the probability of at least one false rejection. The methods diverge when some effects are real.

With real effects present, FDR control is strictly less conservative. FWER still guarantees the probability of any false positive stays below alpha. FDR only guarantees the false positive fraction stays below alpha. If you reject 10 hypotheses under FDR control at 5%, you expect 0.5 false positives. That's a useful guarantee for large-scale screening but a weak one for decisions where each individual result matters.

The practical implication: use FWER for shipping decisions (where each false positive has a direct cost) and FDR for discovery and hypothesis generation (where false positives get filtered downstream).

Related terms

Multiple Testing
Benjamini-Hochberg Correction

The Benjamini-Hochberg (BH) correction is a multiple testing procedure that controls the false discovery rate (FDR): the expected proportion of false positives among all results declared significant.

Multiple Testing
Family-Wise Error Rate

Family-wise error rate (FWER) is the probability of making at least one false positive across a set of hypothesis tests.

Multiple Testing
Multiple Testing Correction

A multiple testing correction is an adjustment to significance thresholds that accounts for evaluating more than one hypothesis in the same experiment.

Multiple Testing
Correction Family

A correction family is the set of hypothesis tests grouped together for a multiple testing adjustment.

Multiple Testing
Bonferroni Correction

The Bonferroni correction adjusts significance thresholds for multiple testing by dividing the target alpha by the number of tests.

Statistical Methods
Statistical Significance

Statistical significance is the determination that an observed difference between experiment groups is unlikely to have occurred by chance alone.

Spotify

Learn more

  • Read our blog
  • See comparisons
  • Glossary
  • RFP guides
  • Listen to us
  • Read our docs
  • Status page

Need help

  • Contact us

Legal

  • Terms of Service
  • Data Protection Agreement
  • Privacy Policy
  • Cookies

© 2026 Spotify

The Confidence name and logo are registered trademarks of Spotify.