Confidence
  • Pricing
  • Success stories
  • Contact us
  • Login
Start free trial
All terms
Statistical Methods

What is a Minimum Detectable Effect (MDE)?

The minimum detectable effect (MDE) is the smallest treatment effect an experiment is designed to reliably detect at a given significance level and power.

The minimum detectable effect (MDE) is the smallest treatment effect an experiment is designed to reliably detect at a given significance level and power. If you set your MDE at a 2% lift in conversion rate with 80% power, the experiment will detect a true 2% lift (or larger) 80% of the time. Effects smaller than the MDE may exist but won't be detected reliably by the test as designed.

Choosing the MDE is the most underrated decision in experiment design. It determines how long the test runs, how much traffic it consumes, and what kinds of product changes it can evaluate. Set the MDE too small and the experiment requires weeks of traffic you don't have. Set it too large and you miss real improvements that would have compounded over time. Confidence provides power calculators that show the sample size required for any MDE, making this trade-off visible before the test starts.

How do you choose the right MDE?

The MDE should reflect the smallest effect that would change your decision. If a 0.5% improvement in retention wouldn't be worth the ongoing maintenance cost of the feature, then your MDE should be larger than 0.5%. If a 1% improvement in conversion rate would justify the change, your MDE should be at most 1%.

This requires teams to think about the economics of the decision before they run the experiment. How much is a 1% lift in this metric worth in revenue or user value? What's the cost of maintaining the feature? The MDE makes those trade-offs quantitative.

At Spotify, where 300+ teams share experiment bandwidth across 10,000+ experiments per year, MDE choices have an organizational impact. A team that sets an unnecessarily small MDE (say, 0.1% when they'd ship at 1%) runs an experiment 100 times larger than needed. That traffic could have powered several other tests. Conversely, a team that sets the MDE too large (10% when realistic effects are 2-3%) will never see significance and will conclude nothing happened.

How does MDE relate to sample size?

The relationship is inverse and quadratic. Halving the MDE requires roughly four times the sample size, holding power and alpha constant. This means small MDEs get expensive fast.

For a concrete example: detecting a 2% lift in a metric with a given variance might require 10,000 users per group. Detecting a 1% lift in the same metric requires ~40,000 users per group. Detecting a 0.5% lift requires ~160,000 users per group.

This is why variance reduction matters so much. CUPED typically reduces metric variance by ~50%, which effectively halves the sample size requirement for the same MDE. In Confidence, the power calculator automatically incorporates variance reduction when computing required sample sizes, so teams see the actual runtime they'll need rather than an inflated estimate based on raw metric variance.

What happens when the true effect is smaller than the MDE?

The experiment will likely produce a non-significant result. This doesn't mean the change has no effect. It means the test wasn't sensitive enough to detect it.

This is the most common misinterpretation of null results in A/B testing. A team runs a test powered to detect a 3% lift, sees no significant result, and concludes "the change didn't work." The change may well have produced a 1.5% lift that the test simply couldn't see. The correct interpretation is: "we found no evidence of an effect of 3% or larger."

Teams that consistently set MDEs larger than the effects they're likely to produce will consistently get null results. The Spotify Search team's experimentation maturity journey included systematically reducing MDEs over time by combining CUPED, trigger analysis, and metric refinement within Confidence. As their tests became more sensitive, they detected effects they previously would have missed.

Related terms

Statistical Methods
Sample Size

Sample size is the number of experimental units (typically users) needed in an A/B test to detect a given effect with a specified level of confidence and power.

Statistical Methods
Statistical Power

Statistical power is the probability that an experiment will detect a real effect when one exists.

Statistical Methods
Effect Size

Effect size is the magnitude of the difference in a metric between treatment and control groups.

Statistical Methods
Variance

Variance is a measure of how much a metric's values spread out across users.

Statistical Methods
Variance Reduction

Variance reduction is a set of statistical techniques that tighten the confidence intervals of an A/B test without requiring more traffic.

Statistical Methods
CUPED

CUPED (Controlled-experiment Using Pre-Existing Data) is a variance reduction method that uses data from before an experiment started to remove predictable noise from metric estimates, producing ti...

Experiment Analysis
Trigger Analysis

Trigger analysis is an experiment analysis technique that restricts the evaluation to users who actually encountered the changed feature, rather than analyzing every user assigned to the experiment.

Spotify

Learn more

  • Read our blog
  • See comparisons
  • Glossary
  • RFP guides
  • Listen to us
  • Read our docs
  • Status page

Need help

  • Contact us

Legal

  • Terms of Service
  • Data Protection Agreement
  • Privacy Policy
  • Cookies

© 2026 Spotify

The Confidence name and logo are registered trademarks of Spotify.