Statistical Methods

What is a Minimum Detectable Effect (MDE)?

The minimum detectable effect (MDE) is the smallest treatment effect an experiment is designed to reliably detect at a given significance level and power.

The minimum detectable effect (MDE) is the smallest treatment effect an experiment is designed to reliably detect at a given significance level and power. If you set your MDE at a 2% lift in conversion rate with 80% power, the experiment will detect a true 2% lift (or larger) 80% of the time. Effects smaller than the MDE may exist but won't be detected reliably by the test as designed.

Choosing the MDE is the most underrated decision in experiment design. It determines how long the test runs, how much traffic it consumes, and what kinds of product changes it can evaluate. Set the MDE too small and the experiment requires weeks of traffic you don't have. Set it too large and you miss real improvements that would have compounded over time. Confidence provides power calculators that show the sample size required for any MDE, making this trade-off visible before the test starts.

How do you choose the right MDE?

The MDE should reflect the smallest effect that would change your decision. If a 0.5% improvement in retention wouldn't be worth the ongoing maintenance cost of the feature, then your MDE should be larger than 0.5%. If a 1% improvement in conversion rate would justify the change, your MDE should be at most 1%.

This requires teams to think about the economics of the decision before they run the experiment. How much is a 1% lift in this metric worth in revenue or user value? What's the cost of maintaining the feature? The MDE makes those trade-offs quantitative.

At Spotify, where 300+ teams share experiment bandwidth across 10,000+ experiments per year, MDE choices have an organizational impact. A team that sets an unnecessarily small MDE (say, 0.1% when they'd ship at 1%) runs an experiment 100 times larger than needed. That traffic could have powered several other tests. Conversely, a team that sets the MDE too large (10% when realistic effects are 2-3%) will never see significance and will conclude nothing happened.

How does MDE relate to sample size?

The relationship is inverse and quadratic. Halving the MDE requires roughly four times the sample size, holding power and alpha constant. This means small MDEs get expensive fast.

For a concrete example: detecting a 2% lift in a metric with a given variance might require 10,000 users per group. Detecting a 1% lift in the same metric requires ~40,000 users per group. Detecting a 0.5% lift requires ~160,000 users per group.

This is why variance reduction matters so much. CUPED typically reduces metric variance by ~50%, which effectively halves the sample size requirement for the same MDE. In Confidence, the power calculator automatically incorporates variance reduction when computing required sample sizes, so teams see the actual runtime they'll need rather than an inflated estimate based on raw metric variance.

What happens when the true effect is smaller than the MDE?

The experiment will likely produce a non-significant result. This doesn't mean the change has no effect. It means the test wasn't sensitive enough to detect it.

This is the most common misinterpretation of null results in A/B testing. A team runs a test powered to detect a 3% lift, sees no significant result, and concludes "the change didn't work." The change may well have produced a 1.5% lift that the test simply couldn't see. The correct interpretation is: "we found no evidence of an effect of 3% or larger."

Teams that consistently set MDEs larger than the effects they're likely to produce will consistently get null results. The Spotify Search team's experimentation maturity journey included systematically reducing MDEs over time by combining CUPED, trigger analysis, and metric refinement within Confidence. As their tests became more sensitive, they detected effects they previously would have missed.