Minimum Detectable Effects
For success metrics, the effect size is the minimum detectable effect (MDE), which represents the smallest effect you want to be able to detect. Use the MDE to design your experiment so that it has enough statistical power to detect meaningful effects. Picking the MDE is a trade-off between:- The smallest effect that is still relevant for the business
- The smallest effect that is practically measurable
Non-Inferiority Margins
Confidence uses a different statistical test for guardrail metrics than for success metrics. These tests, called non-inferiority tests, verify that the metric performs better than a specified non-inferiority margin (NIM). The non-inferiority margin is essentially a tolerance level—you accept a small amount of degradation in the guardrail metric, but it must not worsen beyond the NIM.The non-inferiority margin (NIM) is a tolerance threshold that helps you gather
evidence to rule out the possibility that the metric deteriorates
by more than the NIM. This choice affects both the power analysis
and the results calculations.
How to Find the Smallest Practically Measurable Effect
Follow these steps to quickly understand what effect sizes are practically measurable in your experiments.-
Assess how large your experiments can be.
- Do you need to run multiple experiments in parallel? For example, if you need to run 4 experiments simultaneously on the same population, each experiment can only use 100/4=25% of users.
- Do you want to limit exposure to a new variant because the changes are risky?
-
Decide if the detectable effect size is small enough for your business needs.
- If the smallest detectable effect is small enough, select a value slightly larger than the smallest effect you can detect with enough power. Remember that sample size calculations are estimates with inherent uncertainty.
- If the smallest detectable effect is too large, consider these options:
- Change the metric. Variance can vary widely between metrics measuring similar aspects of user behavior. A lower-variance metric is more sensitive, giving you a higher chance of detecting effects.
- Adjust alpha and power. You can detect smaller effect sizes with the same sample size if you can tolerate more risk. Increase alpha to accept increased false positive results (shipping changes with no real effect). Decrease power to accept increased false negative results (failing to ship changes with positive effects).

