Skip to main content
When planning an experiment, power analyses help you find the sample size needed for your metrics to reach the desired power. Power is a function of effect size. All power analyses calculate the required sample size for the effect sizes you want to detect. Smaller effects are harder to detect and require larger sample sizes.

Minimum Detectable Effects

For success metrics, the effect size is the minimum detectable effect (MDE), which represents the smallest effect you want to be able to detect. Use the MDE to design your experiment so that it has enough statistical power to detect meaningful effects. Picking the MDE is a trade-off between:
  • The smallest effect that is still relevant for the business
  • The smallest effect that is practically measurable
As an experimenter, use your domain expertise and discuss with stakeholders to decide the smallest effect you would consider meaningful. Then, calculate the required sample size. If the sample size needed to detect your chosen MDE is unrealistically large, you need to increase the MDE. Note: the MDE is a required input to the power analysis, but does not impact the calculation of results.
Set the minimum detectable effect size to the smallest effect that you and your stakeholders care about. This ensures your A/B test can detect effects that are meaningful to the business. In other words, if the true effect is smaller than the MDE and you fail to detect it, it doesn’t matter because the improvement would be too small to justify shipping anyway. Larger effects are easier to detect than smaller effects, while a smaller MDE requires a larger sample size.

Non-Inferiority Margins

Confidence uses a different statistical test for guardrail metrics than for success metrics. These tests, called non-inferiority tests, verify that the metric performs better than a specified non-inferiority margin (NIM). The non-inferiority margin is essentially a tolerance level—you accept a small amount of degradation in the guardrail metric, but it must not worsen beyond the NIM.
The non-inferiority margin (NIM) is a tolerance threshold that helps you gather evidence to rule out the possibility that the metric deteriorates by more than the NIM. This choice affects both the power analysis and the results calculations.
Unlike the MDE for success metrics, the statistical tests for guardrail metrics directly use the NIMs in the hypothesis tests. Because of this, the NIM serves dual purposes: it’s both an effect size for the power analysis and a tolerance threshold in the statistical test itself. Smaller NIMs require larger sample sizes because it becomes harder to gather enough evidence that the metric stays within a tighter tolerance range.

How to Find the Smallest Practically Measurable Effect

Follow these steps to quickly understand what effect sizes are practically measurable in your experiments.
  1. Assess how large your experiments can be.
    • Do you need to run multiple experiments in parallel? For example, if you need to run 4 experiments simultaneously on the same population, each experiment can only use 100/4=25% of users.
    • Do you want to limit exposure to a new variant because the changes are risky?
For example, if your experiments typically have 10,000 users, calculate what effect size you have enough power to detect with this sample size.
  1. Decide if the detectable effect size is small enough for your business needs.
    • If the smallest detectable effect is small enough, select a value slightly larger than the smallest effect you can detect with enough power. Remember that sample size calculations are estimates with inherent uncertainty.
    • If the smallest detectable effect is too large, consider these options:
      • Change the metric. Variance can vary widely between metrics measuring similar aspects of user behavior. A lower-variance metric is more sensitive, giving you a higher chance of detecting effects.
      • Adjust alpha and power. You can detect smaller effect sizes with the same sample size if you can tolerate more risk. Increase alpha to accept increased false positive results (shipping changes with no real effect). Decrease power to accept increased false negative results (failing to ship changes with positive effects).