Skip to main content
Experimentation is about understanding and controlling risks. Two concepts are central to managing risk in experimentation: alpha and power.

Alpha

Alpha is the false positive rate, which describes how often false positives occur. A false positive happens when you conclude an effect exists when in reality it doesn’t. For example, suppose you run an experiment and the results show that conversion has increased. If the truth is that there is no real effect and conversion didn’t actually increase because of the treatment, you have observed a false positive result. Because data is inherently noisy, the false positive rate can never be completely zero, so you must choose an acceptable level of risk. Alpha is commonly set to 5% in many sciences, which is also the default for Confidence. The alpha you choose determines the rate at which you are willing to accept false positives across repeated experiments. Depending on the consequences of shipping a feature that truly has no effect, you may want to decrease (more conservative) or increase (less conservative) this value. In an ideal world, you don’t want any false positive results at all, but setting a low alpha makes it harder to detect effects that truly exist. Setting alpha is a balancing act between the risk of finding effects when there are none (false positives) and missing effects that really do exist (false negatives). Common values for alpha are 1%, 5%, and 10%. Higher alphas are often used in early stage experiments that seek to identify promising variants for more rigorous testing later.

Power

Statistical power describes the probability of detecting an effect when there truly is an effect of a particular size. It determines your ability to separate signal from noise, with higher power meaning better chances of finding effects when they exist. Power is also known as the true positive rate, and equals 1 minus the false negative rate. Power is commonly set to 80% in many sciences, which is also the default for Confidence. Depending on the consequences of missing a true effect, you may want to adjust this value. The power level relates to the risks of magnitude (type-M) and sign (type-S) errors. When an experiment has low power, there is a higher risk that significant effects it detects are either overestimated or even have the wrong sign (positive vs. negative).

Power Analysis

Power analysis is the process of determining the minimum number of users required to reach a desired level of statistical power. While it’s often called “sample size calculation,” this represents the minimum number of users needed to detect a desired effect size with a given level of confidence, not necessarily the total number of users exposed in an experiment. The analysis takes several inputs as outlined below, and outputs the minimum number of users required to achieve the desired level of power.

Alpha and Power

You set alpha and power according to your tolerance for false positive or false negative errors. By default, Confidence sets alpha (the false positive rate) to 5% and the power level to 80%, but you can adjust them based on your risk tolerance. Lowering alpha or increasing the power level increases your confidence in your measured results and your ability to detect significant effects, but it also increases the number of users required for your experiment.

Experiment Intake

Experiment intake is the number of days at the start of an experiment during which you include newly exposed users in metric calculations. For example, if you run your experiment for 14 days and want to measure “Consumption during Week 1,” your intake is 7 days. The intake period is typically determined by how long the experiment can feasibly run. To avoid seasonality effects, the intake period should ideally be a multiple of 7 days. A longer experiment duration delays decision-making, but enables the experiment to expose more users and helps achieve the desired statistical power.

Metrics

You should select your metrics according to the hypothesis of the experiment. The variance of your selected metrics significantly affects the required sample size—high-variance metrics require many more users to detect small effect sizes. The number of metrics also affects the required sample size because multiple testing corrections impact the adjusted levels of alpha and power in the experiment. The minimum detectable effect (MDE) is the smallest effect size you want to be able to measure to make a decision. The sample size calculation uses the MDE to decide how many users you need to detect this effect with a probability equal to the power level. The number of users required is inversely proportional to the square of the MDE, which means measuring small changes requires many more users. The MDE should be both meaningful and realistic. If you set the MDE too high, you may miss effects that would impact your decision. If you set the MDE too low, it may be impossible to achieve the desired power with a realistic number of users in reasonable time. You should ideally set the MDE based on product requirements for decision-making and a meta-analysis of effect sizes observed in prior experiments. As a last resort, consider Cohen’s Recommendations:
  • Small effect: 1% of the variance (“too small to detect other than statistically; lower limit of what is clinically relevant”)
  • Medium effect: 6% of the variance (“clear with careful observation”)
  • Large effect: 15% of the variance (“clear with a superficial glance; unlikely to be the focus of research because it’s too obvious”)
The minimum detectable effect (MDE) is the effect size used for success metrics. For guardrail metrics, the effect size is the non-inferiority margin (NIM).

Number of Variants

Each variant you add increases the multiple testing correction and, as a result, the number of users required. The probability of observing a significant result by chance increases with the number of comparisons, which requires adjustments for multiple comparisons. For example, with 2 variants you have 1 comparison, but with 3 variants you have 2 comparisons (each compared to control). It’s important to carefully consider the number of variants before running your experiment— only include variants you’re genuinely interested in testing.

Treatment Sizes

An equal split between treatment and control minimizes the number of required users, but carries higher risk because the experiment exposes more users to the new, unproven variant.