Lesson 2: A Refresher on Alpha and Power
This lesson is a brief summary of lessons 5 and 6 from the Hypothesis Testing course, to make sure you have what you need to understand the sample size calculation course.
Possible outcomes in experiments
In an experiment, there either exists a treatment effect or there doesn't, and you either detect it or you don't. This gives us four possible outcomes depicted below.
Across many experiments, these four outcomes will occur with some rates. That is, if we run 100 experiments, some number of them will end up in each of the four quadrants.
In hypothesis testing, alpha is used as a parameter to control the rate of false positive results among the experiments that have no effect, and power is used to control the rate of true positive results among the experiments where there is an effect. We call alpha and power the intended error rates of the test.
Our goal with experimentation is to control the rates of incorrect and correct results. We can trade off between the rates of false positives and false negatives by changing the alpha, power, and sample size of our test. In fact, there are several things that affect the risk handling in experiments, which we will cover in future courses. But for now, let's not get ahead of ourselves.
By using a statistically valid test with a certain alpha, and a sample size large enough for a certain MDE to achieve a certain power, we can:
- Bound the proportion of experiments without an effect that falsely detect an effect to be lower than or equal to alpha
- Bound the proportion of experiments with an effect of MDE (or larger) that correctly detect that effect to be larger than or equal to power.
Video recap
If you haven't already, watch this 4-minute and 31-second video to quickly review what we've learned so far:
Win rate across all experiments
Having powered tests does not bound the true positive rate across all experiments you run. It only bounds the true positive rate for the subset of experiments that have a true treatment effect of MDE or larger.
In practice, some experiments will have a non-zero effect smaller than the MDE for which we have designed the test. In those experiments, our chance to detect the treatment effect will be smaller than power.
The best we can do is to make sure that we select MDEs that map to the smallest effect size that is practically relevant for our business. By powering all experiments to detect that effect, we can ensure that our true positive rate is at least power for all experiments in which the true effect is of a relevant size.
The nonlinearity of alpha and power
It is important to understand how the alpha and power parameters affect the sample size. Because we, in most cases, use a Z-test for evaluating experiments, a normal distribution underlies the dependency between required sample size, alpha and power. This means that the required sample size is not increasing linearly with alpha or power. This makes it much harder to reason about how the required sample size changes with changes to the alpha or power.
The Alpha z-value
In the sample size calculation, the alpha parameter comes into the equation via a z-value. Although this is the same type of z-scores that we have discussed in previous lessons, here let's not focus on the rationale for the z-value being in the equation, but rather on the relation between alpha and the z-value.
The alpha enters into this sample size formula via a z-value because we are using a Z-test that is based on the asymptotic normality of the difference-in-means sample estimator. It is good to know that the relation between alpha and z-alpha is nonlinear. This implies that changing the alpha by a fixed value will change the required sample size by different amounts depending on the alpha you had to begin with. Changing from 0.02 to 0.01 will increase the required sample size more than changing from 0.1 to 0.09.
Note that the asymptotic normality that the Z-test (and therefore the sample size calculations) is based on doesn't require the underlying data to be normally distributed. Instead, it's the difference-in-means estimator that needs to be approximately normally distributed under the null hypothesis, which it is for many underlying data distributions thanks to the central limit theorem. Learn more about the distribution of the difference-in-means estimator in the Hypothesis Testing course.
The plot below shows how the z-value changes with alpha.
The power z-value
The same relation holds for how power comes into the sample size formula