Lesson 5: False positive rate and alpha
This page teaches you about the concept of false positive results and the rate at which they appear in experiments. You learn:
- What a false positive result is
- What the false positive rate is
- What alpha is and how it relates to the false positive rate
A false positive result
A false positive result, often simply called a "false positive," is when we find a statistically significant effect from a treatment in an experiment when the treatment actually doesn't have an effect. Another term for a false positive result is "a type I error".
In most experiments, we are testing the mean difference between the treatment groups. A false positive result in this case would be when we observe a large enough mean difference between treatment and control groups to be statistically significant even though the treatment had no effect on the outcome.
All users have some value on the outcome metric even if the treatment has no effect. Some have large values relative to the population mean, some have small values relative to the population mean. When we are randomly splitting the users into treatment and control, there is always a risk that most users in the sample with large values end up in the treatment group rather than in the control group.
The decision from one experiment will be right or wrong
Although the random treatment assignment makes it possible to quantify the variation, it also means that we can never be certain about whether an observed result in an experiment is true or not.
There is no way around the fact that we will never be certain about the result
of a single experiment. However, we can ensure that the rate of false positives
across many experiments is bounded at a certain level.
Statistical tests are constructed to limit the rate of finding the wrong result across many experiments.
False positive rate
Valid statistical tests quantify the variability of the test statistic under the hypothesis of no treatment effect and use that to bound the rate at which we get false positive results. Only the alpha percent most unlikely imbalances under the null will be considered significant.
A good property for a statistical test to have is that the rate of false positives is bounded to a certain level which can be controlled by the experimenter.
The false positive rate is the rate at which we find false positives in experiments where there is no effect.
For example, if we run 100 experiments where the treatment has no effect on the
outcome metric, and we find that 10 have a significant effect, the proportion of
experiments where we find a significant effect (10/100 = 10%) is the false
positive rate of this test.
Alpha (the intended false positive rate)
Alpha is a parameter that statistical tests have that corresponds to the intended upper bound on the false positive rate. We say that a statistical test is valid if the false positive rate over repeated experiments (with no effect) is lower than or equal to alpha. In other words, by using valid statistical tests, we can bound the proportion of experiments where there is no true effect but we falsely find one.
At this point, you might be wondering why we cannot simply set alpha to zero to avoid all false positives. The reason is that we also want to be able to find true effects when they are there, and unfortunately, there is a trade-off between the false positive rate and our ability to find true effects which we return to in the next lesson.
False positive rate simulator
Now we can put what we have learned together and simulate experiments to see how
often we find false positives with a given alpha.
In this simulator, we are using a Z-test and are drawing large random samples.
Since the sample size is large, the distribution of the test statistic under the
null hypothesis is approximately normal and therefore the false positive rate
should be close to alpha across many random experiments.
False positive rate simulator
If we ran the simulation with infinitely many experiments, the rate converges on exactly alpha%.
What is a false positive result in an experiment?
What does the false positive rate represent?
What is the role of alpha in hypothesis testing?
Notes for Nerds
Conservative tests
The test used in the simulation would reach exactly the intended false positive rate if we simulated a large enough number of experiments. However, for a test to be valid, it's enough that the false positive rate is lower than or equal to alpha.
A statistical test that has a false positive rate substantially lower than alpha is called a conservative test. Generally speaking, it is good to avoid conservative tests as they give the experimenter less control over the risk management of the experiments.
Intended vs actual false positive rate
Note that we say "intended" false positive rate. If we use a statistical test
incorrectly, the actual false positive rate might not in fact be bounded by
alpha.
A classic example of when a test is misused causing inflated false positive
rates is when a fixed-sample hypothesis test is used to peek at the data
multiple times. In this case, the false positive rate is not bounded by the
alpha of the test, because the test only bounds the false positive below alpha
if the test is performed once at the end of the experiment, not if it is
performed repeatedly.
Read more about the issue with peeking on standard statistical hypothesis tests in the blog post on sequential tests.