Lesson 5: False positive rate and alpha

A false positive result

A false positive result, often simply called a "false positive," is when we find a statistically significant effect from a treatment in an experiment when the treatment actually doesn't have an effect. Another term for a false positive result is "a type I error".

In most experiments, we are testing the mean difference between the treatment groups. A false positive result in this case would be when we observe a large enough mean difference between treatment and control groups to be statistically significant even though the treatment had no effect on the outcome.

All users have some value on the outcome metric even if the treatment has no effect. Some have large values relative to the population mean, some have small values relative to the population mean. When we are randomly splitting the users into treatment and control, there is always a risk that most users in the sample with large values end up in the treatment group rather than in the control group.


The decision from one experiment will be right or wrong

Although the random treatment assignment makes it possible to quantify the variation, it also means that we can never be certain about whether an observed result in an experiment is true or not.

Possible outcomes

There is no way around the fact that we will never be certain about the result of a single experiment. However, we can ensure that the rate of false positives across many experiments is bounded at a certain level.
Statistical tests are constructed to limit the rate of finding the wrong result across many experiments.


False positive rate

Valid statistical tests quantify the variability of the test statistic under the hypothesis of no treatment effect and use that to bound the rate at which we get false positive results. Only the alpha percent most unlikely imbalances under the null will be considered significant.

A good property for a statistical test to have is that the rate of false positives is bounded to a certain level which can be controlled by the experimenter.


Alpha (the intended false positive rate)

Alpha is a parameter that statistical tests have that corresponds to the intended upper bound on the false positive rate. We say that a statistical test is valid if the false positive rate over repeated experiments (with no effect) is lower than or equal to alpha. In other words, by using valid statistical tests, we can bound the proportion of experiments where there is no true effect but we falsely find one.

At this point, you might be wondering why we cannot simply set alpha to zero to avoid all false positives. The reason is that we also want to be able to find true effects when they are there, and unfortunately, there is a trade-off between the false positive rate and our ability to find true effects which we return to in the next lesson.


False positive rate simulator

Now we can put what we have learned together and simulate experiments to see how often we find false positives with a given alpha.
In this simulator, we are using a Z-test and are drawing large random samples. Since the sample size is large, the distribution of the test statistic under the null hypothesis is approximately normal and therefore the false positive rate should be close to alpha across many random experiments.

False positive rate simulator

5%

Adjust alpha and click simulate to see results.

If we ran the simulation with infinitely many experiments, the rate converges on exactly alpha%.



Notes for Nerds

Conservative tests

The test used in the simulation would reach exactly the intended false positive rate if we simulated a large enough number of experiments. However, for a test to be valid, it's enough that the false positive rate is lower than or equal to alpha.

A statistical test that has a false positive rate substantially lower than alpha is called a conservative test. Generally speaking, it is good to avoid conservative tests as they give the experimenter less control over the risk management of the experiments.


Intended vs actual false positive rate

Note that we say "intended" false positive rate. If we use a statistical test incorrectly, the actual false positive rate might not in fact be bounded by alpha.
A classic example of when a test is misused causing inflated false positive rates is when a fixed-sample hypothesis test is used to peek at the data multiple times. In this case, the false positive rate is not bounded by the alpha of the test, because the test only bounds the false positive below alpha if the test is performed once at the end of the experiment, not if it is performed repeatedly.

Read more about the issue with peeking on standard statistical hypothesis tests in the blog post on sequential tests.