Lesson 6: True positive rate, MDE, and power

A true positive result

A true positive result, often simply called a "true positive", is when we find a statistically significant effect from a treatment in an experiment, and the treatment actually does have an effect. In other words, we correctly identify a significant effect from the treatment, confirming that the treatment had a real impact.

A true positive result is when we observe a large enough mean difference between treatment and control groups to be statistically significant when the treatment truly has an effect.

However, the fact that the treatment has an effect is no guarantee that we will find a true positive result. By chance, people with large outcome values might end up in the control group, and people with low outcome values might end up in the treatment group. In such a situation, we might not find a true positive result, even though the treatment has an effect, because the treatment is canceled out by the difference between the groups caused by the random treatment assignment.

Thanks to the random assignment of users to treatment and control, probability theory lets us quantify how likely it is with any level of random imbalance between the groups; this is precisely what statistical tests do.
Valid statistical tests quantify the variability of the test statistic under the hypothesis of no treatment effect, which can be used to bound the rate at which we get true positive results for certain hypothetical effects.

True positive rate

We can never be certain about whether an observed result in an experiment is true or not, due to randomness. For this reason, statistical tests are derived to limit the rate of finding the wrong result across many experiments.

A good property for a statistical test to have is a high true positive rate. The true positive rate is the rate at which the test correctly identifies a significant effect when there is a true effect from the treatment.

For example, if we run 100 experiments where the treatment truly affects the outcome metric, and we find that 80 of these experiments show a significant effect, the proportion of experiments where we correctly find a significant effect (80/100=80%) is the true positive rate of this test.

Having a high true positive rate means that the false negative result rate is low, which is good because it means that we are not missing true effects.

The minimum detectable effect (MDE)

How large the true positive rate is depends on several things, including the false positive rate. Importantly, the true positive rate depends on the size of the unknown but true treatment effect.

If the treatment effect is huge relative to the variability of the outcome metric, the true positive rate will be high. If the treatment effect is tiny (but not zero), the true positive rate will be low.

To derive statistical tests with bounded true positive rates, we use the concept of the Minimum Detectable Effect (MDE). The MDE is the smallest effect size that we want to be able to detect in an experiment with a certain true positive rate.
In other words, if there is a true effect of MDE, we want to be able to bound the true positive rate higher than a certain rate.

Power (the intended true positive rate)

Power is a parameter that statistical tests have that corresponds to the lower bound on the true positive rate.
We say that a statistical test is powered for a certain effect (MDE) if the true positive rate over repeated experiments (with a true effect of MDE) is higher than or equal to a desired level of power.
In other words, by using powered statistical tests, we can bound the proportion of experiments where there is a true effect that we fail to detect to 1-power. The value 1-power is called the false negative rate and is often represented by beta.

At this point, you might be wondering why we cannot simply set power to 100% to find all true positive results. The reason is that this would require an infinitely large sample. The relation here is: the larger the sample size the smaller the standard error of the mean difference, and therefore the higher our ability to detect small effects.

Risk management in experimentation is about balancing the risk of false positives and chance for true positives against the sample size required to achieve these risk bounds. Learn more about sample size and how to calculate the required sample size for your experiment in the Sample size calculation - level I course.

For a fixed alpha, the higher power we want, the larger the sample size we need.
We can increase alpha to reduce the required sample size for a given level of power and MDE, but this increases the false positive rate if in fact the treatment doesn't have an effect.

For a fixed alpha and power, we can increase the MDE, but this means that we can only find larger effects, and we might miss smaller effects even if they are practically important.

Notes for Nerds

Sometimes the advice is given to "not trust underpowered experiments". This is because significant treatment effects observed in underpowered experiments are by construction over-estimated. See for example this paper for details.

However, this advice shouldn't be given without reference to the MDE for which the experiment is powered, especially in relation to the true effect. What really matters is the true positive rate, which is a function of the true effect. If the experiment is underpowered for a very small MDE, but the true effect is very large, the true positive rate might be high even when the experiment is underpowered according to its MDE.

The bottom line is that the power is always in relation to a hypothetical effect (MDE). The true effect can be anything, and thus the actual true positive rate can be much larger or smaller than the intended power, even if our sample size is sufficiently large to power the experiment for the MDE we have decided.