Who is the Confidence Bootcamp for?

The bootcamp is designed for anyone who wants to improve their experimentation skills. Courses are tailored for data scientists, analysts, engineers, product managers, and leaders — whether you are running your first A/B test or scaling an experimentation program across your organization.

Is the bootcamp free?

Yes, the Confidence Bootcamp is completely free. All 11 courses, 90+ lessons, and resources are available at no cost. You can start learning immediately without creating an account, though signing in lets you track your progress across devices.

The bootcamp covers the full experimentation lifecycle: A/B testing fundamentals, hypothesis formulation, interpreting experiment results, metrics design, sample size calculation, feature flags, and building an experimentation culture. It includes 11 courses with over 90 lessons built by the Confidence team at Spotify.

How long does the bootcamp take to complete?

The full bootcamp takes approximately 20 hours to complete across all 11 courses. Individual courses range from 30 minutes to 3 hours. You can learn at your own pace and pick the courses most relevant to your role.

Do I need prior experience with A/B testing or statistics?

No prior experience is required. The bootcamp starts with foundational courses like Intro to Experimentation and progressively covers more advanced topics like sequential testing and variance reduction. Each course clearly indicates which roles it is designed for.

Who created the Confidence Bootcamp?

The Confidence Bootcamp was created by the Confidence team at Spotify, the same team that builds the experimentation and feature flagging platform used across Spotify. The content reflects real-world experimentation practices used at one of the world's largest digital products.

Lesson 6: True positive rate, MDE, and power

Summary

This page teaches you about the concept of true positive results in experiments. You learn:

What a true positive result is.
What the minimum detectable effect is.
What the true positive rate is.
What power is, and how it relates to the true positive rate.

A true positive result

A true positive result, often simply called a "true positive", is when we find a statistically significant effect from a treatment in an experiment, and the treatment actually does have an effect. In other words, we correctly identify a significant effect from the treatment, confirming that the treatment had a real impact.

A true positive result is when we observe a large enough mean difference between treatment and control groups to be statistically significant when the treatment truly has an effect.

However, the fact that the treatment has an effect is no guarantee that we will find a true positive result. By chance, people with large outcome values might end up in the control group, and people with low outcome values might end up in the treatment group. In such a situation, we might not find a true positive result, even though the treatment has an effect, because the treatment is canceled out by the difference between the groups caused by the random treatment assignment.

User assignment simulator

Randomly assign users to treatment or control groups with balanced allocation (4 users per group). In this simulation, the treatment effect is applied to users in the treatment group.

Alice (10)

Bob (15)

Charlie (5)

David (20)

Eve (8)

Frank (12)

Grace (18)

Henry (6)

Thanks to the random assignment of users to treatment and control, probability theory lets us quantify how likely it is with any level of random imbalance between the groups; this is precisely what statistical tests do.
Valid statistical tests quantify the variability of the test statistic under the hypothesis of no treatment effect, which can be used to bound the rate at which we get true positive results for certain hypothetical effects.

Note

If we fail to detect an existing effect, we call this a false negative result. When a false negative result occurs it can also be called a "Type II error". Bounding the true positive rate to be above a certain intended power is the same as bounding the false negative rate to be below a certain level.

True positive rate

We can never be certain about whether an observed result in an experiment is true or not, due to randomness. For this reason, statistical tests are derived to limit the rate of finding the wrong result across many experiments.

A good property for a statistical test to have is a high true positive rate. The true positive rate is the rate at which the test correctly identifies a significant effect when there is a true effect from the treatment.

For example, if we run 100 experiments where the treatment truly affects the outcome metric, and we find that 80 of these experiments show a significant effect, the proportion of experiments where we correctly find a significant effect (80/100=80%) is the true positive rate of this test.

Having a high true positive rate means that the false negative result rate is low, which is good because it means that we are not missing true effects.

The minimum detectable effect (MDE)

How large the true positive rate is depends on several things, including the false positive rate. Importantly, the true positive rate depends on the size of the unknown but true treatment effect.

If the treatment effect is huge relative to the variability of the outcome metric, the true positive rate will be high. If the treatment effect is tiny (but not zero), the true positive rate will be low.

To derive statistical tests with bounded true positive rates, we use the concept of the Minimum Detectable Effect (MDE). The MDE is the smallest effect size that we want to be able to detect in an experiment with a certain true positive rate.
In other words, if there is a true effect of MDE, we want to be able to bound the true positive rate higher than a certain rate.

Power (the intended true positive rate)

Power is a parameter that statistical tests have that corresponds to the lower bound on the true positive rate.
We say that a statistical test is powered for a certain effect (MDE) if the true positive rate over repeated experiments (with a true effect of MDE) is higher than or equal to a desired level of power.
In other words, by using powered statistical tests, we can bound the proportion of experiments where there is a true effect that we fail to detect to 1-power. The value 1-power is called the false negative rate and is often represented by beta.

At this point, you might be wondering why we cannot simply set power to 100% to find all true positive results. The reason is that this would require an infinitely large sample. The relation here is: the larger the sample size the smaller the standard error of the mean difference, and therefore the higher our ability to detect small effects.

Risk management in experimentation is about balancing the risk of false positives and chance for true positives against the sample size required to achieve these risk bounds. Learn more about sample size and how to calculate the required sample size for your experiment in the Sample size calculation - level I course.

For a fixed alpha, the higher power we want, the larger the sample size we need.
We can increase alpha to reduce the required sample size for a given level of power and MDE, but this increases the false positive rate if in fact the treatment doesn't have an effect.

For a fixed alpha and power, we can increase the MDE, but this means that we can only find larger effects, and we might miss smaller effects even if they are practically important.

Notes for Nerds

Sometimes the advice is given to "not trust underpowered experiments". This is because significant treatment effects observed in underpowered experiments are by construction over-estimated. See for example this paper for details.

However, this advice shouldn't be given without reference to the MDE for which the experiment is powered, especially in relation to the true effect. What really matters is the true positive rate, which is a function of the true effect. If the experiment is underpowered for a very small MDE, but the true effect is very large, the true positive rate might be high even when the experiment is underpowered according to its MDE.

The bottom line is that the power is always in relation to a hypothetical effect (MDE). The true effect can be anything, and thus the actual true positive rate can be much larger or smaller than the intended power, even if our sample size is sufficiently large to power the experiment for the MDE we have decided.

Lesson 6: True positive rate, MDE, and power

A true positive result

User assignment simulator

True positive rate

The minimum detectable effect (MDE)

Power (the intended true positive rate)

What is a true positive result in an experiment?

What does the Minimum Detectable Effect (MDE) represent?

Why can't we simply set power to 100% to detect all true positive results?

Notes for Nerds

Lesson 6: True positive rate, MDE, and power

A true positive result

User assignment simulator

True positive rate

The minimum detectable effect (MDE)

Power (the intended true positive rate)

What is a true positive result in an experiment?

What does the Minimum Detectable Effect (MDE) represent?

Why can't we simply set power to 100% to detect all true positive results?

Notes for Nerds