Who is the Confidence Bootcamp for?

The bootcamp is designed for anyone who wants to improve their experimentation skills. Courses are tailored for data scientists, analysts, engineers, product managers, and leaders — whether you are running your first A/B test or scaling an experimentation program across your organization.

Is the bootcamp free?

Yes, the Confidence Bootcamp is completely free. All 11 courses, 90+ lessons, and resources are available at no cost. You can start learning immediately without creating an account, though signing in lets you track your progress across devices.

The bootcamp covers the full experimentation lifecycle: A/B testing fundamentals, hypothesis formulation, interpreting experiment results, metrics design, sample size calculation, feature flags, and building an experimentation culture. It includes 11 courses with over 90 lessons built by the Confidence team at Spotify.

How long does the bootcamp take to complete?

The full bootcamp takes approximately 20 hours to complete across all 11 courses. Individual courses range from 30 minutes to 3 hours. You can learn at your own pace and pick the courses most relevant to your role.

Do I need prior experience with A/B testing or statistics?

No prior experience is required. The bootcamp starts with foundational courses like Intro to Experimentation and progressively covers more advanced topics like sequential testing and variance reduction. Each course clearly indicates which roles it is designed for.

Who created the Confidence Bootcamp?

The Confidence Bootcamp was created by the Confidence team at Spotify, the same team that builds the experimentation and feature flagging platform used across Spotify. The content reflects real-world experimentation practices used at one of the world's largest digital products.

Lesson 1: Guardrail metrics with non-inferiority margins

Summary

Guardrail metrics can be tested with inferiority tests or non-inferiority tests. Using non-inferiority tests improves the quality of the decision to launch a feature, but require you to specify a non-inferiority margin.

In this lesson you learn how to test guardrail metrics and what the difference between inferiority and non-inferiority tests is. You can test guardrail metrics in two different ways:

Use an inferiority test. This test evaluates whether there is evidence that the guardrail metric does worse in the treatment group compared to the control group.
Use a non-inferiority test. This test instead evaluates whether there is evidence that the guardrail metric does better than a pre-defined threshold in the treatment group compared to the control group.

In practice you choose between the two tests by specifying or not specifying a non-inferiority margin. If you specify a non-inferiority margin, you use a non-inferiority test. If you don't, you use an inferiority test.

In Confidence

Read more about the statistical details of inferiority and non-inferiority tests in the Confidence documentation.

Inferiority test

The inferiority test seeks evidence that the treatment group does worse than the control group. The test means that if the metric deteriorates due to the treatment, the test signals that the change causes the metric to deteriorate and is a not a good launch candidate.

The downside of the inferiority test is that it can never establish evidence that the change is safe to launch. It instead views lack of evidence for a negative impact as that the change is safe to launch.

Example: Checkout flow

Your guardrail metric number of purchases per visitor changes by -0.2% with a 95% confidence interval of [-0.7%, 0.3%]. There is no evidence that your change has a negative impact on the number of purchases per visitor. There is also no evidence that your change doesn't have a negative impact on the metric.

If you can't prove that the metric deteriorated, then you conclude that the change is safe to launch.

Non-inferiority test

The non-inferiority test seeks evidence that the difference between the treatment and control groups is not worse than a pre-specified margin, known as the non-inferiority margin. The test means that a significant result is evidence that the change does better than the margin, and is a good launch candidate.

Example: Checkout flow

Your guardrail metric number of purchases per visitor changes by -0.2% with a 95% confidence interval of [-0.7%, 0.3%]. Your non-inferiority margin is -1%, meaning that you are willing to accept a 1% decrease. The lower bound of the interval exceeds this value and you can conclude that there is evidence that the change does better than your margin.

One way to think about the non-inferiority margin is as a threshold for how much a metric can deteriorate before you consider it to be a negative impact. If you can prove that the metric did not deteriorate by more than the non-inferiority margin, then you conclude that the change is safe to launch.

Learn about what the non-inferiority margin is and how to set it in 3 minutes and 44 seconds.

How Inferiority and non-inferiority tests compare

To illustrate the difference between the two tests, consider the following illustration.

The figure shows a confidence interval for a guardrail metric. The non-inferiority margin is below the interval's lower bound, which means that there is evidence that the difference between the treatment and control groups is not worse than the margin. In other words, the change in the metric is significantly non-inferior, and this is a successful non-inferiority test. If you use the non-inferiority test, the conclusion is that there is evidence that the change doesn't have a negative impact on the metric beyond the margin (non-inferiority margin) that you set.

For the same illustration, if you use an inferiority test (disregard the non-inferiority margin), the conclusion is that there is no evidence for deterioration (as the confidence interval covers zero), and thus the conclusion is that the change is safe to launch.

From non-inferiority to inferiority

If you use a non-inferiority test for a guardrail, Confidence still automatically checks if the metric has deteriorated as part of its monitoring checks.

Consider the following illustration. In this case the metric has not deteriorated beyond the non-inferiority margin, but it has deteriorated in relation to zero. The overall conclusion is that it is not safe to launch this feature.

In Confidence

In Confidence, even when you use a non-inferiority test for a guardrail, Confidence automatically checks for deterioration as part of its monitoring checks. Read more about how the various checks feed into the recommendation to launch.

Practical considerations

Inferiority or non-inferiority test

Inferiority tests can be a good starting point for guardrail metrics as they don't require the experimenter to specify a non-inferiority margin. However, the drawbacks are:

Non-significant doesn't mean neutral. You shouldn't make a product decision on a non-significant result. Lack of significance means you don't have evidence that the metric has deteriorated. It doesn't mean you have evidence that the metric hasn't deteriorated. With a wide enough confidence interval, as when, for example, you have a small sample size, everything can be neutral.
A non-inferiority test is a better, but more complicated, approach. With a non-inferiority test, you seek evidence that the change is safe to launch. This means you have a specified certainty that your change doesn't decrease your metric by more than a margin you've decided.

Set the non-inferiority margin

You need to consider both the business perspective and the statistical perspective when you set the non-inferiority margin. The business perspective is about what is an acceptable decrease for the business. The statistical perspective is about what is a reasonable margin given the variability of the metric and the sample size you can expect.

To decide on the non-inferiority margin, you can consider the following:

What negative change in the metric is acceptable?
What change in a given metric would you consider to be so small as to deem it practically equal?

Reader exercise

What does the non-inferiority margin represent?

How much we want the guardrail metric to improve.

How much we want the guardrail metric to deteriorate.

We want to find evidence that the guardrail metric has not deteriorated more than the non-inferiority margin.

Reader exercise

What is true for a guardrail metric that uses non-inferiority tests?

If the metric deteriorates, the inferiority test detects the regression and alerts the experimenter.

If the metric deteriorates, we will never know since we use a non-inferiority test.

If the metric improves, this will lead to rejection of the feature. We want non-inferiority, not superiority.

Lesson 1: Guardrail metrics with non-inferiority margins

Summary

In this lesson you learn how to test guardrail metrics and what the difference between inferiority and non-inferiority tests is. You can test guardrail metrics in two different ways:

Use an inferiority test. This test evaluates whether there is evidence that the guardrail metric does worse in the treatment group compared to the control group.
Use a non-inferiority test. This test instead evaluates whether there is evidence that the guardrail metric does better than a pre-defined threshold in the treatment group compared to the control group.

In Confidence

Read more about the statistical details of inferiority and non-inferiority tests in the Confidence documentation.

Inferiority test

Example: Checkout flow

If you can't prove that the metric deteriorated, then you conclude that the change is safe to launch.

Non-inferiority test

Example: Checkout flow

Learn about what the non-inferiority margin is and how to set it in 3 minutes and 44 seconds.

How Inferiority and non-inferiority tests compare

To illustrate the difference between the two tests, consider the following illustration.

From non-inferiority to inferiority

If you use a non-inferiority test for a guardrail, Confidence still automatically checks if the metric has deteriorated as part of its monitoring checks.

In Confidence

Practical considerations

Inferiority or non-inferiority test

Inferiority tests can be a good starting point for guardrail metrics as they don't require the experimenter to specify a non-inferiority margin. However, the drawbacks are:

Non-significant doesn't mean neutral. You shouldn't make a product decision on a non-significant result. Lack of significance means you don't have evidence that the metric has deteriorated. It doesn't mean you have evidence that the metric hasn't deteriorated. With a wide enough confidence interval, as when, for example, you have a small sample size, everything can be neutral.
A non-inferiority test is a better, but more complicated, approach. With a non-inferiority test, you seek evidence that the change is safe to launch. This means you have a specified certainty that your change doesn't decrease your metric by more than a margin you've decided.

Set the non-inferiority margin

To decide on the non-inferiority margin, you can consider the following:

What negative change in the metric is acceptable?
What change in a given metric would you consider to be so small as to deem it practically equal?

Reader exercise

What does the non-inferiority margin represent?

How much we want the guardrail metric to improve.

How much we want the guardrail metric to deteriorate.

We want to find evidence that the guardrail metric has not deteriorated more than the non-inferiority margin.

Reader exercise

What is true for a guardrail metric that uses non-inferiority tests?

If the metric deteriorates, the inferiority test detects the regression and alerts the experimenter.

If the metric deteriorates, we will never know since we use a non-inferiority test.

If the metric improves, this will lead to rejection of the feature. We want non-inferiority, not superiority.