Who is the Confidence Bootcamp for?

The bootcamp is designed for anyone who wants to improve their experimentation skills. Courses are tailored for data scientists, analysts, engineers, product managers, and leaders — whether you are running your first A/B test or scaling an experimentation program across your organization.

Is the bootcamp free?

Yes, the Confidence Bootcamp is completely free. All 11 courses, 90+ lessons, and resources are available at no cost. You can start learning immediately without creating an account, though signing in lets you track your progress across devices.

The bootcamp covers the full experimentation lifecycle: A/B testing fundamentals, hypothesis formulation, interpreting experiment results, metrics design, sample size calculation, feature flags, and building an experimentation culture. It includes 11 courses with over 90 lessons built by the Confidence team at Spotify.

How long does the bootcamp take to complete?

The full bootcamp takes approximately 20 hours to complete across all 11 courses. Individual courses range from 30 minutes to 3 hours. You can learn at your own pace and pick the courses most relevant to your role.

Do I need prior experience with A/B testing or statistics?

No prior experience is required. The bootcamp starts with foundational courses like Intro to Experimentation and progressively covers more advanced topics like sequential testing and variance reduction. Each course clearly indicates which roles it is designed for.

Who created the Confidence Bootcamp?

The Confidence Bootcamp was created by the Confidence team at Spotify, the same team that builds the experimentation and feature flagging platform used across Spotify. The content reflects real-world experimentation practices used at one of the world's largest digital products.

Lesson 3: Different types of metrics in experiments

Summary

Required metrics, success metrics, guardrail metrics, and exploratory metrics serve different purposes and knowing when to use which is important to run efficient experiments.

The goal of the statistical analysis in an experiment is to provide a solid foundation for product decisions, covered in detail in a previous blog post. This means that the statistical analysis focuses on an overall decision, informed by all metrics in the experiment. A good experiment uses different types of metrics to balance the validity of the product decision with learning as much as possible, safeguarding against regressions, and inspiring new iterations. Learn how to use different types of metrics to maximize efficiency and to avoid inflating the sample sizes you need to collect.

Metrics in an experiment can be added in the following ways:

Required metrics on surfaces. Configure metrics that all experiments on specific surfaces must add. The platform verifies that these metrics don't deteriorate.
Success and guardrail metrics. Add the metrics meant to inform the decision when setting up the experiment.
Exploratory metrics. Create exploratory analyses to dig deeper and split by dimensions during or after launching your experiment.

Use required metrics to align decision scrutiny

Required metrics are metrics that all experiments that run on a surface check for regressions. These metrics show up on the experiment design page automatically when you select the surface. Importantly, required metrics are only checked for deterioration—Confidence recommends aborting the experiment if a required metric moves significantly in the unintended direction. These metrics have a negligible effect on the required sample size of the experiment. See this blog post for more details.

Required metrics increase the autonomy of teams. A single team focuses on optimizing the user experience they own, using metrics related directly to that experience. If an experiment has an unintended effect on business-critical metrics, Confidence alerts the team.

In Confidence

In Confidence, required metrics are configured on surfaces and show up automatically in the experiment design page.

Use success and guardrail metrics to construct the basis for your decision

Success metrics

For success metrics, the recommendation is to ship if at least one success metric has significantly improved. This means that for each success metric you add, you have one more chance for a false positive result. The platform corrects for the number of success metrics to control the overall false positive rate, leading to a higher required sample size.

Guardrail metrics

For guardrail metrics, you can choose to use them in one of two ways:

With non-inferiority margins
Without non-inferiority margins

Learn more about the two ways to analyze guardrail metrics in the lesson on the topic.

If you use non-inferiority margins, which is the most rigorous way of using guardrail metrics, all guardrail metrics must be simultaneously powered. You should only ship if the treatment is significantly non-inferior to control for all guardrail metrics. Each guardrail metric adds a chance of not finding a significant non-inferior result, which means that the power per metric is corrected upwards, leading to a higher required sample size.

If you don't use non-inferiority margins for guardrail metrics, they are only checked for significant regressions. Interpreting the lack of evidence for deterioration as a signal for shipping the variant is a less rigorous way to work with guardrail metrics. Guardrail metrics without non-inferiority margins don't affect the required sample size. This means that if the sample size requirements are too large when all guardrails have non-inferiority margins, you can trade off rigor for a smaller sample size by changing some guardrail metrics to not use non-inferiority margins.

Use exploratory metrics to learn more

You can add any metrics to an exploratory analysis of the experiment. You can also slice and dice these results on dimensions to see the results in different subgroups. Use explorations to dig deeper into the results after you make a decision from the experiment. The exploration is separate from the analysis on the result page, and does not affect the required sample size of the experiment.

In Confidence

In Confidence, use the Explore tab to add metrics and slice results by dimensions.

Recommendations

Consider the following recommendation when you configure metrics for your experiments:

Add as required metrics the metrics that should never deteriorate, but that you rarely use as decision metrics in experiments.
- Add them to the global surface if they apply business wide, or on a local surface.
- Required metrics don't inflate the required sample size, but adding more decreases the chances of finding a deterioration.
Add as success metrics the metrics that quantify the success of a treatment.
- These should generally be few.
- The more success metrics you add, the larger the sample size you need to collect is.
Add as guardrail metrics with non-inferiority margins the metrics that your success metrics are likely to cannibalize.
- The margin lets you quantify the trade-off between a guardrail metric deteriorating and a success metric improving.
- The more guardrail metrics with non-inferiority margins that you add, the larger the sample size you need to collect is.
Add as guardrail metrics without non-inferiority margins the metrics that you don't want to see deteriorate, but that you have a less strict trade-off for.
- Guardrail metrics without non-inferiority margins don't inflate the sample size you need to collect, but detecting true deterioration becomes harder the more you add.
Add as exploratory metrics the metrics that you don't base your decision on, but that help you learn and understand the results of the experiment.
- Exploratory metrics don't influence the sample size you need to collect, but adding more makes it harder to find effects in your explorations.

Reader exercise

What is a true statement about required metrics?

They improve the power of the experiment.

They don't affect the required sample size of the experiment.

They are selected by each experimenter

Reader exercise

When should a guardrail metric be used without non-inferiority margins?

When the metric is highly sensitive and critical to our business.

When you want to check for regressions but you don't have any strict trade-offs to make between this metric and a success metric.

When you use sequential testing.

Lesson 3: Different types of metrics in experiments

Summary

Required metrics, success metrics, guardrail metrics, and exploratory metrics serve different purposes and knowing when to use which is important to run efficient experiments.

Metrics in an experiment can be added in the following ways:

Required metrics on surfaces. Configure metrics that all experiments on specific surfaces must add. The platform verifies that these metrics don't deteriorate.
Success and guardrail metrics. Add the metrics meant to inform the decision when setting up the experiment.
Exploratory metrics. Create exploratory analyses to dig deeper and split by dimensions during or after launching your experiment.

With non-inferiority margins
Without non-inferiority margins

Learn more about the two ways to analyze guardrail metrics in the lesson on the topic.

Use exploratory metrics to learn more

In Confidence

In Confidence, use the Explore tab to add metrics and slice results by dimensions.

Recommendations

Consider the following recommendation when you configure metrics for your experiments:

Add as required metrics the metrics that should never deteriorate, but that you rarely use as decision metrics in experiments.
- Add them to the global surface if they apply business wide, or on a local surface.
- Required metrics don't inflate the required sample size, but adding more decreases the chances of finding a deterioration.
Add as success metrics the metrics that quantify the success of a treatment.
- These should generally be few.
- The more success metrics you add, the larger the sample size you need to collect is.
Add as guardrail metrics with non-inferiority margins the metrics that your success metrics are likely to cannibalize.
- The margin lets you quantify the trade-off between a guardrail metric deteriorating and a success metric improving.
- The more guardrail metrics with non-inferiority margins that you add, the larger the sample size you need to collect is.
Add as guardrail metrics without non-inferiority margins the metrics that you don't want to see deteriorate, but that you have a less strict trade-off for.
- Guardrail metrics without non-inferiority margins don't inflate the sample size you need to collect, but detecting true deterioration becomes harder the more you add.
Add as exploratory metrics the metrics that you don't base your decision on, but that help you learn and understand the results of the experiment.
- Exploratory metrics don't influence the sample size you need to collect, but adding more makes it harder to find effects in your explorations.

Reader exercise

What is a true statement about required metrics?

They improve the power of the experiment.

They don't affect the required sample size of the experiment.

They are selected by each experimenter

Reader exercise

When should a guardrail metric be used without non-inferiority margins?

When the metric is highly sensitive and critical to our business.

When you want to check for regressions but you don't have any strict trade-offs to make between this metric and a success metric.

When you use sequential testing.