Who is the Confidence Bootcamp for?

The bootcamp is designed for anyone who wants to improve their experimentation skills. Courses are tailored for data scientists, analysts, engineers, product managers, and leaders — whether you are running your first A/B test or scaling an experimentation program across your organization.

Is the bootcamp free?

Yes, the Confidence Bootcamp is completely free. All 11 courses, 90+ lessons, and resources are available at no cost. You can start learning immediately without creating an account, though signing in lets you track your progress across devices.

The bootcamp covers the full experimentation lifecycle: A/B testing fundamentals, hypothesis formulation, interpreting experiment results, metrics design, sample size calculation, feature flags, and building an experimentation culture. It includes 11 courses with over 90 lessons built by the Confidence team at Spotify.

How long does the bootcamp take to complete?

The full bootcamp takes approximately 20 hours to complete across all 11 courses. Individual courses range from 30 minutes to 3 hours. You can learn at your own pace and pick the courses most relevant to your role.

Do I need prior experience with A/B testing or statistics?

No prior experience is required. The bootcamp starts with foundational courses like Intro to Experimentation and progressively covers more advanced topics like sequential testing and variance reduction. Each course clearly indicates which roles it is designed for.

Who created the Confidence Bootcamp?

The Confidence Bootcamp was created by the Confidence team at Spotify, the same team that builds the experimentation and feature flagging platform used across Spotify. The content reflects real-world experimentation practices used at one of the world's largest digital products.

Lesson 11: Evaluate your experiment and make a decision

Summary

In this lesson, you learn how to interpret the results of your experiment and make a decision based on the results and how Confidence helps you do that. Use exploration to learn more about the results you got and to get inspiration for new hypotheses.

A good experimentation platform calculates results for you and displays the performance of each variant, taking care of the statistical details so you can focus on learning from the experiment. With those insights, you make a decision on how to proceed with the change you tested.

At this stage, your experiment has successfully run for a period of time and has no visible errors. Congratulations! Now it's time for the fun part. You have at least one result to interpret, but often there are more than just one. More precisely, your experiment has T x M results to interpret, where T = Number of treatment groups (excluding control) and M = Number of metrics.

For an experiment with 3 treatment groups and 4 metrics, you have 12 results to interpret.

Overall decision recommendations

A good experimentation platform provides overall decision recommendations that use the outcomes of all metrics to suggest whether a specific treatment is worth rolling out.

The shipping recommendation recommends you to ship a change if at least one success metric has moved in the desired direction with significance. Simultaneously, all guardrail metrics must be significantly non-inferior, meaning that they're all within the acceptable margin you set using the non-inferiority margin. The test must also be in a healthy state, with no significant negative changes in any of the metrics, and no sign that there is a problem with the quality of the test.

In Confidence

Confidence provides overall decision recommendations on each treatment card on the results page.

Metric results

For each metric, you see a comparison between the control group and each treatment group. You can dig deeper into the results to see metric values, confidence intervals, variances, and more. If you ran your experiment with results delivered continuously, you can also view the results over time.

Exploration

If at the end of the experiment you find things that you would like to dig deeper into, you can do exploratory analysis. Here you can add any metric and see how it performed for each of the treatment groups, and split the results by dimensions.

Note

This type of explorations in which you look at many metrics, perhaps until you find an "interesting" result, severely increases the risk for finding false positives. This means you risk that results are significant only by chance.

For that reason, you shouldn't use exploratory analysis to make decisions about whether an experiment was successful or not. Use it to get inspiration for new hypotheses.

In Confidence

Use the Explore tab to add any metric and split results by dimensions.

Reader exercise

After ending your experiment, the results page in Confidence tells you that 'Shipping might be recommended' for the treatment. What does this mean?

Some success metrics have moved significantly in the desired direction, but some guardrail metrics are not significantly non-inferior

Confidence doesn't know what your definition of success is, and therefore can't make a recommendation

Some metric has deteriorated significantly

Reader exercise

You ran an A/B-test and it turned out to show no significant difference between the treatment and control. However, you then go into the Exploratory tab, and after looking at about 10 different metrics, you find that there is a significant difference for one of the metrics for the treatment group. What do you do?

Declare that the treatment is a winner and send out emails to stakeholders that the experiment was a success

Consider the high possibility that the significant result is a false positive, and use the result to form a new hypothesis and replicate this finding to see if it holds

Only one significant result is a bit thin, so you look for more significant results in additional metrics

Lesson 11: Evaluate your experiment and make a decision

Summary

For an experiment with 3 treatment groups and 4 metrics, you have 12 results to interpret.

Overall decision recommendations

A good experimentation platform provides overall decision recommendations that use the outcomes of all metrics to suggest whether a specific treatment is worth rolling out.

In Confidence

Confidence provides overall decision recommendations on each treatment card on the results page.

Metric results

Exploration

Note

For that reason, you shouldn't use exploratory analysis to make decisions about whether an experiment was successful or not. Use it to get inspiration for new hypotheses.

In Confidence

Use the Explore tab to add any metric and split results by dimensions.

Reader exercise

After ending your experiment, the results page in Confidence tells you that 'Shipping might be recommended' for the treatment. What does this mean?

Some success metrics have moved significantly in the desired direction, but some guardrail metrics are not significantly non-inferior

Confidence doesn't know what your definition of success is, and therefore can't make a recommendation

Some metric has deteriorated significantly

Reader exercise

You ran an A/B-test and it turned out to show no significant difference between the treatment and control. However, you then go into the Exploratory tab, and after looking at about 10 different metrics, you find that there is a significant difference for one of the metrics for the treatment group. What do you do?

Declare that the treatment is a winner and send out emails to stakeholders that the experiment was a success

Consider the high possibility that the significant result is a false positive, and use the result to form a new hypothesis and replicate this finding to see if it holds

Only one significant result is a bit thin, so you look for more significant results in additional metrics