Who is the Confidence Bootcamp for?

The bootcamp is designed for anyone who wants to improve their experimentation skills. Courses are tailored for data scientists, analysts, engineers, product managers, and leaders — whether you are running your first A/B test or scaling an experimentation program across your organization.

Is the bootcamp free?

Yes, the Confidence Bootcamp is completely free. All 11 courses, 90+ lessons, and resources are available at no cost. You can start learning immediately without creating an account, though signing in lets you track your progress across devices.

The bootcamp covers the full experimentation lifecycle: A/B testing fundamentals, hypothesis formulation, interpreting experiment results, metrics design, sample size calculation, feature flags, and building an experimentation culture. It includes 11 courses with over 90 lessons built by the Confidence team at Spotify.

How long does the bootcamp take to complete?

The full bootcamp takes approximately 20 hours to complete across all 11 courses. Individual courses range from 30 minutes to 3 hours. You can learn at your own pace and pick the courses most relevant to your role.

Do I need prior experience with A/B testing or statistics?

No prior experience is required. The bootcamp starts with foundational courses like Intro to Experimentation and progressively covers more advanced topics like sequential testing and variance reduction. Each course clearly indicates which roles it is designed for.

Who created the Confidence Bootcamp?

The Confidence Bootcamp was created by the Confidence team at Spotify, the same team that builds the experimentation and feature flagging platform used across Spotify. The content reflects real-world experimentation practices used at one of the world's largest digital products.

Lesson 2: Choose evaluation strategy: Sequential or non-sequential tests

Summary

What evaluation frequency you choose determines if your experiment uses a sequential or a non-sequential test. Optimize your experimentation by choosing the test that fits your needs.

Sequential tests let you look at your results and make decisions during the experiment but have lower power than non-sequential tests.
Non-sequential tests only let you look at the results and make decisions once the experiment has ended but has higher power than sequential tests.

In this lesson you learn how to select evaluation frequency and the various trade-offs to consider when making this choice.

This video gives a 2 minutes and 10 seconds overview of the evaluation frequency topic.

The evaluation frequency determines the statistical test

The evaluation frequency has two options:

Continuous: lets you see results during the experiment
Upon conclusion: lets you see results only when the experiment has ended

Continuous evaluation uses sequential tests

When you select continuous evaluation, Confidence automatically uses a sequential test behind the scenes. Sequential tests ensure that the repeated testing implied by calculating results continuously throughout the experiment does not inflate the risk of finding a false positive result.

There are two types of sequential tests:

Group Sequential Tests: require the experimenter to provide a maximum sample size and must be stopped once this sample size is reached; they have higher power than Always Valid Tests for a given sample size
Always Valid Tests: do not require any pre-specified sample size and can keep letting new users into the sample indefinitely; they have lower power than Group Sequential Tests for a given sample size

In Confidence

Confidence automatically chooses the type of sequential test based on whether you provide a maximum sample size or not. If you provide an expected sample size, Confidence uses a Group Sequential Test to maximize power; otherwise it uses an Always Valid Test.

Upon Conclusion evaluation uses non-sequential tests

When you select Upon Conclusion, Confidence automatically uses a fixed horizon test—the classical statistical tests from intro statistics courses.

Fixed horizon tests, such as the z-test, do not allow you to see the results during the experiment, but they have higher power for the given sample size after you stop the experiment and run the analysis as compared to the sequential tests. Although you should run sample size calculations before the experiment, you can view the powered effect at any point during the experiment, which makes it possible to judge when it is time to end it.

A note on quality tests during the experiment

Regardless of your choice of evaluation frequency, a well-designed experimentation platform uses continuous evaluation (sequential tests) on all quality and deterioration checks during the experiment. This means that you will not miss any experiments that deteriorate or have fatal errors, even if you choose Upon Conclusion as your evaluation frequency.

This means the only reason to select continuous evaluation is if you want to make a shipping decision as soon as possible.

In Confidence

When you create an A/B test or rollout in Confidence, you can select the evaluation frequency in the experiment settings.

A note on early stopping to ship a variant

Using sequential testing to detect harm and errors to abort failing experiments early is standard in modern A/B testing. However, stopping early to ship is less straightforward. This is because it is often easier to prove that something is too bad to ship than to prove that it is sufficiently good to ship.

A few things to consider are:

Time effects. If there are strong effects of for example weekdays it might be important to run the experiment over a full week to average these out
Novelty effects. If there is a strong novelty effect, it might be important to observe users for a longer time to determine the longer time effect of the change
Power. Under powered experiments tends to overestimate the treatment effect. If the effect is larger than you expect early in the data collection, you might want run it longer to ensure your experiment is well powered.

How to choose

Selecting evaluation frequency and thereby type of statistical test is a trade-off between various interests. Some general guidelines are:

The ability to abort failing experiments should not influence the choice of evaluation frequency. All experiments have sequential checks for deterioration and quality errors, regardless of the result evaluation frequency.
If speed of decision making is more important than the quality of the estimates, choose continuous evaluation
If you are using continuous evaluation, provide a maximum sample size if you can to use Group Sequential Tests that maximizes the power
If you are going to run the experiment for a fixed number of days anyway before making the decision to ship, use Upon Conclusion evaluation

For a guide on how to read and act on results depending on which evaluation strategy your experiment uses, see Lesson 9: Sequential and non-sequential tests in the Interpreting experiment results course.

The recommendation to experimenters at Spotify is to use the upon-conclusion evaluation frequency. This is because Confidence offers continuous deterioration and quality tests regardless of the evaluation frequency, there are many reasons to run tests for at least a fixed time period such as time effects and novelty effects, and upon-conclusion maximizes the chances to detect a true effect (power).

If you are interested in reading more about types of sequential tests and how they compare, checkout this Spotify engineering blog post about choosing a Sequential Testing Framework There is also more details about the tests in the documentation.

Reader exercise

When should you choose continuous evaluation?

If you want to be able to abort as soon as there is evidence of errors

If speed of reaching a decision is the priority for you.

If the most important thing is accuracy in the estimates

Reader exercise

What is true for experiments using upon-conclusion evaluation?

They risk running for a long time with errors that you do not detect.

The have less power than continuously evaluated experiments for the same sample size.

They maximize the power for a given sample size.

Lesson 2: Choose evaluation strategy: Sequential or non-sequential tests

Summary

What evaluation frequency you choose determines if your experiment uses a sequential or a non-sequential test. Optimize your experimentation by choosing the test that fits your needs.

Sequential tests let you look at your results and make decisions during the experiment but have lower power than non-sequential tests.
Non-sequential tests only let you look at the results and make decisions once the experiment has ended but has higher power than sequential tests.

In this lesson you learn how to select evaluation frequency and the various trade-offs to consider when making this choice.

This video gives a 2 minutes and 10 seconds overview of the evaluation frequency topic.

The evaluation frequency determines the statistical test

The evaluation frequency has two options:

Continuous: lets you see results during the experiment
Upon conclusion: lets you see results only when the experiment has ended

Continuous evaluation uses sequential tests

There are two types of sequential tests:

Group Sequential Tests: require the experimenter to provide a maximum sample size and must be stopped once this sample size is reached; they have higher power than Always Valid Tests for a given sample size
Always Valid Tests: do not require any pre-specified sample size and can keep letting new users into the sample indefinitely; they have lower power than Group Sequential Tests for a given sample size

In Confidence

Upon Conclusion evaluation uses non-sequential tests

When you select Upon Conclusion, Confidence automatically uses a fixed horizon test—the classical statistical tests from intro statistics courses.

A note on quality tests during the experiment

This means the only reason to select continuous evaluation is if you want to make a shipping decision as soon as possible.

In Confidence

When you create an A/B test or rollout in Confidence, you can select the evaluation frequency in the experiment settings.

A note on early stopping to ship a variant

A few things to consider are:

Time effects. If there are strong effects of for example weekdays it might be important to run the experiment over a full week to average these out
Novelty effects. If there is a strong novelty effect, it might be important to observe users for a longer time to determine the longer time effect of the change
Power. Under powered experiments tends to overestimate the treatment effect. If the effect is larger than you expect early in the data collection, you might want run it longer to ensure your experiment is well powered.

How to choose

Selecting evaluation frequency and thereby type of statistical test is a trade-off between various interests. Some general guidelines are:

The ability to abort failing experiments should not influence the choice of evaluation frequency. All experiments have sequential checks for deterioration and quality errors, regardless of the result evaluation frequency.
If speed of decision making is more important than the quality of the estimates, choose continuous evaluation
If you are using continuous evaluation, provide a maximum sample size if you can to use Group Sequential Tests that maximizes the power
If you are going to run the experiment for a fixed number of days anyway before making the decision to ship, use Upon Conclusion evaluation

Reader exercise

When should you choose continuous evaluation?

If you want to be able to abort as soon as there is evidence of errors

If speed of reaching a decision is the priority for you.

If the most important thing is accuracy in the estimates

Reader exercise

What is true for experiments using upon-conclusion evaluation?

They risk running for a long time with errors that you do not detect.

The have less power than continuously evaluated experiments for the same sample size.

They maximize the power for a given sample size.