Who is the Confidence Bootcamp for?

The bootcamp is designed for anyone who wants to improve their experimentation skills. Courses are tailored for data scientists, analysts, engineers, product managers, and leaders — whether you are running your first A/B test or scaling an experimentation program across your organization.

Is the bootcamp free?

Yes, the Confidence Bootcamp is completely free. All 11 courses, 90+ lessons, and resources are available at no cost. You can start learning immediately without creating an account, though signing in lets you track your progress across devices.

The bootcamp covers the full experimentation lifecycle: A/B testing fundamentals, hypothesis formulation, interpreting experiment results, metrics design, sample size calculation, feature flags, and building an experimentation culture. It includes 11 courses with over 90 lessons built by the Confidence team at Spotify.

How long does the bootcamp take to complete?

The full bootcamp takes approximately 20 hours to complete across all 11 courses. Individual courses range from 30 minutes to 3 hours. You can learn at your own pace and pick the courses most relevant to your role.

Do I need prior experience with A/B testing or statistics?

No prior experience is required. The bootcamp starts with foundational courses like Intro to Experimentation and progressively covers more advanced topics like sequential testing and variance reduction. Each course clearly indicates which roles it is designed for.

Who created the Confidence Bootcamp?

The Confidence Bootcamp was created by the Confidence team at Spotify, the same team that builds the experimentation and feature flagging platform used across Spotify. The content reflects real-world experimentation practices used at one of the world's largest digital products.

Lesson 9: Sequential and non-sequential tests

Summary

In Confidence, sometimes you can see results during the experiment and sometimes only after it ends. This lesson explains why, and what it means for how you act on what you see.

In Confidence, sometimes you can see results during the experiment and sometimes only after it ends. This depends on the evaluation strategy chosen for the experiment. Understanding the difference is important because it determines when you can act on what you see.

The peeking problem

Standard statistical tests are designed to be run once: you collect your data, run the test, and look at the result. If you look at the result repeatedly (checking every day, or every time data updates) and you are willing to act the moment you see significance, you are effectively running many tests. Each look is an opportunity to find a false positive by chance, and the more often you look, the more likely you are to get one.

With a 5% false positive rate per test and no correction, checking results 20 times gives you roughly a 64% chance of seeing at least one false positive result across all those looks, even if there is truly no effect.

Sequential tests: built for continuous evaluation

Sequential tests solve the peeking problem by adjusting the statistical test to account for the fact that you will look at the results multiple times. They allocate the acceptable false positive rate across all the looks you will take, so that the overall false positive rate stays controlled. Sequential testing is the right choice when speed matters: when you want to be able to ship as soon as you have strong enough evidence, without waiting for a pre-specified end date.

There are two main types of sequential tests:

Group Sequential Tests: used when you provide an expected sample size. These are more powerful than always-valid tests and are the preferred option when you have a reasonable estimate of how many users your experiment will collect.
Always-Valid Tests: used when you do not provide an expected sample size. These can run indefinitely without inflating the false positive rate, but they have lower power than group sequential tests for a given sample size.

The trade-off for both types is that sequential tests require slightly more data than non-sequential tests to achieve the same power for a given effect size.

In Confidence

In Confidence, you use sequential tests by selecting Continuous as the evaluation frequency. Confidence automatically picks between Group Sequential Tests and Always-Valid Tests based on whether you provide an expected sample size.

Non-sequential tests: the highest power option

A standard non-sequential test (also called a fixed horizon test) is designed to be run once, after data collection ends. With this approach, you should not act on the metric results you see while the experiment is still running. The numbers mid-experiment are informational but they do not have the statistical guarantees of the final result.

The benefit is that non-sequential tests achieve higher statistical power for the same sample size compared to sequential tests. If you know in advance you will run the experiment for a fixed period regardless of what you see, this is the most efficient choice.

In Confidence

In Confidence, non-sequential tests correspond to the Upon Conclusion evaluation frequency. You can see the current powered effect at any point during the experiment, which helps you judge when it makes sense to end it.

Note

The recommendation at Confidence is to use upon-conclusion evaluation unless speed of decision is a priority. This is because most teams benefit from running experiments for a full, pre-specified period to account for time effects and novelty effects, and upon-conclusion evaluation maximizes power for that use case.

Deterioration checks always run sequentially

Regardless of which evaluation strategy you choose, Confidence always monitors your metrics for deterioration using sequential tests. This means you will be alerted if a metric starts to move in the wrong direction, even if you chose upon-conclusion evaluation and have not yet reached the end of your experiment.

This is an important design choice: the ability to detect harm early is always on. You do not sacrifice protection against regressions by choosing a non-sequential test for your main results.

What this means in practice

When you look at experiment results, the key question to ask is: what evaluation strategy is this experiment using?

Sequential (continuous) evaluation: the results you see are statistically valid to act on at any point. The test has accounted for repeated looks.
Non-sequential (upon conclusion) evaluation: the final result after the experiment ends is valid to act on. Mid-experiment numbers should be used for awareness only, not decisions.

For a deeper look at how to choose between these strategies when setting up an experiment, including trade-offs around time effects, novelty effects, and power, see Lesson 2: Choose evaluation strategy in the Advance your experimentation course.

Reader exercise

You are running an experiment with upon-conclusion evaluation. Why shouldn't you look at the results before the experiment has ended?

Because the experiment might not have reached the target audience size yet

Because success metrics are only calculated once the experiment ends

Because upon-conclusion testing only provides valid statistical guarantees at the predetermined end point — peeking and acting on interim results inflates the false positive rate

Because early results always show larger effects that shrink as more data comes in

Reader exercise

What is always true about deterioration monitoring in Confidence, regardless of evaluation strategy?

Deterioration is only checked when you select continuous evaluation

Deterioration checks always use sequential tests, so you are protected against regressions at all times

Deterioration checks are only run when the experiment reaches its required sample size

You must manually trigger deterioration checks for upon-conclusion experiments

Notes for nerds

Confidence supports two types of sequential tests. Group Sequential Tests (GST) are used when you provide an expected sample size at setup. They are more powerful than the alternative for a given sample size and are the preferred option when you have a reasonable estimate of how many users the experiment will collect. Always-Valid Tests are used when no expected sample size is provided. They can run indefinitely without inflating the false positive rate, at the cost of somewhat lower power than a GST for the same amount of data.

The sample size calculator in Confidence accounts for this choice. When you switch from non-sequential to sequential evaluation, you can see the impact on required sample size directly in the calculator. Sequential tests require more data to achieve the same power for a given effect size, and the calculator shows you exactly how much.

Lesson 9: Sequential and non-sequential tests

Summary

In Confidence, sometimes you can see results during the experiment and sometimes only after it ends. This lesson explains why, and what it means for how you act on what you see.

The peeking problem

Sequential tests: built for continuous evaluation

There are two main types of sequential tests:

Group Sequential Tests: used when you provide an expected sample size. These are more powerful than always-valid tests and are the preferred option when you have a reasonable estimate of how many users your experiment will collect.
Always-Valid Tests: used when you do not provide an expected sample size. These can run indefinitely without inflating the false positive rate, but they have lower power than group sequential tests for a given sample size.

The trade-off for both types is that sequential tests require slightly more data than non-sequential tests to achieve the same power for a given effect size.

In Confidence

Non-sequential tests: the highest power option

In Confidence

Note

Deterioration checks always run sequentially

This is an important design choice: the ability to detect harm early is always on. You do not sacrifice protection against regressions by choosing a non-sequential test for your main results.

What this means in practice

When you look at experiment results, the key question to ask is: what evaluation strategy is this experiment using?

Sequential (continuous) evaluation: the results you see are statistically valid to act on at any point. The test has accounted for repeated looks.
Non-sequential (upon conclusion) evaluation: the final result after the experiment ends is valid to act on. Mid-experiment numbers should be used for awareness only, not decisions.

Reader exercise

You are running an experiment with upon-conclusion evaluation. Why shouldn't you look at the results before the experiment has ended?

Because the experiment might not have reached the target audience size yet

Because success metrics are only calculated once the experiment ends

Because upon-conclusion testing only provides valid statistical guarantees at the predetermined end point — peeking and acting on interim results inflates the false positive rate

Because early results always show larger effects that shrink as more data comes in

Reader exercise

What is always true about deterioration monitoring in Confidence, regardless of evaluation strategy?

Deterioration is only checked when you select continuous evaluation

Deterioration checks always use sequential tests, so you are protected against regressions at all times

Deterioration checks are only run when the experiment reaches its required sample size

You must manually trigger deterioration checks for upon-conclusion experiments