Who is the Confidence Bootcamp for?

The bootcamp is designed for anyone who wants to improve their experimentation skills. Courses are tailored for data scientists, analysts, engineers, product managers, and leaders — whether you are running your first A/B test or scaling an experimentation program across your organization.

Is the bootcamp free?

Yes, the Confidence Bootcamp is completely free. All 11 courses, 90+ lessons, and resources are available at no cost. You can start learning immediately without creating an account, though signing in lets you track your progress across devices.

The bootcamp covers the full experimentation lifecycle: A/B testing fundamentals, hypothesis formulation, interpreting experiment results, metrics design, sample size calculation, feature flags, and building an experimentation culture. It includes 11 courses with over 90 lessons built by the Confidence team at Spotify.

How long does the bootcamp take to complete?

The full bootcamp takes approximately 20 hours to complete across all 11 courses. Individual courses range from 30 minutes to 3 hours. You can learn at your own pace and pick the courses most relevant to your role.

Do I need prior experience with A/B testing or statistics?

No prior experience is required. The bootcamp starts with foundational courses like Intro to Experimentation and progressively covers more advanced topics like sequential testing and variance reduction. Each course clearly indicates which roles it is designed for.

Who created the Confidence Bootcamp?

The Confidence Bootcamp was created by the Confidence team at Spotify, the same team that builds the experimentation and feature flagging platform used across Spotify. The content reflects real-world experimentation practices used at one of the world's largest digital products.

Lesson 2: The Spotlight

Summary

In this lesson, you learn what each Spotlight recommendation means, what drives each one, and how the Spotlight synthesizes many metric results into a single actionable signal.

The Spotlight is the first thing you see on the results page. It gives a recommendation for each treatment variant: Ship, Continue, End, or Abort. Each recommendation has a precise meaning, and understanding what drives each one lets you immediately understand the state of your experiment when you open it.

The four recommendations

Ship

Ship means Confidence has found sufficient evidence that the treatment variant is worth rolling out. Three conditions must all be true for a Ship recommendation:

At least one success metric has improved significantly in the intended direction.
All guardrail metrics meet their tolerance levels: either significantly non-inferior (if you use non-inferiority margins), or showing no evidence of deterioration (if you do not).
No health check has failed. There is no SRM and no evidence of metric deterioration.

Ship is a positive signal, but it is a recommendation, not a mandate. You should still exercise judgment. Consider whether the effect size is practically meaningful, whether the experiment ran long enough to rule out novelty effects, and whether the results make sense given your understanding of the product.

Continue

Continue means there is not yet enough evidence to ship, and no reason to stop. This is the most common recommendation for an experiment that is on track. It means:

No success metric has significantly improved yet.
No health checks have failed.
No metrics have deteriorated.

The right response to Continue is to let the experiment run until it reaches the required sample size. Stopping an experiment early because results are not yet significant is a common mistake: it produces biased, unreliable results.

End

End appears when an experiment has collected enough data to be considered powered for all success metrics, but none of those metrics has shown a significant improvement. This is different from Continue.

Continue means "we do not know yet." End means "we have collected enough data, and there is no signal."

An End recommendation is a genuine null result. The treatment variant does not appear to improve the metrics you care about, and you have enough data to be fairly confident in that conclusion. The right response is to stop the experiment and treat this as a real finding: the treatment variant did not work as hypothesized.

Note

A null result is a valuable result. Knowing that a change did not improve metrics saves engineering and design resources that would otherwise go into shipping and maintaining a change that does not help users. Do not dismiss null results.

Abort

Abort means something has gone wrong that makes continuing the experiment harmful or pointless. This recommendation appears when:

A health check has failed (most commonly a sample ratio mismatch), which means the results cannot be trusted.
One or more metrics have deteriorated significantly, meaning there is statistical evidence the treatment variant is harming something you care about.

When you see Abort, stop the experiment. If the cause is an SRM, investigate the exposure logic before relaunching. If the cause is metric deterioration, the treatment variant may be harmful and should not be shipped.

After the experiment ends

After you end an experiment, the Continue and End recommendations merge into a single Don't ship label. The Abort recommendation may also appear if there was a health check failure. The Ship recommendation remains if the evidence for shipping was already established before the experiment ended.

The full picture

The Spotlight is a synthesis. It takes the outcomes of all success metrics, all guardrail metrics, and all health checks, and compresses them into one recommendation per treatment variant. Understanding the individual components (significance, CI width, health checks, evaluation strategy) gives you the ability to look at a Spotlight recommendation and trace it back to its causes.

Example

An experiment shows a Continue recommendation in the Spotlight, with one success metric showing "Not significant" (+1.4%) and all guardrail metrics showing "Has not deteriorated." The experiment is halfway through its planned duration.

The correct interpretation: the experiment is healthy and on track. The success metric has not yet moved enough to be statistically significant, but "not significant" at the halfway point is expected. Continue running until the required sample size is reached before drawing conclusions.

Reader exercise

The Spotlight shows 'End' for a running experiment. What is the most accurate interpretation?

The experiment has been aborted due to a health check failure

The experiment has collected enough data and no success metric has improved significantly. This is a genuine null result

The experiment ran out of traffic and cannot continue

The treatment variant improved all metrics significantly

Reader exercise

Which of the following must be true for Confidence to recommend 'Ship'?

All success metrics must be statistically significant

At least one success metric has improved significantly, all guardrail metrics are acceptable, and no health check has failed

The experiment has reached twice its required sample size

All metrics (success and guardrail) must show no deterioration

Notes for nerds

The Spotlight's synthesis of many metric results into a single recommendation is a non-trivial statistical problem. When you have many metrics, each tested at some significance level, the probability that at least one shows a spurious significant result grows quickly. Spotify has published research on how to make principled, risk-aware product decisions under exactly this kind of multi-metric setting. The underlying ideas are described in a paper by Schultzberg, Ankargren, and Frånberg (2024). You can read the engineering post at engineering.atspotify.com.

Lesson 2: The Spotlight

Summary

In this lesson, you learn what each Spotlight recommendation means, what drives each one, and how the Spotlight synthesizes many metric results into a single actionable signal.

The four recommendations

Ship

Ship means Confidence has found sufficient evidence that the treatment variant is worth rolling out. Three conditions must all be true for a Ship recommendation:

At least one success metric has improved significantly in the intended direction.
All guardrail metrics meet their tolerance levels: either significantly non-inferior (if you use non-inferiority margins), or showing no evidence of deterioration (if you do not).
No health check has failed. There is no SRM and no evidence of metric deterioration.

Continue

Continue means there is not yet enough evidence to ship, and no reason to stop. This is the most common recommendation for an experiment that is on track. It means:

No success metric has significantly improved yet.
No health checks have failed.
No metrics have deteriorated.

End

Continue means "we do not know yet." End means "we have collected enough data, and there is no signal."

Note

Abort

Abort means something has gone wrong that makes continuing the experiment harmful or pointless. This recommendation appears when:

A health check has failed (most commonly a sample ratio mismatch), which means the results cannot be trusted.
One or more metrics have deteriorated significantly, meaning there is statistical evidence the treatment variant is harming something you care about.

After the experiment ends

The full picture

Example

Reader exercise

The Spotlight shows 'End' for a running experiment. What is the most accurate interpretation?

The experiment has been aborted due to a health check failure

The experiment has collected enough data and no success metric has improved significantly. This is a genuine null result

The experiment ran out of traffic and cannot continue

The treatment variant improved all metrics significantly

Reader exercise

Which of the following must be true for Confidence to recommend 'Ship'?

All success metrics must be statistically significant

At least one success metric has improved significantly, all guardrail metrics are acceptable, and no health check has failed

The experiment has reached twice its required sample size

All metrics (success and guardrail) must show no deterioration