Who is the Confidence Bootcamp for?

The bootcamp is designed for anyone who wants to improve their experimentation skills. Courses are tailored for data scientists, analysts, engineers, product managers, and leaders — whether you are running your first A/B test or scaling an experimentation program across your organization.

Is the bootcamp free?

Yes, the Confidence Bootcamp is completely free. All 11 courses, 90+ lessons, and resources are available at no cost. You can start learning immediately without creating an account, though signing in lets you track your progress across devices.

The bootcamp covers the full experimentation lifecycle: A/B testing fundamentals, hypothesis formulation, interpreting experiment results, metrics design, sample size calculation, feature flags, and building an experimentation culture. It includes 11 courses with over 90 lessons built by the Confidence team at Spotify.

How long does the bootcamp take to complete?

The full bootcamp takes approximately 20 hours to complete across all 11 courses. Individual courses range from 30 minutes to 3 hours. You can learn at your own pace and pick the courses most relevant to your role.

Do I need prior experience with A/B testing or statistics?

No prior experience is required. The bootcamp starts with foundational courses like Intro to Experimentation and progressively covers more advanced topics like sequential testing and variance reduction. Each course clearly indicates which roles it is designed for.

Who created the Confidence Bootcamp?

The Confidence Bootcamp was created by the Confidence team at Spotify, the same team that builds the experimentation and feature flagging platform used across Spotify. The content reflects real-world experimentation practices used at one of the world's largest digital products.

Answers to Case study: Shuffle button in a shelf on Spotify home

Summary

Suggested answers to the shuffle button case study: success and guardrail metric choices and a testable hypothesis for the Spotify Home experiment.

Side-by-side comparison of Spotify Home screens for control and treatment groups, where the treatment adds a shuffle button to the Try something else shelf

Exercise: Write a hypothesis for an experiment

Which success metrics would you set for this experiment? Why would you choose these metrics?

The product brief states the theory that the shuffle button will "increase engagement and audio consumption". We probably want to measure a metric that measures consumption, such as Minutes played on day 1 after exposure. However, there could be a strong novelty effect that leads to increased consumption on the first day when the button is new to the user, but wears off quickly. Therefore Minutes played on week 1 after exposure may be a better choice to measure this effect. To be sure that the effect is persistent over time, we could add more metrics for other time points, such as Minutes played on week 2 after exposure. However, adding more metrics can make it more difficult to detect a change. Thinking carefully and then picking fewer metrics is often the better option.

What unintended side effects could you see when you add the shuffle button to this shelf? What guardrail metrics would you set to test for this?

There is a risk that adding the shuffle button will increase crashes. The metric Share of users with a crash on day 1 after exposure would be a good metric to detect if this happens. In addition it might be that we are only moving consumption from other shelves to the shelf with the shuffle button. However, we don't need to add a metric for this, since we are measuring overall consumption. If consumption is only moved around, we will see no effect in the experiment. If instead we used Consumption from the shelf with the shuffle button as a metric, we would need to add a metric for Consumption from other shelves to make sure that we are not just moving consumption around.

Based on the information above, write a testable hypothesis for an experiment on a shuffle button on the Try something else shelf.

Based on user research, consumption data and anecdotal evidence we believe that giving users low-effort paths to discovering new content is important for user satisfaction and engagement. We think that adding a shuffle button to the 'Try something else' shelf for users in Spanish-speaking Latin America will achieve increased overall audio consumption. We will know this is true when we see an increase in Minutes played on week 1 after exposure. This will be good for customers because they can discover new songs and artists without having to make decisions and for creators because more users will discover more content.

Bonus questions

How much of an increase in these metrics would you want to see to call the experiment a success?

When setting up an experiment, we need to define the "minimum detectable effect" for each success metric. This helps you set up the experiment to be sufficiently sensitive to measure the effect that we care about. It is not always straightforward to define how large an effect needs to be relevant. In this case, there is historical data on similar experiments in other places, from which we can learn what size of lift to expect.

What is the largest decrease in the guardrail metrics that you would still consider acceptable?

For a metric like Share of users with crashes on day 1 after exposure, even a small increase may be unacceptable. If the change increases the number of crashes, we would probably want to fix the source of the crashes, rather than rolling the change out despite causing crashes to some users. So we may want to set the minimum detectable effect very low here.

On the other hand, if we set the effect very low, then we will need a large sample size to detect an effect this small.

At the same time, the baseline of crash rates is low, so even a large relative increase in crash rates may not be a large absolute increase.

Setting a sensible effect size here is a trade-off between an effect that is

small enough to protect the business against rolling out a change that can cause a negative experience and
large enough to be able to detect it within a reasonable sample size.