Who is the Confidence Bootcamp for?

The bootcamp is designed for anyone who wants to improve their experimentation skills. Courses are tailored for data scientists, analysts, engineers, product managers, and leaders — whether you are running your first A/B test or scaling an experimentation program across your organization.

Is the bootcamp free?

Yes, the Confidence Bootcamp is completely free. All 11 courses, 90+ lessons, and resources are available at no cost. You can start learning immediately without creating an account, though signing in lets you track your progress across devices.

The bootcamp covers the full experimentation lifecycle: A/B testing fundamentals, hypothesis formulation, interpreting experiment results, metrics design, sample size calculation, feature flags, and building an experimentation culture. It includes 11 courses with over 90 lessons built by the Confidence team at Spotify.

How long does the bootcamp take to complete?

The full bootcamp takes approximately 20 hours to complete across all 11 courses. Individual courses range from 30 minutes to 3 hours. You can learn at your own pace and pick the courses most relevant to your role.

Do I need prior experience with A/B testing or statistics?

No prior experience is required. The bootcamp starts with foundational courses like Intro to Experimentation and progressively covers more advanced topics like sequential testing and variance reduction. Each course clearly indicates which roles it is designed for.

Who created the Confidence Bootcamp?

The Confidence Bootcamp was created by the Confidence team at Spotify, the same team that builds the experimentation and feature flagging platform used across Spotify. The content reflects real-world experimentation practices used at one of the world's largest digital products.

Lesson 1: Introduction to hypothesis tests

Summary

This lesson introduces hypothesis testing by looking at the relation between the experiment hypothesis and the hypotheses in a hypothesis test. You learn about the null and alternative hypotheses in a hypothesis test, and how they relate to the experiment hypothesis. You also learn why you often test the mean of metrics in experiments, and how the randomness in the experiment helps us reason about the true average treatment effect.

Experiment hypotheses

Most people are familiar with the concept of a hypothesis: it's a somewhat formal statement about what you believe will be the outcome in an experiment. If you have ever read some introduction to experimentation (including our introductory course), it's often stated how important it is to have a clear hypothesis. It's worth noting that the hypothesis in an experiment is not the same as the hypotheses in an hypothesis test. The hypotheses in a hypothesis test are mathematically precise statements about an aspect of a metric of interest. The test hypotheses should of course reflect the experiment hypothesis, but they are not the same.

An example of a good experiment hypothesis is:

Based on user research, we believe that having to create a username creates friction in the sign-up process. We think that removing the step to enter a username for users signing up in the app will lead to faster completion of the sign-up flow. We will know this is true when we see a decrease in the mean sign-up completion time.

The last part about decreasing the mean sign-up completion time is the part we test using hypothesis testing. To do that we need to translate the experiment hypothesis into the two hypotheses in an hypothesis test.

The null and alternative hypotheses of an hypothesis test

A hypothesis test has a null hypothesis and an alternative hypothesis. Hypotheses are specific statements about aspects of a metric of interest, such as the mean sign-up completion time that you want to improve. For example, a hypothesis might say "we think the mean sign-up completion time will decrease by 10 seconds".

Note

An experiment hypothesis is often a less mathematically precise version of the alternative hypothesis in a hypothesis test. In our example, "faster completion of the sign-up flow" becomes a precise statement about decreasing mean sign-up completion time.

Simply put:

The alternative hypothesis describes what happens to the metric if the treatment has an effect (in our case, that removing the username field decreases mean sign-up completion time).
The null hypothesis describes what happens to the metric if treatment has no effect (in our case, that removing the username field doesn't change mean sign-up completion time).

With these hypothesis in mind, we have a clear definition of what we expect will happen with the metrics if the treatment has or has not an effect.

Hypotheses for means

When talking about hypothesis tests in online A/B tests, we almost always talk about a test of the difference in means between treatment and control. You have an outcome metric of interest (like sign-up completion time), and this metric has some mean before starting any experiment. As a product team, you are trying to improve this metric by iterating on your product. When you are about to run an experiment, you are interested in if a treatment (like removing the username field) has a positive effect on the mean of this metric.

For our sign-up completion time example, we want the mean time to decrease after the treatment, and the null and alternative hypotheses would be:

$H_0$ : Mean sign-up completion time in control group = Mean sign-up completion time in treatment group (null hypothesis)
$H_1$ : Mean sign-up completion time in control group > Mean sign-up completion time in treatment group (one-sided, negative direction)

This example shows how we can have directionality in our hypothesis test. We specifically want to see a decrease in the mean time. In other contexts, where we might want to see an increase in a metric (like mean revenue per user), we would adjust the alternative hypothesis accordingly:

$H_0$ : Mean revenue in control group = Mean revenue in treatment group (null hypothesis)
$H_1$ : Mean revenue in control group < Mean revenue in treatment group (one-sided, positive direction)

Note

Because the hypothesis is used in the test formulas, it's common to present them in a mathematical form. For the purpose of this course, and to build intuition about hypothesis testing, it suffices to understand them as precise statements about some aspect of a metric.

There are always two hypotheses in a hypothesis test: the null hypothesis and the alternative hypothesis. You can be mathematically fancy and write $H_0$ for the null hypothesis and $H_1$ for the alternative.

Why look at the mean?

You can test other parameters than the mean of the metric, but for experimentation the mean is the most common. The mean is a good summary of a metric to base decisions on, and, the mean has attractive statistical properties that makes it easy to manage risks for using statistics.

The steps for running an experiment and testing the mean difference between treatment and control are illustrated in the figure below. We:

take a random sample of users
split the sample randomly into treatment and control groups
give the treatment to the users in the treatment group
observe the mean of the metric in the two groups
compare the means using a hypothesis test.

A Spotify example

For a more concrete example for Spotify, suppose the metric of interest is audio book minutes played. The current mean is 48 minutes per user, and we are interested in if a new version of a recommendation algorithm increases the mean minutes played from 48 minutes to a higher number. In other words, is the new version of the recommendation algorithm causing an increase in the average minutes played? In an experiment, we test if the treatment affects the mean of the metric by taking a random sample, splitting it randomly into treatment and control, giving the treatment to the users in the treatment group, and then observe the mean of the metric in the two groups and compare the means.

How the randomness helps us manage risks

When the goal is to affect the mean, the hypothesis refers to the difference in the means between the treatment and control groups. If the treatment has no effect, the treatment and the control group should have the same mean and the mean difference should be zero. If the treatment increases the mean, the treatment group should have a higher mean than the control group, which would lead to a mean difference larger than zero. The null hypothesis is that the difference in means is zero. The alternative hypothesis is that the difference in means is greater than zero.

Since the sample and treatment assignment is random and no person is exactly like any other person, the difference in means between treatment and control will not be exactly zero even if the treatment truly has no effect. The difference-in-means estimator gives an estimate of the true average treatment effect in the population, and it will vary across random samples and treatment assignments. In the next lesson we will dig deeper into the variation of the difference-in-means estimator across random samples and treatment assignments.

Reader exercise

What is the primary purpose of hypothesis testing in experimentation?

To prove the experiment is correct

To handle uncertainty and manage risk

To increase the sample size

To make the data look better

Reader exercise

In a typical experiment, what do we usually test?

The difference in medians between treatment and control

The difference in means between treatment and control

The difference in variances between treatment and control

The difference in modes between treatment and control

Notes for nerds

It's common to call the population means in the treatment groups $μ_0$ and $μ_1$ to keep notation succinct. The null hypothesis is then $H_0: μ_0 = μ_1$ and the alternative hypothesis is for example $H_1: μ_0 < μ_1$ .

The hypotheses in a hypothesis test refer to so-called population parameters. The population parameters are the true values of the metric in the population. In other words, what would be the true average treatment effect if all users in the population got this treatment. In an experiment, we can only observe the metric in a sample of the population, and we use the sample mean as an estimate of the population mean. The hypothesis test is then a test of the null hypothesis that the population means are equal, based on the sample means.

Following lessons are devoted to unpacking the relations between the sample and the population.

Lesson 1: Introduction to hypothesis tests

Summary

Experiment hypotheses

An example of a good experiment hypothesis is:

The null and alternative hypotheses of an hypothesis test

Note

Simply put:

The alternative hypothesis describes what happens to the metric if the treatment has an effect (in our case, that removing the username field decreases mean sign-up completion time).
The null hypothesis describes what happens to the metric if treatment has no effect (in our case, that removing the username field doesn't change mean sign-up completion time).

With these hypothesis in mind, we have a clear definition of what we expect will happen with the metrics if the treatment has or has not an effect.

Hypotheses for means

For our sign-up completion time example, we want the mean time to decrease after the treatment, and the null and alternative hypotheses would be:

$H_0$ : Mean sign-up completion time in control group = Mean sign-up completion time in treatment group (null hypothesis)
$H_1$ : Mean sign-up completion time in control group > Mean sign-up completion time in treatment group (one-sided, negative direction)

$H_0$ : Mean revenue in control group = Mean revenue in treatment group (null hypothesis)
$H_1$ : Mean revenue in control group < Mean revenue in treatment group (one-sided, positive direction)