Who is the Confidence Bootcamp for?

The bootcamp is designed for anyone who wants to improve their experimentation skills. Courses are tailored for data scientists, analysts, engineers, product managers, and leaders — whether you are running your first A/B test or scaling an experimentation program across your organization.

Is the bootcamp free?

Yes, the Confidence Bootcamp is completely free. All 11 courses, 90+ lessons, and resources are available at no cost. You can start learning immediately without creating an account, though signing in lets you track your progress across devices.

The bootcamp covers the full experimentation lifecycle: A/B testing fundamentals, hypothesis formulation, interpreting experiment results, metrics design, sample size calculation, feature flags, and building an experimentation culture. It includes 11 courses with over 90 lessons built by the Confidence team at Spotify.

How long does the bootcamp take to complete?

The full bootcamp takes approximately 20 hours to complete across all 11 courses. Individual courses range from 30 minutes to 3 hours. You can learn at your own pace and pick the courses most relevant to your role.

Do I need prior experience with A/B testing or statistics?

No prior experience is required. The bootcamp starts with foundational courses like Intro to Experimentation and progressively covers more advanced topics like sequential testing and variance reduction. Each course clearly indicates which roles it is designed for.

Who created the Confidence Bootcamp?

The Confidence Bootcamp was created by the Confidence team at Spotify, the same team that builds the experimentation and feature flagging platform used across Spotify. The content reflects real-world experimentation practices used at one of the world's largest digital products.

Lesson 6: Why do we need statistics?

Summary

In this lesson, you learn about the role that statistical analysis plays in experimentation.

Statistics helps to:

Quantify the uncertainty in the metric results.
Makes it possible to manage and bound the risk of making the wrong product decisions.

Statistics is the mathematical language for quantifying uncertainty. In experiments, there is always random variation between the treatment groups even before you make a change to any group of users. All users are unique and therefore treatment groups never have exactly the same average.

Gym example

Let's go back to the weight-loss program trial.

You carried out your experiment as planned, randomizing participants into a control group and a treatment group. You measured the weight of the participants in both groups at the start of the trial. The treatment group started the 6-week weight-loss program right away, while the control group waited 6 weeks. At the end of the 6 weeks, you measured the weight of everyone again.

You ended up with 10 participants in each group, and the table below shows the difference before and after 6 weeks. People in the control group lost 1.3 kg on average, and people who participated in the weight-loss program lost 3.1 kg on average.

Weight change (kg)	Group average	1	2	3	4	5	6	7	8	9	10
Control	-1.3	+2.1	-3.3	-2.7	-1.8	+2.1	-1.8	+0.9	-1.2	-4.5	-3.0
Treatment	-3.1	-3.9	-2.7	-3.3	-7.5	-0.9	-4.2	-1.8	-2.1	-3.9	-0.6

So, the people in your program lost 1.8 kg more weight than the people in the control group. Does that mean that your program works? Or could it just be a coincidence?

Quantify the noise to detect the signal

People's weights fluctuate somewhat over time. If you just divided 20 people randomly into two groups and measured their change in weight over time, it's quite unlikely that the averages of the two groups would be exactly the same. This random variation causes noise in your measurement.

How can you detect the signal among the noise?

To answer this question, you can think about how much "noise" in the weight measurements you would expect to see, even if your program had no impact at all. You can then compare the difference that you found, to the amount of noise that you would expect purely from random variation. The amount of noise in the average measurements depends on the number of people in each group, and the variance, that is how much the individuals differ between each other. Based on this, you can use statistical theory to calculate how much noise you can expect in the measurement. In other words: How likely would it be to find a difference this large, just due to random variation.

We'll skip the math here. For the weight-loss example, it turns out that there is an 8% chance to find a difference between two groups of 1.8kg or more, based purely on random variation. This calculation assumes that your weight-loss program had absolutely no effect on people's weight.

Use statistics to make a decision

You can use this calculation to make a decision about your program. If you find quite a small effect it may not be unlikely to see it even if your program did nothing. We can conclude that there's not enough evidence to conclude that your program is working. If you find a larger effect, it is less likely to see this purely due to random variation. We can define a threshold beforehand, to decide how "unlikely to be seen purely based on random variation" your result needs to be, to consider it strong enough evidence.

For example, if you had decided beforehand that you would consider your results strong enough if they are less than 5% likely to be seen based purely on random variation, then the result of your trial would not have produced strong enough evidence. If on the other hand, you had decided beforehand to set that threshold at 10%, then your result would have passed the test. We call a result that has passed such a test "statistically significant".

Don't change the target after you see the results

It is tempting to change the threshold after seeing the results. "OK, we said 5% beforehand, but 8% is still not that bad, if we decide that a threshold of 10% is good enough, then the experiment passes the test!". But to avoid confirmation bias, it is important to define the threshold before seeing the results. Changing the threshold after seeing the results, is like shooting arrows at a wall, and then drawing a target around the place where the arrow landed. Drawing a target around the results of an experiment is cheating, and cheating in experimentation leads to worse product decisions. The same applies to changing the metrics after the end of the experiment.

A note on statistical uncertainty for the curious

Statistics doesn't magically know how different any two groups are. However, you have one trick up your sleeve: randomization. By randomizing the treatment assignment, you know how the difference in means between two treatment groups varies across different random treatment assignments. In other words, randomly assigning the treatment to users serves two purposes, make the groups similar in all other aspects than the treatment (as discussed in the scientific method lesson), and to 'structure' the noise in the difference-in-means estimator to allow statistical inference.

Reader exercise

What is the purpose of statistical testing?

To ensure that the results are successful.

Quantifying uncertainty and limit the risks of reaching the wrong conclusion.

Use p-values as much as possible to gain efficiency.