Who is the Confidence Bootcamp for?

The bootcamp is designed for anyone who wants to improve their experimentation skills. Courses are tailored for data scientists, analysts, engineers, product managers, and leaders — whether you are running your first A/B test or scaling an experimentation program across your organization.

Is the bootcamp free?

Yes, the Confidence Bootcamp is completely free. All 11 courses, 90+ lessons, and resources are available at no cost. You can start learning immediately without creating an account, though signing in lets you track your progress across devices.

The bootcamp covers the full experimentation lifecycle: A/B testing fundamentals, hypothesis formulation, interpreting experiment results, metrics design, sample size calculation, feature flags, and building an experimentation culture. It includes 11 courses with over 90 lessons built by the Confidence team at Spotify.

How long does the bootcamp take to complete?

The full bootcamp takes approximately 20 hours to complete across all 11 courses. Individual courses range from 30 minutes to 3 hours. You can learn at your own pace and pick the courses most relevant to your role.

Do I need prior experience with A/B testing or statistics?

No prior experience is required. The bootcamp starts with foundational courses like Intro to Experimentation and progressively covers more advanced topics like sequential testing and variance reduction. Each course clearly indicates which roles it is designed for.

Who created the Confidence Bootcamp?

The Confidence Bootcamp was created by the Confidence team at Spotify, the same team that builds the experimentation and feature flagging platform used across Spotify. The content reflects real-world experimentation practices used at one of the world's largest digital products.

Lesson 10: Exploratory analysis

Summary

In this lesson, you learn how to interpret segmented results in Confidence. You will know what a dimension result means, how to read it, and why segment findings require follow-up experiments rather than direct decisions.

Splitting experiment results by user subgroups is powerful, but it requires a specific way of reading the results. This lesson focuses on one question: after you have split results by a subgroup, how do you interpret what you see?

In Confidence

In Confidence, you split results by user subgroups by adding dimensions to an exploratory analysis. For the broader context on exploratory analysis (what explorations are, how to create them, and why the false positive risk matters), see the intro to metrics course.

What a dimension result shows

When you add a dimension, Confidence shows you the metric result for each segment separately. For example, if you split by platform, you see one result for iOS users and one for Android users. Each segment result has its own point estimate and confidence interval.

Note

Confidence always uses the dimension value from right before a user was exposed to the treatment variant. This means the segment a user is in cannot have been affected by the treatment variant itself. Static attributes like country or device type are inherently safe. Dynamic attributes like subscription tier or engagement level are also safe, because Confidence uses the pre-exposure snapshot.

How to read a segment result

A segment result is a metric result like any other. The point estimate is the relative effect within that segment, and the confidence interval tells you how precisely that effect has been measured.

Read each segment result the same way you would read any CI:

Is the effect in the expected direction?
Does the CI cross zero? If it does, the effect for this segment is not statistically significant.
How wide is the CI? Segments typically have fewer users than the full experiment, so the CI will be wider. A wide CI in a segment means you have less precision, not that the effect is different.

What it means when segments look different

If one segment shows a positive effect and another shows no effect (or a negative effect), this is called a heterogeneous treatment variant effect: the treatment variant appears to work differently for different types of users.

Before drawing conclusions from a pattern like this, consider two things.

First, you are running one test per segment. With many segments, some will look significant by chance. A result that stands out across two or three segments is more credible than one that barely clears the threshold in a single segment.

Second, sample sizes per segment are smaller than for the full experiment. The difference you see between segments might be noise rather than a genuine interaction. Overlapping confidence intervals between segments are a strong sign that the apparent difference is not reliable.

The right response to an interesting segment pattern is not to ship only to that segment based on the exploratory result. It is to run a new experiment with that segment as the target population and the segmented metric as the pre-registered success metric.

Example

An experiment shows no significant improvement overall. When split by platform, iOS shows +4.2% (95% CI: [−0.5%, +8.9%]) and Android shows −1.1% (95% CI: [−6.0%, +3.8%]). Both CIs cross zero. Both results are non-significant. The overlap between the two CIs is large.

The pattern is suggestive, but neither result is significant and the CIs heavily overlap. This is not strong evidence of a genuine platform difference. It is worth noting as a hypothesis to test in a follow-up experiment, but not a reason to ship selectively to iOS users.

Signs of a credible segment result

A segment result is more credible when:

It was pre-specified before the experiment ran (you predicted this segment would respond differently).
The effect size is large and the CI does not cross zero.
It replicates the direction of the overall result, just amplified (rather than going in the opposite direction).
It makes intuitive sense given what you know about the product and the treatment variant.

A segment finding that is surprising, post-hoc, and narrowly significant should be treated as a hypothesis, not a conclusion.

Reader exercise

You split experiment results by country and notice that one country shows a significant positive effect while all others do not. What is the most appropriate response?

Ship the treatment variant to that country only, since you have a significant result

Treat it as a hypothesis and run a new experiment targeting that country with this metric pre-registered

Declare the overall experiment a success because at least one segment was positive

Ignore the result because segment results are never valid

Reader exercise

A segment result shows a wide confidence interval that crosses zero. What does this tell you?

The treatment variant had no effect on this segment

The segment result is not statistically significant and the effect is measured with low precision, likely because the segment has fewer users

The segment should be excluded from the analysis

The experiment has an SRM for this segment

Lesson 10: Exploratory analysis

Summary

In Confidence

What a dimension result shows

Note

How to read a segment result

A segment result is a metric result like any other. The point estimate is the relative effect within that segment, and the confidence interval tells you how precisely that effect has been measured.

Read each segment result the same way you would read any CI:

Is the effect in the expected direction?
Does the CI cross zero? If it does, the effect for this segment is not statistically significant.
How wide is the CI? Segments typically have fewer users than the full experiment, so the CI will be wider. A wide CI in a segment means you have less precision, not that the effect is different.

What it means when segments look different

Before drawing conclusions from a pattern like this, consider two things.

Example

Signs of a credible segment result

A segment result is more credible when:

It was pre-specified before the experiment ran (you predicted this segment would respond differently).
The effect size is large and the CI does not cross zero.
It replicates the direction of the overall result, just amplified (rather than going in the opposite direction).
It makes intuitive sense given what you know about the product and the treatment variant.

A segment finding that is surprising, post-hoc, and narrowly significant should be treated as a hypothesis, not a conclusion.

Reader exercise

You split experiment results by country and notice that one country shows a significant positive effect while all others do not. What is the most appropriate response?

Ship the treatment variant to that country only, since you have a significant result

Treat it as a hypothesis and run a new experiment targeting that country with this metric pre-registered

Declare the overall experiment a success because at least one segment was positive

Ignore the result because segment results are never valid

Reader exercise

A segment result shows a wide confidence interval that crosses zero. What does this tell you?

The treatment variant had no effect on this segment

The segment result is not statistically significant and the effect is measured with low precision, likely because the segment has fewer users

The segment should be excluded from the analysis

The experiment has an SRM for this segment