Who is the Confidence Bootcamp for?

The bootcamp is designed for anyone who wants to improve their experimentation skills. Courses are tailored for data scientists, analysts, engineers, product managers, and leaders — whether you are running your first A/B test or scaling an experimentation program across your organization.

Is the bootcamp free?

Yes, the Confidence Bootcamp is completely free. All 11 courses, 90+ lessons, and resources are available at no cost. You can start learning immediately without creating an account, though signing in lets you track your progress across devices.

The bootcamp covers the full experimentation lifecycle: A/B testing fundamentals, hypothesis formulation, interpreting experiment results, metrics design, sample size calculation, feature flags, and building an experimentation culture. It includes 11 courses with over 90 lessons built by the Confidence team at Spotify.

How long does the bootcamp take to complete?

The full bootcamp takes approximately 20 hours to complete across all 11 courses. Individual courses range from 30 minutes to 3 hours. You can learn at your own pace and pick the courses most relevant to your role.

Do I need prior experience with A/B testing or statistics?

No prior experience is required. The bootcamp starts with foundational courses like Intro to Experimentation and progressively covers more advanced topics like sequential testing and variance reduction. Each course clearly indicates which roles it is designed for.

Who created the Confidence Bootcamp?

The Confidence Bootcamp was created by the Confidence team at Spotify, the same team that builds the experimentation and feature flagging platform used across Spotify. The content reflects real-world experimentation practices used at one of the world's largest digital products.

Lesson 10: Segment-level analysis

Summary

When you discover that a metric behaves differently across user groups, it's tempting to solve this by creating filtered metrics for each cohort. Don't. Filtered metrics are statistically inefficient and obscure the aggregate story. Segment analysis gives you the diagnostic power you actually need—without the cost.

The temptation: a metric for every cohort

Imagine you run an experiment and find that new users respond very differently from power users. The aggregate result looks muddled. The obvious instinct is to clean this up at the metric level: define "streams for new users" and "streams for power users" as separate primary metrics so each can be evaluated cleanly.

This feels rigorous. It is actually a mistake.

When you filter a metric to a subpopulation, you reduce your sample size. A smaller sample means higher variance in your estimate. Higher variance means lower statistical power—you need a larger true effect to reliably detect it, and your confidence intervals get wider. You are not gaining precision; you are trading away sensitivity.

Worse, you lose the aggregate story entirely. A change that helps new users and hurts power users will look like a success if you are only watching the new user metric. The trade-off is invisible.

Example

A team notices that a new onboarding flow lifts 7-day streams for new users. They're tempted to make "new user streams" the primary metric for future onboarding experiments. But new users are 15% of the experiment population. Filtering to that 15% means their experiments are now underpowered for the same effect sizes—and they can no longer see whether changes that help new users are hurting everyone else.

What to do instead: aggregate metric, exploratory segments

The right structure separates two distinct jobs:

The primary metric captures the aggregate effect across all users. It has the full sample size behind it, which maximizes statistical power and captures trade-offs that would cancel or hide in a filtered view.

Segment analysis is how you understand that effect. After you have the aggregate result, you break it down by pre-specified user groups to diagnose what is driving it, where the effect is concentrated, and whether any group is being meaningfully harmed.

This is not a demotion for segment-level thinking—it is a clarification of its role. Segments are diagnostic tools, not primary outcomes. They answer why and for whom, after the primary metric answers whether.

Pre-specify your segments

The risk in exploratory segment analysis is obvious: if you scroll through 20 cuts after seeing the data and report the interesting ones, you will always find something. At a 5% significance threshold across 20 segments, you should expect roughly one spurious significant result even when the treatment has no real effect on anyone.

The fix is to decide which segments you will examine before you look at results. Pre-specified comparisons carry full statistical weight. Post-hoc ones are leads to follow up, not conclusions to act on.

The most practical way to do this is a standard segment set—a fixed list your team runs after every experiment. Good candidates are segments where, if you found a strong effect, it would change what you do:

New versus returning users—onboarding dynamics often differ sharply
Free versus premium—what works for one tier may not work for the other
Mobile versus desktop—platform differences can reflect implementation quality as much as concept quality
Engagement tier—power users and casual users can respond very differently

Example

A streaming platform runs every experiment with a standard breakdown: new users (under 30 days), returning users, free tier, premium tier. A new recommendation algorithm shows a neutral aggregate result. The standard breakdown reveals +8% streams for new users and −3% for premium users—effects that cancel in aggregate. Because these segments were pre-specified, the finding carries full statistical weight and the team knows they have a real trade-off to resolve before shipping.

The exception: guardrail metrics

There is one legitimate use of cohort-specific metrics: guardrails. A guardrail metric is not a success criterion—it is a constraint. "Do not hurt new users" is a reasonable guardrail even if new user streams are not your primary metric.

The distinction matters:

Primary metric (filtered): bad. Underpowered, loses aggregate signal.
Guardrail metric (filtered): fine. You are not trying to optimize it, just checking that you haven't broken something for a specific group.

Note

Post-hoc segment findings—ones that emerged from exploring the data rather than pre-specified analysis—should be treated as hypotheses to confirm in a follow-up experiment. The most useful thing a surprising post-hoc finding can do is point you toward the right question for your next experiment, not justify a ship decision on its own.

Reader exercise

Why is filtering your primary metric to a subpopulation (e.g., 'streams for new users only') a problem?

It makes the metric harder to explain to stakeholders

It reduces sample size, increasing variance and lowering statistical power

It violates the assumption of random assignment

Reader exercise

You run an experiment and notice post-hoc that users in one city show a surprisingly large positive effect. What should you do?

Treat the finding as a hypothesis and run a follow-up experiment to confirm it before acting on it

Exclude that city from future analyses to avoid bias

Ship the change and attribute the success to that city

Reader exercise

What is the appropriate role for cohort-specific metrics (e.g., 'new user streams') in an experiment?

They should replace the aggregate metric when effects are heterogeneous

They should only be used when the aggregate metric is not statistically significant

They can serve as guardrail metrics but should not be primary success metrics

Notes for nerds

The statistical cost of filtering is a direct consequence of the variance of a sample mean. If your full experiment population has N users, the variance of your effect estimate scales as 1/N. If you filter to a subpopulation of size n < N, the variance scales as 1/n—larger by a factor of N/n. To recover the same power you had with the full population, you need a true effect size that is √(N/n) times as large.

For a segment that is 20% of your population, you need an effect roughly 2.2× larger to detect it with the same reliability as your aggregate metric. Most real product effects don't come that much stronger in subpopulations than in aggregate—so you are often just running an underpowered test while thinking you've done something more targeted.

The multiple comparisons problem compounds this: the familywise error rate for k independent tests at significance level α is 1 − (1 − α)^k. At α = 0.05 and k = 20 segments, that is roughly 0.64—a 64% chance of at least one false positive even when nothing is real.

Lesson 10: Segment-level analysis

Summary

The temptation: a metric for every cohort

This feels rigorous. It is actually a mistake.

Worse, you lose the aggregate story entirely. A change that helps new users and hurts power users will look like a success if you are only watching the new user metric. The trade-off is invisible.

Example

What to do instead: aggregate metric, exploratory segments

The right structure separates two distinct jobs:

Pre-specify your segments

New versus returning users—onboarding dynamics often differ sharply
Free versus premium—what works for one tier may not work for the other
Mobile versus desktop—platform differences can reflect implementation quality as much as concept quality
Engagement tier—power users and casual users can respond very differently

Example

The exception: guardrail metrics

The distinction matters:

Primary metric (filtered): bad. Underpowered, loses aggregate signal.
Guardrail metric (filtered): fine. You are not trying to optimize it, just checking that you haven't broken something for a specific group.

Note

Reader exercise

Why is filtering your primary metric to a subpopulation (e.g., 'streams for new users only') a problem?

It makes the metric harder to explain to stakeholders

It reduces sample size, increasing variance and lowering statistical power

It violates the assumption of random assignment

Reader exercise

You run an experiment and notice post-hoc that users in one city show a surprisingly large positive effect. What should you do?

Treat the finding as a hypothesis and run a follow-up experiment to confirm it before acting on it

Exclude that city from future analyses to avoid bias

Ship the change and attribute the success to that city

Reader exercise

What is the appropriate role for cohort-specific metrics (e.g., 'new user streams') in an experiment?

They should replace the aggregate metric when effects are heterogeneous

They should only be used when the aggregate metric is not statistically significant

They can serve as guardrail metrics but should not be primary success metrics