Who is the Confidence Bootcamp for?

The bootcamp is designed for anyone who wants to improve their experimentation skills. Courses are tailored for data scientists, analysts, engineers, product managers, and leaders — whether you are running your first A/B test or scaling an experimentation program across your organization.

Is the bootcamp free?

Yes, the Confidence Bootcamp is completely free. All 11 courses, 90+ lessons, and resources are available at no cost. You can start learning immediately without creating an account, though signing in lets you track your progress across devices.

The bootcamp covers the full experimentation lifecycle: A/B testing fundamentals, hypothesis formulation, interpreting experiment results, metrics design, sample size calculation, feature flags, and building an experimentation culture. It includes 11 courses with over 90 lessons built by the Confidence team at Spotify.

How long does the bootcamp take to complete?

The full bootcamp takes approximately 20 hours to complete across all 11 courses. Individual courses range from 30 minutes to 3 hours. You can learn at your own pace and pick the courses most relevant to your role.

Do I need prior experience with A/B testing or statistics?

No prior experience is required. The bootcamp starts with foundational courses like Intro to Experimentation and progressively covers more advanced topics like sequential testing and variance reduction. Each course clearly indicates which roles it is designed for.

Who created the Confidence Bootcamp?

The Confidence Bootcamp was created by the Confidence team at Spotify, the same team that builds the experimentation and feature flagging platform used across Spotify. The content reflects real-world experimentation practices used at one of the world's largest digital products.

Lesson 7: Feasibility and sensitivity

Summary

Learn how to evaluate metrics for experiments by assessing feasibility, understanding that effective variance (after variance reduction) matters more than raw variance, recognizing that binary metrics aren't inherently more sensitive than continuous metrics, and ensuring your product change can actually influence the metric you're measuring.

Feasibility: Can we measure it?

Feasibility means you have the data, infrastructure, and resources to compute the metric reliably. Before committing to a metric, verify: data availability (events are logged at the right granularity with sufficient history), technical capability (you can compute it efficiently within required time frames), and sample size requirements (you can collect enough data at acceptable cost).

For example, "streams per user from playlist recommendations" is feasible with logged events and user-level data. In contrast, "user satisfaction with recommendation quality" requires expensive survey data that may not exist.

Sensitivity: Can we detect changes?

A metric is sensitive if it reliably detects meaningful changes. Sensitivity depends on two independent factors: variance (how much noise exists) and influenceability (whether your change can actually move the metric). Both must be favorable.

Variance: Raw versus effective

Higher variance requires more users or longer runtime to detect a given effect size. However, what matters is not raw variance but effective variance after applying variance reduction techniques. The next lesson covers variance reduction in detail—for now, understand that the right covariate can dramatically change which metric is actually the most sensitive choice.

This distinction is critical. Variance reduction techniques use regression adjustment to control for pre-experiment user behavior. These techniques (like CUPED, CUPAC, and others) leverage temporal correlation: how stable the metric is over time. A metric with high raw variance but strong temporal correlation can end up with lower effective variance than a metric with moderate raw variance but weak correlation.

Example

Consider two metrics for measuring engagement:

Metric A - Total streams per user:

Raw variance: Very high (some users stream 1 song, others stream 1,000+)
Temporal correlation: Very high (heavy streamers stay heavy streamers)
Variance reduction factor: 80%
Effective variance: Low

Metric B - Did user create a playlist?:

Raw variance: Moderate (binomial metric)
Temporal correlation: Low (playlist creation is sporadic)
Variance reduction factor: 20%
Effective variance: Moderate to high

After applying variance reduction, Metric A might actually have lower effective variance and detect changes faster, despite its higher raw variance.

Binary versus continuous metrics

Converting continuous metrics to binary (changing "total streams" to "did user stream more than 10 times?") has critical trade-offs:

Binary metrics can have lower variance when users cluster far from the threshold, but they lose information about magnitude. Say you convert "streams per user" into a binary metric: "did the user stream more than 10 times?" Then a change from 15 to 50 streams produces the same binary outcome (1→1) as no change at all—both users were already above the threshold. You only detect crossings of that threshold, making binary metrics less responsive to changes tested in the experiment. Substantial behavior improvements may not register. Additionally, variance is highest when proportions are near 50%, so poor thresholds can increase variance.

Note

Binary metrics aren't inherently more sensitive. They can reduce variance but also reduce influenceability by ignoring non-threshold changes. Consider whether information loss justifies potential variance reduction—and whether regression adjustment techniques would make the continuous metric more sensitive anyway.

The techniques used to reduce effective variance—regression adjustment (CUPED, CUPAC), capping, and related methods—are covered in depth in Lesson 8: Variance reduction.

Influenceability: Can your change move the metric?

Even with perfect variance properties, a metric is useless if your product change cannot move it. Influenceability measures whether your experiment can actually affect the metric—this is completely independent of variance.

Influenceability depends on three factors: Exposure (what proportion of users encounter the change?), Mechanism (does the metric measure behavior the change can affect?), and Effect size (is the change substantial enough to impact the metric?).

Example

Testing a playlist creation flow redesign:

Low influenceability: Using "monthly active users" as your primary metric. Most users never create playlists, so the change can't influence them. Even a perfect redesign might show no effect.

High influenceability: Using "playlist creation rate among users who start the creation flow." All measured users are exposed, the metric directly captures what you're improving, and UI changes can meaningfully affect completion.

Scoping your experiment population and choosing metrics aligned with your intervention are crucial—perfect variance properties mean nothing if your change can't move the metric.

Trade-offs in practice

You've now seen that good metrics require feasibility, low effective variance, high influenceability, interpretability, and business alignment. No single metric optimizes all dimensions—you must make deliberate trade-offs:

Sensitivity versus business alignment: Metrics directly tied to business value (revenue, long-term retention) often have high variance or long measurement windows. Use sensitive proxies (engagement, short-term behavior) as primary metrics while monitoring business metrics secondarily. Ask: do improvements in my sensitive metric translate to business outcomes?

Granularity versus variance: Granular metrics (total revenue per user) capture effect magnitude but have high variance. Before simplifying through capping or binary conversion, check whether regression adjustment can give you both granularity and sensitivity.

In Confidence

Confidence automatically applies variance reduction using regression adjustment. You can see how much variance was reduced for each metric on the detailed results page.

Scope versus influenceability: Broad metrics (platform retention) align with business goals but resist movement from single experiments. Narrow metrics (feature engagement) are movable by experiments but miss broader impacts. Solution: use narrow, movable primary metrics for decisions while monitoring broad metrics for unintended effects.

Example

Testing a new recommendation algorithm:

Success metric: "Streams from recommendations per user"—directly influenced by your change, moderate variance with variance reduction, measures what you're improving.

Guardrail metrics: Total streams (broader impact), premium conversion (business outcome), recommendations shown (diagnostic).

This lets you decide confidently while understanding broader implications.

Reader exercise

Why might a metric with high raw variance be better for experiments than a metric with low raw variance?

If the high-variance metric has strong temporal correlation, regression adjustment techniques can reduce its effective variance below the low-variance metric

High variance metrics are always more sensitive

Raw variance does not matter for experiment design

Reader exercise

What is a key disadvantage of converting a continuous metric to binary?

Binary metrics are harder to compute

Binary metrics always have higher variance

Binary metrics lose information about magnitude and only detect changes when users cross the threshold

Reader exercise

What makes a metric highly influenceable for a specific experiment?

Low variance and strong temporal correlation

High proportion of users exposed to the change, direct causal mechanism, and sufficient effect size

Simple calculation and easy interpretation

Notes for nerds

Ratio metrics and variance reduction. Ratio metrics require special care not just for variance estimation, but also for variance reduction. When applying regression adjustment to a ratio metric, you can't treat the ratio as a simple scalar—the numerator and denominator each carry independent variance, and the covariance between them matters. This is covered in depth in Lesson 8: Variance reduction.

Heterogeneous treatment effects. Influenceability describes the average effect of your change across the measured population—but that average can mask enormous variation. A feature might be highly influenceable for power users who interact frequently with the affected surface, while being essentially unmovable for casual users who rarely encounter it. Conversely, a change can produce a neutral average because it helps some segments while harming others, with the two effects cancelling out. Whether this matters depends on your product strategy. If your goal is aggregate improvement, the average effect is what counts. If you want to understand who benefits and why, average influenceability is not enough—segment-level analysis is the right tool, and it is covered in Lesson 10: Segment-level analysis.

Lesson 7: Feasibility and sensitivity

Summary

Feasibility: Can we measure it?

Sensitivity: Can we detect changes?

Variance: Raw versus effective

Example

Consider two metrics for measuring engagement:

Metric A - Total streams per user:

Raw variance: Very high (some users stream 1 song, others stream 1,000+)
Temporal correlation: Very high (heavy streamers stay heavy streamers)
Variance reduction factor: 80%
Effective variance: Low

Metric B - Did user create a playlist?:

Raw variance: Moderate (binomial metric)
Temporal correlation: Low (playlist creation is sporadic)
Variance reduction factor: 20%
Effective variance: Moderate to high

After applying variance reduction, Metric A might actually have lower effective variance and detect changes faster, despite its higher raw variance.

Binary versus continuous metrics

Converting continuous metrics to binary (changing "total streams" to "did user stream more than 10 times?") has critical trade-offs:

Note

The techniques used to reduce effective variance—regression adjustment (CUPED, CUPAC), capping, and related methods—are covered in depth in Lesson 8: Variance reduction.

Influenceability: Can your change move the metric?

Example

Testing a playlist creation flow redesign:

Low influenceability: Using "monthly active users" as your primary metric. Most users never create playlists, so the change can't influence them. Even a perfect redesign might show no effect.

Scoping your experiment population and choosing metrics aligned with your intervention are crucial—perfect variance properties mean nothing if your change can't move the metric.

Trade-offs in practice

In Confidence

Confidence automatically applies variance reduction using regression adjustment. You can see how much variance was reduced for each metric on the detailed results page.

Example

Testing a new recommendation algorithm:

Success metric: "Streams from recommendations per user"—directly influenced by your change, moderate variance with variance reduction, measures what you're improving.

Guardrail metrics: Total streams (broader impact), premium conversion (business outcome), recommendations shown (diagnostic).

This lets you decide confidently while understanding broader implications.

Reader exercise

Why might a metric with high raw variance be better for experiments than a metric with low raw variance?

If the high-variance metric has strong temporal correlation, regression adjustment techniques can reduce its effective variance below the low-variance metric

High variance metrics are always more sensitive

Raw variance does not matter for experiment design

Reader exercise

What is a key disadvantage of converting a continuous metric to binary?

Binary metrics are harder to compute

Binary metrics always have higher variance

Binary metrics lose information about magnitude and only detect changes when users cross the threshold

Reader exercise

What makes a metric highly influenceable for a specific experiment?

Low variance and strong temporal correlation

High proportion of users exposed to the change, direct causal mechanism, and sufficient effect size

Simple calculation and easy interpretation