Who is the Confidence Bootcamp for?

The bootcamp is designed for anyone who wants to improve their experimentation skills. Courses are tailored for data scientists, analysts, engineers, product managers, and leaders — whether you are running your first A/B test or scaling an experimentation program across your organization.

Is the bootcamp free?

Yes, the Confidence Bootcamp is completely free. All 11 courses, 90+ lessons, and resources are available at no cost. You can start learning immediately without creating an account, though signing in lets you track your progress across devices.

The bootcamp covers the full experimentation lifecycle: A/B testing fundamentals, hypothesis formulation, interpreting experiment results, metrics design, sample size calculation, feature flags, and building an experimentation culture. It includes 11 courses with over 90 lessons built by the Confidence team at Spotify.

How long does the bootcamp take to complete?

The full bootcamp takes approximately 20 hours to complete across all 11 courses. Individual courses range from 30 minutes to 3 hours. You can learn at your own pace and pick the courses most relevant to your role.

Do I need prior experience with A/B testing or statistics?

No prior experience is required. The bootcamp starts with foundational courses like Intro to Experimentation and progressively covers more advanced topics like sequential testing and variance reduction. Each course clearly indicates which roles it is designed for.

Who created the Confidence Bootcamp?

The Confidence Bootcamp was created by the Confidence team at Spotify, the same team that builds the experimentation and feature flagging platform used across Spotify. The content reflects real-world experimentation practices used at one of the world's largest digital products.

Lesson 2: Metric roles

Summary

In this lesson, you learn about the different roles metrics play in experiments: success metrics that define what winning looks like, guardrail metrics that protect against unintended harm, diagnostic metrics that validate your experiment ran correctly, and exploratory metrics that help you understand why results occurred. You learn how to construct a balanced metric suite that supports confident decision-making.

Why metric roles matter

Not all metrics serve the same purpose in an experiment. Some tell you whether to ship your change. Others help you understand what happened and why. Still others act as safety checks, ensuring you haven't broken something critical along the way.

The key to running effective experiments is understanding these different roles and selecting the right mix of metrics before you start. A metric that's perfect for understanding user behavior might be terrible for making a shipping decision. A metric that's essential as a guardrail might distract you if treated as a success measure.

When you define clear roles for each metric upfront, you create a framework for making decisions based on evidence rather than intuition or confirmation bias.

Success metrics

Success metrics define what winning looks like. These are the metrics you use to decide whether to ship your change. Everything else in your experiment supports this core question: did we succeed?

A good success metric directly measures the behavior your hypothesis predicts will improve. If you're testing a faster checkout flow because you believe it will increase purchase completion, your success metric should measure purchase completion. If you believe a new feature will increase user engagement, your success metric should capture that specific type of engagement.

Success metrics need to be sensitive enough to detect changes that matter but stable enough to trust. They should be clearly interpretable—when the metric moves, you should know whether that's good or bad without needing complex analysis. And they should be few in number, typically one to three. More than that and you're not being clear about what success actually means.

Example

An e-commerce company tests a simplified checkout flow:

Hypothesis: Reducing checkout from five steps to three will increase purchase completion by making the process less frustrating and faster to complete.

Success metric: Purchase completion rate per session in the first 7 days

This metric directly measures the intended outcome. When it moves, you know whether your simplification worked. It's specific enough to be meaningful but broad enough to capture the full effect on purchasing behavior.

Recommendation

Select all your metrics during experiment planning, before you collect any data. Pre-defining your success criteria and guardrails helps you avoid confirmation bias and ensures you're measuring what you set out to learn, not just what happened to move.

Guardrail metrics

Guardrail metrics protect against unintended harm. These are metrics you don't expect to improve—you just need to ensure they don't get significantly worse.

Think of guardrails as the price of shipping. You might improve checkout completion, but if you also increase refund rates or hurt customer satisfaction, the change isn't worth shipping. You might boost engagement, but if you also increase server costs beyond acceptable limits or degrade app performance, you've created new problems while solving old ones.

Good guardrails measure outcomes that matter strategically but aren't the focus of your current change. Revenue, retention, and key quality metrics are common guardrails because they're critical to the business even when they're not what you're trying to move. Choose guardrails carefully—too many inflate sample size requirements and delay results, but too few leave you vulnerable to shipping changes that help one metric while hurting another.

Because you only care about whether a guardrail degrades (not whether it improves), guardrail metrics are tested asymmetrically from success metrics. The Advancing Experimentation course covers inferiority and non-inferiority tests for guardrail metrics in detail.

Example

For the checkout simplification experiment, guardrails might include:

Revenue per completed order: You're trying to increase completion, not reduce order values. Make sure removing steps didn't make high-value purchases harder.

Refund rate: Ensure faster checkout didn't lead to more impulse purchases that customers later regret.

Customer support contacts: Check that removing options didn't create confusion requiring support help.

Site performance metrics: Verify that code changes didn't slow down the page or introduce errors.

You don't expect checkout changes to improve these metrics, but you need confidence they didn't get worse before shipping.

Diagnostic metrics

Diagnostic metrics answer a simpler question: did the experiment run correctly? Before you interpret any results, you need to know whether the treatment was actually applied, whether users were randomized properly, and whether the implementation worked as intended.

Most platforms track many diagnostic metrics automatically. Sample ratio checks ensure you got the expected number of users in each variant. Exposure metrics confirm users actually saw your treatment. Implementation-specific metrics validate technical details like whether code deployed correctly or whether UI elements rendered.

You should always check diagnostics before diving into results. An imbalanced sample ratio can indicate a randomization bug or biased exposure logging. The most common cause of exposure bias is logging events conditionally on something affected by the treatment, which skews who appears in each variant. Low exposure means fewer users saw your change, reducing your sample size and statistical power—though the effect estimate for exposed users remains accurate. Technical errors might have affected some users but not others, creating misleading patterns in your data.

Recommendation

Review diagnostic checks before interpreting your success metrics to ensure your experiment ran as intended. If diagnostics show issues, investigate and fix them before drawing conclusions from your results.

In Confidence

In Confidence, diagnostic checks run automatically for every experiment. The randomization is always sound, so a sample ratio imbalance means the exposure logging is biased rather than a randomization issue.

Exploratory metrics

Exploratory metrics help you understand the story behind your results. Of course you will break results down by segments and look for patterns—and of course that will inform how you make product decisions. But it's critical to understand the pitfalls: post-hoc exploration inflates false positive rates, and great product organizations use exploratory findings primarily to inform future experiments and product iterations, not as a direct basis for shipping decisions. This is covered in detail in Lesson 10: Segment-level analysis.

Unlike success metrics, exploratory metrics don't need to be defined before the experiment starts. You can add them during analysis as questions arise. Saw an unexpected drop in your success metric? Add exploratory metrics to investigate possible causes. Saw a surprising win? Add metrics to understand which user segments benefited most or what behaviors drove the improvement.

This flexibility is powerful because you can't anticipate every question you'll want to ask. The data itself often reveals patterns you didn't predict. Exploratory metrics let you dig deeper into those patterns without the rigidity of pre-defined hypotheses.

Note

Exploratory metrics do not affect sample size requirements or the ship/no-ship decision, but they are not statistically free. Running many post-hoc analyses inflates the false discovery rate. Treat exploratory findings as hypotheses to confirm in a follow-up experiment, not as the basis for a ship decision.

Example

Continuing the checkout flow example, you might add exploratory metrics during analysis to understand the results. If purchase completion increased, you might explore:

Average time spent in checkout (did speed matter?)
Cart abandonment at each remaining step (where do users still drop off?)
Share of purchases using saved payment methods (did removing steps make saved methods more prominent?)

If results were neutral, you might explore:

Mobile versus desktop completion rates (did the change only help one platform?)
New versus returning customer completion (did experience level matter?)
Order values (did you lose high-value shoppers who wanted more options?)

These metrics help you understand what drove the outcome and what to test next.

In Confidence

In Confidence, exploratory metrics can be added at any point during analysis. You can also automate exploration using Actions, which can trigger metric calculations, generate reports, or run analyses based on experiment results.

Complete metric suite

A complete experiment includes all four metric roles, balanced to give you both confidence in your decision and insight into your results.

At Spotify, a typical experiment uses one to three success metrics that define winning, three to five guardrail metrics for critical business outcomes, three to five diagnostic metrics to validate implementation, and five to ten exploratory metrics to understand mechanisms and context. These ranges reflect our experience and serve as a useful ballpark—they are not prescriptive rules. The right numbers depend on the scope of your change, your available sample size, how many things could plausibly go wrong, and how much you need to understand about the mechanism behind your results. The Sample Size Calculation II course covers how the number of success metrics affects sample size requirements in detail.

Example

Complete metric suite for an e-commerce checkout simplification:

Success metric:

Purchase completion rate per session in first 7 days

Guardrail metrics:

Revenue per completed order
30-day refund rate
Customer support contact rate
Payment processing errors

Diagnostic metrics:

Treatment/control sample ratio
Share of treatment users who saw simplified flow
Checkout page load times
JavaScript error rates

Exploratory metrics:

Average checkout time
Cart abandonment by step
Mobile versus desktop completion rates
Share using saved payment methods
New versus returning customer completion
Average items per completed order

In Confidence

In Confidence, you define metric roles during experiment setup. Success and guardrail metrics form your decision framework and drive sample size requirements. Exploratory metrics do not affect sample size calculations. Learn more in the Intro to experimentation course.

Notes for nerds

Risk mitigation and metric types. One practical way to think about metric roles is through a risk mitigation lens: success metrics confirm you achieved your goal, guardrails protect against harm, and diagnostics confirm your experiment ran correctly. Spotify has written about this framing in the context of running experiments with smaller samples—where the choice of metric roles directly affects how much risk you're taking on at each stage. The Confidence blog post on experimenting with smaller samples digs into this, including how the risk mitigation ladder connects to metric type selection.

Lesson 2: Metric roles

Summary

Why metric roles matter

When you define clear roles for each metric upfront, you create a framework for making decisions based on evidence rather than intuition or confirmation bias.

Success metrics

Example

An e-commerce company tests a simplified checkout flow:

Hypothesis: Reducing checkout from five steps to three will increase purchase completion by making the process less frustrating and faster to complete.

Success metric: Purchase completion rate per session in the first 7 days

Recommendation

Guardrail metrics

Guardrail metrics protect against unintended harm. These are metrics you don't expect to improve—you just need to ensure they don't get significantly worse.

Example

For the checkout simplification experiment, guardrails might include:

Revenue per completed order: You're trying to increase completion, not reduce order values. Make sure removing steps didn't make high-value purchases harder.

Refund rate: Ensure faster checkout didn't lead to more impulse purchases that customers later regret.

Customer support contacts: Check that removing options didn't create confusion requiring support help.

Site performance metrics: Verify that code changes didn't slow down the page or introduce errors.

You don't expect checkout changes to improve these metrics, but you need confidence they didn't get worse before shipping.

Diagnostic metrics

Recommendation

In Confidence

Exploratory metrics

Note

Example

Continuing the checkout flow example, you might add exploratory metrics during analysis to understand the results. If purchase completion increased, you might explore:

Average time spent in checkout (did speed matter?)
Cart abandonment at each remaining step (where do users still drop off?)
Share of purchases using saved payment methods (did removing steps make saved methods more prominent?)

If results were neutral, you might explore:

Mobile versus desktop completion rates (did the change only help one platform?)
New versus returning customer completion (did experience level matter?)
Order values (did you lose high-value shoppers who wanted more options?)

These metrics help you understand what drove the outcome and what to test next.

In Confidence

Complete metric suite

A complete experiment includes all four metric roles, balanced to give you both confidence in your decision and insight into your results.

Example

Complete metric suite for an e-commerce checkout simplification:

Success metric:

Purchase completion rate per session in first 7 days

Guardrail metrics:

Revenue per completed order
30-day refund rate
Customer support contact rate
Payment processing errors

Diagnostic metrics:

Treatment/control sample ratio
Share of treatment users who saw simplified flow
Checkout page load times
JavaScript error rates

Exploratory metrics:

Average checkout time
Cart abandonment by step
Mobile versus desktop completion rates
Share using saved payment methods
New versus returning customer completion
Average items per completed order

In Confidence

Lesson 2: Metric roles

Why metric roles matter

Success metrics

Guardrail metrics

Diagnostic metrics

Exploratory metrics

Complete metric suite

What distinguishes success metrics from exploratory metrics?

Why can exploratory metrics be added during analysis in Confidence?

What is the main risk of having too many guardrail metrics?

You are testing a new product recommendation algorithm. Average order value is a critical business metric that you do not expect to improve. Which role is most appropriate?

Notes for nerds

Lesson 2: Metric roles

Why metric roles matter

Success metrics

Guardrail metrics

Diagnostic metrics

Exploratory metrics

Complete metric suite

What distinguishes success metrics from exploratory metrics?

Why can exploratory metrics be added during analysis in Confidence?

What is the main risk of having too many guardrail metrics?

You are testing a new product recommendation algorithm. Average order value is a critical business metric that you do not expect to improve. Which role is most appropriate?

Notes for nerds