Who is the Confidence Bootcamp for?

The bootcamp is designed for anyone who wants to improve their experimentation skills. Courses are tailored for data scientists, analysts, engineers, product managers, and leaders — whether you are running your first A/B test or scaling an experimentation program across your organization.

Is the bootcamp free?

Yes, the Confidence Bootcamp is completely free. All 11 courses, 90+ lessons, and resources are available at no cost. You can start learning immediately without creating an account, though signing in lets you track your progress across devices.

The bootcamp covers the full experimentation lifecycle: A/B testing fundamentals, hypothesis formulation, interpreting experiment results, metrics design, sample size calculation, feature flags, and building an experimentation culture. It includes 11 courses with over 90 lessons built by the Confidence team at Spotify.

How long does the bootcamp take to complete?

The full bootcamp takes approximately 20 hours to complete across all 11 courses. Individual courses range from 30 minutes to 3 hours. You can learn at your own pace and pick the courses most relevant to your role.

Do I need prior experience with A/B testing or statistics?

No prior experience is required. The bootcamp starts with foundational courses like Intro to Experimentation and progressively covers more advanced topics like sequential testing and variance reduction. Each course clearly indicates which roles it is designed for.

Who created the Confidence Bootcamp?

The Confidence Bootcamp was created by the Confidence team at Spotify, the same team that builds the experimentation and feature flagging platform used across Spotify. The content reflects real-world experimentation practices used at one of the world's largest digital products.

Lesson 9: Select metrics

Summary

Bring together everything you've learned into a practical framework for selecting metrics. You'll walk through a complete process from hypothesis to metric suite, understand when to make trade-offs, and leave with a systematic approach you can apply to any experiment.

You're ready to design metrics

You now understand what makes metrics effective. Metrics measure aggregated user behavior (lesson 1). They play distinct roles in experiments: success, guardrail, exploratory, diagnostic (lesson 2). Time windows shape when you measure and when you see results (lesson 3). Good metrics capture the right behavior without gaming or unintended consequences (lesson 4). The metric hierarchy connects tactical work to strategic outcomes (lesson 5). Clear names and documentation ensure interpretability (lesson 6). Effective variance and influenceability determine whether you can detect changes (lesson 7). Variance reduction techniques reduce effective variance (lesson 8).

The question isn't whether you understand these concepts—you do. The question is how to apply them systematically when designing an experiment.

The framework: From hypothesis to metric suite

Start with your hypothesis

Every metric choice flows from what you expect to change and why. Write your hypothesis explicitly: "[Change] will [impact behavior] because [mechanism]."

For example: "Adding one-click checkout will increase purchase completion because it reduces friction for returning customers" or "Personalized homepage recommendations will increase discovery streams because they surface relevant content users wouldn't otherwise find."

Your hypothesis identifies what behavior should change (purchase completion, discovery streams) and why (reduced friction, better relevance). This points directly to the metrics you need.

Identify what to measure

Translate your hypothesis into specific, measurable behaviors. Ask: "If my hypothesis is correct, what would users do differently?"

If reduced friction increases completion, measure purchase completion rate. If better relevance increases discovery, measure streams from recommendations. These are your candidate success metrics—they directly capture the behavior your hypothesis predicts will change.

Then identify what could go wrong. Faster checkout might increase regretted purchases—measure return rate. Better recommendations might cannibalize search—measure search engagement. These become guardrail candidates—metrics you don't expect to improve but must protect.

Note

If you can't clearly articulate what behavior should change and what you need to protect, your hypothesis isn't specific enough. Refine it before selecting metrics.

Assign metric roles

Build a balanced suite by assigning clear roles to each candidate metric.

Your success metrics (1-3) define what winning looks like and guide your ship/no-ship decision. Choose metrics where changes are directly attributable to your experiment—typically middle-layer metrics (local or proxy) from the metric hierarchy. They should be movable by your change and sensitive enough to detect realistic effects.

Your guardrail metrics (3-5) protect critical outcomes. Include top-layer KPIs like revenue and retention, plus any specific risks your change introduces. These ensure you don't improve one dimension while breaking another.

Exploratory metrics (5-10) enable understanding. These help you interpret why your success metrics moved and discover unexpected patterns. Unlike success metrics, you can add these during analysis as questions arise.

Diagnostic metrics (automated by most platforms) validate implementation—sample balance, exposure rates, technical health.

Evaluate metric characteristics

For each success and guardrail metric, validate that it meets your needs across four dimensions.

First, check feasibility—can you compute this reliably with available data? Do you have sufficient history to estimate variance reduction potential? If not, you may need to invest in logging infrastructure before proceeding.

Second, assess effective variance and influenceability, which together determine whether you'll detect changes. You need both low effective variance (considering temporal correlation and regression adjustment, not just raw variance) and high influenceability (your change must actually move the metric). Remember that binary metrics aren't automatically more sensitive—they reduce both variance and influenceability. A continuous metric with strong temporal correlation often outperforms a binary metric after applying variance reduction.

Third, ensure interpretability—does the name clearly communicate what's being measured (what behavior, for whom, over what time period)? Will stakeholders understand what it means when the metric moves?

Finally, verify alignment—does this metric connect to what you actually care about? If you're using a proxy metric, have you validated that it correlates with the ultimate outcome?

Recommendation

If a metric fails feasibility, invest in the logging infrastructure you need—don't compromise by measuring the wrong thing. If it fails variance or influenceability tests, refine the metric or reconsider your experimental approach.

Configure time windows

Time configuration determines when you measure and when you see results. For each metric, you need to make three choices.

When does measurement start? The exposure offset controls this. Use 0 days for immediate effects, or 7+ days to measure behavior after novelty effects fade. Consider using multiple metrics with different offsets to understand both short-term response and sustained impact.

How long do you measure each user? The aggregation window determines this. Match the window to the natural cycle of your user behavior and how long effects take to manifest.

When do users appear in results? You have two options. Closed windows (measuring at the end of the window) give cleaner interpretation—all users measured over exactly the same period—but results appear slower. Use these for success metrics guiding final decisions. Cumulative windows give faster results by including users as soon as they enter the window, even if incomplete. Use these when you need earlier monitoring, such as for guardrails or when monitoring success metrics before they mature.

Document everything

Write down for each metric: the specific behavior being measured, who it applies to (all users, activated users, exposed subset), the time configuration, expected direction of change, and why you chose this metric. Include calculation logic, filters, and any caveats.

This documentation serves three purposes: it ensures everyone interprets results consistently, it helps you catch issues before launching, and it creates a learning artifact for future teams.

Trade-offs

Real constraints force trade-offs. You can't always have perfect alignment, ideal sensitivity, and fast results simultaneously. Here's how to navigate common tensions.

When sensitivity conflicts with business alignment, remember that business-critical metrics like revenue and retention often have high variance or require slow measurement. Use sensitive proxies as primary success metrics while monitoring business metrics as guardrails. Just validate that improvements in your sensitive metric actually translate to business outcomes.

When feasibility limits ideal measurement, proxy metrics become necessary. Use them when direct metrics aren't available, but validate the correlation with historical data. Document the relationship clearly so others understand what the proxy represents.

When time constraints conflict with clean interpretation, semi-open windows give faster monitoring at the cost of interpretation complexity. Use them for early signals, but plan to validate with closed windows before making irreversible decisions.

The key is making these trade-offs deliberately, documenting them, and understanding their implications.

A complete example

Example

Context: An e-commerce platform tests a visual search feature that lets users upload photos to find similar products.

Hypothesis: "Visual search will increase purchase conversion for fashion categories because it helps users find specific styles when they can't describe them in text."

Metric design:

Success metrics:

Purchase conversion rate for visual search users in fashion categories (cumulative)
- Why: Directly measures the hypothesized behavior change
- Characteristics: Highly influenceable (scoped to visual search users), moderate variance, clear interpretation
Visual search usage rate among fashion shoppers (cumulative)
- Why: Validates feature adoption—if users don't use it, conversion can't improve
- Characteristics: Very sensitive, fast signal, enables understanding of null results

Guardrail metrics:

Overall revenue per user (cumulative)
- Why: Ensure visual search doesn't cannibalize other purchase paths
Text search usage rate (cumulative)
- Why: Detect if visual search replaces rather than complements text search
Mobile app performance scores (cumulative)
- Why: Visual search uses image processing—ensure it doesn't degrade experience

Exploratory metrics (can be added during analysis):

Category distribution of visual search queries
Success rate of visual searches (found similar products)
Average product price from visual vs text search
Cart add rate from visual search results
Share of searches that are visual vs text

Trade-offs made:

Using "purchase conversion for visual search users" rather than "overall site conversion" (more influenceable but narrower scope—monitoring overall conversion as guardrail)
Using cumulative metrics without windows since e-commerce users (especially anonymous/cookie-based) may not return, making fixed windows less appropriate

You've got this

You now have a systematic approach for selecting metrics. You understand how to start with your hypothesis, translate it into measurable behaviors, assign appropriate roles, validate characteristics, configure time windows, and make deliberate trade-offs when necessary.

The framework isn't rigid—adapt it to your context. But the core principles remain: be explicit about what you expect to change and why, measure the right behavior even if it's harder, protect critical outcomes, and document your choices so others can learn from them.

Metric design is iterative. Your first attempt won't be perfect. You'll learn from each experiment what worked, what didn't, and how to refine your measurement approach. That's expected and valuable—each iteration builds your intuition and your organization's measurement capabilities.

Recommendation

Start applying this framework on your next experiment. Walk through each step, document your decisions, and review the metric choices with a colleague before launching. After the experiment concludes, reflect on what you learned about metric design and how you'd approach it differently next time.

Lesson 9: Select metrics

Summary

You're ready to design metrics

The question isn't whether you understand these concepts—you do. The question is how to apply them systematically when designing an experiment.

The framework: From hypothesis to metric suite

Start with your hypothesis

Every metric choice flows from what you expect to change and why. Write your hypothesis explicitly: "[Change] will [impact behavior] because [mechanism]."

Your hypothesis identifies what behavior should change (purchase completion, discovery streams) and why (reduced friction, better relevance). This points directly to the metrics you need.

Identify what to measure

Translate your hypothesis into specific, measurable behaviors. Ask: "If my hypothesis is correct, what would users do differently?"

Note

If you can't clearly articulate what behavior should change and what you need to protect, your hypothesis isn't specific enough. Refine it before selecting metrics.

Assign metric roles

Build a balanced suite by assigning clear roles to each candidate metric.

Diagnostic metrics (automated by most platforms) validate implementation—sample balance, exposure rates, technical health.

Evaluate metric characteristics

For each success and guardrail metric, validate that it meets your needs across four dimensions.

Finally, verify alignment—does this metric connect to what you actually care about? If you're using a proxy metric, have you validated that it correlates with the ultimate outcome?

Recommendation

Hypothesis: "Visual search will increase purchase conversion for fashion categories because it helps users find specific styles when they can't describe them in text."

Metric design:

Success metrics:

Purchase conversion rate for visual search users in fashion categories (cumulative)
- Why: Directly measures the hypothesized behavior change
- Characteristics: Highly influenceable (scoped to visual search users), moderate variance, clear interpretation
Visual search usage rate among fashion shoppers (cumulative)
- Why: Validates feature adoption—if users don't use it, conversion can't improve
- Characteristics: Very sensitive, fast signal, enables understanding of null results

Guardrail metrics:

Overall revenue per user (cumulative)
- Why: Ensure visual search doesn't cannibalize other purchase paths
Text search usage rate (cumulative)
- Why: Detect if visual search replaces rather than complements text search
Mobile app performance scores (cumulative)
- Why: Visual search uses image processing—ensure it doesn't degrade experience

Exploratory metrics (can be added during analysis):

Category distribution of visual search queries
Success rate of visual searches (found similar products)
Average product price from visual vs text search
Cart add rate from visual search results
Share of searches that are visual vs text

Trade-offs made:

Using "purchase conversion for visual search users" rather than "overall site conversion" (more influenceable but narrower scope—monitoring overall conversion as guardrail)
Using cumulative metrics without windows since e-commerce users (especially anonymous/cookie-based) may not return, making fixed windows less appropriate

You've got this

Recommendation

Lesson 9: Select metrics

You're ready to design metrics

The framework: From hypothesis to metric suite

Start with your hypothesis

Identify what to measure

Assign metric roles

Evaluate metric characteristics

Configure time windows

Document everything

Trade-offs

A complete example

You've got this

What should drive your metric choices in an experiment?

Why might you choose a proxy metric with moderate variance over a business metric with high variance?

When should you use closed windows versus semi-open windows?

What is the most important step when using a proxy metric instead of the direct business outcome?

Lesson 9: Select metrics

You're ready to design metrics

The framework: From hypothesis to metric suite

Start with your hypothesis

Identify what to measure

Assign metric roles

Evaluate metric characteristics

Configure time windows

Document everything

Trade-offs

A complete example

You've got this

What should drive your metric choices in an experiment?

Why might you choose a proxy metric with moderate variance over a business metric with high variance?

When should you use closed windows versus semi-open windows?

What is the most important step when using a proxy metric instead of the direct business outcome?