Confidence
  • Documentation
  • Blog
  • Bootcamp
  • Status
  • Confidence Bootcamp
    • My learning
    • Intro to experimentation
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: Experiment hypothesis
      • Lesson 3: Success and guardrail metrics
      • Lesson 4: Success metrics
      • Lesson 5: Set up your experiment
      • Lesson 6: Calculation frequency
      • Lesson 7: Target audience
      • Lesson 8: Sample size
      • Lesson 9: Quality assurance
      • Lesson 10: Run your experiment
      • Lesson 11: Evaluate your experiment and make a decision
      • Lesson 12: A/B tests and rollouts
      • Course wrap up
    • Intro to metrics
      • Introduction
      • Lesson 1: What is a metric?
      • Lesson 2: Metric roles
      • Lesson 3: Time considerations
      • Lesson 4: Capturing behavior
      • Lesson 5: Strategic metrics
      • Lesson 6: Interpretability
      • Lesson 7: Feasibility and sensitivity
      • Lesson 8: Variance reduction
      • Lesson 9: Select metrics
      • Lesson 10: Segment-level analysis
      • Course wrap up
    • Scientific product development
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: The scientific method
      • Lesson 3: Randomized controlled trials
      • Lesson 4: Experiment hypothesis
      • Lesson 5: Case study
        • Case study
        • Answers to case study
      • Lesson 6: Why do we need statistics?
      • Lesson 7: Success metrics
      • Lesson 8: Detectable effects and sample size
      • Lesson 9: Make a decision
      • Course wrap up
    • A primer on hypothesis testing
      • Introduction
      • Lesson 1: Introduction to hypothesis testing
      • Lesson 2: True vs estimated effects
      • Lesson 3: Sampling distribution of the difference-in-means estimator
      • Lesson 4: Z-tests and how to reject the null hypothesis
      • Lesson 5: False postive rate and alpha
      • Lesson 6: True positive rate, MDE, and power
      • Course wrap up
    • Intro to Feature Flags
      • Introduction
      • Lesson 1: What is a feature flag?
      • Lesson 2: Lifecycle of a feature flag
      • Lesson 3: Clients
      • Lesson 4: Evaluation context and targeting
    • Sample size calculation - I
      • Introduction
      • Lesson 1: What is the required sample size?
      • Lesson 2: Alpha and power
      • Lesson 3: Baseline mean and variance
      • Lesson 4: Sample size playground - I
    • Sample size calculation - II
      • Introduction
      • Lesson 1: Multi-metric decision making
      • Lesson 2: Number of success metrics
      • Lesson 3: Number of guardrail metrics
      • Lesson 4: Number of comparisons
      • Lesson 5: Sample size playground - II
    • Sample size calculation - III
      • Introduction
      • Lesson 1: Binary metrics
      • Lesson 2: Treatment group proportions
      • Lesson 3: Variance reduction
      • Lesson 4: Sequential testing and sample size
      • Lesson 5: Sample size playground - III
    • Advance your experimentation
      • Introduction
      • Lesson 1: Guardrail metrics with non-inferiority margins
      • Lesson 2: Choose evaluation frequency
      • Lesson 3: Metrics' roles in experiments
      • Lesson 4: Cumulative holdback evaluations
    • Experimentation culture
      • Introduction
      • Lesson 1: Onboarding into experimentation
      • Lesson 2: Empowering experimentation champions
      • Lesson 3: Sustaining the experimentation culture
    • Videos

Answers to Case study: Shuffle button in a shelf on Spotify home

Exercise: Write a hypothesis for an experiment

  1. Which success metrics would you set for this experiment? Why would you choose these metrics?

The product brief states the theory that the shuffle button will "increase engagement and audio consumption". We probably want to measure a metric that measures consumption, such as Minutes played on day 1 after exposure. However, there could be a strong novelty effect that leads to increased consumption on the first day when the button is new to the user, but wears off quickly. Therefore Minutes played on week 1 after exposure may be a better choice to measure this effect. To be sure that the effect is persistent over time, we could add more metrics for other time points, such as Minutes played on week 2 after exposure. However, adding more metrics can make it more difficult to detect a change. Thinking carefully and then picking fewer metrics is often the better option.

  1. What unintended side effects could you see when you add the shuffle button to this shelf? What guardrail metrics would you set to test for this?

There is a risk that adding the shuffle button will increase crashes. The metric Share of users with a crash on day 1 after exposure would be a good metric to detect if this happens. In addition it might be that we are only moving consumption from other shelves to the shelf with the shuffle button. However, we don't need to add a metric for this, since we are measuring overall consumption. If consumption is only moved around, we will see no effect in the experiment. If instead we used Consumption from the shelf with the shuffle button as a metric, we would need to add a metric for Consumption from other shelves to make sure that we are not just moving consumption around.

  1. Based on the information above, write a testable hypothesis for an experiment on a shuffle button on the Try something else shelf.

Based on user research, consumption data and anecdotal evidence we believe that giving users low-effort paths to discovering new content is important for user satisfaction and engagement. We think that adding a shuffle button to the 'Try something else' shelf for users in Spanish-speaking Latin America will achieve increased overall audio consumption. We will know this is true when we see an increase in Minutes played on week 1 after exposure. This will be good for customers because they can discover new songs and artists without having to make decisions and for creators because more users will discover more content.

Bonus questions

  1. How much of an increase in these metrics would you want to see to call the experiment a success?

When setting up an experiment, we need to define the "minimum detectable effect" for each success metric. This helps you set up the experiment to be sufficiently sensitive to measure the effect that we care about. It is not always straightforward to define how large an effect needs to be relevant. In this case, there is historical data on similar experiments in other places, from which we can learn what size of lift to expect.

  1. What is the largest decrease in the guardrail metrics that you would still consider acceptable?

For a metric like Share of users with crashes on day 1 after exposure, even a small increase may be unacceptable. If the change increases the number of crashes, we would probably want to fix the source of the crashes, rather than rolling the change out despite causing crashes to some users. So we may want to set the minimum detectable effect very low here.

On the other hand, if we set the effect very low, then we will need a large sample size to detect an effect this small.

At the same time, the baseline of crash rates is low, so even a large relative increase in crash rates may not be a large absolute increase.

Setting a sensible effect size here is a trade-off between an effect that is

  • small enough to protect the business against rolling out a change that can cause a negative experience and
  • large enough to be able to detect it within a reasonable sample size.

Was this page helpful?

PreviousCase study
NextLesson 6: Why do we need statistics?

© Copyright 2026. All rights reserved.

Follow us on TwitterFollow us on GitHub

On this page

  1. Exercise: Write a hypothesis for an experiment

  2. Bonus questions