Answers to Case study: Shuffle button in a shelf on Spotify home

Exercise: Write a hypothesis for an experiment

  1. Which success metrics would you set for this experiment? Why would you choose these metrics?

The product brief states the theory that the shuffle button will "increase engagement and audio consumption". We probably want to measure a metric that measures consumption, such as Minutes played on day 1 after exposure. However, there could be a strong novelty effect that leads to increased consumption on the first day when the button is new to the user, but wears off quickly. Therefore Minutes played on week 1 after exposure may be a better choice to measure this effect. To be sure that the effect is persistent over time, we could add more metrics for other time points, such as Minutes played on week 2 after exposure. However, adding more metrics can make it more difficult to detect a change. Thinking carefully and then picking fewer metrics is often the better option.

  1. What unintended side effects could you see when you add the shuffle button to this shelf? What guardrail metrics would you set to test for this?

There is a risk that adding the shuffle button will increase crashes. The metric Share of users with a crash on day 1 after exposure would be a good metric to detect if this happens. In addition it might be that we are only moving consumption from other shelves to the shelf with the shuffle button. However, we don't need to add a metric for this, since we are measuring overall consumption. If consumption is only moved around, we will see no effect in the experiment. If instead we used Consumption from the shelf with the shuffle button as a metric, we would need to add a metric for Consumption from other shelves to make sure that we are not just moving consumption around.

  1. Based on the information above, write a testable hypothesis for an experiment on a shuffle button on the Try something else shelf.

Based on user research, consumption data and anecdotal evidence we believe that giving users low-effort paths to discovering new content is important for user satisfaction and engagement. We think that adding a shuffle button to the 'Try something else' shelf for users in Spanish-speaking Latin America will achieve increased overall audio consumption. We will know this is true when we see an increase in Minutes played on week 1 after exposure. This will be good for customers because they can discover new songs and artists without having to make decisions and for creators because more users will discover more content.

Bonus questions

  1. How much of an increase in these metrics would you want to see to call the experiment a success?

When setting up an experiment, we need to define the "minimum detectable effect" for each success metric. This helps you set up the experiment to be sufficiently sensitive to measure the effect that we care about. It is not always straightforward to define how large an effect needs to be relevant. In this case, there is historical data on similar experiments in other places, from which we can learn what size of lift to expect.

  1. What is the largest decrease in the guardrail metrics that you would still consider acceptable?

For a metric like Share of users with crashes on day 1 after exposure, even a small increase may be unacceptable. If the change increases the number of crashes, we would probably want to fix the source of the crashes, rather than rolling the change out despite causing crashes to some users. So we may want to set the minimum detectable effect very low here.

On the other hand, if we set the effect very low, then we will need a large sample size to detect an effect this small.

At the same time, the baseline of crash rates is low, so even a large relative increase in crash rates may not be a large absolute increase.

Setting a sensible effect size here is a trade-off between an effect that is

  • small enough to protect the business against rolling out a change that can cause a negative experience and
  • large enough to be able to detect it within a reasonable sample size.