If you want to experiment like Spotify, check out our experimentation platform Confidence. It's currently available in a set of selected markets, and we're gradually adding more markets as we go.
Start trialAt Spotify, we know that enhancing product decision-making through experimentation is a journey. For example, if you introduce overly complex methods too soon it can can hinder team progress, stifling both innovation and iteration speed. Our approach is grounded in the belief that the greatest benefit from experimentation occurs when teams transition from not experimenting to experimenting. Teams can only refine their testing process and what they learn from each test after they start to experiment. As the need arises, that can mean adjusting what methods they use and adopting more advanced practices. We've guided over 300 teams through this journey at Spotify and have developed Confidence to support everyone in this process. Start from where you are now and let the tool encourage you to work out your experimentation muscles and elevate your practices.
In this post, we explore how to take a step toward more sophisticated experimentation and decision-making by complementing your success metrics with guardrail metrics. Most experimenters acknowledge the importance of tracking multiple metrics in experiments. What most fail to acknowledge is that multiple metrics can complicate the experiment design if you want to maintain a high statistical rigor. Guardrail metrics often require additional input from the experimenter, which can raise the barrier for those in the early stages of their experimentation journey. Luckily, it's possible to simplify the usage of guardrail metrics by lowering the requirements on rigor. This way, more experimenters can start benefiting from guardrail metrics earlier in their experimentation journey. Later on, they can move up the funnel and level up their practices.
Let's begin by examining the two types of metrics commonly used in experiments.
What are success and guardrail metrics?
In a nutshell, success metrics are metrics you want to improve with the product change you are testing. Guardrail metrics are metrics you don’t want to harm with your product change. The easiest way to understand the need for these types of metrics is by considering some simple examples.
Example 1: You're aiming to boost user engagement in your product by introducing a new version of the "Recommended For You" section. However, to make sure that overall engagement increases, you must guarantee that the rise in engagement in "Recommended For You" doesn't lower engagement in a competing section, the "Best Sellers". That is, you want to avoid shifting engagement from "Recommended For You" to "Best Sellers". Rather, you want to maintain the same level of engagement in "Best Sellers" while simultaneously increasing it in "Recommended For You". In this scenario, use engagement in "Recommended For You" as the success metric and engagement in "Best Sellers" as the guardrail metric. If the new version boosts engagement in "Recommended For You" without reducing engagement in "Best Sellers", it's a success and should be implemented.
Example 2: Your aim is to increase the number of successful purchases on your online grocery store by introducing an 'Express Checkout' option. This feature allows users to quickly purchase from a predefined list of personalized items, reducing the time they spend on the site. Your hypothesis is that this will increase the number of transactions. However, since the express checkout only has a limited number of pre-defined items, it could decrease the overall purchase amount. In this scenario, use the proportion of successful checkouts as the success metric and the average purchase amount as the guardrail metric. If the number of successful purchases increases without decreasing the average purchase amount, it's a success and should be implemented.
Clearly, there are situations in which you should use both guardrail metrics and success metrics in an experiment. So, how do you use guardrail metrics?
Two ways to test guardrail metrics
You can test guardrail metrics in two different ways:
- Use an inferiority test to try to prove that the treatment group does worse than the control group. The desired outcome for a positive signal for the feature is that there is no statistical evidence that the treatment group does worse than the control group.
- Use a non-inferiority test to try to prove that the treatment group does better than a certain pre-defined margin relative to the control group. The desired outcome for a positive signal for the feature is if there is evidence that the treatment group does better than this relative margin.
Inferiority and non-inferiority tests are philosophical opposites. With the inferiority test, we ship the feature if there is no evidence of harm. With the non-inferiority test, we ship the feature if there is evidence that there’s no harm larger than the pre-specified tolerance level. In the inferiority test, the goal is to not find evidence of harm. For the non-inferiority test, the goal is to find evidence of no harm.
From a statistical point of view, non-inferiority testing is a better way to guard against deterioration. However, non-inferiority tests require more planning and a deeper statistical understanding than inferiority tests. At Spotify, it is common that teams use guardrails metrics with inferiority tests when they get started with experimentation. After a few months of experimentation, most teams are ready for more advanced settings and within a year almost all teams use non-inferiority tests for their guardrails. We think it is a mistake to let perfect be the enemy of good when it comes to getting started with experimentation. In the examples above, it’s obviously better for product decisions to use guardrail metrics with inferiority tests than to not include guardrail metrics at all.
Use guardrail metrics with inferiority testing
To use guardrail metrics with inferiority testing in Confidence, all you have to do is select a guardrail metric. Confidence checks the metric for deterioration and won't recommend you to ship if there is a significant deterioration in the metric. Guardrails evaluated with inferiority tests don’t affect the sample size you need to power your experiment. The reason is that they seek evidence in the negative direction, while you typically power your experiments for changes in the positive direction. Read more about how this works in our decision making blog post or dig right into the math in our recent paper Schultzberg, Ankargren, and Frånberg (2024).
No evidence for inferiority. The box is the confidence interval around the treatment effect estimate. Since we don't want sales amount to decrease, and the upper limit of the confidence interval is above zero, there is no significant deterioiration in the sales amount.
Use guardrail metrics with non-inferiority testing
To use non-inferiority tests, you need to select a guardrail metric and decide on a non-inferiority margin (NIM). A non-inferiority margin is the amount with which a metric can deteriorate before you want to consider it a failure. We say that there is evidence that treatment is non-inferior to control if the difference in the metric hasn't deteriorated more than the NIM. To make this more concrete, consider the example of sales amount as the guardrail metric. If you set the NIM to 1%, you only want to ship the change if there is evidence that the total sales amount doesn’t decrease by more than 1%.
No evidence for non-inferiority. The box is the confidence interval around the treatment effect estimate. Since we don't want sales amount to decrease more than NIM, and the lower limit of the confidence interval is below -NIM, there is no significant non-inferiority in the sales amount.
Setting the NIM can often be challenging the first couple of times you set it for a metric. This extra step of deep thinking and reasoning is what ultimately improves your decision-making. In other words, it isn't the non-inferiority test itself that improves the quality of your decision-making. It’s that it forces you to specify at what cost you can accept one metric to improve at the expense of another. Being explicit about trade-offs and having conversations about them is what will enable you to build a truly great product. On the one hand, there can be strategic reasons warranting you to move engagement from one place in an app to another. On the other hand, moving user activity around can also be a form of sophisticated procrastination that overall leads to no improvement of your product.
Illustration of how inferiority tests and non-inferiority tests differ and how sample size and NIM affect what evidence we find. The box is the confidence interval around the treatment effect estimate.
The confidence-interval width is tighter with a larger sample size. This means we can detect smaller deteriorations or find evidence of non-inferiority for smaller non-inferiority margins the larger the sample size.
Focus on progress, not your state
At Spotify, we've seen time and time again that steady progress is the best kind of progress when it comes to advancing experimentation. Forcing a team that's new to experimentation to make decisions they struggle to understand–like selecting a non-inferiority margin–leads to no good. Almost certainly, it leads to slow adoption and won't be the fastest way to improve product decision-making. Teams should aspire to meet best practices, but in our experience, the speed of progress is far more important than the current state. If you can get most of your organization on a track of improvement, you'll see that time moves fast.
At Spotify, we've helped +300 teams make this journey, and we're excited to help you too with Confidence!
Confidence is currently available in private beta. If you haven't signed up already, sign up today.