Experiment like Spotify: Analysis of Experiments

Experiment like Spotify: Analysis of Experiments
Sebastian Ankargren, Senior Data Scientist
Sebastian Ankargren, Senior Data Scientist
Mattias Frånberg, Senior Data Scientist
Mattias Frånberg, Senior Data Scientist
Mårten Schultzberg, Staff Data Scientist
Mårten Schultzberg, Staff Data Scientist

If you want to experiment like Spotify, check out our experimentation platform Confidence. It's currently available in a set of selected markets, and we're gradually adding more markets as we go.

Start trial

This post is part of a series that showcases how you can use Confidence. Make sure to check out our earlier posts Experiment like Spotify: With Confidence, Experiment like Spotify: A/B Tests and Rollouts and Experiment like Spotify: Feature Flags.

At Spotify, we've gone from manually analyzing experiments in notebooks to experiment analysis that's fully automated and available in a centralized platform. All our many years of work, including statistical research and best practices, are fully available through Confidence. Instantly level up your experiment analytics and get top-of-the-line statistical methods, reliable practices, and validations that let you know your experiments are being run and set up in healthy ways. All crucial requirements for effectively experimenting at scale.

Analysis is a common bottleneck

As you increase your experimentation velocity, a common bottleneck is analytics. With more experiments running, the backlog of your analysts and data scientists continue to grow as they need to analyze more and more experiments. ​​Unless you've automated the analysis of your experiments, the bandwidth of your analysts limits your velocity and your experimentation won't scale at the rate you hoped. At Spotify, we experienced the analytics bottleneck first hand as we started to outgrow our experimentation platform ABBA in 2018. A large part of the analysis happened in a decentralized way in notebooks. This practice slowed down the experimentation learning cycle because we couldn't get any results before an analyst had crunched the numbers. It also made us more susceptible to errors because of the lack of standardization and validation.

With our next two iterations of platforms, the Experimentation Platform and Confidence, we decided to put increased analysis capacity front and center. Through reliable statistical analyses and automated validations of experiments, we've managed to scale up our experimentation more than tenfold. Because our platform does all this in a fully automated way, teams don't need an analyst to run successful experiments. Through the standardization our platform provides, we've even been able to elevate team autonomy — something we value highly at Spotify.

Success and guardrail metrics

To know what happened in your experiment, you need metrics that measure the behavior of the experience you test. With Confidence, you use success metrics and guardrail metrics to evaluate the results of your experiments. Success metrics are metrics where you want to improve — this could, for example, be podcast minutes played in the Spotify case. Guardrail metrics are the metrics you want to make sure that your new experience doesn't negatively impact. For example, to make sure that the metric podcast minutes played doesn't increase at the expense of music consumption, an excellent guardrail candidate for a Spotify experiment is music minutes played.

Checks help you run trustworthy experiments at scale

It's notoriously hard to come up with good ideas. A surprisingly large share of experiments, both at Spotify and elsewhere, don't have the expected impact. Some even make the user experience worse. While this insight can be demotivating, it's only natural — and especially so for more mature products. If your product is already great, finding ways to improve it without fiddling too much with the experience your users already love is a big challenge. For this reason, it's not enough to just look at decision metrics — you also need to check your metrics for deterioration. At Spotify, we always check our metrics for deterioration to rule out that they move in the opposite direction from what we expect. In addition, we check for deterioration in a set of metrics that are critical to the business, and perform a series of quality checks for the experiment. These checks make sure that the fundamentals of the experiment's setup are all correct. This includes verifying that traffic is incoming, metrics data is functional, that the proportion in each treatment group matches the expected proportion, and more. It's crucial to do these quality and deterioration checks to detect problematic experiences at an early stage. The checks offer peace of mind and allow you to experiment confidently, without being unsure if they are set up correctly.

At Spotify, we find that automatically checking business-critical metrics for all experiments enables even more freedom for our hundreds of autonomous experimenting teams. The platform enables these teams to experiment with low friction without compromising ‌quality, effectively acting as a safety net that helps identify regressions. Confidence informs and alerts when something goes awry to enable safe experimentation at scale, promoting speed and autonomy while maintaining safety mechanisms to minimize risks.

Shipping recommendations

When your experiment passes the quality and deterioration checks, you know you can trust and act on the results of the experiment. Each metric shows you a result, and you need to make a product decision based on these results — but the product doesn't care about individual metric results. That's why we've developed shipping recommendations at Spotify. Shipping recommendations summarize the outcome of the experiment into one recommendation for the product: should you ship this change or not? They shift the focus from individual metrics to the product, and what the experiment says about the overall impact. The shipping recommendation takes into account the results of the success and guardrail metrics, and the deterioration and quality checks. It helps you standardize decision-making by having one common lens that you view the success of your experiments through. It's also a great way for people in all roles to make sense of the results of the experiment on their own, which frees up scarce analytics and data science resources.

Fully equipped with advanced statistical tools

Confidence lets you decide how you want your results delivered, either continuously or when you end the experiment. The analysis applies appropriate sequential testing as needed to avoid so-called peeking problems. You can use group sequential tests, just like we do at Spotify, to maximize power. Or you use always-valid approaches that work better if you want to look at your experiment's results after each new data point you receive. Confidence also comes jam-packed with the statistical tools used by hundreds of teams at Spotify: variance reduction, different types of metrics, multiple testing corrections, and power and sample size calculations. You combine the metrics and methods that you need, Confidence figures out the math for you.

Exploratory analysis to slice and dice

With Confidence, what you can learn from your experiment won't come to a halt just because it's ended. In many situations, your experiment's results will warrant further investigation. For example, you might have gotten a surprisingly good result and want to understand if all types of users shared this great improvement or not. Or you might have found almost no effect, and you want to understand if there are some specific groups that benefited from the variant you tested that might shed light on how to iterate. In any case, Confidence's exploratory analysis page has got you covered. On the exploratory page, you can slice and dice the results by the dimensions you create. These dimensions can exist directly on your events that power your metrics, like if an order placed by a user happened on your website or in your app. Dimensions can also describe the entity your experiment is on, such as a country dimension describing the country of registration for your users.

Use Confidence in your existing platform

Confidence is a true platform, which means that you can use any of its services independently of each other. For example, you might already have an experimentation platform that does feature flagging, logging, calculations of metrics, and some basic stats calculations. If you're looking to level up your statistical toolbox, you don't need to abandon your existing tools — just use the Confidence stats API to offload your analysis to Confidence and keep everything else intact.

What's next

This post is part of a series that showcases how you can use Confidence. Make sure to check out our earlier posts Experiment like Spotify: With Confidence, Experiment like Spotify: A/B Tests and Rollouts and Experiment like Spotify: Feature Flags. Coming up in the series are posts on metrics, workflows, and more.

Confidence is currently available in private beta. If you haven't signed up already, sign up today.