Lesson 1: Why you should experiment

We experiment because we know that we have biases

As humans, we tend to look for evidence that supports what we already believe, a phenomenon known as confirmation bias. To make matters worse, we also have a tendency to overvalue the products that we built ourselves (also known as the IKEA effect). This means that if we want to know the true value of product changes for our users, we have to be very careful to measure the impact in an unbiased and objective way, to avoid having our own beliefs fool us.

We experiment to avoid accidental breakage

Every time we change something about our product, we run the risk of accidentally causing negative side effects. This could be an increase in latency or crash rates caused by a new feature. For a mature product such as Spotify, it is much easier to unintentionally break the user experience than to improve it. Without experimentation, small undetected decreases in performance can add up and have a detrimental combined impact on the overall user experience.

Product evaluation

We run experiments to innovate fast and abandon bad ideas early

The most important thing for most companies, Spotify included, is not to ship a lot of changes, but to ship the right changes. To not release negative product changes is as important as to release new positive changes to the product. Without testing our assumptions systematically on real users in a real life setting, we risk investing a lot of development resources into product changes that appeared promising at first, but didn't actually improve the user experience in a real life setting.

Experiments allow us to draw causal conclusions

Let's say that we are looking for ways to reduce churn for Spotify premium users. We could do an analysis that compares users who churned with users who didn't. One result of such an analysis could be that users who didn't churn experienced more app crashes than users who churned. Does this mean that increasing the number of app crashes would reduce churn? Of course not. People who use the app a lot are more likely to experience a crash, and are also less likely to churn.

Now let's imagine that we built a new feature, and we hope that it reduces churn for Spotify premium users. In theory, we could just roll out the feature to everyone, check how many people are using it, and then see if people who use the feature are less likely to churn. But would this tell us if the feature actually reduces churn? No. Because just as with app crashes, a correlation between more feature usage and less churn would not imply a causal link. To objectively measure the value of our new feature, we need to find a way to isolate the impact of the feature from everything else that can impact our metric of choice. The gold standard method for doing this is called a "randomized controlled trial."

Randomized controlled trials

Experiments split users into two (or more) groups by random assignment. The random assignment makes sure that the groups are, on average, similar in all aspects except for the change we want to test. If we randomly split all Spotify users into two groups, the two groups should be very similar in terms of dimensions like demographics, connection speed, and music taste. One group gets the status of a "treatment" group and receives the new feature. The other group receives the default feature. We can then observe the users over time while they receive two different experiences, and measure some outcome of interest, for example churn, daily activity, or the number of minutes played. At the end of the experiment, we run a statistical test to calculate whether the differences between the groups are larger than what we expect to see if there's no difference.

Randomised Controlled trial

The cost of experiments

Experiments aren't free. The main costs involved are that:

  • It takes time to set up an experiment, wait for users to be exposed and analyze the results.
  • If the change that you test is as beneficial as you hope, then the users in the control group miss out on the improved experience until the end of the experiment.
  • If a change makes the user experience worse, then some users receive a worse experience for as long as the experiment runs.

The cost of not running experiments

  • You don't know if users respond to the product change in the way that you expect.
  • You might have negatively impacted your users in unexpected ways. If you roll out many changes without A/B testing them on real users, there might be negative impacts on system performance, crash rates, and more that you fail to detect. Taken together, they can add up and seriously impact the user experience.
  • Without testing your assumptions on real users, you risk investing resources into product changes that appear promising, but don't actually improve the experience in a real life setting.

Learn more

Watch this video to see examples of different types of experiments for various common use cases.