Lesson 12: A/B tests and rollouts

A/B tests and rollouts are two tools for product evaluation. Although similar in some ways, they are usually used for different stages of evaluation.

A/B tests

The main characteristics of A/B tests are that they

  • Can have more than two variants
  • Can have both success metrics and guardrail metrics
  • Have a fixed allocation of the total population
  • Allow for different evaluation frequencies for calculating results

Use A/B tests to

  • Decide a winner among two or more variants of a product
  • Explore and learn about how different settings affect use behavior

A/B tests are flexible and rich product evaluation tools. They help you ensure that the winning variant is better than the losing variants for the business, by allowing you to consider a complete set of success and guardrail metrics.

Rollouts

The main characteristics of a rollout is that it

  • Can only have two variants of which one is the current default and one is the variant that you want to roll out
  • Can only have guardrail metrics
  • Has an allocation that you can gradually increase
  • Always displays results continuously

Use rollouts to

  • Gradually ship a variant while monitoring important guardrail metrics
  • Gradually ship technical changes to the system, for example major refactors and migrations

A great benefit of using rollouts to ship changes is that the change that you roll out is behind a feature flag. This makes it easy to roll back a change. In other words, if you start rolling something out and get an alert that the rollout harms the end-user experience, it is only a button click away to revert back to the earlier experience. This saves engineers a lot of time and agony at Spotify, and has made rollouts the default way for engineers to release changes.

Combine A/B tests and rollouts for important changes

Significant results from experiments with small sample sizes tend to over estimate the treatment effect. Some online experimenters propose that you should replicate the results from such experiments by rerunning the experiment on other users to confirm the result. In practice, it is hard to know which experiments are underpowered. One way to think about it is that you should scrutinize unexpected results harder, and replicate them, to believe in them. A practical way to get confirmation of results from an A/B test is to ship the winner with a rollout. The metric results in the rollout works as a replication of the A/B test results and you can be even more certain about making the right decision.

At Spotify, most A/B tests that identify a winning variant use a rollout to ship that variant, which means that most results are replicated while released to everyone.