Experiments with Smaller Samples

If you want to experiment like Spotify, check out our experimentation platform Confidence and get a personalized demo.

What is a small sample?

What counts as a small sample is relative. At Spotify, most experiments have tens of millions users, but some only have thousands. We see the experiments on thousands of users as small-sample experiments. For researchers in medicine that often deal with sample sizes well below a hundred, a sample size of a thousand would be considered huge.

A more useful definition of "small sample" is a sample that doesn't let you estimate the impact of a change with the precision that you care about. For example, you might test a product change that you think will increase an important metric by 3%, but the sample size you can reach only gives you a fair chance (power) to detect effects of 10% or larger. In this post, we reason about what you can do in situations like this, and why at Spotify, we still always run experiments to de-risk product decisions — even when we can't reach the precision we desire.

Note, in this post we still assume that samples are large enough for standard statistical methods to work (without small sample size corrections). In other words, a small sample size should still be >1000.

A ladder of risk mitigation

Experimenters sometimes feel discouraged to run experiments when they cannot reach the intended precision, or in other words, when they don't have a fair chance (power) to detect the effect they are anticipating from the change. In these situations, remember that not using an experiment implies being completely oblivious to the impact. Running an experiment gives you clear bounds on the risk of making terrible product decisions. Any bounding of the risks is for sure better than no bounding.

We think about this as a ladder of risk mitigation, and our philosophy is that taking one step on this ladder is always better than not doing risk mitigation at all.

No experiment — no risk mitigation
Lack of evidence for negative impact — avoiding terrible decisions — no sample size requirements
Evidence for limited negative impact — bounding how bad the decisions can be — mild sample size requirements
Evidence for improvement — evidence that this is a good decision — strong sample size requirements

Many think of experimentation as primarily the third step. The step is about proving that the thing we are changing is improving some important metric. Although we certainly want this to happen, it is rarely pragmatic to expect this in all cases we make changes. Even in situations where we expect improvements (as opposed to no change at all), the expected improvements might be smaller than what we can reasonably expect to find evidence for given the sample size we have in our experiments.

There are also many reasons for why we might ship a change that doesn't systematically improve anything or where we expect no metrics to change. Examples are

Software updates — for example updating a version of some software library
Strategic changes that enable future improvements — for example making a surface in the app personalized, but serving the current static content in it as a first iteration
System refactoring — for example decreasing network usage by making fewer network calls

In these situations, we might expect no change or improvement. So why run an experiment then? Because mistakes and unintended consequences happen even for changes that are not meant to, or expected to, change the user experience.

How to climb the ladder

When product teams start to use experimentation to learn from their users and de-risk decisions, experiments must be set up in a way that makes it easy to celebrate the risk mitigation you got instead of mourning the risk mitigation you wish you had. Below follows a few steps that we have used at Spotify to onboard teams to experimentation in a way that focuses less on proving improvements and more on limiting the risk that is possible to limit.

At Spotify, we know that it's a good investment for any team to take the first step on this ladder. We have detected and avoided many unintended effects over the years, saving the company a vast amount of money — avoiding bad product decisions is at least as important as making the good ones.

Step 1 — Start with rollouts with guardrail metrics without NIMs

Using rollouts with guardrail metrics without non-inferiority margins (NIMs) is a great way to get teams started with experimentation. When you don't specify a NIM for a guardrail in Confidence, the statistical test behind the scenes changes from a non-inferiority test to a deterioration test. You can read more about these two ways to test guardrail metrics in the blog post Better Product Decisions with Guardrail Metrics.

There are two benefits of using guardrails without NIMs as a starting point:

The experimenter doesn't have to select the NIM, which can feel challenging the first time you do it
The rollout will recommend shipping the features as long as there is no evidence for deterioration (the guardrail metric moving significantly in the unintended direction)

Although step 1 gives a weak form of risk mitigation in the sense that lack of evidence for deterioration is not strong evidence that the metric is neutral, it provides some risk mitigation. If your change causes a large unintended negative impact — you will detect it.

Look at the confidence intervals

Confidence always presents confidence intervals for all success and guardrail metrics. You can look at the interval for a guardrail metric to get a feeling for the certainty you are dealing with for a given rollout. As long as the metric has not significantly moved in the wrong direction, there is no evidence of harm. However, the worst end of the confidence interval gives you some guidance on how the metric might have moved. If the confidence interval for a metric that improves when it increases is between -10% and +12%, you can conclude that it's not worse than -10%. This reasoning is very much in line with non-inferiority testing, which is what step 2 on the ladder is all about.

Step 2 — start with large NIMs

Once experimenters are used to rollouts, it's a good time to start introducing guardrail metrics with non-inferiority margins (NIMs). A non-inferiority margin is a fancy word for saying that we accept that this metric deteriorates a little bit, but not more than the NIM. So the NIM is a threshold for how much a metric is allowed to move in the non-desired direction before we consider it a failure. This video gives a quick introduction to NIMs.

With a non-inferiority test, Confidence recommends to increase the rollout reach (or ship the variant for an A/B test), if the metric is significantly above the threshold and not otherwise. Non-inferiority testing gives a stronger form of risk mitigation as we must collect evidence for limited negative impact, rather than lacking evidence for negative impact as in step 1.

It is natural to want to have as small NIM as possible to accept no deterioration, but the smaller the absolute NIM is, the larger the sample size you need to be able to find evidence for non-inferiority is. In the same sense that using a guardrail metric without NIMs is better than no experiment, it is better with a large NIM than no NIM. Of course, if your NIM is unrealistically large, like accepting a 60% regression, it doesn't in fact offer much risk mitigation. But, even in that case, if you can find evidence for non-inferiority, it is an explicit way of saying "the impact of this product change it at least not worse than a 60% regression".

By running sample size calculations, you can build a sense for what sample sizes a certain NIM requires for a given metric. As a starting point, for easing into experimentation selecting a fairly large NIM is fine: Confidence still checks all metrics if they move significantly in the non-desired direction. This means that you always get step 1 of the ladder for free even when you select a NIM. This means that if the NIM is unreasonably large, you are essentially back at step 1. In other words, at worst we are at step 1 on the ladder and at best we have a more explicit threshold for what kinds of hit we are willing to take and if we have evidence that we are within that margin or not.

A note on how NIMs should be selected in an ideal world

Ideally, NIMs should be based on business reasoning. How much are we willing to "pay" for shipping this feature in terms of deterioration in a certain metric. It should be rare to ship things for strategic reasons if they are not improving some part of your experience. At the same time, organizations that are new to experimentation are not new to shipping features. Organizations will continue to ship features. The question is if they will start risk mitigating with rollouts or not. We say, using rollouts with whatever NIM is better than to keep shipping in the dark.

That said, we spent a lot of time at Spotify thinking about reasonable NIMs for different guardrail metrics. Of course, it matters how you mitigate risk — but we really mustn't let perfect get in the way of good. Running any kind of rollout with a guardrail metric is always better than just shipping features blindly.

Step 3 — learn what kind of optimization you can perform

When it comes to finding evidence of improvements, fewer aspects are in your hands. Either there is a positive effect of the change you are making or there's not. And even when there is an effect, the size of that improvement is not something you can control.

We use the Minimum Detectable Effect (MDE) as a tool to plan the experiment. This short video gives an introduction to MDEs and how to select them.

The MDE should be set to the effect you are expecting from the change and what to have a fair chance (power) to detect if it exists. If the effect you are expecting is quite small, you will need a quite large sample size to have a fair chance to detect it. Increasing the MDE on the design page won't help here, because that will only let you know how small sample size you would require if the true effect is that large — but if you increase the MDE to be larger than the impact you expect, that is useless information.

Not all experiments within a company have the same sample size. If you, for example, experiment on a page that few users enter, you will have smaller sample sizes leading to less ability to detect small improvements. Fortunately, there is some symmetry to this: the most important pages to get right are often the pages with the most traffic, which means you have the best chance of detecting small changes.

A push for bolder changes

Experiments measure the impact of changes. If you change very little in the user experience, the impact on user behavior is often proportionally small. A common mistake for early experimenters is to expect large impact from small changes. This doesn't mean it is a bad idea to start with small changes, but it means we need to set reasonable expectations on what we might learn.

It is a good idea to start with smaller changes and the first step of the risk mitigation ladder. As you climb, you also need to start making bolder changes. Bigger changes can give stronger signals, which help us learn faster.

Statistical settings for smaller samples

There are several additional settings that affect what sample size you need to have a fair chance to detect an improvement (step 3) or at least find evidence for that limited deterioration (step 2). In this section we briefly go through the settings and choices you can use in Confidence and how to think about them.

Metric selection

Not all metrics have the same sample size requirements. Some metrics are noisy by nature and therefore require larger signals for us to be able to detect them. In general, it's a good idea to select specific metrics that measure an aspect closely related to the change you make. The more closely related the metric is to the change, the more likely it is to be affected by the change. This means that the metric will have a higher signal-to-noise ratio, which in turn means that you need a smaller sample size to detect the effect. For example, if at Spotify we hope to improve podcast recommendations, podcast consumption carries a stronger signal than overall audio consumption.

It is a good idea to compare metrics in terms of variance directly in Confidence by running for example an A/A test. Add all metrics you are considering using and compare their variance. The reason for doing this in Confidence is that you also get variance reduction taken (often referred to as CUPED) into account.

The actual variance you need to care about is the variance after variance reduction. Therefore, a metric with slightly higher variance, but where variance reduction is more effective, might be better than a metric with slightly lower variance but less efficient variance reduction. The variance reduction depends on the correlation of the metrics' values over time, which might or might not depend on the variance in the metric.

Capped metrics

Reduce variance in metric by using metrics capping. Capping is simply censoring metric values to lie in a certain interval, which offers a way to limit the impact of outliers. Capping with reasonable values based on business and logical reasoning can dramatically reduce the variance in a metric, which in turn reduces the sample size requirements.

Exposure filters

The default exposure definition in Confidence is that as soon as an end-user of your product has applied the variant assigned by your experiment on their device, they are counted as exposed. If you are experimenting on a specific page, not all users that are counted as exposed according to the default definition might ever go to that page and be practically exposed to the change. Even for the people that eventually do, that might happen days after the default exposure occurred.

Since the default definition might include a bunch of users that in fact didn't experience the change, the observed treatment effect gets diluted. This means that more users in your experiment is not helping, because most of the users in the experiment can't actually be affected by the treatment. In Confidence, you can solve this issue with exposure filters. Read more about exposure filtering, also called "trigger analysis", in the blog post Reduce Dilution and Improve Sensitivity with Trigger Analysis.

Alpha and power

Experimentation lets us bound the risks of making the wrong decisions. The risks we are bounding are the false positive risk and the false negative risk. These risks are in conflict with each other. If we want a very low risk for finding false positive results, we must make it harder to detect a true positive effect (increase the false negative risk).

Given that the metrics (including their NIMs/MDEs) and all other settings are fixed, the following holds:

If you increase the false positive rate (alpha), the required sample size to achieve power decreases. You will find more false positive results, but you will need a lower sample size to find true effects.
If you decrease power (increase the false negative risk), the required sample size for a fixed alpha decreases. You need a lower sample size, but if there is a true effect of the size you envision, you will have a lower chance of detecting it.
If you increase the sample size for fixed alpha, the power for a given NIM/MDE increases. Your chance of finding false positive results is still the same, but your chance of finding true effects increases.

Avoid sequential tests

For A/B tests in Confidence, you can choose to see the results continuously during the experiment or only upon conclusion when the test ends. Although it is tempting to look at the results all the time, especially when you start out with experimentation, you can require a lower sample size by only looking at the results after the experiment is stopped. When you look at the results continuously, Confidence uses sequential tests to ensure that the false positive rate is not inflated. The cost for this repeated testing is lower efficiency compared to if you only look at the results once. If you really need to see the results continuously, use the group sequential tests in Confidence. Group sequential tests are the most efficient sequential tests available. See this Spotify RnD blog post for more details about sequential testing. Confidence supports both always-valid and group sequential tests.

Even if you choose to only see the results at the end of the test, Confidence automatically checks for regressions in all metrics sequentially behind the scenes. This means that you are not risking to harm the end user experience unknowingly by not looking at the results continuously.

Summary

When dealing with smaller sample sizes in experimentation, remember these key principles:

A "small sample" is relative. It's better defined as a sample that doesn't let you estimate impact with your desired precision. Even with smaller samples, experimentation is valuable for risk mitigation.
Think of risk mitigation as a ladder:
- Level 0: No experiment (no risk mitigation)
- Level 1: Lack of evidence for negative impact (avoiding terrible decisions)
- Level 2: Evidence for limited negative impact (bounding negative impact)
- Level 3: Evidence for improvement (proving positive impact)
Starting with experimentation:
- Begin with rollouts using guardrail metrics without NIMs
- Graduate to using larger NIMs as teams gain confidence
- Focus on what risk mitigation you achieved rather than what you couldn't achieve
- Make bolder changes to get stronger signals as you gain experience
Optimize your experimental design:
- Choose metrics carefully, considering their variance and variance reduction
- Use metric capping to reduce variance
- Apply appropriate exposure filters to avoid diluting treatment effects
- Consider the trade-offs between alpha, power, and sample size
- Avoid sequential testing when possible to gain efficiency

Remember: Any form of risk mitigation through experimentation is better than shipping changes blindly. Don't let perfect be the enemy of good - start with what sample size you have and focus on avoiding negative impacts before trying to prove positive ones.