Lesson 2: Treatment group proportions
This lesson explains how the relative sizes of treatment groups affect the required sample size in experiments. If the total sample size is fixed, it's in most cases optimal to have equal group sizes. However, a larger total sample size is always better.
Group size and power
When discussing required sample size, it is common to refer to a single number: "the sample size." However, this total sample size is actually a combination of the sample size in the control group and the sample size in the treatment group.
Interestingly, the total required sample size is not fixed if we change the relative sizes of the treatment groups. This is intuitive: imagine you have a sample of 100 users. When do you learn the most about the treatment effect? If the groups are split 50/50 or 99/1? If only one user is in the treatment group, it will not provide much information about the treatment effect.
Fixed sample size: Equal group sizes maximize power
For continuous metrics, it is optimal from a power perspective to have equal group sizes. For binary metrics, the optimal group sizes depend on the Minimum Detectable Effect (MDE) and the baseline proportion. For the optimal group sizes to deviate from equal, the baseline proportion and the proportion under the hypothetical treatment effect must differ a lot. In other words, unless the MDE is very large, it is a good general rule to aim for similar group sizes. If you are interested in the mathematical details, see the derivation of optimal treatment group sizes in the Note for nerds section below.
If the total possible sample size is fixed, it is a good general rule to aim for similar group sizes.
Larger total sample size is always better
It's important to realize that:
- For a fixed total sample size, it is optimal to have similar group sizes to maximize power.
- It is always better to have a larger total sample size.
This also means that if you have a fixed number of users that can be exposed to the treatment (for example, due to legal or budget constraints), the larger the control group, the better. In other words, if the size of one group is fixed for some reason, increasing the size of the other group will always improve power.
This is because we want to minimize the uncertainty of the mean for both the treatment and control groups to accurately estimate the treatment effect.
Risky treatments
If a treatment is risky, you might want to limit the number of users exposed to this treatment for risk mitigation purposes. In such cases, the power can be improved by increasing the size of the control group. Sometimes for risky treatments at Spotify, the treatment group is fixed to a small size, and the control group is increased to as large as possible to maximize power. This requires some fiddling in practice since the allocation of the population and treatment proportions are both relative.
For example, if the population is 1000 users and you want 30 to be exposed to the treatment. Then you could have any number up to 970 in the control group. You could run a 50/50 split on 6% of the population to have 30 in each group, or a 97/3 split on 100% of the population to have 970 in the control group and 30 in the treatment group.
What is the optimal group size allocation for continuous metrics when the total sample size is fixed?
Why is it better to have a larger total sample size in experiments?
What happens to the required sample size if the relative sizes of the treatment groups are uneven?
Note for nerds
It's in fact quite straightforward to derive the optimal group sizes for binary and continuous metrics. If calculus is not your thing, feel free to skip this section.
Binary metrics
Let's derive the optimal proportions for binary metrics step by step:
Initial setup
Let and be the sample sizes of two treatment groups, and and be the baseline proportion and the proportion under the hypothetical treatment effect. Define , where . For simplicity, let for .
Step 1: Express total sample size
The minimum required sample size for given type-I and type-II risks is found by solving:
This expands to:
Step 2: Take derivative
Taking the derivative with respect to :
Step 3: Set to zero and solve
Setting to zero:
This implies that for a baseline proportion and a hypothetical treatment group proportion , it is optimal to have:
Clearly, if , then is close to , which makes the rule of keeping the groups similar a good general guideline. For the nerds who paid attention in the previous lesson, this also of course implies that for binary guardrail metrics, the optimal group sizes are equal.
Continuous metrics
For continuous metrics, let's derive the optimal group sizes step by step:
Initial setup
Let and be the mean of two groups on some continuous metric. Since the variance doesn't depend on the treatment effect, our optimization simplifies.
Step 1: Express total sample size
We want to minimize:
This expands to:
Step 2: Take derivative
Taking the derivative with respect to :
Step 3: Set to zero and solve
Setting to zero:
This implies that it is optimal to have equal group sizes ().
Summary
- For binary metrics, the treatment effect impacts the variance, so the optimal group sizes depend on the baseline proportion and the MDE.
- For continuous metrics, equal group sizes are always optimal.
- In all cases, a larger total sample size will improve power.