Lesson 2: Treatment group proportions


Group size and power

When discussing required sample size, it is common to refer to a single number: "the sample size." However, this total sample size is actually a combination of the sample size in the control group and the sample size in the treatment group.

Interestingly, the total required sample size is not fixed if we change the relative sizes of the treatment groups. This is intuitive: imagine you have a sample of 100 users. When do you learn the most about the treatment effect? If the groups are split 50/50 or 99/1? If only one user is in the treatment group, it will not provide much information about the treatment effect.


Fixed sample size: Equal group sizes maximize power

For continuous metrics, it is optimal from a power perspective to have equal group sizes. For binary metrics, the optimal group sizes depend on the Minimum Detectable Effect (MDE) and the baseline proportion. For the optimal group sizes to deviate from equal, the baseline proportion and the proportion under the hypothetical treatment effect must differ a lot. In other words, unless the MDE is very large, it is a good general rule to aim for similar group sizes. If you are interested in the mathematical details, see the derivation of optimal treatment group sizes in the Note for nerds section below.


Larger total sample size is always better

It's important to realize that:

  1. For a fixed total sample size, it is optimal to have similar group sizes to maximize power.
  2. It is always better to have a larger total sample size.

This also means that if you have a fixed number of users that can be exposed to the treatment (for example, due to legal or budget constraints), the larger the control group, the better. In other words, if the size of one group is fixed for some reason, increasing the size of the other group will always improve power.

This is because we want to minimize the uncertainty of the mean for both the treatment and control groups to accurately estimate the treatment effect.

Risky treatments

If a treatment is risky, you might want to limit the number of users exposed to this treatment for risk mitigation purposes. In such cases, the power can be improved by increasing the size of the control group. Sometimes for risky treatments at Spotify, the treatment group is fixed to a small size, and the control group is increased to as large as possible to maximize power. This requires some fiddling in practice since the allocation of the population and treatment proportions are both relative.

For example, if the population is 1000 users and you want 30 to be exposed to the treatment. Then you could have any number up to 970 in the control group. You could run a 50/50 split on 6% of the population to have 30 in each group, or a 97/3 split on 100% of the population to have 970 in the control group and 30 in the treatment group.


Optimal Group Sizes for Power

Note for nerds

It's in fact quite straightforward to derive the optimal group sizes for binary and continuous metrics. If calculus is not your thing, feel free to skip this section.

Binary metrics

Let's derive the optimal proportions for binary metrics step by step:

Initial setup

Let NaN_a and NbN_b be the sample sizes of two treatment groups, and pap_a and pbp_b be the baseline proportion and the proportion under the hypothetical treatment effect. Define κ=Nb/Na\kappa = N_b / N_a, where κ>0\kappa > 0. For simplicity, let vj=pj(1pj)v_j = p_j(1 - p_j) for j{a,b}j \in \{a, b\}.

Step 1: Express total sample size

The minimum required sample size for given type-I and type-II risks is found by solving:

arg minκN=arg minκ((Zα+Zβpapb)2×(va/κ+vb)+(Zα+Zβpapb)2×(va+vbκ))\argmin_{\kappa} N = \argmin_{\kappa} \left(\left(\frac{Z_{\alpha}+Z_{\beta}}{p_a-p_b}\right)^2 \times (v_a / \kappa + v_b) + \left(\frac{Z_{\alpha}+Z_{\beta}}{p_a-p_b}\right)^2 \times (v_a + v_b \kappa)\right)

This expands to:

arg minκN=arg minκ((Zα+Zβpapb)2va/κ+(Zα+Zβpapb)2vb+(Zα+Zβpapb)2va+(Zα+Zβpapb)2vbκ)\argmin_{\kappa} N = \argmin_{\kappa} \left(\left(\frac{Z_{\alpha}+Z_{\beta}}{p_a-p_b}\right)^2 v_a / \kappa + \left(\frac{Z_{\alpha}+Z_{\beta}}{p_a-p_b}\right)^2 v_b + \left(\frac{Z_{\alpha}+Z_{\beta}}{p_a-p_b}\right)^2 v_a + \left(\frac{Z_{\alpha}+Z_{\beta}}{p_a-p_b}\right)^2 v_b \kappa\right)

Step 2: Take derivative

Taking the derivative with respect to κ\kappa:

κ=(Zα+Zβpapb)2va/κ2+(Zα+Zβpapb)2vb\frac{\partial}{\partial \kappa} = -\left(\frac{Z_{\alpha}+Z_{\beta}}{p_a-p_b}\right)^2 v_a / \kappa^2 + \left(\frac{Z_{\alpha}+Z_{\beta}}{p_a-p_b}\right)^2 v_b

Step 3: Set to zero and solve

Setting to zero:

(Zα+Zβpapb)2va/κ2+(Zα+Zβpapb)2vb=0-\left(\frac{Z_{\alpha}+Z_{\beta}}{p_a-p_b}\right)^2 v_a / \kappa^2 + \left(\frac{Z_{\alpha}+Z_{\beta}}{p_a-p_b}\right)^2 v_b = 0

(Zα+Zβpapb)2va/κ2=(Zα+Zβpapb)2vb\left(\frac{Z_{\alpha}+Z_{\beta}}{p_a-p_b}\right)^2 v_a / \kappa^2 = \left(\frac{Z_{\alpha}+Z_{\beta}}{p_a-p_b}\right)^2 v_b

va/κ2=vbv_a / \kappa^2 = v_b

va/vb=κ2v_a / v_b = \kappa^2

κ=va/vb=pa(1pa)pb(1pb)\kappa = \sqrt{v_a / v_b} = \sqrt{\frac{p_a(1-p_a)}{p_b(1-p_b)}}

This implies that for a baseline proportion pap_a and a hypothetical treatment group proportion pbp_b, it is optimal to have:

Nb=Napa(1pa)pb(1pb)N_b = N_a \sqrt{\frac{p_a(1-p_a)}{p_b(1-p_b)}}

Clearly, if papbp_a \approx p_b, then NaN_a is close to NbN_b, which makes the rule of keeping the groups similar a good general guideline. For the nerds who paid attention in the previous lesson, this also of course implies that for binary guardrail metrics, the optimal group sizes are equal.

Continuous metrics

For continuous metrics, let's derive the optimal group sizes step by step:

Initial setup

Let mam_a and mbm_b be the mean of two groups on some continuous metric. Since the variance doesn't depend on the treatment effect, our optimization simplifies.

Step 1: Express total sample size

We want to minimize:

arg minκN=arg minκ((Zα+Zβmamb)2×σ2(1+1/κ)+(Zα+Zβmamb)2×σ2(κ+1))\argmin_{\kappa} N = \argmin_{\kappa} \left(\left(\frac{Z_{\alpha}+Z_{\beta}}{m_a-m_b}\right)^2 \times \sigma^2 (1 + 1 / \kappa) + \left(\frac{Z_{\alpha}+Z_{\beta}}{m_a-m_b}\right)^2 \times \sigma^2 (\kappa + 1)\right)

This expands to:

arg minκN=arg minκ((Zα+Zβmamb)2σ2+(Zα+Zβmamb)2σ2/κ+(Zα+Zβmamb)2σ2κ+(Zα+Zβmamb)2σ2)\argmin_{\kappa} N = \argmin_{\kappa} \left(\left(\frac{Z_{\alpha}+Z_{\beta}}{m_a-m_b}\right)^2 \sigma^2 + \left(\frac{Z_{\alpha}+Z_{\beta}}{m_a-m_b}\right)^2 \sigma^2 / \kappa + \left(\frac{Z_{\alpha}+Z_{\beta}}{m_a-m_b}\right)^2 \sigma^2 \kappa + \left(\frac{Z_{\alpha}+Z_{\beta}}{m_a-m_b}\right)^2 \sigma^2\right)

Step 2: Take derivative

Taking the derivative with respect to κ\kappa:

κ=(Zα+Zβmamb)2σ2/κ2+(Zα+Zβmamb)2σ2\frac{\partial}{\partial \kappa} = -\left(\frac{Z_{\alpha}+Z_{\beta}}{m_a-m_b}\right)^2 \sigma^2 / \kappa^2 + \left(\frac{Z_{\alpha}+Z_{\beta}}{m_a-m_b}\right)^2 \sigma^2

Step 3: Set to zero and solve

Setting to zero:

(Zα+Zβmamb)2σ2/κ2=(Zα+Zβmamb)2σ2\left(\frac{Z_{\alpha}+Z_{\beta}}{m_a-m_b}\right)^2 \sigma^2 / \kappa^2 = \left(\frac{Z_{\alpha}+Z_{\beta}}{m_a-m_b}\right)^2 \sigma^2

σ2/κ2=σ2\sigma^2 / \kappa^2 = \sigma^2

κ2=σ2σ2\kappa^2 = \frac{\sigma^2}{\sigma^2}

κ=1\kappa = 1

This implies that it is optimal to have equal group sizes (Na=NbN_a = N_b).

Summary

  • For binary metrics, the treatment effect impacts the variance, so the optimal group sizes depend on the baseline proportion and the MDE.
  • For continuous metrics, equal group sizes are always optimal.
  • In all cases, a larger total sample size will improve power.