The sampling distribution is the probability distribution of a statistic (like a sample mean or a difference in means) computed across all possible random samples of a given size from a population. It describes how much the statistic would vary if you repeated the experiment many times, each time with a fresh random sample of users.
Sampling distributions are the foundation of hypothesis testing. When you run an A/B test and compute the difference in conversion rates between treatment and control, that observed difference is one draw from a sampling distribution. The width of that distribution tells you how much the result would bounce around under repeated experimentation, which is exactly what a confidence interval quantifies. Every p-value, every confidence interval, and every power calculation in Confidence rests on the properties of sampling distributions.
Why does the sampling distribution matter for A/B testing?
A single experiment gives you one observed treatment effect. The sampling distribution tells you the range of treatment effects you'd see if you ran the same experiment thousands of times with different random samples of users.
This matters for two reasons.
First, it tells you how much to trust the result. A narrow sampling distribution (low standard error) means different random samples would give similar answers. Your single observation is representative. A wide sampling distribution means different samples could give very different answers, and your result might be far from the true effect.
Second, it's what makes hypothesis testing possible. The null hypothesis (treatment has no effect) implies a specific sampling distribution: the difference in means should be centered at zero with a spread determined by the metric variance and sample size. Your observed difference either falls within the range expected under this null distribution (not significant) or far enough into the tails that chance is an implausible explanation (significant).
How does the central limit theorem connect to sampling distributions?
The central limit theorem (CLT) is the reason sampling distributions are useful in practice. It states that the sampling distribution of the sample mean approaches a normal distribution as sample size increases, regardless of the shape of the underlying data.
This is a strong result. User-level metrics can be wildly non-normal: revenue per user is right-skewed with a long tail, session counts have a spike at zero, and binary metrics follow a Bernoulli distribution. None of these look anything like a bell curve at the individual user level. But the average across thousands of users follows an approximately normal distribution, and the average across millions is normal for all practical purposes.
At Spotify, experiments routinely involve millions of users. The CLT's normal approximation is excellent at this scale, which is why the z-test (based on the standard normal distribution) is the default statistical test in Confidence. For the rare cases with smaller samples, the t-distribution provides a slightly wider sampling distribution that accounts for the additional uncertainty in estimating variance from limited data.
How do sample size and variance affect the sampling distribution?
Two factors determine the width of the sampling distribution: the variance of the underlying metric and the sample size.
Higher metric variance means a wider sampling distribution. If individual users' behavior varies enormously (some stream 0 songs, others stream 500), the mean computed from any single sample will be noisy. Variance reduction techniques like CUPED, metric capping, and trigger analysis directly narrow the sampling distribution by reducing the metric variance. CUPED typically shrinks the variance by 20-50%, which narrows the sampling distribution by 10-30%.
Larger sample sizes mean a narrower sampling distribution. The standard error scales with 1/sqrt(n): doubling the sample size reduces the standard error by about 30%. This is why underpowered experiments produce ambiguous results. With too few users, the sampling distribution is so wide that both "the treatment helped" and "the treatment did nothing" are consistent with the data.
This relationship is also why variance reduction is so valuable. Cutting the metric variance in half has the same effect on the sampling distribution's width as quadrupling the sample size. For teams with limited traffic, variance reduction is often the only practical path to adequate power.
How do confidence intervals relate to the sampling distribution?
A 95% confidence interval is constructed so that, if you drew many samples and built an interval from each one, 95% of those intervals would contain the true treatment effect. The width of the interval is directly proportional to the standard error of the sampling distribution.
Narrow intervals come from narrow sampling distributions: low variance, large sample sizes, or both. Wide intervals come from wide sampling distributions: high variance, small samples, or both.
When Confidence reports a treatment effect of +2.1% with a 95% CI of [0.8%, 3.4%], it's saying: the sampling distribution centered at the observed effect, with width determined by the standard error, places the true effect somewhere in that range with 95% coverage probability. The interval is a direct reflection of the sampling distribution's shape.