Statistical Methods

What is a Z-Test?

A z-test is a hypothesis test that uses the standard normal distribution to determine whether the observed difference between two groups is statistically significant.

A z-test is a hypothesis test that uses the standard normal distribution to determine whether the observed difference between two groups is statistically significant. In A/B testing, the z-test compares the mean metric value in the treatment group to the mean in the control group, accounting for the variability of the metric and the sample size. It's the workhorse statistical test behind most large-scale experimentation platforms, including Confidence.

The z-test works because of the central limit theorem: regardless of how individual user metrics are distributed, the difference in sample means between two large groups follows an approximately normal distribution. For the sample sizes typical in product experimentation (thousands to millions of users), this approximation is excellent. At Spotify, where experiments routinely involve millions of users across 186 markets, the z-test's large-sample assumptions are comfortably satisfied.

How does a z-test work in an A/B test?

The mechanics are straightforward. You have two groups of users: control (n_c users, mean metric Y_c) and treatment (n_t users, mean metric Y_t). The z-statistic is:

z = (Y_t - Y_c) / SE

where SE is the standard error of the difference in means. The standard error depends on the variance of the metric in each group and the sample sizes. For large samples, this simplifies to the pooled standard error estimate.

If the absolute value of z exceeds the critical value (1.96 for a two-sided test at alpha = 0.05), you reject the null hypothesis: the observed difference is unlikely to have occurred by chance alone.

The confidence interval follows directly. A 95% confidence interval for the treatment effect is (Y_t - Y_c) plus or minus 1.96 times the standard error. This interval tells you the range of plausible treatment effect sizes given the data.

When is a z-test appropriate?

The z-test requires two conditions.

Large sample sizes. The central limit theorem needs enough observations for the normal approximation to hold. For most product metrics, "enough" means hundreds of users per group at a minimum, and the approximation improves rapidly with more data. With the sample sizes typical in online experimentation (tens of thousands to millions), the approximation is near-exact.

Known or well-estimated variance. The z-test treats the variance as known. In practice, you estimate it from the data, which technically makes it a t-test. For large samples, the t-distribution and the standard normal distribution are indistinguishable (the difference is less than 0.1% at n = 1,000), so the distinction is academic for most online experiments.

The z-test doesn't require the underlying metric to be normally distributed. Revenue per user, session counts, and other heavily skewed metrics are all fine. The central limit theorem ensures the test statistic is approximately normal even when the individual observations are not. What matters is the sample size, not the shape of the raw data.

How does the z-test relate to other methods in Confidence?

The z-test is the atomic unit of Confidence's statistical methodology. More complex methods build on it.

CUPED variance reduction produces an adjusted metric with lower variance, then applies a z-test to the adjusted values. The test itself doesn't change; the input metric is denoised first.

Sequential testing (Group Sequential Tests and always-valid inference) extends the z-test to handle repeated analysis. Instead of a single critical value of 1.96, GSTs use adjusted boundaries at each interim look to control the overall false positive rate. The underlying test statistic at each look is still a z-statistic.

Multiple testing corrections adjust the significance threshold when testing multiple metrics simultaneously. Bonferroni, for example, divides alpha by the number of success metrics. The test for each metric is still a z-test; the correction changes the threshold for declaring significance.

This composability is deliberate. Because every extension builds on the same z-test foundation, the interactions between methods are well-understood and mathematically tractable. Confidence's approach of supporting one framework deeply, rather than offering multiple partially-supported frameworks, follows from this design principle.

What's the relationship between z-tests and Bayesian methods?

For large samples with uninformative priors, Bayesian credible intervals and frequentist z-test confidence intervals converge to the same numerical values. The Bayesian "probability that the treatment is better" is a monotonic transformation of the z-test's p-value. This mathematical equivalence is one reason Confidence uses the frequentist z-test as its foundation rather than offering both frameworks: for the experiments most teams run, the choice is notational, and a single framework simplifies interpretation without sacrificing accuracy.