Statistical Methods

What is Frequentist A/B Testing?

Frequentist A/B testing is the classical approach to experiment analysis that evaluates results using p-values and confidence intervals, asking: how likely is the observed data (or something more e...

Frequentist A/B testing is the classical approach to experiment analysis that evaluates results using p-values and confidence intervals, asking: how likely is the observed data (or something more extreme) if the treatment has no effect? When that probability falls below a pre-specified threshold (typically 0.05), the result is declared statistically significant, and you have evidence that the treatment caused a real change.

This is the statistical framework that most experimentation platforms use, including Confidence. It's the foundation behind z-tests, t-tests, sequential testing methods like Group Sequential Tests, and multiple testing corrections like Bonferroni. At Spotify, every one of the 10,000+ experiments run per year is analyzed within this framework, with extensions for sequential monitoring, variance reduction, and guardrail metrics built on top of it.

How does frequentist analysis work in practice?

A frequentist A/B test follows a defined sequence.

Before the experiment starts, you specify the significance level (alpha, typically 0.05), the desired statistical power (typically 80%), and the minimum detectable effect (MDE) you care about. These three inputs determine the required sample size.

During the experiment, users are randomly assigned to control and treatment. After the pre-calculated sample size is reached, you compute a test statistic (usually a z-statistic for large samples) that measures the observed difference in means relative to the standard error. If the test statistic exceeds the critical value, you reject the null hypothesis: the data provides sufficient evidence that the treatment effect is real.

The confidence interval complements the p-value. A 95% confidence interval is a range constructed so that, if you repeated the experiment many times, 95% of the intervals would contain the true treatment effect. It tells you not just whether the effect exists, but how large it plausibly is.

What are p-values and why are they misunderstood?

A p-value is the probability of observing data as extreme as (or more extreme than) what you got, assuming the treatment has no effect. A p-value of 0.03 means: if the treatment truly did nothing, you'd see a difference this large or larger only 3% of the time by chance.

The most common misinterpretation: a p-value of 0.03 does not mean "there's a 97% probability the treatment works." It doesn't tell you the probability of your hypothesis being true. It tells you how surprising the data would be if the hypothesis were false.

This distinction matters for decision-making. A significant p-value in an underpowered experiment might reflect a true effect, but the effect size estimate is likely inflated (a phenomenon called the "winner's curse"). A non-significant p-value doesn't mean the treatment has no effect; it means the experiment didn't collect enough evidence to distinguish the effect from noise. The Confidence blog's framing of "two questions every experiment should answer" addresses this directly: was the implementation bold enough, and was the test adequately powered?

How does sequential testing extend the frequentist framework?

Classical frequentist tests assume you look at the data once, at a pre-determined sample size. In practice, teams want to monitor experiments as they run. Looking at results before the planned end point and stopping when you see significance inflates false positive rates, sometimes dramatically.

Sequential testing methods solve this within the frequentist framework. Group Sequential Tests (GSTs) pre-specify a set of analysis times and adjust the significance threshold at each look to maintain the overall false positive rate. Always-valid inference (AVI) provides confidence sequences that are valid at any stopping time, without pre-specifying when you'll look.

Confidence supports both approaches. GSTs are the default for experiments where you can estimate the maximum sample size in advance. They're more powerful (require fewer users to detect the same effect) because they use the planned analysis schedule to allocate the error budget efficiently. Spotify's research has shown that GSTs provide superior power when the maximum sample size is known, while AVI is better suited to experiments with uncertain timelines.

What's the relationship between frequentist and Bayesian approaches?

For the large-sample experiments typical in product development (tens of thousands of users, weak or no prior information), frequentist and Bayesian analyses converge to equivalent results. The frequentist confidence interval and the Bayesian credible interval cover the same range. The p-value maps to the posterior probability through a monotonic transformation.

The methods diverge meaningfully only when sample sizes are small and the Bayesian prior carries real information, or when the decision framework explicitly requires minimizing expected loss rather than controlling error rates. For standard product experimentation, the choice between frameworks is largely notational. Confidence uses the frequentist framework because it integrates naturally with sequential testing, multiple testing corrections, and the full statistical methodology stack.