Statistical Methods

What is an Effect Size?

Effect size is the magnitude of the difference in a metric between treatment and control groups.

Effect size is the magnitude of the difference in a metric between treatment and control groups. It answers the question: how much did the change move the needle? While statistical significance tells you whether an effect is real, effect size tells you whether it's worth acting on.

Effect size can be expressed in raw units (a 1.2 percentage point increase in conversion rate) or standardized form (a Cohen's d of 0.15, meaning the difference is 0.15 standard deviations). Raw effect sizes are more interpretable for product decisions. Standardized effect sizes are useful when comparing results across experiments with different metrics. Both matter, and Confidence reports confidence intervals around the raw effect size for every metric in every experiment, giving teams a direct view of the plausible range of impact.

Why does effect size matter more than significance?

Statistical significance is a binary signal: the effect is or isn't distinguishable from zero. Effect size is a continuous measure of how large the effect actually is. Two experiments can both be significant and have wildly different implications.

Consider a product team testing two changes to a checkout flow. Change A produces a statistically significant 0.05% increase in conversion. Change B produces a statistically significant 3.2% increase. Both crossed the significance threshold. Only one is worth shipping.

With enough sample size, even tiny effects become significant. At Spotify's scale of 750 million users, an experiment can detect vanishingly small differences. Significance alone would say "ship it" for a 0.01% improvement. The effect size forces the real question: is 0.01% worth the complexity of maintaining this feature forever?

What is a standardized effect size?

A standardized effect size divides the raw difference by a measure of variability, removing the units and making effects comparable across different metrics.

Cohen's d is the most common standardization: the difference in means divided by the pooled standard deviation. A d of 0.2 is conventionally called "small," 0.5 "medium," and 0.8 "large." These labels come from behavioral science and don't map directly to product experimentation, where a "small" effect of d = 0.02 on a high-value metric like revenue per user might justify a major investment.

In A/B testing, the standardized effect size is closely related to the minimum detectable effect (MDE). The MDE is the smallest standardized effect the experiment is powered to detect. When teams set the MDE in Confidence's power calculator, they're specifying the minimum effect size they care about.

How does effect size relate to experiment design?

Effect size, sample size, and statistical power are bound together. For a fixed power and significance level:

  • Larger expected effect sizes require smaller samples to detect.
  • Smaller expected effect sizes require larger samples.
  • The relationship is quadratic: halving the expected effect quadruples the required sample.

This is why "bold implementations" matter. At Spotify, teams are encouraged to test the loudest version of an idea that still functions as a user experience. A bold change produces a larger effect size, which means it can be detected faster and with less traffic. If the bold version works, you can refine later. If it doesn't, you've learned something definitive rather than producing an ambiguous null result from a timid test.

Variance reduction also enters here. CUPED doesn't change the true effect size, but it reduces the metric's standard deviation, making a given raw effect larger in standardized terms. This is equivalent to increasing the signal-to-noise ratio, which increases power without requiring more users.