Confidence
  • Pricing
  • Success stories
  • Contact us
  • Login
Start free trial
All terms
Statistical Methods

What is an Effect Size?

Effect size is the magnitude of the difference in a metric between treatment and control groups.

Effect size is the magnitude of the difference in a metric between treatment and control groups. It answers the question: how much did the change move the needle? While statistical significance tells you whether an effect is real, effect size tells you whether it's worth acting on.

Effect size can be expressed in raw units (a 1.2 percentage point increase in conversion rate) or standardized form (a Cohen's d of 0.15, meaning the difference is 0.15 standard deviations). Raw effect sizes are more interpretable for product decisions. Standardized effect sizes are useful when comparing results across experiments with different metrics. Both matter, and Confidence reports confidence intervals around the raw effect size for every metric in every experiment, giving teams a direct view of the plausible range of impact.

Why does effect size matter more than significance?

Statistical significance is a binary signal: the effect is or isn't distinguishable from zero. Effect size is a continuous measure of how large the effect actually is. Two experiments can both be significant and have wildly different implications.

Consider a product team testing two changes to a checkout flow. Change A produces a statistically significant 0.05% increase in conversion. Change B produces a statistically significant 3.2% increase. Both crossed the significance threshold. Only one is worth shipping.

With enough sample size, even tiny effects become significant. At Spotify's scale of 750 million users, an experiment can detect vanishingly small differences. Significance alone would say "ship it" for a 0.01% improvement. The effect size forces the real question: is 0.01% worth the complexity of maintaining this feature forever?

What is a standardized effect size?

A standardized effect size divides the raw difference by a measure of variability, removing the units and making effects comparable across different metrics.

Cohen's d is the most common standardization: the difference in means divided by the pooled standard deviation. A d of 0.2 is conventionally called "small," 0.5 "medium," and 0.8 "large." These labels come from behavioral science and don't map directly to product experimentation, where a "small" effect of d = 0.02 on a high-value metric like revenue per user might justify a major investment.

In A/B testing, the standardized effect size is closely related to the minimum detectable effect (MDE). The MDE is the smallest standardized effect the experiment is powered to detect. When teams set the MDE in Confidence's power calculator, they're specifying the minimum effect size they care about.

How does effect size relate to experiment design?

Effect size, sample size, and statistical power are bound together. For a fixed power and significance level:

  • Larger expected effect sizes require smaller samples to detect.
  • Smaller expected effect sizes require larger samples.
  • The relationship is quadratic: halving the expected effect quadruples the required sample.

This is why "bold implementations" matter. At Spotify, teams are encouraged to test the loudest version of an idea that still functions as a user experience. A bold change produces a larger effect size, which means it can be detected faster and with less traffic. If the bold version works, you can refine later. If it doesn't, you've learned something definitive rather than producing an ambiguous null result from a timid test.

Variance reduction also enters here. CUPED doesn't change the true effect size, but it reduces the metric's standard deviation, making a given raw effect larger in standardized terms. This is equivalent to increasing the signal-to-noise ratio, which increases power without requiring more users.

Related terms

Statistical Methods
Minimum Detectable Effect (MDE)

The minimum detectable effect (MDE) is the smallest treatment effect an experiment is designed to reliably detect at a given significance level and power.

Statistical Methods
Statistical Significance

Statistical significance is the determination that an observed difference between experiment groups is unlikely to have occurred by chance alone.

Statistical Methods
Confidence Interval

A confidence interval is a range of values that, at a given confidence level, is expected to contain the true treatment effect.

Statistical Methods
Statistical Power

Statistical power is the probability that an experiment will detect a real effect when one exists.

Statistical Methods
Variance

Variance is a measure of how much a metric's values spread out across users.

Statistical Methods
Sample Size

Sample size is the number of experimental units (typically users) needed in an A/B test to detect a given effect with a specified level of confidence and power.

Spotify

Learn more

  • Read our blog
  • See comparisons
  • Glossary
  • RFP guides
  • Listen to us
  • Read our docs
  • Status page

Need help

  • Contact us

Legal

  • Terms of Service
  • Data Protection Agreement
  • Privacy Policy
  • Cookies

© 2026 Spotify

The Confidence name and logo are registered trademarks of Spotify.