Confidence
  • Pricing
  • Success stories
  • Contact us
  • Login
Start free trial
All terms
Core Experimentation

What is an Average Treatment Effect (ATE)?

The average treatment effect (ATE) is the mean difference in outcomes between the treatment group and the control group, averaged across the entire experimental population.

The average treatment effect (ATE) is the mean difference in outcomes between the treatment group and the control group, averaged across the entire experimental population. It estimates what would happen, on average, if every user in the population received the treatment versus if every user received the control. When someone says "the experiment showed a +3% lift in conversion," they're usually reporting an estimate of the ATE.

The ATE is the default estimand in most product A/B tests, and for good reason. It gives you a single number that summarizes the causal impact of a change across your user base. Confidence reports ATE estimates for every metric in every experiment, along with confidence intervals and sequential testing boundaries that let teams make decisions at any point during the test without inflating false positive rates.

What does the ATE actually estimate?

The ATE answers a specific counterfactual question: if you could somehow show every user both versions of the product (the treatment and the control), and then compare each user's outcome under treatment to their outcome under control, what would the average of those individual differences be?

You can't actually observe both outcomes for the same user. That's the fundamental problem of causal inference. Randomization solves it at the population level: by randomly assigning users to treatment or control, you ensure the two groups are statistically equivalent before the experiment starts. The difference in average outcomes between the groups is then an unbiased estimate of the ATE.

At Spotify, where experiments routinely run across millions of users, the law of large numbers makes ATE estimates precise. The Spotify Search team, for example, uses ATE estimates across a set of engagement and quality metrics to evaluate ranking algorithm changes. A treatment effect that's positive on click-through rate but negative on a satisfaction guardrail produces a clear signal: the change makes search feel more responsive but degrades result quality.

How is ATE different from other estimands?

The ATE averages across everyone in the experiment. That's useful for population-level decisions (should we ship this to all users?), but it can mask important variation.

The average treatment effect on the treated (ATT) estimates the effect only among users who were actually assigned to treatment. In a well-randomized experiment, ATE and ATT are the same in expectation. They diverge when you use trigger analysis to restrict to exposed users: the effect among triggered users is a conditional ATE that may not generalize to the full population.

The conditional average treatment effect (CATE) estimates the ATE within a subgroup defined by user characteristics: country, platform, tenure, or behavioral segment. If the ATE across all users is +1%, but the CATE for new users is +5% and for tenured users is -0.2%, the overall average hides a meaningful difference. Segment analysis in Confidence surfaces these heterogeneous effects.

Understanding which estimand your analysis reports matters. A team that sees a large effect in trigger analysis and assumes it applies to the full user base will overestimate the impact of a global rollout.

When is the ATE not the right quantity?

The ATE works well when the effect of a change is relatively uniform across users. When effects are highly heterogeneous, the average can be uninformative or even misleading.

Consider a change that benefits power users (+10% engagement) and hurts casual users (-8% engagement). If power users are 30% of the population, the ATE might be slightly negative, suggesting the change is harmful. But a team might reasonably ship the change to power users and exclude casual users. In that case, the CATE for each segment is more useful than the population ATE.

Confidence's segment analysis helps teams detect these patterns. When you define success and guardrail metrics, Confidence breaks down the treatment effect by pre-configured segments so heterogeneous effects surface before the ship decision, not after.

Related terms

Core Experimentation
Treatment Effect

A treatment effect is the measured difference in a metric between the treatment group (users who see a change) and the control group (users who see the current experience).

Core Experimentation
Control Group

The control group is the set of users in an experiment who see the unchanged, current experience.

Statistical Methods
Statistical Significance

Statistical significance is the determination that an observed difference between experiment groups is unlikely to have occurred by chance alone.

Experiment Analysis
Trigger Analysis

Trigger analysis is an experiment analysis technique that restricts the evaluation to users who actually encountered the changed feature, rather than analyzing every user assigned to the experiment.

Experiment Analysis
Segment Analysis

Segment analysis breaks down experiment results by user subgroups to detect heterogeneous treatment effects: cases where the change helps some users, hurts others, or has no effect on a particular ...

Statistical Methods
Variance Reduction

Variance reduction is a set of statistical techniques that tighten the confidence intervals of an A/B test without requiring more traffic.

Experiment Analysis
Estimand

An estimand is the precise quantity an experiment is designed to estimate.

Spotify

Learn more

  • Read our blog
  • See comparisons
  • Glossary
  • RFP guides
  • Listen to us
  • Read our docs
  • Status page

Need help

  • Contact us

Legal

  • Terms of Service
  • Data Protection Agreement
  • Privacy Policy
  • Cookies

© 2026 Spotify

The Confidence name and logo are registered trademarks of Spotify.