The average treatment effect (ATE) is the mean difference in outcomes between the treatment group and the control group, averaged across the entire experimental population. It estimates what would happen, on average, if every user in the population received the treatment versus if every user received the control. When someone says "the experiment showed a +3% lift in conversion," they're usually reporting an estimate of the ATE.
The ATE is the default estimand in most product A/B tests, and for good reason. It gives you a single number that summarizes the causal impact of a change across your user base. Confidence reports ATE estimates for every metric in every experiment, along with confidence intervals and sequential testing boundaries that let teams make decisions at any point during the test without inflating false positive rates.
What does the ATE actually estimate?
The ATE answers a specific counterfactual question: if you could somehow show every user both versions of the product (the treatment and the control), and then compare each user's outcome under treatment to their outcome under control, what would the average of those individual differences be?
You can't actually observe both outcomes for the same user. That's the fundamental problem of causal inference. Randomization solves it at the population level: by randomly assigning users to treatment or control, you ensure the two groups are statistically equivalent before the experiment starts. The difference in average outcomes between the groups is then an unbiased estimate of the ATE.
At Spotify, where experiments routinely run across millions of users, the law of large numbers makes ATE estimates precise. The Spotify Search team, for example, uses ATE estimates across a set of engagement and quality metrics to evaluate ranking algorithm changes. A treatment effect that's positive on click-through rate but negative on a satisfaction guardrail produces a clear signal: the change makes search feel more responsive but degrades result quality.
How is ATE different from other estimands?
The ATE averages across everyone in the experiment. That's useful for population-level decisions (should we ship this to all users?), but it can mask important variation.
The average treatment effect on the treated (ATT) estimates the effect only among users who were actually assigned to treatment. In a well-randomized experiment, ATE and ATT are the same in expectation. They diverge when you use trigger analysis to restrict to exposed users: the effect among triggered users is a conditional ATE that may not generalize to the full population.
The conditional average treatment effect (CATE) estimates the ATE within a subgroup defined by user characteristics: country, platform, tenure, or behavioral segment. If the ATE across all users is +1%, but the CATE for new users is +5% and for tenured users is -0.2%, the overall average hides a meaningful difference. Segment analysis in Confidence surfaces these heterogeneous effects.
Understanding which estimand your analysis reports matters. A team that sees a large effect in trigger analysis and assumes it applies to the full user base will overestimate the impact of a global rollout.
When is the ATE not the right quantity?
The ATE works well when the effect of a change is relatively uniform across users. When effects are highly heterogeneous, the average can be uninformative or even misleading.
Consider a change that benefits power users (+10% engagement) and hurts casual users (-8% engagement). If power users are 30% of the population, the ATE might be slightly negative, suggesting the change is harmful. But a team might reasonably ship the change to power users and exclude casual users. In that case, the CATE for each segment is more useful than the population ATE.
Confidence's segment analysis helps teams detect these patterns. When you define success and guardrail metrics, Confidence breaks down the treatment effect by pre-configured segments so heterogeneous effects surface before the ship decision, not after.