A treatment effect is the measured difference in a metric between the treatment group (users who see a change) and the control group (users who see the current experience). It answers the most fundamental question in experimentation: did this change move the number? Because A/B tests use random assignment, the observed difference can be attributed to the change itself, not to pre-existing differences between users.
Treatment effects matter because they're the unit of evidence in product experimentation. Every ship-or-don't-ship decision, every guardrail check, every learning a team extracts from an experiment comes down to whether the treatment effect on a given metric is large enough, small enough, or directionally informative. At Spotify, where teams run over 10,000 experiments per year, each experiment produces treatment effects across dozens of metrics. The 42% rollback rate exists because Confidence surfaces treatment effects on guardrail metrics that would otherwise go unnoticed.
How is a treatment effect calculated?
The simplest treatment effect estimate is the difference-in-means estimator: subtract the average metric value in the control group from the average in the treatment group. If treatment users stream an average of 48 minutes per day and control users stream 46 minutes, the treatment effect estimate is +2 minutes.
That point estimate alone isn't enough. You also need a confidence interval that tells you the range of plausible true values, and a p-value that tells you how likely you'd be to observe a difference this large if the change had no real effect. Confidence computes all three automatically for every metric in an experiment, using the statistical methods (CUPED variance reduction, sequential testing, multiple testing corrections) that make the estimate as precise and trustworthy as the data allows.
Why can treatment effects be misleading?
Three things commonly distort treatment effect estimates.
Dilution from non-exposed users. If you change the checkout flow but include all users in the analysis, users who never reached checkout dilute the effect. The true treatment effect on exposed users gets averaged down toward zero. Trigger analysis solves this by restricting the analysis to users who actually encountered the change. But the effect size estimated on triggered users doesn't generalize directly to the full population: it answers "what was the effect on people who saw it?" not "what would happen if everyone saw it?"
Underpowered experiments. When an experiment doesn't have enough users or runs for too little time, the confidence interval around the treatment effect is wide. A wide interval means you can't distinguish a real positive effect from noise, or a real negative effect from zero. The experiment consumes bandwidth without producing a clear answer. Confidence's power analysis helps teams size experiments before they start, so the treatment effect estimate will be precise enough to act on.
Metric choice. The same change can show a positive treatment effect on one metric and a negative effect on another. A feature that increases engagement might also increase load time. The treatment effect is always relative to the metric you're measuring, which is why Confidence's decision framework distinguishes success metrics (what you're trying to improve) from guardrail metrics (what you're trying not to break).
What's the difference between a treatment effect and an effect size?
These terms overlap but aren't identical. A treatment effect is the raw difference between treatment and control on a specific metric in a specific experiment: "+2 minutes of daily streaming" or "-0.3 percentage points on crash rate." An effect size is often standardized (divided by the pooled standard deviation) to make comparisons across experiments and metrics possible. Cohen's d is the most common standardized effect size.
In practice, most product teams work with raw treatment effects because the business cares about the actual units: minutes, conversion percentage points, revenue per user. Standardized effect sizes become useful when planning experiments (deciding on a minimum detectable effect) or when comparing the relative magnitude of changes across metrics with different scales.