What is an Estimand?

An estimand is the precise quantity an experiment is designed to estimate. It answers the question: what exactly are you measuring? Before worrying about sample sizes, statistical tests, or significance thresholds, you need to define the thing you're trying to learn. The estimand is that thing, stated with enough precision that two analysts working independently would compute the same number from the same data.

Getting the estimand wrong is more consequential than getting the statistical method wrong. A perfectly executed analysis of the wrong quantity produces a precise answer to a question nobody asked.

Why does the estimand matter in A/B testing?

Most experimenters think in terms of "the treatment effect." But there are many treatment effects, and which one your analysis estimates depends on choices that are often implicit.

Average treatment effect (ATE) measures the average impact across all assigned users. This is the standard estimand for an intent-to-treat analysis. It answers: what happens to the full population metric if we ship this change?

Average treatment effect on the treated (ATT) measures the impact on users who actually received the treatment. In trigger analysis, this is closer to what you're estimating: the effect on users who encountered the changed feature. The ATT and ATE can differ substantially when only a fraction of users are exposed.

Conditional average treatment effect (CATE) measures the impact for a specific subgroup. Segment analysis estimates CATEs. The treatment might help new users and hurt experienced users, producing a positive ATE that masks a negative CATE for one group.

Each estimand answers a different question. The ATE tells you about business impact. The ATT tells you whether the change works for the people it reaches. The CATE tells you whether effects vary across segments. Choosing the wrong one leads to the wrong decision.

How do metric definitions affect the estimand?

The estimand isn't just about which users to include. It's also about how the metric is defined over time. A Spotify research team's paper on what A/B test metrics actually estimate distinguishes two common approaches:

Cumulative metrics sum all post-exposure data for each user. A user who's been in the experiment for 14 days contributes 14 days of behavior. A user who joined 3 days ago contributes 3 days. The estimand is the total cumulative effect over the observation period.

Windowed metrics fix a time window per user (say, 7 days post-exposure) and only count behavior within that window. Every user contributes the same duration, which makes the estimand cleaner but discards data from users who haven't yet completed the window.

These two approaches estimate different quantities, even from the same experiment. The cumulative estimand reflects what the business would see in aggregate. The windowed estimand isolates the per-user effect over a standardized period. Neither is universally better. The right choice depends on what decision you're making.

What is an Estimand?

Why does the estimand matter in A/B testing?

How do metric definitions affect the estimand?

Related terms