An unbiased estimator is a statistical estimator whose expected value equals the true parameter it's estimating. If you could repeat the experiment infinitely many times, the average of all the estimates would equal the actual treatment effect. Any single estimate may be above or below the truth, but there's no systematic tendency in one direction.
Unbiasedness is a property of the method, not of any individual result. A single experiment produces one estimate. That estimate can be far from the truth and the estimator can still be unbiased, because bias refers to the long-run average, not to any particular realization.
Why does unbiasedness matter for A/B tests?
The difference-in-means estimator (treatment group average minus control group average) is the workhorse of A/B test analysis. Under random assignment, it's an unbiased estimator of the average treatment effect. This is the fundamental reason randomized experiments produce trustworthy causal estimates: the expected value of the comparison equals the true causal effect, with no systematic error.
Biased estimators, by contrast, produce results that are systematically too high or too low. If your analysis includes a step that introduces selection bias (like conditioning on a post-treatment outcome), the estimator becomes biased. The expected value of your estimate no longer equals the true effect, and collecting more data doesn't help. More data makes the biased estimate more precise, not more accurate.
This distinction matters practically. A team running an A/B test with 100,000 users and a biased analysis methodology can produce a very tight confidence interval around the wrong answer. The interval is narrow because the sample is large, but it's centered on a biased estimate. Precision without accuracy is worse than an imprecise but unbiased result, because the narrow interval creates false confidence.
How do common analysis techniques affect unbiasedness?
CUPED variance reduction (using pre-experiment covariates to reduce noise) preserves unbiasedness. The Negi-Wooldridge full regression estimator that Confidence uses adjusts for covariates observed before the experiment, which can't be affected by the treatment. The expected value of the adjusted estimator still equals the true treatment effect, but with a tighter confidence interval.
Trigger analysis introduces a subtlety. Restricting to users who triggered exposure changes the estimand from the average treatment effect (ATE) to something closer to the average treatment effect on the treated (ATT). Within that estimand, the trigger analysis estimator can remain unbiased under the right conditions. But the ATT is a different quantity than the ATE, so the estimate isn't biased in the usual sense. It's estimating a different thing.
Post-treatment filtering (excluding users based on behavior that occurred during the experiment) generally introduces bias. If the treatment affects who gets filtered, the remaining treatment and control populations are no longer comparable. The difference-in-means on the filtered sample is a biased estimator of the treatment effect.
Outlier capping (winsorizing extreme values) introduces a small, known bias in exchange for a large reduction in variance. For most product experiments, the trade-off is worth it: the bias from capping the top 1% of a revenue metric is tiny compared to the variance reduction.
Is unbiasedness always the most important property?
Not always. Estimators have other properties that matter: variance (how much the estimate bounces around across experiments), mean squared error (bias squared plus variance), and consistency (convergence to the truth as sample size grows).
Sometimes a slightly biased estimator with much lower variance produces better decisions on average. CUPED is popular precisely because it reduces variance dramatically (often by 50% or more) while remaining unbiased. Metric capping introduces slight bias but reduces variance enough that the mean squared error drops. The right balance depends on your sample size and how much variance you're dealing with.
For A/B tests, the practical priority is: use an estimator that's unbiased (or nearly so), and then reduce variance through methods that don't introduce meaningful bias. This is the approach Confidence takes: unbiased estimation with CUPED for variance reduction, automatic outlier handling, and analysis methods that preserve the validity of the causal claim.