Metric capping (also called winsorization) is a variance reduction technique that clips extreme metric values at a chosen threshold, reducing the outsized influence of outliers on experiment results. By replacing values above the cap with the cap value itself, you shrink the tails of the metric distribution and produce tighter confidence intervals.
Outliers are a practical reality in product experimentation. A single user who generates 500 page views in a day, or a bot account that triggers thousands of events, can dominate the variance of an entire treatment group. Metric capping prevents these extreme values from drowning out the signal you're trying to detect. In Confidence, metric capping works alongside CUPED and trigger analysis as part of the default variance reduction stack.
Why do outliers matter so much in A/B tests?
Variance drives the width of your confidence intervals, and outliers drive variance disproportionately. A metric like "revenue per user" might have 95% of users between 50, but a handful of users at $5,000+. Those few extreme values inflate the standard deviation of the metric, which inflates the standard error of the treatment effect estimate, which widens the confidence interval, which reduces statistical power.
The math is stark. If you have a million users and ten of them each generate 5, those ten users contribute more to the variance than the bottom 500,000 users combined. Capping those values at, say, the 99th percentile removes most of that disproportionate influence while preserving the directional information: a high-value user still counts as high-value, just not as an extreme outlier.
At Spotify, where experiments run across hundreds of millions of users, outlier-driven variance is one of the primary obstacles to detecting small but meaningful treatment effects. Capping is applied systematically as part of the experiment analysis pipeline in Confidence.
How do you choose a capping threshold?
The most common approach is percentile-based capping: set the threshold at the 95th, 99th, or 99.5th percentile of the metric distribution. Values above the threshold are replaced with the threshold value.
The choice of percentile involves a tradeoff. A lower cap (95th percentile) removes more variance but also removes more real information. A higher cap (99.5th percentile) preserves more information but removes less variance. The right threshold depends on the metric.
For heavy-tailed metrics like revenue or session duration, aggressive capping (95th-99th percentile) often makes sense because the tail contains mostly noise. For metrics that are naturally bounded (like click-through rates), capping is less necessary because the distribution doesn't have extreme tails.
One important property: the cap should be determined before looking at results, ideally as part of the metric definition rather than chosen after the experiment runs. Choosing the cap post-hoc introduces a researcher degree of freedom that can bias results.
How does metric capping differ from other variance reduction methods?
Metric capping, CUPED, and trigger analysis each attack a different source of noise.
CUPED removes variance explained by pre-experiment behavior. It works best when users' behavior is predictable from their history. Metric capping removes variance caused by extreme values in the tails. It works best when the metric distribution is heavy-tailed. Trigger analysis removes variance from users who never experienced the change. It works best when a large fraction of users in the experiment aren't exposed to the feature being tested.
These methods compose well. You can apply all three to the same metric: first restrict to triggered users, then cap extreme values, then apply CUPED adjustment. Each step removes a different component of noise, and the combined variance reduction exceeds what any single method achieves alone.