Confidence
  • Pricing
  • Success stories
  • Contact us
  • Login
Start free trial
All terms
Statistical Methods

What is a Metric Capping?

Metric capping (also called winsorization) is a variance reduction technique that clips extreme metric values at a chosen threshold, reducing the outsized influence of outliers on experiment results.

Metric capping (also called winsorization) is a variance reduction technique that clips extreme metric values at a chosen threshold, reducing the outsized influence of outliers on experiment results. By replacing values above the cap with the cap value itself, you shrink the tails of the metric distribution and produce tighter confidence intervals.

Outliers are a practical reality in product experimentation. A single user who generates 500 page views in a day, or a bot account that triggers thousands of events, can dominate the variance of an entire treatment group. Metric capping prevents these extreme values from drowning out the signal you're trying to detect. In Confidence, metric capping works alongside CUPED and trigger analysis as part of the default variance reduction stack.

Why do outliers matter so much in A/B tests?

Variance drives the width of your confidence intervals, and outliers drive variance disproportionately. A metric like "revenue per user" might have 95% of users between 0and0 and 0and50, but a handful of users at $5,000+. Those few extreme values inflate the standard deviation of the metric, which inflates the standard error of the treatment effect estimate, which widens the confidence interval, which reduces statistical power.

The math is stark. If you have a million users and ten of them each generate 10,000inrevenuewhilethemedianis10,000 in revenue while the median is 10,000inrevenuewhilethemedianis5, those ten users contribute more to the variance than the bottom 500,000 users combined. Capping those values at, say, the 99th percentile removes most of that disproportionate influence while preserving the directional information: a high-value user still counts as high-value, just not as an extreme outlier.

At Spotify, where experiments run across hundreds of millions of users, outlier-driven variance is one of the primary obstacles to detecting small but meaningful treatment effects. Capping is applied systematically as part of the experiment analysis pipeline in Confidence.

How do you choose a capping threshold?

The most common approach is percentile-based capping: set the threshold at the 95th, 99th, or 99.5th percentile of the metric distribution. Values above the threshold are replaced with the threshold value.

The choice of percentile involves a tradeoff. A lower cap (95th percentile) removes more variance but also removes more real information. A higher cap (99.5th percentile) preserves more information but removes less variance. The right threshold depends on the metric.

For heavy-tailed metrics like revenue or session duration, aggressive capping (95th-99th percentile) often makes sense because the tail contains mostly noise. For metrics that are naturally bounded (like click-through rates), capping is less necessary because the distribution doesn't have extreme tails.

One important property: the cap should be determined before looking at results, ideally as part of the metric definition rather than chosen after the experiment runs. Choosing the cap post-hoc introduces a researcher degree of freedom that can bias results.

How does metric capping differ from other variance reduction methods?

Metric capping, CUPED, and trigger analysis each attack a different source of noise.

CUPED removes variance explained by pre-experiment behavior. It works best when users' behavior is predictable from their history. Metric capping removes variance caused by extreme values in the tails. It works best when the metric distribution is heavy-tailed. Trigger analysis removes variance from users who never experienced the change. It works best when a large fraction of users in the experiment aren't exposed to the feature being tested.

These methods compose well. You can apply all three to the same metric: first restrict to triggered users, then cap extreme values, then apply CUPED adjustment. Each step removes a different component of noise, and the combined variance reduction exceeds what any single method achieves alone.

Related terms

Statistical Methods
Variance Reduction

Variance reduction is a set of statistical techniques that tighten the confidence intervals of an A/B test without requiring more traffic.

Statistical Methods
CUPED

CUPED (Controlled-experiment Using Pre-Existing Data) is a variance reduction method that uses data from before an experiment started to remove predictable noise from metric estimates, producing ti...

Statistical Methods
Signal-to-Noise Ratio

The signal-to-noise ratio (SNR) in A/B testing is the ratio of the treatment effect (the signal) to the variability of the metric being measured (the noise).

Statistical Methods
Confidence Interval

A confidence interval is a range of values that, at a given confidence level, is expected to contain the true treatment effect.

Spotify

Learn more

  • Read our blog
  • See comparisons
  • Glossary
  • RFP guides
  • Listen to us
  • Read our docs
  • Status page

Need help

  • Contact us

Legal

  • Terms of Service
  • Data Protection Agreement
  • Privacy Policy
  • Cookies

© 2026 Spotify

The Confidence name and logo are registered trademarks of Spotify.