Want to experiment like Spotify? Sign up for a 30 day free trial.
Start your free trialLast updated: May 2026
Every experimentation platform that offers variance reduction will tell you it cuts experiment runtime by 20-50%. The number is usually real. What is usually missing is how far that reduction actually reaches. Variance reduction works by using pre-experiment data to remove predictable noise from the metric, so the experiment can detect the same effect with fewer observations. The technique is called CUPED (Controlled-experiment Using Pre-Experiment Data), and extensions like CUPAC and CUPED++ expand the set of covariates beyond just the pre-experiment version of the outcome metric. The statistical principle is simple: if you can predict part of the metric from data that existed before the experiment started, subtracting that prediction reduces the variance of the treatment effect estimate.
The problem is not whether the platform has CUPED. Most do. The problem is where the reduction stops. A platform that reduces variance in the analysis but ignores the reduction in the sample size calculator plans every experiment as if CUPED did not exist. A platform that applies CUPED to means but not to ratio or percentile metrics leaves your most complex metrics running at full noise. A platform that offers CUPED in the frequentist engine but not the Bayesian one forces a choice between your preferred inference framework and a shorter experiment. Each of these is a partial implementation. The feature exists, but the full benefit does not.
At Spotify, hundreds of experiments compete for the same traffic at any given time. Across that portfolio, variance reduction compounds: if it cuts required sample size by 30% and the calculator ignores that, every experiment runs 30% longer than necessary. Over a quarter, that is the difference between running your roadmap and not.
Should you add variance reduction to your experimentation platform requirements?
Yes. Variance reduction is one of the few features where the benefit is nearly universal and the cost of a partial implementation is concrete and measurable. Unlike some features where the gap between a checkbox and a real implementation is subtle, the gap here shows up directly in experiment runtime: experiments take longer than they need to because the planning tools do not account for the noise reduction the analysis provides.
The standard RFP question, "do you support CUPED or variance reduction?", is not wrong, but it is incomplete. Seven of the eight vendors reviewed in this series offer some form of variance reduction. Only one does not. But the question that separates a useful implementation from a checkbox is whether the variance reduction connects to the rest of the platform: the sample size calculator, the sequential testing boundaries, the metric types you actually use, and the inference mode your team prefers.
At Spotify, the sample size calculator in Confidence accounts for variance reduction, so the planning estimate matches the actual noise level.
Variance reduction that does not connect to planning and does not cover the metric types you rely on creates a specific trap. At Spotify, we refused to ship variance reduction without connecting it to the sample size calculator because the alternative, where every experiment quietly runs longer than necessary while the mismatch goes unnoticed, is a form of waste that compounds silently across the entire portfolio.
What your RFP should ask instead of the "yes/no?"
Six questions separate a connected variance reduction implementation from a disconnected one.
First: which metric types does variance reduction cover? The simplest case is mean metrics: CUPED regresses the post-exposure metric on its pre-exposure counterpart and subtracts the predicted component. Most vendors support this. Ratio metrics, where both numerator and denominator vary across users, require an extended covariance adjustment that accounts for the joint distribution. Percentile metrics require different statistical treatment entirely, and variance reduction for percentiles is rare. Funnel metrics and retention metrics may also be excluded. If CUPED covers means but not ratio metrics, your revenue-per-user metric runs at full variance while your session count metric gets the benefit. The practical consequence is that the metrics with the most variance, like revenue, are the ones that do not get the reduction. Ask which metric types support variance reduction, and compare that list to the metrics you actually use.
Second: is variance reduction reflected in the sample size calculation? This is the question that matters most and the one where nearly every vendor falls short. Variance reduction changes the effective noise level of the metric. If the sample size calculator uses the raw, unadjusted variance to compute the required sample size, it overestimates the number of users you need. The experiment still benefits from CUPED during analysis, so you end up overpowered. One overpowered experiment is not a disaster. But across a portfolio of experiments, systematic overpowering means every experiment occupies a traffic slot longer than necessary, and experiments queued behind it are delayed. A platform that offers variance reduction but does not connect it to the calculator leaves the planning side of the problem unsolved. Ask whether enabling CUPED in the experiment configuration changes the sample size estimate and the projected runtime. If it does not, the planning tool and the analysis tool are operating on different assumptions.
Third: does variance reduction work in combination with sequential testing? Sequential testing lets you monitor results during the experiment and stop early if the effect is clear, without inflating false positive rates. Variance reduction makes the effect estimate more precise, which should make the sequential boundaries easier to cross with fewer observations. But combining the two is not automatic. The sequential confidence interval must be computed on the CUPED-adjusted estimate with the CUPED-adjusted variance. If the sequential method uses the raw variance while CUPED adjusts the point estimate, the confidence interval is too wide for the noise level of the adjusted metric, and you lose efficiency. The interaction also needs to flow through to the sample size calculator: a valid pre-experiment estimate for a sequential experiment with CUPED needs to account for both the sequential width penalty and the CUPED variance reduction simultaneously. Ask whether the platform documents how CUPED and sequential testing interact, and whether the combined effect is reflected in the sample size calculation.
Fourth: does variance reduction work in both Bayesian and frequentist modes? If the platform offers both inference frameworks, CUPED should work in both. The underlying adjustment is the same: regress the outcome on a pre-experiment covariate and use the residual as the adjusted metric. Whether the downstream inference is a confidence interval or a credible interval does not change the covariate adjustment step. But some vendors implement CUPED only for the frequentist engine, leaving the Bayesian mode running on unadjusted metrics with higher variance. If your team uses Bayesian analysis, an experiment that takes three weeks with CUPED in frequentist mode takes four or five weeks in Bayesian mode because the variance reduction is unavailable. That penalty may push teams toward a framework they did not choose, or it may go unnoticed entirely. Ask whether CUPED is available in both modes.
Fifth: what covariates does the platform use, and are they configurable? Standard CUPED uses the pre-experiment version of the outcome metric as the single covariate. Extensions like CUPED++ (used by Eppo) and CURE (used by Statsig) allow multiple covariates: pre-experiment values of other metrics, user attributes like country or device type, and other dimensions. More covariates can produce more variance reduction, but they also introduce complexity. Ridge regression or regularization may be needed to avoid overfitting when the number of covariates is large relative to the sample size. The lookback window (how many days of pre-experiment data is used) also affects the reduction. Too short and you miss stable behavioral patterns. Too long and you include data that no longer predicts current behavior. Ask what covariates are included, whether the set is configurable, what lookback window is used, and whether the platform applies regularization.
Sixth: does variance reduction carry through to dimensional breakdowns and segments? When you slice experiment results by country, device, or user segment, the CUPED adjustment should still apply to each subgroup. If variance reduction is computed only at the top level and dropped when you explore dimensions, subgroup results are noisier than they need to be, and the comparison between the top-level and subgroup confidence intervals is inconsistent. Some vendors explicitly document that CUPED does not apply to explores, segments, or filtered results. Ask whether the variance reduction carries through to every view of the results, or whether it disappears when you start slicing.
What the answers actually look like across vendors
Here is how the major platforms stand as of this writing. Vendor capabilities are based on public documentation. Confidence capabilities are based on the product itself.
"Yes" and "No" are based on explicit evidence in the vendor's public documentation. "Not documented" means we found no explicit evidence either way, not that the capability is confirmed absent.
| Platform | Variance reduction available? | Metric types covered | Reflected in sample size calc? | Works with sequential testing? | Works in Bayesian mode? | Covariates | Carries through to segments/explores? |
|---|---|---|---|---|---|---|---|
| Confidence | Yes | Mean, ratio | Yes | Yes | Yes | Configurable | Yes |
| GrowthBook | Yes (CUPED) | Mean, proportion, ratio (not quantile) | No | Not documented | Partial | Pre-experiment metrics | Not documented |
| Eppo | Yes (CUPED++) | All documented | No | Not documented | Not documented | Multiple covariates | No |
| Statsig | Yes (CUPED + CURE) | Mean, ratio (not funnel) | No | Yes | Not documented | Multiple covariates | Partial |
| LaunchDarkly | Yes (CUPED) | Mean, conversion (not percentile) | No | Not documented | Yes | Pre-experiment metric | Not documented |
| PostHog | Yes | Mean, funnel (not ratio) | No | N/A | Yes | Pre-experiment metric | Not documented |
| Amplitude | Yes | Most (not retention) | No | Not documented | No | Not documented | No |
| Optimizely | Yes | Numeric only | No | Not documented | Not documented | Pre-experiment metric | Not documented |
| VWO | No | N/A | N/A | N/A | N/A | N/A | N/A |
Four patterns stand out.
The first pattern is the universal disconnect between variance reduction and sample size planning. Every external vendor that offers CUPED excludes it from the sample size calculator. GrowthBook's documentation explicitly acknowledges this: their power calculator page states, "If in practice you use CUPED, your power will be higher. Use CUPED!" Statsig's blog provides a manual calculation (multiply the required sample size by 1 minus the squared correlation coefficient), but this is a formula the experimenter must apply outside the tool, not a built-in adjustment. The result across the vendor landscape is the same: every experiment planned with CUPED enabled uses a sample size based on raw variance. If CUPED typically reduces variance by 30%, that translates to a 30% overestimate of required sample size for every experiment. Across a portfolio of hundreds of experiments, that adds up to weeks of wasted traffic per quarter.
The second pattern is uneven metric type coverage. Every vendor that offers CUPED supports it for simple mean metrics. Beyond that, the coverage fragments. GrowthBook and Statsig extend CUPED to ratio metrics through covariance-based adjustments. Eppo claims the broadest coverage, stating that CUPED++ applies to mean, ratio, funnel, and percentile metrics. Optimizely restricts CUPED to numeric metrics only, explicitly excluding conversion metrics, which are Optimizely's most common metric type and the only type supported by their public sample size calculator. PostHog's recent implementation covers means and funnels but explicitly excludes ratio metrics as "out of scope." LaunchDarkly limits CUPED to metrics using the "average" analysis method, excluding percentile metrics. The metrics that tend to have the highest variance, and therefore benefit most from reduction, are often the ones excluded from the feature. Ratio metrics like revenue per user and percentile metrics like P95 latency typically have higher variance than simple means, making the gap between covered and uncovered metric types more costly than it appears on a feature list.
The third pattern is the Bayesian gap. LaunchDarkly documents CUPED for both Bayesian and frequentist modes, and PostHog's GitHub implementation works across both engines. GrowthBook claims CUPED is available for both but does not technically document the Bayesian implementation. Amplitude explicitly states that CUPED is not supported in Bayesian mode. For the remaining vendors (Eppo, Statsig, Optimizely), Bayesian CUPED compatibility is simply not documented. This creates a practical choice for teams using Bayesian inference: switch to frequentist mode to get variance reduction, or accept longer experiments. The choice is not always visible in the feature list, because the Bayesian mode and the CUPED toggle may both appear available without any indication that they do not work together.
The fourth pattern is the dimensional breakdown gap. Several vendors that offer CUPED explicitly exclude it from dimensional breakdowns. Eppo's documentation states that CUPED does not apply to "filtered results, segments, or explore analyses." Amplitude documents the same limitation: CUPED is dropped from group-by results. This means the top-level experiment result benefits from variance reduction while every subgroup analysis runs on unadjusted data. The confidence intervals in the two views are not directly comparable, because they are computed under different noise levels. For teams that rely on dimensional analysis to understand heterogeneous effects, this inconsistency undermines the value of the exploration.
The right RFP question is not "do you support CUPED?" It is whether the variance reduction is connected to the sample size calculator, available for the metric types you use, compatible with your inference framework and your sequential testing method, and consistent across every view of the results. A platform that reduces variance in the analysis but ignores it everywhere else gives you narrower confidence intervals at the end of the experiment while planning and running every experiment as if the reduction did not exist.