How to Write an Experimentation Platform RFP for Ratio Metrics

Want to experiment like Spotify? Sign up for a 30 day free trial.

Last updated: May 2026

Revenue per user. Streams per session. Pages per visit. Order value per checkout. Ratio metrics are among the most common metrics in product experimentation. Both the numerator (total revenue, total streams) and the denominator (number of users, number of sessions) vary across experimental units. That variation in the denominator is what makes ratio metrics different from simple means, and it is where most vendor implementations quietly break down.

The standard variance formula for a mean metric treats each observation as independent with a single source of randomness. A ratio metric has two sources of randomness that are correlated: users who generate more sessions also tend to generate more streams. Ignoring that correlation produces variance estimates that are too small, confidence intervals that are too narrow, and p-values that are too optimistic. The correct approach uses the delta method (or bootstrap), which accounts for the covariance between numerator and denominator. The issue is not an edge case. Without the covariance term, a platform reports a 95% confidence interval that actually covers the true effect less than 95% of the time.

At Spotify, ratio metrics appear throughout the experimentation program. Streams per user, listening time per session, and conversion rates with variable denominators are standard parts of experiment scorecards. The delta method needs to carry through the full pipeline for these to be trustworthy. Without that integration, the metric exists in the UI but the inference behind it is incomplete.

Should you add ratio metrics to your experimentation platform requirements?

Yes. Unlike percentile metrics, where the use case is narrow enough that many teams can skip them entirely, ratio metrics are difficult to avoid. Any metric that normalizes a total by a count that varies across users is a ratio metric. Revenue per user, sessions per day, messages per active user, add-to-cart rate when the number of product views varies. If your experiment scorecard includes metrics like these, your platform needs to handle them correctly.

The question is not whether the platform supports ratio metrics as a metric type. Most do. The question is whether the statistical treatment is correct throughout the pipeline. A platform can let you define revenue divided by users as a metric and still compute the variance as if it were a simple mean, which produces confidence intervals that are systematically too narrow. The metric exists in the UI, but the inference is wrong. Our experience from building Confidence and working with customers has shown us that a ratio metric with incorrect variance is worse than not supporting ratio metrics at all. When the platform says it handles something, people stop questioning it. At Spotify, we will not ship a metric type unless the variance estimation is correct end to end. The RFP question that matters is whether ratio metrics receive the correct statistical treatment at every stage: variance estimation, sequential testing, variance reduction, sample size calculation, and both inference modes if the platform offers Bayesian and frequentist analysis.

What your RFP should ask instead of the "yes/no?"

Six questions separate a correct ratio metric implementation from one that renders numbers without valid inference.

First: does the platform use the delta method (or an equal approach) for ratio metric variance? The delta method accounts for the covariance between numerator and denominator when computing the variance of a ratio. Without it, the variance estimate assumes the denominator is fixed, which it is not. The resulting confidence intervals are too narrow and the false positive rate is inflated. Some platforms use bootstrap resampling instead of the delta method, which is also valid but computationally more expensive. The key question is whether the platform uses any method that accounts for the numerator-denominator covariance. If the documentation describes a single variance formula applied to all metric types, or if formula metrics assume zero covariance between components, the ratio metric inference is flawed. Ask whether the variance estimation for ratio metrics accounts for the covariance between numerator and denominator, and what method is used.

Second: does sequential testing work correctly for ratio metrics? Sequential testing adjusts confidence intervals to allow valid interim looks at experiment results. For mean metrics, this adjustment is well understood. For ratio metrics, the sequential correction must be applied on top of the delta method variance, not on top of a simple-mean variance. If the platform computes sequential confidence intervals for ratio metrics using the wrong base variance, the sequential guarantee does not hold: the intervals are too narrow, and the false positive rate under continuous monitoring is higher than stated. This interacts with the broader sequential testing question in this series. A platform that supports sequential testing for means but not for ratio metrics forces you into a fixed horizon for exactly the metrics where early monitoring matters most, like revenue per user or error rate per session. Ask whether sequential testing is available for ratio metrics, and whether the sequential intervals use the delta method variance.

Third: does variance reduction work correctly for ratio metrics? Variance reduction techniques like CUPED use pre-experiment data to reduce metric noise, shortening the time needed to detect a given effect. For mean metrics, CUPED adjusts each user's outcome by their predicted value based on pre-experiment behavior. For ratio metrics, the adjustment is more involved: the pre-experiment covariate is itself a ratio (or a pair of values for numerator and denominator), and the regression adjustment must account for the ratio structure. A platform that applies standard mean-metric CUPED to a ratio metric will reduce variance, but the point estimate of the treatment effect may be biased if the regression does not account for the fact that both numerator and denominator are being adjusted. The efficiency gain also differs. The correlation between pre-experiment and post-experiment ratios is often lower than for simple means, so the variance reduction is typically smaller. If the sample size calculator does not account for this, the runtime estimate will be off in both directions: too long if it ignores variance reduction entirely, or too short if it assumes the same reduction as for mean metrics. Ask whether CUPED is available for ratio metrics, whether the adjustment accounts for the ratio structure, and whether the variance reduction is reflected in the sample size calculation.

Fourth: does the sample size calculator support ratio metrics? A sample size calculator for ratio metrics needs the delta method variance, which requires estimates of the numerator variance, the denominator variance, and their covariance. A calculator that uses only the variance of the ratio values (treating them as simple observations) will produce a wrong sample size. The direction of the error depends on how the variance is computed: if the calculator ignores the covariance entirely, the sample size is wrong in a way that depends on the metric's correlation structure. If the calculator also accounts for sequential testing and variance reduction for ratio metrics, the planning and the analysis match. If it does not, you are back to the planning-analysis disconnect that recurs throughout this series. Ask whether the sample size calculator accepts ratio metrics, and whether it uses the delta method variance or a simpler formula.

Fifth: are ratio metrics handled in both Bayesian and frequentist modes? If the platform offers both inference frameworks, the delta method (or an equivalent) needs to be used in both. A frequentist engine that uses the delta method paired with a Bayesian engine that models the ratio as a simple normal distribution without accounting for the covariance will produce different confidence intervals for the same data, not because the frameworks differ in principle, but because one is using the correct variance and the other is not. The pattern is the same one described in the multiple testing post: features that work correctly in one inference mode but not the other create inconsistencies that are difficult to diagnose. Ask whether the delta method (or equal approach) is applied in both modes, and whether switching frameworks changes the variance estimate for ratio metrics.

Sixth: does the platform document the variance formula it uses for ratio metrics? Documentation is a transparency question. The delta method for a ratio R = Y/X involves the variances of Y and X and their covariance. The formula is well established, but implementations differ in how they estimate the covariance, how they handle edge cases (users with zero denominator values), and whether they use a first-order or higher-order approximation. A platform that documents its variance formula lets you verify correctness. A platform that does not leaves you trusting a black box for every ratio metric in every experiment. Ask whether the statistical documentation describes the variance estimation method for ratio metrics, including how the covariance is estimated.

What the answers actually look like across vendors

Here is how the major platforms stand as of this writing. Vendor capabilities are based on public documentation. Confidence capabilities are based on the product itself.

"Yes" and "No" are based on explicit evidence in the vendor's public documentation. "Not documented" means we found no explicit evidence either way, not that the capability is confirmed absent.

Platform	Delta method for ratio variance?	Sequential testing for ratio metrics?	CUPED for ratio metrics?	Sample size calc for ratio metrics?	Bayesian mode handles ratio variance?	Variance formula documented?
Confidence	Yes	Yes	Yes	Yes	—	Yes
GrowthBook	Yes	Yes	Yes (excluding legacy)	Yes	Yes	Yes
Eppo	Yes	Yes	Yes	Yes	Not documented	Yes
Statsig	Yes	Yes	Yes	Partial	Not documented	Yes
Optimizely	Yes	Yes	Not documented	Not documented	Not documented	Yes
LaunchDarkly	Not documented	Not documented	Not documented	Not documented	Not documented	Partial
PostHog	Yes	No (no sequential testing)	No	No	Yes	Yes
Amplitude	Partial	Not documented	Not documented	No	—	Partial
VWO	Not documented	Not documented	—	Not documented	Not documented	Not documented

Four patterns stand out from this comparison.

The first is the covariance problem. The delta method requires an estimate of the covariance between numerator and denominator. Most vendors that document the delta method implement it correctly: GrowthBook, Eppo, Statsig, Optimizely, and PostHog all account for the covariance. Amplitude is the exception that reveals how the gap works in practice. Amplitude's formula metrics compute the variance of each component independently and then combine them using arithmetic operators, assuming zero covariance between the components. For ratio metrics where the numerator and denominator are correlated (which is nearly always the case: users who visit more pages also tend to buy more), this assumption produces variance estimates that are wrong. This is a systematic error and its direction depends on the correlation structure. Ignoring a positive covariance between numerator and denominator overstates the variance of the ratio, making the test conservative and reducing power. Ignoring a negative covariance understates it, inflating the false positive rate. In practice, the covariance for most product metrics is positive (more activity in the numerator correlates with more activity in the denominator), so the typical effect of the zero-covariance assumption is overly wide confidence intervals and reduced power. The result is experiments that run longer than necessary to detect real effects.

The second pattern is the gap between having the delta method in analysis and having it in planning. GrowthBook and Eppo both use the delta method in their analysis pipelines and connect it to their sample size calculators. Statsig uses the delta method in analysis but its sample size calculator documentation describes a single variance formula regardless of metric type, which suggests the ratio-specific variance may not carry through to planning. Optimizely documents the delta method for ratio metrics in analysis but does not document whether the sample size calculator accounts for it. LaunchDarkly, VWO, PostHog, and Amplitude do not document ratio-aware sample size calculation. The gap is the same planning-analysis disconnect that surfaces across every topic in this series: the analysis uses one variance estimate, the planning tool uses another, and the experiment is either underpowered or overpowered from the start.

The third pattern is variance reduction coverage. CUPED for ratio metrics requires accounting for the ratio structure in the regression adjustment. GrowthBook, Eppo, and Statsig all document CUPED support that extends to ratio metrics, though GrowthBook notes that legacy ratio metrics are excluded. Optimizely, LaunchDarkly, VWO, and Amplitude either do not document CUPED for ratio metrics or do not offer CUPED at all. PostHog does not offer CUPED for any metric type. When variance reduction works for mean metrics but not ratio metrics, experiments on ratio metrics run longer than they need to. Since ratio metrics often have higher variance than simple means (the denominator adds noise), they are precisely the metrics that benefit most from variance reduction. A platform where CUPED is available but does not extend to ratio metrics leaves the highest-variance metrics unimproved.

The fourth pattern is the documentation gap. LaunchDarkly and VWO both support experimentation with multiple metric types, but neither documents how ratio metric variance is estimated, whether the delta method or any covariance correction is applied, or whether sequential testing and CUPED extend to ratio metrics. This does not mean the implementations are wrong. It means a buyer cannot verify correctness from public documentation alone. For a metric type where the difference between correct and wrong variance estimation is the difference between valid and invalid inference, undocumented treatment is itself a risk.

Most vendors get the basic delta method right in their analysis engines. That is the easy part. The hard part is carrying the correct variance through every stage of the pipeline: sequential testing that uses it, variance reduction that preserves it, and a sample size calculator that plans around it. A platform where the analysis is correct but the planning is not gives you valid results at the end of an experiment whose duration was wrong from the start. The RFP question that separates vendors is not "do you support ratio metrics?" It is "does the delta method carry through from variance estimation to planning, sequential testing, variance reduction, and both inference modes?" If it does not, the pipeline is correct at one end and incomplete at the other.

How to Write an Experimentation Platform RFP for Ratio Metrics

Should you add ratio metrics to your experimentation platform requirements?

What your RFP should ask instead of the "yes/no?"

What the answers actually look like across vendors

More in this series