How to Write an Experimentation Platform RFP for Percentile Metrics

Want to experiment like Spotify? Sign up for a 30 day free trial.

The canonical use case for percentile metrics is performance-sensitive features where the tail user experience is the signal you care about. Streaming quality, API response time, search latency. For these, P95 or P99 is the right metric, not the mean. Percentiles are sometimes presented as outlier robust alternatives to means, but metrics capping is often a sufficient remedy for outliers.

It's easy to find an experimentation vendor that supports percentiles. Most do. The problem is that most implementations break down in exactly the conditions where percentile metrics are most useful. A yes on "do you support percentile metrics?" tells you the feature exists, not whether the implementation meets any reasonable standard. Having a feature is treated as a binary question. It is a continuum: from a metric type that renders in the UI, to one where you can calculate the sample size before you start, monitor results without inflating your false positive rate, and trust that the underlying choices about aggregation, windowing, and missing values are coherent and apply consistently throughout.

Should you add percentiles to your experimentation platform requirements?

At Spotify, we build Confidence for ourselves and have deliberately not added percentile metrics to it. There are use cases for which they would be superior to mean metrics. Here is our reasoning, and it should inform how you evaluate vendors that have.

Spotify has done serious work on this problem. Our engineering post and accompanying paper describe an assumption-free approach to quantile comparison using a Poisson bootstrap. Getting this right requires solving hard statistical and engineering problems, which is part of why most vendor implementations stay incomplete. We don't add features unless we can deliver a complete offering that doesn't complicate or fragment the user experience.

The decision not to use percentile metrics in the Confidence platform is deliberate. We can't afford a fragmented implementation that risks product decisions. We have not seen the added complexity outweigh the benefits yet. For most experiments, mean-based metrics with proper sequential testing cover the use case. A change that degrades tail latency substantially will generally move the mean too, especially in conjunction with variance reduction, albeit with less statistical precision. This, combined with metrics capping for outlier removal, makes the benefit small.

What your RFP should ask instead of the "yes/no?"

It's never enough to ask a vendor if they "have a feature". You need to be specific about how it's supported.

Seven questions determine whether a percentile metrics implementation is worth having.

First: is sequential testing available for percentile metrics, and does it connect to the sample size calculation? Sequential testing is what lets you monitor an experiment while it runs and stop early if you see a clear result, without inflating your false positive rate. Without it, you have two bad options: commit to a fixed end date and don't look until then, or peek at results during the experiment and accept that peeking at results multiple times can inflate your false positive rate from the intended 5% to 40% or higher.

For performance metrics specifically, teams should monitor them during the experiment. If a change is degrading latency for 5% of users, you want to catch it and roll back quickly. That is the point of using a percentile metric as a guardrail. A platform that supports percentile metrics but not sequential testing for them makes the guardrail difficult to act on in practice.

Second: can you calculate sample size for percentile metrics? Before you run an experiment, you need to know how many users you require to have a reasonable chance of detecting the effect you care about. Since percentiles require different statistical machinery, it is not a given that a vendor with a sample size calculator supports percentile metrics. Without it, you are choosing your experiment duration with no principled basis for the choice. You either run too short and get ambiguous results, or run too long and waste time.

Sample size calculation also needs to integrate with the other percentile metric features. A valid sample size estimate needs to account for whether sequential testing will be used, how the percentile is aggregated, whether variance reduction is active, the observation window, and how missing values are handled. Each of those choices affects the underlying variance. A calculator that ignores any of them gives you a number that does not match the experiment you will actually run.

Third: what is the unit of aggregation, and can you do inference on both? There are two valid approaches. The first pools all observations across all users (every page load, every API call) and computes the percentile across that pool. This is effectively a ratio metric: the statistic is load time per load, where the denominator varies across users, and inference requires the same treatment as any ratio metric. The second approach aggregates within each user (their mean, sum, or their own P95 across their sessions) and then computes the percentile across those user-level values. A user-level P95 treats every user equally regardless of how many loads they generated; an event-level P95 gives more weight to high-frequency users. They answer different questions. Neither is wrong, but confusing them produces an estimate that does not match the question you think you are asking. You should have a clear answer to what you want to use the percentile for and confirm the vendor supports it.

Fourth: is variance reduction available for percentile metrics, and is it reflected in the sample size calculation? Variance reduction techniques like CUPED reduce experiment noise by adjusting for pre-experiment behavior, which lets you detect the same effect with fewer users or reach significance faster. For mean metrics this is standard on most platforms. For percentile metrics it is often simply excluded. But even when it is available, the efficiency gain only materialises in planning if the sample size calculator accounts for it. A platform where you enable CUPED for a percentile metric but still plan your experiment as if CUPED were off gives you the runtime benefit while hiding it from your pre-experiment estimates. You end up running longer than you needed to.

Fifth: what is the observation window, and is it consistent for all users? Metrics are sensitive to how long each user is observed, especially if the metric uses a within-user aggregation before the group percentile is calculated. If some users have one day of post-exposure data and others have fourteen, the distribution shifts as the experiment runs. Extreme single values tend to be smoothed out within-users across time, which means that users with a short window are likely to have more extreme values. A platform that offers fixed window or cumulative window metrics makes it possible to configure how you want to handle this. A fixed window waits for every user to complete the same observation period before entering the analysis; a cumulative window adds users as they arrive, so early users contribute more observations and the tail estimate shifts as the experiment runs. If a ratio metric (see the third point) is used, the comparison is less sensitive to observation window. Ask whether the observation window (length, and open vs cumulative) is configurable for percentile metrics.

Sixth: how are zeros and missing values handled? Missing-value handling (sometimes called padding) determines what happens when a user is exposed to the experiment but generates no events for the metric during the window. They can be included as a zero or excluded entirely. For a latency metric the distinction is meaningful: users with very high latency might have timed out and not generated any events, setting their value to zero would completely distort the true picture, making a negative impact on the experiment look like a success. At the same time, excluding users with missing values in the same situation would lead to within-metric sample ratio mismatches: one group might have far fewer users for one metric than the other, because of, for example, timeouts. To handle this, you might require per-metric sample-ratio mismatch checks. You should make sure you have thought about how you want to deal with missing values and ask if the vendor supports it.

Seventh: do the other implementation choices hold when you explore by dimension? Slicing a P95 result by country, device type, or user segment is a legitimate use case. But it surfaces a coherence question: do the zero-handling policy, observation window, and unit of aggregation all apply consistently when you filter to a subgroup? Some platforms recompute a segment-level percentile fresh from raw events, bypassing the user-level aggregation or window configuration you set for the full experiment. The result is a subgroup number that does not measure the same thing as the top-line metric. Ask whether segmented views inherit the same zero-handling, windowing, and aggregation choices as the primary analysis, or whether those settings get dropped.

What the answers actually look like across vendors

Here is how the major platforms stand as of this writing, based on their public documentation.

Not documented means the vendor's public documentation and/or UI does not address this. It does not mean the capability is confirmed absent.

Vendor	Percentile metrics	Sample size	Sequential testing	Variance reduction	All connected to sample size	Unit of aggregation
GrowthBook	Y	N	N	N	N/A	Configurable
Eppo	Y	Not documented	Not documented	Y	Not documented	Event-level
Statsig	Y (Warehouse Native)	Not documented	Not documented	Not documented	Not documented	Event-level
LaunchDarkly	Y (beta)	Not documented	Partial (guarded rollouts only)	N	Not documented	User-level
Confidence	N	N/A	N/A	N/A	N/A	N/A
PostHog	N	N/A	N/A	N/A	N/A	N/A
Amplitude	N	N/A	N/A	N/A	N/A	N/A
Optimizely	Not documented	Not documented	Not documented	Not documented	Not documented	Not documented
VWO	Not documented	Not documented	Not documented	Not documented	Not documented	Not documented

GrowthBook supports percentile metrics with documented variance formulas for both event-level and user-level aggregation. Their documentation explicitly states: "Currently CUPED and Sequential Testing are not implemented for Quantile Testing." Their sample size calculator notes that "All non-quantile metrics are supported," meaning percentile metrics are explicitly excluded from pre-experiment planning. You can create a P99 latency metric in GrowthBook, but sample size estimation and sequential testing are not available for it.

Eppo documents percentile metrics with a specific confidence interval algorithm, and their CUPED documentation explicitly states variance reduction applies to all metric types including percentile, a claim they also make publicly. What their documentation does not address is whether sequential testing and sample size calculation extend to percentile metrics. Both features exist for standard metrics.

LaunchDarkly has been adding percentile support incrementally. Its implementation computes the percentile across per-user aggregates rather than raw events, which is the statistically simpler approach. Sequential testing for percentile metrics arrived in guarded rollouts in January 2026, but the feature is in beta, listed as incompatible with CUPED adjustments, and may cause results tab timeouts on large audiences. How LaunchDarkly constructs confidence intervals for percentile metrics is not documented anywhere. Sample size calculation for percentile metrics with or without sequential testing is not documented.

A complete percentile-metric implementation means sample size calculation that accounts for your aggregation choice, sequential testing connected to that sample size, variance reduction reflected in both, and configuration options for window and missing values that apply consistently across the full analysis and any dimensional breakdowns. A vendor that clears all of that gives you a useful extra lens on tail regressions. If any of those connections are missing, mean metrics with a complete offering are the better choice. Fragmented complexity is never worth it in our experience.