Want to experiment like Spotify? Sign up for a 30 day free trial.
Start your free trialLast updated: May 2026
Every experimentation platform claims to support sequential testing, and the claim is almost always incomplete. You read the feature list, see "sequential testing" or "always-valid results," and check the box. What you do not see is which metric types are covered, whether the sample size calculator accounts for the wider confidence intervals, whether the stopping rule carries valid error rate guarantees, or whether the sequential method applies consistently across the platform. Sequential testing is not a single feature. It is a set of interlocking commitments: the stopping rule must connect to the sample size calculation, the method must cover every metric type in the experiment, and the guarantees must hold in the mode you actually use.
When hundreds of experiments compete for the same user population, as they do across Spotify's highest-traffic surfaces, an invalid early stop on one experiment delays every experiment queued behind it. The cost of a flawed sequential method is not just one wrong decision. It is a scheduling problem that compounds across the entire program.
Should you add sequential testing to your experimentation platform requirements?
Yes, but not as "do you support sequential testing?" Every vendor will say yes. The question that matters is what the support actually covers and whether the sequential method connects to the rest of the platform.
Sequential testing solves a real and pervasive problem. In practice, teams look at experiment results before the planned sample size is reached. They check dashboards, respond to stakeholder questions, and monitor guardrail metrics for regressions. Without a valid sequential method, every one of those looks inflates the false positive rate (the probability of incorrectly declaring a difference when none exists). The alternative, a strict fixed-horizon discipline where nobody looks until the experiment ends, is unrealistic for most organizations. Sequential testing makes continuous monitoring statistically safe.
But "we support sequential testing" can mean many things. It can mean the platform adjusts confidence intervals for peeking across all metric types and reflects the cost in the sample size calculator. It can also mean the platform offers a sequential toggle for one metric type in one inference mode, with no connection to planning and no coverage of the metrics that matter most to your guardrails. Both answers produce the same checkbox on an RFP. They produce very different experiments.
A sequential method that does not connect to the sample size calculator, does not cover all your metric types, or does not carry through to guardrail monitoring creates a specific failure mode. Our experience building and operating Confidence at Spotify has shown us, consistently, that an incomplete sequential implementation is more dangerous than having none. The added confidence a "sequential testing" label provides cannot outweigh the risk when the method does not actually cover the experiment being run.
The question your RFP should ask is not whether sequential testing exists. It is whether the implementation is complete enough that you can trust the results when you act on them.
What your RFP should ask instead of the "yes/no?"
Eight questions separate a connected sequential testing implementation from a disconnected one.
First: does sequential testing cover all your metric types? Sequential testing for means is the simplest case, and most platforms that offer sequential testing support it. But experiments rarely evaluate means alone. Ratio metrics like revenue per session, where both numerator and denominator vary, require different variance formulas. Percentile metrics like P95 latency require different statistical machinery entirely. If sequential testing works for means but not for ratio or percentile metrics, you are forced into a fixed horizon for your most important guardrails. That leaves exactly the metrics you most need to monitor early without a valid way to act on them. Ask which metric types support sequential testing, and whether the coverage includes every type you plan to use.
Second: is sequential testing reflected in the sample size calculation, and how? Sequential testing requires wider confidence intervals than fixed-sample testing to keep valid error rates while allowing continuous monitoring. That width comes at a cost: more data is needed to reach the same power. How much more depends on the method. Group sequential boundaries add only a few percent to the maximum required sample size. Always-valid confidence sequences can require 50% more. A sample size calculator that ignores this cost will tell you the experiment is powered when it is not. The disconnect is worse when the calculator applies a generic "sequential testing" adjustment without knowing which method will be used, because the adjustment factor varies by method. Ask whether the calculator adjusts automatically when sequential testing is enabled, whether it knows which sequential method will be used, and whether the adjustment is method-specific or a fixed multiplier.
Third: what types of sequential tests are supported? The two main families are group sequential tests and always-valid confidence sequences. Group sequential tests plan a fixed number of interim analyses at pre-specified points. They are statistically efficient: the maximum sample size penalty is small (often under 5%). But they require you to specify the number and timing of looks in advance. If you look at a different time than planned, the boundary is no longer valid. Always-valid confidence sequences allow you to look at any time, as often as you want, with no pre-specified schedule. The tradeoff is lower power at any given sample size: confidence intervals are wider because the method implicitly corrects for an infinite number of analyses. The mixture sequential probability ratio test (mSPRT) used by several vendors is a closely related approach that shares the always-valid property. These are not interchangeable. They differ in power, in the sample size they require, and in the operational constraints they impose. Ask which methods are available and what tradeoffs the platform documents for each.
Fourth: does sequential testing work properly in both Bayesian and frequentist modes? The question is not whether both modes exist. The question is whether each mode has a valid stopping rule with stated error rate guarantees. On the frequentist side, group sequential boundaries and confidence sequences provide well-defined false positive rate control. On the Bayesian side, stopping when a posterior probability crosses a threshold (such as "95% probability that B beats A") without a calibrated prior does not control the false positive rate. Valid Bayesian stopping rules exist. A Bayes factor threshold, which measures the relative evidence for one hypothesis over another, can provide error rate control when paired with a proper prior. But these require deliberate design. If the platform offers Bayesian experiments with a "stop anytime" claim but no documented stopping rule, the error rate guarantee is missing. Ask whether each inference mode has a stated stopping rule, what error rate it controls, and whether the guarantee is documented.
Fifth: can you use sequential testing for monitoring even when the primary analysis is fixed-sample? Not every team wants to stop early based on the primary metric. Some commit to a fixed horizon for the main result because it maximizes power and simplifies interpretation. But those same teams still need to catch regressions in guardrail metrics and detect assignment bugs (sample ratio mismatches) as the experiment runs, not after it ends. Sequential testing for monitoring is a different use case from sequential testing for early stopping, and it requires the same statistical rigor. A platform that checks guardrails without a sequential correction inflates the false alarm rate every time it runs the check. A platform that defers all checks to the end of the experiment leaves regressions undetected for weeks. The right answer is sequential monitoring of guardrails and data quality checks running in the background, regardless of whether the primary analysis is sequential or fixed. Ask whether the platform supports sequential monitoring independently from the primary analysis method, and whether that monitoring uses a valid sequential correction. This connects directly to the monitoring and alerting post in this series: the statistical validity of monitoring depends on the sequential method backing it.
Sixth: are all supported sequential methods carried through consistently? Some platforms offer more than one sequential method. That is useful only if each method connects to the rest of the platform. If the platform offers both always-valid confidence sequences and group sequential boundaries, do both work across all metric types? Do both connect to the sample size calculator? Do both apply to guardrail monitoring? A platform that offers two methods but only connects one to the planning tools has the same disconnect as a platform that offers one method and no planning tools. The number of methods on the feature list matters less than whether each method is carried through from planning to analysis to decision. Ask whether every supported sequential method is reflected in the sample size calculator, applied across all metric types, and available for monitoring.
Seventh: does sequential testing interact correctly with multiple testing corrections? When an experiment evaluates multiple metrics, both the sequential boundary and the multiple testing correction affect the effective significance threshold. These adjustments are not additive in the way you might expect. A platform that applies Bonferroni across four success metrics and then layers a sequential correction on top needs to handle the interaction correctly, or the stated error rate does not hold. The sample size calculator needs to account for both adjustments simultaneously, not in isolation. Ask whether the platform documents how sequential testing and multiple testing corrections interact, and whether the combined adjustment is reflected in the sample size calculation.
Eighth: does the platform enforce the stopping rule, or only display it? A sequential boundary is only valid if the stopping decision respects the boundary. If the platform displays a sequential confidence interval but lets the experimenter ignore a significant guardrail regression and continue running, the error rate guarantee no longer holds. Enforcement can take different forms: automatic rollback, mandatory acknowledgment before continuing, or a hard stop. If the platform does nothing when the boundary is crossed, the error rate guarantee exists only in the documentation. Ask whether the platform enforces stopping rules or only surfaces them, and what happens when a sequential boundary is crossed.
What the answers actually look like across vendors
Here is how the major platforms stand as of this writing. Vendor capabilities are based on public documentation. Confidence capabilities are based on the product itself.
"Yes" and "No" are based on explicit evidence in the vendor's public documentation. "—" means the platform does not offer the broader feature at all, so the question does not apply. "Not documented" means we found no explicit evidence either way, not that the capability is confirmed absent.
| Platform | Sequential method | Metric type coverage | Sample size calculation sequential-aware? | Frequentist stopping rule with false positive rate guarantee? | Bayesian stopping rule with error rate guarantee? | Methods carried through consistently? | Sequential and multiple testing interaction documented? | Sequential guardrail monitoring? | Enforcement on boundary crossing? | Other notes |
|---|---|---|---|---|---|---|---|---|---|---|
| Confidence | Group sequential + always-valid confidence sequences | Mean, ratio | Yes | Yes | — | Yes | Yes | Yes | Yes (configurable) | — |
| GrowthBook | Always-valid confidence sequences | Mean, proportion, ratio (not quantile) | Partial | Yes | — (frequentist only) | — | Not documented | Yes (Safe Rollouts) | Yes (auto-rollback) | — |
| Eppo | Always-valid confidence sequences + hybrid sequential | Mean, ratio, proportion, percentile | Partial | Yes | No | Partial | Not documented | Partial | No | Hybrid sequential switches to fixed-sample at planned end date |
| Statsig | Mixture sequential probability ratio test | Mean, ratio, event count, percentile | Partial | Yes | — (frequentist only) | — | Not documented | Partial | No documented enforcement | — |
| Optimizely | Mixture sequential probability ratio test | Conversion, revenue | Partial | Yes | — | — | Yes | Not documented | Not documented | Frequentist fixed-horizon and Bayesian modes are in beta |
| LaunchDarkly | Always-valid confidence sequences | Binary, numeric | No (fixed-horizon only) | Yes | — | — | Not documented | Yes (Guarded Rollouts) | Yes (auto-rollback) | — |
| VWO | Bayesian sequential (SmartStats) | Conversion, revenue | Yes | — (no frequentist mode) | Partial | — | Not documented | Yes (auto-pause) | Yes (auto-pause) | Bonferroni built in |
| PostHog | None | Funnel, mean, ratio (frequentist); count, conversion, continuous (Bayesian) | — | No (fixed-horizon only) | No | — | — | No | No | — |
| Amplitude | Mixture sequential probability ratio test | Uniques, averages, sum (no percentile) | No | Yes | — (frequentist only) | — | Not documented | No | No | — |
Five patterns emerge from this comparison.
The first pattern is the gap between sequential analysis and sequential planning. Three vendors (LaunchDarkly, Amplitude, and to a lesser extent Statsig) offer sequential testing in the analysis but do not fully connect it to the sample size calculator. Optimizely's documentation explicitly states that sequential experiments do not require sample size calculation, treating planning as separate from analysis, even though its calculator uses the Stats Engine sequential formula. LaunchDarkly takes a similar position: its calculator applies to fixed-horizon experiments only, and sequential experiments proceed without a sample size target. Amplitude's duration estimator explicitly supports only t-tests while the analysis engine defaults to sequential testing. The result is that teams using sequential testing on these platforms have no principled way to plan experiment duration or allocate traffic. GrowthBook and Eppo stand out here: both adjust the sample size calculation to account for sequential testing, with Eppo scaling the minimum detectable effect by the ratio of sequential to fixed-sample confidence interval width, and GrowthBook inflating the standard error in the power formula.
The second pattern is incomplete metric type coverage. Sequential testing for simple means and proportions is nearly universal. But experiments rarely evaluate only those types. Ratio metrics (revenue per session, pages per visit) and percentile metrics (P95 latency, P75 load time) are common in guardrail sets, and these are exactly the metrics teams need to monitor sequentially. GrowthBook supports quantile metrics but explicitly does not support sequential testing for them. Eppo supports percentile metrics, but its documentation does not explicitly confirm sequential analysis applies to them. Statsig recently added percentile metrics, though the interaction with sequential testing is not documented. Most other vendors do not support percentile metrics at all, a gap covered in detail in the percentile metrics post in this series. A platform that offers sequential testing for means but forces you into a fixed horizon for ratio and percentile metrics creates an inconsistency: the metrics you most need to monitor early are the ones without a valid sequential method.
The third pattern is the Bayesian stopping rule gap. Every vendor in this comparison offers a Bayesian mode or a Bayesian-first engine. But offering Bayesian analysis and offering a valid Bayesian stopping rule are different things. VWO's SmartStats applies a sequential correction to its Bayesian posterior probabilities and documents false positive rate control as a configurable parameter. This is the most explicit Bayesian stopping rule among the vendors reviewed. PostHog's Bayesian mode allows peeking with no documented sequential correction or error rate guarantee. Optimizely and LaunchDarkly both offer Bayesian analysis as a separate mode, but neither documents a sequential stopping rule for it. Eppo's documentation states directly that Bayesian methods make no promises about the false positive rate. The result is that for most vendors, the only path to a sequential stopping rule with stated error rate guarantees is the frequentist mode. If your organization uses Bayesian analysis, ask what the stopping rule is and what it guarantees. If the answer is "Bayesian inference avoids the peeking problem," that is not a stopping rule.
The fourth pattern is sequential monitoring as a separate capability. The strongest use case for sequential testing is not always early stopping on the primary metric. It is catching regressions in guardrail metrics and data quality issues while the experiment runs. GrowthBook's Safe Rollouts and LaunchDarkly's Guarded Rollouts both use one-sided sequential tests specifically for guardrail monitoring, with automatic rollback when a regression is detected. Confidence monitors guardrails sequentially in every experiment, even when the primary analysis uses a fixed horizon. VWO allows guardrail metrics to automatically pause campaigns when a threshold is breached. Eppo surfaces guardrail warnings using sequential confidence intervals but does not enforce a response. Statsig applies sequential adjustments during monitoring but does not document automatic enforcement. Optimizely, PostHog, and Amplitude do not document sequential monitoring as a capability distinct from the primary analysis. Teams that commit to a fixed horizon for the main result but still want valid early detection of regressions should ask whether the platform supports sequential guardrail monitoring independently, and whether it enforces the boundary or only displays it.
The fifth pattern is the enforcement gap. A sequential boundary is a statistical contract: the error rate guarantee holds only if the stopping decision respects the boundary. Four vendors (Confidence, GrowthBook, LaunchDarkly, and VWO) enforce this contract in at least some contexts, through automatic rollback, automatic pause, or configurable alerts tied to guardrail monitoring. The remaining vendors surface sequential results but leave the stopping decision entirely to the experimenter. This is not inherently wrong, but it means the guarantee depends on human discipline. At scale, with hundreds of concurrent experiments, relying on every team to respect a sequential boundary they may not fully understand is a fragile guarantee.
The right RFP question is not "do you support sequential testing?" It is whether the sequential method is carried through from planning to analysis to monitoring to enforcement, consistently across every metric type and inference mode you will use. If it is not, you are trusting results produced by a method that was never designed to work as a whole.