Want to experiment like Spotify? Sign up for a 30 day free trial.
Start your free trialLast updated: May 2026
Every experimentation platform has a sample size calculator, and the number it gives you may not match the experiment you will actually run. You enter a baseline, a minimum detectable effect, and your desired power. The calculator assumes a fixed-horizon test with a single metric and no sequential testing. If your platform uses sequential stopping rules, tests multiple metrics with a correction, applies variance reduction, if your metrics include types beyond simple means, or if the experiment targets a specific segment of your user base, the number will often not match the experiment you will run.
On Spotify's mobile home screen, dozens of teams share the same traffic pool, with roughly ten new experiments launching every week. An underpowered experiment wastes a slot that could have gone to something conclusive. An overpowered experiment ties up traffic for weeks longer than necessary. The sample size calculator is where that waste starts: when the calculator ignores how the experiment will actually be analyzed.
Should you add sample size calculation to your experimentation platform requirements?
Sample size calculation is a must have, but not any sample size calculator will do. Keeping track of the false negative rate and being able to plan the run-length of an experiment is essential for running a high-throughput experimentation program. The need for planning holds regardless of what platform and inference framework you use.
But, the question "do you have a sample size calculator?" is the wrong question. Every vendor will say yes. The question that matters is whether the calculator accounts for the full experimental design, or at least the most important parts of it: the stopping rule, variance reduction, the metric types, the number of metrics and how they interact, the experimental structure, the target population, and the traffic actually available to you. Noticing after the experiment has started that something essential was not accounted for in the sample size calculation is a recipe for low trust and frustration.
At Spotify, the sample size calculator in Confidence is built around the analysis pipeline. Beyond the standard inputs like alpha, power, and sequential testing, the calculator accounts for our multi-metric decision rule and its implied inference. Add a second success metric, and the required sample size updates to reflect the tighter significance threshold. A sample size calculator that doesn't understand the role of different metrics for the decision rule will be inefficient.
Without a tight connection between the calculator and the analysis, a calculator that says you need 10,000 users per group, attached to an analysis that uses sequential boundaries requiring 14,000 users per group, produces an experiment that is underpowered from the start. The team sees "powered" in the planning tool and "inconclusive" when the experiment ends.
What your RFP should ask instead of the "yes/no?"
Ten questions separate a connected sample size implementation from a disconnected one.
First: does the calculator account for your stopping rule? Sequential testing allows you to monitor results during the experiment and stop early if the effect is clear, without inflating false positive rates. Different sequential methods (group sequential tests, always-valid confidence sequences) have different power profiles and require different sample sizes. A calculator that assumes a fixed horizon while the analysis uses sequential boundaries will produce the wrong sample size. How large the discrepancy is depends on the method: Group sequential boundaries add only a few percent to the maximum required sample size, while fully sequential approaches like always-valid confidence sequences can require 50% more. Ask whether the calculator adjusts automatically when sequential testing is enabled, and whether it knows which sequential method will be used.
Second: does the calculator account for multi-metric decision rules? Most experiments evaluate more than one metric. If your platform corrects for multiple comparisons across success metrics, the effective significance threshold per metric is lower than the nominal alpha. For example, two success metrics with Bonferroni correction means each metric is tested at alpha = 0.025 instead of 0.05. That increases the required sample size by roughly 20%, and more when combined with other adjustments like sequential testing. A calculator that does not know how many success metrics you have and how guardrail metrics (metrics that must not degrade) are treated in the correction cannot produce the right number.
Third: does the calculator account for variance reduction? Methods like CUPED use pre-experiment data to reduce metric variance, lowering the sample size needed to detect a given effect. Six of the eight vendors reviewed offer variance reduction in their analysis pipeline, but none factor it into the sample size calculator. The result: every experiment runs on raw-variance estimates and ends up overpowered. One overpowered experiment is not so bad. But if variance reduction cuts required sample size by 30% and the calculator ignores that, every experiment runs 30% longer than necessary. Across hundreds of concurrent experiments competing for the same traffic, that is the difference between running your roadmap in three quarters and running it in four. A platform that offers variance reduction but does not connect it to the calculator captures only half the value. Ask whether the calculator adjusts sample size and runtime when variance reduction is enabled.
Fourth: does the sample size calculation account for which multiple testing correction is used? Many platforms support correction methods beyond Bonferroni, such as Holm, Hommel, or Benjamini-Hochberg. These methods can recover some power that Bonferroni spends, but they also make sample size calculation harder. Bonferroni is the only correction whose power cost plugs directly into a standard sample size formula: just use alpha divided by the number of success metrics. Holm and Hommel require simulation because their rejection decisions depend on the ordering of test statistics across metrics. Benjamini-Hochberg requires a pre-experiment estimate of how many metrics have no real effect. As our analysis of ~1,300 Spotify experiments shows, the practical power difference between these methods is often modest: Holm and Hommel gained roughly 4.5 percentage points in ship rate over Bonferroni, and Benjamini-Hochberg gained 4.9. But a platform that uses Holm or Benjamini-Hochberg in the analysis while the sample size calculator assumes Bonferroni, or no correction at all, produces a sample size that does not match the actual test. Ask which correction method the calculator accounts for, and whether it matches what the analysis will use.
Fifth: does the calculator cover your metric types? The calculator should support the metric types you actually use. Ratio metrics like revenue per user, where both numerator and denominator vary across users, require different variance formulas than simple averages. Percentile metrics like P95 latency require different statistical machinery entirely. Most vendor calculators do not support percentile metrics at all. If your calculator does not handle your metric type, the sample size it produces is not valid.
Sixth: does the calculator support your experimental design and analysis framework? Standard A/B tests randomize at the user level with frequentist inference. If your experiments deviate from that, the calculator needs to keep up. Cluster randomized experiments randomize at a higher level (a market, a region, a device type), and the effective sample size depends on the number of clusters and the intra-cluster correlation, not the number of users. Counting users when the experiment randomizes by cluster overstates your power. Bayesian analysis requires its own planning framework: the concepts of power and sample size still apply, but the calculations are different. Every vendor reviewed offers Bayesian analysis, but only two offer a corresponding planning tool in the calculator. The rest leave teams with no principled basis for choosing experiment duration. If you run cluster experiments, use Bayesian methods, or both, ask whether the calculator accounts for your actual design.
Seventh: does the platform offer both pre-experiment and during-experiment sample size estimation? A pre-experiment calculator uses historical data to estimate metric variance and produce a sample size before the experiment starts. Every platform should have this. But pre-experiment estimates cannot account for variance introduced by the treatment itself. For most experiments this gap is small. For experiments that fundamentally change user behavior, it can invalidate the power calculation.
A during-experiment calculator solves this by updating the estimate with actual experiment data as it arrives. Peeking at variance-based statistics like power does not inflate false positive rates, unlike peeking at p-values. If variance was underestimated, the platform extends the experiment. If overestimated, it finishes sooner. Some vendors offer partial during-experiment estimates (Amplitude and PostHog update duration projections), but none offer a principled power recalculation that adapts the sample size target based on observed variance. Ask whether the platform offers during-experiment estimation, and whether it is a projection or an actual recalculation.
Eighth: does the platform use historical data to estimate runtime automatically? A sample size in users is only half the answer. Teams need to know how long the experiment will take. Some calculators ask you to enter metric variance, baseline values, and daily traffic manually. Others pull historical data for the metric and population you selected, estimate variance and the intake curve automatically, and tell you the experiment will take approximately four weeks. The second version is less error-prone (manual variance estimates are often wrong) and accessible to experimenters who are not statisticians. Ask whether the calculator estimates variance from historical data, models the intake curve, and updates the runtime estimate when you change traffic allocation or add metrics.
Ninth: does the calculator account for experiment targeting? Most experiments target a subset of users: a market, a platform, a behavioral segment. When you narrow the population, metric variance within that segment may differ from overall variance, and the number of available users shrinks. A calculator that uses overall variance for a single-market experiment will get both the sample size and runtime wrong. The platform should connect targeting criteria to historical data so the calculator estimates variance and population size for the segment you will actually test, whether by pointing at a previous experiment or mapping targeting rules to logged data. Ask whether the calculator adjusts when targeting is applied.
Tenth: does the calculator account for traffic coordination? The runtime estimate is only right if the calculator knows how many users are actually available. Holdbacks reduce the reachable population. Mutual exclusion with other experiments reduces it further. A calculator that estimates "two weeks" based on full traffic, when coordination constraints leave you 60%, will underestimate runtime by days or weeks. Ask whether the calculator surfaces the reachable population after holdbacks and exclusions, and whether the runtime estimate reflects the traffic you will actually get.
What the answers actually look like across vendors
Here is how the major platforms stand as of this writing. Vendor capabilities are based on public documentation. Confidence capabilities are based on the product itself.
"Yes" and "No" are based on explicit evidence in the vendor's public documentation. "—" means the platform does not offer this feature, so the question does not apply. "Not documented" means we found no explicit evidence either way, not that the capability is confirmed absent.
| Platform | Calculator? | Sequential-aware? | Correction-method-aware? | Variance-reduction-adjusted? | Metric-type-aware? | Multi-metric-aware? | Design-aware (cluster, Bayesian)? | During-experiment monitoring? | Historical data for runtime? | Targeting-aware? | Coordination-aware? | Other SSC gaps |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Confidence | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | — |
| GrowthBook | Yes | Partial | No | No | Percentiles not supported | No | Partial (Bayesian only) | No | Yes | Partial | No | — |
| Eppo | Yes | Yes | Not documented | No | Most (no funnels) | Not documented | No | Not documented | Yes | Partial | No | — |
| Statsig | Yes | No | No | No | Yes | Partial | No | No | Yes | Partial | No | — |
| LaunchDarkly | Yes (fixed-horizon only) | No | No | No | Binary + numeric | No | No | Not documented | Not documented | No | No | — |
| PostHog | Yes | — | — | — | 3 types | — | — | Partial | Not documented | No | No | — |
| Amplitude | Yes | No | No | No | Limited | No | No | Partial | Yes | Partial | No | — |
| Optimizely | Yes | Partial | No | No | Conversion only | No | No | Not documented | Not documented | No | No | — |
| VWO | Yes | Yes | Not documented | — | 2 types | Not documented | Not documented | Not documented | Yes | No | No | — |
Four patterns stand out. The first is the disconnect between analysis and planning. Optimizely's calculator claims to be "powered by Stats Engine," which uses sequential testing, but accepts only fixed-horizon inputs (baseline rate, MDE, significance level). Their documentation states that sequential experiments do not require sample size calculation at all. Amplitude's duration estimator supports only T-tests while their analysis engine defaults to sequential testing.
The second pattern is partial integration. Sequential testing requires wider confidence intervals than fixed-sample testing to keep valid error rates while allowing continuous monitoring. How much wider depends on the method and the sample size. Eppo's calculator knows this ratio for their confidence sequence implementation and applies it directly: switch from fixed-sample to sequential, and the required sample size scales up to reflect the cost of continuous monitoring. VWO's calculator is explicitly designed for its SmartStats Bayesian engine, creating a direct link between planning and analysis. Both are the most integrated vendors reviewed on sequential testing, but neither accounts for the full set of inputs: variance reduction, multi-metric decision rules, cluster randomization, percentile metrics, targeting-aware variance estimation, traffic coordination, or during-experiment power updates.
LaunchDarkly takes a different approach. Their calculator covers fixed-horizon experiments only, and their documentation states that sequential experiments do not require sample size calculation. This avoids the disconnect but leaves teams without planning guidance for sequential experiments.
A third pattern is the gap between the methods a platform supports and what its calculator accounts for. Four vendors support cluster randomization and three support switchback experiments, but no calculator adjusts for cluster-level variance inflation or time-period structure. Every vendor reviewed offers Bayesian analysis, but only GrowthBook and VWO offer a corresponding planning tool in the calculator. Optimizely and VWO both support multivariate testing, but neither offers a multivariate testing-specific sample size calculation.
A fourth pattern emerges around targeting and coordination. Several vendors (GrowthBook, Eppo, Statsig, Amplitude) let you scope the calculator to a segment or past experiment, so the population size reflects who the experiment will reach. But scoping the population is not the same as adjusting variance for it. Statsig's documentation explicitly warns that its tool "does not account for the fact that experiments targeting only a subset of users may have different summary statistics." Eppo comes closest: its Entry Points pull variance estimates from warehouse data for the scoped population. All vendors support holdbacks and mutual exclusion as configuration features, only Confidence takes it into account in the calculator.
If you only run simple fixed-horizon tests with a single metric, a standard calculator gives you the right number. The disconnect only matters when the analysis method diverges from the calculator's assumptions. For teams using sequential testing, variance reduction, multiple metrics, correction procedures, targeted populations, or coordinated traffic, that divergence is the norm. The right RFP question is whether the calculator and the analysis share the same assumptions. If they do not, you are planning one experiment and running another.