What Makes a Good Sample Size Calculator?

What Makes a Good Sample Size Calculator?
Mårten Schultzberg, Staff Data Scientist
Mårten Schultzberg, Staff Data Scientist

Want to experiment like Spotify? Sign up for a 30 day free trial.

Start your free trial

Every experimentation platform has a sample size calculator. You enter a baseline conversion rate, a minimum detectable effect, and your significance and power targets. The calculator returns a number. That number is only correct if the calculator knows how the experiment will be analyzed.

Most calculators do not. They assume a fixed-sample test with a single metric. If your experiment uses sequential testing, corrects for multiple metrics, or measures percentiles, the number is inaccurate. You are planning one experiment and running another.

The calculator is the frontend of the analysis pipeline

The core property of a good sample size calculator is a tight connection to the analysis. The sample size is accurate for a certain analysis procedure, and inaccurate for all other procedures. If you calculate the sample size assuming (implicitly or explicitly) one analysis, but are using a different analysis in practice, the required sample size is by construction inaccurate. Change the stopping rule, and the required sample size updates. Add a success metric with a multiple testing correction, and the significance threshold adjusts. Switch to a different sequential testing method, and the power profile recalculates.

In practice, this is rare. Most calculators are standalone widgets: a form with four inputs and a formula that assumes a two-sample z-test run to a fixed sample. The analysis engine sits elsewhere, configured separately, often using different methods entirely. The experimenter sees a green checkmark from the calculator and an inconclusive result from the analysis, with no indication that the two were never in agreement. Each of the sections below describes one way that gap creates a concrete error — and each of those errors compounds with the others.

At Spotify, the experiment throughput is very high, and bandwidth is always scarce in relation to how many ideas we have to test. An underpowered experiment wastes a slot that could have gone to something conclusive. An overpowered one ties up traffic for weeks longer than necessary. Either way, the team loses time it cannot recover.

Sequential testing changes the required sample size

If your platform uses sequential testing (analyzing results at multiple points during the experiment rather than only at the end), the required sample size depends on which sequential method is used.

Different sequential methods have very different power profiles, as we cover in our blog post. Group sequential tests with O'Brien-Fleming boundaries add only a few percent to the maximum required sample size compared to a fixed-sample test. Always-valid confidence sequences can require 50% or more additional observations to achieve the same power. At a fixed sample size of 500 observations per group, group sequential tests achieve power of 0.89 to 0.93, while always-valid methods achieve 0.71 to 0.76.

A calculator that assumes a fixed-sample test while the analysis uses always-valid confidence sequences will underestimate the required sample size by a third or more. A team that trusted the calculator enters the experiment underpowered from the start. The result comes back inconclusive, and nobody realizes the calculator was the problem. The error runs the other way too: a calculator that assumes always-valid methods when the experiment uses group sequential boundaries will overestimate the sample size, tying up traffic longer than necessary.

A connected calculator knows more than whether sequential testing is turned on. It knows which sequential method is configured, how many interim analyses are planned, and what rule governs how alpha is allocated across those looks (the spending function). Each of these affects the required sample size. And this is before any multiple testing correction enters the picture.

Multiple metrics change the significance threshold

Most experiments evaluate more than one metric. The median experiment at Spotify has 2 success metrics, 4 guardrail metrics, plus company-wide regression metrics (see our blog post for more). When you test multiple success metrics, the significance threshold must be corrected to control the false positive rate across the set.

The correction changes the required sample size. Two success metrics with a Bonferroni correction means each metric is tested at alpha = 0.025 instead of 0.05. That increases the required sample size by roughly 20%. Without knowing how many success metrics the experiment has, the calculator will undercount.

Two details make this harder than it sounds. First, not all metrics enter the correction the same way. Guardrail metrics do not require an alpha correction, because all guardrails must pass simultaneously and the false positive rate naturally decreases as you add more. But they do affect the required sample size through power: since every guardrail must be non-inferior at the same time, the per-metric power must be corrected upward to keep the overall decision adequately powered (we cover this in our blog post, in our engineering blog post, and in our accompanying paper). A good calculator distinguishes between success metrics and guardrails. Second, different correction methods require different calculations. Bonferroni is the only correction where the power cost plugs directly into a standard sample size formula. Methods like Holm-Bonferroni require simulation because their rejection decisions depend on the ordering of test statistics.

When sequential testing and multiple testing corrections compound, the error is not additive — each dimension of miscalibration multiplies against the others. Manual adjustment is impractical. Accounting for both in the calculator avoids the compounding error.

Variance reduction makes metric selection better

The most efficient metric for an experiment is not the one with the smallest raw variance. It is the one with the smallest variance after variance reduction. By using pre-experiment data to reduce the variance of each metric's treatment effect estimate, variance reduction can change which metric is actually the most sensitive choice (we cover this in depth in Sample Size Calculation III). A continuous metric that looks noisy in isolation might have highly stable user behavior over time, meaning a pre-experiment covariate absorbs most of that noise. After adjustment, it can be far more sensitive than a binary alternative that seemed cleaner on the surface.

This matters for the sample size calculator because the required sample size depends on effective variance, not raw variance. If the calculator does not account for variance reduction, it will overestimate the sample size for metrics where reduction is strong and underestimate the relative cost of metrics where it is weak. The ranking of metrics by required runtime changes once variance reduction is in the picture.

For this to work, variance reduction must apply to every metric type in the experiment. If it works for simple averages but not for ratio metrics or percentiles, the sample size benefit is uneven across the metrics that enter the decision rule. Some metrics reach power quickly while others lag behind, and the experiment runs as long as the slowest metric requires. A good sample size calculator reflects the reduced variance for every metric where variance reduction is applied, and the unreduced variance for every metric where it is not.

During-experiment power monitoring

Every pre-experiment sample size calculation relies on a variance estimate derived from historical data, under the assumption that variance will be the same in both control and treatment. In practice, variance is not always stable over time, and the equal-variance assumption rarely holds when the treatment has an effect. Most treatments do not affect all users, and among those they do affect, the magnitude varies. This heterogeneity means that the treatment group's variance is likely to increase relative to control, making the pre-experiment estimate too optimistic.

A good calculator does not stop at the planning stage. Fixed-power designs allow monitoring power during the experiment itself. Power depends on the variance of the metric, not on the treatment effect estimate, so peeking at variance-based statistics like the power does not inflate false positive rates, as we discuss in Sample Size Calculation I. As the experiment accumulates data, the required sample size estimate updates based on observed variance. If the experiment needs more data than originally planned, the team knows before results come back inconclusive. If it needs less, the experiment finishes sooner.

Teams experimenting in new markets or with new user populations benefit most, because their historical data is thin and their pre-experiment estimates are least reliable.

Trigger analysis and exposure filters

The calculator can also get the population wrong. Trigger analysis (filtering to users who actually experienced the change) can dramatically reduce the required sample size (we write more about this in Sample Size Calculation II). If only 10% of exposed users are affected by the treatment, planning for the full population vastly overestimates the time the experiment needs to run. Modeling the triggered population gives you the right number.

The same principle applies to other design choices

If your platform supports percentile metrics like p99 latency, the variance formula differs from the one used for means, and the calculator should reflect that. Cluster randomization changes the effective sample size: it depends on cluster count and intra-cluster correlation, not the number of individual users. If the analysis accounts for it, the calculator must too.

Summary

The failure mode is specific: a calculator that was never connected to the analysis produces a number that is wrong for how your experiment actually runs. Each additional design choice that is not automatically accounted for in the sample size calculator (sequential method, correction type, variance reduction) is another layer of error, and those layers compound quietly. The way to fix it is not to be more careful with the inputs, it is to use a tool that accounts for all your design choices in the sample size calculator for you.

If you want to go deeper on how each of these factors enters the calculation, we have three dedicated courses in our bootcamp: Sample Size Calculation I (the basics), II (multiple metrics), and III (variance reduction and metric types).