Want to experiment like Spotify? Sign up for a 30 day free trial.
Start your free trialLast updated: May 2026
Every experimentation platform gives you a sample size estimate before the experiment starts. Almost none revisit that estimate after the experiment is running. The pre-experiment number is calculated from historical variance, a guessed minimum detectable effect, and a set of design parameters that may or may not match the experiment you actually launch. After the experiment starts, the number is treated as settled. If the variance turns out to be different from what was assumed, if the user population shifts, or if the treatment itself changes the variance structure of your metrics, the power calculation you relied on during planning no longer describes the experiment you are running.
For most experiments, the gap between planned and actual variance is small. The pre-experiment estimate is close enough, and the experiment ends with the power it was designed to have. But for experiments that change user behavior in unexpected ways, that target a population segment whose variance differs from the historical average, or that run during periods of unusual traffic, the gap can be large enough to render the experiment underpowered or unnecessarily overpowered. Teams discover this after the experiment ends, when the result is either inconclusive despite weeks of runtime or significant at a sample size that could have been reached much sooner.
The fix is straightforward in principle: update the power estimate during the experiment using observed data. The reason most platforms do not do this is that during-experiment monitoring of treatment effects inflates false positive rates. Peeking at p-values is the classic form of this problem. But peeking at variance-based statistics like power does not inflate false positive rates. Power depends on the variance of the metric and the sample size accumulated so far. Neither of these is a function of the treatment effect. You can monitor them continuously without compromising the experiment's statistical validity. Practitioners have likely been doing this informally for years. What fixed-power designs offer is the formal proof that this is safe, published in a recent paper by the Confidence team.
Should you add during-experiment power monitoring to your experimentation platform requirements?
Yes, and this is a feature where the vendor landscape is almost entirely empty. During-experiment power monitoring is not a standard offering. No external vendor reviewed provides a principled power recalculation that adapts the sample size target based on observed variance while the experiment runs. Some vendors offer partial during-experiment estimates: Amplitude and PostHog update duration projections using live data. But a duration projection based on current enrollment pace is not the same as a power recalculation based on observed metric variance. The first tells you when you will reach your original sample size target. The second tells you whether that target is still the right one.
The original sample size target can be wrong in both directions. If pre-experiment variance was underestimated, the experiment is underpowered from the start, and reaching the planned sample size will not deliver the power you intended. If variance was overestimated, the experiment is overpowered, and reaching the planned sample size wastes traffic that could have been used for other experiments. A platform that projects when you will reach the planned target, without checking whether the planned target is still valid, catches neither of these problems.
At Spotify, Confidence monitors metric variance during the experiment and updates the sample size target when it diverges from the pre-experiment estimate.
But consider a power monitor that conflates enrollment pace with statistical power. A duration projection that shows "four weeks remaining" when the experiment is actually underpowered gives false reassurance. This is why we built during-experiment power monitoring into Confidence: a team that knows it is operating on assumptions will plan accordingly, but a team that trusts a misleading projection will run the experiment to completion and only then discover the result was never going to be conclusive.
This capability connects directly to the broader sample size calculation problem described in the sample size post in this series. A pre-experiment calculator that accounts for the stopping rule, variance reduction, metric types, and multi-metric decision rules is essential. But even a well-connected calculator relies on estimates that may not hold after the experiment starts. During-experiment power monitoring is the second half of the problem: it closes the loop between planning and execution.
What your RFP should ask instead of the "yes/no?"
Six questions separate a principled during-experiment power monitoring implementation from a disconnected one.
First: does the platform offer any during-experiment power monitoring at all? Most platforms give you the pre-experiment estimate and nothing else. After the experiment launches, the power calculation is static: it reflects the assumptions made at planning time and never updates. Some platforms offer partial options. Amplitude updates its duration estimate during the experiment, and PostHog's automatic mode provides live estimates after the first day and 100 exposures. But these are enrollment-pace projections, not power recalculations. They tell you when you will reach the original target, not whether the original target still delivers the intended power. Ask whether the platform provides any during-experiment estimate of power, and whether that estimate recalculates based on observed metric variance or merely projects based on enrollment pace.
Second: does the platform distinguish safe statistics to peek at from unsafe ones? The distinction between safe and unsafe peek statistics is the core question. Peeking at treatment effects (p-values, confidence intervals, posterior probabilities) during an experiment inflates false positive rates. The peeking problem is what sequential testing was designed to solve. But not all statistics are unsafe to peek at. Variance-based statistics like power, metric variance, and accumulated sample size do not depend on the treatment effect. Monitoring them does not affect the experiment's false positive rate. A platform that distinguishes between these two categories can offer continuous power monitoring without requiring sequential testing boundaries for the power estimate itself. A platform that treats all during-experiment statistics as equally dangerous to look at has no principled basis for power monitoring. Ask whether the platform documents which statistics are safe to monitor continuously and which require sequential corrections.
Third: does during-experiment power monitoring work across metric types? The variance structure differs across metric types. Simple means have straightforward variance estimates. Ratio metrics like revenue per session, where both numerator and denominator vary, require the delta method or similar approaches. Percentile metrics require entirely different statistical machinery. A power monitoring system that updates variance estimates for means but not for ratio or percentile metrics is incomplete: it covers the easy cases and misses the hard ones. The metrics with the most uncertain pre-experiment variance estimates are often the ones left without a valid during-experiment check. Ask whether the power monitoring covers all metric types the platform supports, including ratio and percentile metrics.
Fourth: does during-experiment power monitoring work in combination with sequential testing? Fixed-power designs and sequential testing solve different problems and complement each other. Sequential testing controls the false positive rate when peeking at treatment effects, allowing valid early stopping when the effect is clear. Fixed-power designs control the false negative rate by ensuring the experiment maintains adequate power throughout its runtime. A platform that offers sequential testing but not power monitoring lets you stop early when the signal is strong but gives you no warning when the experiment is underpowered. A platform that offers power monitoring but not sequential testing lets you know whether you have enough data but gives you no valid way to act on early results. The combination gives you both: you know the experiment has adequate power (because variance is monitored), and you can stop early if the treatment effect is detected before the full sample size is reached (because sequential boundaries are in place). Ask whether the two capabilities are designed to work together, and whether the sequential stopping boundaries account for the updated sample size target from the power monitor.
Fifth: does during-experiment power monitoring work in both Bayesian and frequentist modes? The statistical insight that variance-based statistics are safe to peek at holds regardless of the inference framework. Power depends on variance and sample size in both Bayesian and frequentist analyses. If the platform offers both modes, ask whether during-experiment power monitoring is available in each. A platform that monitors power in frequentist mode but not in Bayesian mode, or vice versa, leaves one mode without a principled way to detect underpowered experiments while they are running.
Sixth: does the platform act on the updated power estimate, or only display it? A power estimate that updates in real time is useful. A power estimate that triggers an action is more useful. If the updated estimate shows the experiment is underpowered, the platform could extend the experiment automatically, alert the experimenter, or flag the experiment in a dashboard. If the estimate shows the experiment is overpowered, the platform could suggest early completion or reallocate traffic. The difference between displaying a number and acting on it determines whether the feature saves experiments in practice or only in theory. Ask what happens when the during-experiment power estimate falls below the target, and whether the response is automatic, semi-automatic, or manual.
What the answers actually look like across vendors
Here is how the major platforms stand as of this writing. Vendor capabilities are based on public documentation. Confidence capabilities are based on the product itself.
"Yes" and "No" are based on explicit evidence in the vendor's public documentation. "—" means the platform does not offer this feature, so the question does not apply. "Not documented" means we found no explicit evidence either way, not that the capability is confirmed absent.
| Platform | During-experiment power monitoring? | Distinguishes safe vs. unsafe peek statistics? | Works across metric types? | Works with sequential testing? | Works in Bayesian mode? | Automated response to power changes? |
|---|---|---|---|---|---|---|
| Confidence | Yes | Yes | Yes | Yes | — | Yes |
| GrowthBook | No | No | — | — | — | — |
| Eppo | Not documented | Not documented | — | — | — | — |
| Statsig | No | No | — | — | — | — |
| LaunchDarkly | Not documented | Not documented | — | — | — | — |
| PostHog | Partial | No | No | — | — | No |
| Amplitude | Partial | No | No | No | — | No |
| Optimizely | Not documented | Not documented | — | — | — | — |
| VWO | Not documented | Not documented | — | — | — | — |
No external vendor offers principled during-experiment power monitoring based on observed metric variance. The two vendors that come closest, Amplitude and PostHog, offer duration projections rather than power recalculations. A duration projection answers "when will we reach the planned sample size?" A power recalculation answers "is the planned sample size still the right target?" The first is an operational convenience. The second is a statistical safeguard.
The first pattern is the absence of the feature entirely. Six of the nine vendors reviewed (GrowthBook, Eppo, Statsig, LaunchDarkly, Optimizely, and VWO) do not document any form of during-experiment power monitoring. After the experiment launches, the sample size target is static. If the pre-experiment variance estimate was wrong, teams discover this after the experiment ends: the result is either inconclusive (variance was underestimated) or the experiment ran longer than necessary (variance was overestimated). Neither outcome is visible during the experiment.
The second pattern is the confusion between duration projection and power monitoring. Amplitude's live duration estimate and PostHog's automatic mode both update during the experiment, but they update the timeline, not the power target. If variance is higher than expected, these projections may show the experiment taking longer (because enrollment is slower or noisier), but they do not explicitly flag that the experiment's statistical power has changed. A duration projection that shows "four weeks remaining" when the experiment is actually underpowered gives false reassurance. The team waits four weeks, reaches the planned sample size, and finds the result is inconclusive because the planned sample size was never adequate.
The third pattern is the lack of a principled statistical framework for safe peeking. Fixed-power designs rest on a specific theoretical insight: that peeking at variance-based statistics does not inflate false positive rates because these statistics do not depend on the treatment effect. No external vendor documents this distinction. Without it, there is no principled basis for deciding which during-experiment statistics are safe to monitor. The result is either that vendors avoid during-experiment monitoring entirely (the cautious but incomplete approach) or that vendors offer monitoring without documenting what is and is not safe to look at (the convenient but potentially misleading approach).
Fixed-power designs were published by Spotify and formalized in the accompanying paper. The method is recent, which likely explains why no external vendor has adopted it yet.
For teams evaluating platforms, the practical question is whether during-experiment power monitoring matters for your program. If your experiments run on large, stable populations with well-characterized metrics, pre-experiment variance estimates are usually right and the gap is small. If your experiments target niche segments, launch during volatile periods, or involve treatments that fundamentally change user behavior, the gap between planned and actual variance can be large enough to waste weeks of runtime or produce inconclusive results. The more experiments you run concurrently and the more competitive traffic allocation becomes, the higher the cost of each overpowered or underpowered experiment.
The right RFP question is not "do you have during-experiment power monitoring?" It is whether the platform has a principled way to detect and respond to the gap between planned and actual metric variance while the experiment is still running. If it does not, every experiment runs on an assumption made before a single data point was collected, and the only way to discover that assumption was wrong is to wait until the experiment is over.