Want to experiment like Spotify? Sign up for a 30 day free trial.
Start your free trialLast updated: May 2026
Every experimentation platform measures user behavior after exposure. The question most buyers never ask is: over what time period? The observation window determines when a user's data starts counting and how long the measurement lasts. When some users have been in the experiment for two weeks while others joined yesterday, a metric computed across all of them is not measuring what you think it is. A metric measured over a fixed seven-day window after exposure answers a different question than the same metric measured cumulatively from exposure until the experiment ends. Most platforms expose a single windowing mode without documenting what assumption it makes, or they offer a configurable window without explaining how the choice interacts with sequential testing, sample size calculation, or dimensional breakdowns.
The problem is not that vendors choose different defaults. It is that most do not document what their default is, or explain what the tradeoff is between early readout and estimation integrity.
At Spotify, Confidence supports configurable observation windows that connect to the sample size calculator and the sequential testing engine. When the observation window changes, the variance estimate changes, and the planning tools reflect that.
Should you add observation windows to your experimentation platform requirements?
Yes. If your experiments measure outcomes over a defined post-exposure period and you care about comparing users measured over the same duration, the observation window is one of the most consequential configuration choices in the platform. It is also one of the most underspecified. But a configurable window that is disconnected from sample size calculation and sequential testing is a display setting, not an analysis feature. At Spotify, we learned early on that a configuration option that does not propagate to planning and monitoring creates more confusion than it resolves. The window changes the metric, but if nothing downstream notices, the experimenter is left comparing results against a plan that no longer applies.
The core issue is what happens when users enter the experiment at different times. Consider a seven-day revenue metric. On day three of the experiment, some users have been exposed for three days and others for one day. A cumulative approach includes all of them, mixing users with one day of post-exposure data and users with three days. The users with shorter exposure have had less time to generate events. If the mix of observation lengths is similar in treatment and control, this adds variance but does not bias the treatment effect. But if exposure timing differs between groups, or if the treatment itself affects when users become active, the mix of observation lengths can differ between groups, and the treatment effect estimate becomes biased. Even in the symmetric case, the composition of the sample shifts as the experiment runs: early users accumulate more data while new users enter with less. The metric at any analysis point reflects a weighted average across users with different amounts of information, and that weighting changes every day.
A closed-window approach avoids this by waiting until each user has completed the full observation period before including them. On day three of the experiment, only users exposed on day one or earlier (who have completed at least three days) would be included in a three-day metric. This means fewer users in each analysis, which delays readout, but every included user has been observed for the same duration. The treatment effect estimate compares like with like.
The tradeoff is real. Closed windows delay results. Cumulative windows deliver results sooner but introduce unequal observation durations that can bias estimates, particularly for metrics where behavior changes over the observation period. Revenue accumulates over time, so a user with fourteen days of data will almost always have higher total revenue than a user with one day, regardless of treatment. Engagement metrics like daily active use can stabilize or decay over the same period. For any metric where the expected value depends on how long the user has been observed, mixing observation lengths adds variance in all cases and can introduce systematic bias when the mix differs between treatment and control.
This connects directly to the peeking problem for longitudinal data that Spotify identified and solved in Confidence. Standard sequential tests assume that a unit's measurement is finalized when it enters the analysis. When measurements accumulate over a window, that assumption breaks. Peeking at a metric before every user's window is complete inflates false positive rates even when using a valid sequential method. This is the peeking problem 2.0: the sequential test controls for peeking across users, but not for peeking within a user's incomplete observation window. Solving it requires the sequential testing framework to account for the window structure explicitly, which is what Confidence does through its group sequential test design for longitudinal data.
What your RFP should ask instead of the "yes/no?"
Six questions decide whether a platform's observation window implementation is coherent.
First: is the observation window configurable? The most basic question is whether you can set the length of the post-exposure measurement period. A platform with no configurable window either uses the full experiment duration (every user is measured from their exposure until the experiment ends, producing users observed for different lengths of time) or uses an implicit window that is not documented. If you cannot configure the window, you cannot ensure that the metric measures what you intend. Ask whether the platform lets you define a per-metric observation window relative to exposure, and whether the window length is adjustable.
Second: does the platform offer both closed-window and cumulative-window modes? A closed window waits for every included user to complete the same observation period before entering the analysis. A cumulative window adds users as they arrive, so the analysis at any point includes users at different stages of their observation. Both are legitimate choices, but they answer different questions and have different statistical properties. A platform that only offers cumulative measurement forces you to accept users observed for different durations. A platform that only offers closed windows forces you to wait for the full window before seeing any results. A platform that lets you choose per metric gives you the most flexibility, because different metrics in the same experiment may warrant different treatment. A guardrail metric for crash rate might use a short closed window for clean comparison, while a cumulative revenue metric over the full experiment duration might be acceptable when combined with a ratio-metric formulation that normalizes by observation time. Ask whether both modes are available and whether you can configure them independently per metric.
Third: is the observation window reflected in the sample size calculation? The observation window affects metric variance. A seven-day closed-window metric has different variance than a fourteen-day closed-window metric on the same underlying data, because users accumulate more events over longer windows and the within-user variance structure changes. A sample size calculator that ignores the window length produces a number that does not match the experiment you will run. If you plan the experiment with a calculator that assumes full-duration measurement but analyze with a seven-day window, the effective sample size at each analysis point is smaller than planned (because users are only included after their window completes), and the variance per user may differ from the historical estimate. The same applies to variance reduction: if CUPED is enabled, the variance reduction depends on the correlation between pre-experiment and post-experiment data over the configured window, not the full experiment duration. Ask whether the calculator accounts for the configured window length when estimating variance and runtime, and whether it adjusts for variance reduction under the configured window.
Fourth: does sequential testing account for the window type? This is the most technically demanding question on the list, and it is where most implementations break down. For a closed-window metric, sequential testing is relatively straightforward. At each analysis point, users with completed windows form a growing sample, and a standard sequential test applies. The cumulative case is harder. Each user's measurement changes as their observation period grows. The data for a given user is not finalized when it first enters the analysis. Standard sequential tests assume each observation is fixed once included. When observations evolve over time because the user accumulates more events, the sequential guarantee can break. Applying group sequential tests correctly to cumulative metrics requires modeling the within-user longitudinal structure explicitly.
Ask whether the platform's sequential testing method is designed for the window type in use, and whether the error rate guarantee holds when users have incomplete observation periods at the time of analysis. Specifically, look for documentation that describes how the sequential method handles users whose observation window is still open when the analysis runs. If the documentation does not mention this, the implementation likely treats each user's current cumulative value as final.
Fifth: does the observation window apply consistently when you explore by dimension? Slicing results by country, device type, or user segment is standard practice. But the window configuration needs to carry through to those breakdowns. If the top-line metric uses a seven-day closed window but the dimensional breakdown recomputes the metric from raw events without respecting the window, the subgroup numbers measure something different from the top-line result. The same risk applies to zero-handling and aggregation choices: if any of those settings are dropped when you filter to a segment, the dimensional result is not comparable to the overall result. Ask whether segmented views inherit the same window configuration as the primary analysis, including window length, closed versus cumulative mode, and any filtering of users with incomplete windows.
Sixth: does the observation window work consistently across metric types? A platform might support configurable windows for conversion metrics but not for percentile metrics or ratio metrics. Or the window might apply to the metric's events but not to the denominator of a ratio metric, creating an inconsistency where the numerator is windowed but the denominator is not. For percentile metrics specifically, the observation window interacts with the unit of aggregation: if the platform computes a user-level P95 before the group-level comparison, the window determines how many observations each user contributes to their own P95, which directly affects the distribution. Ask whether the window configuration applies to all metric types you use, and whether the implementation is consistent across types.
What the answers actually look like across vendors
Here is how the major platforms stand as of this writing. Vendor capabilities are based on public documentation. Confidence capabilities are based on the product itself.
"Yes" and "No" are based on explicit evidence in the vendor's public documentation. "—" means the platform does not offer this feature, so the question does not apply. "Not documented" means we found no explicit evidence either way, not that the capability is confirmed absent.
| Platform | Configurable observation window? | Closed-window mode? | Cumulative-window mode? | Window reflected in sample size calculation? | Sequential testing accounts for window type? | Consistent in dimensional breakdowns? | Consistent across metric types? | Other gaps |
|---|---|---|---|---|---|---|---|---|
| Confidence | Yes (per-metric) | Yes | Yes | Yes | Yes | Yes | Yes | — |
| GrowthBook | Yes (three modes) | Partial (In-Progress toggle) | Yes | No | Not documented | Not documented | Not documented | Lookback mode is relative to experiment end |
| Eppo | Yes (configurable offsets) | Yes (aged subject filtering) | Yes | No | Not documented | Not documented | Not documented | Multiple time units supported |
| Statsig | Partial (Warehouse Native only) | Yes (completed window option) | Yes | No | Not documented | Not documented | Partial (Warehouse Native broader) | Two implementations (Cloud vs Warehouse Native) |
| LaunchDarkly | No (fixed 90-day window) | No | No | No | Not documented | Not documented | Not documented | 90-day attribution limit |
| PostHog | Yes (conversion window) | No | Yes | No | — (no sequential testing) | Not documented | Partial (not retention) | Window extends past experiment end |
| Amplitude | Partial (retention only) | No | Yes | No | Not documented | Not documented | No (retention only) | Duration Estimator ignores window |
| Optimizely | Yes (conversion window) | No | Yes | No | Not documented | Not documented | Not documented | Documented for conversion metrics only |
| VWO | Not documented | Not documented | Not documented | Not documented | Not documented | Not documented | Not documented | No per-metric window documented |
Four patterns stand out.
The first pattern is the gap between window configuration and statistical integration. Five vendors (GrowthBook, Eppo, Statsig, PostHog, and Optimizely) offer some form of configurable observation window. None of them connect the window configuration to the sample size calculator. The window length affects metric variance: a seven-day window produces different variance than a fourteen-day window, and a closed window produces different effective sample sizes at each analysis point than a cumulative window. A calculator that ignores the window setting plans the experiment as if the window does not exist. This is the same planning-analysis disconnect that appears across nearly every feature in the vendor landscape, but for observation windows it is particularly consequential because the window directly determines how many users are available for analysis at any given time.
The second pattern is the split between closed and cumulative modes. Only three vendors (GrowthBook, Eppo, and Statsig) offer an explicit option to exclude users whose observation window is not yet complete. GrowthBook calls it "In-Progress Conversions," Eppo calls it "aged subject filtering," and Statsig calls it "only include units with a completed window." The terminology varies but the mechanism is the same: users are excluded from the analysis until their full window has elapsed. This is the closed-window mode that gives every included user the same observation duration. PostHog, Amplitude, Optimizely, LaunchDarkly, and VWO do not document an equivalent option. On those platforms, users enter the analysis as soon as they are exposed, regardless of how much of their observation period has elapsed. The result is that the metric at any analysis point reflects a mix of users with different observation lengths. For metrics where the expected value depends on observation duration (revenue, total engagement, retention), this heterogeneity can bias the treatment effect estimate.
The third pattern is the absence of sequential testing designed for the window type. No external vendor documents how their sequential testing method accounts for the observation window. This matters most for cumulative metrics, where a user's measurement changes as their window grows. A sequential test that treats each user's current cumulative value as a fixed observation will have inflated error rates when users' values are still evolving. For closed-window metrics, the problem is simpler but still present: the sequential test needs to account for the fact that users enter the analysis in cohorts defined by when their window completes, not when they are exposed. Confidence solves this through its group sequential test design for longitudinal data, which explicitly models the within-user measurement structure. Among the external vendors, GrowthBook and Eppo both offer sequential testing and configurable windows, but neither documents how the two features interact. The gap is not necessarily that the implementation is wrong. It is that the interaction is not addressed in the documentation, which means buyers have no way to evaluate whether the error rate guarantees hold under their configured window.
GrowthBook also offers a third windowing mode, Lookback, which measures behavior in the last N days before the experiment ends rather than relative to each user's exposure. This answers "what did users do in the final week?" rather than "what did users do in their first week after exposure." The Lookback mode is useful for assessing steady-state behavior, but it conflates exposure timing with measurement timing in a way that makes causal interpretation harder.
The fourth pattern is the dimensional consistency gap. No external vendor documents whether the observation window configuration carries through to dimensional breakdowns. If the top-line metric uses a seven-day closed window but the segment explorer recomputes the metric without that constraint, the subgroup results are not comparable to the top-line result.
A related gap appears in how observation windows interact with other platform features. GrowthBook's quantile metrics are excluded from both sequential testing and sample size calculation, which means the observation window configuration for a percentile metric has no connection to planning or monitoring. Statsig's window implementation differs between its Cloud and Warehouse Native products, creating a consistency question for organizations that use both. Amplitude's window support is limited to retention metrics, leaving other metric types without configurable observation periods. These are the feature combination gaps that a checkbox RFP will not surface: the window exists, but the features that need to interact with it do not.
If your experiments use short observation windows (minutes or hours rather than days), cumulative inclusion may introduce negligible bias because most users complete their window quickly. The longer the window relative to the experiment duration, the more the choice between closed and cumulative modes matters. For a fourteen-day metric on a four-week experiment, users exposed in the last two weeks will not have completed their window when the experiment ends. A closed-window analysis excludes them, reducing your effective sample size. A cumulative analysis includes them with partial data, changing what the metric measures. Neither is wrong, but the platform should let you make this choice deliberately and reflect it in the planning tools.
The right RFP question is not "do you support observation windows?" It is whether the window configuration is connected to the rest of the analysis pipeline. A configurable window that does not connect to sample size calculation, sequential testing, or dimensional breakdowns is a display setting, not an analysis feature. You changed the window from fourteen days to seven. The metric values changed. The sample size calculator still shows the same number. The sequential boundary still uses the same threshold. Nothing downstream noticed. The experiment you planned and the experiment you analyzed are two different experiments.