We chose not to·RFP Series

How to Write an Experimentation Platform RFP for Switchback Experiments

Want to experiment like Spotify? Sign up for a 30 day free trial.

Start your free trial

Last updated: May 2026

Switchback experiments exist because standard A/B tests break down when users interact with each other. In a ride-sharing marketplace, randomly assigning drivers to treatment and control is meaningless if both groups compete for the same riders in the same city at the same time. The treatment effect spills over into the control group, and the comparison is no longer valid. Switchback experiments solve this by alternating the entire environment between treatment and control over time: all drivers and riders in a city experience the same condition during each period, and the treatment effect is estimated from the difference between periods.

The use case is narrow but critical. Marketplaces, logistics networks, pricing systems, and any product where one user's experience depends on what other users are assigned to will produce biased estimates from standard randomization. Switchback designs avoid this by randomizing across time rather than across users.

Two vendors in the current landscape offer switchback experiment support: Eppo and Statsig. A handful of others support cluster randomization, which addresses related but distinct problems. The challenge is not finding a vendor that says "yes" to switchback experiments. It is determining whether the implementation handles the statistical problems that make switchback experiments different from standard A/B tests. A platform that runs a switchback schedule but analyzes the results as if observations were independent across time periods will produce confidence intervals that are too narrow, p-values that are too small, and decisions built on inflated certainty.

Should you add switchback experiments to your experimentation platform requirements?

Only if you operate a marketplace or two-sided platform where user-level randomization causes interference. For most product experiments, standard A/B tests with user-level randomization are the right design, and switchback experiments add complexity without benefit. If your product has network effects that contaminate user-level comparisons, switchback designs are not optional: they are the only way to get an unbiased estimate of the treatment effect.

At Spotify, we build Confidence for a product where user-level randomization works. Listeners and podcasters do not compete for the same resource in a way that causes one user's assignment to change another user's outcome. We have not needed switchback experiments and have not built them into Confidence. If we operated a marketplace where interference was the dominant concern, this would be among the first capabilities we would require.

The question for your RFP is not whether the vendor supports switchback experiments. It is whether the implementation accounts for the statistical problems that are unique to switchback designs: temporal autocorrelation, carry-over effects, the correct unit of inference, and sample size planning that reflects the design rather than a standard A/B test.

What your RFP should ask instead of the "yes/no?"

Six questions decide whether a switchback implementation is worth having.

First: does the platform account for temporal autocorrelation? Observations within the same unit across adjacent time periods are correlated. A city's ride volume at 6 PM is not independent of its ride volume at 5 PM. If the analysis treats each time step as an independent observation, it underestimates variance and produces confidence intervals that are too narrow. The platform reports statistical significance when the evidence does not support it. A valid switchback analysis must either model the temporal correlation structure explicitly or use an inference method (such as bootstrapping at the period level or cluster-robust standard errors) that does not require independence across time steps. Ask whether the platform's variance estimation accounts for within-unit temporal correlation, and how.

Second: does the platform handle carry-over and washout periods? When the treatment switches, its effect does not vanish instantly. A pricing change that was active for an hour may still influence rider behavior in the first minutes of the next period. If those contaminated observations enter the analysis, they bias the treatment effect estimate. Washout periods (also called burn-in and burn-out periods) discard observations at the boundaries of each switchback window to allow the system to stabilize. The platform should let you configure the length of these exclusion windows, and the analysis should respect them. Ask whether burn-in and burn-out periods are configurable, whether they apply at every switch boundary, and whether the discarded observations are excluded from both the metric calculation and the variance estimation.

Third: does inference use the correct number of independent observations? The effective sample size in a switchback experiment is the number of switchback periods (or, with geographic clustering, the number of unit-period combinations), not the number of individual users or events within those periods. An analysis that counts users as independent observations inflates the effective sample size by orders of magnitude. The confidence intervals shrink so, and every test becomes artificially significant. This is the switchback equivalent of ignoring intracluster correlation in a cluster-randomized experiment. Ask what the platform treats as the unit of analysis, and whether the degrees of freedom in the statistical test reflect the number of independent switchback periods rather than the number of users or events.

Fourth: is sample size planning available for switchback designs? Before running a switchback experiment, you need to know how many switchback periods, across how many units and how much calendar time, you require to detect the effect you care about. This calculation is fundamentally different from a standard A/B test sample size calculation. It depends on the number of units (cities, regions), the switching frequency, the expected temporal autocorrelation, and the carry-over structure. A standard calculator that counts users will give you a number that has no relationship to the experiment you will run. Ask whether the platform offers sample size or power planning specifically for switchback designs, and whether it accounts for the switching frequency, the number of geographic units, and the temporal correlation.

Fifth: what types of metrics are supported, and is the analysis consistent across them? Switchback experiments commonly evaluate ratio metrics (revenue per ride, completion rate per order), count metrics (rides per hour, orders per period), and sometimes continuous metrics (average delivery time). The analysis method needs to handle each correctly. A ratio metric in a switchback design requires aggregation within each period-unit combination before comparison, and the variance estimation must account for the correlation structure at the aggregated level. Ask which metric types the platform supports for switchback experiments, and whether the aggregation and variance estimation are consistent across them.

Sixth: are multiple testing corrections available for switchback designs? Most switchback experiments evaluate more than one metric. If the platform applies multiple testing corrections for standard A/B tests but not for switchback experiments, the false positive rate in switchback analyses is uncontrolled. This matters especially for guardrail metrics in marketplace experiments, where shipping a pricing change that degrades driver earnings or rider wait times can have outsized consequences. Ask whether the correction methods available for standard experiments also apply to switchback experiments.

What the answers actually look like across vendors

Here is how the major platforms stand as of this writing. Vendor capabilities are based on public documentation. Confidence capabilities are based on the product itself.

"Yes" and "No" are based on explicit evidence in the vendor's public documentation. "Not documented" means we found no explicit evidence either way, not that the capability is confirmed absent.

PlatformSwitchback supportCarry-over / washout handlingVariance accounts for temporal correlationSample size planningMetric typesMultiple testing for switchbackBayesian mode
ConfidenceNoN/AN/AN/AN/AN/AN/A
EppoYesYes (burn-in / burn-out)Not documentedNot documentedNot documentedNot documentedNot documented
StatsigYesYes (configurable)PartialNot documentedRatio, sum, countNot documentedNot documented
GrowthBookNoN/AN/AN/AN/AN/AN/A
LaunchDarklyNoN/AN/AN/AN/AN/AN/A
PostHogNoN/AN/AN/AN/AN/AN/A
AmplitudeNoN/AN/AN/AN/AN/AN/A
OptimizelyNoN/AN/AN/AN/AN/AN/A
VWONoN/AN/AN/AN/AN/AN/A

The first pattern is that switchback support is rare. Only two of the eight external vendors reviewed offer switchback experiments. This is not surprising. The use case is specific to marketplaces and platforms with network effects, and building a correct switchback analysis requires statistical machinery that differs from the standard A/B testing pipeline. For the six vendors that do not offer switchback support, the absence is not necessarily a gap if your product does not require it.

The second pattern is that even where switchback support exists, the implementation is incomplete in ways that mirror the gaps in other RFP topics in this series. Neither Eppo nor Statsig documents sample size planning for switchback designs. Both support burn-in and burn-out periods for carry-over effects. But the deeper statistical questions are either partially addressed or not documented. Does the variance estimation model temporal autocorrelation? Do the degrees of freedom reflect the number of independent periods rather than the number of users? Do multiple testing corrections extend to switchback analyses? For most of these, the answer is either unclear or absent from public documentation.

Statsig's implementation is the better documented of the two. Their original approach uses bootstrapping, collecting bootstrap samples from test and control buckets separately and computing the difference in means across 10,000 iterations to derive confidence intervals. By resampling at the bucket level rather than the user level, this approach implicitly handles some of the correlation structure without requiring distributional assumptions. However, their documentation explicitly notes that CUPED, sequential testing, dimension breakdowns, and time-series analysis are not available for switchback tests. For their Warehouse Native (WHN) customers, Statsig has introduced a regression-based analysis that replaces bootstrapping, with support for pre-computed dimensions and more configurable burn-in/out periods. Whether this regression-based method explicitly models temporal autocorrelation is not documented.

Eppo's switchback implementation supports burn-in and burn-out periods, CUPED for variance reduction, and a diagnostic that validates subject assignments against the switchback schedule. Their marketing materials reference "valid lift estimates" and compliance validation. But the documentation does not describe the analysis method in detail: how variance is estimated, whether temporal autocorrelation is modeled, or what the unit of inference is. Eppo's switchback feature is listed as being in closed beta, which means the method may still be evolving.

The third pattern is the absence of sample size planning for switchback designs across the entire vendor landscape. Both Eppo and Statsig offer sample size calculators for standard A/B tests, but neither documents a calculator that accounts for switchback-specific parameters: the number of geographic units, the switching frequency, the expected temporal autocorrelation, or the carry-over structure. This means teams running switchback experiments on these platforms have no principled way to decide how long the experiment should run or how many switchback periods they need. The gap is the same one documented in the sample size post in this series, but more consequential here because the relationship between calendar time and statistical power in a switchback experiment is less intuitive than in a standard A/B test.

Neither vendor documents whether switchback experiments work in a Bayesian mode. Both platforms offer Bayesian analysis for standard experiments, but whether that extends to switchback designs is not addressed in their public documentation. Similarly, neither vendor documents multiple testing corrections for switchback analyses. If you evaluate six metrics in a switchback experiment and ship on the first one that clears the threshold, the false positive rate is uncontrolled unless a correction is applied, exactly as in a standard experiment.

If your product requires switchback experiments, the vendor shortlist is short and the implementations are young. The RFP question that matters is not "do you support switchback experiments?" It is whether the analysis behind that support accounts for the properties that make switchback data fundamentally different from user-level data. A switchback implementation without correct variance estimation will produce results that look precise but overstate certainty. The team sees tight confidence intervals, ships the change, and the effect was never real. The intervals looked tight because the analysis counted users as independent observations when the independent units were time periods.